388 lines
8.9 KiB
Markdown
388 lines
8.9 KiB
Markdown
# Testing Guide - Bandit Runner LangGraph Agent
|
|
|
|
## ✅ Current Status
|
|
|
|
### What's Working
|
|
- ✅ Build successful - no TypeScript errors
|
|
- ✅ Dev server starts on port 3002
|
|
- ✅ SSH proxy running on port 3001
|
|
- ✅ All components installed and configured
|
|
- ✅ Beautiful UI fully functional
|
|
- ✅ WebSocket infrastructure ready
|
|
|
|
### What Needs Configuration
|
|
- ⚠️ OpenRouter API key (required for LLM)
|
|
- ⚠️ Durable Object export (works in production, limited in dev)
|
|
|
|
## 🚀 Quick Start Testing
|
|
|
|
### 1. Set Your OpenRouter API Key
|
|
|
|
Edit `.dev.vars`:
|
|
```bash
|
|
OPENROUTER_API_KEY=sk-or-v1-YOUR-ACTUAL-KEY-HERE
|
|
```
|
|
|
|
Get a key from: https://openrouter.ai/keys
|
|
|
|
### 2. Start the Application
|
|
|
|
```bash
|
|
cd bandit-runner-app
|
|
pnpm dev
|
|
```
|
|
|
|
Server will start on http://localhost:3002 (port 3000 was taken)
|
|
|
|
### 3. Test the UI
|
|
|
|
**What You'll See:**
|
|
- Beautiful retro terminal interface with control panel
|
|
- Model selection dropdown (GPT-4o, Claude, etc.)
|
|
- Level range selector (0-33)
|
|
- START/PAUSE/RESUME buttons
|
|
- Connection status indicators
|
|
|
|
**Try These Actions:**
|
|
1. **Select a model** - Choose "GPT-4o Mini" (cheapest for testing)
|
|
2. **Set level range** - Start with 0-2 (quick test)
|
|
3. **Click START** - This will attempt to create a run
|
|
|
|
### 4. Expected Behavior (Current State)
|
|
|
|
**⚠️ Known Limitation:**
|
|
The Durable Object binding doesn't work in local dev mode (`next dev`). You'll see:
|
|
```
|
|
POST /api/agent/run-xxx/start - 500 (Durable Object binding not found)
|
|
```
|
|
|
|
This is expected! The warning message tells us:
|
|
> "internal Durable Objects... will not work in local development, but they should work in production"
|
|
|
|
### 5. Testing Options
|
|
|
|
**Option A: Test UI Without Backend (Current)**
|
|
- UI works perfectly
|
|
- Control panel functional
|
|
- Model selection works
|
|
- WebSocket connection attempts (fails gracefully)
|
|
- You can type commands and messages in the interface
|
|
|
|
**Option B: Use Wrangler Dev (Full Testing)**
|
|
```bash
|
|
# Install wrangler globally if needed
|
|
npm i -g wrangler
|
|
|
|
# Run with Workers runtime
|
|
wrangler dev
|
|
|
|
# This gives you:
|
|
# ✅ Full Durable Object support
|
|
# ✅ Real WebSocket connections
|
|
# ✅ Actual agent runs
|
|
```
|
|
|
|
**Option C: Deploy to Cloudflare (Production Testing)**
|
|
```bash
|
|
# Build
|
|
pnpm build
|
|
|
|
# Deploy
|
|
wrangler deploy
|
|
|
|
# Test on:
|
|
# https://bandit-runner-app.your-account.workers.dev
|
|
```
|
|
|
|
## 🧪 Manual Testing Checklist
|
|
|
|
### UI Testing (Works Now)
|
|
|
|
- [ ] Control panel displays correctly
|
|
- [ ] Model dropdown shows all options
|
|
- [ ] Level selectors work (0-33)
|
|
- [ ] Streaming mode toggle functional
|
|
- [ ] START button enabled when idle
|
|
- [ ] Status indicators show correct state
|
|
- [ ] Terminal panel renders
|
|
- [ ] Agent chat panel renders
|
|
- [ ] Command input accepts text
|
|
- [ ] Chat input accepts text
|
|
- [ ] Keyboard shortcuts work (Ctrl+K/J, ESC, arrow keys)
|
|
- [ ] Theme toggle works
|
|
- [ ] Retro styling (scan lines, grid) visible
|
|
|
|
### Backend Testing (Requires Wrangler Dev)
|
|
|
|
- [ ] Start run creates Durable Object
|
|
- [ ] WebSocket connection established
|
|
- [ ] Agent begins planning
|
|
- [ ] SSH commands execute via proxy
|
|
- [ ] Terminal shows command output
|
|
- [ ] Chat shows agent thoughts
|
|
- [ ] Pause button stops execution
|
|
- [ ] Resume button continues
|
|
- [ ] Manual commands work when paused
|
|
- [ ] Level advancement works
|
|
- [ ] Run completes successfully
|
|
- [ ] Error handling works
|
|
- [ ] Retry logic functions
|
|
|
|
### SSH Proxy Integration
|
|
|
|
Test your SSH proxy directly:
|
|
|
|
```bash
|
|
# Test connection
|
|
curl -X POST http://localhost:3001/ssh/connect \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"host":"bandit.labs.overthewire.org",
|
|
"port":2220,
|
|
"username":"bandit0",
|
|
"password":"bandit0"
|
|
}'
|
|
|
|
# Should return:
|
|
# {"connectionId":"conn-xxx","success":true,"message":"Connected successfully"}
|
|
|
|
# Test command execution
|
|
curl -X POST http://localhost:3001/ssh/exec \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"connectionId":"conn-xxx",
|
|
"command":"cat readme"
|
|
}'
|
|
|
|
# Should return:
|
|
# {"output":"boJ9jbbUNNfktd78OOpsqOltutMc3MY1\n","exitCode":0,"success":true}
|
|
```
|
|
|
|
## 🐛 Known Issues & Workarounds
|
|
|
|
### Issue 1: Durable Object Not Found (Local Dev)
|
|
|
|
**Error:**
|
|
```
|
|
Durable Object binding not found
|
|
```
|
|
|
|
**Cause:** `next dev` uses standard Node.js runtime, not Workers runtime
|
|
|
|
**Solutions:**
|
|
1. Use `wrangler dev` instead of `pnpm dev`
|
|
2. Deploy to Cloudflare for full testing
|
|
3. Test UI functionality only in local dev
|
|
|
|
### Issue 2: WebSocket Connection Failed
|
|
|
|
**Error:**
|
|
```
|
|
WebSocket connection error
|
|
connectionState: 'error'
|
|
```
|
|
|
|
**Cause:** Durable Object not available in local dev
|
|
|
|
**Solution:** Use wrangler dev or deploy to production
|
|
|
|
### Issue 3: OpenRouter API Errors
|
|
|
|
**Error:**
|
|
```
|
|
401 Unauthorized / Invalid API key
|
|
```
|
|
|
|
**Solution:**
|
|
1. Check `.dev.vars` has correct API key
|
|
2. Verify key at https://openrouter.ai/activity
|
|
3. Ensure key has credits
|
|
|
|
## 📊 Test Scenarios
|
|
|
|
### Scenario 1: Simple Level Test (0-1)
|
|
|
|
**Setup:**
|
|
- Model: GPT-4o Mini
|
|
- Levels: 0 to 1
|
|
- Max retries: 3
|
|
|
|
**Expected:**
|
|
1. Agent connects as bandit0
|
|
2. Executes `ls -la`
|
|
3. Finds `readme` file
|
|
4. Executes `cat readme`
|
|
5. Extracts password: `boJ9jbbUNNfktd78OOpsqOltutMc3MY1`
|
|
6. Validates password
|
|
7. Advances to level 1
|
|
8. Completes successfully
|
|
|
|
**Duration:** ~30 seconds
|
|
|
|
### Scenario 2: Multi-Level Test (0-5)
|
|
|
|
**Setup:**
|
|
- Model: Claude 3 Haiku or GPT-4o
|
|
- Levels: 0 to 5
|
|
- Max retries: 3
|
|
|
|
**Expected:**
|
|
- Each level solved systematically
|
|
- SSH connections maintained
|
|
- Checkpoints saved
|
|
- Total time: ~3-5 minutes
|
|
|
|
### Scenario 3: Pause/Resume Test
|
|
|
|
**Setup:**
|
|
- Model: Any
|
|
- Levels: 0 to 3
|
|
- Pause after level 1
|
|
|
|
**Expected:**
|
|
1. Start run
|
|
2. Complete level 0-1
|
|
3. Click PAUSE
|
|
4. Type manual command: `pwd`
|
|
5. See output in terminal
|
|
6. Click RESUME
|
|
7. Agent continues from level 1
|
|
|
|
### Scenario 4: Error Recovery Test
|
|
|
|
**Setup:**
|
|
- Model: GPT-4o Mini
|
|
- Levels: 0 to 10
|
|
- Intentionally disconnect SSH mid-run
|
|
|
|
**Expected:**
|
|
- Agent detects error
|
|
- Retry logic kicks in
|
|
- Re-establishes connection
|
|
- Continues execution
|
|
|
|
## 📈 Success Criteria
|
|
|
|
### Minimum Viable Test
|
|
- ✅ UI loads without errors
|
|
- ✅ SSH proxy connects to Bandit server
|
|
- ✅ Can start a run (even if it fails)
|
|
- ✅ WebSocket attempts connection
|
|
- ✅ Terminal displays messages
|
|
|
|
### Full Integration Test
|
|
- ✅ Complete level 0-1 successfully
|
|
- ✅ Agent reasoning visible in chat
|
|
- ✅ Commands executed via SSH proxy
|
|
- ✅ Password validation works
|
|
- ✅ Level advancement automatic
|
|
- ✅ Pause/resume functional
|
|
- ✅ Manual intervention works
|
|
|
|
### Production Ready
|
|
- ✅ Complete levels 0-10 reliably
|
|
- ✅ Error recovery working
|
|
- ✅ Cost tracking accurate
|
|
- ✅ Logs saved to R2 (when configured)
|
|
- ✅ Multiple concurrent runs supported
|
|
- ✅ All models work via OpenRouter
|
|
|
|
## 🔍 Debugging Tips
|
|
|
|
### Check SSH Proxy Logs
|
|
```bash
|
|
# In your ssh-proxy terminal
|
|
# Should see connection requests
|
|
```
|
|
|
|
### Check Browser Console
|
|
```javascript
|
|
// Open DevTools (F12)
|
|
// Look for:
|
|
// - WebSocket connection attempts
|
|
// - API call results
|
|
// - Error messages
|
|
```
|
|
|
|
### Check Network Tab
|
|
- API calls to `/api/agent/[runId]/start`
|
|
- WebSocket upgrade to `/api/agent/[runId]/ws`
|
|
- Response status codes
|
|
|
|
### Check Wrangler Logs
|
|
```bash
|
|
# If using wrangler dev
|
|
# Ctrl+C to stop, logs show:
|
|
# - Durable Object creation
|
|
# - WebSocket messages
|
|
# - LangGraph execution
|
|
```
|
|
|
|
## 🎯 Next Steps
|
|
|
|
### For Local Testing:
|
|
1. ✅ SSH proxy running (you have this!)
|
|
2. ✅ Set OpenRouter API key in `.dev.vars`
|
|
3. ⏳ Switch to `wrangler dev` for full testing
|
|
4. 🎉 Test complete run (level 0-2)
|
|
|
|
### For Production:
|
|
1. Create Cloudflare account
|
|
2. Deploy with `wrangler deploy`
|
|
3. Set secrets: `wrangler secret put OPENROUTER_API_KEY`
|
|
4. Test on live URL
|
|
5. Optional: Set up D1 and R2
|
|
|
|
## 🎨 Current UI Features You Can Test
|
|
|
|
Even without the backend, you can test:
|
|
|
|
- **Theme toggle** - Dark/light mode
|
|
- **Panel switching** - Ctrl+K/J or ESC
|
|
- **Command history** - Arrow up/down
|
|
- **Model selection** - All 10+ models listed
|
|
- **Level range** - Any combination 0-33
|
|
- **Control buttons** - START/PAUSE/RESUME visual states
|
|
- **Status indicators** - Connection and run state
|
|
- **Retro effects** - Scan lines, grid, CRT glow
|
|
- **Responsive layout** - Desktop and mobile
|
|
- **Terminal styling** - Monospace, colors, timestamps
|
|
- **Chat formatting** - User/agent message differentiation
|
|
|
|
## 📝 Test Results Template
|
|
|
|
```markdown
|
|
## Test Run - [Date]
|
|
|
|
**Configuration:**
|
|
- Model: GPT-4o Mini
|
|
- Levels: 0-2
|
|
- Runtime: Wrangler Dev
|
|
|
|
**Results:**
|
|
- ✅ UI loaded correctly
|
|
- ✅ SSH proxy connected
|
|
- ✅ Agent started
|
|
- ✅ Level 0 completed (30s)
|
|
- ✅ Level 1 completed (45s)
|
|
- ❌ Level 2 failed (wrong command)
|
|
- Total time: 2m 15s
|
|
- Cost: $0.003
|
|
|
|
**Issues Found:**
|
|
- Agent confused by file with spaces in name
|
|
- Retry logic worked correctly
|
|
- Manual intervention successful
|
|
|
|
**Notes:**
|
|
- Claude 3 Haiku performed better on level 2
|
|
- Should increase timeout for decompression
|
|
```
|
|
|
|
## 🚀 Ready to Test!
|
|
|
|
You're all set! The implementation is complete. Start with UI testing, then move to `wrangler dev` for full integration testing.
|
|
|
|
Good luck! 🎉
|
|
|