2025-10-09 22:03:37 -06:00

8.9 KiB

Testing Guide - Bandit Runner LangGraph Agent

Current Status

What's Working

  • Build successful - no TypeScript errors
  • Dev server starts on port 3002
  • SSH proxy running on port 3001
  • All components installed and configured
  • Beautiful UI fully functional
  • WebSocket infrastructure ready

What Needs Configuration

  • ⚠️ OpenRouter API key (required for LLM)
  • ⚠️ Durable Object export (works in production, limited in dev)

🚀 Quick Start Testing

1. Set Your OpenRouter API Key

Edit .dev.vars:

OPENROUTER_API_KEY=sk-or-v1-YOUR-ACTUAL-KEY-HERE

Get a key from: https://openrouter.ai/keys

2. Start the Application

cd bandit-runner-app
pnpm dev

Server will start on http://localhost:3002 (port 3000 was taken)

3. Test the UI

What You'll See:

  • Beautiful retro terminal interface with control panel
  • Model selection dropdown (GPT-4o, Claude, etc.)
  • Level range selector (0-33)
  • START/PAUSE/RESUME buttons
  • Connection status indicators

Try These Actions:

  1. Select a model - Choose "GPT-4o Mini" (cheapest for testing)
  2. Set level range - Start with 0-2 (quick test)
  3. Click START - This will attempt to create a run

4. Expected Behavior (Current State)

⚠️ Known Limitation: The Durable Object binding doesn't work in local dev mode (next dev). You'll see:

POST /api/agent/run-xxx/start - 500 (Durable Object binding not found)

This is expected! The warning message tells us:

"internal Durable Objects... will not work in local development, but they should work in production"

5. Testing Options

Option A: Test UI Without Backend (Current)

  • UI works perfectly
  • Control panel functional
  • Model selection works
  • WebSocket connection attempts (fails gracefully)
  • You can type commands and messages in the interface

Option B: Use Wrangler Dev (Full Testing)

# Install wrangler globally if needed
npm i -g wrangler

# Run with Workers runtime
wrangler dev

# This gives you:
# ✅ Full Durable Object support
# ✅ Real WebSocket connections
# ✅ Actual agent runs

Option C: Deploy to Cloudflare (Production Testing)

# Build
pnpm build

# Deploy
wrangler deploy

# Test on:
# https://bandit-runner-app.your-account.workers.dev

🧪 Manual Testing Checklist

UI Testing (Works Now)

  • Control panel displays correctly
  • Model dropdown shows all options
  • Level selectors work (0-33)
  • Streaming mode toggle functional
  • START button enabled when idle
  • Status indicators show correct state
  • Terminal panel renders
  • Agent chat panel renders
  • Command input accepts text
  • Chat input accepts text
  • Keyboard shortcuts work (Ctrl+K/J, ESC, arrow keys)
  • Theme toggle works
  • Retro styling (scan lines, grid) visible

Backend Testing (Requires Wrangler Dev)

  • Start run creates Durable Object
  • WebSocket connection established
  • Agent begins planning
  • SSH commands execute via proxy
  • Terminal shows command output
  • Chat shows agent thoughts
  • Pause button stops execution
  • Resume button continues
  • Manual commands work when paused
  • Level advancement works
  • Run completes successfully
  • Error handling works
  • Retry logic functions

SSH Proxy Integration

Test your SSH proxy directly:

# Test connection
curl -X POST http://localhost:3001/ssh/connect \
  -H "Content-Type: application/json" \
  -d '{
    "host":"bandit.labs.overthewire.org",
    "port":2220,
    "username":"bandit0",
    "password":"bandit0"
  }'

# Should return:
# {"connectionId":"conn-xxx","success":true,"message":"Connected successfully"}

# Test command execution
curl -X POST http://localhost:3001/ssh/exec \
  -H "Content-Type: application/json" \
  -d '{
    "connectionId":"conn-xxx",
    "command":"cat readme"
  }'

# Should return:
# {"output":"boJ9jbbUNNfktd78OOpsqOltutMc3MY1\n","exitCode":0,"success":true}

🐛 Known Issues & Workarounds

Issue 1: Durable Object Not Found (Local Dev)

Error:

Durable Object binding not found

Cause: next dev uses standard Node.js runtime, not Workers runtime

Solutions:

  1. Use wrangler dev instead of pnpm dev
  2. Deploy to Cloudflare for full testing
  3. Test UI functionality only in local dev

Issue 2: WebSocket Connection Failed

Error:

WebSocket connection error
connectionState: 'error'

Cause: Durable Object not available in local dev

Solution: Use wrangler dev or deploy to production

Issue 3: OpenRouter API Errors

Error:

401 Unauthorized / Invalid API key

Solution:

  1. Check .dev.vars has correct API key
  2. Verify key at https://openrouter.ai/activity
  3. Ensure key has credits

📊 Test Scenarios

Scenario 1: Simple Level Test (0-1)

Setup:

  • Model: GPT-4o Mini
  • Levels: 0 to 1
  • Max retries: 3

Expected:

  1. Agent connects as bandit0
  2. Executes ls -la
  3. Finds readme file
  4. Executes cat readme
  5. Extracts password: boJ9jbbUNNfktd78OOpsqOltutMc3MY1
  6. Validates password
  7. Advances to level 1
  8. Completes successfully

Duration: ~30 seconds

Scenario 2: Multi-Level Test (0-5)

Setup:

  • Model: Claude 3 Haiku or GPT-4o
  • Levels: 0 to 5
  • Max retries: 3

Expected:

  • Each level solved systematically
  • SSH connections maintained
  • Checkpoints saved
  • Total time: ~3-5 minutes

Scenario 3: Pause/Resume Test

Setup:

  • Model: Any
  • Levels: 0 to 3
  • Pause after level 1

Expected:

  1. Start run
  2. Complete level 0-1
  3. Click PAUSE
  4. Type manual command: pwd
  5. See output in terminal
  6. Click RESUME
  7. Agent continues from level 1

Scenario 4: Error Recovery Test

Setup:

  • Model: GPT-4o Mini
  • Levels: 0 to 10
  • Intentionally disconnect SSH mid-run

Expected:

  • Agent detects error
  • Retry logic kicks in
  • Re-establishes connection
  • Continues execution

📈 Success Criteria

Minimum Viable Test

  • UI loads without errors
  • SSH proxy connects to Bandit server
  • Can start a run (even if it fails)
  • WebSocket attempts connection
  • Terminal displays messages

Full Integration Test

  • Complete level 0-1 successfully
  • Agent reasoning visible in chat
  • Commands executed via SSH proxy
  • Password validation works
  • Level advancement automatic
  • Pause/resume functional
  • Manual intervention works

Production Ready

  • Complete levels 0-10 reliably
  • Error recovery working
  • Cost tracking accurate
  • Logs saved to R2 (when configured)
  • Multiple concurrent runs supported
  • All models work via OpenRouter

🔍 Debugging Tips

Check SSH Proxy Logs

# In your ssh-proxy terminal
# Should see connection requests

Check Browser Console

// Open DevTools (F12)
// Look for:
// - WebSocket connection attempts
// - API call results
// - Error messages

Check Network Tab

  • API calls to /api/agent/[runId]/start
  • WebSocket upgrade to /api/agent/[runId]/ws
  • Response status codes

Check Wrangler Logs

# If using wrangler dev
# Ctrl+C to stop, logs show:
# - Durable Object creation
# - WebSocket messages
# - LangGraph execution

🎯 Next Steps

For Local Testing:

  1. SSH proxy running (you have this!)
  2. Set OpenRouter API key in .dev.vars
  3. Switch to wrangler dev for full testing
  4. 🎉 Test complete run (level 0-2)

For Production:

  1. Create Cloudflare account
  2. Deploy with wrangler deploy
  3. Set secrets: wrangler secret put OPENROUTER_API_KEY
  4. Test on live URL
  5. Optional: Set up D1 and R2

🎨 Current UI Features You Can Test

Even without the backend, you can test:

  • Theme toggle - Dark/light mode
  • Panel switching - Ctrl+K/J or ESC
  • Command history - Arrow up/down
  • Model selection - All 10+ models listed
  • Level range - Any combination 0-33
  • Control buttons - START/PAUSE/RESUME visual states
  • Status indicators - Connection and run state
  • Retro effects - Scan lines, grid, CRT glow
  • Responsive layout - Desktop and mobile
  • Terminal styling - Monospace, colors, timestamps
  • Chat formatting - User/agent message differentiation

📝 Test Results Template

## Test Run - [Date]

**Configuration:**
- Model: GPT-4o Mini
- Levels: 0-2
- Runtime: Wrangler Dev

**Results:**
- ✅ UI loaded correctly
- ✅ SSH proxy connected
- ✅ Agent started
- ✅ Level 0 completed (30s)
- ✅ Level 1 completed (45s)
- ❌ Level 2 failed (wrong command)
- Total time: 2m 15s
- Cost: $0.003

**Issues Found:**
- Agent confused by file with spaces in name
- Retry logic worked correctly
- Manual intervention successful

**Notes:**
- Claude 3 Haiku performed better on level 2
- Should increase timeout for decompression

🚀 Ready to Test!

You're all set! The implementation is complete. Start with UI testing, then move to wrangler dev for full integration testing.

Good luck! 🎉