Nicholai/bandit-runner

Fork 0

nicholai e934d047b0 added example runs

2025-10-09 22:03:37 -06:00

8.9 KiB

Raw Blame History

Testing Guide - Bandit Runner LangGraph Agent

✅ Current Status

What's Working

✅ Build successful - no TypeScript errors
✅ Dev server starts on port 3002
✅ SSH proxy running on port 3001
✅ All components installed and configured
✅ Beautiful UI fully functional
✅ WebSocket infrastructure ready

What Needs Configuration

⚠️ OpenRouter API key (required for LLM)
⚠️ Durable Object export (works in production, limited in dev)

🚀 Quick Start Testing

1. Set Your OpenRouter API Key

Edit .dev.vars:

OPENROUTER_API_KEY=sk-or-v1-YOUR-ACTUAL-KEY-HERE

Get a key from: https://openrouter.ai/keys

2. Start the Application

cd bandit-runner-app
pnpm dev

Server will start on http://localhost:3002 (port 3000 was taken)

3. Test the UI

What You'll See:

Beautiful retro terminal interface with control panel
Model selection dropdown (GPT-4o, Claude, etc.)
Level range selector (0-33)
START/PAUSE/RESUME buttons
Connection status indicators

Try These Actions:

Select a model - Choose "GPT-4o Mini" (cheapest for testing)
Set level range - Start with 0-2 (quick test)
Click START - This will attempt to create a run

4. Expected Behavior (Current State)

⚠️ Known Limitation: The Durable Object binding doesn't work in local dev mode (next dev). You'll see:

POST /api/agent/run-xxx/start - 500 (Durable Object binding not found)

This is expected! The warning message tells us:

"internal Durable Objects... will not work in local development, but they should work in production"

5. Testing Options

Option A: Test UI Without Backend (Current)

UI works perfectly
Control panel functional
Model selection works
WebSocket connection attempts (fails gracefully)
You can type commands and messages in the interface

Option B: Use Wrangler Dev (Full Testing)

# Install wrangler globally if needed
npm i -g wrangler

# Run with Workers runtime
wrangler dev

# This gives you:
# ✅ Full Durable Object support
# ✅ Real WebSocket connections
# ✅ Actual agent runs

Option C: Deploy to Cloudflare (Production Testing)

# Build
pnpm build

# Deploy
wrangler deploy

# Test on:
# https://bandit-runner-app.your-account.workers.dev

🧪 Manual Testing Checklist

UI Testing (Works Now)

Control panel displays correctly
Model dropdown shows all options
Level selectors work (0-33)
Streaming mode toggle functional
START button enabled when idle
Status indicators show correct state
Terminal panel renders
Agent chat panel renders
Command input accepts text
Chat input accepts text
Keyboard shortcuts work (Ctrl+K/J, ESC, arrow keys)
Theme toggle works
Retro styling (scan lines, grid) visible

Backend Testing (Requires Wrangler Dev)

Start run creates Durable Object
WebSocket connection established
Agent begins planning
SSH commands execute via proxy
Terminal shows command output
Chat shows agent thoughts
Pause button stops execution
Resume button continues
Manual commands work when paused
Level advancement works
Run completes successfully
Error handling works
Retry logic functions

SSH Proxy Integration

Test your SSH proxy directly:

# Test connection
curl -X POST http://localhost:3001/ssh/connect \
  -H "Content-Type: application/json" \
  -d '{
    "host":"bandit.labs.overthewire.org",
    "port":2220,
    "username":"bandit0",
    "password":"bandit0"
  }'

# Should return:
# {"connectionId":"conn-xxx","success":true,"message":"Connected successfully"}

# Test command execution
curl -X POST http://localhost:3001/ssh/exec \
  -H "Content-Type: application/json" \
  -d '{
    "connectionId":"conn-xxx",
    "command":"cat readme"
  }'

# Should return:
# {"output":"boJ9jbbUNNfktd78OOpsqOltutMc3MY1\n","exitCode":0,"success":true}

🐛 Known Issues & Workarounds

Issue 1: Durable Object Not Found (Local Dev)

Error:

Durable Object binding not found

Cause: next dev uses standard Node.js runtime, not Workers runtime

Solutions:

Use wrangler dev instead of pnpm dev
Deploy to Cloudflare for full testing
Test UI functionality only in local dev

Issue 2: WebSocket Connection Failed

Error:

WebSocket connection error
connectionState: 'error'

Cause: Durable Object not available in local dev

Solution: Use wrangler dev or deploy to production

Issue 3: OpenRouter API Errors

Error:

401 Unauthorized / Invalid API key

Solution:

Check .dev.vars has correct API key
Verify key at https://openrouter.ai/activity
Ensure key has credits

📊 Test Scenarios

Scenario 1: Simple Level Test (0-1)

Setup:

Model: GPT-4o Mini
Levels: 0 to 1
Max retries: 3

Expected:

Agent connects as bandit0
Executes ls -la
Finds readme file
Executes cat readme
Extracts password: boJ9jbbUNNfktd78OOpsqOltutMc3MY1
Validates password
Advances to level 1
Completes successfully

Duration: ~30 seconds

Scenario 2: Multi-Level Test (0-5)

Setup:

Model: Claude 3 Haiku or GPT-4o
Levels: 0 to 5
Max retries: 3

Expected:

Each level solved systematically
SSH connections maintained
Checkpoints saved
Total time: ~3-5 minutes

Scenario 3: Pause/Resume Test

Setup:

Model: Any
Levels: 0 to 3
Pause after level 1

Expected:

Start run
Complete level 0-1
Click PAUSE
Type manual command: pwd
See output in terminal
Click RESUME
Agent continues from level 1

Scenario 4: Error Recovery Test

Setup:

Model: GPT-4o Mini
Levels: 0 to 10
Intentionally disconnect SSH mid-run

Expected:

Agent detects error
Retry logic kicks in
Re-establishes connection
Continues execution

📈 Success Criteria

Minimum Viable Test

✅ UI loads without errors
✅ SSH proxy connects to Bandit server
✅ Can start a run (even if it fails)
✅ WebSocket attempts connection
✅ Terminal displays messages

Full Integration Test

✅ Complete level 0-1 successfully
✅ Agent reasoning visible in chat
✅ Commands executed via SSH proxy
✅ Password validation works
✅ Level advancement automatic
✅ Pause/resume functional
✅ Manual intervention works

Production Ready

✅ Complete levels 0-10 reliably
✅ Error recovery working
✅ Cost tracking accurate
✅ Logs saved to R2 (when configured)
✅ Multiple concurrent runs supported
✅ All models work via OpenRouter

🔍 Debugging Tips

Check SSH Proxy Logs

# In your ssh-proxy terminal
# Should see connection requests

Check Browser Console

// Open DevTools (F12)
// Look for:
// - WebSocket connection attempts
// - API call results
// - Error messages

Check Network Tab

API calls to /api/agent/[runId]/start
WebSocket upgrade to /api/agent/[runId]/ws
Response status codes

Check Wrangler Logs

# If using wrangler dev
# Ctrl+C to stop, logs show:
# - Durable Object creation
# - WebSocket messages
# - LangGraph execution

🎯 Next Steps

For Local Testing:

✅ SSH proxy running (you have this!)
✅ Set OpenRouter API key in .dev.vars
⏳ Switch to wrangler dev for full testing
🎉 Test complete run (level 0-2)

For Production:

Create Cloudflare account
Deploy with wrangler deploy
Set secrets: wrangler secret put OPENROUTER_API_KEY
Test on live URL
Optional: Set up D1 and R2

🎨 Current UI Features You Can Test

Even without the backend, you can test:

Theme toggle - Dark/light mode
Panel switching - Ctrl+K/J or ESC
Command history - Arrow up/down
Model selection - All 10+ models listed
Level range - Any combination 0-33
Control buttons - START/PAUSE/RESUME visual states
Status indicators - Connection and run state
Retro effects - Scan lines, grid, CRT glow
Responsive layout - Desktop and mobile
Terminal styling - Monospace, colors, timestamps
Chat formatting - User/agent message differentiation

📝 Test Results Template

## Test Run - [Date]

**Configuration:**
- Model: GPT-4o Mini
- Levels: 0-2
- Runtime: Wrangler Dev

**Results:**
- ✅ UI loaded correctly
- ✅ SSH proxy connected
- ✅ Agent started
- ✅ Level 0 completed (30s)
- ✅ Level 1 completed (45s)
- ❌ Level 2 failed (wrong command)
- Total time: 2m 15s
- Cost: $0.003

**Issues Found:**
- Agent confused by file with spaces in name
- Retry logic worked correctly
- Manual intervention successful

**Notes:**
- Claude 3 Haiku performed better on level 2
- Should increase timeout for decompression

🚀 Ready to Test!

You're all set! The implementation is complete. Start with UI testing, then move to wrangler dev for full integration testing.

Good luck! 🎉

8.9 KiB Raw Blame History

Testing Guide - Bandit Runner LangGraph Agent

✅ Current Status

What's Working

What Needs Configuration

🚀 Quick Start Testing

1. Set Your OpenRouter API Key

2. Start the Application

3. Test the UI

4. Expected Behavior (Current State)

5. Testing Options

🧪 Manual Testing Checklist

UI Testing (Works Now)

Backend Testing (Requires Wrangler Dev)

SSH Proxy Integration

🐛 Known Issues & Workarounds

Issue 1: Durable Object Not Found (Local Dev)

Issue 2: WebSocket Connection Failed

Issue 3: OpenRouter API Errors

📊 Test Scenarios

Scenario 1: Simple Level Test (0-1)

Scenario 2: Multi-Level Test (0-5)

Scenario 3: Pause/Resume Test

Scenario 4: Error Recovery Test

📈 Success Criteria

Minimum Viable Test

Full Integration Test

Production Ready

🔍 Debugging Tips

Check SSH Proxy Logs

Check Browser Console

Check Network Tab

Check Wrangler Logs

🎯 Next Steps

For Local Testing:

For Production:

🎨 Current UI Features You Can Test

📝 Test Results Template

🚀 Ready to Test!

8.9 KiB

Raw Blame History