# Testing Guide - Bandit Runner LangGraph Agent ## ✅ Current Status ### What's Working - ✅ Build successful - no TypeScript errors - ✅ Dev server starts on port 3002 - ✅ SSH proxy running on port 3001 - ✅ All components installed and configured - ✅ Beautiful UI fully functional - ✅ WebSocket infrastructure ready ### What Needs Configuration - ⚠️ OpenRouter API key (required for LLM) - ⚠️ Durable Object export (works in production, limited in dev) ## 🚀 Quick Start Testing ### 1. Set Your OpenRouter API Key Edit `.dev.vars`: ```bash OPENROUTER_API_KEY=sk-or-v1-YOUR-ACTUAL-KEY-HERE ``` Get a key from: https://openrouter.ai/keys ### 2. Start the Application ```bash cd bandit-runner-app pnpm dev ``` Server will start on http://localhost:3002 (port 3000 was taken) ### 3. Test the UI **What You'll See:** - Beautiful retro terminal interface with control panel - Model selection dropdown (GPT-4o, Claude, etc.) - Level range selector (0-33) - START/PAUSE/RESUME buttons - Connection status indicators **Try These Actions:** 1. **Select a model** - Choose "GPT-4o Mini" (cheapest for testing) 2. **Set level range** - Start with 0-2 (quick test) 3. **Click START** - This will attempt to create a run ### 4. Expected Behavior (Current State) **⚠️ Known Limitation:** The Durable Object binding doesn't work in local dev mode (`next dev`). You'll see: ``` POST /api/agent/run-xxx/start - 500 (Durable Object binding not found) ``` This is expected! The warning message tells us: > "internal Durable Objects... will not work in local development, but they should work in production" ### 5. Testing Options **Option A: Test UI Without Backend (Current)** - UI works perfectly - Control panel functional - Model selection works - WebSocket connection attempts (fails gracefully) - You can type commands and messages in the interface **Option B: Use Wrangler Dev (Full Testing)** ```bash # Install wrangler globally if needed npm i -g wrangler # Run with Workers runtime wrangler dev # This gives you: # ✅ Full Durable Object support # ✅ Real WebSocket connections # ✅ Actual agent runs ``` **Option C: Deploy to Cloudflare (Production Testing)** ```bash # Build pnpm build # Deploy wrangler deploy # Test on: # https://bandit-runner-app.your-account.workers.dev ``` ## 🧪 Manual Testing Checklist ### UI Testing (Works Now) - [ ] Control panel displays correctly - [ ] Model dropdown shows all options - [ ] Level selectors work (0-33) - [ ] Streaming mode toggle functional - [ ] START button enabled when idle - [ ] Status indicators show correct state - [ ] Terminal panel renders - [ ] Agent chat panel renders - [ ] Command input accepts text - [ ] Chat input accepts text - [ ] Keyboard shortcuts work (Ctrl+K/J, ESC, arrow keys) - [ ] Theme toggle works - [ ] Retro styling (scan lines, grid) visible ### Backend Testing (Requires Wrangler Dev) - [ ] Start run creates Durable Object - [ ] WebSocket connection established - [ ] Agent begins planning - [ ] SSH commands execute via proxy - [ ] Terminal shows command output - [ ] Chat shows agent thoughts - [ ] Pause button stops execution - [ ] Resume button continues - [ ] Manual commands work when paused - [ ] Level advancement works - [ ] Run completes successfully - [ ] Error handling works - [ ] Retry logic functions ### SSH Proxy Integration Test your SSH proxy directly: ```bash # Test connection curl -X POST http://localhost:3001/ssh/connect \ -H "Content-Type: application/json" \ -d '{ "host":"bandit.labs.overthewire.org", "port":2220, "username":"bandit0", "password":"bandit0" }' # Should return: # {"connectionId":"conn-xxx","success":true,"message":"Connected successfully"} # Test command execution curl -X POST http://localhost:3001/ssh/exec \ -H "Content-Type: application/json" \ -d '{ "connectionId":"conn-xxx", "command":"cat readme" }' # Should return: # {"output":"boJ9jbbUNNfktd78OOpsqOltutMc3MY1\n","exitCode":0,"success":true} ``` ## 🐛 Known Issues & Workarounds ### Issue 1: Durable Object Not Found (Local Dev) **Error:** ``` Durable Object binding not found ``` **Cause:** `next dev` uses standard Node.js runtime, not Workers runtime **Solutions:** 1. Use `wrangler dev` instead of `pnpm dev` 2. Deploy to Cloudflare for full testing 3. Test UI functionality only in local dev ### Issue 2: WebSocket Connection Failed **Error:** ``` WebSocket connection error connectionState: 'error' ``` **Cause:** Durable Object not available in local dev **Solution:** Use wrangler dev or deploy to production ### Issue 3: OpenRouter API Errors **Error:** ``` 401 Unauthorized / Invalid API key ``` **Solution:** 1. Check `.dev.vars` has correct API key 2. Verify key at https://openrouter.ai/activity 3. Ensure key has credits ## 📊 Test Scenarios ### Scenario 1: Simple Level Test (0-1) **Setup:** - Model: GPT-4o Mini - Levels: 0 to 1 - Max retries: 3 **Expected:** 1. Agent connects as bandit0 2. Executes `ls -la` 3. Finds `readme` file 4. Executes `cat readme` 5. Extracts password: `boJ9jbbUNNfktd78OOpsqOltutMc3MY1` 6. Validates password 7. Advances to level 1 8. Completes successfully **Duration:** ~30 seconds ### Scenario 2: Multi-Level Test (0-5) **Setup:** - Model: Claude 3 Haiku or GPT-4o - Levels: 0 to 5 - Max retries: 3 **Expected:** - Each level solved systematically - SSH connections maintained - Checkpoints saved - Total time: ~3-5 minutes ### Scenario 3: Pause/Resume Test **Setup:** - Model: Any - Levels: 0 to 3 - Pause after level 1 **Expected:** 1. Start run 2. Complete level 0-1 3. Click PAUSE 4. Type manual command: `pwd` 5. See output in terminal 6. Click RESUME 7. Agent continues from level 1 ### Scenario 4: Error Recovery Test **Setup:** - Model: GPT-4o Mini - Levels: 0 to 10 - Intentionally disconnect SSH mid-run **Expected:** - Agent detects error - Retry logic kicks in - Re-establishes connection - Continues execution ## 📈 Success Criteria ### Minimum Viable Test - ✅ UI loads without errors - ✅ SSH proxy connects to Bandit server - ✅ Can start a run (even if it fails) - ✅ WebSocket attempts connection - ✅ Terminal displays messages ### Full Integration Test - ✅ Complete level 0-1 successfully - ✅ Agent reasoning visible in chat - ✅ Commands executed via SSH proxy - ✅ Password validation works - ✅ Level advancement automatic - ✅ Pause/resume functional - ✅ Manual intervention works ### Production Ready - ✅ Complete levels 0-10 reliably - ✅ Error recovery working - ✅ Cost tracking accurate - ✅ Logs saved to R2 (when configured) - ✅ Multiple concurrent runs supported - ✅ All models work via OpenRouter ## 🔍 Debugging Tips ### Check SSH Proxy Logs ```bash # In your ssh-proxy terminal # Should see connection requests ``` ### Check Browser Console ```javascript // Open DevTools (F12) // Look for: // - WebSocket connection attempts // - API call results // - Error messages ``` ### Check Network Tab - API calls to `/api/agent/[runId]/start` - WebSocket upgrade to `/api/agent/[runId]/ws` - Response status codes ### Check Wrangler Logs ```bash # If using wrangler dev # Ctrl+C to stop, logs show: # - Durable Object creation # - WebSocket messages # - LangGraph execution ``` ## 🎯 Next Steps ### For Local Testing: 1. ✅ SSH proxy running (you have this!) 2. ✅ Set OpenRouter API key in `.dev.vars` 3. ⏳ Switch to `wrangler dev` for full testing 4. 🎉 Test complete run (level 0-2) ### For Production: 1. Create Cloudflare account 2. Deploy with `wrangler deploy` 3. Set secrets: `wrangler secret put OPENROUTER_API_KEY` 4. Test on live URL 5. Optional: Set up D1 and R2 ## 🎨 Current UI Features You Can Test Even without the backend, you can test: - **Theme toggle** - Dark/light mode - **Panel switching** - Ctrl+K/J or ESC - **Command history** - Arrow up/down - **Model selection** - All 10+ models listed - **Level range** - Any combination 0-33 - **Control buttons** - START/PAUSE/RESUME visual states - **Status indicators** - Connection and run state - **Retro effects** - Scan lines, grid, CRT glow - **Responsive layout** - Desktop and mobile - **Terminal styling** - Monospace, colors, timestamps - **Chat formatting** - User/agent message differentiation ## 📝 Test Results Template ```markdown ## Test Run - [Date] **Configuration:** - Model: GPT-4o Mini - Levels: 0-2 - Runtime: Wrangler Dev **Results:** - ✅ UI loaded correctly - ✅ SSH proxy connected - ✅ Agent started - ✅ Level 0 completed (30s) - ✅ Level 1 completed (45s) - ❌ Level 2 failed (wrong command) - Total time: 2m 15s - Cost: $0.003 **Issues Found:** - Agent confused by file with spaces in name - Retry logic worked correctly - Manual intervention successful **Notes:** - Claude 3 Haiku performed better on level 2 - Should increase timeout for decompression ``` ## 🚀 Ready to Test! You're all set! The implementation is complete. Start with UI testing, then move to `wrangler dev` for full integration testing. Good luck! 🎉