9.7 KiB
Agent Reliability, Terminal Fidelity, and Reasoning Visibility - Implementation Summary
Overview
This implementation addresses three critical issues identified in the agent's behavior:
- Max-Retries User Decision Flow - Prevents dead-ends at max retries by giving users options to Stop, Intervene, or Continue
- Terminal Fidelity Improvements - Enhanced command hygiene and pre-advance password validation for better agent behavior
- Reasoning Visibility - Properly displays LLM thinking/reasoning in the chat panel
- Error Recovery - Added retry logic with exponential backoff for all critical operations
- Cost Tracking - Real-time token usage and cost display in the agent panel
Implementation Details
1. Max-Retries → User Decision Flow
Files Modified:
bandit-runner-app/src/lib/durable-objects/BanditAgentDO.tsbandit-runner-app/src/lib/agents/bandit-state.tsbandit-runner-app/src/hooks/useAgentWebSocket.tsbandit-runner-app/src/components/terminal-chat-interface.tsx
Changes:
- BanditAgentDO now emits
user_action_requiredevents when max retries are hit instead of immediately failing - Agent state transitions to
pausedrather thanfailedon max-retries errors - The
/retryendpoint now properly resets retry count AND resumes the agent run - AgentEvent type extended with
user_action_requiredevent type and associated data fields - WebSocket hook now supports callbacks for
user_action_requiredevents - Terminal Interface displays a modal dialog (shadcn AlertDialog) with three options:
- Stop: Ends the run completely
- Intervene: Enables manual mode and pauses the agent
- Continue: Resets retry counter and resumes the agent
Benefits:
- No more dead-ends at Level 1 or any level
- Users can provide manual assistance when the agent gets stuck
- Enables iterative debugging and agent improvement
- Maintains leaderboard integrity (manual intervention is tracked)
2. Terminal Fidelity & Command Hygiene
Files Modified:
ssh-proxy/agent.ts
Changes:
- Updated SYSTEM_PROMPT to explicitly forbid nested SSH connections and dangerous commands
- Command Validation in
executeCommandchecks for forbidden patterns:sshcommands (nested SSH)scp,sudo,sucommands- Dangerous patterns like
rm -rf
- Forbidden commands return error messages and return to planning state instead of executing
- Pre-Advance Password Validation: After extracting a password,
validateResultnow:- Tests the password with a non-interactive SSH connection (
testOnly: true) - Only advances if the password is valid
- Counts invalid passwords as retries (fail-fast approach)
- Falls back to proceeding on network errors (fail-open for robustness)
- Tests the password with a non-interactive SSH connection (
- Accurate completion events:
run_completenow includes status information based on final state
Benefits:
- Prevents common agent errors (nested SSH causing timeouts)
- Reduces wasted retries on invalid passwords
- More reliable level advancement
- Better alignment with example terminal agent UX (like opencode)
3. Reasoning Visibility
Files Modified:
bandit-runner-app/src/components/terminal-chat-interface.tsx
Changes:
- Updated chat message rendering to display
thinkingmessages with their full content - Thinking messages now show with distinct styling (blue border/text)
- Message type label shows "THINKING" for reasoning messages
- Already emitted by the agent, now properly rendered in the UI
Benefits:
- Full transparency into agent's decision-making process
- Critical for benchmarking and debugging
- Helps users understand what the agent is thinking before executing commands
4. Error Recovery with Exponential Backoff
Files Modified:
ssh-proxy/agent.ts
Changes:
- Added
retryWithBackoffhelper function:- Generic retry logic with exponential backoff (1s → 2s → 4s)
- Configurable max retries and base delay
- Contextual error messages for debugging
- Applied to critical operations:
- SSH connections (3 retries, 1s base delay)
- LLM planning calls (3 retries, 2s base delay)
- SSH command execution (2 retries, 1.5s base delay)
- Graceful error handling with informative error messages
Benefits:
- Resilient to transient network failures
- Reduces run failures due to temporary issues
- Better user experience (fewer unexplained failures)
- Production-ready reliability
5. Token Usage & Cost Tracking
Files Modified:
ssh-proxy/agent.tsbandit-runner-app/src/lib/agents/bandit-state.tsbandit-runner-app/src/hooks/useAgentWebSocket.tsbandit-runner-app/src/components/terminal-chat-interface.tsxbandit-runner-app/src/components/agent-control-panel.tsx
Changes:
- Agent State now tracks
totalTokensandtotalCost(accumulated via reducers) - Planning Node extracts token usage from LLM responses and estimates costs
- Agent emits
usage_updateevents after each LLM call - WebSocket Hook handles
usage_updateevents with callbacks - AgentControlPanel displays token count and cost in metadata section
- Terminal Interface updates agent state with usage data in real-time
Cost Estimation:
- Rough approximation: 70% prompt tokens ($1/M), 30% completion tokens ($5/M)
- Real-world costs may vary based on specific OpenRouter model pricing
Benefits:
- Real-time visibility into LLM costs
- Helps users make informed model selection decisions
- Essential for benchmarking tool economics
- Transparent cost tracking for production deployments
Testing Checklist
Max-Retries Flow
- Start a run with a model (e.g.,
openai/gpt-4o-mini) - Wait for Level 1 to hit max retries (3 attempts)
- Verify modal appears with Stop/Intervene/Continue options
- Test "Continue" → verify retry count resets and agent resumes
- Test "Intervene" → verify manual mode is enabled
- Test "Stop" → verify run ends cleanly
Terminal Fidelity
- Verify agent doesn't attempt
sshcommands - Check that forbidden commands trigger error messages
- Confirm ANSI codes are preserved in terminal output
- Test password validation: invalid password should trigger retry with error message
- Test password validation: valid password should advance to next level
Reasoning Visibility
- Start a run and observe chat panel
- Verify "THINKING" messages appear with blue styling
- Confirm full reasoning content is displayed (not just "Processing...")
- Test with different models to ensure consistent behavior
Error Recovery
- Simulate network issues (if possible) to test retry logic
- Verify agent recovers from temporary SSH connection failures
- Check that LLM API rate limits are handled gracefully
Cost Tracking
- Start a run and observe agent control panel
- Verify "TOKENS" and "COST" appear after first LLM call
- Confirm counts increment with each planning step
- Test with different models to see cost variations
Architecture Notes
Event Flow for Max-Retries
Agent (validateResult)
→ Detects max retries
→ Emits 'error' with "Max retries..." message
→ BanditAgentDO.updateStateFromEvent
→ Checks error message for "Max retries"
→ Emits 'user_action_required' event
→ State set to 'paused' (not 'failed')
→ WebSocket → Frontend
→ useAgentWebSocket.onUserActionRequired callback
→ Terminal Interface shows AlertDialog
→ User clicks button
→ POST to /retry endpoint
→ BanditAgentDO.retryLevel resets count & resumes agent
Event Flow for Usage Tracking
Agent (planLevel)
→ LLM invoke with retry logic
→ Extract token usage from response
→ Update state.totalTokens and state.totalCost
→ Emit 'usage_update' event
→ WebSocket → Frontend
→ useAgentWebSocket.onUsageUpdate callback
→ Terminal Interface updates agentState
→ AgentControlPanel renders updated metrics
Compatibility & Safety
- ✅ No changes to DO bindings or WS protocol
- ✅ All new features are additive (no breaking changes)
- ✅ Existing functionality preserved
- ✅ Fallback behavior for network errors (fail-open for password validation)
- ✅ Error messages are user-friendly and actionable
- ✅ Linter errors fixed, TypeScript types properly defined
Future Enhancements (Optional)
These were outlined in the plan but not implemented in this iteration:
Phase 2: PTY Streaming (Optional)
- Implement
stream: truein/ssh/execto send incremental PTY chunks - Provides more 1:1 terminal experience with progressive rendering
- Feature-flagged for optional enablement
Phase 3: Persistent Interactive Shell (Optional)
- Implement
/ssh/shellWebSocket endpoint for persistent PTY session - Full TUI fidelity similar to opencode
- More complex implementation, requires careful state management
Deployment Notes
-
SSH Proxy: Redeploy to Fly.io with updated
agent.tscd ssh-proxy flyctl deploy -
Cloudflare Worker: Deploy updated DO and routes
cd bandit-runner-app pnpm run deploy -
Environment Variables: No new variables required
-
Database/Storage: No schema changes
Summary
This implementation successfully addresses all three core issues while also adding error recovery and cost tracking. The agent is now:
- ✅ More robust (retry logic with exponential backoff)
- ✅ More transparent (reasoning visible, costs tracked)
- ✅ More reliable (command hygiene, password validation)
- ✅ More user-friendly (max-retries decision flow, clear error messages)
- ✅ Production-ready (proper error handling, type safety, no breaking changes)
The changes maintain backward compatibility and follow the plan's phased approach, delivering immediate improvements while leaving room for future enhancements.