Nicholai/bandit-runner

Fork 0

nicholai 0d93e26986 updates

2025-10-13 10:21:50 -06:00

9.7 KiB

Raw Permalink Blame History

Agent Reliability, Terminal Fidelity, and Reasoning Visibility - Implementation Summary

Overview

This implementation addresses three critical issues identified in the agent's behavior:

Max-Retries User Decision Flow - Prevents dead-ends at max retries by giving users options to Stop, Intervene, or Continue
Terminal Fidelity Improvements - Enhanced command hygiene and pre-advance password validation for better agent behavior
Reasoning Visibility - Properly displays LLM thinking/reasoning in the chat panel
Error Recovery - Added retry logic with exponential backoff for all critical operations
Cost Tracking - Real-time token usage and cost display in the agent panel

Implementation Details

1. Max-Retries → User Decision Flow

Files Modified:

bandit-runner-app/src/lib/durable-objects/BanditAgentDO.ts
bandit-runner-app/src/lib/agents/bandit-state.ts
bandit-runner-app/src/hooks/useAgentWebSocket.ts
bandit-runner-app/src/components/terminal-chat-interface.tsx

Changes:

BanditAgentDO now emits user_action_required events when max retries are hit instead of immediately failing
Agent state transitions to paused rather than failed on max-retries errors
The /retry endpoint now properly resets retry count AND resumes the agent run
AgentEvent type extended with user_action_required event type and associated data fields
WebSocket hook now supports callbacks for user_action_required events
Terminal Interface displays a modal dialog (shadcn AlertDialog) with three options:
- Stop: Ends the run completely
- Intervene: Enables manual mode and pauses the agent
- Continue: Resets retry counter and resumes the agent

Benefits:

No more dead-ends at Level 1 or any level
Users can provide manual assistance when the agent gets stuck
Enables iterative debugging and agent improvement
Maintains leaderboard integrity (manual intervention is tracked)

2. Terminal Fidelity & Command Hygiene

Files Modified:

ssh-proxy/agent.ts

Changes:

Updated SYSTEM_PROMPT to explicitly forbid nested SSH connections and dangerous commands
Command Validation in executeCommand checks for forbidden patterns:
- ssh commands (nested SSH)
- scp, sudo, su commands
- Dangerous patterns like rm -rf
Forbidden commands return error messages and return to planning state instead of executing
Pre-Advance Password Validation: After extracting a password, validateResult now:
1. Tests the password with a non-interactive SSH connection (testOnly: true)
2. Only advances if the password is valid
3. Counts invalid passwords as retries (fail-fast approach)
4. Falls back to proceeding on network errors (fail-open for robustness)
Accurate completion events: run_complete now includes status information based on final state

Benefits:

Prevents common agent errors (nested SSH causing timeouts)
Reduces wasted retries on invalid passwords
More reliable level advancement
Better alignment with example terminal agent UX (like opencode)

3. Reasoning Visibility

Files Modified:

bandit-runner-app/src/components/terminal-chat-interface.tsx

Changes:

Updated chat message rendering to display thinking messages with their full content
Thinking messages now show with distinct styling (blue border/text)
Message type label shows "THINKING" for reasoning messages
Already emitted by the agent, now properly rendered in the UI

Benefits:

Full transparency into agent's decision-making process
Critical for benchmarking and debugging
Helps users understand what the agent is thinking before executing commands

4. Error Recovery with Exponential Backoff

Files Modified:

ssh-proxy/agent.ts

Changes:

Added retryWithBackoff helper function:
- Generic retry logic with exponential backoff (1s → 2s → 4s)
- Configurable max retries and base delay
- Contextual error messages for debugging
Applied to critical operations:
- SSH connections (3 retries, 1s base delay)
- LLM planning calls (3 retries, 2s base delay)
- SSH command execution (2 retries, 1.5s base delay)
Graceful error handling with informative error messages

Benefits:

Resilient to transient network failures
Reduces run failures due to temporary issues
Better user experience (fewer unexplained failures)
Production-ready reliability

5. Token Usage & Cost Tracking

Files Modified:

ssh-proxy/agent.ts
bandit-runner-app/src/lib/agents/bandit-state.ts
bandit-runner-app/src/hooks/useAgentWebSocket.ts
bandit-runner-app/src/components/terminal-chat-interface.tsx
bandit-runner-app/src/components/agent-control-panel.tsx

Changes:

Agent State now tracks totalTokens and totalCost (accumulated via reducers)
Planning Node extracts token usage from LLM responses and estimates costs
Agent emits usage_update events after each LLM call
WebSocket Hook handles usage_update events with callbacks
AgentControlPanel displays token count and cost in metadata section
Terminal Interface updates agent state with usage data in real-time

Cost Estimation:

Rough approximation: 70% prompt tokens ($1/M), 30% completion tokens ($5/M)
Real-world costs may vary based on specific OpenRouter model pricing

Benefits:

Real-time visibility into LLM costs
Helps users make informed model selection decisions
Essential for benchmarking tool economics
Transparent cost tracking for production deployments

Testing Checklist

Max-Retries Flow

Start a run with a model (e.g., openai/gpt-4o-mini)
Wait for Level 1 to hit max retries (3 attempts)
Verify modal appears with Stop/Intervene/Continue options
Test "Continue" → verify retry count resets and agent resumes
Test "Intervene" → verify manual mode is enabled
Test "Stop" → verify run ends cleanly

Terminal Fidelity

Verify agent doesn't attempt ssh commands
Check that forbidden commands trigger error messages
Confirm ANSI codes are preserved in terminal output
Test password validation: invalid password should trigger retry with error message
Test password validation: valid password should advance to next level

Reasoning Visibility

Start a run and observe chat panel
Verify "THINKING" messages appear with blue styling
Confirm full reasoning content is displayed (not just "Processing...")
Test with different models to ensure consistent behavior

Error Recovery

Simulate network issues (if possible) to test retry logic
Verify agent recovers from temporary SSH connection failures
Check that LLM API rate limits are handled gracefully

Cost Tracking

Start a run and observe agent control panel
Verify "TOKENS" and "COST" appear after first LLM call
Confirm counts increment with each planning step
Test with different models to see cost variations

Architecture Notes

Event Flow for Max-Retries

Agent (validateResult) 
  → Detects max retries 
  → Emits 'error' with "Max retries..." message
  → BanditAgentDO.updateStateFromEvent 
  → Checks error message for "Max retries"
  → Emits 'user_action_required' event
  → State set to 'paused' (not 'failed')
  → WebSocket → Frontend
  → useAgentWebSocket.onUserActionRequired callback
  → Terminal Interface shows AlertDialog
  → User clicks button
  → POST to /retry endpoint
  → BanditAgentDO.retryLevel resets count & resumes agent

Event Flow for Usage Tracking

Agent (planLevel) 
  → LLM invoke with retry logic
  → Extract token usage from response
  → Update state.totalTokens and state.totalCost
  → Emit 'usage_update' event
  → WebSocket → Frontend
  → useAgentWebSocket.onUsageUpdate callback
  → Terminal Interface updates agentState
  → AgentControlPanel renders updated metrics

Compatibility & Safety

✅ No changes to DO bindings or WS protocol
✅ All new features are additive (no breaking changes)
✅ Existing functionality preserved
✅ Fallback behavior for network errors (fail-open for password validation)
✅ Error messages are user-friendly and actionable
✅ Linter errors fixed, TypeScript types properly defined

Future Enhancements (Optional)

These were outlined in the plan but not implemented in this iteration:

Phase 2: PTY Streaming (Optional)

Implement stream: true in /ssh/exec to send incremental PTY chunks
Provides more 1:1 terminal experience with progressive rendering
Feature-flagged for optional enablement

Phase 3: Persistent Interactive Shell (Optional)

Implement /ssh/shell WebSocket endpoint for persistent PTY session
Full TUI fidelity similar to opencode
More complex implementation, requires careful state management

Deployment Notes

SSH Proxy: Redeploy to Fly.io with updated agent.ts
```
cd ssh-proxy
flyctl deploy
```
Cloudflare Worker: Deploy updated DO and routes
```
cd bandit-runner-app
pnpm run deploy
```
Environment Variables: No new variables required
Database/Storage: No schema changes

Summary

This implementation successfully addresses all three core issues while also adding error recovery and cost tracking. The agent is now:

✅ More robust (retry logic with exponential backoff)
✅ More transparent (reasoning visible, costs tracked)
✅ More reliable (command hygiene, password validation)
✅ More user-friendly (max-retries decision flow, clear error messages)
✅ Production-ready (proper error handling, type safety, no breaking changes)

The changes maintain backward compatibility and follow the plan's phased approach, delivering immediate improvements while leaving room for future enhancements.

9.7 KiB Raw Permalink Blame History

Agent Reliability, Terminal Fidelity, and Reasoning Visibility - Implementation Summary

Overview

Implementation Details

1. Max-Retries → User Decision Flow

2. Terminal Fidelity & Command Hygiene

3. Reasoning Visibility

4. Error Recovery with Exponential Backoff

5. Token Usage & Cost Tracking

Testing Checklist

Max-Retries Flow

Terminal Fidelity

Reasoning Visibility

Error Recovery

Cost Tracking

Architecture Notes

Event Flow for Max-Retries

Event Flow for Usage Tracking

Compatibility & Safety

Future Enhancements (Optional)

Phase 2: PTY Streaming (Optional)

Phase 3: Persistent Interactive Shell (Optional)

Deployment Notes

Summary

9.7 KiB

Raw Permalink Blame History