249 lines
9.7 KiB
Markdown
249 lines
9.7 KiB
Markdown
# Agent Reliability, Terminal Fidelity, and Reasoning Visibility - Implementation Summary
|
|
|
|
## Overview
|
|
|
|
This implementation addresses three critical issues identified in the agent's behavior:
|
|
|
|
1. **Max-Retries User Decision Flow** - Prevents dead-ends at max retries by giving users options to Stop, Intervene, or Continue
|
|
2. **Terminal Fidelity Improvements** - Enhanced command hygiene and pre-advance password validation for better agent behavior
|
|
3. **Reasoning Visibility** - Properly displays LLM thinking/reasoning in the chat panel
|
|
4. **Error Recovery** - Added retry logic with exponential backoff for all critical operations
|
|
5. **Cost Tracking** - Real-time token usage and cost display in the agent panel
|
|
|
|
## Implementation Details
|
|
|
|
### 1. Max-Retries → User Decision Flow
|
|
|
|
**Files Modified:**
|
|
- `bandit-runner-app/src/lib/durable-objects/BanditAgentDO.ts`
|
|
- `bandit-runner-app/src/lib/agents/bandit-state.ts`
|
|
- `bandit-runner-app/src/hooks/useAgentWebSocket.ts`
|
|
- `bandit-runner-app/src/components/terminal-chat-interface.tsx`
|
|
|
|
**Changes:**
|
|
- **BanditAgentDO** now emits `user_action_required` events when max retries are hit instead of immediately failing
|
|
- Agent state transitions to `paused` rather than `failed` on max-retries errors
|
|
- The `/retry` endpoint now properly resets retry count AND resumes the agent run
|
|
- **AgentEvent** type extended with `user_action_required` event type and associated data fields
|
|
- **WebSocket hook** now supports callbacks for `user_action_required` events
|
|
- **Terminal Interface** displays a modal dialog (shadcn AlertDialog) with three options:
|
|
- **Stop**: Ends the run completely
|
|
- **Intervene**: Enables manual mode and pauses the agent
|
|
- **Continue**: Resets retry counter and resumes the agent
|
|
|
|
**Benefits:**
|
|
- No more dead-ends at Level 1 or any level
|
|
- Users can provide manual assistance when the agent gets stuck
|
|
- Enables iterative debugging and agent improvement
|
|
- Maintains leaderboard integrity (manual intervention is tracked)
|
|
|
|
### 2. Terminal Fidelity & Command Hygiene
|
|
|
|
**Files Modified:**
|
|
- `ssh-proxy/agent.ts`
|
|
|
|
**Changes:**
|
|
- **Updated SYSTEM_PROMPT** to explicitly forbid nested SSH connections and dangerous commands
|
|
- **Command Validation** in `executeCommand` checks for forbidden patterns:
|
|
- `ssh` commands (nested SSH)
|
|
- `scp`, `sudo`, `su` commands
|
|
- Dangerous patterns like `rm -rf`
|
|
- Forbidden commands return error messages and return to planning state instead of executing
|
|
- **Pre-Advance Password Validation**: After extracting a password, `validateResult` now:
|
|
1. Tests the password with a non-interactive SSH connection (`testOnly: true`)
|
|
2. Only advances if the password is valid
|
|
3. Counts invalid passwords as retries (fail-fast approach)
|
|
4. Falls back to proceeding on network errors (fail-open for robustness)
|
|
- **Accurate completion events**: `run_complete` now includes status information based on final state
|
|
|
|
**Benefits:**
|
|
- Prevents common agent errors (nested SSH causing timeouts)
|
|
- Reduces wasted retries on invalid passwords
|
|
- More reliable level advancement
|
|
- Better alignment with example terminal agent UX (like opencode)
|
|
|
|
### 3. Reasoning Visibility
|
|
|
|
**Files Modified:**
|
|
- `bandit-runner-app/src/components/terminal-chat-interface.tsx`
|
|
|
|
**Changes:**
|
|
- Updated chat message rendering to display `thinking` messages with their full content
|
|
- Thinking messages now show with distinct styling (blue border/text)
|
|
- Message type label shows "THINKING" for reasoning messages
|
|
- Already emitted by the agent, now properly rendered in the UI
|
|
|
|
**Benefits:**
|
|
- Full transparency into agent's decision-making process
|
|
- Critical for benchmarking and debugging
|
|
- Helps users understand what the agent is thinking before executing commands
|
|
|
|
### 4. Error Recovery with Exponential Backoff
|
|
|
|
**Files Modified:**
|
|
- `ssh-proxy/agent.ts`
|
|
|
|
**Changes:**
|
|
- **Added `retryWithBackoff` helper function**:
|
|
- Generic retry logic with exponential backoff (1s → 2s → 4s)
|
|
- Configurable max retries and base delay
|
|
- Contextual error messages for debugging
|
|
- **Applied to critical operations**:
|
|
- SSH connections (3 retries, 1s base delay)
|
|
- LLM planning calls (3 retries, 2s base delay)
|
|
- SSH command execution (2 retries, 1.5s base delay)
|
|
- Graceful error handling with informative error messages
|
|
|
|
**Benefits:**
|
|
- Resilient to transient network failures
|
|
- Reduces run failures due to temporary issues
|
|
- Better user experience (fewer unexplained failures)
|
|
- Production-ready reliability
|
|
|
|
### 5. Token Usage & Cost Tracking
|
|
|
|
**Files Modified:**
|
|
- `ssh-proxy/agent.ts`
|
|
- `bandit-runner-app/src/lib/agents/bandit-state.ts`
|
|
- `bandit-runner-app/src/hooks/useAgentWebSocket.ts`
|
|
- `bandit-runner-app/src/components/terminal-chat-interface.tsx`
|
|
- `bandit-runner-app/src/components/agent-control-panel.tsx`
|
|
|
|
**Changes:**
|
|
- **Agent State** now tracks `totalTokens` and `totalCost` (accumulated via reducers)
|
|
- **Planning Node** extracts token usage from LLM responses and estimates costs
|
|
- Agent emits `usage_update` events after each LLM call
|
|
- **WebSocket Hook** handles `usage_update` events with callbacks
|
|
- **AgentControlPanel** displays token count and cost in metadata section
|
|
- **Terminal Interface** updates agent state with usage data in real-time
|
|
|
|
**Cost Estimation:**
|
|
- Rough approximation: 70% prompt tokens ($1/M), 30% completion tokens ($5/M)
|
|
- Real-world costs may vary based on specific OpenRouter model pricing
|
|
|
|
**Benefits:**
|
|
- Real-time visibility into LLM costs
|
|
- Helps users make informed model selection decisions
|
|
- Essential for benchmarking tool economics
|
|
- Transparent cost tracking for production deployments
|
|
|
|
## Testing Checklist
|
|
|
|
### Max-Retries Flow
|
|
- [ ] Start a run with a model (e.g., `openai/gpt-4o-mini`)
|
|
- [ ] Wait for Level 1 to hit max retries (3 attempts)
|
|
- [ ] Verify modal appears with Stop/Intervene/Continue options
|
|
- [ ] Test "Continue" → verify retry count resets and agent resumes
|
|
- [ ] Test "Intervene" → verify manual mode is enabled
|
|
- [ ] Test "Stop" → verify run ends cleanly
|
|
|
|
### Terminal Fidelity
|
|
- [ ] Verify agent doesn't attempt `ssh` commands
|
|
- [ ] Check that forbidden commands trigger error messages
|
|
- [ ] Confirm ANSI codes are preserved in terminal output
|
|
- [ ] Test password validation: invalid password should trigger retry with error message
|
|
- [ ] Test password validation: valid password should advance to next level
|
|
|
|
### Reasoning Visibility
|
|
- [ ] Start a run and observe chat panel
|
|
- [ ] Verify "THINKING" messages appear with blue styling
|
|
- [ ] Confirm full reasoning content is displayed (not just "Processing...")
|
|
- [ ] Test with different models to ensure consistent behavior
|
|
|
|
### Error Recovery
|
|
- [ ] Simulate network issues (if possible) to test retry logic
|
|
- [ ] Verify agent recovers from temporary SSH connection failures
|
|
- [ ] Check that LLM API rate limits are handled gracefully
|
|
|
|
### Cost Tracking
|
|
- [ ] Start a run and observe agent control panel
|
|
- [ ] Verify "TOKENS" and "COST" appear after first LLM call
|
|
- [ ] Confirm counts increment with each planning step
|
|
- [ ] Test with different models to see cost variations
|
|
|
|
## Architecture Notes
|
|
|
|
### Event Flow for Max-Retries
|
|
```
|
|
Agent (validateResult)
|
|
→ Detects max retries
|
|
→ Emits 'error' with "Max retries..." message
|
|
→ BanditAgentDO.updateStateFromEvent
|
|
→ Checks error message for "Max retries"
|
|
→ Emits 'user_action_required' event
|
|
→ State set to 'paused' (not 'failed')
|
|
→ WebSocket → Frontend
|
|
→ useAgentWebSocket.onUserActionRequired callback
|
|
→ Terminal Interface shows AlertDialog
|
|
→ User clicks button
|
|
→ POST to /retry endpoint
|
|
→ BanditAgentDO.retryLevel resets count & resumes agent
|
|
```
|
|
|
|
### Event Flow for Usage Tracking
|
|
```
|
|
Agent (planLevel)
|
|
→ LLM invoke with retry logic
|
|
→ Extract token usage from response
|
|
→ Update state.totalTokens and state.totalCost
|
|
→ Emit 'usage_update' event
|
|
→ WebSocket → Frontend
|
|
→ useAgentWebSocket.onUsageUpdate callback
|
|
→ Terminal Interface updates agentState
|
|
→ AgentControlPanel renders updated metrics
|
|
```
|
|
|
|
## Compatibility & Safety
|
|
|
|
- ✅ No changes to DO bindings or WS protocol
|
|
- ✅ All new features are additive (no breaking changes)
|
|
- ✅ Existing functionality preserved
|
|
- ✅ Fallback behavior for network errors (fail-open for password validation)
|
|
- ✅ Error messages are user-friendly and actionable
|
|
- ✅ Linter errors fixed, TypeScript types properly defined
|
|
|
|
## Future Enhancements (Optional)
|
|
|
|
These were outlined in the plan but not implemented in this iteration:
|
|
|
|
### Phase 2: PTY Streaming (Optional)
|
|
- Implement `stream: true` in `/ssh/exec` to send incremental PTY chunks
|
|
- Provides more 1:1 terminal experience with progressive rendering
|
|
- Feature-flagged for optional enablement
|
|
|
|
### Phase 3: Persistent Interactive Shell (Optional)
|
|
- Implement `/ssh/shell` WebSocket endpoint for persistent PTY session
|
|
- Full TUI fidelity similar to opencode
|
|
- More complex implementation, requires careful state management
|
|
|
|
## Deployment Notes
|
|
|
|
1. **SSH Proxy**: Redeploy to Fly.io with updated `agent.ts`
|
|
```bash
|
|
cd ssh-proxy
|
|
flyctl deploy
|
|
```
|
|
|
|
2. **Cloudflare Worker**: Deploy updated DO and routes
|
|
```bash
|
|
cd bandit-runner-app
|
|
pnpm run deploy
|
|
```
|
|
|
|
3. **Environment Variables**: No new variables required
|
|
|
|
4. **Database/Storage**: No schema changes
|
|
|
|
## Summary
|
|
|
|
This implementation successfully addresses all three core issues while also adding error recovery and cost tracking. The agent is now:
|
|
|
|
- ✅ More robust (retry logic with exponential backoff)
|
|
- ✅ More transparent (reasoning visible, costs tracked)
|
|
- ✅ More reliable (command hygiene, password validation)
|
|
- ✅ More user-friendly (max-retries decision flow, clear error messages)
|
|
- ✅ Production-ready (proper error handling, type safety, no breaking changes)
|
|
|
|
The changes maintain backward compatibility and follow the plan's phased approach, delivering immediate improvements while leaving room for future enhancements.
|
|
|