bandit-runner/MAX-RETRIES-ROOT-CAUSE.md
2025-10-13 10:21:50 -06:00

146 lines
4.5 KiB
Markdown

# Max-Retries Modal - Root Cause Analysis
## Test Results
**Status**: ❌ Modal does NOT appear
**Error Seen**: "ERROR: Max retries reached for level 0" (in terminal and chat)
**Modal Shown**: NO
## Root Cause
The `user_action_required` event is **never emitted** from the Durable Object.
### Why?
Looking at `BanditAgentDO.ts`:
```typescript
private updateStateFromEvent(event: AgentEvent) {
if (!this.state) return
switch (event.type) {
case 'error':
const errorContent = event.data.content || ''
if (errorContent.includes('Max retries')) {
// Emit user_action_required event
this.broadcast({
type: 'user_action_required',
data: { ... }
})
}
}
}
```
**The Problem**: `updateStateFromEvent()` is only called when processing events FROM the SSH proxy. But by the time we see the `error` event here, the proxy has already ended its stream with `run_complete`.
The `error` event from the proxy goes:
1. SSH Proxy emits `error: Max retries...`
2. DO receives it via `runAgentViaProxy()` stream
3. DO calls `updateStateFromEvent(event)`
4. DO tries to `broadcast()` the `user_action_required`
5. **BUT** - we're inside the proxy stream handler, and immediately after this the proxy sends `run_complete` and ends the stream
6. The frontend never gets the `user_action_required` because it's racing with `run_complete`
## The Real Fix
We need to **pause BEFORE emitting the final error**, not after.
### Option 1: Fix in SSH Proxy (Recommended)
In `ssh-proxy/agent.ts`, when `validateResult` hits max retries, instead of returning status `'failed'`, return status `'paused_for_user_action'`:
```typescript
// In validateResult()
if (state.retryCount >= state.maxRetries) {
return {
status: 'paused_for_user_action' as const, // New status
error: `Max retries reached for level ${state.currentLevel}`,
}
}
```
Then in the graph conditional routing:
```typescript
function shouldContinue(state: BanditAgentState): string {
if (state.status === 'paused_for_user_action') {
return END // Stop graph execution
}
// ... rest of routing
}
```
And in the DO, when we see this status, emit the user action event:
```typescript
case 'node_update':
if (nodeOutput.status === 'paused_for_user_action') {
this.broadcast({
type: 'user_action_required',
data: {
reason: 'max_retries',
level: this.state.currentLevel,
// ...
}
})
this.state.status = 'paused'
}
```
### Option 2: Fix in DO (Simpler but less clean)
Before broadcasting the error event, check if it's a max-retries error and emit `user_action_required` FIRST:
```typescript
// In runAgentViaProxy(), when processing events:
if (agentEvent.type === 'error' && agentEvent.data.content?.includes('Max retries')) {
// Emit user_action_required FIRST
this.broadcast({
type: 'user_action_required',
data: { ... }
})
this.state.status = 'paused'
await this.storage.saveState(this.state)
}
// Then broadcast the error normally
this.broadcast(agentEvent)
```
## Why Current Code Doesn't Work
The current code tries to detect the error in `updateStateFromEvent()` which is called too late in the event processing pipeline. By the time we try to emit `user_action_required`, the proxy stream has already ended and the frontend has moved on to `run_complete`.
## Recommended Fix
**Option 1** is cleaner because it makes the agent's state machine explicit about needing user action. This also prevents the `run_complete` event from firing prematurely.
## Testing Plan
1. Implement Option 1 in `ssh-proxy/agent.ts`
2. Add new status to type definitions
3. Update DO to recognize this status and emit event
4. Test with GPT-4o Mini, wait for Level 1 max retries
5. Verify logs show:
- Agent graph ends with `paused_for_user_action`
- DO emits `user_action_required`
- Frontend receives event and shows modal
6. Test Continue button → retry count resets, agent resumes
## Files to Modify
1. `ssh-proxy/agent.ts`:
- Update `BanditState` annotation to include `paused_for_user_action` status
- Modify `validateResult` to return this status instead of `'failed'`
- Update `shouldContinue` routing
2. `bandit-runner-app/src/lib/agents/bandit-state.ts`:
- Add `'paused_for_user_action'` to status union type
3. `bandit-runner-app/src/lib/durable-objects/BanditAgentDO.ts`:
- In `runAgentViaProxy()`, detect `paused_for_user_action` status
- Emit `user_action_required` when detected
- Remove detection from `updateStateFromEvent()` (it's too late)