bandit-runner/MAX-RETRIES-ROOT-CAUSE.md
2025-10-13 10:21:50 -06:00

4.5 KiB

Max-Retries Modal - Root Cause Analysis

Test Results

Status: Modal does NOT appear
Error Seen: "ERROR: Max retries reached for level 0" (in terminal and chat)
Modal Shown: NO

Root Cause

The user_action_required event is never emitted from the Durable Object.

Why?

Looking at BanditAgentDO.ts:

private updateStateFromEvent(event: AgentEvent) {
  if (!this.state) return

  switch (event.type) {
    case 'error':
      const errorContent = event.data.content || ''
      if (errorContent.includes('Max retries')) {
        // Emit user_action_required event
        this.broadcast({
          type: 'user_action_required',
          data: { ... }
        })
      }
  }
}

The Problem: updateStateFromEvent() is only called when processing events FROM the SSH proxy. But by the time we see the error event here, the proxy has already ended its stream with run_complete.

The error event from the proxy goes:

  1. SSH Proxy emits error: Max retries...
  2. DO receives it via runAgentViaProxy() stream
  3. DO calls updateStateFromEvent(event)
  4. DO tries to broadcast() the user_action_required
  5. BUT - we're inside the proxy stream handler, and immediately after this the proxy sends run_complete and ends the stream
  6. The frontend never gets the user_action_required because it's racing with run_complete

The Real Fix

We need to pause BEFORE emitting the final error, not after.

In ssh-proxy/agent.ts, when validateResult hits max retries, instead of returning status 'failed', return status 'paused_for_user_action':

// In validateResult()
if (state.retryCount >= state.maxRetries) {
  return {
    status: 'paused_for_user_action' as const, // New status
    error: `Max retries reached for level ${state.currentLevel}`,
  }
}

Then in the graph conditional routing:

function shouldContinue(state: BanditAgentState): string {
  if (state.status === 'paused_for_user_action') {
    return END // Stop graph execution
  }
  // ... rest of routing
}

And in the DO, when we see this status, emit the user action event:

case 'node_update':
  if (nodeOutput.status === 'paused_for_user_action') {
    this.broadcast({
      type: 'user_action_required',
      data: {
        reason: 'max_retries',
        level: this.state.currentLevel,
        // ...
      }
    })
    this.state.status = 'paused'
  }

Option 2: Fix in DO (Simpler but less clean)

Before broadcasting the error event, check if it's a max-retries error and emit user_action_required FIRST:

// In runAgentViaProxy(), when processing events:
if (agentEvent.type === 'error' && agentEvent.data.content?.includes('Max retries')) {
  // Emit user_action_required FIRST
  this.broadcast({
    type: 'user_action_required',
    data: { ... }
  })
  this.state.status = 'paused'
  await this.storage.saveState(this.state)
}

// Then broadcast the error normally
this.broadcast(agentEvent)

Why Current Code Doesn't Work

The current code tries to detect the error in updateStateFromEvent() which is called too late in the event processing pipeline. By the time we try to emit user_action_required, the proxy stream has already ended and the frontend has moved on to run_complete.

Option 1 is cleaner because it makes the agent's state machine explicit about needing user action. This also prevents the run_complete event from firing prematurely.

Testing Plan

  1. Implement Option 1 in ssh-proxy/agent.ts
  2. Add new status to type definitions
  3. Update DO to recognize this status and emit event
  4. Test with GPT-4o Mini, wait for Level 1 max retries
  5. Verify logs show:
    • Agent graph ends with paused_for_user_action
    • DO emits user_action_required
    • Frontend receives event and shows modal
  6. Test Continue button → retry count resets, agent resumes

Files to Modify

  1. ssh-proxy/agent.ts:

    • Update BanditState annotation to include paused_for_user_action status
    • Modify validateResult to return this status instead of 'failed'
    • Update shouldContinue routing
  2. bandit-runner-app/src/lib/agents/bandit-state.ts:

    • Add 'paused_for_user_action' to status union type
  3. bandit-runner-app/src/lib/durable-objects/BanditAgentDO.ts:

    • In runAgentViaProxy(), detect paused_for_user_action status
    • Emit user_action_required when detected
    • Remove detection from updateStateFromEvent() (it's too late)