146 lines
4.5 KiB
Markdown
146 lines
4.5 KiB
Markdown
# Max-Retries Modal - Root Cause Analysis
|
|
|
|
## Test Results
|
|
|
|
**Status**: ❌ Modal does NOT appear
|
|
**Error Seen**: "ERROR: Max retries reached for level 0" (in terminal and chat)
|
|
**Modal Shown**: NO
|
|
|
|
## Root Cause
|
|
|
|
The `user_action_required` event is **never emitted** from the Durable Object.
|
|
|
|
### Why?
|
|
|
|
Looking at `BanditAgentDO.ts`:
|
|
|
|
```typescript
|
|
private updateStateFromEvent(event: AgentEvent) {
|
|
if (!this.state) return
|
|
|
|
switch (event.type) {
|
|
case 'error':
|
|
const errorContent = event.data.content || ''
|
|
if (errorContent.includes('Max retries')) {
|
|
// Emit user_action_required event
|
|
this.broadcast({
|
|
type: 'user_action_required',
|
|
data: { ... }
|
|
})
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**The Problem**: `updateStateFromEvent()` is only called when processing events FROM the SSH proxy. But by the time we see the `error` event here, the proxy has already ended its stream with `run_complete`.
|
|
|
|
The `error` event from the proxy goes:
|
|
1. SSH Proxy emits `error: Max retries...`
|
|
2. DO receives it via `runAgentViaProxy()` stream
|
|
3. DO calls `updateStateFromEvent(event)`
|
|
4. DO tries to `broadcast()` the `user_action_required`
|
|
5. **BUT** - we're inside the proxy stream handler, and immediately after this the proxy sends `run_complete` and ends the stream
|
|
6. The frontend never gets the `user_action_required` because it's racing with `run_complete`
|
|
|
|
## The Real Fix
|
|
|
|
We need to **pause BEFORE emitting the final error**, not after.
|
|
|
|
### Option 1: Fix in SSH Proxy (Recommended)
|
|
|
|
In `ssh-proxy/agent.ts`, when `validateResult` hits max retries, instead of returning status `'failed'`, return status `'paused_for_user_action'`:
|
|
|
|
```typescript
|
|
// In validateResult()
|
|
if (state.retryCount >= state.maxRetries) {
|
|
return {
|
|
status: 'paused_for_user_action' as const, // New status
|
|
error: `Max retries reached for level ${state.currentLevel}`,
|
|
}
|
|
}
|
|
```
|
|
|
|
Then in the graph conditional routing:
|
|
|
|
```typescript
|
|
function shouldContinue(state: BanditAgentState): string {
|
|
if (state.status === 'paused_for_user_action') {
|
|
return END // Stop graph execution
|
|
}
|
|
// ... rest of routing
|
|
}
|
|
```
|
|
|
|
And in the DO, when we see this status, emit the user action event:
|
|
|
|
```typescript
|
|
case 'node_update':
|
|
if (nodeOutput.status === 'paused_for_user_action') {
|
|
this.broadcast({
|
|
type: 'user_action_required',
|
|
data: {
|
|
reason: 'max_retries',
|
|
level: this.state.currentLevel,
|
|
// ...
|
|
}
|
|
})
|
|
this.state.status = 'paused'
|
|
}
|
|
```
|
|
|
|
### Option 2: Fix in DO (Simpler but less clean)
|
|
|
|
Before broadcasting the error event, check if it's a max-retries error and emit `user_action_required` FIRST:
|
|
|
|
```typescript
|
|
// In runAgentViaProxy(), when processing events:
|
|
if (agentEvent.type === 'error' && agentEvent.data.content?.includes('Max retries')) {
|
|
// Emit user_action_required FIRST
|
|
this.broadcast({
|
|
type: 'user_action_required',
|
|
data: { ... }
|
|
})
|
|
this.state.status = 'paused'
|
|
await this.storage.saveState(this.state)
|
|
}
|
|
|
|
// Then broadcast the error normally
|
|
this.broadcast(agentEvent)
|
|
```
|
|
|
|
## Why Current Code Doesn't Work
|
|
|
|
The current code tries to detect the error in `updateStateFromEvent()` which is called too late in the event processing pipeline. By the time we try to emit `user_action_required`, the proxy stream has already ended and the frontend has moved on to `run_complete`.
|
|
|
|
## Recommended Fix
|
|
|
|
**Option 1** is cleaner because it makes the agent's state machine explicit about needing user action. This also prevents the `run_complete` event from firing prematurely.
|
|
|
|
## Testing Plan
|
|
|
|
1. Implement Option 1 in `ssh-proxy/agent.ts`
|
|
2. Add new status to type definitions
|
|
3. Update DO to recognize this status and emit event
|
|
4. Test with GPT-4o Mini, wait for Level 1 max retries
|
|
5. Verify logs show:
|
|
- Agent graph ends with `paused_for_user_action`
|
|
- DO emits `user_action_required`
|
|
- Frontend receives event and shows modal
|
|
6. Test Continue button → retry count resets, agent resumes
|
|
|
|
## Files to Modify
|
|
|
|
1. `ssh-proxy/agent.ts`:
|
|
- Update `BanditState` annotation to include `paused_for_user_action` status
|
|
- Modify `validateResult` to return this status instead of `'failed'`
|
|
- Update `shouldContinue` routing
|
|
|
|
2. `bandit-runner-app/src/lib/agents/bandit-state.ts`:
|
|
- Add `'paused_for_user_action'` to status union type
|
|
|
|
3. `bandit-runner-app/src/lib/durable-objects/BanditAgentDO.ts`:
|
|
- In `runAgentViaProxy()`, detect `paused_for_user_action` status
|
|
- Emit `user_action_required` when detected
|
|
- Remove detection from `updateStateFromEvent()` (it's too late)
|
|
|