bandit-runner/docs/development_documentation/WEBSOCKET-DEBUG-STATUS.md
2025-10-09 22:03:37 -06:00

163 lines
4.7 KiB
Markdown

# WebSocket Debugging Status
## ✅ What's Working
1. **App loads without errors** - Fixed `__name is not defined` with polyfill in layout.tsx
2. **Model selection** - Dropdown populated with OpenRouter models
3. **HTTP API routes** - All working:
- `/api/agent/[runId]/start` → 200 ✅
- `/api/agent/[runId]/status` → 200 ✅
- `/api/agent/[runId]/pause` → 200 ✅
- `/api/agent/[runId]/resume` → 200 ✅
4. **Durable Object HTTP** - DO responds to HTTP requests correctly
5. **UI state updates** - Status changes from IDLE → RUNNING, agent message appears
## ❌ What's Broken
**WebSocket connection fails with 500 error during handshake**
### Error Details
```
WebSocket connection to 'wss://bandit-runner-app.nicholaivogelfilms.workers.dev/api/agent/run-XXX/ws'
failed: Error during WebSocket handshake: Unexpected response code: 500
```
### Test Results
| Test | Result | Details |
|------|--------|---------|
| curl with WS headers | 426 | Returns "Expected Upgrade: websocket" |
| Browser WebSocket | 500 | Handshake fails |
| DO `/status` endpoint | 200 | DO is accessible |
## Code Analysis
### /ws Route (`src/app/api/agent/[runId]/ws/route.ts`)
- ✅ Checks for `Upgrade: websocket` header
- ✅ Gets DO stub correctly
- ✅ Forwards request to DO
- ⚠️ **curl gets 426, browser gets 500** - different behavior!
### Durable Object WebSocket Code
```javascript
// In patch-worker.js (deployed to .open-next/worker.js)
if (request.headers.get("Upgrade") === "websocket") {
const pair = new WebSocketPair();
const [client, server] = Object.values(pair);
this.ctx.acceptWebSocket(server); // ✅ Modern Hibernatable API
return new Response(null, { status: 101, webSocket: client });
}
// WebSocket handler methods exist:
async webSocketMessage(ws, message) { ... }
async webSocketClose(ws, code, reason, wasClean) { ... }
async webSocketError(ws, error) { ... }
```
### Verified Deployed Code
- ✅ Polyfill at top of worker.js
-`BanditAgentDO` class exported
- ✅ WebSocket handling using Hibernatable API
- ✅ Handler methods present
## Possible Causes
### 1. **Next.js/OpenNext Middleware Interception**
- OpenNext may be intercepting WebSocket upgrades before they reach the route
- Middleware might be stripping headers or modifying the request
### 2. **Request Object Compatibility**
- `NextRequest` forwarded to DO might not be compatible with DO's `fetch()`
- Headers may be lost/modified during forwarding
### 3. **Deployment Issue**
- Despite code looking correct, deployed worker may differ
- Bundling process may be corrupting WebSocket code
### 4. **Missing Secret**
- `OPENROUTER_API_KEY` not set (though this shouldn't affect WS upgrade)
## Next Steps to Try
### Option A: Bypass Next.js Route Entirely
Create a direct Worker route handler that doesn't go through Next.js:
1. Add to `wrangler.jsonc`:
```json
{
"routes": [
{
"pattern": "*/ws/*",
"custom_domain": false,
"zone_name": "your-domain.com"
}
]
}
```
2. Create Worker-native WebSocket handler
### Option B: Use Service Bindings
Instead of routing through Next.js, create a Service Binding to the DO:
```json
{
"services": [
{
"binding": "WS_SERVICE",
"service": "websocket-handler",
"environment": "production"
}
]
}
```
### Option C: Deploy Separate DO Worker (RECOMMENDED)
As outlined in the plan - this guarantees no Next.js interference:
```bash
# 1. Deploy standalone DO worker
cd workers/bandit-agent-do
wrangler deploy
# 2. Update main wrangler.jsonc
{
"durable_objects": {
"bindings": [{
"name": "BANDIT_AGENT",
"class_name": "BanditAgentDO",
"script_name": "bandit-agent-do" // External worker
}]
}
}
# 3. Remove patch script from deploy process
```
### Option D: Add Debug Logging and Re-test
- Deploy with comprehensive logging
- Use `wrangler tail` to capture actual request/response
- Identify exact failure point
## Current Theory
**Most Likely**: Next.js/OpenNext is incompatible with WebSocket upgrades in API routes. The framework expects HTTP responses, not protocol upgrades. This is a known limitation in serverless environments.
**Evidence**:
- curl (bypassing Next.js routing somehow) gets 426
- Browser (going through full Next.js stack) gets 500
- HTTP routes work fine (standard request/response)
- WebSocket routes fail (protocol upgrade)
## Recommendation
**Proceed with Option C** (Separate DO Worker) as it:
1. Completely bypasses Next.js/OpenNext
2. Uses Cloudflare's recommended architecture
3. Matches the plan we already created
4. Eliminates all bundling/compatibility issues
5. Provides independent deployment and debugging
The inline DO + patch script approach was worth trying, but WebSocket upgrades likely need a native Worker environment, not a Next.js API route.