bandit-runner/docs/development_documentation/WEBSOCKET-DEBUG-STATUS.md
2025-10-09 22:03:37 -06:00

4.7 KiB

WebSocket Debugging Status

What's Working

  1. App loads without errors - Fixed __name is not defined with polyfill in layout.tsx
  2. Model selection - Dropdown populated with OpenRouter models
  3. HTTP API routes - All working:
    • /api/agent/[runId]/start → 200
    • /api/agent/[runId]/status → 200
    • /api/agent/[runId]/pause → 200
    • /api/agent/[runId]/resume → 200
  4. Durable Object HTTP - DO responds to HTTP requests correctly
  5. UI state updates - Status changes from IDLE → RUNNING, agent message appears

What's Broken

WebSocket connection fails with 500 error during handshake

Error Details

WebSocket connection to 'wss://bandit-runner-app.nicholaivogelfilms.workers.dev/api/agent/run-XXX/ws' 
failed: Error during WebSocket handshake: Unexpected response code: 500

Test Results

Test Result Details
curl with WS headers 426 Returns "Expected Upgrade: websocket"
Browser WebSocket 500 Handshake fails
DO /status endpoint 200 DO is accessible

Code Analysis

/ws Route (src/app/api/agent/[runId]/ws/route.ts)

  • Checks for Upgrade: websocket header
  • Gets DO stub correctly
  • Forwards request to DO
  • ⚠️ curl gets 426, browser gets 500 - different behavior!

Durable Object WebSocket Code

// In patch-worker.js (deployed to .open-next/worker.js)
if (request.headers.get("Upgrade") === "websocket") {
  const pair = new WebSocketPair();
  const [client, server] = Object.values(pair);
  this.ctx.acceptWebSocket(server);  // ✅ Modern Hibernatable API
  return new Response(null, { status: 101, webSocket: client });
}

// WebSocket handler methods exist:
async webSocketMessage(ws, message) { ... }
async webSocketClose(ws, code, reason, wasClean) { ... }
async webSocketError(ws, error) { ... }

Verified Deployed Code

  • Polyfill at top of worker.js
  • BanditAgentDO class exported
  • WebSocket handling using Hibernatable API
  • Handler methods present

Possible Causes

1. Next.js/OpenNext Middleware Interception

  • OpenNext may be intercepting WebSocket upgrades before they reach the route
  • Middleware might be stripping headers or modifying the request

2. Request Object Compatibility

  • NextRequest forwarded to DO might not be compatible with DO's fetch()
  • Headers may be lost/modified during forwarding

3. Deployment Issue

  • Despite code looking correct, deployed worker may differ
  • Bundling process may be corrupting WebSocket code

4. Missing Secret

  • OPENROUTER_API_KEY not set (though this shouldn't affect WS upgrade)

Next Steps to Try

Option A: Bypass Next.js Route Entirely

Create a direct Worker route handler that doesn't go through Next.js:

  1. Add to wrangler.jsonc:
{
  "routes": [
    {
      "pattern": "*/ws/*",
      "custom_domain": false,
      "zone_name": "your-domain.com"
    }
  ]
}
  1. Create Worker-native WebSocket handler

Option B: Use Service Bindings

Instead of routing through Next.js, create a Service Binding to the DO:

{
  "services": [
    {
      "binding": "WS_SERVICE",
      "service": "websocket-handler",
      "environment": "production"
    }
  ]
}

As outlined in the plan - this guarantees no Next.js interference:

# 1. Deploy standalone DO worker
cd workers/bandit-agent-do
wrangler deploy

# 2. Update main wrangler.jsonc
{
  "durable_objects": {
    "bindings": [{
      "name": "BANDIT_AGENT",
      "class_name": "BanditAgentDO",
      "script_name": "bandit-agent-do"  // External worker
    }]
  }
}

# 3. Remove patch script from deploy process

Option D: Add Debug Logging and Re-test

  • Deploy with comprehensive logging
  • Use wrangler tail to capture actual request/response
  • Identify exact failure point

Current Theory

Most Likely: Next.js/OpenNext is incompatible with WebSocket upgrades in API routes. The framework expects HTTP responses, not protocol upgrades. This is a known limitation in serverless environments.

Evidence:

  • curl (bypassing Next.js routing somehow) gets 426
  • Browser (going through full Next.js stack) gets 500
  • HTTP routes work fine (standard request/response)
  • WebSocket routes fail (protocol upgrade)

Recommendation

Proceed with Option C (Separate DO Worker) as it:

  1. Completely bypasses Next.js/OpenNext
  2. Uses Cloudflare's recommended architecture
  3. Matches the plan we already created
  4. Eliminates all bundling/compatibility issues
  5. Provides independent deployment and debugging

The inline DO + patch script approach was worth trying, but WebSocket upgrades likely need a native Worker environment, not a Next.js API route.