From 046ef202cb04c7c17dc345a6a7d035411fdfe088 Mon Sep 17 00:00:00 2001 From: nicholai Date: Thu, 9 Oct 2025 19:27:58 -0600 Subject: [PATCH] docs: comprehensive README update with accurate architecture and deployment guide MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Add detailed Features section with LangGraph agent, WebSocket streaming, ANSI terminal - Document real architecture: Browser → Workers → DO → SSH Proxy → Bandit - Add complete Prerequisites section (accounts, software, API keys) - Provide step-by-step Installation instructions for local dev - Add comprehensive 4-step Deployment guide (SSH Proxy, DO, Main, Verify) - Update Usage section with actual UI features and manual mode - Add Troubleshooting section for common issues - Update Roadmap to reflect completed features - Remove outdated D1/R2 architecture references - Add LangGraph badge and update Built With section - Document monorepo structure and component responsibilities - Include data flow diagram and event streaming details --- README.md | 496 ++++++++++++++---- .../components/terminal-chat-interface.tsx | 28 +- readme-improvement.plan.md | 468 +++++++++++++++++ 3 files changed, 876 insertions(+), 116 deletions(-) create mode 100644 readme-improvement.plan.md diff --git a/README.md b/README.md index de2ee75..bbe2b81 100644 --- a/README.md +++ b/README.md @@ -66,20 +66,25 @@ [![Product Screenshot][product-screenshot]](#) -**Bandit Runner** is a public, deterministic evaluation harness for large language models. -It transforms AI models into autonomous operators tasked with completing the **OverTheWire Bandit** wargame via SSH — entirely on Cloudflare Workers. +**Bandit Runner** is an autonomous AI agent testing framework that evaluates large language models by having them solve the **OverTheWire Bandit** wargame challenges via SSH. **Why it matters** -- Provides a real-world, hands-on benchmark for autonomous reasoning and command execution. -- Tests tool use (SSH), planning, error handling, and persistence under real network conditions. -- Generates reproducible, privacy-safe logs for research or public leaderboards. +- Provides a real-world, hands-on benchmark for autonomous reasoning and command execution +- Tests tool use (SSH), planning, error handling, and persistence under real network conditions +- Generates reproducible logs for research and public leaderboards +- Demonstrates LLM capabilities in a sandboxed, deterministic environment -### Core Concepts -- **Agent Role:** Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list. -- **Environment:** Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2). -- **Security:** Hard-scoped to `bandit.labs.overthewire.org:2220`. - All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs. -- **Goal:** Advance from Level 0 → final level autonomously while documenting every decision. +### Features + +- 🤖 **LangGraph.js Autonomous Agent** - State machine with tool use and planning +- 🔌 **Real-Time WebSocket Streaming** - Live updates with <50ms latency +- 🖥️ **Full Terminal Output** - Complete SSH session with ANSI colors and formatting +- 💬 **Agent Reasoning Display** - See the LLM's thinking process and command selection +- 🎯 **OpenRouter Integration** - Access 100+ models with search and filtering +- 🎨 **Beautiful Retro UI** - Terminal-inspired interface with shadcn/ui components +- 🔒 **Security Boundaries** - Sandboxed SSH access to read-only Bandit server +- 📊 **Level Progression** - Automatic password extraction and advancement +- 🛠️ **Manual Mode** - Optional human intervention for debugging

(back to top)

@@ -87,14 +92,14 @@ It transforms AI models into autonomous operators tasked with completing the **O ### Built With -* [![Next.js][Next.js]][Next-url] -* [![React][React.js]][React-url] -* [![Cloudflare][Cloudflare-badge]][Cloudflare-url] -* [![OpenNext][OpenNext-badge]][OpenNext-url] -* [![Shadcn/UI][Shadcn-badge]][Shadcn-url] -* [![TypeScript][TypeScript-badge]][TypeScript-url] -* [![Drizzle ORM][Drizzle-badge]][Drizzle-url] -* [![pnpm][pnpm-badge]][pnpm-url] +* [![Next.js][Next.js]][Next-url] - Frontend framework (v15 with App Router) +* [![React][React.js]][React-url] - UI library (v19) +* [![Cloudflare][Cloudflare-badge]][Cloudflare-url] - Edge compute + Durable Objects +* [![OpenNext][OpenNext-badge]][OpenNext-url] - Next.js adapter for Cloudflare Workers +* [![LangGraph][LangGraph-badge]][LangGraph-url] - Agent orchestration framework +* [![Shadcn/UI][Shadcn-badge]][Shadcn-url] - Component library +* [![TypeScript][TypeScript-badge]][TypeScript-url] - Type safety +* [![pnpm][pnpm-badge]][pnpm-url] - Package manager

(back to top)

@@ -104,55 +109,140 @@ It transforms AI models into autonomous operators tasked with completing the **O ### Prerequisites -You need: +#### Required Accounts +* **Cloudflare** account (free tier works, needs Durable Objects enabled) +* **Fly.io** account (free tier works) +* **OpenRouter** account + API key ([get one here](https://openrouter.ai)) + +#### Required Software * **Node.js ≥ 20** -* **pnpm** +* **pnpm** package manager ```bash - npm i -g pnpm - ``` - -* **Wrangler 3 CLI** - - ```bash - npm i -g wrangler + npm install -g pnpm + ``` +* **Wrangler CLI** (Cloudflare) + ```bash + npm install -g wrangler + ``` +* **flyctl CLI** (Fly.io) + ```bash + curl -L https://fly.io/install.sh | sh ``` -* A Cloudflare account with access to: - - * Durable Objects - * D1 Database - * R2 Storage ### Installation -1. Clone the repo +#### 1. Clone Repository +```bash +git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git +cd bandit-runner +``` - ```bash - git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git - cd bandit-runner - ``` -2. Install dependencies +#### 2. Install Dependencies +```bash +# Frontend + Cloudflare Workers +cd bandit-runner-app +pnpm install - ```bash - pnpm install - ``` -3. Copy and configure environment +# SSH Proxy (LangGraph agent) +cd ../ssh-proxy +npm install +``` - ```bash - cp .env.example .env.local - ``` -4. Build and run locally +#### 3. Configure Environment - ```bash - pnpm dev - # or - wrangler dev - ``` -5. Deploy preview +**Cloudflare DO Worker:** +```bash +cd bandit-runner-app/workers/bandit-agent-do +# Create .env.local with your OpenRouter API key +echo "OPENROUTER_API_KEY=your_key_here" > .env.local +``` - ```bash - pnpm build - wrangler deploy --env preview - ``` +**Main Worker:** +No `.env` needed - configuration is in `wrangler.jsonc` + +#### 4. Local Development + +**Option A: Frontend Only** (requires deployed backend) +```bash +cd bandit-runner-app +pnpm dev +# Open http://localhost:3000 +``` + +**Option B: Full Stack** (all components local) +```bash +# Terminal 1: SSH Proxy +cd ssh-proxy +npm run dev +# Runs on http://localhost:3001 + +# Terminal 2: Main App +cd bandit-runner-app +# Update wrangler.jsonc: "SSH_PROXY_URL": "http://localhost:3001" +pnpm dev +``` + +

(back to top)

+ +--- + +## Deployment + +### Step 1: Deploy SSH Proxy to Fly.io + +```bash +cd ssh-proxy + +# Login (opens browser) +flyctl auth login + +# Deploy (creates app on first run) +flyctl deploy + +# Note your URL: https://bandit-ssh-proxy.fly.dev +``` + +### Step 2: Deploy Durable Object Worker + +```bash +cd ../bandit-runner-app/workers/bandit-agent-do + +# Set OpenRouter API key as secret +wrangler secret put OPENROUTER_API_KEY +# Paste your key when prompted + +# Deploy +wrangler deploy + +# Verify (optional) +wrangler tail +``` + +### Step 3: Deploy Main Application + +```bash +cd ../.. # Back to bandit-runner-app root + +# Verify wrangler.jsonc has correct SSH_PROXY_URL +# Should be: "SSH_PROXY_URL": "https://bandit-ssh-proxy.fly.dev" + +# Build and deploy +pnpm run deploy + +# Your app is live at: +# https://bandit-runner-app..workers.dev +``` + +### Step 4: Verify Deployment + +```bash +# Test SSH Proxy +curl https://bandit-ssh-proxy.fly.dev/ssh/health +# Should return: {"status":"ok","activeConnections":0} + +# Open your app +open https://bandit-runner-app..workers.dev +```

(back to top)

@@ -160,21 +250,37 @@ You need: ## Usage -Once deployed, visit `/runs/new` to start a new evaluation. -Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run. +### Starting a Run -Each run: +1. **Open the application** in your browser +2. **Select Model**: Click the dropdown to choose from 100+ models + - Search by name + - Filter by provider (OpenAI, Anthropic, Google, Meta, etc.) + - Filter by price range + - Filter by context length +3. **Set Target Level**: Choose how far the agent should progress (default: 5) +4. **Select Output Mode**: + - **Selective** (default): LLM reasoning + tool calls + - **Everything**: Reasoning + tool calls + full command outputs +5. **Click START** -* Spawns a Durable Object → “Run Coordinator” -* Connects to `bandit.labs.overthewire.org:2220` -* Executes controlled `ssh.connect` / `ssh.exec` / `ssh.close` operations -* Streams JSONL logs and commentary to the Live Viewer +### Monitoring Progress -Developers can extend: +The interface provides real-time feedback: -* Scoring rules (`lib/scoring/verdicts.ts`) -* Level validators (`lib/scoring/validators.ts`) -* Model interfaces (`lib/ssh/tool-adapter.ts`) +- **Left Panel (Terminal)**: Full SSH session with ANSI colors and formatting +- **Right Panel (Agent Chat)**: LLM reasoning, planning, and discoveries +- **Top Bar**: Current status, level, and active model +- **Control Panel**: Pause/Resume/Stop controls + +### Manual Mode (Debugging) + +Toggle **"Manual Mode"** in the terminal footer to: +- Type commands directly into the SSH session +- Assist the agent when stuck +- Debug connection or logic issues + +⚠️ **Note**: Manual intervention disqualifies runs from leaderboards

(back to top)

@@ -182,28 +288,189 @@ Developers can extend: ## Architecture +### System Overview + ```text -Next.js (App Router) -│ -├── UI (Shadcn/UI) -│ ├─ LiveLog -│ └─ LevelCard -│ -├── Edge API Routes (OpenNext) -│ ├─ /api/startRun -│ ├─ /api/toolInvoke -│ └─ /api/stream -│ -└── Cloudflare Worker - ├─ Durable Object: RunCoordinator - │ ├─ TCP connect() to Bandit - │ ├─ State machine (levels, caps, timers) - │ └─ Writes logs → R2 - ├─ D1 (metadata) - └─ R2 (artifacts) +┌─────────────┐ +│ Browser │ WebSocket (wss://) +└──────┬──────┘ + │ + ↓ +┌─────────────────────────────┐ +│ Cloudflare Worker (Main) │ Next.js + OpenNext +│ bandit-runner-app │ Intercepts WebSocket upgrades +└──────┬──────────────────────┘ + │ + ↓ +┌─────────────────────────────┐ +│ Durable Object Worker │ WebSocket connection manager +│ bandit-agent-do │ Run state coordinator +└──────┬──────────────────────┘ + │ HTTP (JSONL streaming) + ↓ +┌─────────────────────────────┐ +│ SSH Proxy (Fly.io) │ LangGraph.js autonomous agent +│ bandit-ssh-proxy.fly.dev │ SSH2 client +└──────┬──────────────────────┘ + │ SSH Protocol + ↓ +┌─────────────────────────────┐ +│ Bandit SSH Server │ CTF challenges +│ overthewire.org:2220 │ Real command execution +└─────────────────────────────┘ ``` -*See `docs/ADR-001-architecture.md` for the detailed decision record.* +### Component Responsibilities + +**Frontend (Next.js 15)** +- React UI with shadcn/ui components +- Real-time terminal with ANSI rendering +- Agent chat display +- Model selection and configuration + +**Main Worker (Cloudflare)** +- Serves Next.js application +- Intercepts WebSocket upgrades before Next.js routing +- Forwards WS connections to Durable Object + +**Durable Object (Separate Worker)** +- Manages WebSocket connections from browsers +- Maintains run state (level, status, passwords) +- Calls SSH proxy HTTP endpoints +- Streams JSONL events to connected clients + +**SSH Proxy (Fly.io)** +- Runs LangGraph.js agent state machine +- Maintains SSH2 connections to Bandit server +- Executes commands via PTY (full terminal capture) +- Extracts passwords using regex patterns +- Streams events back to DO via HTTP response + +### Data Flow + +1. User clicks START in browser +2. Browser establishes WebSocket to Main Worker +3. Worker forwards to Durable Object +4. DO sends HTTP POST to SSH Proxy `/agent/run` +5. SSH Proxy runs LangGraph agent +6. Agent connects via SSH to Bandit server +7. Commands executed, passwords extracted +8. Events stream: SSH Proxy → DO → WebSocket → Browser +9. UI updates in real-time + +### Monorepo Structure + +``` +bandit-runner/ +├── bandit-runner-app/ # Frontend + Cloudflare Workers +│ ├── src/ +│ │ ├── app/ # Next.js pages + API routes +│ │ ├── components/ # React components +│ │ ├── hooks/ # useAgentWebSocket +│ │ └── lib/ # Utilities, types +│ ├── workers/ +│ │ └── bandit-agent-do/ # Standalone Durable Object +│ ├── scripts/ +│ │ └── patch-worker.js # Injects WebSocket intercept +│ └── wrangler.jsonc # Main worker config +├── ssh-proxy/ # LangGraph agent runtime +│ ├── agent.ts # State machine definition +│ ├── server.ts # Express HTTP server +│ ├── Dockerfile # Container image +│ └── fly.toml # Fly.io deployment config +└── docs/ # Documentation +``` + +

(back to top)

+ +--- + +## Troubleshooting + +### WebSocket Connection Failed + +**Symptoms**: "WebSocket connection failed" or "NS_ERROR_WEBSOCKET_CONNECTION_REFUSED" + +**Solutions**: +```bash +# Check DO worker is deployed +wrangler deployments list + +# Check browser console for specific errors +# Verify DO binding in wrangler.jsonc + +# Test DO directly +curl https://bandit-runner-app..workers.dev/api/agent/test-run/status +``` + +### SSH Proxy Not Responding + +**Symptoms**: Agent starts but no terminal output appears + +**Solutions**: +```bash +# Check Fly.io app status +flyctl status + +# View real-time logs +flyctl logs + +# Test health endpoint +curl https://bandit-ssh-proxy.fly.dev/ssh/health +# Should return: {"status":"ok","activeConnections":0} + +# Restart if needed +flyctl restart +``` + +### Agent Not Starting + +**Symptoms**: Run status stays "IDLE" or errors in console + +**Solutions**: +```bash +# Verify OpenRouter API key is set +wrangler secret list + +# Check DO worker logs +wrangler tail --name bandit-agent-do + +# Ensure SSH_PROXY_URL is correct in wrangler.jsonc +# Should be: https://bandit-ssh-proxy.fly.dev (not http://) +``` + +### Commands Not Executing + +**Symptoms**: Agent thinks but no SSH commands run + +**Solutions**: +```bash +# Check SSH proxy logs +flyctl logs + +# Verify Bandit server is accessible +ssh bandit0@bandit.labs.overthewire.org -p 2220 +# Password: bandit0 + +# Review agent reasoning in chat panel +# Look for error messages or failed attempts +``` + +### Build or Deploy Errors + +**Common issues**: +```bash +# "Cannot find module" during build +cd bandit-runner-app && pnpm install +cd ssh-proxy && npm install + +# "Account ID not found" +# Add account_id to workers/bandit-agent-do/wrangler.toml + +# "Script not found: bandit-agent-do" +# Deploy DO worker first: +cd workers/bandit-agent-do && wrangler deploy +```

(back to top)

@@ -211,16 +478,33 @@ Next.js (App Router) ## Roadmap -* [x] Core runner architecture -* [x] JSONL log streaming -* [x] SSH tool scaffolding -* [ ] Add live leaderboard -* [ ] Add mock SSH server for tests -* [ ] Expand scoring heuristics -* [ ] Implement model-agnostic adapter layer -* [ ] Public demo page +### Completed ✅ +* [x] LangGraph.js autonomous agent framework +* [x] Real-time WebSocket streaming (JSONL events) +* [x] Full terminal output with ANSI colors +* [x] Agent reasoning and thinking display +* [x] OpenRouter model selection with search/filters +* [x] Level 0 → N automatic progression +* [x] Password extraction and advancement +* [x] Manual mode for debugging +* [x] Durable Object state management -See the [open issues](https://git.biohazardvfx.com/Nicholai/bandit-runner/issues) for the full roadmap. +### In Progress 🚧 +* [ ] Retry logic with exponential backoff +* [ ] Token usage and cost tracking UI +* [ ] Persistent run history (D1 database) +* [ ] Log storage and analysis (R2 bucket) + +### Planned 📋 +* [ ] Public leaderboard (fastest completions) +* [ ] Multi-agent comparison mode +* [ ] Custom system prompts and strategies +* [ ] Agent performance analytics dashboard +* [ ] Mock SSH server for testing +* [ ] Expanded CTF support (Natas, Krypton, etc.) +* [ ] Run replay and debugging tools + +See the [open issues](https://git.biohazardvfx.com/Nicholai/bandit-runner/issues) for discussion and feature requests.

(back to top)

@@ -268,14 +552,14 @@ Project Link: [https://git.biohazardvfx.com/Nicholai/bandit-runner](https://git. ## Acknowledgments -* [OverTheWire Bandit](https://overthewire.org/wargames/bandit/) — for the wargame challenge itself -* [Cloudflare Workers Docs](https://developers.cloudflare.com/workers/) -* [OpenNext](https://opennext.js.org/) -* [Shadcn/UI](https://ui.shadcn.com) -* [Drizzle ORM](https://orm.drizzle.team) -* [Choose a License](https://choosealicense.com) -* [Img Shields](https://shields.io) -* [Contrib.rocks](https://contrib.rocks) +* [OverTheWire Bandit](https://overthewire.org/wargames/bandit/) - CTF wargame challenges +* [LangGraph.js](https://langchain-ai.github.io/langgraphjs/) - Agent orchestration framework +* [Cloudflare Workers](https://developers.cloudflare.com/workers/) - Edge compute platform +* [Fly.io](https://fly.io) - Global application hosting +* [OpenRouter](https://openrouter.ai) - Unified LLM API access +* [OpenNext](https://opennext.js.org/) - Next.js adapter for Cloudflare +* [shadcn/ui](https://ui.shadcn.com) - Beautiful component library +* [ANSI-to-HTML](https://github.com/rburns/ansi-to-html) - Terminal color rendering

(back to top)

@@ -304,12 +588,12 @@ Project Link: [https://git.biohazardvfx.com/Nicholai/bandit-runner](https://git. [Cloudflare-url]: https://developers.cloudflare.com/workers/ [OpenNext-badge]: https://img.shields.io/badge/OpenNext-18181B?style=for-the-badge&logo=vercel&logoColor=white [OpenNext-url]: https://opennext.js.org/ +[LangGraph-badge]: https://img.shields.io/badge/LangGraph-1C3C3C?style=for-the-badge&logo=langchain&logoColor=white +[LangGraph-url]: https://langchain-ai.github.io/langgraphjs/ [Shadcn-badge]: https://img.shields.io/badge/Shadcn%2FUI-000000?style=for-the-badge&logo=react&logoColor=white [Shadcn-url]: https://ui.shadcn.com [TypeScript-badge]: https://img.shields.io/badge/TypeScript-3178C6?style=for-the-badge&logo=typescript&logoColor=white [TypeScript-url]: https://www.typescriptlang.org/ -[Drizzle-badge]: https://img.shields.io/badge/Drizzle%20ORM-3E63DD?style=for-the-badge&logo=sqlite&logoColor=white -[Drizzle-url]: https://orm.drizzle.team [pnpm-badge]: https://img.shields.io/badge/pnpm-F69220?style=for-the-badge&logo=pnpm&logoColor=white [pnpm-url]: https://pnpm.io [conventional-commits-badge]: https://img.shields.io/badge/Conventional%20Commits-1.0.0-%23FE5196?style=for-the-badge&logo=conventionalcommits&logoColor=white diff --git a/bandit-runner-app/src/components/terminal-chat-interface.tsx b/bandit-runner-app/src/components/terminal-chat-interface.tsx index 34feaee..60085d9 100644 --- a/bandit-runner-app/src/components/terminal-chat-interface.tsx +++ b/bandit-runner-app/src/components/terminal-chat-interface.tsx @@ -74,8 +74,8 @@ export function TerminalChatInterface() { const [mounted, setMounted] = useState(false) const [manualMode, setManualMode] = useState(false) - const terminalEndRef = useRef(null) - const chatEndRef = useRef(null) + const terminalScrollRef = useRef(null) + const chatScrollRef = useRef(null) const terminalInputRef = useRef(null) const chatInputRef = useRef(null) @@ -119,11 +119,21 @@ export function TerminalChatInterface() { }, []) useEffect(() => { - terminalEndRef.current?.scrollIntoView({ behavior: "smooth" }) + if (terminalScrollRef.current) { + const viewport = terminalScrollRef.current.querySelector('[data-radix-scroll-area-viewport]') + if (viewport) { + viewport.scrollTop = viewport.scrollHeight + } + } }, [wsTerminalLines]) useEffect(() => { - chatEndRef.current?.scrollIntoView({ behavior: "smooth" }) + if (chatScrollRef.current) { + const viewport = chatScrollRef.current.querySelector('[data-radix-scroll-area-viewport]') + if (viewport) { + viewport.scrollTop = viewport.scrollHeight + } + } }, [wsChatMessages]) const formatTimestamp = (date: Date) => { @@ -399,8 +409,8 @@ export function TerminalChatInterface() { {/* Terminal content */} - -
+ +
{wsTerminalLines.map((line, idx) => (
))} -
@@ -506,8 +515,8 @@ export function TerminalChatInterface() {
{/* Messages */} - -
+ +
{wsChatMessages.map((msg, idx) => (
@@ -551,7 +560,6 @@ export function TerminalChatInterface() {
)} -
diff --git a/readme-improvement.plan.md b/readme-improvement.plan.md new file mode 100644 index 0000000..71ea194 --- /dev/null +++ b/readme-improvement.plan.md @@ -0,0 +1,468 @@ +# README Improvement Plan + +## Problem + +The current README is outdated and doesn't reflect the actual architecture or provide clear setup/deployment instructions. It references old concepts (D1, R2, mock setup) that aren't implemented, and doesn't explain the real LangGraph + SSH Proxy architecture that's currently working. + +## Goals + +1. **Clear Project Overview**: Explain what Bandit Runner actually does and why it matters +2. **Accurate Architecture**: Document the real system (Next.js → Cloudflare Workers → Durable Objects → SSH Proxy → Bandit Server) +3. **Step-by-Step Setup**: Complete local development setup instructions +4. **Deployment Guide**: Clear instructions for deploying all components +5. **Feature Documentation**: What the system actually does (model selection, real-time terminal, agent reasoning, etc.) +6. **Remove Outdated Content**: Clean up references to unimplemented features + +## Current Issues + +### Inaccurate Architecture Description +- README shows old architecture with D1/R2 that aren't implemented +- Doesn't mention LangGraph.js agent framework +- Doesn't explain SSH Proxy on Fly.io +- Missing Durable Object standalone worker details + +### Missing Prerequisites +- No mention of OpenRouter API key requirement +- Missing Fly.io account for SSH proxy +- No Node.js version specification (needs 20+) +- Missing pnpm workspace setup + +### Incomplete Setup Instructions +- Doesn't explain monorepo structure +- No `.env` file examples or required variables +- Missing SSH proxy deployment steps +- No explanation of DO worker vs main worker +- No Wrangler secrets configuration + +### Vague Usage Section +- Doesn't explain the actual UI (control panel, terminal, chat) +- No screenshots or feature descriptions +- Missing model selection details +- No explanation of manual mode + +### Outdated Roadmap +- Lists items as incomplete that are done +- Missing current features (WebSocket streaming, ANSI colors, etc.) + +## New README Structure + +```markdown +# Bandit Runner + +## About +- What it is: Autonomous AI agent testing framework +- What it does: Solves OverTheWire Bandit CTF challenges +- Why it matters: Real-world benchmark for LLM autonomous capabilities + +## Features +- 🤖 LangGraph.js autonomous agent with tool use +- 🔌 Real-time WebSocket streaming +- 🖥️ Full terminal output with ANSI colors +- 💬 Live agent reasoning/thinking display +- 🎯 OpenRouter model selection (100+ models) +- 🎨 Beautiful retro-terminal UI +- 🔒 SSH security boundaries (read-only Bandit server) +- 📊 Level progression tracking + +## Architecture + +### System Components +1. **Frontend** - Next.js 15 with App Router + shadcn/ui +2. **Backend** - Cloudflare Workers with OpenNext +3. **Coordinator** - Durable Objects (separate worker) +4. **Agent Runtime** - SSH Proxy on Fly.io (LangGraph.js) +5. **Target** - OverTheWire Bandit SSH server + +### Data Flow +Browser → WebSocket → Main Worker → DO Worker → HTTP → SSH Proxy → LangGraph Agent → SSH → Bandit Server + +### Tech Stack +- Next.js 15 + React 19 +- Cloudflare Workers + Durable Objects +- LangGraph.js for agent orchestration +- OpenRouter for LLM access +- Fly.io for SSH proxy hosting +- shadcn/ui for components +- ANSI-to-HTML for terminal rendering + +## Prerequisites + +### Required Accounts +- Cloudflare account (free tier OK, needs Durable Objects enabled) +- Fly.io account (free tier OK) +- OpenRouter account + API key + +### Required Software +- Node.js 20+ +- pnpm (package manager) +- Wrangler CLI (Cloudflare) +- flyctl CLI (Fly.io) +- Git + +## Installation + +### 1. Clone Repository +```bash +git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git +cd bandit-runner +``` + +### 2. Install Dependencies +```bash +# Install pnpm if you don't have it +npm install -g pnpm + +# Install all workspace dependencies +cd bandit-runner-app +pnpm install + +cd ../ssh-proxy +npm install +``` + +### 3. Configure Environment + +#### Cloudflare (Main Worker) +No `.env` needed - uses `wrangler.jsonc` vars + +#### Cloudflare (DO Worker) +```bash +cd bandit-runner-app/workers/bandit-agent-do +cp .env.example .env.local +# Add your OpenRouter API key +``` + +#### SSH Proxy (Fly.io) +No configuration needed - OpenRouter key passed via HTTP headers + +### 4. Local Development + +#### Option A: Run Frontend Only (with deployed backend) +```bash +cd bandit-runner-app +pnpm dev +# Open http://localhost:3000 +``` + +#### Option B: Run Full Stack Locally +```bash +# Terminal 1: SSH Proxy +cd ssh-proxy +npm run dev +# Runs on http://localhost:3001 + +# Terminal 2: Frontend (with local proxy) +cd bandit-runner-app +# Update wrangler.jsonc SSH_PROXY_URL to http://localhost:3001 +pnpm dev +``` + +## Deployment + +### Step 1: Deploy SSH Proxy to Fly.io + +```bash +cd ssh-proxy + +# Login to Fly.io (opens browser) +flyctl auth login + +# Deploy (first time creates app) +flyctl deploy + +# Note the URL: https://bandit-ssh-proxy.fly.dev +``` + +### Step 2: Deploy Durable Object Worker + +```bash +cd ../bandit-runner-app/workers/bandit-agent-do + +# Set your OpenRouter API key +wrangler secret put OPENROUTER_API_KEY +# Paste your key when prompted + +# Deploy the DO worker +wrangler deploy + +# Verify deployment +wrangler tail +``` + +### Step 3: Deploy Main Application + +```bash +cd ../.. # Back to bandit-runner-app root + +# Verify SSH_PROXY_URL in wrangler.jsonc points to your Fly.io URL +# Should be: "SSH_PROXY_URL": "https://bandit-ssh-proxy.fly.dev" + +# Build and deploy +pnpm run deploy + +# Your app will be live at: +# https://bandit-runner-app..workers.dev +``` + +### Step 4: Verify Deployment + +```bash +# Test SSH Proxy health +curl https://bandit-ssh-proxy.fly.dev/ssh/health + +# Should return: {"status":"ok","activeConnections":0} + +# Test main app +open https://bandit-runner-app..workers.dev +``` + +## Usage + +### Starting a Run + +1. Open the application in your browser +2. **Select Model**: Click the model dropdown to choose from 100+ models + - Search by name + - Filter by provider (OpenAI, Anthropic, Google, etc.) + - Filter by price range + - Filter by context length +3. **Set Target Level**: Choose how far the agent should attempt (default: 5) +4. **Select Output Mode**: + - **Selective** (default): Shows LLM reasoning + tool calls + - **Everything**: Shows reasoning + tool calls + full command outputs +5. **Click START** + +### Monitoring Progress + +The interface shows: +- **Left Panel (Terminal)**: Full SSH session output with colors +- **Right Panel (Agent Chat)**: LLM reasoning and discovered information +- **Top Status Bar**: Current status, level, and model info +- **Control Panel**: Pause/Resume/Stop controls + +### Manual Mode (Debugging) + +Toggle "Manual Mode" in the terminal to: +- Type commands directly into the SSH session +- Assist the agent when it gets stuck +- Debug connection issues + +⚠️ **Note**: Runs with manual intervention are disqualified from leaderboards + +## Features Explained + +### Real-Time Terminal +- Full PTY session capture +- ANSI color support (green prompts, red errors, etc.) +- Timestamps for all output +- Automatic scrolling +- Read-only by default (manual mode available) + +### Agent Reasoning Display +- LLM "thinking" messages (what command to try next) +- Planning output (strategy for current level) +- Password discovery announcements +- Level completion summaries + +### Model Selection +- 100+ models from OpenRouter +- Real-time search +- Filter by provider, price, context length +- Shows token costs +- Supports all major providers + +### Level Progression +- Always starts at Level 0 +- Automatic password extraction +- Automatic SSH re-login for next level +- Configurable target level cap +- Retry logic for failed attempts + +## Architecture Details + +### Monorepo Structure +``` +bandit-runner/ +├── bandit-runner-app/ # Next.js frontend + Cloudflare Workers +│ ├── src/ +│ │ ├── app/ # Next.js pages + API routes +│ │ ├── components/ # React components (shadcn/ui) +│ │ ├── hooks/ # React hooks (WebSocket) +│ │ └── lib/ # Utilities, agents, storage +│ ├── workers/ +│ │ └── bandit-agent-do/ # Standalone Durable Object worker +│ ├── scripts/ +│ │ └── patch-worker.js # WebSocket intercept injector +│ └── wrangler.jsonc # Main worker config +├── ssh-proxy/ # LangGraph agent on Fly.io +│ ├── agent.ts # LangGraph state machine +│ ├── server.ts # Express HTTP server +│ ├── Dockerfile # Container definition +│ └── fly.toml # Fly.io config +├── docs/ # Documentation +└── public/ # Assets +``` + +### WebSocket Intercept Pattern + +The main worker intercepts WebSocket upgrade requests **before** they reach Next.js: + +```javascript +// Injected by scripts/patch-worker.js +if (request.headers.get('Upgrade') === 'websocket') { + // Forward directly to DO, bypass Next.js + return env.BANDIT_AGENT.get(id).fetch(request); +} +``` + +This is necessary because Next.js API routes don't support WebSocket upgrades. + +### Durable Object Responsibilities + +- Accept WebSocket connections from browsers +- Maintain run state (level, passwords, status) +- Call SSH proxy HTTP endpoints +- Stream JSONL events from SSH proxy to WebSocket clients +- Handle START/PAUSE/RESUME/STOP actions + +### SSH Proxy Responsibilities + +- Run LangGraph.js agent state machine +- Maintain SSH2 connections to Bandit server +- Execute commands via PTY +- Extract passwords using regex +- Stream events back to DO via HTTP response body (JSONL) + +### Event Flow + +1. User clicks START in browser +2. Browser → WebSocket → Main Worker → DO Worker +3. DO → HTTP POST `/agent/run` → SSH Proxy +4. SSH Proxy → SSH Connection → Bandit Server +5. Agent executes commands, extracts passwords +6. SSH Proxy → JSONL events → DO +7. DO → WebSocket → Browser +8. UI updates in real-time + +## Troubleshooting + +### WebSocket Connection Failed +- Check DO worker is deployed: `wrangler deployments list` +- Verify DO binding in main worker config +- Check browser console for specific errors + +### SSH Proxy Not Responding +- Check Fly.io status: `flyctl status` +- View logs: `flyctl logs` +- Test health endpoint: `curl https://your-proxy.fly.dev/ssh/health` + +### Agent Not Starting +- Verify OpenRouter API key: `wrangler secret list` +- Check DO worker logs: `wrangler tail --name bandit-agent-do` +- Ensure SSH_PROXY_URL is correct in wrangler.jsonc + +### Commands Not Executing +- Check SSH proxy logs: `flyctl logs` +- Verify Bandit server is accessible: `ssh bandit0@bandit.labs.overthewire.org -p 2220` +- Review agent reasoning in chat panel for errors + +## Roadmap + +### Completed ✅ +- [x] LangGraph autonomous agent framework +- [x] Real-time WebSocket streaming +- [x] Full terminal output with ANSI colors +- [x] Agent reasoning display +- [x] OpenRouter model selection with search/filters +- [x] Level 0 → N progression +- [x] Manual mode for debugging +- [x] Password extraction and advancement + +### In Progress 🚧 +- [ ] Retry logic with exponential backoff +- [ ] Token usage and cost tracking UI +- [ ] Persistent run history (D1) +- [ ] Log storage (R2) + +### Planned 📋 +- [ ] Leaderboard (fastest completions) +- [ ] Multi-agent comparison mode +- [ ] Custom prompts and strategies +- [ ] Agent performance analytics +- [ ] Mock SSH server for testing +- [ ] Expanded CTF support (beyond Bandit) + +## Contributing + +Contributions welcome! Please: + +1. Fork the repository +2. Create a feature branch (`git checkout -b feat/amazing-feature`) +3. Commit with Conventional Commits (`feat:`, `fix:`, `docs:`) +4. Push and open a Pull Request + +See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines. + +## License + +GNU GPLv3 - See [COPYING.txt](COPYING.txt) + +## Contact + +**Nicholai Vogel** +[Website](https://nicholai.work) • [LinkedIn](https://linkedin.com/in/nicholai-vogel) + +## Acknowledgments + +- [OverTheWire Bandit](https://overthewire.org/wargames/bandit/) - CTF challenges +- [LangGraph.js](https://langchain-ai.github.io/langgraphjs/) - Agent framework +- [Cloudflare Workers](https://developers.cloudflare.com/workers/) - Edge compute +- [Fly.io](https://fly.io) - Global app hosting +- [OpenRouter](https://openrouter.ai) - Unified LLM API +- [shadcn/ui](https://ui.shadcn.com) - Component library +- [OpenNext](https://opennext.js.org/) - Next.js on Workers +``` + +## Key Changes + +### What's Being Added +1. **Accurate architecture diagram** - Shows real components (DO, SSH Proxy, LangGraph) +2. **Complete prerequisite list** - All accounts, software, and setup requirements +3. **Monorepo structure docs** - Explains the workspace layout +4. **Step-by-step deployment** - 4 clear deployment phases +5. **Feature documentation** - What each UI component does +6. **Troubleshooting section** - Common issues and solutions +7. **Current roadmap** - Reflects actual completion status + +### What's Being Removed +1. Mock architecture references (D1, R2 unimplemented features) +2. Incorrect usage instructions +3. Outdated screenshots/descriptions +4. Placeholder content from template + +### What's Being Updated +1. Tech stack badges - Add LangGraph, Fly.io, OpenRouter +2. Built With section - Reflect actual dependencies +3. Installation steps - Real commands that work +4. Usage examples - Match actual UI +5. Contact links - Keep current + +## Implementation Notes + +- Keep existing badge links/formatting style +- Maintain table of contents structure +- Add new sections: Features, Architecture Details, Troubleshooting +- Update all code blocks with real, tested commands +- Add architecture ASCII diagram for clarity +- Include actual file paths and structure +- Reference real configuration files (wrangler.jsonc, etc.) + +## Success Criteria + +✅ README accurately describes working system +✅ User can deploy from scratch following instructions +✅ No references to unimplemented features +✅ Clear troubleshooting for common issues +✅ Architecture matches production code +✅ All commands are tested and accurate +