bandit-runner

A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Core Concepts
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Architecture
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

Bandit Runner is an autonomous AI agent testing framework that evaluates large language models by having them solve the OverTheWire Bandit wargame challenges via SSH.

Why it matters

Provides a real-world, hands-on benchmark for autonomous reasoning and command execution
Tests tool use (SSH), planning, error handling, and persistence under real network conditions
Generates reproducible logs for research and public leaderboards
Demonstrates LLM capabilities in a sandboxed, deterministic environment

Features

🤖 LangGraph.js Autonomous Agent - State machine with tool use and planning
🔌 Real-Time WebSocket Streaming - Live updates with <50ms latency
🖥️ Full Terminal Output - Complete SSH session with ANSI colors and formatting
💬 Agent Reasoning Display - See the LLM's thinking process and command selection
🎯 OpenRouter Integration - Access 100+ models with search and filtering
🎨 Beautiful Retro UI - Terminal-inspired interface with shadcn/ui components
🔒 Security Boundaries - Sandboxed SSH access to read-only Bandit server
📊 Level Progression - Automatic password extraction and advancement
🛠️ Manual Mode - Optional human intervention for debugging

(back to top)

Built With

- Frontend framework (v15 with App Router)
- UI library (v19)
- Edge compute + Durable Objects
- Next.js adapter for Cloudflare Workers
- Agent orchestration framework
- Component library
- Type safety
- Package manager

(back to top)

Getting Started

Prerequisites

Required Accounts

Cloudflare account (free tier works, needs Durable Objects enabled)
Fly.io account (free tier works)
OpenRouter account + API key (get one here)

Required Software

Node.js ≥ 20
pnpm package manager
```
npm install -g pnpm
```
Wrangler CLI (Cloudflare)
```
npm install -g wrangler
```
flyctl CLI (Fly.io)
```
curl -L https://fly.io/install.sh | sh
```

Installation

1. Clone Repository

git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
cd bandit-runner

2. Install Dependencies

# Frontend + Cloudflare Workers
cd bandit-runner-app
pnpm install

# SSH Proxy (LangGraph agent)
cd ../ssh-proxy
npm install

3. Configure Environment

Cloudflare DO Worker:

cd bandit-runner-app/workers/bandit-agent-do
# Create .env.local with your OpenRouter API key
echo "OPENROUTER_API_KEY=your_key_here" > .env.local

Main Worker:
No .env needed - configuration is in wrangler.jsonc

4. Local Development

Option A: Frontend Only (requires deployed backend)

cd bandit-runner-app
pnpm dev
# Open http://localhost:3000

Option B: Full Stack (all components local)

# Terminal 1: SSH Proxy
cd ssh-proxy
npm run dev
# Runs on http://localhost:3001

# Terminal 2: Main App
cd bandit-runner-app
# Update wrangler.jsonc: "SSH_PROXY_URL": "http://localhost:3001"
pnpm dev

(back to top)

Deployment

Step 1: Deploy SSH Proxy to Fly.io

cd ssh-proxy

# Login (opens browser)
flyctl auth login

# Deploy (creates app on first run)
flyctl deploy

# Note your URL: https://bandit-ssh-proxy.fly.dev

Step 2: Deploy Durable Object Worker

cd ../bandit-runner-app/workers/bandit-agent-do

# Set OpenRouter API key as secret
wrangler secret put OPENROUTER_API_KEY
# Paste your key when prompted

# Deploy
wrangler deploy

# Verify (optional)
wrangler tail

Step 3: Deploy Main Application

cd ../..  # Back to bandit-runner-app root

# Verify wrangler.jsonc has correct SSH_PROXY_URL
# Should be: "SSH_PROXY_URL": "https://bandit-ssh-proxy.fly.dev"

# Build and deploy
pnpm run deploy

# Your app is live at:
# https://bandit-runner-app.<your-subdomain>.workers.dev

Step 4: Verify Deployment

# Test SSH Proxy
curl https://bandit-ssh-proxy.fly.dev/ssh/health
# Should return: {"status":"ok","activeConnections":0}

# Open your app
open https://bandit-runner-app.<your-subdomain>.workers.dev

(back to top)

Usage

Starting a Run

Open the application in your browser
Select Model: Click the dropdown to choose from 100+ models
- Search by name
- Filter by provider (OpenAI, Anthropic, Google, Meta, etc.)
- Filter by price range
- Filter by context length
Set Target Level: Choose how far the agent should progress (default: 5)
Select Output Mode:
- Selective (default): LLM reasoning + tool calls
- Everything: Reasoning + tool calls + full command outputs
Click START

Monitoring Progress

The interface provides real-time feedback:

Left Panel (Terminal): Full SSH session with ANSI colors and formatting
Right Panel (Agent Chat): LLM reasoning, planning, and discoveries
Top Bar: Current status, level, and active model
Control Panel: Pause/Resume/Stop controls

Manual Mode (Debugging)

Toggle "Manual Mode" in the terminal footer to:

Type commands directly into the SSH session
Assist the agent when stuck
Debug connection or logic issues

⚠️ Note: Manual intervention disqualifies runs from leaderboards

(back to top)

Architecture

System Overview

┌─────────────┐
│   Browser   │ WebSocket (wss://)
└──────┬──────┘
       │
       ↓
┌─────────────────────────────┐
│  Cloudflare Worker (Main)   │ Next.js + OpenNext
│  bandit-runner-app          │ Intercepts WebSocket upgrades
└──────┬──────────────────────┘
       │
       ↓
┌─────────────────────────────┐
│  Durable Object Worker      │ WebSocket connection manager
│  bandit-agent-do            │ Run state coordinator
└──────┬──────────────────────┘
       │ HTTP (JSONL streaming)
       ↓
┌─────────────────────────────┐
│  SSH Proxy (Fly.io)         │ LangGraph.js autonomous agent
│  bandit-ssh-proxy.fly.dev   │ SSH2 client
└──────┬──────────────────────┘
       │ SSH Protocol
       ↓
┌─────────────────────────────┐
│  Bandit SSH Server          │ CTF challenges
│  overthewire.org:2220       │ Real command execution
└─────────────────────────────┘

Component Responsibilities

Frontend (Next.js 15)

React UI with shadcn/ui components
Real-time terminal with ANSI rendering
Agent chat display
Model selection and configuration

Main Worker (Cloudflare)

Serves Next.js application
Intercepts WebSocket upgrades before Next.js routing
Forwards WS connections to Durable Object

Durable Object (Separate Worker)

Manages WebSocket connections from browsers
Maintains run state (level, status, passwords)
Calls SSH proxy HTTP endpoints
Streams JSONL events to connected clients

SSH Proxy (Fly.io)

Runs LangGraph.js agent state machine
Maintains SSH2 connections to Bandit server
Executes commands via PTY (full terminal capture)
Extracts passwords using regex patterns
Streams events back to DO via HTTP response

Data Flow

User clicks START in browser
Browser establishes WebSocket to Main Worker
Worker forwards to Durable Object
DO sends HTTP POST to SSH Proxy /agent/run
SSH Proxy runs LangGraph agent
Agent connects via SSH to Bandit server
Commands executed, passwords extracted
Events stream: SSH Proxy → DO → WebSocket → Browser
UI updates in real-time

Monorepo Structure

bandit-runner/
├── bandit-runner-app/          # Frontend + Cloudflare Workers
│   ├── src/
│   │   ├── app/                # Next.js pages + API routes
│   │   ├── components/         # React components
│   │   ├── hooks/              # useAgentWebSocket
│   │   └── lib/                # Utilities, types
│   ├── workers/
│   │   └── bandit-agent-do/    # Standalone Durable Object
│   ├── scripts/
│   │   └── patch-worker.js     # Injects WebSocket intercept
│   └── wrangler.jsonc          # Main worker config
├── ssh-proxy/                   # LangGraph agent runtime
│   ├── agent.ts                # State machine definition
│   ├── server.ts               # Express HTTP server
│   ├── Dockerfile              # Container image
│   └── fly.toml                # Fly.io deployment config
└── docs/                        # Documentation

(back to top)

Troubleshooting

WebSocket Connection Failed

Symptoms: "WebSocket connection failed" or "NS_ERROR_WEBSOCKET_CONNECTION_REFUSED"

Solutions:

# Check DO worker is deployed
wrangler deployments list

# Check browser console for specific errors
# Verify DO binding in wrangler.jsonc

# Test DO directly
curl https://bandit-runner-app.<your-subdomain>.workers.dev/api/agent/test-run/status

SSH Proxy Not Responding

Symptoms: Agent starts but no terminal output appears

Solutions:

# Check Fly.io app status
flyctl status

# View real-time logs
flyctl logs

# Test health endpoint
curl https://bandit-ssh-proxy.fly.dev/ssh/health
# Should return: {"status":"ok","activeConnections":0}

# Restart if needed
flyctl restart

Agent Not Starting

Symptoms: Run status stays "IDLE" or errors in console

Solutions:

# Verify OpenRouter API key is set
wrangler secret list

# Check DO worker logs
wrangler tail --name bandit-agent-do

# Ensure SSH_PROXY_URL is correct in wrangler.jsonc
# Should be: https://bandit-ssh-proxy.fly.dev (not http://)

Commands Not Executing

Symptoms: Agent thinks but no SSH commands run

Solutions:

# Check SSH proxy logs
flyctl logs

# Verify Bandit server is accessible
ssh bandit0@bandit.labs.overthewire.org -p 2220
# Password: bandit0

# Review agent reasoning in chat panel
# Look for error messages or failed attempts

Build or Deploy Errors

Common issues:

# "Cannot find module" during build
cd bandit-runner-app && pnpm install
cd ssh-proxy && npm install

# "Account ID not found"
# Add account_id to workers/bandit-agent-do/wrangler.toml

# "Script not found: bandit-agent-do"
# Deploy DO worker first:
cd workers/bandit-agent-do && wrangler deploy

(back to top)