2025-10-09 01:39:24 -06:00
2025-10-09 01:44:03 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00

Contributors Forks Stargazers Issues GPLv3 License Conventional Commits LinkedIn


Logo

Bandit Runner

A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »

View Demo · Report Bug · Request Feature


Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Architecture
  5. Roadmap
  6. Contributing
  7. License
  8. Contact
  9. Acknowledgments

About The Project

Product Screenshot

Bandit Runner is an autonomous AI agent testing framework that evaluates large language models by having them solve the OverTheWire Bandit wargame challenges via SSH.

Why it matters

  • Provides a real-world, hands-on benchmark for autonomous reasoning and command execution
  • Tests tool use (SSH), planning, error handling, and persistence under real network conditions
  • Generates reproducible logs for research and public leaderboards
  • Demonstrates LLM capabilities in a sandboxed, deterministic environment

Features

  • 🤖 LangGraph.js Autonomous Agent - State machine with tool use and planning
  • 🔌 Real-Time WebSocket Streaming - Live updates with <50ms latency
  • 🖥️ Full Terminal Output - Complete SSH session with ANSI colors and formatting
  • 💬 Agent Reasoning Display - See the LLM's thinking process and command selection
  • 🎯 OpenRouter Integration - Access 100+ models with search and filtering
  • 🎨 Beautiful Retro UI - Terminal-inspired interface with shadcn/ui components
  • 🔒 Security Boundaries - Sandboxed SSH access to read-only Bandit server
  • 📊 Level Progression - Automatic password extraction and advancement
  • 🛠️ Manual Mode - Optional human intervention for debugging

(back to top)


Built With

  • Next.js - Frontend framework (v15 with App Router)
  • React - UI library (v19)
  • Cloudflare - Edge compute + Durable Objects
  • OpenNext - Next.js adapter for Cloudflare Workers
  • LangGraph - Agent orchestration framework
  • Shadcn/UI - Component library
  • TypeScript - Type safety
  • pnpm - Package manager

(back to top)


Getting Started

Prerequisites

Required Accounts

  • Cloudflare account (free tier works, needs Durable Objects enabled)
  • Fly.io account (free tier works)
  • OpenRouter account + API key (get one here)

Required Software

  • Node.js ≥ 20
  • pnpm package manager
    npm install -g pnpm
    
  • Wrangler CLI (Cloudflare)
    npm install -g wrangler
    
  • flyctl CLI (Fly.io)
    curl -L https://fly.io/install.sh | sh
    

Installation

1. Clone Repository

git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
cd bandit-runner

2. Install Dependencies

# Frontend + Cloudflare Workers
cd bandit-runner-app
pnpm install

# SSH Proxy (LangGraph agent)
cd ../ssh-proxy
npm install

3. Configure Environment

Cloudflare DO Worker:

cd bandit-runner-app/workers/bandit-agent-do
# Create .env.local with your OpenRouter API key
echo "OPENROUTER_API_KEY=your_key_here" > .env.local

Main Worker:
No .env needed - configuration is in wrangler.jsonc

4. Local Development

Option A: Frontend Only (requires deployed backend)

cd bandit-runner-app
pnpm dev
# Open http://localhost:3000

Option B: Full Stack (all components local)

# Terminal 1: SSH Proxy
cd ssh-proxy
npm run dev
# Runs on http://localhost:3001

# Terminal 2: Main App
cd bandit-runner-app
# Update wrangler.jsonc: "SSH_PROXY_URL": "http://localhost:3001"
pnpm dev

(back to top)


Deployment

Step 1: Deploy SSH Proxy to Fly.io

cd ssh-proxy

# Login (opens browser)
flyctl auth login

# Deploy (creates app on first run)
flyctl deploy

# Note your URL: https://bandit-ssh-proxy.fly.dev

Step 2: Deploy Durable Object Worker

cd ../bandit-runner-app/workers/bandit-agent-do

# Set OpenRouter API key as secret
wrangler secret put OPENROUTER_API_KEY
# Paste your key when prompted

# Deploy
wrangler deploy

# Verify (optional)
wrangler tail

Step 3: Deploy Main Application

cd ../..  # Back to bandit-runner-app root

# Verify wrangler.jsonc has correct SSH_PROXY_URL
# Should be: "SSH_PROXY_URL": "https://bandit-ssh-proxy.fly.dev"

# Build and deploy
pnpm run deploy

# Your app is live at:
# https://bandit-runner-app.<your-subdomain>.workers.dev

Step 4: Verify Deployment

# Test SSH Proxy
curl https://bandit-ssh-proxy.fly.dev/ssh/health
# Should return: {"status":"ok","activeConnections":0}

# Open your app
open https://bandit-runner-app.<your-subdomain>.workers.dev

(back to top)


Usage

Starting a Run

  1. Open the application in your browser
  2. Select Model: Click the dropdown to choose from 100+ models
    • Search by name
    • Filter by provider (OpenAI, Anthropic, Google, Meta, etc.)
    • Filter by price range
    • Filter by context length
  3. Set Target Level: Choose how far the agent should progress (default: 5)
  4. Select Output Mode:
    • Selective (default): LLM reasoning + tool calls
    • Everything: Reasoning + tool calls + full command outputs
  5. Click START

Monitoring Progress

The interface provides real-time feedback:

  • Left Panel (Terminal): Full SSH session with ANSI colors and formatting
  • Right Panel (Agent Chat): LLM reasoning, planning, and discoveries
  • Top Bar: Current status, level, and active model
  • Control Panel: Pause/Resume/Stop controls

Manual Mode (Debugging)

Toggle "Manual Mode" in the terminal footer to:

  • Type commands directly into the SSH session
  • Assist the agent when stuck
  • Debug connection or logic issues

⚠️ Note: Manual intervention disqualifies runs from leaderboards

(back to top)


Architecture

System Overview

┌─────────────┐
│   Browser   │ WebSocket (wss://)
└──────┬──────┘
       │
       ↓
┌─────────────────────────────┐
│  Cloudflare Worker (Main)   │ Next.js + OpenNext
│  bandit-runner-app          │ Intercepts WebSocket upgrades
└──────┬──────────────────────┘
       │
       ↓
┌─────────────────────────────┐
│  Durable Object Worker      │ WebSocket connection manager
│  bandit-agent-do            │ Run state coordinator
└──────┬──────────────────────┘
       │ HTTP (JSONL streaming)
       ↓
┌─────────────────────────────┐
│  SSH Proxy (Fly.io)         │ LangGraph.js autonomous agent
│  bandit-ssh-proxy.fly.dev   │ SSH2 client
└──────┬──────────────────────┘
       │ SSH Protocol
       ↓
┌─────────────────────────────┐
│  Bandit SSH Server          │ CTF challenges
│  overthewire.org:2220       │ Real command execution
└─────────────────────────────┘

Component Responsibilities

Frontend (Next.js 15)

  • React UI with shadcn/ui components
  • Real-time terminal with ANSI rendering
  • Agent chat display
  • Model selection and configuration

Main Worker (Cloudflare)

  • Serves Next.js application
  • Intercepts WebSocket upgrades before Next.js routing
  • Forwards WS connections to Durable Object

Durable Object (Separate Worker)

  • Manages WebSocket connections from browsers
  • Maintains run state (level, status, passwords)
  • Calls SSH proxy HTTP endpoints
  • Streams JSONL events to connected clients

SSH Proxy (Fly.io)

  • Runs LangGraph.js agent state machine
  • Maintains SSH2 connections to Bandit server
  • Executes commands via PTY (full terminal capture)
  • Extracts passwords using regex patterns
  • Streams events back to DO via HTTP response

Data Flow

  1. User clicks START in browser
  2. Browser establishes WebSocket to Main Worker
  3. Worker forwards to Durable Object
  4. DO sends HTTP POST to SSH Proxy /agent/run
  5. SSH Proxy runs LangGraph agent
  6. Agent connects via SSH to Bandit server
  7. Commands executed, passwords extracted
  8. Events stream: SSH Proxy → DO → WebSocket → Browser
  9. UI updates in real-time

Monorepo Structure

bandit-runner/
├── bandit-runner-app/          # Frontend + Cloudflare Workers
│   ├── src/
│   │   ├── app/                # Next.js pages + API routes
│   │   ├── components/         # React components
│   │   ├── hooks/              # useAgentWebSocket
│   │   └── lib/                # Utilities, types
│   ├── workers/
│   │   └── bandit-agent-do/    # Standalone Durable Object
│   ├── scripts/
│   │   └── patch-worker.js     # Injects WebSocket intercept
│   └── wrangler.jsonc          # Main worker config
├── ssh-proxy/                   # LangGraph agent runtime
│   ├── agent.ts                # State machine definition
│   ├── server.ts               # Express HTTP server
│   ├── Dockerfile              # Container image
│   └── fly.toml                # Fly.io deployment config
└── docs/                        # Documentation

(back to top)


Troubleshooting

WebSocket Connection Failed

Symptoms: "WebSocket connection failed" or "NS_ERROR_WEBSOCKET_CONNECTION_REFUSED"

Solutions:

# Check DO worker is deployed
wrangler deployments list

# Check browser console for specific errors
# Verify DO binding in wrangler.jsonc

# Test DO directly
curl https://bandit-runner-app.<your-subdomain>.workers.dev/api/agent/test-run/status

SSH Proxy Not Responding

Symptoms: Agent starts but no terminal output appears

Solutions:

# Check Fly.io app status
flyctl status

# View real-time logs
flyctl logs

# Test health endpoint
curl https://bandit-ssh-proxy.fly.dev/ssh/health
# Should return: {"status":"ok","activeConnections":0}

# Restart if needed
flyctl restart

Agent Not Starting

Symptoms: Run status stays "IDLE" or errors in console

Solutions:

# Verify OpenRouter API key is set
wrangler secret list

# Check DO worker logs
wrangler tail --name bandit-agent-do

# Ensure SSH_PROXY_URL is correct in wrangler.jsonc
# Should be: https://bandit-ssh-proxy.fly.dev (not http://)

Commands Not Executing

Symptoms: Agent thinks but no SSH commands run

Solutions:

# Check SSH proxy logs
flyctl logs

# Verify Bandit server is accessible
ssh bandit0@bandit.labs.overthewire.org -p 2220
# Password: bandit0

# Review agent reasoning in chat panel
# Look for error messages or failed attempts

Build or Deploy Errors

Common issues:

# "Cannot find module" during build
cd bandit-runner-app && pnpm install
cd ssh-proxy && npm install

# "Account ID not found"
# Add account_id to workers/bandit-agent-do/wrangler.toml

# "Script not found: bandit-agent-do"
# Deploy DO worker first:
cd workers/bandit-agent-do && wrangler deploy

(back to top)


Roadmap

Completed

  • LangGraph.js autonomous agent framework
  • Real-time WebSocket streaming (JSONL events)
  • Full terminal output with ANSI colors
  • Agent reasoning and thinking display
  • OpenRouter model selection with search/filters
  • Level 0 → N automatic progression
  • Password extraction and advancement
  • Manual mode for debugging
  • Durable Object state management

In Progress 🚧

  • Retry logic with exponential backoff
  • Token usage and cost tracking UI
  • Persistent run history (D1 database)
  • Log storage and analysis (R2 bucket)

Planned 📋

  • Public leaderboard (fastest completions)
  • Multi-agent comparison mode
  • Custom system prompts and strategies
  • Agent performance analytics dashboard
  • Mock SSH server for testing
  • Expanded CTF support (Natas, Krypton, etc.)
  • Run replay and debugging tools

See the open issues for discussion and feature requests.

(back to top)


Contributing

Contributions are welcome.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feat/amazing)
  3. Commit (pnpm commit) using Conventional Commits
  4. Push (git push origin feat/amazing)
  5. Open a Pull Request

Top Contributors

Contributors

(back to top)


License

Distributed under the GNU GPLv3 License. See LICENSE for details.

(back to top)


Contact

Nicholai Vogel WebsiteLinkedInInstagram

Project Link: https://git.biohazardvfx.com/Nicholai/bandit-runner

(back to top)


Acknowledgments

(back to top)


Description
LLM test rig that runs OverTheWire Bandit end-to-end on Cloudflare Workers, with a strict SSH tool, scoring, and public run logs.
Readme 2.6 MiB
Languages
TypeScript 75.3%
CSS 16.6%
Shell 7.1%
JavaScript 1%