nicholai acd04dd6ac 🎉 BREAKTHROUGH: WebSocket working! Real-time streaming functional
 What's Working:
- WebSocket connections established (patched worker to intercept upgrades)
- Real-time event streaming: Agent → DO → Browser
- Terminal panel showing live command execution
- Agent chat panel showing LLM thoughts
- Full infrastructure: UI → API → DO → SSH Proxy → LangGraph Agent

🔧 Key Changes:
- Created standalone DO worker at workers/bandit-agent-do/
- Deployed DO as separate Worker (bandit-agent-do)
- Updated wrangler.jsonc to reference external DO via script_name
- Modified patch-worker.js to intercept WS upgrades before Next.js
- Added __name polyfill to fix esbuild helper
- Created pnpm workspace config for monorepo

📝 Architecture:
- Frontend (Next.js) → Cloudflare Worker
- Worker intercepts /api/agent/*/ws → forwards to DO
- DO (bandit-agent-do) → manages WebSocket connections
- DO → calls SSH Proxy API
- SSH Proxy → runs LangGraph agent → executes SSH commands
- Events stream back: SSH Proxy → DO → WebSocket → UI

🐛 Known Issue:
- Agent logic needs refinement (not parsing SSH output correctly)
- But core infrastructure is 100% functional!

This resolves all WebSocket and real-time streaming issues.
2025-10-09 15:10:16 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:44:03 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00

Contributors Forks Stargazers Issues GPLv3 License Conventional Commits LinkedIn


Logo

Bandit Runner

A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »

View Demo · Report Bug · Request Feature


Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Architecture
  5. Roadmap
  6. Contributing
  7. License
  8. Contact
  9. Acknowledgments

About The Project

Product Screenshot

Bandit Runner is a public, deterministic evaluation harness for large language models.
It transforms AI models into autonomous operators tasked with completing the OverTheWire Bandit wargame via SSH — entirely on Cloudflare Workers.

Why it matters

  • Provides a real-world, hands-on benchmark for autonomous reasoning and command execution.
  • Tests tool use (SSH), planning, error handling, and persistence under real network conditions.
  • Generates reproducible, privacy-safe logs for research or public leaderboards.

Core Concepts

  • Agent Role: Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list.
  • Environment: Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2).
  • Security: Hard-scoped to bandit.labs.overthewire.org:2220.
    All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs.
  • Goal: Advance from Level 0 → final level autonomously while documenting every decision.

(back to top)


Built With

  • Next.js
  • React
  • Cloudflare
  • OpenNext
  • Shadcn/UI
  • TypeScript
  • Drizzle ORM
  • pnpm

(back to top)


Getting Started

Prerequisites

You need:

  • Node.js ≥ 20

  • pnpm

    npm i -g pnpm
    
  • Wrangler 3 CLI

    npm i -g wrangler
    
  • A Cloudflare account with access to:

    • Durable Objects
    • D1 Database
    • R2 Storage

Installation

  1. Clone the repo

    git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
    cd bandit-runner
    
  2. Install dependencies

    pnpm install
    
  3. Copy and configure environment

    cp .env.example .env.local
    
  4. Build and run locally

    pnpm dev
    # or
    wrangler dev
    
  5. Deploy preview

    pnpm build
    wrangler deploy --env preview
    

(back to top)


Usage

Once deployed, visit /runs/new to start a new evaluation. Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run.

Each run:

  • Spawns a Durable Object → “Run Coordinator”
  • Connects to bandit.labs.overthewire.org:2220
  • Executes controlled ssh.connect / ssh.exec / ssh.close operations
  • Streams JSONL logs and commentary to the Live Viewer

Developers can extend:

  • Scoring rules (lib/scoring/verdicts.ts)
  • Level validators (lib/scoring/validators.ts)
  • Model interfaces (lib/ssh/tool-adapter.ts)

(back to top)


Architecture

Next.js (App Router)
│
├── UI (Shadcn/UI)
│   ├─ LiveLog
│   └─ LevelCard
│
├── Edge API Routes (OpenNext)
│   ├─ /api/startRun
│   ├─ /api/toolInvoke
│   └─ /api/stream
│
└── Cloudflare Worker
    ├─ Durable Object: RunCoordinator
    │   ├─ TCP connect() to Bandit
    │   ├─ State machine (levels, caps, timers)
    │   └─ Writes logs → R2
    ├─ D1 (metadata)
    └─ R2 (artifacts)

See docs/ADR-001-architecture.md for the detailed decision record.

(back to top)


Roadmap

  • Core runner architecture
  • JSONL log streaming
  • SSH tool scaffolding
  • Add live leaderboard
  • Add mock SSH server for tests
  • Expand scoring heuristics
  • Implement model-agnostic adapter layer
  • Public demo page

See the open issues for the full roadmap.

(back to top)


Contributing

Contributions are welcome.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feat/amazing)
  3. Commit (pnpm commit) using Conventional Commits
  4. Push (git push origin feat/amazing)
  5. Open a Pull Request

Top Contributors

Contributors

(back to top)


License

Distributed under the GNU GPLv3 License. See LICENSE for details.

(back to top)


Contact

Nicholai Vogel WebsiteLinkedInInstagram

Project Link: https://git.biohazardvfx.com/Nicholai/bandit-runner

(back to top)


Acknowledgments

(back to top)


Description
LLM test rig that runs OverTheWire Bandit end-to-end on Cloudflare Workers, with a strict SSH tool, scoring, and public run logs.
Readme 2.6 MiB
Languages
TypeScript 75.3%
CSS 16.6%
Shell 7.1%
JavaScript 1%