10 KiB
Bandit Runner
A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
About The Project
Bandit Runner is a public, deterministic evaluation harness for large language models.
It transforms AI models into autonomous operators tasked with completing the OverTheWire Bandit wargame via SSH — entirely on Cloudflare Workers.
Why it matters
- Provides a real-world, hands-on benchmark for autonomous reasoning and command execution.
- Tests tool use (SSH), planning, error handling, and persistence under real network conditions.
- Generates reproducible, privacy-safe logs for research or public leaderboards.
Core Concepts
- Agent Role: Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list.
- Environment: Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2).
- Security: Hard-scoped to
bandit.labs.overthewire.org:2220.
All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs. - Goal: Advance from Level 0 → final level autonomously while documenting every decision.
Built With
Getting Started
Prerequisites
You need:
-
Node.js ≥ 20
-
pnpm
npm i -g pnpm -
Wrangler 3 CLI
npm i -g wrangler -
A Cloudflare account with access to:
- Durable Objects
- D1 Database
- R2 Storage
Installation
-
Clone the repo
git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git cd bandit-runner -
Install dependencies
pnpm install -
Copy and configure environment
cp .env.example .env.local -
Build and run locally
pnpm dev # or wrangler dev -
Deploy preview
pnpm build wrangler deploy --env preview
Usage
Once deployed, visit /runs/new to start a new evaluation.
Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run.
Each run:
- Spawns a Durable Object → “Run Coordinator”
- Connects to
bandit.labs.overthewire.org:2220 - Executes controlled
ssh.connect/ssh.exec/ssh.closeoperations - Streams JSONL logs and commentary to the Live Viewer
Developers can extend:
- Scoring rules (
lib/scoring/verdicts.ts) - Level validators (
lib/scoring/validators.ts) - Model interfaces (
lib/ssh/tool-adapter.ts)
Architecture
Next.js (App Router)
│
├── UI (Shadcn/UI)
│ ├─ LiveLog
│ └─ LevelCard
│
├── Edge API Routes (OpenNext)
│ ├─ /api/startRun
│ ├─ /api/toolInvoke
│ └─ /api/stream
│
└── Cloudflare Worker
├─ Durable Object: RunCoordinator
│ ├─ TCP connect() to Bandit
│ ├─ State machine (levels, caps, timers)
│ └─ Writes logs → R2
├─ D1 (metadata)
└─ R2 (artifacts)
See docs/ADR-001-architecture.md for the detailed decision record.
Roadmap
- Core runner architecture
- JSONL log streaming
- SSH tool scaffolding
- Add live leaderboard
- Add mock SSH server for tests
- Expand scoring heuristics
- Implement model-agnostic adapter layer
- Public demo page
See the open issues for the full roadmap.
Contributing
Contributions are welcome.
- Fork the Project
- Create your Feature Branch (
git checkout -b feat/amazing) - Commit (
pnpm commit) using Conventional Commits - Push (
git push origin feat/amazing) - Open a Pull Request
Top Contributors
License
Distributed under the GNU GPLv3 License.
See LICENSE for details.
Contact
Nicholai Vogel Website • LinkedIn • Instagram
Project Link: https://git.biohazardvfx.com/Nicholai/bandit-runner
Acknowledgments
- OverTheWire Bandit — for the wargame challenge itself
- Cloudflare Workers Docs
- OpenNext
- Shadcn/UI
- Drizzle ORM
- Choose a License
- Img Shields
- Contrib.rocks
