nicholai bb346b2988
Some checks are pending
CI / build-test (push) Waiting to run
updated system prompt
2025-10-09 01:44:03 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:44:03 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00
2025-10-09 01:39:24 -06:00

Contributors Forks Stargazers Issues GPLv3 License Conventional Commits LinkedIn


Logo

Bandit Runner

A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »

View Demo · Report Bug · Request Feature


Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Architecture
  5. Roadmap
  6. Contributing
  7. License
  8. Contact
  9. Acknowledgments

About The Project

Product Screenshot

Bandit Runner is a public, deterministic evaluation harness for large language models.
It transforms AI models into autonomous operators tasked with completing the OverTheWire Bandit wargame via SSH — entirely on Cloudflare Workers.

Why it matters

  • Provides a real-world, hands-on benchmark for autonomous reasoning and command execution.
  • Tests tool use (SSH), planning, error handling, and persistence under real network conditions.
  • Generates reproducible, privacy-safe logs for research or public leaderboards.

Core Concepts

  • Agent Role: Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list.
  • Environment: Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2).
  • Security: Hard-scoped to bandit.labs.overthewire.org:2220.
    All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs.
  • Goal: Advance from Level 0 → final level autonomously while documenting every decision.

(back to top)


Built With

  • Next.js
  • React
  • Cloudflare
  • OpenNext
  • Shadcn/UI
  • TypeScript
  • Drizzle ORM
  • pnpm

(back to top)


Getting Started

Prerequisites

You need:

  • Node.js ≥ 20

  • pnpm

    npm i -g pnpm
    
  • Wrangler 3 CLI

    npm i -g wrangler
    
  • A Cloudflare account with access to:

    • Durable Objects
    • D1 Database
    • R2 Storage

Installation

  1. Clone the repo

    git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
    cd bandit-runner
    
  2. Install dependencies

    pnpm install
    
  3. Copy and configure environment

    cp .env.example .env.local
    
  4. Build and run locally

    pnpm dev
    # or
    wrangler dev
    
  5. Deploy preview

    pnpm build
    wrangler deploy --env preview
    

(back to top)


Usage

Once deployed, visit /runs/new to start a new evaluation. Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run.

Each run:

  • Spawns a Durable Object → “Run Coordinator”
  • Connects to bandit.labs.overthewire.org:2220
  • Executes controlled ssh.connect / ssh.exec / ssh.close operations
  • Streams JSONL logs and commentary to the Live Viewer

Developers can extend:

  • Scoring rules (lib/scoring/verdicts.ts)
  • Level validators (lib/scoring/validators.ts)
  • Model interfaces (lib/ssh/tool-adapter.ts)

(back to top)


Architecture

Next.js (App Router)
│
├── UI (Shadcn/UI)
│   ├─ LiveLog
│   └─ LevelCard
│
├── Edge API Routes (OpenNext)
│   ├─ /api/startRun
│   ├─ /api/toolInvoke
│   └─ /api/stream
│
└── Cloudflare Worker
    ├─ Durable Object: RunCoordinator
    │   ├─ TCP connect() to Bandit
    │   ├─ State machine (levels, caps, timers)
    │   └─ Writes logs → R2
    ├─ D1 (metadata)
    └─ R2 (artifacts)

See docs/ADR-001-architecture.md for the detailed decision record.

(back to top)


Roadmap

  • Core runner architecture
  • JSONL log streaming
  • SSH tool scaffolding
  • Add live leaderboard
  • Add mock SSH server for tests
  • Expand scoring heuristics
  • Implement model-agnostic adapter layer
  • Public demo page

See the open issues for the full roadmap.

(back to top)


Contributing

Contributions are welcome.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feat/amazing)
  3. Commit (pnpm commit) using Conventional Commits
  4. Push (git push origin feat/amazing)
  5. Open a Pull Request

Top Contributors

Contributors

(back to top)


License

Distributed under the GNU GPLv3 License. See LICENSE for details.

(back to top)


Contact

Nicholai Vogel WebsiteLinkedInInstagram

Project Link: https://git.biohazardvfx.com/Nicholai/bandit-runner

(back to top)


Acknowledgments

(back to top)


Description
LLM test rig that runs OverTheWire Bandit end-to-end on Cloudflare Workers, with a strict SSH tool, scoring, and public run logs.
Readme 2.6 MiB
Languages
TypeScript 75.3%
CSS 16.6%
Shell 7.1%
JavaScript 1%