bandit-runner

Nicholai/bandit-runner

Fork 0

Go to file

nicholai bb346b2988

CI / build-test (push) Waiting to run

Details

updated system prompt

2025-10-09 01:44:03 -06:00

.gitea

initialized repository

2025-10-09 01:39:24 -06:00

bandit-runner-app

initialized repository

2025-10-09 01:39:24 -06:00

docs

updated system prompt

2025-10-09 01:44:03 -06:00

public

initialized repository

2025-10-09 01:39:24 -06:00

scripts

initialized repository

2025-10-09 01:39:24 -06:00

.env.example

initialized repository

2025-10-09 01:39:24 -06:00

.gitignore

initialized repository

2025-10-09 01:39:24 -06:00

CONTRIBUTING.md

initialized repository

2025-10-09 01:39:24 -06:00

COPYING.txt

initialized repository

2025-10-09 01:39:24 -06:00

README.md

initialized repository

2025-10-09 01:39:24 -06:00

README.md

Bandit Runner

A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Core Concepts
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Architecture
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

Bandit Runner is a public, deterministic evaluation harness for large language models.
It transforms AI models into autonomous operators tasked with completing the OverTheWire Bandit wargame via SSH — entirely on Cloudflare Workers.

Why it matters

Provides a real-world, hands-on benchmark for autonomous reasoning and command execution.
Tests tool use (SSH), planning, error handling, and persistence under real network conditions.
Generates reproducible, privacy-safe logs for research or public leaderboards.

Core Concepts

Agent Role: Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list.
Environment: Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2).
Security: Hard-scoped to bandit.labs.overthewire.org:2220.
All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs.
Goal: Advance from Level 0 → final level autonomously while documenting every decision.

(back to top)

Built With

(back to top)

Getting Started

Prerequisites

You need:

Node.js ≥ 20
pnpm
```
npm i -g pnpm
```
Wrangler 3 CLI
```
npm i -g wrangler
```
A Cloudflare account with access to:
- Durable Objects
- D1 Database
- R2 Storage

Installation

Clone the repo

git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
cd bandit-runner

Install dependencies
```
pnpm install
```
Copy and configure environment
```
cp .env.example .env.local
```
Build and run locally
```
pnpm dev
# or
wrangler dev
```

Deploy preview

pnpm build
wrangler deploy --env preview

(back to top)

Usage

Once deployed, visit /runs/new to start a new evaluation. Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run.

Each run:

Spawns a Durable Object → “Run Coordinator”
Connects to bandit.labs.overthewire.org:2220
Executes controlled ssh.connect / ssh.exec / ssh.close operations
Streams JSONL logs and commentary to the Live Viewer

Developers can extend:

Scoring rules (lib/scoring/verdicts.ts)
Level validators (lib/scoring/validators.ts)
Model interfaces (lib/ssh/tool-adapter.ts)

(back to top)

Architecture

Next.js (App Router)
│
├── UI (Shadcn/UI)
│   ├─ LiveLog
│   └─ LevelCard
│
├── Edge API Routes (OpenNext)
│   ├─ /api/startRun
│   ├─ /api/toolInvoke
│   └─ /api/stream
│
└── Cloudflare Worker
    ├─ Durable Object: RunCoordinator
    │   ├─ TCP connect() to Bandit
    │   ├─ State machine (levels, caps, timers)
    │   └─ Writes logs → R2
    ├─ D1 (metadata)
    └─ R2 (artifacts)

See docs/ADR-001-architecture.md for the detailed decision record.

(back to top)