bandit-runner

Go to file

nicholai acd04dd6ac 🎉 BREAKTHROUGH: WebSocket working! Real-time streaming functional

✅ What's Working:
- WebSocket connections established (patched worker to intercept upgrades)
- Real-time event streaming: Agent → DO → Browser
- Terminal panel showing live command execution
- Agent chat panel showing LLM thoughts
- Full infrastructure: UI → API → DO → SSH Proxy → LangGraph Agent

🔧 Key Changes:
- Created standalone DO worker at workers/bandit-agent-do/
- Deployed DO as separate Worker (bandit-agent-do)
- Updated wrangler.jsonc to reference external DO via script_name
- Modified patch-worker.js to intercept WS upgrades before Next.js
- Added __name polyfill to fix esbuild helper
- Created pnpm workspace config for monorepo

📝 Architecture:
- Frontend (Next.js) → Cloudflare Worker
- Worker intercepts /api/agent/*/ws → forwards to DO
- DO (bandit-agent-do) → manages WebSocket connections
- DO → calls SSH Proxy API
- SSH Proxy → runs LangGraph agent → executes SSH commands
- Events stream back: SSH Proxy → DO → WebSocket → UI

🐛 Known Issue:
- Agent logic needs refinement (not parsing SSH output correctly)
- But core infrastructure is 100% functional!

This resolves all WebSocket and real-time streaming issues.

2025-10-09 15:10:16 -06:00

.gitea

initialized repository

2025-10-09 01:39:24 -06:00

bandit-runner-app

🎉 BREAKTHROUGH: WebSocket working! Real-time streaming functional

2025-10-09 15:10:16 -06:00

docs

updated system prompt

2025-10-09 01:44:03 -06:00

public

initialized repository

2025-10-09 01:39:24 -06:00

scripts

initialized repository

2025-10-09 01:39:24 -06:00

ssh-proxy

Fix __name polyfill - app now loads without errors

2025-10-09 14:27:03 -06:00

.env.example

initialized repository

2025-10-09 01:39:24 -06:00

.gitignore

feat: redesign terminal UI with theme support and retro aesthetic

2025-10-09 04:00:19 -06:00

BROWSER-TEST-REPORT.md

Fix __name polyfill - app now loads without errors

2025-10-09 14:27:03 -06:00

CONTRIBUTING.md

initialized repository

2025-10-09 01:39:24 -06:00

COPYING.txt

initialized repository

2025-10-09 01:39:24 -06:00

CORE-FUNCTIONALITY-STATUS.md

Fix __name polyfill - app now loads without errors

2025-10-09 14:27:03 -06:00

DEBUGGING-GUIDE.md

Fix __name polyfill - app now loads without errors

2025-10-09 14:27:03 -06:00

DURABLE-OBJECT-SETUP.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

FINAL-STATUS.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

IMPLEMENTATION-COMPLETE.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

IMPLEMENTATION-FINAL.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

IMPLEMENTATION-SUMMARY.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

QUICK-START.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

README.md

initialized repository

2025-10-09 01:39:24 -06:00

SSH-PROXY-README.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

TESTING-GUIDE.md

feat: implement LangGraph.js agentic framework with OpenRouter integration

2025-10-09 07:03:29 -06:00

UI-ENHANCEMENTS-SUMMARY.md

Fix __name polyfill - app now loads without errors

2025-10-09 14:27:03 -06:00

WEBSOCKET-DEBUG-STATUS.md

🎉 BREAKTHROUGH: WebSocket working! Real-time streaming functional

2025-10-09 15:10:16 -06:00

README.md

Bandit Runner

A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Project
- Core Concepts
- Built With
Getting Started
- Prerequisites
- Installation
Usage
Architecture
Roadmap
Contributing
License
Contact
Acknowledgments

About The Project

Bandit Runner is a public, deterministic evaluation harness for large language models.
It transforms AI models into autonomous operators tasked with completing the OverTheWire Bandit wargame via SSH — entirely on Cloudflare Workers.

Why it matters

Provides a real-world, hands-on benchmark for autonomous reasoning and command execution.
Tests tool use (SSH), planning, error handling, and persistence under real network conditions.
Generates reproducible, privacy-safe logs for research or public leaderboards.

Core Concepts

Agent Role: Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list.
Environment: Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2).
Security: Hard-scoped to bandit.labs.overthewire.org:2220.
All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs.
Goal: Advance from Level 0 → final level autonomously while documenting every decision.

(back to top)

Built With

(back to top)

Getting Started

Prerequisites

You need:

Node.js ≥ 20
pnpm
```
npm i -g pnpm
```
Wrangler 3 CLI
```
npm i -g wrangler
```
A Cloudflare account with access to:
- Durable Objects
- D1 Database
- R2 Storage

Installation

Clone the repo

git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
cd bandit-runner

Install dependencies
```
pnpm install
```
Copy and configure environment
```
cp .env.example .env.local
```
Build and run locally
```
pnpm dev
# or
wrangler dev
```

Deploy preview

pnpm build
wrangler deploy --env preview

(back to top)

Usage

Once deployed, visit /runs/new to start a new evaluation. Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run.

Each run:

Spawns a Durable Object → “Run Coordinator”
Connects to bandit.labs.overthewire.org:2220
Executes controlled ssh.connect / ssh.exec / ssh.close operations
Streams JSONL logs and commentary to the Live Viewer

Developers can extend:

Scoring rules (lib/scoring/verdicts.ts)
Level validators (lib/scoring/validators.ts)
Model interfaces (lib/ssh/tool-adapter.ts)

(back to top)

Architecture

Next.js (App Router)
│
├── UI (Shadcn/UI)
│   ├─ LiveLog
│   └─ LevelCard
│
├── Edge API Routes (OpenNext)
│   ├─ /api/startRun
│   ├─ /api/toolInvoke
│   └─ /api/stream
│
└── Cloudflare Worker
    ├─ Durable Object: RunCoordinator
    │   ├─ TCP connect() to Bandit
    │   ├─ State machine (levels, caps, timers)
    │   └─ Writes logs → R2
    ├─ D1 (metadata)
    └─ R2 (artifacts)

See docs/ADR-001-architecture.md for the detailed decision record.

(back to top)