317 lines
10 KiB
Markdown
317 lines
10 KiB
Markdown
<a id="readme-top"></a>
|
|
|
|
<!-- PROJECT SHIELDS -->
|
|
[![Contributors][contributors-shield]][contributors-url]
|
|
[![Forks][forks-shield]][forks-url]
|
|
[![Stargazers][stars-shield]][stars-url]
|
|
[![Issues][issues-shield]][issues-url]
|
|
[![GPLv3 License][license-shield]][license-url]
|
|
[![Conventional Commits][conventional-commits-badge]](https://conventionalcommits.org)
|
|
[![LinkedIn][linkedin-shield]][linkedin-url]
|
|
|
|
<!-- PROJECT LOGO -->
|
|
<br />
|
|
<div align="center">
|
|
<a href="https://git.biohazardvfx.com/Nicholai/bandit-runner">
|
|
<img src="public/bandit-logo.png" alt="Logo" width="100" height="100">
|
|
</a>
|
|
|
|
<h3 align="center">Bandit Runner</h3>
|
|
|
|
<p align="center">
|
|
A deterministic AI testing rig for LLMs-as-agents — built on Next.js, OpenNext, and Cloudflare Workers.
|
|
<br />
|
|
<a href="https://git.biohazardvfx.com/Nicholai/bandit-runner"><strong>Explore the docs »</strong></a>
|
|
<br />
|
|
<br />
|
|
<a href="#">View Demo</a>
|
|
·
|
|
<a href="https://git.biohazardvfx.com/Nicholai/bandit-runner/issues/new?labels=bug">Report Bug</a>
|
|
·
|
|
<a href="https://git.biohazardvfx.com/Nicholai/bandit-runner/issues/new?labels=enhancement">Request Feature</a>
|
|
</p>
|
|
</div>
|
|
|
|
---
|
|
|
|
<!-- TABLE OF CONTENTS -->
|
|
<details>
|
|
<summary>Table of Contents</summary>
|
|
<ol>
|
|
<li><a href="#about-the-project">About The Project</a>
|
|
<ul>
|
|
<li><a href="#core-concepts">Core Concepts</a></li>
|
|
<li><a href="#built-with">Built With</a></li>
|
|
</ul>
|
|
</li>
|
|
<li><a href="#getting-started">Getting Started</a>
|
|
<ul>
|
|
<li><a href="#prerequisites">Prerequisites</a></li>
|
|
<li><a href="#installation">Installation</a></li>
|
|
</ul>
|
|
</li>
|
|
<li><a href="#usage">Usage</a></li>
|
|
<li><a href="#architecture">Architecture</a></li>
|
|
<li><a href="#roadmap">Roadmap</a></li>
|
|
<li><a href="#contributing">Contributing</a></li>
|
|
<li><a href="#license">License</a></li>
|
|
<li><a href="#contact">Contact</a></li>
|
|
<li><a href="#acknowledgments">Acknowledgments</a></li>
|
|
</ol>
|
|
</details>
|
|
|
|
---
|
|
|
|
## About The Project
|
|
|
|
[![Product Screenshot][product-screenshot]](#)
|
|
|
|
**Bandit Runner** is a public, deterministic evaluation harness for large language models.
|
|
It transforms AI models into autonomous operators tasked with completing the **OverTheWire Bandit** wargame via SSH — entirely on Cloudflare Workers.
|
|
|
|
**Why it matters**
|
|
- Provides a real-world, hands-on benchmark for autonomous reasoning and command execution.
|
|
- Tests tool use (SSH), planning, error handling, and persistence under real network conditions.
|
|
- Generates reproducible, privacy-safe logs for research or public leaderboards.
|
|
|
|
### Core Concepts
|
|
- **Agent Role:** Each run instantiates an LLM as “BanditRunner” — a scripted, deterministic persona following a strict system prompt and command allow-list.
|
|
- **Environment:** Next.js frontend + OpenNext build → Cloudflare Workers backend (Durable Objects + D1 + R2).
|
|
- **Security:** Hard-scoped to `bandit.labs.overthewire.org:2220`.
|
|
All discovered passwords are redacted in logs and sealed in short-lived encrypted blobs.
|
|
- **Goal:** Advance from Level 0 → final level autonomously while documenting every decision.
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
### Built With
|
|
|
|
* [![Next.js][Next.js]][Next-url]
|
|
* [![React][React.js]][React-url]
|
|
* [![Cloudflare][Cloudflare-badge]][Cloudflare-url]
|
|
* [![OpenNext][OpenNext-badge]][OpenNext-url]
|
|
* [![Shadcn/UI][Shadcn-badge]][Shadcn-url]
|
|
* [![TypeScript][TypeScript-badge]][TypeScript-url]
|
|
* [![Drizzle ORM][Drizzle-badge]][Drizzle-url]
|
|
* [![pnpm][pnpm-badge]][pnpm-url]
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Getting Started
|
|
|
|
### Prerequisites
|
|
|
|
You need:
|
|
* **Node.js ≥ 20**
|
|
* **pnpm**
|
|
```bash
|
|
npm i -g pnpm
|
|
```
|
|
|
|
* **Wrangler 3 CLI**
|
|
|
|
```bash
|
|
npm i -g wrangler
|
|
```
|
|
* A Cloudflare account with access to:
|
|
|
|
* Durable Objects
|
|
* D1 Database
|
|
* R2 Storage
|
|
|
|
### Installation
|
|
|
|
1. Clone the repo
|
|
|
|
```bash
|
|
git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
|
|
cd bandit-runner
|
|
```
|
|
2. Install dependencies
|
|
|
|
```bash
|
|
pnpm install
|
|
```
|
|
3. Copy and configure environment
|
|
|
|
```bash
|
|
cp .env.example .env.local
|
|
```
|
|
4. Build and run locally
|
|
|
|
```bash
|
|
pnpm dev
|
|
# or
|
|
wrangler dev
|
|
```
|
|
5. Deploy preview
|
|
|
|
```bash
|
|
pnpm build
|
|
wrangler deploy --env preview
|
|
```
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Usage
|
|
|
|
Once deployed, visit `/runs/new` to start a new evaluation.
|
|
Provide a model endpoint (OpenAI, OpenRouter, or self-hosted) and initiate a Bandit Run.
|
|
|
|
Each run:
|
|
|
|
* Spawns a Durable Object → “Run Coordinator”
|
|
* Connects to `bandit.labs.overthewire.org:2220`
|
|
* Executes controlled `ssh.connect` / `ssh.exec` / `ssh.close` operations
|
|
* Streams JSONL logs and commentary to the Live Viewer
|
|
|
|
Developers can extend:
|
|
|
|
* Scoring rules (`lib/scoring/verdicts.ts`)
|
|
* Level validators (`lib/scoring/validators.ts`)
|
|
* Model interfaces (`lib/ssh/tool-adapter.ts`)
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
```text
|
|
Next.js (App Router)
|
|
│
|
|
├── UI (Shadcn/UI)
|
|
│ ├─ LiveLog
|
|
│ └─ LevelCard
|
|
│
|
|
├── Edge API Routes (OpenNext)
|
|
│ ├─ /api/startRun
|
|
│ ├─ /api/toolInvoke
|
|
│ └─ /api/stream
|
|
│
|
|
└── Cloudflare Worker
|
|
├─ Durable Object: RunCoordinator
|
|
│ ├─ TCP connect() to Bandit
|
|
│ ├─ State machine (levels, caps, timers)
|
|
│ └─ Writes logs → R2
|
|
├─ D1 (metadata)
|
|
└─ R2 (artifacts)
|
|
```
|
|
|
|
*See `docs/ADR-001-architecture.md` for the detailed decision record.*
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Roadmap
|
|
|
|
* [x] Core runner architecture
|
|
* [x] JSONL log streaming
|
|
* [x] SSH tool scaffolding
|
|
* [ ] Add live leaderboard
|
|
* [ ] Add mock SSH server for tests
|
|
* [ ] Expand scoring heuristics
|
|
* [ ] Implement model-agnostic adapter layer
|
|
* [ ] Public demo page
|
|
|
|
See the [open issues](https://git.biohazardvfx.com/Nicholai/bandit-runner/issues) for the full roadmap.
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome.
|
|
|
|
1. Fork the Project
|
|
2. Create your Feature Branch (`git checkout -b feat/amazing`)
|
|
3. Commit (`pnpm commit`) using Conventional Commits
|
|
4. Push (`git push origin feat/amazing`)
|
|
5. Open a Pull Request
|
|
|
|
### Top Contributors
|
|
|
|
<a href="https://git.biohazardvfx.com/Nicholai/bandit-runner/graphs/contributors">
|
|
<img src="https://contrib.rocks/image?repo=Nicholai/bandit-runner" alt="Contributors" />
|
|
</a>
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## License
|
|
|
|
Distributed under the **GNU GPLv3** License.
|
|
See `LICENSE` for details.
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Contact
|
|
|
|
**Nicholai Vogel**
|
|
[Website](https://nicholai.work) • [LinkedIn](https://linkedin.com/in/nicholai-vogel) • [Instagram](https://instagram.com/nicholai.exe)
|
|
|
|
Project Link: [https://git.biohazardvfx.com/Nicholai/bandit-runner](https://git.biohazardvfx.com/Nicholai/bandit-runner)
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
## Acknowledgments
|
|
|
|
* [OverTheWire Bandit](https://overthewire.org/wargames/bandit/) — for the wargame challenge itself
|
|
* [Cloudflare Workers Docs](https://developers.cloudflare.com/workers/)
|
|
* [OpenNext](https://opennext.js.org/)
|
|
* [Shadcn/UI](https://ui.shadcn.com)
|
|
* [Drizzle ORM](https://orm.drizzle.team)
|
|
* [Choose a License](https://choosealicense.com)
|
|
* [Img Shields](https://shields.io)
|
|
* [Contrib.rocks](https://contrib.rocks)
|
|
|
|
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
---
|
|
|
|
<!-- MARKDOWN LINKS & IMAGES -->
|
|
|
|
[contributors-shield]: https://img.shields.io/github/contributors/Nicholai/bandit-runner.svg?style=for-the-badge
|
|
[contributors-url]: https://git.biohazardvfx.com/Nicholai/bandit-runner/graphs/contributors
|
|
[forks-shield]: https://img.shields.io/github/forks/Nicholai/bandit-runner.svg?style=for-the-badge
|
|
[forks-url]: https://git.biohazardvfx.com/Nicholai/bandit-runner/network/members
|
|
[stars-shield]: https://img.shields.io/github/stars/Nicholai/bandit-runner.svg?style=for-the-badge
|
|
[stars-url]: https://git.biohazardvfx.com/Nicholai/bandit-runner/stargazers
|
|
[issues-shield]: https://img.shields.io/github/issues/Nicholai/bandit-runner.svg?style=for-the-badge
|
|
[issues-url]: https://git.biohazardvfx.com/Nicholai/bandit-runner/issues
|
|
[license-shield]: https://img.shields.io/github/license/Nicholai/bandit-runner.svg?style=for-the-badge
|
|
[license-url]: https://git.biohazardvfx.com/Nicholai/bandit-runner/blob/main/COPYING.txt
|
|
[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=for-the-badge&logo=linkedin&colorB=555
|
|
[linkedin-url]: https://linkedin.com/in/nicholai-vogel
|
|
[product-screenshot]: public/screenshot.png
|
|
[Next.js]: https://img.shields.io/badge/Next.js-000000?style=for-the-badge&logo=nextdotjs&logoColor=white
|
|
[Next-url]: https://nextjs.org/
|
|
[React.js]: https://img.shields.io/badge/React-20232A?style=for-the-badge&logo=react&logoColor=61DAFB
|
|
[React-url]: https://react.dev/
|
|
[Cloudflare-badge]: https://img.shields.io/badge/Cloudflare%20Workers-F38020?style=for-the-badge&logo=cloudflare&logoColor=white
|
|
[Cloudflare-url]: https://developers.cloudflare.com/workers/
|
|
[OpenNext-badge]: https://img.shields.io/badge/OpenNext-18181B?style=for-the-badge&logo=vercel&logoColor=white
|
|
[OpenNext-url]: https://opennext.js.org/
|
|
[Shadcn-badge]: https://img.shields.io/badge/Shadcn%2FUI-000000?style=for-the-badge&logo=react&logoColor=white
|
|
[Shadcn-url]: https://ui.shadcn.com
|
|
[TypeScript-badge]: https://img.shields.io/badge/TypeScript-3178C6?style=for-the-badge&logo=typescript&logoColor=white
|
|
[TypeScript-url]: https://www.typescriptlang.org/
|
|
[Drizzle-badge]: https://img.shields.io/badge/Drizzle%20ORM-3E63DD?style=for-the-badge&logo=sqlite&logoColor=white
|
|
[Drizzle-url]: https://orm.drizzle.team
|
|
[pnpm-badge]: https://img.shields.io/badge/pnpm-F69220?style=for-the-badge&logo=pnpm&logoColor=white
|
|
[pnpm-url]: https://pnpm.io
|
|
[conventional-commits-badge]: https://img.shields.io/badge/Conventional%20Commits-1.0.0-%23FE5196?style=for-the-badge&logo=conventionalcommits&logoColor=white
|
|
|