nicholai e934d047b0 added example runs

2025-10-09 22:03:37 -06:00

14 KiB

Raw Blame History

README Improvement Plan

Problem

The current README is outdated and doesn't reflect the actual architecture or provide clear setup/deployment instructions. It references old concepts (D1, R2, mock setup) that aren't implemented, and doesn't explain the real LangGraph + SSH Proxy architecture that's currently working.

Goals

Clear Project Overview: Explain what Bandit Runner actually does and why it matters
Accurate Architecture: Document the real system (Next.js → Cloudflare Workers → Durable Objects → SSH Proxy → Bandit Server)
Step-by-Step Setup: Complete local development setup instructions
Deployment Guide: Clear instructions for deploying all components
Feature Documentation: What the system actually does (model selection, real-time terminal, agent reasoning, etc.)
Remove Outdated Content: Clean up references to unimplemented features

Current Issues

Inaccurate Architecture Description

README shows old architecture with D1/R2 that aren't implemented
Doesn't mention LangGraph.js agent framework
Doesn't explain SSH Proxy on Fly.io
Missing Durable Object standalone worker details

Missing Prerequisites

No mention of OpenRouter API key requirement
Missing Fly.io account for SSH proxy
No Node.js version specification (needs 20+)
Missing pnpm workspace setup

Incomplete Setup Instructions

Doesn't explain monorepo structure
No .env file examples or required variables
Missing SSH proxy deployment steps
No explanation of DO worker vs main worker
No Wrangler secrets configuration

Vague Usage Section

Doesn't explain the actual UI (control panel, terminal, chat)
No screenshots or feature descriptions
Missing model selection details
No explanation of manual mode

Outdated Roadmap

Lists items as incomplete that are done
Missing current features (WebSocket streaming, ANSI colors, etc.)

New README Structure

# Bandit Runner

## About
- What it is: Autonomous AI agent testing framework
- What it does: Solves OverTheWire Bandit CTF challenges
- Why it matters: Real-world benchmark for LLM autonomous capabilities

## Features
- 🤖 LangGraph.js autonomous agent with tool use
- 🔌 Real-time WebSocket streaming
- 🖥️ Full terminal output with ANSI colors
- 💬 Live agent reasoning/thinking display
- 🎯 OpenRouter model selection (100+ models)
- 🎨 Beautiful retro-terminal UI
- 🔒 SSH security boundaries (read-only Bandit server)
- 📊 Level progression tracking

## Architecture

### System Components
1. **Frontend** - Next.js 15 with App Router + shadcn/ui
2. **Backend** - Cloudflare Workers with OpenNext
3. **Coordinator** - Durable Objects (separate worker)
4. **Agent Runtime** - SSH Proxy on Fly.io (LangGraph.js)
5. **Target** - OverTheWire Bandit SSH server

### Data Flow
Browser → WebSocket → Main Worker → DO Worker → HTTP → SSH Proxy → LangGraph Agent → SSH → Bandit Server

### Tech Stack
- Next.js 15 + React 19
- Cloudflare Workers + Durable Objects
- LangGraph.js for agent orchestration
- OpenRouter for LLM access
- Fly.io for SSH proxy hosting
- shadcn/ui for components
- ANSI-to-HTML for terminal rendering

## Prerequisites

### Required Accounts
- Cloudflare account (free tier OK, needs Durable Objects enabled)
- Fly.io account (free tier OK)
- OpenRouter account + API key

### Required Software
- Node.js 20+
- pnpm (package manager)
- Wrangler CLI (Cloudflare)
- flyctl CLI (Fly.io)
- Git

## Installation

### 1. Clone Repository
```bash
git clone https://git.biohazardvfx.com/Nicholai/bandit-runner.git
cd bandit-runner

2. Install Dependencies

# Install pnpm if you don't have it
npm install -g pnpm

# Install all workspace dependencies
cd bandit-runner-app
pnpm install

cd ../ssh-proxy
npm install

3. Configure Environment

Cloudflare (Main Worker)

No .env needed - uses wrangler.jsonc vars

Cloudflare (DO Worker)

cd bandit-runner-app/workers/bandit-agent-do
cp .env.example .env.local
# Add your OpenRouter API key

SSH Proxy (Fly.io)

No configuration needed - OpenRouter key passed via HTTP headers

4. Local Development

Option A: Run Frontend Only (with deployed backend)

cd bandit-runner-app
pnpm dev
# Open http://localhost:3000

Option B: Run Full Stack Locally

# Terminal 1: SSH Proxy
cd ssh-proxy
npm run dev
# Runs on http://localhost:3001

# Terminal 2: Frontend (with local proxy)
cd bandit-runner-app
# Update wrangler.jsonc SSH_PROXY_URL to http://localhost:3001
pnpm dev

Deployment

Step 1: Deploy SSH Proxy to Fly.io

cd ssh-proxy

# Login to Fly.io (opens browser)
flyctl auth login

# Deploy (first time creates app)
flyctl deploy

# Note the URL: https://bandit-ssh-proxy.fly.dev

Step 2: Deploy Durable Object Worker

cd ../bandit-runner-app/workers/bandit-agent-do

# Set your OpenRouter API key
wrangler secret put OPENROUTER_API_KEY
# Paste your key when prompted

# Deploy the DO worker
wrangler deploy

# Verify deployment
wrangler tail

Step 3: Deploy Main Application

cd ../..  # Back to bandit-runner-app root

# Verify SSH_PROXY_URL in wrangler.jsonc points to your Fly.io URL
# Should be: "SSH_PROXY_URL": "https://bandit-ssh-proxy.fly.dev"

# Build and deploy
pnpm run deploy

# Your app will be live at:
# https://bandit-runner-app.<your-subdomain>.workers.dev

Step 4: Verify Deployment

# Test SSH Proxy health
curl https://bandit-ssh-proxy.fly.dev/ssh/health

# Should return: {"status":"ok","activeConnections":0}

# Test main app
open https://bandit-runner-app.<your-subdomain>.workers.dev

Usage

Starting a Run

Open the application in your browser
Select Model: Click the model dropdown to choose from 100+ models
- Search by name
- Filter by provider (OpenAI, Anthropic, Google, etc.)
- Filter by price range
- Filter by context length
Set Target Level: Choose how far the agent should attempt (default: 5)
Select Output Mode:
- Selective (default): Shows LLM reasoning + tool calls
- Everything: Shows reasoning + tool calls + full command outputs
Click START

Monitoring Progress

The interface shows:

Left Panel (Terminal): Full SSH session output with colors
Right Panel (Agent Chat): LLM reasoning and discovered information
Top Status Bar: Current status, level, and model info
Control Panel: Pause/Resume/Stop controls

Manual Mode (Debugging)

Toggle "Manual Mode" in the terminal to:

Type commands directly into the SSH session
Assist the agent when it gets stuck
Debug connection issues

⚠️ Note: Runs with manual intervention are disqualified from leaderboards

Features Explained

Real-Time Terminal

Full PTY session capture
ANSI color support (green prompts, red errors, etc.)
Timestamps for all output
Automatic scrolling
Read-only by default (manual mode available)

Agent Reasoning Display

LLM "thinking" messages (what command to try next)
Planning output (strategy for current level)
Password discovery announcements
Level completion summaries

Model Selection

100+ models from OpenRouter
Real-time search
Filter by provider, price, context length
Shows token costs
Supports all major providers

Level Progression

Always starts at Level 0
Automatic password extraction
Automatic SSH re-login for next level
Configurable target level cap
Retry logic for failed attempts

Architecture Details

Monorepo Structure

bandit-runner/
├── bandit-runner-app/          # Next.js frontend + Cloudflare Workers
│   ├── src/
│   │   ├── app/                # Next.js pages + API routes
│   │   ├── components/         # React components (shadcn/ui)
│   │   ├── hooks/              # React hooks (WebSocket)
│   │   └── lib/                # Utilities, agents, storage
│   ├── workers/
│   │   └── bandit-agent-do/    # Standalone Durable Object worker
│   ├── scripts/
│   │   └── patch-worker.js     # WebSocket intercept injector
│   └── wrangler.jsonc          # Main worker config
├── ssh-proxy/                   # LangGraph agent on Fly.io
│   ├── agent.ts                # LangGraph state machine
│   ├── server.ts               # Express HTTP server
│   ├── Dockerfile              # Container definition
│   └── fly.toml                # Fly.io config
├── docs/                        # Documentation
└── public/                      # Assets

WebSocket Intercept Pattern

The main worker intercepts WebSocket upgrade requests before they reach Next.js:

// Injected by scripts/patch-worker.js
if (request.headers.get('Upgrade') === 'websocket') {
  // Forward directly to DO, bypass Next.js
  return env.BANDIT_AGENT.get(id).fetch(request);
}

This is necessary because Next.js API routes don't support WebSocket upgrades.

Durable Object Responsibilities

Accept WebSocket connections from browsers
Maintain run state (level, passwords, status)
Call SSH proxy HTTP endpoints
Stream JSONL events from SSH proxy to WebSocket clients
Handle START/PAUSE/RESUME/STOP actions

SSH Proxy Responsibilities

Run LangGraph.js agent state machine
Maintain SSH2 connections to Bandit server
Execute commands via PTY
Extract passwords using regex
Stream events back to DO via HTTP response body (JSONL)

Event Flow

User clicks START in browser
Browser → WebSocket → Main Worker → DO Worker
DO → HTTP POST /agent/run → SSH Proxy
SSH Proxy → SSH Connection → Bandit Server
Agent executes commands, extracts passwords
SSH Proxy → JSONL events → DO
DO → WebSocket → Browser
UI updates in real-time

Troubleshooting

WebSocket Connection Failed

Check DO worker is deployed: wrangler deployments list
Verify DO binding in main worker config
Check browser console for specific errors

SSH Proxy Not Responding

Check Fly.io status: flyctl status
View logs: flyctl logs
Test health endpoint: curl https://your-proxy.fly.dev/ssh/health

Agent Not Starting

Verify OpenRouter API key: wrangler secret list
Check DO worker logs: wrangler tail --name bandit-agent-do
Ensure SSH_PROXY_URL is correct in wrangler.jsonc

Commands Not Executing

Check SSH proxy logs: flyctl logs
Verify Bandit server is accessible: ssh bandit0@bandit.labs.overthewire.org -p 2220
Review agent reasoning in chat panel for errors

Roadmap

Completed ✅

LangGraph autonomous agent framework
Real-time WebSocket streaming
Full terminal output with ANSI colors
Agent reasoning display
OpenRouter model selection with search/filters
Level 0 → N progression
Manual mode for debugging
Password extraction and advancement

In Progress 🚧

Retry logic with exponential backoff
Token usage and cost tracking UI
Persistent run history (D1)
Log storage (R2)

Planned 📋

Leaderboard (fastest completions)
Multi-agent comparison mode
Custom prompts and strategies
Agent performance analytics
Mock SSH server for testing
Expanded CTF support (beyond Bandit)

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feat/amazing-feature)
Commit with Conventional Commits (feat:, fix:, docs:)
Push and open a Pull Request

See CONTRIBUTING.md for detailed guidelines.

License

GNU GPLv3 - See COPYING.txt

Contact

Nicholai Vogel
Website • LinkedIn

Acknowledgments

OverTheWire Bandit - CTF challenges
LangGraph.js - Agent framework
Cloudflare Workers - Edge compute
Fly.io - Global app hosting
OpenRouter - Unified LLM API
shadcn/ui - Component library
OpenNext - Next.js on Workers


## Key Changes

### What's Being Added
1. **Accurate architecture diagram** - Shows real components (DO, SSH Proxy, LangGraph)
2. **Complete prerequisite list** - All accounts, software, and setup requirements
3. **Monorepo structure docs** - Explains the workspace layout
4. **Step-by-step deployment** - 4 clear deployment phases
5. **Feature documentation** - What each UI component does
6. **Troubleshooting section** - Common issues and solutions
7. **Current roadmap** - Reflects actual completion status

### What's Being Removed
1. Mock architecture references (D1, R2 unimplemented features)
2. Incorrect usage instructions
3. Outdated screenshots/descriptions
4. Placeholder content from template

### What's Being Updated
1. Tech stack badges - Add LangGraph, Fly.io, OpenRouter
2. Built With section - Reflect actual dependencies
3. Installation steps - Real commands that work
4. Usage examples - Match actual UI
5. Contact links - Keep current

## Implementation Notes

- Keep existing badge links/formatting style
- Maintain table of contents structure
- Add new sections: Features, Architecture Details, Troubleshooting
- Update all code blocks with real, tested commands
- Add architecture ASCII diagram for clarity
- Include actual file paths and structure
- Reference real configuration files (wrangler.jsonc, etc.)

## Success Criteria

✅ README accurately describes working system  
✅ User can deploy from scratch following instructions  
✅ No references to unimplemented features  
✅ Clear troubleshooting for common issues  
✅ Architecture matches production code  
✅ All commands are tested and accurate

14 KiB Raw Blame History