--- title: llama.cpp Engine description: Configure Jan's local AI engine for optimal performance. keywords: [ Jan, local AI, llama.cpp, AI engine, local models, performance, GPU acceleration, CPU processing, model optimization, ] --- import { Tabs } from 'nextra/components' import { Callout, Steps } from 'nextra/components' import { Settings } from 'lucide-react' # Local AI Engine (llama.cpp) llama.cpp is the engine that runs AI models locally on your computer. It's what makes Jan work without needing internet or cloud services. ## Accessing Engine Settings Find llama.cpp settings at **Settings** () > **Local Engine** > **llama.cpp**: ![llama.cpp](./_assets/llama.cpp-01-updated.png) Most users don't need to change these settings. Jan picks good defaults for your hardware automatically. ## When to Adjust Settings You might need to modify these settings if: - Models load slowly or don't work - You've installed new hardware (like a graphics card) - You want to optimize performance for your specific setup ## Engine Management | Feature | What It Does | When You Need It | |---------|-------------|------------------| | **Engine Version** | Shows current llama.cpp version | Check compatibility with newer models | | **Check Updates** | Downloads engine updates | When new models require updated engine | | **Backend Selection** | Choose hardware-optimized version | After hardware changes or performance issues | ## Hardware Backends Different backends are optimized for different hardware. Pick the one that matches your computer: ### NVIDIA Graphics Cards (Fastest) **For CUDA 12.0:** - `llama.cpp-avx2-cuda-12-0` (most common) - `llama.cpp-avx512-cuda-12-0` (newer Intel/AMD CPUs) **For CUDA 11.7:** - `llama.cpp-avx2-cuda-11-7` (older drivers) ### CPU Only - `llama.cpp-avx2` (modern CPUs) - `llama.cpp-avx` (older CPUs) - `llama.cpp-noavx` (very old CPUs) ### Other Graphics Cards - `llama.cpp-vulkan` (AMD, Intel Arc) ### NVIDIA Graphics Cards - `llama.cpp-avx2-cuda-12-0` (recommended) - `llama.cpp-avx2-cuda-11-7` (older drivers) ### CPU Only - `llama.cpp-avx2` (modern CPUs) - `llama.cpp-arm64` (ARM processors) ### Other Graphics Cards - `llama.cpp-vulkan` (AMD, Intel graphics) ### Apple Silicon (M1/M2/M3/M4) - `llama.cpp-mac-arm64` (recommended) ### Intel Macs - `llama.cpp-mac-amd64` Apple Silicon automatically uses GPU acceleration through Metal. ## Performance Settings | Setting | What It Does | Recommended | Impact | |---------|-------------|-------------|---------| | **Continuous Batching** | Handle multiple requests simultaneously | Enabled | Faster when using tools or multiple chats | | **Parallel Operations** | Number of concurrent requests | 4 | Higher = more multitasking, uses more memory | | **CPU Threads** | Processor cores to use | Auto | More threads can speed up CPU processing | ## Memory Settings | Setting | What It Does | Recommended | When to Change | |---------|-------------|-------------|----------------| | **Flash Attention** | Efficient memory usage | Enabled | Leave enabled unless problems occur | | **Caching** | Remember recent conversations | Enabled | Speeds up follow-up questions | | **KV Cache Type** | Memory vs quality trade-off | f16 | Change to q8_0 if low on memory | | **mmap** | Efficient model loading | Enabled | Helps with large models | | **Context Shift** | Handle very long conversations | Disabled | Enable for very long chats | ### Memory Options Explained - **f16**: Best quality, uses more memory - **q8_0**: Balanced memory and quality - **q4_0**: Least memory, slight quality reduction ## Quick Troubleshooting **Models won't load:** - Try a different backend - Check available RAM/VRAM - Update engine version **Slow performance:** - Verify GPU acceleration is active - Close memory-intensive applications - Increase GPU Layers in model settings **Out of memory:** - Change KV Cache Type to q8_0 - Reduce Context Size in model settings - Try a smaller model **Crashes or errors:** - Switch to a more stable backend (avx instead of avx2) - Update graphics drivers - Check system temperature ## Quick Setup Guide **Most users:** 1. Use default settings 2. Only change if problems occur **NVIDIA GPU users:** 1. Download CUDA backend 2. Ensure GPU Layers is set high 3. Enable Flash Attention **Performance optimization:** 1. Enable Continuous Batching 2. Use appropriate backend for hardware 3. Monitor memory usage The default settings work well for most hardware. Only adjust these if you're experiencing specific issues or want to optimize for your particular setup.