jan/docs/src/pages/local-server/llama-cpp.mdx

172 lines
4.8 KiB
Plaintext

---
title: llama.cpp Engine
description: Configure Jan's local AI engine for optimal performance.
keywords:
[
Jan,
local AI,
llama.cpp,
AI engine,
local models,
performance,
GPU acceleration,
CPU processing,
model optimization,
]
---
import { Tabs } from 'nextra/components'
import { Callout, Steps } from 'nextra/components'
import { Settings } from 'lucide-react'
# Local AI Engine (llama.cpp)
llama.cpp is the engine that runs AI models locally on your computer. It's what makes Jan work without needing internet or cloud services.
## Accessing Engine Settings
Find llama.cpp settings at **Settings** (<Settings width={16} height={16} style={{display:"inline"}}/>) > **Local Engine** > **llama.cpp**:
![llama.cpp](./_assets/llama.cpp-01-updated.png)
<Callout type="info">
Most users don't need to change these settings. Jan picks good defaults for your hardware automatically.
</Callout>
## When to Adjust Settings
You might need to modify these settings if:
- Models load slowly or don't work
- You've installed new hardware (like a graphics card)
- You want to optimize performance for your specific setup
## Engine Management
| Feature | What It Does | When You Need It |
|---------|-------------|------------------|
| **Engine Version** | Shows current llama.cpp version | Check compatibility with newer models |
| **Check Updates** | Downloads engine updates | When new models require updated engine |
| **Backend Selection** | Choose hardware-optimized version | After hardware changes or performance issues |
## Hardware Backends
Different backends are optimized for different hardware. Pick the one that matches your computer:
<Tabs items={['Windows', 'Linux', 'macOS']}>
<Tabs.Tab>
### NVIDIA Graphics Cards (Fastest)
**For CUDA 12.0:**
- `llama.cpp-avx2-cuda-12-0` (most common)
- `llama.cpp-avx512-cuda-12-0` (newer Intel/AMD CPUs)
**For CUDA 11.7:**
- `llama.cpp-avx2-cuda-11-7` (older drivers)
### CPU Only
- `llama.cpp-avx2` (modern CPUs)
- `llama.cpp-avx` (older CPUs)
- `llama.cpp-noavx` (very old CPUs)
### Other Graphics Cards
- `llama.cpp-vulkan` (AMD, Intel Arc)
</Tabs.Tab>
<Tabs.Tab>
### NVIDIA Graphics Cards
- `llama.cpp-avx2-cuda-12-0` (recommended)
- `llama.cpp-avx2-cuda-11-7` (older drivers)
### CPU Only
- `llama.cpp-avx2` (modern CPUs)
- `llama.cpp-arm64` (ARM processors)
### Other Graphics Cards
- `llama.cpp-vulkan` (AMD, Intel graphics)
</Tabs.Tab>
<Tabs.Tab>
### Apple Silicon (M1/M2/M3/M4)
- `llama.cpp-mac-arm64` (recommended)
### Intel Macs
- `llama.cpp-mac-amd64`
<Callout type="info">
Apple Silicon automatically uses GPU acceleration through Metal.
</Callout>
</Tabs.Tab>
</Tabs>
## Performance Settings
| Setting | What It Does | Recommended | Impact |
|---------|-------------|-------------|---------|
| **Continuous Batching** | Handle multiple requests simultaneously | Enabled | Faster when using tools or multiple chats |
| **Parallel Operations** | Number of concurrent requests | 4 | Higher = more multitasking, uses more memory |
| **CPU Threads** | Processor cores to use | Auto | More threads can speed up CPU processing |
## Memory Settings
| Setting | What It Does | Recommended | When to Change |
|---------|-------------|-------------|----------------|
| **Flash Attention** | Efficient memory usage | Enabled | Leave enabled unless problems occur |
| **Caching** | Remember recent conversations | Enabled | Speeds up follow-up questions |
| **KV Cache Type** | Memory vs quality trade-off | f16 | Change to q8_0 if low on memory |
| **mmap** | Efficient model loading | Enabled | Helps with large models |
| **Context Shift** | Handle very long conversations | Disabled | Enable for very long chats |
### Memory Options Explained
- **f16**: Best quality, uses more memory
- **q8_0**: Balanced memory and quality
- **q4_0**: Least memory, slight quality reduction
## Quick Troubleshooting
**Models won't load:**
- Try a different backend
- Check available RAM/VRAM
- Update engine version
**Slow performance:**
- Verify GPU acceleration is active
- Close memory-intensive applications
- Increase GPU Layers in model settings
**Out of memory:**
- Change KV Cache Type to q8_0
- Reduce Context Size in model settings
- Try a smaller model
**Crashes or errors:**
- Switch to a more stable backend (avx instead of avx2)
- Update graphics drivers
- Check system temperature
## Quick Setup Guide
**Most users:**
1. Use default settings
2. Only change if problems occur
**NVIDIA GPU users:**
1. Download CUDA backend
2. Ensure GPU Layers is set high
3. Enable Flash Attention
**Performance optimization:**
1. Enable Continuous Batching
2. Use appropriate backend for hardware
3. Monitor memory usage
<Callout type="info">
The default settings work well for most hardware. Only adjust these if you're experiencing specific issues or want to optimize for your particular setup.
</Callout>