172 lines
4.8 KiB
Plaintext
172 lines
4.8 KiB
Plaintext
---
|
|
title: llama.cpp Engine
|
|
description: Configure Jan's local AI engine for optimal performance.
|
|
keywords:
|
|
[
|
|
Jan,
|
|
local AI,
|
|
llama.cpp,
|
|
AI engine,
|
|
local models,
|
|
performance,
|
|
GPU acceleration,
|
|
CPU processing,
|
|
model optimization,
|
|
]
|
|
---
|
|
|
|
import { Tabs } from 'nextra/components'
|
|
import { Callout, Steps } from 'nextra/components'
|
|
import { Settings } from 'lucide-react'
|
|
|
|
# Local AI Engine (llama.cpp)
|
|
|
|
llama.cpp is the engine that runs AI models locally on your computer. It's what makes Jan work without needing internet or cloud services.
|
|
|
|
## Accessing Engine Settings
|
|
|
|
Find llama.cpp settings at **Settings** (<Settings width={16} height={16} style={{display:"inline"}}/>) > **Local Engine** > **llama.cpp**:
|
|
|
|

|
|
|
|
<Callout type="info">
|
|
Most users don't need to change these settings. Jan picks good defaults for your hardware automatically.
|
|
</Callout>
|
|
|
|
## When to Adjust Settings
|
|
|
|
You might need to modify these settings if:
|
|
- Models load slowly or don't work
|
|
- You've installed new hardware (like a graphics card)
|
|
- You want to optimize performance for your specific setup
|
|
|
|
## Engine Management
|
|
|
|
| Feature | What It Does | When You Need It |
|
|
|---------|-------------|------------------|
|
|
| **Engine Version** | Shows current llama.cpp version | Check compatibility with newer models |
|
|
| **Check Updates** | Downloads engine updates | When new models require updated engine |
|
|
| **Backend Selection** | Choose hardware-optimized version | After hardware changes or performance issues |
|
|
|
|
## Hardware Backends
|
|
|
|
Different backends are optimized for different hardware. Pick the one that matches your computer:
|
|
|
|
<Tabs items={['Windows', 'Linux', 'macOS']}>
|
|
|
|
<Tabs.Tab>
|
|
|
|
### NVIDIA Graphics Cards (Fastest)
|
|
**For CUDA 12.0:**
|
|
- `llama.cpp-avx2-cuda-12-0` (most common)
|
|
- `llama.cpp-avx512-cuda-12-0` (newer Intel/AMD CPUs)
|
|
|
|
**For CUDA 11.7:**
|
|
- `llama.cpp-avx2-cuda-11-7` (older drivers)
|
|
|
|
### CPU Only
|
|
- `llama.cpp-avx2` (modern CPUs)
|
|
- `llama.cpp-avx` (older CPUs)
|
|
- `llama.cpp-noavx` (very old CPUs)
|
|
|
|
### Other Graphics Cards
|
|
- `llama.cpp-vulkan` (AMD, Intel Arc)
|
|
|
|
</Tabs.Tab>
|
|
|
|
<Tabs.Tab>
|
|
|
|
### NVIDIA Graphics Cards
|
|
- `llama.cpp-avx2-cuda-12-0` (recommended)
|
|
- `llama.cpp-avx2-cuda-11-7` (older drivers)
|
|
|
|
### CPU Only
|
|
- `llama.cpp-avx2` (modern CPUs)
|
|
- `llama.cpp-arm64` (ARM processors)
|
|
|
|
### Other Graphics Cards
|
|
- `llama.cpp-vulkan` (AMD, Intel graphics)
|
|
|
|
</Tabs.Tab>
|
|
|
|
<Tabs.Tab>
|
|
|
|
### Apple Silicon (M1/M2/M3/M4)
|
|
- `llama.cpp-mac-arm64` (recommended)
|
|
|
|
### Intel Macs
|
|
- `llama.cpp-mac-amd64`
|
|
|
|
<Callout type="info">
|
|
Apple Silicon automatically uses GPU acceleration through Metal.
|
|
</Callout>
|
|
|
|
</Tabs.Tab>
|
|
|
|
</Tabs>
|
|
|
|
## Performance Settings
|
|
|
|
| Setting | What It Does | Recommended | Impact |
|
|
|---------|-------------|-------------|---------|
|
|
| **Continuous Batching** | Handle multiple requests simultaneously | Enabled | Faster when using tools or multiple chats |
|
|
| **Parallel Operations** | Number of concurrent requests | 4 | Higher = more multitasking, uses more memory |
|
|
| **CPU Threads** | Processor cores to use | Auto | More threads can speed up CPU processing |
|
|
|
|
## Memory Settings
|
|
|
|
| Setting | What It Does | Recommended | When to Change |
|
|
|---------|-------------|-------------|----------------|
|
|
| **Flash Attention** | Efficient memory usage | Enabled | Leave enabled unless problems occur |
|
|
| **Caching** | Remember recent conversations | Enabled | Speeds up follow-up questions |
|
|
| **KV Cache Type** | Memory vs quality trade-off | f16 | Change to q8_0 if low on memory |
|
|
| **mmap** | Efficient model loading | Enabled | Helps with large models |
|
|
| **Context Shift** | Handle very long conversations | Disabled | Enable for very long chats |
|
|
|
|
### Memory Options Explained
|
|
- **f16**: Best quality, uses more memory
|
|
- **q8_0**: Balanced memory and quality
|
|
- **q4_0**: Least memory, slight quality reduction
|
|
|
|
## Quick Troubleshooting
|
|
|
|
**Models won't load:**
|
|
- Try a different backend
|
|
- Check available RAM/VRAM
|
|
- Update engine version
|
|
|
|
**Slow performance:**
|
|
- Verify GPU acceleration is active
|
|
- Close memory-intensive applications
|
|
- Increase GPU Layers in model settings
|
|
|
|
**Out of memory:**
|
|
- Change KV Cache Type to q8_0
|
|
- Reduce Context Size in model settings
|
|
- Try a smaller model
|
|
|
|
**Crashes or errors:**
|
|
- Switch to a more stable backend (avx instead of avx2)
|
|
- Update graphics drivers
|
|
- Check system temperature
|
|
|
|
## Quick Setup Guide
|
|
|
|
**Most users:**
|
|
1. Use default settings
|
|
2. Only change if problems occur
|
|
|
|
**NVIDIA GPU users:**
|
|
1. Download CUDA backend
|
|
2. Ensure GPU Layers is set high
|
|
3. Enable Flash Attention
|
|
|
|
**Performance optimization:**
|
|
1. Enable Continuous Batching
|
|
2. Use appropriate backend for hardware
|
|
3. Monitor memory usage
|
|
|
|
<Callout type="info">
|
|
The default settings work well for most hardware. Only adjust these if you're experiencing specific issues or want to optimize for your particular setup.
|
|
</Callout>
|