2024-03-18 15:53:22 +07:00

85 lines
3.2 KiB
Plaintext

---
title: LlamaCPP Extension
slug: /guides/providers/llamacpp
sidebar_position: 1
description: A step-by-step guide on how to customize the LlamaCPP extension.
keywords:
[
Jan AI,
Jan,
ChatGPT alternative,
local AI,
private AI,
conversational AI,
no-subscription fee,
large language model,
Llama CPP integration,
LlamaCPP Extension,
]
---
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
## Overview
[Nitro](https://github.com/janhq/nitro) is an inference server on top of [llama.cpp](https://github.com/ggerganov/llama.cpp). It provides an OpenAI-compatible API, queue, & scaling.
Nitro is the default AI engine downloaded with Jan. There is no additional setup needed.
In this guide, we'll walk you through the process of customizing your engine settings by configuring the `nitro.json` file
1. Navigate to the `App Settings` > `Advanced` > `Open App Directory` > `~/jan/engine` folder.
<Tabs>
<TabItem value="mac" label="MacOS" default>
```sh
cd ~/jan/engines
```
</TabItem>
<TabItem value="windows" label="Windows" default>
```sh
C:/Users/<your_user_name>/jan/engines
```
</TabItem>
<TabItem value="linux" label="Linux" default>
```sh
cd ~/jan/engines
```
</TabItem>
</Tabs>
2. Modify the `nitro.json` file based on your needs. The default settings are shown below.
```json title="~/jan/engines/nitro.json"
{
"ctx_len": 2048,
"ngl": 100,
"cpu_threads": 1,
"cont_batching": false,
"embedding": false
}
```
The table below describes the parameters in the `nitro.json` file.
| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `ctx_len` | **Integer** | Typically set at `2048`, `ctx_len` provides ample context for model operations like `GPT-3.5`. (*Maximum*: `4096`, *Minimum*: `1`) |
| `ngl` | **Integer** | Defaulted at `100`, `ngl` determines GPU layer usage. |
| `cpu_threads` | **Integer** | Determines CPU inference threads, limited by hardware and OS. (*Maximum* determined by system) |
| `cont_batching` | **Integer** | Controls continuous batching, enhancing throughput for LLM inference. |
| `embedding` | **Integer** | Enables embedding utilization for tasks like document-enhanced chat in RAG-based applications. |
:::tip
- By default, the value of `ngl` is set to 100, which indicates that it will offload all. If you wish to offload only 50% of the GPU, you can set `ngl` to 15 because most models on Mistral or Llama are around ~ 30 layers.
- To utilize the embedding feature, include the JSON parameter `"embedding": true`. It will enable Nitro to process inferences with embedding capabilities. Please refer to the [Embedding in the Nitro documentation](https://nitro.jan.ai/features/embed) for a more detailed explanation.
- To utilize the continuous batching feature for boosting throughput and minimizing latency in large language model (LLM) inference, include `cont_batching: true`. For details, please refer to the [Continuous Batching in the Nitro documentation](https://nitro.jan.ai/features/cont-batch).
:::
:::info[Assistance and Support]
If you have questions, please join our [Discord community](https://discord.gg/Dt7MxDyNNZ) for support, updates, and discussions.
:::