jan/docs/docs/guides/local-providers/llamacpp.mdx

---
title: LlamaCPP Extension
slug: /guides/engines/llamacpp
sidebar_position: 1
description: A step-by-step guide on how to customize the LlamaCPP extension.
keywords:
  [
    Jan AI,
    Jan,
    ChatGPT alternative,
    local AI,
    private AI,
    conversational AI,
    no-subscription fee,
    large language model,
    Llama CPP integration,
    LlamaCPP Extension,
  ]
---


import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

## Overview
[Nitro](https://github.com/janhq/nitro) is an inference server on top of [llama.cpp](https://github.com/ggerganov/llama.cpp). It provides an OpenAI-compatible API, queue, & scaling.

## LlamaCPP Extension
:::note
Nitro is the default AI engine downloaded with Jan. There is no additional setup needed.
:::

In this guide, we'll walk you through the process of customizing your engine settings by configuring the `nitro.json` file

1. Navigate to the `App Settings` > `Advanced` > `Open App Directory` > `~/jan/engine` folder.

<Tabs>
    <TabItem value="mac" label="MacOS" default>
        ```sh
        cd ~/jan/engines
        ```
    </TabItem>
    <TabItem value="windows" label="Windows" default>
        ```sh
        C:/Users/<your_user_name>/jan/engines
        ```
    </TabItem>
    <TabItem value="linux" label="Linux" default>
        ```sh
        cd ~/jan/engines
        ```
    </TabItem>
</Tabs>

2. Modify the `nitro.json` file based on your needs. The default settings are shown below.

```json title="~/jan/engines/nitro.json"
{
  "ctx_len": 2048,
  "ngl": 100,
  "cpu_threads": 1,
  "cont_batching": false,
  "embedding": false
}
```

The table below describes the parameters in the `nitro.json` file.

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `ctx_len` | **Integer** | Typically set at `2048`, `ctx_len` provides ample context for model operations like `GPT-3.5`. (*Maximum*: `4096`, *Minimum*: `1`) |
| `ngl` | **Integer** | Defaulted at `100`, `ngl` determines GPU layer usage. |
| `cpu_threads` | **Integer** | Determines CPU inference threads, limited by hardware and OS. (*Maximum* determined by system) |
| `cont_batching` | **Integer** | Controls continuous batching, enhancing throughput for LLM inference. |
| `embedding` | **Integer** | Enables embedding utilization for tasks like document-enhanced chat in RAG-based applications. |

:::tip
 - By default, the value of `ngl` is set to 100, which indicates that it will offload all. If you wish to offload only 50% of the GPU, you can set `ngl` to 15 because most models on Mistral or Llama are around ~ 30 layers.
 - To utilize the embedding feature, include the JSON parameter `"embedding": true`. It will enable Nitro to process inferences with embedding capabilities. Please refer to the [Embedding in the Nitro documentation](https://nitro.jan.ai/features/embed) for a more detailed explanation.
 - To utilize the continuous batching feature for boosting throughput and minimizing latency in large language model (LLM) inference, include `cont_batching: true`. For details, please refer to the [Continuous Batching in the Nitro documentation](https://nitro.jan.ai/features/cont-batch).

:::

:::info[Assistance and Support]

If you have questions, please join our [Discord community](https://discord.gg/Dt7MxDyNNZ) for support, updates, and discussions.

:::