docs: trt-llm extension guides
This commit is contained in:
parent
d9c3852997
commit
f878555598
8
docs/docs/guides/inference/README.mdx
Normal file
8
docs/docs/guides/inference/README.mdx
Normal file
@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
title: Extensions
|
||||||
|
slug: /guides/inference/
|
||||||
|
---
|
||||||
|
|
||||||
|
import DocCardList from "@theme/DocCardList";
|
||||||
|
|
||||||
|
<DocCardList />
|
||||||
BIN
docs/docs/guides/inference/image.png
Normal file
BIN
docs/docs/guides/inference/image.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 27 KiB |
11
docs/docs/guides/inference/llama-cpp.md
Normal file
11
docs/docs/guides/inference/llama-cpp.md
Normal file
@ -0,0 +1,11 @@
|
|||||||
|
---
|
||||||
|
title: Llama-CPP Extension
|
||||||
|
---
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
[LlamaCPP](https://github.com/ggerganov/llama.cpp) is the default AI engine downloaded with Jan. It is served through Nitro, a C++ inference server, that handles additional UX and hardware optimizations.
|
||||||
|
|
||||||
|
The source code for Nitro-llama-cpp is [here](https://github.com/janhq/nitro).
|
||||||
|
|
||||||
|
There is no additional setup needed.
|
||||||
83
docs/docs/guides/inference/tensorrt-llm.md
Normal file
83
docs/docs/guides/inference/tensorrt-llm.md
Normal file
@ -0,0 +1,83 @@
|
|||||||
|
---
|
||||||
|
title: TensorRT-LLM Extension
|
||||||
|
---
|
||||||
|
|
||||||
|
Users with Nvidia GPUs can get 20-40% faster* token speeds on their laptop or desktops by using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
|
||||||
|
|
||||||
|
This guide walks you through how to install Jan's official [TensorRT-LLM Extension](https://github.com/janhq/nitro-tensorrt-llm). This extension uses [Nitro-TensorRT-LLM](https://github.com/janhq/nitro-tensorrt-llm) as the AI engine, instead of the default [Nitro-Llama-CPP](https://github.com/janhq/nitro). It includes an efficient C++ server to natively execute the [TRT-LLM C++ runtime](https://nvidia.github.io/TensorRT-LLM/gpt_runtime.html). It also comes with additional feature and performance improvements like OpenAI compatibility, tokenizer improvements, and queues.
|
||||||
|
|
||||||
|
*Compared to using LlamaCPP engine.
|
||||||
|
|
||||||
|
:::info
|
||||||
|
This feature is only available for Windows users. Linux is coming soon.
|
||||||
|
:::
|
||||||
|
|
||||||
|
## Requirements
|
||||||
|
|
||||||
|
- A Windows PC
|
||||||
|
- Nvidia GPU(s): Ada or Ampere series (i.e. RTX 4000s & 3000s). More will be supported soon.
|
||||||
|
- 3GB+ of disk space to download TRT-LLM artifacts and a Nitro binary
|
||||||
|
- Jan v0.4.9+ or Jan v0.4.8-321+ (nightly)
|
||||||
|
- Nvidia Driver v535+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
|
||||||
|
- CUDA Toolkit v12.2+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
|
||||||
|
|
||||||
|
## Install TensorRT-Extension
|
||||||
|
|
||||||
|
1. Go to Settings > Extensions
|
||||||
|
2. Click install next to the TensorRT-LLM Extension
|
||||||
|
3. Check that files are correctly downloaded
|
||||||
|
|
||||||
|
```sh
|
||||||
|
ls ~\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin
|
||||||
|
# Your Extension Folder should now include `nitro.exe`, among other artifacts needed to run TRT-LLM
|
||||||
|
```
|
||||||
|
|
||||||
|
## Download a Compatible Model
|
||||||
|
TensorRT-LLM can only run models in `TensorRT` format. These models, aka "TensorRT Engines", are prebuilt specifically for each target OS+GPU architecture.
|
||||||
|
|
||||||
|
We offer a handful of precompiled models for Ampere and Ada cards that you can immediately download and play with:
|
||||||
|
|
||||||
|
1. Restart the application and go to the Hub
|
||||||
|
2. Look for models with the `TensorRT-LLM` label in the recommended models list. Click download. This step might take some time. 🙏
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
3. Click use and start chatting!
|
||||||
|
4. You may need to allow Nitro in your network
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
:::info
|
||||||
|
Due to our limited resources, we only prebuilt a few demo models. You can always build your desired models directly on your machine. [Read here](##Build-your-own-TensorRT-models).
|
||||||
|
:::
|
||||||
|
|
||||||
|
## Configure Settings
|
||||||
|
|
||||||
|
You can customize the default parameters for how Jan runs TensorRT-LLM.
|
||||||
|
|
||||||
|
:::info
|
||||||
|
coming soon
|
||||||
|
:::
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Incompatible Extension vs Engine versions
|
||||||
|
|
||||||
|
For now, the model versions are pinned to the extension versions.
|
||||||
|
|
||||||
|
### Uninstall Extension
|
||||||
|
|
||||||
|
1. Quit the app
|
||||||
|
2. Go to Settings > Extensions
|
||||||
|
3. Delete the entire Extensions folder.
|
||||||
|
4. Reopen the app, only the default extensions should be restored.
|
||||||
|
|
||||||
|
### Install Nitro-TensorRT-LLM manually
|
||||||
|
|
||||||
|
To manually build the artifacts needed to run the server and TensorRT-LLM, you can reference the source code. [Read here](https://github.com/janhq/nitro-tensorrt-llm?tab=readme-ov-file#quickstart).
|
||||||
|
|
||||||
|
### Build your own TensorRT models
|
||||||
|
|
||||||
|
:::info
|
||||||
|
coming soon
|
||||||
|
:::
|
||||||
@ -199,6 +199,19 @@ const sidebars = {
|
|||||||
"guides/models/integrate-remote",
|
"guides/models/integrate-remote",
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
|
{
|
||||||
|
type: "category",
|
||||||
|
label: "Inference Providers",
|
||||||
|
className: "head_SubMenu",
|
||||||
|
link: {
|
||||||
|
type: 'doc',
|
||||||
|
id: "guides/inference/README",
|
||||||
|
},
|
||||||
|
items: [
|
||||||
|
"guides/inference/llama-cpp",
|
||||||
|
"guides/inference/tensorrt-llm",
|
||||||
|
]
|
||||||
|
},
|
||||||
{
|
{
|
||||||
type: "category",
|
type: "category",
|
||||||
label: "Extensions",
|
label: "Extensions",
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user