docs: trt-llm extension guides
This commit is contained in:
parent
d9c3852997
commit
f878555598
8
docs/docs/guides/inference/README.mdx
Normal file
8
docs/docs/guides/inference/README.mdx
Normal file
@ -0,0 +1,8 @@
|
||||
---
|
||||
title: Extensions
|
||||
slug: /guides/inference/
|
||||
---
|
||||
|
||||
import DocCardList from "@theme/DocCardList";
|
||||
|
||||
<DocCardList />
|
||||
BIN
docs/docs/guides/inference/image.png
Normal file
BIN
docs/docs/guides/inference/image.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 27 KiB |
11
docs/docs/guides/inference/llama-cpp.md
Normal file
11
docs/docs/guides/inference/llama-cpp.md
Normal file
@ -0,0 +1,11 @@
|
||||
---
|
||||
title: Llama-CPP Extension
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
[LlamaCPP](https://github.com/ggerganov/llama.cpp) is the default AI engine downloaded with Jan. It is served through Nitro, a C++ inference server, that handles additional UX and hardware optimizations.
|
||||
|
||||
The source code for Nitro-llama-cpp is [here](https://github.com/janhq/nitro).
|
||||
|
||||
There is no additional setup needed.
|
||||
83
docs/docs/guides/inference/tensorrt-llm.md
Normal file
83
docs/docs/guides/inference/tensorrt-llm.md
Normal file
@ -0,0 +1,83 @@
|
||||
---
|
||||
title: TensorRT-LLM Extension
|
||||
---
|
||||
|
||||
Users with Nvidia GPUs can get 20-40% faster* token speeds on their laptop or desktops by using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
|
||||
|
||||
This guide walks you through how to install Jan's official [TensorRT-LLM Extension](https://github.com/janhq/nitro-tensorrt-llm). This extension uses [Nitro-TensorRT-LLM](https://github.com/janhq/nitro-tensorrt-llm) as the AI engine, instead of the default [Nitro-Llama-CPP](https://github.com/janhq/nitro). It includes an efficient C++ server to natively execute the [TRT-LLM C++ runtime](https://nvidia.github.io/TensorRT-LLM/gpt_runtime.html). It also comes with additional feature and performance improvements like OpenAI compatibility, tokenizer improvements, and queues.
|
||||
|
||||
*Compared to using LlamaCPP engine.
|
||||
|
||||
:::info
|
||||
This feature is only available for Windows users. Linux is coming soon.
|
||||
:::
|
||||
|
||||
## Requirements
|
||||
|
||||
- A Windows PC
|
||||
- Nvidia GPU(s): Ada or Ampere series (i.e. RTX 4000s & 3000s). More will be supported soon.
|
||||
- 3GB+ of disk space to download TRT-LLM artifacts and a Nitro binary
|
||||
- Jan v0.4.9+ or Jan v0.4.8-321+ (nightly)
|
||||
- Nvidia Driver v535+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
|
||||
- CUDA Toolkit v12.2+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
|
||||
|
||||
## Install TensorRT-Extension
|
||||
|
||||
1. Go to Settings > Extensions
|
||||
2. Click install next to the TensorRT-LLM Extension
|
||||
3. Check that files are correctly downloaded
|
||||
|
||||
```sh
|
||||
ls ~\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin
|
||||
# Your Extension Folder should now include `nitro.exe`, among other artifacts needed to run TRT-LLM
|
||||
```
|
||||
|
||||
## Download a Compatible Model
|
||||
TensorRT-LLM can only run models in `TensorRT` format. These models, aka "TensorRT Engines", are prebuilt specifically for each target OS+GPU architecture.
|
||||
|
||||
We offer a handful of precompiled models for Ampere and Ada cards that you can immediately download and play with:
|
||||
|
||||
1. Restart the application and go to the Hub
|
||||
2. Look for models with the `TensorRT-LLM` label in the recommended models list. Click download. This step might take some time. 🙏
|
||||
|
||||

|
||||
|
||||
3. Click use and start chatting!
|
||||
4. You may need to allow Nitro in your network
|
||||
|
||||

|
||||
|
||||
:::info
|
||||
Due to our limited resources, we only prebuilt a few demo models. You can always build your desired models directly on your machine. [Read here](##Build-your-own-TensorRT-models).
|
||||
:::
|
||||
|
||||
## Configure Settings
|
||||
|
||||
You can customize the default parameters for how Jan runs TensorRT-LLM.
|
||||
|
||||
:::info
|
||||
coming soon
|
||||
:::
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Incompatible Extension vs Engine versions
|
||||
|
||||
For now, the model versions are pinned to the extension versions.
|
||||
|
||||
### Uninstall Extension
|
||||
|
||||
1. Quit the app
|
||||
2. Go to Settings > Extensions
|
||||
3. Delete the entire Extensions folder.
|
||||
4. Reopen the app, only the default extensions should be restored.
|
||||
|
||||
### Install Nitro-TensorRT-LLM manually
|
||||
|
||||
To manually build the artifacts needed to run the server and TensorRT-LLM, you can reference the source code. [Read here](https://github.com/janhq/nitro-tensorrt-llm?tab=readme-ov-file#quickstart).
|
||||
|
||||
### Build your own TensorRT models
|
||||
|
||||
:::info
|
||||
coming soon
|
||||
:::
|
||||
@ -199,6 +199,19 @@ const sidebars = {
|
||||
"guides/models/integrate-remote",
|
||||
]
|
||||
},
|
||||
{
|
||||
type: "category",
|
||||
label: "Inference Providers",
|
||||
className: "head_SubMenu",
|
||||
link: {
|
||||
type: 'doc',
|
||||
id: "guides/inference/README",
|
||||
},
|
||||
items: [
|
||||
"guides/inference/llama-cpp",
|
||||
"guides/inference/tensorrt-llm",
|
||||
]
|
||||
},
|
||||
{
|
||||
type: "category",
|
||||
label: "Extensions",
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user