docs: trt-llm extension guides

2024-03-14 20:30:37 +08:00 · 2024-03-14 20:30:37 +08:00 · f878555598
commit f878555598
parent d9c3852997
5 changed files with 115 additions and 0 deletions
--- a/docs/docs/guides/inference/README.mdx
+++ b/docs/docs/guides/inference/README.mdx
@ -0,0 +1,8 @@
 ---
 title: Extensions
 slug: /guides/inference/
 ---
 import DocCardList from "@theme/DocCardList";
 <DocCardList />
--- a/docs/docs/guides/inference/image.png
+++ b/docs/docs/guides/inference/image.png
--- a/docs/docs/guides/inference/llama-cpp.md
+++ b/docs/docs/guides/inference/llama-cpp.md
@ -0,0 +1,11 @@
 ---
 title: Llama-CPP Extension
 ---
 ## Overview
 [LlamaCPP](https://github.com/ggerganov/llama.cpp) is the default AI engine downloaded with Jan. It is served through Nitro, a C++ inference server, that handles additional UX and hardware optimizations.
 The source code for Nitro-llama-cpp is [here](https://github.com/janhq/nitro).
 There is no additional setup needed.
--- a/docs/docs/guides/inference/tensorrt-llm.md
+++ b/docs/docs/guides/inference/tensorrt-llm.md
@ -0,0 +1,83 @@
 ---
 title: TensorRT-LLM Extension
 ---
 Users with Nvidia GPUs can get 20-40% faster* token speeds on their laptop or desktops by using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
 This guide walks you through how to install Jan's official [TensorRT-LLM Extension](https://github.com/janhq/nitro-tensorrt-llm). This extension uses [Nitro-TensorRT-LLM](https://github.com/janhq/nitro-tensorrt-llm) as the AI engine, instead of the default [Nitro-Llama-CPP](https://github.com/janhq/nitro). It includes an efficient C++ server to natively execute the [TRT-LLM C++ runtime](https://nvidia.github.io/TensorRT-LLM/gpt_runtime.html). It also comes with additional feature and performance improvements like OpenAI compatibility, tokenizer improvements, and queues.
 *Compared to using LlamaCPP engine.
 :::info
 This feature is only available for Windows users. Linux is coming soon.
 :::
 ## Requirements
 - A Windows PC
 - Nvidia GPU(s): Ada or Ampere series (i.e. RTX 4000s & 3000s). More will be supported soon.
 - 3GB+ of disk space to download TRT-LLM artifacts and a Nitro binary
 - Jan v0.4.9+ or Jan v0.4.8-321+ (nightly)
 - Nvidia Driver v535+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
 - CUDA Toolkit v12.2+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements))
 ## Install TensorRT-Extension
 1. Go to Settings > Extensions
 2. Click install next to the TensorRT-LLM Extension
 3. Check that files are correctly downloaded
 ```sh
 ls ~\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin
 # Your Extension Folder should now include `nitro.exe`, among other artifacts needed to run TRT-LLM
 ```
 ## Download a Compatible Model
 TensorRT-LLM can only run models in `TensorRT` format. These models, aka "TensorRT Engines", are prebuilt specifically for each target OS+GPU architecture.
 We offer a handful of precompiled models for Ampere and Ada cards that you can immediately download and play with:
 1. Restart the application and go to the Hub
 2. Look for models with the `TensorRT-LLM` label in the recommended models list. Click download. This step might take some time. 🙏
 ![image](https://hackmd.io/_uploads/rJewrEgRp.png)
 3. Click use and start chatting!
 4. You may need to allow Nitro in your network 
 ![alt text](image.png)
 :::info
 Due to our limited resources, we only prebuilt a few demo models. You can always build your desired models directly on your machine. [Read here](##Build-your-own-TensorRT-models).
 :::
 ## Configure Settings
 You can customize the default parameters for how Jan runs TensorRT-LLM. 
 :::info
 coming soon
 :::
 ## Troubleshooting
 ### Incompatible Extension vs Engine versions
 For now, the model versions are pinned to the extension versions.
 ### Uninstall Extension
 1. Quit the app
 2. Go to Settings > Extensions
 3. Delete the entire Extensions folder.
 4. Reopen the app, only the default extensions should be restored.
 ### Install Nitro-TensorRT-LLM manually
 To manually build the artifacts needed to run the server and TensorRT-LLM, you can reference the source code. [Read here](https://github.com/janhq/nitro-tensorrt-llm?tab=readme-ov-file#quickstart).
 ### Build your own TensorRT models
 :::info
 coming soon
 :::
--- a/docs/sidebars.js
+++ b/docs/sidebars.js
@ -199,6 +199,19 @@ const sidebars = {
            "guides/models/integrate-remote",
          ]
        },
        {
          type: "category",
          label: "Inference Providers",
          className: "head_SubMenu",
          link: {
            type: 'doc',
            id: "guides/inference/README",
          },
          items: [
            "guides/inference/llama-cpp",
            "guides/inference/tensorrt-llm",
          ]
        },
        {
          type: "category",
          label: "Extensions",