diff --git a/docs/docs/guides/inference/README.mdx b/docs/docs/guides/inference/README.mdx new file mode 100644 index 000000000..289fd8241 --- /dev/null +++ b/docs/docs/guides/inference/README.mdx @@ -0,0 +1,8 @@ +--- +title: Extensions +slug: /guides/inference/ +--- + +import DocCardList from "@theme/DocCardList"; + + diff --git a/docs/docs/guides/inference/image.png b/docs/docs/guides/inference/image.png new file mode 100644 index 000000000..5f1f7104e Binary files /dev/null and b/docs/docs/guides/inference/image.png differ diff --git a/docs/docs/guides/inference/llama-cpp.md b/docs/docs/guides/inference/llama-cpp.md new file mode 100644 index 000000000..470424dbb --- /dev/null +++ b/docs/docs/guides/inference/llama-cpp.md @@ -0,0 +1,11 @@ +--- +title: Llama-CPP Extension +--- + +## Overview + +[LlamaCPP](https://github.com/ggerganov/llama.cpp) is the default AI engine downloaded with Jan. It is served through Nitro, a C++ inference server, that handles additional UX and hardware optimizations. + +The source code for Nitro-llama-cpp is [here](https://github.com/janhq/nitro). + +There is no additional setup needed. \ No newline at end of file diff --git a/docs/docs/guides/inference/tensorrt-llm.md b/docs/docs/guides/inference/tensorrt-llm.md new file mode 100644 index 000000000..d795d55e6 --- /dev/null +++ b/docs/docs/guides/inference/tensorrt-llm.md @@ -0,0 +1,83 @@ +--- +title: TensorRT-LLM Extension +--- + +Users with Nvidia GPUs can get 20-40% faster* token speeds on their laptop or desktops by using [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). + +This guide walks you through how to install Jan's official [TensorRT-LLM Extension](https://github.com/janhq/nitro-tensorrt-llm). This extension uses [Nitro-TensorRT-LLM](https://github.com/janhq/nitro-tensorrt-llm) as the AI engine, instead of the default [Nitro-Llama-CPP](https://github.com/janhq/nitro). It includes an efficient C++ server to natively execute the [TRT-LLM C++ runtime](https://nvidia.github.io/TensorRT-LLM/gpt_runtime.html). It also comes with additional feature and performance improvements like OpenAI compatibility, tokenizer improvements, and queues. + +*Compared to using LlamaCPP engine. + +:::info +This feature is only available for Windows users. Linux is coming soon. +::: + +## Requirements + +- A Windows PC +- Nvidia GPU(s): Ada or Ampere series (i.e. RTX 4000s & 3000s). More will be supported soon. +- 3GB+ of disk space to download TRT-LLM artifacts and a Nitro binary +- Jan v0.4.9+ or Jan v0.4.8-321+ (nightly) +- Nvidia Driver v535+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements)) +- CUDA Toolkit v12.2+ ([installation guide](https://jan.ai/guides/common-error/not-using-gpu/#1-ensure-gpu-mode-requirements)) + +## Install TensorRT-Extension + +1. Go to Settings > Extensions +2. Click install next to the TensorRT-LLM Extension +3. Check that files are correctly downloaded + +```sh +ls ~\jan\extensions\@janhq\tensorrt-llm-extension\dist\bin +# Your Extension Folder should now include `nitro.exe`, among other artifacts needed to run TRT-LLM +``` + +## Download a Compatible Model +TensorRT-LLM can only run models in `TensorRT` format. These models, aka "TensorRT Engines", are prebuilt specifically for each target OS+GPU architecture. + +We offer a handful of precompiled models for Ampere and Ada cards that you can immediately download and play with: + +1. Restart the application and go to the Hub +2. Look for models with the `TensorRT-LLM` label in the recommended models list. Click download. This step might take some time. 🙏 + +![image](https://hackmd.io/_uploads/rJewrEgRp.png) + +3. Click use and start chatting! +4. You may need to allow Nitro in your network + +![alt text](image.png) + +:::info +Due to our limited resources, we only prebuilt a few demo models. You can always build your desired models directly on your machine. [Read here](##Build-your-own-TensorRT-models). +::: + +## Configure Settings + +You can customize the default parameters for how Jan runs TensorRT-LLM. + +:::info +coming soon +::: + +## Troubleshooting + +### Incompatible Extension vs Engine versions + +For now, the model versions are pinned to the extension versions. + +### Uninstall Extension + +1. Quit the app +2. Go to Settings > Extensions +3. Delete the entire Extensions folder. +4. Reopen the app, only the default extensions should be restored. + +### Install Nitro-TensorRT-LLM manually + +To manually build the artifacts needed to run the server and TensorRT-LLM, you can reference the source code. [Read here](https://github.com/janhq/nitro-tensorrt-llm?tab=readme-ov-file#quickstart). + +### Build your own TensorRT models + +:::info +coming soon +::: diff --git a/docs/sidebars.js b/docs/sidebars.js index 4c45cadbe..26cb09eb2 100644 --- a/docs/sidebars.js +++ b/docs/sidebars.js @@ -199,6 +199,19 @@ const sidebars = { "guides/models/integrate-remote", ] }, + { + type: "category", + label: "Inference Providers", + className: "head_SubMenu", + link: { + type: 'doc', + id: "guides/inference/README", + }, + items: [ + "guides/inference/llama-cpp", + "guides/inference/tensorrt-llm", + ] + }, { type: "category", label: "Extensions",