From 806afcd197c503c051ee7841a6317e6a3a0a0f06 Mon Sep 17 00:00:00 2001 From: Daniel Date: Tue, 19 Mar 2024 22:27:09 +0800 Subject: [PATCH] Initial commit of TensorRT-LLM Blog post --- ...01-10-bitdefender-false-positive-flag.mdx} | 0 docs/blog/2024-03-19-TensorRT-LLM.md | 116 ++++++++++++++++++ 2 files changed, 116 insertions(+) rename docs/blog/{01-january-10-2024-bitdefender-false-positive-flag.mdx => 2024-01-10-bitdefender-false-positive-flag.mdx} (100%) create mode 100644 docs/blog/2024-03-19-TensorRT-LLM.md diff --git a/docs/blog/01-january-10-2024-bitdefender-false-positive-flag.mdx b/docs/blog/2024-01-10-bitdefender-false-positive-flag.mdx similarity index 100% rename from docs/blog/01-january-10-2024-bitdefender-false-positive-flag.mdx rename to docs/blog/2024-01-10-bitdefender-false-positive-flag.mdx diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md new file mode 100644 index 000000000..08f1a1d1a --- /dev/null +++ b/docs/blog/2024-03-19-TensorRT-LLM.md @@ -0,0 +1,116 @@ +--- +title: Jan now supports TensorRT-LLM +description: Jan has added for Nvidia's TensorRT-LLM, a hardware-optimized LLM inference engine that runs very fast on Nvidia GPUs +tags: [Nvidia, TensorRT-LLM] +--- + +Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an alternative inference engine. TensorRT-LLM is a hardware-optimized LLM inference engine that compiles models to [run extremely fast on Nvidia GPUs](https://blogs.nvidia.com/blog/tensorrt-llm-windows-stable-diffusion-rtx/). + +- [TensorRT-LLM Extension](/guides/providers/tensorrt-llm) is available in [0.4.9 release](https://github.com/janhq/jan/releases/tag/v0.4.9) +- Currently available only for Windows + +We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download: + +- TinyLlama-1.1b +- Mistral 7b +- TinyJensen-1.1b, which is trained on Jensen Huang's πŸ‘€ + +## What is TensorRT-LLM? + +Please read our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). + +TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. + +## Performance Benchmarks + + +We were curious to see how this would perform on consumer-grade GPUs, as most of Jan's users use consumer-grade GPUs. + +- We’ve done a comparison of how TensorRT-LLM does vs. llama.cpp, our default inference engine. + +| NVIDIA GPU | Architecture | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) | +| ---------- | ------------ | -------------- | ---------- | ------------ | ---------------------- | ----------------------- | +| RTX 4090 | Ada | 24 | 16,384 | 512 | 384 | ~1000 | +| RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 | +| RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 | + +> We test using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. We run 5 times and get the Average. + +> We use Windows task manager and Linux NVIDIA-SMI/ Htop to get CPU/ Memory/ NVIDIA GPU metrics per process. + +> We turn off all user application and only open Jan app with Nitro tensorrt-llm or NVIDIA benchmark script in python + +### RTX 4090 on Windows PC + +- CPU: Intel 13th series +- GPU: NVIDIA GPU 4090 (Ampere - sm 86) +- RAM: 120GB +- OS: Windows + +#### TinyLlama-1.1b q4 + +| Metrics | GGUF (using the GPU) | TensorRT-LLM | +| -------------------- | -------------------- | ------------ | +| Throughput (token/s) | 104 | βœ… 131 | +| VRAM Used (GB) | 2.1 | 😱 21.5 | +| RAM Used (GB) | 0.3 | 😱 15 | +| Disk Size (GB) | 4.07 | 4.07 | + +#### Mistral-7b int4 + +| Metrics | GGUF (using the GPU) | TensorRT-LLM | +| -------------------- | -------------------- | ------------ | +| Throughput (token/s) | 80 | βœ… 97.9 | +| VRAM Used (GB) | 2.1 | 😱 23.5 | +| RAM Used (GB) | 0.3 | 😱 15 | +| Disk Size (GB) | 4.07 | 4.07 | + +### RTX 3090 on Windows PC + +- CPU: Intel 13th series +- GPU: NVIDIA GPU 3090 (Ampere - sm 86) +- RAM: 64GB +- OS: Windows + +#### TinyLlama-1.1b q4 + +| Metrics | GGUF (using the GPU) | TensorRT-LLM | +| -------------------- | -------------------- | ------------ | +| Throughput (token/s) | 131.28 | βœ… 194 | +| VRAM Used (GB) | 2.1 | 😱 21.5 | +| RAM Used (GB) | 0.3 | 😱 15 | +| Disk Size (GB) | 4.07 | 4.07 | + +#### Mistral-7b int4 + +| Metrics | GGUF (using the GPU) | TensorRT-LLM | +| -------------------- | -------------------- | ------------ | +| Throughput (token/s) | 88 | βœ… 137 | +| VRAM Used (GB) | 6.0 | 😱 23.8 | +| RAM Used (GB) | 0.3 | 😱 25 | +| Disk Size (GB) | 4.07 | 4.07 | + +### RTX 4060 on Windows Laptop + +- Manufacturer: Acer Nitro 16 Phenix +- CPU: Ryzen 7000 +- RAM: 16GB +- GPU: NVIDIA Laptop GPU 4060 (Ada) + +#### TinyLlama-1.1b q4 + +| Metrics | GGUF (using the GPU) | TensorRT-LLM | +| -------------------- | -------------------- | ------------ | +| Throughput (token/s) | 65 | ❌ 41 | +| VRAM Used (GB) | 2.1 | 😱 7.6 | +| RAM Used (GB) | 0.3 | 😱 7.2 | +| Disk Size (GB) | 4.07 | 4.07 GB | + +#### Mistral-7b int4 + +| Metrics | GGUF (using the GPU) | TensorRT-LLM | +| -------------------- | -------------------- | ------------ | +| Throughput (token/s) | 22 | ❌ 19 | +| VRAM Used (GB) | 2.1 | 😱 7.7 | +| RAM Used (GB) | 0.3 | 😱 13.5 | +| Disk Size (GB) | 4.07 | 4.07 |