From 4e001bb2459c2201196079391ab04da03daf35b5 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 20 Mar 2024 12:12:49 +0800 Subject: [PATCH 1/3] Initial commit for TensorRT-LLM blog --- docs/blog/2024-03-19-TensorRT-LLM.md | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md index 08f1a1d1a..8b41b8178 100644 --- a/docs/blog/2024-03-19-TensorRT-LLM.md +++ b/docs/blog/2024-03-19-TensorRT-LLM.md @@ -13,20 +13,15 @@ We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hu - TinyLlama-1.1b - Mistral 7b -- TinyJensen-1.1b, which is trained on Jensen Huang's πŸ‘€ +- TinyJensen-1.1b πŸ˜‚ -## What is TensorRT-LLM? - -Please read our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). - -TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. +You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). ## Performance Benchmarks +TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. -We were curious to see how this would perform on consumer-grade GPUs, as most of Jan's users use consumer-grade GPUs. - -- We’ve done a comparison of how TensorRT-LLM does vs. llama.cpp, our default inference engine. +We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine. | NVIDIA GPU | Architecture | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) | | ---------- | ------------ | -------------- | ---------- | ------------ | ---------------------- | ----------------------- | @@ -34,14 +29,17 @@ We were curious to see how this would perform on consumer-grade GPUs, as most of | RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 | | RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 | -> We test using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. We run 5 times and get the Average. - -> We use Windows task manager and Linux NVIDIA-SMI/ Htop to get CPU/ Memory/ NVIDIA GPU metrics per process. - -> We turn off all user application and only open Jan app with Nitro tensorrt-llm or NVIDIA benchmark script in python +- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. +- We ran the tests 5 times to get get the Average. +- CPU, Memory were obtained from... Windows Task Manager +- GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop` +- All tests were run on bare metal PCs with no other apps open +- There is a slight difference between the models: AWQ models for TensorRT-LLM, while llama.cpp has its own quantization technique ### RTX 4090 on Windows PC +TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, + - CPU: Intel 13th series - GPU: NVIDIA GPU 4090 (Ampere - sm 86) - RAM: 120GB From c885d59c0b71061b0a0e31f09a730a6a25929238 Mon Sep 17 00:00:00 2001 From: hiro Date: Wed, 20 Mar 2024 18:51:34 +0700 Subject: [PATCH 2/3] fix: Add latest result on 3090/ 4090 --- docs/blog/2024-03-19-TensorRT-LLM.md | 54 ++++++++++++++-------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md index 8b41b8178..0cd61adc1 100644 --- a/docs/blog/2024-03-19-TensorRT-LLM.md +++ b/docs/blog/2024-03-19-TensorRT-LLM.md @@ -11,15 +11,15 @@ Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an al We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download: -- TinyLlama-1.1b +- TinyLlama-1.1b - Mistral 7b -- TinyJensen-1.1b πŸ˜‚ +- TinyJensen-1.1b πŸ˜‚ -You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). +You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). ## Performance Benchmarks -TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. +TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine. @@ -29,7 +29,7 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu | RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 | | RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 | -- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. +- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. - We ran the tests 5 times to get get the Average. - CPU, Memory were obtained from... Windows Task Manager - GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop` @@ -38,30 +38,30 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu ### RTX 4090 on Windows PC -TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, +TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, - CPU: Intel 13th series - GPU: NVIDIA GPU 4090 (Ampere - sm 86) -- RAM: 120GB -- OS: Windows +- RAM: 32GB +- OS: Windows 11 Pro -#### TinyLlama-1.1b q4 +#### TinyLlama-1.1b FP16 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 104 | βœ… 131 | -| VRAM Used (GB) | 2.1 | 😱 21.5 | -| RAM Used (GB) | 0.3 | 😱 15 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | No support | βœ… 257.76 | +| VRAM Used (GB) | No support | 3.3 | +| RAM Used (GB) | No support | 0.54 | +| Disk Size (GB) | No support | 2 | #### Mistral-7b int4 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 80 | βœ… 97.9 | -| VRAM Used (GB) | 2.1 | 😱 23.5 | -| RAM Used (GB) | 0.3 | 😱 15 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | 101.3 | βœ… 159 | +| VRAM Used (GB) | 5.5 | 6.3 | +| RAM Used (GB) | 0.54 | 0.42 | +| Disk Size (GB) | 4.07 | 3.66 | ### RTX 3090 on Windows PC @@ -70,23 +70,23 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, - RAM: 64GB - OS: Windows -#### TinyLlama-1.1b q4 +#### TinyLlama-1.1b FP16 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 131.28 | βœ… 194 | -| VRAM Used (GB) | 2.1 | 😱 21.5 | -| RAM Used (GB) | 0.3 | 😱 15 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | No support | βœ… 203 | +| VRAM Used (GB) | No support | 3.8 | +| RAM Used (GB) | No support | 0.54 | +| Disk Size (GB) | No support | 2 | #### Mistral-7b int4 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 88 | βœ… 137 | -| VRAM Used (GB) | 6.0 | 😱 23.8 | -| RAM Used (GB) | 0.3 | 😱 25 | -| Disk Size (GB) | 4.07 | 4.07 | +| Throughput (token/s) | 90 | 140.27 | +| VRAM Used (GB) | 6.0 | 6.8 | +| RAM Used (GB) | 0.54 | 0.42 | +| Disk Size (GB) | 4.07 | 3.66 | ### RTX 4060 on Windows Laptop @@ -95,7 +95,7 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, - RAM: 16GB - GPU: NVIDIA Laptop GPU 4060 (Ada) -#### TinyLlama-1.1b q4 +#### TinyLlama-1.1b FP16 | Metrics | GGUF (using the GPU) | TensorRT-LLM | | -------------------- | -------------------- | ------------ | From 00e97718323e3e2ed9467c4261d801295eb2902c Mon Sep 17 00:00:00 2001 From: Daniel Date: Thu, 21 Mar 2024 20:11:54 +0800 Subject: [PATCH 3/3] Add writeup for TensorRT-LLM --- docs/blog/2024-03-19-TensorRT-LLM.md | 169 ++++++++++++++++----------- 1 file changed, 100 insertions(+), 69 deletions(-) diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md index 0cd61adc1..03afb2179 100644 --- a/docs/blog/2024-03-19-TensorRT-LLM.md +++ b/docs/blog/2024-03-19-TensorRT-LLM.md @@ -1,114 +1,145 @@ --- -title: Jan now supports TensorRT-LLM -description: Jan has added for Nvidia's TensorRT-LLM, a hardware-optimized LLM inference engine that runs very fast on Nvidia GPUs -tags: [Nvidia, TensorRT-LLM] +title: Benchmarking TensorRT-LLM vs. llama.cpp +description: Jan has added support for the TensorRT-LLM Inference Engine, as an alternative to llama.cpp. We provide a performance benchmark that shows the head-to-head comparison of the two Inference Engine and model formats, with TensorRT-LLM providing better performance but consumes significantly more VRAM and RAM. +tags: [Nvidia, TensorRT-LLM, llama.cpp, 3090, 4090, "inference engine"] +unlisted: true --- -Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an alternative inference engine. TensorRT-LLM is a hardware-optimized LLM inference engine that compiles models to [run extremely fast on Nvidia GPUs](https://blogs.nvidia.com/blog/tensorrt-llm-windows-stable-diffusion-rtx/). +Jan has added support [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an alternative to the default [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. TensorRT-LLM allows Nvidia GPU owners to run blazing fast LLM inference as a hardware-optimized LLM inference engine that compiles models to [run extremely fast on Nvidia GPUs](https://blogs.nvidia.com/blog/tensorrt-llm-windows-stable-diffusion-rtx/). -- [TensorRT-LLM Extension](/guides/providers/tensorrt-llm) is available in [0.4.9 release](https://github.com/janhq/jan/releases/tag/v0.4.9) -- Currently available only for Windows +You can follow our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm) to try it out today. We've also added a few TensorRT-LLM models to Jan's Model Hub for download: -We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download: - -- TinyLlama-1.1b - Mistral 7b +- TinyLlama-1.1b - TinyJensen-1.1b πŸ˜‚ -You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). +:::tip + +TensorRT-LLM support is available in [v0.4.9](https://github.com/janhq/jan/releases/tag/v0.4.9), but should be considered an experimental feature. + +Please report bugs on [Github](https://github.com/janhq/jan) or on our Discord's [#tensorrt-llm](https://discord.com/channels/1107178041848909847/1201832734704795688) channel. + +::: ## Performance Benchmarks -TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. -We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine. +We were really curious to see how TensorRT-LLM would perform vs. llama.cpp on consumer-grade GPUs. TensorRT-LLM has previously been shown by Nvidia to reach performance of up to [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) on datacenter-grade GPUs. As most of Jan's users are proud card carrying members of the [GPU Poor](https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini#the-gpu-poor), we wanted to see how the two inference engine performed on the same hardware. -| NVIDIA GPU | Architecture | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) | -| ---------- | ------------ | -------------- | ---------- | ------------ | ---------------------- | ----------------------- | -| RTX 4090 | Ada | 24 | 16,384 | 512 | 384 | ~1000 | -| RTX 3090 | Ampere | 24 | 10,496 | 328 | 384 | 935.8 | -| RTX 4060 | Ada | 8 | 3,072 | 96 | 128 | 272 | +:::info -- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. -- We ran the tests 5 times to get get the Average. -- CPU, Memory were obtained from... Windows Task Manager -- GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop` -- All tests were run on bare metal PCs with no other apps open -- There is a slight difference between the models: AWQ models for TensorRT-LLM, while llama.cpp has its own quantization technique +An interesting aside: Jan actually started out in June 2023 building on [FastTransformer](https://github.com/NVIDIA/FasterTransformer), the precursor library to TensorRT-LLM. TensorRT-LLM was released in September 2023, making it a very young library. We're excited to see it's roadmap develop! -### RTX 4090 on Windows PC +::: -TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, +### Test Setup + +We picked 3 hardware platforms to run the test on, based on Jan's userbase's self-reported common hardware platforms. + +| NVIDIA GPU | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) | +| ------------------------- | -------------- | ---------- | ------------ | ---------------------- | ----------------------- | +| RTX 4090 Desktop (Ada) | 24 | 16,384 | 512 | 384 | ~1000 | +| RTX 3090 Desktop (Ampere) | 24 | 10,496 | 328 | 384 | 935.8 | +| RTX 4060 Laptop (Ada) | 8 | 3,072 | 96 | 128 | 272 | + +:::warning[Low-spec Machines?] + +We didn't bother including low-spec machines: TensorRT-LLM is meant for performance, and simply doesn't work on lower grade Nvidia GPUs, or computers without GPUs. + +TensorRT-LLM provides blazing fast performance at the cost of [memory usage](https://nvidia.github.io/TensorRT-LLM/memory.html). This means that the performance improvements only show up in higher-range GPUs with larger VRAMs. + +We've found that [llama.cpp](https://github.com/ggerganov/llama.cpp) does an incredible job of democratizing inference to the [GPU Poor](https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini#the-gpu-poor) with CPU-only or lower-range GPUs. Huge shout outs to the [llama.cpp maintainers](https://github.com/ggerganov/llama.cpp/graphs/contributors) and the [ggml.ai](https://ggml.ai/) team. + +::: + +We chose the popular Mistral 7b model to run on both GGUF and TensorRT-LLM, picking comparable quantizations. + +#### llama.cpp Setup +- For llama.cpp, we used `Mistral-7b-q4_k_m` +- [ ] Fill in `ngl` params, GPU offload etc + +#### TensorRT-LLM Setup +- For TensorRT-LLM, we used `Mistral-7b-int4 AWQ` +- We ran TensorRT-LLM with `free_gpu_memory_fraction` to test it with the lowest VRAM consumption (performance may be affected) +- Note: We picked AWQ for TensorRT-LLM as a handicap as AWQ supposedly sacrifices performance for quality + +#### Experiment Setup +We ran the experiment using a standardized inference request in a sandboxed environment on the same machine: +- We ran tests 5 times for each inference engine, on a baremetal PC with no other applications open +- Each inference request was of `batch_size` 1 and `input_len` 2048, `output_len` 512 as a realistic test case +- CPU and Memory usage were obtained from.... Windows Task Manager 😱 +- GPU usage was obtained from `nvtop`, `htop`, and `nvidia-smi` + +## Results + +Our biggest takeaway: TensorRT-LLM is faster than llama.cpp on 4090s and 3090s with larger VRAMs. However, on smaller GPUs (e.g. Laptop 4060 GPUs), + +| | 4090 Desktop | 3090 Desktop | 4060 Laptop | +| ------------ | ------------ | ------------ | ----------- | +| TensorRT-LLM | βœ… 159t/s | βœ… 140.27t/s | ❌ 19t/s | +| llama.cpp | 101.3t/s | 90t/s | 22t/s | + +### RTX-4090 Desktop + +:::info[Hardware Details] - CPU: Intel 13th series - GPU: NVIDIA GPU 4090 (Ampere - sm 86) - RAM: 32GB -- OS: Windows 11 Pro +- OS: Windows 11 Pro on Proxmox -#### TinyLlama-1.1b FP16 +::: -| Metrics | GGUF (using the GPU) | TensorRT-LLM | -| -------------------- | -------------------- | ------------ | -| Throughput (token/s) | No support | βœ… 257.76 | -| VRAM Used (GB) | No support | 3.3 | -| RAM Used (GB) | No support | 0.54 | -| Disk Size (GB) | No support | 2 | +Nvidia's RTX-4090 is their top-of-the-line consumer GPU, and retails for [approximately $2,000](https://www.amazon.com/rtx-4090/s?k=rtx+4090). #### Mistral-7b int4 -| Metrics | GGUF (using the GPU) | TensorRT-LLM | -| -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 101.3 | βœ… 159 | -| VRAM Used (GB) | 5.5 | 6.3 | -| RAM Used (GB) | 0.54 | 0.42 | -| Disk Size (GB) | 4.07 | 3.66 | +| Metrics | GGUF (using GPU) | TensorRT-LLM | Difference | +| -------------------- | -------------------- | ------------ | -------------- | +| Throughput (token/s) | 101.3 | 159 | βœ… 57% faster | +| VRAM Used (GB) | 5.5 | 6.3 | πŸ€” 14% more | +| RAM Used (GB) | 0.54 | 0.42 | 🀯 20% less | +| Disk Size (GB) | 4.07 | 3.66 | 🀯 10% smaller | -### RTX 3090 on Windows PC + +### RTX-3090 Desktop + +:::info[Hardware Details] - CPU: Intel 13th series - GPU: NVIDIA GPU 3090 (Ampere - sm 86) - RAM: 64GB - OS: Windows -#### TinyLlama-1.1b FP16 - -| Metrics | GGUF (using the GPU) | TensorRT-LLM | -| -------------------- | -------------------- | ------------ | -| Throughput (token/s) | No support | βœ… 203 | -| VRAM Used (GB) | No support | 3.8 | -| RAM Used (GB) | No support | 0.54 | -| Disk Size (GB) | No support | 2 | +::: #### Mistral-7b int4 -| Metrics | GGUF (using the GPU) | TensorRT-LLM | -| -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 90 | 140.27 | -| VRAM Used (GB) | 6.0 | 6.8 | -| RAM Used (GB) | 0.54 | 0.42 | -| Disk Size (GB) | 4.07 | 3.66 | +| Metrics | GGUF (using GPU) | TensorRT-LLM | Difference | +| -------------------- | -------------------- | ------------ | ------------ | +| Throughput (token/s) | 90 | βœ… 140.27 | βœ… 55% faster | +| VRAM Used (GB) | 6.0 | 6.8 | πŸ€” 13% more | +| RAM Used (GB) | 0.54 | 0.42 | 🀯 22% less | +| Disk Size (GB) | 4.07 | 3.66 | 🀯 10% less | -### RTX 4060 on Windows Laptop +### RTX-4060 Laptop + +- [ ] Dan to re-run perf tests and fill in details + +:::info[Hardware Details] - Manufacturer: Acer Nitro 16 Phenix - CPU: Ryzen 7000 - RAM: 16GB - GPU: NVIDIA Laptop GPU 4060 (Ada) -#### TinyLlama-1.1b FP16 - -| Metrics | GGUF (using the GPU) | TensorRT-LLM | -| -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 65 | ❌ 41 | -| VRAM Used (GB) | 2.1 | 😱 7.6 | -| RAM Used (GB) | 0.3 | 😱 7.2 | -| Disk Size (GB) | 4.07 | 4.07 GB | +::: #### Mistral-7b int4 -| Metrics | GGUF (using the GPU) | TensorRT-LLM | -| -------------------- | -------------------- | ------------ | -| Throughput (token/s) | 22 | ❌ 19 | -| VRAM Used (GB) | 2.1 | 😱 7.7 | -| RAM Used (GB) | 0.3 | 😱 13.5 | -| Disk Size (GB) | 4.07 | 4.07 | +| Metrics | GGUF (using the GPU) | TensorRT-LLM | Difference | +| -------------------- | -------------------- | ------------ | ---------- | +| Throughput (token/s) | 22 | ❌ 19 | | +| VRAM Used (GB) | 2.1 | 7.7 | | +| RAM Used (GB) | 0.3 | 13.5 | | +| Disk Size (GB) | 4.07 | 4.07 | | \ No newline at end of file