fix: Add latest result on 3090/ 4090

2024-03-20 18:51:34 +07:00 · 2024-03-20 18:51:34 +07:00 · c885d59c0b
commit c885d59c0b
parent 70e10fcc4a
1 changed files with 27 additions and 27 deletions
--- a/docs/blog/2024-03-19-TensorRT-LLM.md
+++ b/docs/blog/2024-03-19-TensorRT-LLM.md
@ -11,15 +11,15 @@ Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an al

 We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download:

- TinyLlama-1.1b 
+- TinyLlama-1.1b
 - Mistral 7b
- TinyJensen-1.1b 😂 
+- TinyJensen-1.1b 😂

-You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). 
+You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm).

 ## Performance Benchmarks

-TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. 
+TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs.

 We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine.

@ -29,7 +29,7 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu
 | RTX 3090   | Ampere       | 24             | 10,496     | 328          | 384                    | 935.8                   |
 | RTX 4060   | Ada          | 8              | 3,072      | 96           | 128                    | 272                     |

- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. 
+- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use.
 - We ran the tests 5 times to get get the Average.
 - CPU, Memory were obtained from... Windows Task Manager
 - GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop`
@ -38,30 +38,30 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu

 ### RTX 4090 on Windows PC

-TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, 
+TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,

 - CPU: Intel 13th series
 - GPU: NVIDIA GPU 4090 (Ampere - sm 86)
- RAM: 120GB
- OS: Windows
+- RAM: 32GB
+- OS: Windows 11 Pro

-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16

 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 104                  | ✅ 131       |
-| VRAM Used (GB)       | 2.1                  | 😱 21.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | No support           | ✅ 257.76    |
+| VRAM Used (GB)       | No support           | 3.3          |
+| RAM Used (GB)        | No support           | 0.54         |
+| Disk Size (GB)       | No support           | 2            |

 #### Mistral-7b int4

 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 80                   | ✅ 97.9      |
-| VRAM Used (GB)       | 2.1                  | 😱 23.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | 101.3                | ✅ 159       |
+| VRAM Used (GB)       | 5.5                  | 6.3          |
+| RAM Used (GB)        | 0.54                 | 0.42         |
+| Disk Size (GB)       | 4.07                 | 3.66         |

 ### RTX 3090 on Windows PC

@ -70,23 +70,23 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 - RAM: 64GB
 - OS: Windows

-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16

 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 131.28               | ✅ 194       |
-| VRAM Used (GB)       | 2.1                  | 😱 21.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | No support           | ✅ 203       |
+| VRAM Used (GB)       | No support           | 3.8          |
+| RAM Used (GB)        | No support           | 0.54         |
+| Disk Size (GB)       | No support           | 2            |

 #### Mistral-7b int4

 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 88                   | ✅ 137       |
-| VRAM Used (GB)       | 6.0                  | 😱 23.8      |
-| RAM Used (GB)        | 0.3                  | 😱 25        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | 90                   | 140.27       |
+| VRAM Used (GB)       | 6.0                  | 6.8          |
+| RAM Used (GB)        | 0.54                 | 0.42         |
+| Disk Size (GB)       | 4.07                 | 3.66         |

 ### RTX 4060 on Windows Laptop

@ -95,7 +95,7 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 - RAM: 16GB
 - GPU: NVIDIA Laptop GPU 4060 (Ada)

-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16

 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |