From 4e001bb2459c2201196079391ab04da03daf35b5 Mon Sep 17 00:00:00 2001
From: Daniel <daniel@jan.ai>
Date: Wed, 20 Mar 2024 12:12:49 +0800
Subject: [PATCH 1/3] Initial commit for TensorRT-LLM blog

---
 docs/blog/2024-03-19-TensorRT-LLM.md | 26 ++++++++++++--------------
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md
index 08f1a1d1a..8b41b8178 100644
--- a/docs/blog/2024-03-19-TensorRT-LLM.md
+++ b/docs/blog/2024-03-19-TensorRT-LLM.md
@@ -13,20 +13,15 @@ We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hu
 
 - TinyLlama-1.1b 
 - Mistral 7b
-- TinyJensen-1.1b, which is trained on Jensen Huang's 👀
+- TinyJensen-1.1b 😂 
 
-## What is TensorRT-LLM?
-
-Please read our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). 
-
-TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds.
+You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). 
 
 ## Performance Benchmarks
 
+TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. 
 
-We were curious to see how this would perform on consumer-grade GPUs, as most of Jan's users use consumer-grade GPUs.
-
-- We’ve done a comparison of how TensorRT-LLM does vs. llama.cpp, our default inference engine.
+We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine.
 
 | NVIDIA GPU | Architecture | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) |
 | ---------- | ------------ | -------------- | ---------- | ------------ | ---------------------- | ----------------------- |
@@ -34,14 +29,17 @@ We were curious to see how this would perform on consumer-grade GPUs, as most of
 | RTX 3090   | Ampere       | 24             | 10,496     | 328          | 384                    | 935.8                   |
 | RTX 4060   | Ada          | 8              | 3,072      | 96           | 128                    | 272                     |
 
-> We test using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. We run 5 times and get the Average.
-
-> We use Windows task manager and Linux NVIDIA-SMI/ Htop to get CPU/ Memory/ NVIDIA GPU metrics per process.
-
-> We turn off all user application and only open Jan app with Nitro tensorrt-llm or NVIDIA benchmark script in python
+- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. 
+- We ran the tests 5 times to get get the Average.
+- CPU, Memory were obtained from... Windows Task Manager
+- GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop`
+- All tests were run on bare metal PCs with no other apps open
+- There is a slight difference between the models: AWQ models for TensorRT-LLM, while llama.cpp has its own quantization technique
 
 ### RTX 4090 on Windows PC
 
+TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, 
+
 - CPU: Intel 13th series
 - GPU: NVIDIA GPU 4090 (Ampere - sm 86)
 - RAM: 120GB

From c885d59c0b71061b0a0e31f09a730a6a25929238 Mon Sep 17 00:00:00 2001
From: hiro <vuonghoainam.work@gmail.com>
Date: Wed, 20 Mar 2024 18:51:34 +0700
Subject: [PATCH 2/3] fix: Add latest result on 3090/ 4090

---
 docs/blog/2024-03-19-TensorRT-LLM.md | 54 ++++++++++++++--------------
 1 file changed, 27 insertions(+), 27 deletions(-)

diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md
index 8b41b8178..0cd61adc1 100644
--- a/docs/blog/2024-03-19-TensorRT-LLM.md
+++ b/docs/blog/2024-03-19-TensorRT-LLM.md
@@ -11,15 +11,15 @@ Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an al
 
 We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download:
 
-- TinyLlama-1.1b 
+- TinyLlama-1.1b
 - Mistral 7b
-- TinyJensen-1.1b 😂 
+- TinyJensen-1.1b 😂
 
-You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm). 
+You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm).
 
 ## Performance Benchmarks
 
-TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs. 
+TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs.
 
 We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine.
 
@@ -29,7 +29,7 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu
 | RTX 3090   | Ampere       | 24             | 10,496     | 328          | 384                    | 935.8                   |
 | RTX 4060   | Ada          | 8              | 3,072      | 96           | 128                    | 272                     |
 
-- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use. 
+- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use.
 - We ran the tests 5 times to get get the Average.
 - CPU, Memory were obtained from... Windows Task Manager
 - GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop`
@@ -38,30 +38,30 @@ We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://githu
 
 ### RTX 4090 on Windows PC
 
-TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly, 
+TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 
 - CPU: Intel 13th series
 - GPU: NVIDIA GPU 4090 (Ampere - sm 86)
-- RAM: 120GB
-- OS: Windows
+- RAM: 32GB
+- OS: Windows 11 Pro
 
-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 104                  | ✅ 131       |
-| VRAM Used (GB)       | 2.1                  | 😱 21.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | No support           | ✅ 257.76    |
+| VRAM Used (GB)       | No support           | 3.3          |
+| RAM Used (GB)        | No support           | 0.54         |
+| Disk Size (GB)       | No support           | 2            |
 
 #### Mistral-7b int4
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 80                   | ✅ 97.9      |
-| VRAM Used (GB)       | 2.1                  | 😱 23.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | 101.3                | ✅ 159       |
+| VRAM Used (GB)       | 5.5                  | 6.3          |
+| RAM Used (GB)        | 0.54                 | 0.42         |
+| Disk Size (GB)       | 4.07                 | 3.66         |
 
 ### RTX 3090 on Windows PC
 
@@ -70,23 +70,23 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 - RAM: 64GB
 - OS: Windows
 
-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 131.28               | ✅ 194       |
-| VRAM Used (GB)       | 2.1                  | 😱 21.5      |
-| RAM Used (GB)        | 0.3                  | 😱 15        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | No support           | ✅ 203       |
+| VRAM Used (GB)       | No support           | 3.8          |
+| RAM Used (GB)        | No support           | 0.54         |
+| Disk Size (GB)       | No support           | 2            |
 
 #### Mistral-7b int4
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 88                   | ✅ 137       |
-| VRAM Used (GB)       | 6.0                  | 😱 23.8      |
-| RAM Used (GB)        | 0.3                  | 😱 25        |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Throughput (token/s) | 90                   | 140.27       |
+| VRAM Used (GB)       | 6.0                  | 6.8          |
+| RAM Used (GB)        | 0.54                 | 0.42         |
+| Disk Size (GB)       | 4.07                 | 3.66         |
 
 ### RTX 4060 on Windows Laptop
 
@@ -95,7 +95,7 @@ TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
 - RAM: 16GB
 - GPU: NVIDIA Laptop GPU 4060 (Ada)
 
-#### TinyLlama-1.1b q4
+#### TinyLlama-1.1b FP16
 
 | Metrics              | GGUF (using the GPU) | TensorRT-LLM |
 | -------------------- | -------------------- | ------------ |

From 00e97718323e3e2ed9467c4261d801295eb2902c Mon Sep 17 00:00:00 2001
From: Daniel <daniel@jan.ai>
Date: Thu, 21 Mar 2024 20:11:54 +0800
Subject: [PATCH 3/3] Add writeup for TensorRT-LLM

---
 docs/blog/2024-03-19-TensorRT-LLM.md | 169 ++++++++++++++++-----------
 1 file changed, 100 insertions(+), 69 deletions(-)

diff --git a/docs/blog/2024-03-19-TensorRT-LLM.md b/docs/blog/2024-03-19-TensorRT-LLM.md
index 0cd61adc1..03afb2179 100644
--- a/docs/blog/2024-03-19-TensorRT-LLM.md
+++ b/docs/blog/2024-03-19-TensorRT-LLM.md
@@ -1,114 +1,145 @@
 ---
-title: Jan now supports TensorRT-LLM
-description: Jan has added for Nvidia's TensorRT-LLM, a hardware-optimized LLM inference engine that runs very fast on Nvidia GPUs
-tags: [Nvidia, TensorRT-LLM]
+title: Benchmarking TensorRT-LLM vs. llama.cpp
+description: Jan has added support for the TensorRT-LLM Inference Engine, as an alternative to llama.cpp. We provide a performance benchmark that shows the head-to-head comparison of the two Inference Engine and model formats, with TensorRT-LLM providing better performance but consumes significantly more VRAM and RAM. 
+tags: [Nvidia, TensorRT-LLM, llama.cpp, 3090, 4090, "inference engine"]
+unlisted: true
 ---
 
-Jan now supports [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an alternative inference engine. TensorRT-LLM is a hardware-optimized LLM inference engine that compiles models to [run extremely fast on Nvidia GPUs](https://blogs.nvidia.com/blog/tensorrt-llm-windows-stable-diffusion-rtx/).
+Jan has added support [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) as an alternative to the default [llama.cpp](https://github.com/ggerganov/llama.cpp) inference engine. TensorRT-LLM allows Nvidia GPU owners to run blazing fast LLM inference as a hardware-optimized LLM inference engine that compiles models to [run extremely fast on Nvidia GPUs](https://blogs.nvidia.com/blog/tensorrt-llm-windows-stable-diffusion-rtx/). 
 
-- [TensorRT-LLM Extension](/guides/providers/tensorrt-llm) is available in [0.4.9 release](https://github.com/janhq/jan/releases/tag/v0.4.9)
-- Currently available only for Windows
+You can follow our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm) to try it out today. We've also added a few TensorRT-LLM models to Jan's Model Hub for download:
 
-We've made a few TensorRT-LLM models TensorRT-LLM models available in the Jan Hub for download:
-
-- TinyLlama-1.1b
 - Mistral 7b
+- TinyLlama-1.1b
 - TinyJensen-1.1b 😂
 
-You can get started by following our [TensorRT-LLM Guide](/guides/providers/tensorrt-llm).
+:::tip
+
+TensorRT-LLM support is available in [v0.4.9](https://github.com/janhq/jan/releases/tag/v0.4.9), but should be considered an experimental feature. 
+
+Please report bugs on [Github](https://github.com/janhq/jan) or on our Discord's [#tensorrt-llm](https://discord.com/channels/1107178041848909847/1201832734704795688) channel.  
+
+:::
 
 ## Performance Benchmarks
 
-TensorRT-LLM is mainly used in datacenter-grade GPUs to achieve [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) type speeds. Naturally, we were curious to see how this would perform on consumer-grade GPUs.
 
-We’ve done a comparison of how TensorRT-LLM does vs. [llama.cpp](https://github.com/ggerganov/llama.cpp), our default inference engine.
+We were really curious to see how TensorRT-LLM would perform vs. llama.cpp on consumer-grade GPUs. TensorRT-LLM has previously been shown by Nvidia to reach performance of up to [10,000 tokens/s](https://nvidia.github.io/TensorRT-LLM/blogs/H100vsA100.html) on datacenter-grade GPUs. As most of Jan's users are proud card carrying members of the [GPU Poor](https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini#the-gpu-poor), we wanted to see how the two inference engine performed on the same hardware.  
 
-| NVIDIA GPU | Architecture | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) |
-| ---------- | ------------ | -------------- | ---------- | ------------ | ---------------------- | ----------------------- |
-| RTX 4090   | Ada          | 24             | 16,384     | 512          | 384                    | ~1000                   |
-| RTX 3090   | Ampere       | 24             | 10,496     | 328          | 384                    | 935.8                   |
-| RTX 4060   | Ada          | 8              | 3,072      | 96           | 128                    | 272                     |
+:::info
 
-- We tested using batch_size 1 and input length 2048, output length 512 as it’s the common use case people all use.
-- We ran the tests 5 times to get get the Average.
-- CPU, Memory were obtained from... Windows Task Manager
-- GPU Metrics were obtained from `nvidia-smi` or `htop`/`nvtop`
-- All tests were run on bare metal PCs with no other apps open
-- There is a slight difference between the models: AWQ models for TensorRT-LLM, while llama.cpp has its own quantization technique
+An interesting aside: Jan actually started out in June 2023 building on [FastTransformer](https://github.com/NVIDIA/FasterTransformer), the precursor library to TensorRT-LLM. TensorRT-LLM was released in September 2023, making it a very young library. We're excited to see it's roadmap develop!
 
-### RTX 4090 on Windows PC
+:::
 
-TensorRT-LLM handily outperformed llama.cpp in for the 4090s. Interestingly,
+### Test Setup
+
+We picked 3 hardware platforms to run the test on, based on Jan's userbase's self-reported common hardware platforms. 
+
+| NVIDIA GPU                | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) |
+| ------------------------- | -------------- | ---------- | ------------ | ---------------------- | ----------------------- |
+| RTX 4090 Desktop (Ada)    | 24             | 16,384     | 512          | 384                    | ~1000                   |
+| RTX 3090 Desktop (Ampere) | 24             | 10,496     | 328          | 384                    | 935.8                   |
+| RTX 4060 Laptop (Ada)     | 8              | 3,072      | 96           | 128                    | 272                     |
+
+:::warning[Low-spec Machines?]
+
+We didn't bother including low-spec machines: TensorRT-LLM is meant for performance, and simply doesn't work on lower grade Nvidia GPUs, or computers without GPUs.
+
+TensorRT-LLM provides blazing fast performance at the cost of [memory usage](https://nvidia.github.io/TensorRT-LLM/memory.html). This means that the performance improvements only show up in higher-range GPUs with larger VRAMs. 
+
+We've found that [llama.cpp](https://github.com/ggerganov/llama.cpp) does an incredible job of democratizing inference to the [GPU Poor](https://www.semianalysis.com/p/google-gemini-eats-the-world-gemini#the-gpu-poor) with CPU-only or lower-range GPUs. Huge shout outs to the [llama.cpp maintainers](https://github.com/ggerganov/llama.cpp/graphs/contributors) and the [ggml.ai](https://ggml.ai/) team.   
+
+:::
+
+We chose the popular Mistral 7b model to run on both GGUF and TensorRT-LLM, picking comparable quantizations. 
+
+#### llama.cpp Setup
+- For llama.cpp, we used `Mistral-7b-q4_k_m`
+- [ ] Fill in `ngl` params, GPU offload etc
+
+#### TensorRT-LLM Setup
+- For TensorRT-LLM, we used `Mistral-7b-int4 AWQ`
+- We ran TensorRT-LLM with `free_gpu_memory_fraction` to test it with the lowest VRAM consumption (performance may be affected)
+- Note: We picked AWQ for TensorRT-LLM as a handicap as AWQ supposedly sacrifices performance for quality
+
+#### Experiment Setup
+We ran the experiment using a standardized inference request in a sandboxed environment on the same machine:
+- We ran tests 5 times for each inference engine, on a baremetal PC with no other applications open
+- Each inference request was of `batch_size` 1 and `input_len` 2048, `output_len` 512 as a realistic test case
+- CPU and Memory usage were obtained from.... Windows Task Manager 😱
+- GPU usage was obtained from `nvtop`, `htop`, and `nvidia-smi`
+
+## Results
+
+Our biggest takeaway: TensorRT-LLM is faster than llama.cpp on 4090s and 3090s with larger VRAMs. However, on smaller GPUs (e.g. Laptop 4060 GPUs), 
+
+|              | 4090 Desktop | 3090 Desktop | 4060 Laptop |
+| ------------ | ------------ | ------------ | ----------- |
+| TensorRT-LLM | ✅ 159t/s     | ✅ 140.27t/s  | ❌ 19t/s     |
+| llama.cpp    | 101.3t/s     | 90t/s        | 22t/s       |
+
+### RTX-4090 Desktop
+
+:::info[Hardware Details]
 
 - CPU: Intel 13th series
 - GPU: NVIDIA GPU 4090 (Ampere - sm 86)
 - RAM: 32GB
-- OS: Windows 11 Pro
+- OS: Windows 11 Pro on Proxmox
 
-#### TinyLlama-1.1b FP16
+:::
 
-| Metrics              | GGUF (using the GPU) | TensorRT-LLM |
-| -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | No support           | ✅ 257.76    |
-| VRAM Used (GB)       | No support           | 3.3          |
-| RAM Used (GB)        | No support           | 0.54         |
-| Disk Size (GB)       | No support           | 2            |
+Nvidia's RTX-4090 is their top-of-the-line consumer GPU, and retails for [approximately $2,000](https://www.amazon.com/rtx-4090/s?k=rtx+4090). 
 
 #### Mistral-7b int4
 
-| Metrics              | GGUF (using the GPU) | TensorRT-LLM |
-| -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 101.3                | ✅ 159       |
-| VRAM Used (GB)       | 5.5                  | 6.3          |
-| RAM Used (GB)        | 0.54                 | 0.42         |
-| Disk Size (GB)       | 4.07                 | 3.66         |
+| Metrics              | GGUF (using GPU) | TensorRT-LLM | Difference     |
+| -------------------- | -------------------- | ------------ | -------------- |
+| Throughput (token/s) | 101.3                | 159          | ✅ 57% faster   |
+| VRAM Used (GB)       | 5.5                  | 6.3          | 🤔 14% more    |
+| RAM Used (GB)        | 0.54                 | 0.42         | 🤯 20% less    |
+| Disk Size (GB)       | 4.07                 | 3.66         | 🤯 10% smaller |
 
-### RTX 3090 on Windows PC
+
+### RTX-3090 Desktop
+
+:::info[Hardware Details]
 
 - CPU: Intel 13th series
 - GPU: NVIDIA GPU 3090 (Ampere - sm 86)
 - RAM: 64GB
 - OS: Windows
 
-#### TinyLlama-1.1b FP16
-
-| Metrics              | GGUF (using the GPU) | TensorRT-LLM |
-| -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | No support           | ✅ 203       |
-| VRAM Used (GB)       | No support           | 3.8          |
-| RAM Used (GB)        | No support           | 0.54         |
-| Disk Size (GB)       | No support           | 2            |
+:::
 
 #### Mistral-7b int4
 
-| Metrics              | GGUF (using the GPU) | TensorRT-LLM |
-| -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 90                   | 140.27       |
-| VRAM Used (GB)       | 6.0                  | 6.8          |
-| RAM Used (GB)        | 0.54                 | 0.42         |
-| Disk Size (GB)       | 4.07                 | 3.66         |
+| Metrics              | GGUF (using GPU) | TensorRT-LLM | Difference   |
+| -------------------- | -------------------- | ------------ | ------------ |
+| Throughput (token/s) | 90                   | ✅ 140.27     | ✅ 55% faster |
+| VRAM Used (GB)       | 6.0                  | 6.8          | 🤔 13% more  |
+| RAM Used (GB)        | 0.54                 | 0.42         | 🤯 22% less  |
+| Disk Size (GB)       | 4.07                 | 3.66         | 🤯 10% less  |
 
-### RTX 4060 on Windows Laptop
+### RTX-4060 Laptop
+
+- [ ] Dan to re-run perf tests and fill in details
+
+:::info[Hardware Details]
 
 - Manufacturer: Acer Nitro 16 Phenix
 - CPU: Ryzen 7000
 - RAM: 16GB
 - GPU: NVIDIA Laptop GPU 4060 (Ada)
 
-#### TinyLlama-1.1b FP16
-
-| Metrics              | GGUF (using the GPU) | TensorRT-LLM |
-| -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 65                   | ❌ 41        |
-| VRAM Used (GB)       | 2.1                  | 😱 7.6       |
-| RAM Used (GB)        | 0.3                  | 😱 7.2       |
-| Disk Size (GB)       | 4.07                 | 4.07 GB      |
+:::
 
 #### Mistral-7b int4
 
-| Metrics              | GGUF (using the GPU) | TensorRT-LLM |
-| -------------------- | -------------------- | ------------ |
-| Throughput (token/s) | 22                   | ❌ 19        |
-| VRAM Used (GB)       | 2.1                  | 😱 7.7       |
-| RAM Used (GB)        | 0.3                  | 😱 13.5      |
-| Disk Size (GB)       | 4.07                 | 4.07         |
+| Metrics              | GGUF (using the GPU) | TensorRT-LLM | Difference |
+| -------------------- | -------------------- | ------------ | ---------- |
+| Throughput (token/s) | 22                   | ❌ 19         |            |
+| VRAM Used (GB)       | 2.1                  | 7.7          |            |
+| RAM Used (GB)        | 0.3                  | 13.5         |            |
+| Disk Size (GB)       | 4.07                 | 4.07         |            |
\ No newline at end of file