7.3 KiB
| title | description | tags | unlisted | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Benchmarking TensorRT-LLM vs. llama.cpp | Jan has added support for the TensorRT-LLM Inference Engine, as an alternative to llama.cpp. We provide a performance benchmark that shows the head-to-head comparison of the two Inference Engine and model formats, with TensorRT-LLM providing better performance but consumes significantly more VRAM and RAM. |
|
true |
Jan has added support TensorRT-LLM as an alternative to the default llama.cpp inference engine. TensorRT-LLM allows Nvidia GPU owners to run blazing fast LLM inference as a hardware-optimized LLM inference engine that compiles models to run extremely fast on Nvidia GPUs.
You can follow our TensorRT-LLM Guide to try it out today. We've also added a few TensorRT-LLM models to Jan's Model Hub for download:
- Mistral 7b
- TinyLlama-1.1b
- TinyJensen-1.1b 😂
:::tip
TensorRT-LLM support is available in v0.4.9, but should be considered an experimental feature.
Please report bugs on Github or on our Discord's #tensorrt-llm channel.
:::
Performance Benchmarks
We were really curious to see how TensorRT-LLM would perform vs. llama.cpp on consumer-grade GPUs. TensorRT-LLM has previously been shown by Nvidia to reach performance of up to 10,000 tokens/s on datacenter-grade GPUs. As most of Jan's users are proud card carrying members of the GPU Poor, we wanted to see how the two inference engine performed on the same hardware.
:::info
An interesting aside: Jan actually started out in June 2023 building on FastTransformer, the precursor library to TensorRT-LLM. TensorRT-LLM was released in September 2023, making it a very young library. We're excited to see it's roadmap develop!
:::
Test Setup
We picked 3 hardware platforms to run the test on, based on Jan's userbase's self-reported common hardware platforms.
| NVIDIA GPU | VRAM Used (GB) | CUDA Cores | Tensor Cores | Memory Bus Width (bit) | Memory Bandwidth (GB/s) |
|---|---|---|---|---|---|
| RTX 4090 Desktop (Ada) | 24 | 16,384 | 512 | 384 | ~1000 |
| RTX 3090 Desktop (Ampere) | 24 | 10,496 | 328 | 384 | 935.8 |
| RTX 4060 Laptop (Ada) | 8 | 3,072 | 96 | 128 | 272 |
:::warning[Low-spec Machines?]
We didn't bother including low-spec machines: TensorRT-LLM is meant for performance, and simply doesn't work on lower grade Nvidia GPUs, or computers without GPUs.
TensorRT-LLM provides blazing fast performance at the cost of memory usage. This means that the performance improvements only show up in higher-range GPUs with larger VRAMs.
We've found that llama.cpp does an incredible job of democratizing inference to the GPU Poor with CPU-only or lower-range GPUs. Huge shout outs to the llama.cpp maintainers and the ggml.ai team.
:::
We chose the popular Mistral 7b model to run on both GGUF and TensorRT-LLM, picking comparable quantizations.
llama.cpp Setup
- For llama.cpp, we used
Mistral-7b-q4_k_m - Fill in
nglparams, GPU offload etc
TensorRT-LLM Setup
- For TensorRT-LLM, we used
Mistral-7b-int4 AWQ - We ran TensorRT-LLM with
free_gpu_memory_fractionto test it with the lowest VRAM consumption (performance may be affected) - Note: We picked AWQ for TensorRT-LLM as a handicap as AWQ supposedly sacrifices performance for quality
Experiment Setup
We ran the experiment using a standardized inference request in a sandboxed environment on the same machine:
- We ran tests 5 times for each inference engine, on a baremetal PC with no other applications open
- Each inference request was of
batch_size1 andinput_len2048,output_len512 as a realistic test case - CPU and Memory usage were obtained from.... Windows Task Manager 😱
- GPU usage was obtained from
nvtop,htop, andnvidia-smi
Results
Our biggest takeaway: TensorRT-LLM is faster than llama.cpp on 4090s and 3090s with larger VRAMs. However, on smaller GPUs (e.g. Laptop 4060 GPUs),
| 4090 Desktop | 3090 Desktop | 4060 Laptop | |
|---|---|---|---|
| TensorRT-LLM | ✅ 159t/s | ✅ 140.27t/s | ❌ 19t/s |
| llama.cpp | 101.3t/s | 90t/s | 22t/s |
RTX-4090 Desktop
:::info[Hardware Details]
- CPU: Intel 13th series
- GPU: NVIDIA GPU 4090 (Ampere - sm 86)
- RAM: 32GB
- OS: Windows 11 Pro on Proxmox
:::
Nvidia's RTX-4090 is their top-of-the-line consumer GPU, and retails for approximately $2,000.
Mistral-7b int4
| Metrics | GGUF (using GPU) | TensorRT-LLM | Difference |
|---|---|---|---|
| Throughput (token/s) | 101.3 | 159 | ✅ 57% faster |
| VRAM Used (GB) | 5.5 | 6.3 | 🤔 14% more |
| RAM Used (GB) | 0.54 | 0.42 | 🤯 20% less |
| Disk Size (GB) | 4.07 | 3.66 | 🤯 10% smaller |
RTX-3090 Desktop
:::info[Hardware Details]
- CPU: Intel 13th series
- GPU: NVIDIA GPU 3090 (Ampere - sm 86)
- RAM: 64GB
- OS: Windows
:::
Mistral-7b int4
| Metrics | GGUF (using GPU) | TensorRT-LLM | Difference |
|---|---|---|---|
| Throughput (token/s) | 90 | ✅ 140.27 | ✅ 55% faster |
| VRAM Used (GB) | 6.0 | 6.8 | 🤔 13% more |
| RAM Used (GB) | 0.54 | 0.42 | 🤯 22% less |
| Disk Size (GB) | 4.07 | 3.66 | 🤯 10% less |
RTX-4060 Laptop
- Dan to re-run perf tests and fill in details
:::info[Hardware Details]
- Manufacturer: Acer Nitro 16 Phenix
- CPU: Ryzen 7000
- RAM: 16GB
- GPU: NVIDIA Laptop GPU 4060 (Ada)
:::
Mistral-7b int4
| Metrics | GGUF (using the GPU) | TensorRT-LLM | Difference |
|---|---|---|---|
| Throughput (token/s) | 22 | ❌ 19 | |
| VRAM Used (GB) | 2.1 | 7.7 | |
| RAM Used (GB) | 0.3 | 13.5 | |
| Disk Size (GB) | 4.07 | 4.07 |