update blogpost

initial
2025-09-26 07:59:54 +00:00 · 2025-09-25 11:05:03 +00:00
1 changed files with 134 additions and 0 deletions
--- a/docs/src/pages/post/vllm-reduce-overhead.mdx
+++ b/docs/src/pages/post/vllm-reduce-overhead.mdx
@ -0,0 +1,134 @@
+---
+title: Reduce vLLM overhead with these magic flags
+description: NA
+tags: 
+categories: research
+ogImage: 
+date: 2025-09-25
+---
+
+import { Callout } from 'nextra/components'
+import CTABlog from '@/components/Blog/CTA'
+
+# Reduce vLLM overhead with these magic flags
+
+If you don't have much time, add `--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","compile_sizes":[1]}' --async-scheduling` to your vLLM launch command, and you can close this tab. See you next time!
+
+If you are still here, let's talk about how we discovered some lesser known vLLM flags that can speed up your local LLM inference by up to 30%!
+
+## Background
+
+[vLLM](https://github.com/vllm-project/vllm) is one of the most established LLM inference serving frameworks. It includes the best known techniques for serving LLMs not only with single GPU, but also at large distributed scale. However, most of the efforts have been focusing on data center GPUs - think H100, B200, and MI300X. That leaves vLLM rather unoptimized for typical consumer hardware.
+
+On the journey to ensure vLLM runs as fast as possible on our fleet of RTX PRO 6000, we must be able to understand what is the **realistic theoretical limit** of serving a particular LLM model on a particular GPU. To do so, we build a **performance model** to predict the best possible metric we can get out of our hardware.
+
+## LLM performance model
+
+### Napkin math
+
+Before discussing the details of our perf model, we also want to provide you with a napkin-math version. To predict decode performance in terms of tok/s, you can take memory bandwidth and divide it by model size. Model size can be estimated by multiplying parameters count by 2 for BF16, or by 1 and 0.5 for INT8 and INT4 respectively. For example, you want to know the theoretical performance of running [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) on [RTX 4090](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/). The GPU has 1000 GB/s VRAM bandwidth, and the model is estimated to be 4x2 = 8GB. The theoretical decode speed would be 1000 / 8 = 125 tok/s.
+
+TODO: show the equation
+
+This estimate is actually quite accurate for short context size, and as you can see later when we walk through the performance model. To get a slightly better estimate, you can also look up the HuggingFace repo which shows the exact repo size. In our case of Qwen3-4B-Instruct-2507, the repo size is 8.06 GB, so it's not that much difference from our times 2 rule.
+
+TODO: screenshot of HF
+
+The biggest problem with this approach is not taking into account the effect of long context - doing attention on a large KV cache takes time! It also doesn't predict the latency of **prefill** operation i.e. Time to First Token (TTFT). We aim to predict these metrics more accurately with our perf model!
+
+### Break down into ops: Matmul and Attention
+
+Let's take a top down view of a typical transformer-based LLM. We have an embedding layer, a series of repeated hidden layers, and finally the Language Modelling (LM) head. Each hidden layer consists of 2 modules: attention module, which typically employs Grouped Query Attention (GQA), and Multi-Layer Perceptron (MLP) module.
+
+TODO: image of transformer arch
+
+There are a lot of operations in these layers, such as matrix multiplication (matmul), attention, RMS norm, activation functions, and so on. Though they can look overwhelming, we only need to consider **matmul** and **attention** as they account for most of the runtime.
+
+Module | Main operations
+-------|----------------
+Input embedding | Embed tokens (ignore)
+Hidden layers (repeated N times)
+  - Attention module | Query, Key, Value projections (matmul), Attention, Output projection (matmul)
+  - MLP module | Up and Down projections (matmul)
+LM Head | Matmul
+
+A given matmul or attention operation, applied on a specific shape of inputs, can be charaterized as either **compute-bound** or **memory-bound**. These bounds are the theoretical limit on how fast the operation can run, assuming we can fully utilize compute units and memory bandwidth of the GPU (or any kind of hardware).
+
+#### Characterizing Matmul
+
+Matmul takes in two input matrices, A with shape `(M, K)` and B with shape`(K, N)`, and produces output matrix C with shape `(M, N)`. Mathematically speaking, each output element is a dot product between a row of A and a column of B.
+
+$$$
+C_{mn} = \sum_{k=0}^{K-1} A_{mk}B_{kn},  \forall 0\leq m<M, 0\leq n<N
+$$$
+
+TODO: show a diagram
+
+We want to count:
+1. Number of floating point operations to compute matmul, or FLOPs. This will determine the compute bound.
+2. Amount of data being read and write from VRAM. This will determine the memory bound.
+
+Recall that each output element is a dot product of two vectors with size K. This involves 2K FLOPs, since we need to do 1 multiplication and 1 addition for each pair of elements. To compute the whole output of size `(M, N)`, that will be **2MNK FLOPs**. This is the most important fact you need to know about matmul - there are 2MNK floating point operations.
+
+For the amount of data transferred, regardless of how a particular matmul algorithm is implemented, **minimally**, we have to read input data from VRAM, and write output data to VRAM. Assuming BF16 data type, which has 2 bytes per element, matmul has 2(MN + NK + MN) bytes of data transferred.
+
+To determine whether a particular matmul is compute-bound or memory-bound, we compute the expected duration under either compute- or memory-bound assumption. We also need the actual hardware specifications - compute floating point operations per second (FLOPS, with a capital S), and memory bandwidth. Taking RTX PRO 6000 for example. 
+
+There is also another approach using **arithmetic intensity**, but it basically compares the same thing.
+
+#### Characterizing Attention
+
+For each attention head, attention is computed as
+
+$$$
+S = Q @ K.T
+P = softmax(S)
+O = P @ V
+$$$
+
+Where S and P are the attention weights before and after softmax respectively. If we ignore the softmax operation, which we do, attention is simply two back-to-back matmuls. Hence, FLOPs count is simply the sum of FLOPs for individual matmul. For the amount of data transferred, we only consider the inputs (Q, K, V) and output (O). In other words, assuming intermediates like S and P are not written to VRAM (this is in fact the key innovation of [Flash Attention](https://github.com/Dao-AILab/flash-attention)).
+
+### Theoretical vs Actual hardware specs
+
+Our perf model can only be as good as its inputs i.e. the raw performance of the GPU. We can consider more operations and overhead involved in an LLM's forward pass, but it would be pointless if our assumed GPU's FLOPS and memory bandwidth are not accurate. Fortunately, we don't need to do guess work. We can just measure them!
+
+Generally, matmul is guaranteed to be compute-bound when the shapes are sufficiently large. In other words, by measuring the runtime duration of a large matmul, we can estimate the realistically achievable FLOPS of the GPU. We can safely assume the default matmul implementation in PyTorch is near optimal for large problem sizes. In practice, we sweep through large values of matrix sizes and take the highest achieved FLOPS.
+
+```python
+import time
+import torch
+from triton.testing import do_bench
+
+max_flops = 0
+
+for size in range(4096, 16384, 2048):
+    A = torch.randn(size, size, dtype=torch.bfloat16, device="cuda")
+    B = torch.randn(size, size, dtype=torch.bfloat16, device="cuda").T
+
+    time.sleep(0.5)
+    latency_ms = do_bench(lambda: torch.mm(A, B))
+    flops = (2 * size * size * size) / (latency_ms * 1e-3)
+    max_flops = max(max_flops, flops)
+
+print(f"{max_flops * 1e-12:.2f} TFLOPS")
+```
+
+Estimating memory bandwidth is much easier. We can use `memcpy` as the optimal memory operation, which simply reads data from one location and write it to a new location. For a tensor of N bytes, `memcpy` will perform N bytes of memory read and N bytes of memory write.
+
+```python
+import torch
+from triton.testing import do_bench
+
+max_mem_bw = 0
+
+for size in range(1, 4):
+    x = torch.randn(size * (1 << 30), device="cuda")
+
+    latency_ms = do_bench(lambda: x.clone())
+    mem_bw = x.nbytes * 2 / (latency_ms * 1e-3)
+    max_mem_bw = max(max_mem_bw, mem_bw)
+
+print(f"{max_mem_bw * 1e-9:.2f} GB/s")
+```
+
+<CTABlog />
Author	SHA1	Message	Date
Thien Tran	ade55d13ed	update blogpost	2025-09-26 07:59:54 +00:00
Thien Tran	b6270dadd3	initial	2025-09-25 11:05:03 +00:00