diff --git a/docs/src/pages/post/vllm-reduce-overhead.mdx b/docs/src/pages/post/vllm-reduce-overhead.mdx index d251aa15f..db4100190 100644 --- a/docs/src/pages/post/vllm-reduce-overhead.mdx +++ b/docs/src/pages/post/vllm-reduce-overhead.mdx @@ -28,14 +28,107 @@ On the journey to ensure vLLM runs as fast as possible on our fleet of RTX PRO 6 Before discussing the details of our perf model, we also want to provide you with a napkin-math version. To predict decode performance in terms of tok/s, you can take memory bandwidth and divide it by model size. Model size can be estimated by multiplying parameters count by 2 for BF16, or by 1 and 0.5 for INT8 and INT4 respectively. For example, you want to know the theoretical performance of running [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) on [RTX 4090](https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/rtx-4090/). The GPU has 1000 GB/s VRAM bandwidth, and the model is estimated to be 4x2 = 8GB. The theoretical decode speed would be 1000 / 8 = 125 tok/s. +TODO: show the equation + This estimate is actually quite accurate for short context size, and as you can see later when we walk through the performance model. To get a slightly better estimate, you can also look up the HuggingFace repo which shows the exact repo size. In our case of Qwen3-4B-Instruct-2507, the repo size is 8.06 GB, so it's not that much difference from our times 2 rule. TODO: screenshot of HF The biggest problem with this approach is not taking into account the effect of long context - doing attention on a large KV cache takes time! It also doesn't predict the latency of **prefill** operation i.e. Time to First Token (TTFT). We aim to predict these metrics more accurately with our perf model! -### +### Break down into ops: Matmul and Attention -An LLM forward pass consists of a series of operations: matrix multiplication, attention, RMS norm, and so on. For simplicity, we only need to consider **matrix multiplication** and **attention** operations as they account for most of the runtime. +Let's take a top down view of a typical transformer-based LLM. We have an embedding layer, a series of repeated hidden layers, and finally the Language Modelling (LM) head. Each hidden layer consists of 2 modules: attention module, which typically employs Grouped Query Attention (GQA), and Multi-Layer Perceptron (MLP) module. + +TODO: image of transformer arch + +There are a lot of operations in these layers, such as matrix multiplication (matmul), attention, RMS norm, activation functions, and so on. Though they can look overwhelming, we only need to consider **matmul** and **attention** as they account for most of the runtime. + +Module | Main operations +-------|---------------- +Input embedding | Embed tokens (ignore) +Hidden layers (repeated N times) + - Attention module | Query, Key, Value projections (matmul), Attention, Output projection (matmul) + - MLP module | Up and Down projections (matmul) +LM Head | Matmul + +A given matmul or attention operation, applied on a specific shape of inputs, can be charaterized as either **compute-bound** or **memory-bound**. These bounds are the theoretical limit on how fast the operation can run, assuming we can fully utilize compute units and memory bandwidth of the GPU (or any kind of hardware). + +#### Characterizing Matmul + +Matmul takes in two input matrices, A with shape `(M, K)` and B with shape`(K, N)`, and produces output matrix C with shape `(M, N)`. Mathematically speaking, each output element is a dot product between a row of A and a column of B. + +$$$ +C_{mn} = \sum_{k=0}^{K-1} A_{mk}B_{kn}, \forall 0\leq m