vLLM
Deep Dive
How vLLM achieves near-zero memory fragmentation and 2–4× higher throughput through PagedAttention and continuous batching.
Start LearningKV Cache — Deep Dive
The KV cache is the single most important concept in LLM serving. Understanding it from first principles — attention math, memory growth, and architectural variants — explains every design decision in vLLM.
1.1 — Transformer Attention: The Math
At the heart of every transformer is the scaled dot-product attention operation. For each token, the model learns to "look up" relevant information from all previous tokens by comparing queries against keys and retrieving weighted values.
Three weight matrices — WQ, WK, WV — are learned during training. At inference time, each token's embedding is multiplied by these weights to produce its Q, K, and V vectors.
Project → Query
Current token embedding × WQ. Represents "what am I looking for?"
Project → Keys
All past tokens × WK. Each token says "what I represent as a key."
Dot Product → Scores
Compare Q against every K. High dot-product = high relevance. Divide by √dₖ to stabilize gradients.
Weighted Sum → Output
Softmax turns scores into probabilities. Multiply by V to produce the final attended representation.
# Scaled dot-product attention — the exact computation import torch import torch.nn.functional as F def attention(Q, K, V, mask=None): """ Q : [batch, heads, seq_len_q, d_k] ← only current token during decode K : [batch, heads, seq_len_kv, d_k] ← ALL tokens (cached) V : [batch, heads, seq_len_kv, d_v] ← ALL tokens (cached) """ d_k = Q.shape[-1] # Similarity: how much does each query "match" each key? scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5) # Causal mask: token i can only attend to tokens ≤ i if mask is not None: scores = scores.masked_fill(mask == 0, float('-inf')) weights = F.softmax(scores, dim=-1) # [batch, heads, seq_q, seq_kv] return torch.matmul(weights, V) # [batch, heads, seq_q, d_v]
Think of K-V as a fuzzy dictionary. The Query is your search term. Keys are the dictionary headwords. Values are the definitions. Attention computes a weighted blend of all definitions based on how well your search term matches each headword — no hard lookup, all soft.
1.2 — Why We Cache K and V, Not Q
During autoregressive decoding (generating one token at a time), the model runs a forward pass for a single new token. The key insight:
Computed only for the current token being generated. Once we've used Q to compute attention, we discard it — the next token will produce a completely new Q.
Computed for every past token. The new token's Q must compare against all previous Ks. These never change — token 5's key is the same whether we're generating token 6 or token 600.
Computed for every past token. After attention scores are computed, we retrieve a weighted blend of all past Vs to form the output. These also never change for past tokens.
# What happens on each decode step (generating token t) # 1. Embed the NEW token only x_new = embed(token_t) # shape: [1, d_model] # 2. Project to Q, K, V for this token q_new = x_new @ W_Q # [1, d_k] ← only new token k_new = x_new @ W_K # [1, d_k] ← only new token v_new = x_new @ W_V # [1, d_v] ← only new token # 3. APPEND to KV cache (this is what grows each step) kv_cache.K = torch.cat([kv_cache.K, k_new], dim=0) # [t, d_k] kv_cache.V = torch.cat([kv_cache.V, v_new], dim=0) # [t, d_v] # 4. Attend: current Q against ALL cached K, V output = attention( Q = q_new, # [1, d_k] ← 1 query (new token) K = kv_cache.K, # [t, d_k] ← t keys (ALL past tokens) V = kv_cache.V, # [t, d_v] ← t values (ALL past tokens) ) # Without KV cache: we would recompute K and V for ALL t past tokens # every single step → O(t²) compute. With cache: O(t) per step.
Without a KV cache, generating a 1000-token response requires computing K and V for token 1 a total of 999 times, for token 2 a total of 998 times, etc. That's O(n²) compute. With the KV cache it's O(n) — a dramatic difference for long sequences.
1.3 — Live KV Cache Simulator
Watch the KV cache grow token by token. Each new token appends a row to the K and V matrices at every layer. The memory meter shows actual GPU RAM consumed (Llama-3 8B settings: 32 layers, 8 KV heads, head_dim=128, bfloat16).
KV Heads: 8
Head dim: 128
Dtype: bfloat16
1.4 — Real Memory Numbers
The KV cache formula depends on the model's architecture. For standard MHA, per-token KV cache size is:
# KV cache memory per token (one layer) bytes_per_token_per_layer = ( 2 # K and V * num_kv_heads * head_dim * dtype_size # 2 for bfloat16, 4 for float32 ) # Total KV cache for a full request total_bytes = ( num_tokens # prompt + generated tokens * num_layers * bytes_per_token_per_layer )
Real numbers for popular models at 2048 tokens per request:
For Llama-3 70B with 32 concurrent requests at 2048 tokens each: model weights ≈ 140 GB (bfloat16), KV cache ≈ 32 × 0.64 GB = 20.5 GB. At 8192 tokens, KV cache balloons to 82 GB — nearly as large as the weights themselves. This is why KV cache management is the bottleneck, not weight memory.
1.5 — Multi-Head Attention: MHA vs GQA vs MQA
Modern LLMs use variations of multi-head attention that dramatically reduce KV cache size with minimal accuracy loss. The key axis: how many K/V heads share each Q head?
Each Q head has its own private K and V head. Maximum expressivity, maximum memory. Used in: GPT-2, GPT-3, Llama-2.
Groups of Q heads share a K/V head. 4× smaller KV cache than MHA with near-identical quality. Used in: Llama-3, Mistral, Gemma.
All Q heads share a single K/V head. Extreme memory savings, but noticeable quality degradation for complex tasks. Used in: PaLM, Falcon.
# MHA: each of the num_heads heads has its own K, V kv_heads_mha = num_heads # e.g. 32 for Llama-2 7B # GQA: num_kv_heads groups, each shared by (num_heads / num_kv_heads) Q heads kv_heads_gqa = num_kv_heads # e.g. 8 for Llama-3 8B (4× fewer than 32) # MQA: single K, V shared by all Q heads kv_heads_mqa = 1 # Memory impact (Llama-3 8B example, 2048 tokens, bfloat16): # MHA would be: 2048 × 32 × 32 × 128 × 2 × 2 = 1,073,741,824 bytes = 1 GB # GQA actual: 2048 × 32 × 8 × 128 × 2 × 2 = 268,435,456 bytes = 256 MB ✓
1.6 — The Allocation Problem This Creates
Now that we know the KV cache is large, dynamic, and per-request, we can understand why allocating it naively is disastrous. Two failure modes emerge immediately:
A contiguous chunk reserved for the maximum possible output length (e.g. 2048 tokens) is held for a request that generates only 50 tokens. For Llama-3 8B that's 256 MB reserved but only 12 MB used. 244 MB sits idle until the request finishes.
As requests complete, freed memory becomes scattered. A new request needing 256 MB contiguously cannot start even if 300 MB total is free — because it's fragmented into non-contiguous chunks.
Memory Fragmentation — Visual
Each cell = one 16-token block of GPU memory. Notice the wasted reservations on the left vs tight packing on the right:
This is the problem PagedAttention (Chapter 2) solves: treat GPU memory like an OS treats RAM — fixed-size pages, non-contiguous allocation, virtual mapping.
PagedAttention
vLLM's core innovation: divide GPU memory into fixed-size physical blocks and give each request a virtual block table — exactly like an OS page table.
The OS Analogy
Physical block = physical RAM page.
Logical block table = page table (virtual → physical mapping).
Block pool = physical RAM pool managed by the kernel.
Prefix cache = shared libraries mapped read-only into multiple processes.
Swap = CPU offload when GPU pool is exhausted.
How It Works
GPU KV cache memory is divided into fixed-size physical blocks (default: 16 tokens each). Each request maintains a logical block table — an ordered list of physical block IDs that hold its KV data. These blocks need not be contiguous.
The attention kernel is modified to perform scatter-gather reads across non-contiguous blocks using the block table. The attention math is identical; only the memory access pattern changes.
Block Table — Physical Memory Mapping
Three requests share the same GPU block pool. Their logical blocks map to scattered physical blocks:
Core Allocation Code
# KVCacheManager.allocate_slots — called every decode step def allocate_slots(self, request, num_new_tokens): # How many new 16-token blocks are needed? num_blocks_needed = self.get_num_blocks_to_allocate( request_id=request.request_id, num_tokens=request.num_tokens + num_new_tokens, ) # Is the pool large enough? if num_blocks_needed > self.block_pool.get_num_free_blocks(): return None # → scheduler will trigger preemption # Allocate blocks and update the request's block table new_blocks = self.allocate_new_blocks(request.request_id, ...) return KVCacheBlocks(new_blocks)
Prefix Caching (APC)
When requests share an identical prefix (e.g. the same system prompt), vLLM can map both requests' logical block tables to the same physical blocks. These blocks are hashed by token content — a cache hit skips prefill entirely.
A 1000-token system prompt shared across 100 concurrent requests eliminates
100× prefill computation. Enable with --enable-prefix-caching.
Copy-on-write semantics apply when a shared block must be mutated.
Continuous Batching
Traditional inference idles the GPU waiting for the slowest request. Continuous batching re-schedules on every single forward pass, filling freed capacity immediately.
Static vs. Continuous
// ❌ Static batching batch = [req_A, req_B, req_C, req_D] while any_running(batch): model.forward(batch) // ALL wait for the slowest // req_A done at step 10, but GPU idles until req_D at step 80 // ✅ Continuous batching (vLLM) while True: scheduler.schedule() // re-evaluate EVERY iteration outputs = model.forward(batch) for req in outputs: if req.is_done(): free_blocks(req) // instantly reclaim batch.add(waiting_queue.pop())
Live Simulation
7 requests compete for a 48-block GPU pool. Watch how finished requests immediately free capacity, allowing the waiting queue to drain continuously.
Memory Lifecycle
From request arrival to completion, every GPU block passes through a well-defined lifecycle managed by the scheduler and block manager in lockstep.
Request Arrives → Waiting Queue
No GPU memory allocated yet. Scheduler checks if enough free blocks exist.
Prefill Phase — Allocate Initial Blocks
The entire prompt is processed in one forward pass. KV cache written into freshly allocated blocks. New block appended every 16 tokens.
Decode Loop — Allocate on Demand
Each decode step generates one token. Every 16 steps, one new block is needed. If the pool is empty, the scheduler preempts a lower-priority request.
Request Completes → Free All Blocks
On EOS or max_tokens, all physical blocks are returned to the free list in the same scheduler iteration — immediately available.
Next Iteration → Immediate Reuse
Freed blocks are considered in the very next scheduler call (milliseconds later). A waiting request immediately claims them. Zero idle time.
Preemption Strategies
When the block pool runs dry, the scheduler must evict a running request to make room. Two strategies: swap to CPU, or discard and recompute.
When Does Preemption Trigger?
The scheduler runs before each forward pass. If a running request needs
a new block but block_pool.get_num_free_blocks() == 0,
it preempts one or more running requests using one of these strategies:
Use swap when CPU RAM is plentiful and requests are long (expensive to recompute).
Use recompute when CPU RAM is limited or requests are short. Default is recompute.
Configuration Reference
The key parameters that control vLLM's memory management and scheduling behavior.
| Parameter | Default | Effect |
|---|---|---|
| gpu_memory_utilization | 0.9 | Fraction of GPU VRAM reserved for the KV block pool. Increase for more concurrency; decrease to leave headroom for activations. |
| block_size | 16 | Tokens per physical block. Smaller = finer granularity (less waste). Larger = better attention kernel efficiency. 16 is empirically optimal for most models. |
| max_num_seqs | 256 | Max concurrent sequences in one batch. Cap this if you're hitting OOM on activations (separate budget from the KV cache). |
| swap_space | 4 GB | CPU RAM reserved for the swap block pool. Increase if you expect heavy preemption and want to preserve KV caches rather than recompute. |
| enable_prefix_caching | False | Enable APC — share physical blocks for identical prompt prefixes. High value for chatbots with shared system prompts. |
| preemption_mode | recompute | Strategy when block pool is exhausted. recompute drops and re-prefills; swap offloads to CPU. |
| max_model_len | model max | Maximum sequence length (prompt + output). Caps block table size per request. Reducing this directly increases concurrency capacity. |
Startup Memory Calculation
# 1. Load weights → measure VRAM consumed weight_mem = measure_model_footprint() # 2. Profile dummy batch → measure peak activation memory activation_peak = profile_with_dummy_batch() # 3. Remaining VRAM → KV block pool available = (total_vram * gpu_memory_utilization) \ - weight_mem - activation_peak # 4. Calculate block count bytes_per_block = (block_size * num_layers * num_kv_heads * head_dim * 2 # K and V * dtype_bytes) num_gpu_blocks = available // bytes_per_block # Example: A100 80 GB, Llama-3 8B, bfloat16, block_size=16 # → ~4 096 blocks → ~65 536 concurrent token slots