GPU Memory Architecture

vLLM
Deep Dive

How vLLM achieves near-zero memory fragmentation and 2–4× higher throughput through PagedAttention and continuous batching.

Audience SWE / ML Eng
Chapters 6 + deep dive
Source Kwon et al. 2023
Start Learning
Chapter 01

KV Cache — Deep Dive

The KV cache is the single most important concept in LLM serving. Understanding it from first principles — attention math, memory growth, and architectural variants — explains every design decision in vLLM.

1.1 — Transformer Attention: The Math

At the heart of every transformer is the scaled dot-product attention operation. For each token, the model learns to "look up" relevant information from all previous tokens by comparing queries against keys and retrieving weighted values.

Attention(Q, K, V)  =  softmaxQ · Kᵀ / √dₖ ) · V
Q = Query matrix
K = Key matrix
V = Value matrix
dₖ = head dimension (scaling factor)

Three weight matrices — WQ, WK, WV — are learned during training. At inference time, each token's embedding is multiplied by these weights to produce its Q, K, and V vectors.

Step 1
Q
Project → Query

Current token embedding × WQ. Represents "what am I looking for?"

Step 2
K
Project → Keys

All past tokens × WK. Each token says "what I represent as a key."

Step 3
Q·Kᵀ
Dot Product → Scores

Compare Q against every K. High dot-product = high relevance. Divide by √dₖ to stabilize gradients.

Step 4
V
Weighted Sum → Output

Softmax turns scores into probabilities. Multiply by V to produce the final attended representation.

# Scaled dot-product attention — the exact computation
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    """
    Q : [batch, heads, seq_len_q,  d_k]  ← only current token during decode
    K : [batch, heads, seq_len_kv, d_k]  ← ALL tokens (cached)
    V : [batch, heads, seq_len_kv, d_v]  ← ALL tokens (cached)
    """
    d_k = Q.shape[-1]

    # Similarity: how much does each query "match" each key?
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    # Causal mask: token i can only attend to tokens ≤ i
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    weights = F.softmax(scores, dim=-1)   # [batch, heads, seq_q, seq_kv]
    return torch.matmul(weights, V)          # [batch, heads, seq_q, d_v]
Analogy: Dictionary Lookup

Think of K-V as a fuzzy dictionary. The Query is your search term. Keys are the dictionary headwords. Values are the definitions. Attention computes a weighted blend of all definitions based on how well your search term matches each headword — no hard lookup, all soft.

1.2 — Why We Cache K and V, Not Q

During autoregressive decoding (generating one token at a time), the model runs a forward pass for a single new token. The key insight:

Q
Query

Computed only for the current token being generated. Once we've used Q to compute attention, we discard it — the next token will produce a completely new Q.

✗ NOT cached — recomputed each step
K
Key

Computed for every past token. The new token's Q must compare against all previous Ks. These never change — token 5's key is the same whether we're generating token 6 or token 600.

✓ CACHED — accumulated across steps
V
Value

Computed for every past token. After attention scores are computed, we retrieve a weighted blend of all past Vs to form the output. These also never change for past tokens.

✓ CACHED — accumulated across steps
# What happens on each decode step (generating token t)

# 1. Embed the NEW token only
x_new = embed(token_t)                     # shape: [1, d_model]

# 2. Project to Q, K, V for this token
q_new = x_new @ W_Q                       # [1, d_k]  ← only new token
k_new = x_new @ W_K                       # [1, d_k]  ← only new token
v_new = x_new @ W_V                       # [1, d_v]  ← only new token

# 3. APPEND to KV cache (this is what grows each step)
kv_cache.K = torch.cat([kv_cache.K, k_new], dim=0)  # [t, d_k]
kv_cache.V = torch.cat([kv_cache.V, v_new], dim=0)  # [t, d_v]

# 4. Attend: current Q against ALL cached K, V
output = attention(
    Q = q_new,       # [1, d_k]    ← 1 query (new token)
    K = kv_cache.K,  # [t, d_k]   ← t keys (ALL past tokens)
    V = kv_cache.V,  # [t, d_v]   ← t values (ALL past tokens)
)

# Without KV cache: we would recompute K and V for ALL t past tokens
# every single step → O(t²) compute. With cache: O(t) per step.
The Cost of Not Caching

Without a KV cache, generating a 1000-token response requires computing K and V for token 1 a total of 999 times, for token 2 a total of 998 times, etc. That's O(n²) compute. With the KV cache it's O(n) — a dramatic difference for long sequences.

1.3 — Live KV Cache Simulator

Watch the KV cache grow token by token. Each new token appends a row to the K and V matrices at every layer. The memory meter shows actual GPU RAM consumed (Llama-3 8B settings: 32 layers, 8 KV heads, head_dim=128, bfloat16).

// KV Cache Growth Simulator — Llama-3 8B
Generated Tokens
KV Cache Matrix (simplified — 3 layers shown)
K matrix
V matrix
Each colored cell = one 128-dim vector stored in bfloat16. Columns = dimension chunks (8 heads × 128 = 1024 dims shown as 16 cells). Rows = tokens per layer.
GPU RAM Used
0.00 MB
of 256 MB max
(2048 tokens)
Layers: 32
KV Heads: 8
Head dim: 128
Dtype: bfloat16
Memory = tokens × layers × kv_heads × head_dim × 2(K+V) × 2(bytes) = 0 × 32 × 8 × 128 × 2 × 2 = 0 bytes

1.4 — Real Memory Numbers

The KV cache formula depends on the model's architecture. For standard MHA, per-token KV cache size is:

# KV cache memory per token (one layer)
bytes_per_token_per_layer = (
    2          # K and V
  * num_kv_heads
  * head_dim
  * dtype_size  # 2 for bfloat16, 4 for float32
)

# Total KV cache for a full request
total_bytes = (
    num_tokens           # prompt + generated tokens
  * num_layers
  * bytes_per_token_per_layer
)

Real numbers for popular models at 2048 tokens per request:

GPT-2 (117M)
0 MB
MHA, 12L
Llama-2 7B
0 MB
MHA, 32L
Llama-3 8B
0 MB
GQA, 32L
Llama-3 70B
0 MB
GQA, 80L
Llama-3.1 405B
0 MB
GQA, 126L
KV Cache vs Model Weights

For Llama-3 70B with 32 concurrent requests at 2048 tokens each: model weights ≈ 140 GB (bfloat16), KV cache ≈ 32 × 0.64 GB = 20.5 GB. At 8192 tokens, KV cache balloons to 82 GB — nearly as large as the weights themselves. This is why KV cache management is the bottleneck, not weight memory.

1.5 — Multi-Head Attention: MHA vs GQA vs MQA

Modern LLMs use variations of multi-head attention that dramatically reduce KV cache size with minimal accuracy loss. The key axis: how many K/V heads share each Q head?

MHA — Multi-Head
Q₁
K₁
V₁
Q₂
K₂
V₂
Q₃
K₃
V₃
Q₄
K₄
V₄
1024 MB Llama-2 7B @ 2048 tok

Each Q head has its own private K and V head. Maximum expressivity, maximum memory. Used in: GPT-2, GPT-3, Llama-2.

GQA — Grouped-Query
Q₁
Q₂
K₁
V₁
shared
Q₃
Q₄
K₂
V₂
shared
256 MB Llama-3 8B @ 2048 tok

Groups of Q heads share a K/V head. 4× smaller KV cache than MHA with near-identical quality. Used in: Llama-3, Mistral, Gemma.

MQA — Multi-Query
Q₁
Q₂
Q₃
K₁
V₁
ALL share
Q₄
64 MB Example model @ 2048 tok

All Q heads share a single K/V head. Extreme memory savings, but noticeable quality degradation for complex tasks. Used in: PaLM, Falcon.

# MHA: each of the num_heads heads has its own K, V
kv_heads_mha = num_heads          # e.g. 32 for Llama-2 7B

# GQA: num_kv_heads groups, each shared by (num_heads / num_kv_heads) Q heads
kv_heads_gqa = num_kv_heads       # e.g. 8 for Llama-3 8B (4× fewer than 32)

# MQA: single K, V shared by all Q heads
kv_heads_mqa = 1

# Memory impact (Llama-3 8B example, 2048 tokens, bfloat16):
# MHA would be: 2048 × 32 × 32 × 128 × 2 × 2 = 1,073,741,824 bytes = 1 GB
# GQA actual:   2048 × 32 ×  8 × 128 × 2 × 2 =   268,435,456 bytes = 256 MB ✓

1.6 — The Allocation Problem This Creates

Now that we know the KV cache is large, dynamic, and per-request, we can understand why allocating it naively is disastrous. Two failure modes emerge immediately:

Internal Fragmentation

A contiguous chunk reserved for the maximum possible output length (e.g. 2048 tokens) is held for a request that generates only 50 tokens. For Llama-3 8B that's 256 MB reserved but only 12 MB used. 244 MB sits idle until the request finishes.

External Fragmentation

As requests complete, freed memory becomes scattered. A new request needing 256 MB contiguously cannot start even if 300 MB total is free — because it's fragmented into non-contiguous chunks.

Memory Fragmentation — Visual

Each cell = one 16-token block of GPU memory. Notice the wasted reservations on the left vs tight packing on the right:

Naive Allocation
Req A (actual use)
Req B (actual use)
Wasted reservation
Utilization21%
Fragmented~79%
PagedAttention (vLLM)
Req A
Req B
Req C
Req D
Utilization95%
Fragmented<1 block

This is the problem PagedAttention (Chapter 2) solves: treat GPU memory like an OS treats RAM — fixed-size pages, non-contiguous allocation, virtual mapping.

Chapter 02

PagedAttention

vLLM's core innovation: divide GPU memory into fixed-size physical blocks and give each request a virtual block table — exactly like an OS page table.

The OS Analogy

Think: Virtual Memory Paging

Physical block = physical RAM page.
Logical block table = page table (virtual → physical mapping).
Block pool = physical RAM pool managed by the kernel.
Prefix cache = shared libraries mapped read-only into multiple processes.
Swap = CPU offload when GPU pool is exhausted.

How It Works

GPU KV cache memory is divided into fixed-size physical blocks (default: 16 tokens each). Each request maintains a logical block table — an ordered list of physical block IDs that hold its KV data. These blocks need not be contiguous.

The attention kernel is modified to perform scatter-gather reads across non-contiguous blocks using the block table. The attention math is identical; only the memory access pattern changes.

Block Table — Physical Memory Mapping

Three requests share the same GPU block pool. Their logical blocks map to scattered physical blocks:

// Block Table
Logical View (per request)
Request A
Logical[0]Phys #7
Logical[1]Phys #2
Logical[2]Phys #9
Request B
Logical[0]Phys #1
Logical[1]Phys #4
Request C
Logical[0]Phys #6
Physical GPU Block Pool
Phys #0 — free
Phys #1 · Req B · tok 0–15
Phys #2 · Req A · tok 16–31
Phys #3 — free
Phys #4 · Req B · tok 16–31
Phys #5 — free
Phys #6 · Req C · tok 0–15
Phys #7 · Req A · tok 0–15
Phys #8 — free
Phys #9 · Req A · tok 32–47
Phys #10 — free
Phys #11 — free

Core Allocation Code

# KVCacheManager.allocate_slots — called every decode step
def allocate_slots(self, request, num_new_tokens):

    # How many new 16-token blocks are needed?
    num_blocks_needed = self.get_num_blocks_to_allocate(
        request_id=request.request_id,
        num_tokens=request.num_tokens + num_new_tokens,
    )

    # Is the pool large enough?
    if num_blocks_needed > self.block_pool.get_num_free_blocks():
        return None  # → scheduler will trigger preemption

    # Allocate blocks and update the request's block table
    new_blocks = self.allocate_new_blocks(request.request_id, ...)
    return KVCacheBlocks(new_blocks)

Prefix Caching (APC)

When requests share an identical prefix (e.g. the same system prompt), vLLM can map both requests' logical block tables to the same physical blocks. These blocks are hashed by token content — a cache hit skips prefill entirely.

APC in Practice

A 1000-token system prompt shared across 100 concurrent requests eliminates 100× prefill computation. Enable with --enable-prefix-caching. Copy-on-write semantics apply when a shared block must be mutated.

Chapter 03

Continuous Batching

Traditional inference idles the GPU waiting for the slowest request. Continuous batching re-schedules on every single forward pass, filling freed capacity immediately.

Static vs. Continuous

// ❌ Static batching
batch = [req_A, req_B, req_C, req_D]
while any_running(batch):
    model.forward(batch)   // ALL wait for the slowest
// req_A done at step 10, but GPU idles until req_D at step 80

// ✅ Continuous batching (vLLM)
while True:
    scheduler.schedule()        // re-evaluate EVERY iteration
    outputs = model.forward(batch)
    for req in outputs:
        if req.is_done():
            free_blocks(req)          // instantly reclaim
    batch.add(waiting_queue.pop())

Live Simulation

7 requests compete for a 48-block GPU pool. Watch how finished requests immediately free capacity, allowing the waiting queue to drain continuously.

// Continuous Batching Simulator
Step: 0
Waiting Queue
GPU Block Pool — 48 blocks
Free: 48 Used: 0 Util: 0%
Running Batch
// Simulator ready — press Play.
Chapter 04

Memory Lifecycle

From request arrival to completion, every GPU block passes through a well-defined lifecycle managed by the scheduler and block manager in lockstep.

1

Request Arrives → Waiting Queue

No GPU memory allocated yet. Scheduler checks if enough free blocks exist.

// state: waiting
2

Prefill Phase — Allocate Initial Blocks

The entire prompt is processed in one forward pass. KV cache written into freshly allocated blocks. New block appended every 16 tokens.

// block_pool.allocate(ceil(prompt_len / 16))
3

Decode Loop — Allocate on Demand

Each decode step generates one token. Every 16 steps, one new block is needed. If the pool is empty, the scheduler preempts a lower-priority request.

// every 16 decode tokens: block_pool.allocate(1)
4

Request Completes → Free All Blocks

On EOS or max_tokens, all physical blocks are returned to the free list in the same scheduler iteration — immediately available.

// block_pool.free(request.block_table)
5

Next Iteration → Immediate Reuse

Freed blocks are considered in the very next scheduler call (milliseconds later). A waiting request immediately claims them. Zero idle time.

// scheduler.schedule() → waiting_queue.pop()
Chapter 05

Preemption Strategies

When the block pool runs dry, the scheduler must evict a running request to make room. Two strategies: swap to CPU, or discard and recompute.

When Does Preemption Trigger?

The scheduler runs before each forward pass. If a running request needs a new block but block_pool.get_num_free_blocks() == 0, it preempts one or more running requests using one of these strategies:

Strategy 1 — Swap (CPU Offload)
Preempted request selected
GPU blocks copied → CPU RAM
GPU blocks freed for others
Request enters "swapped" queue
When GPU space available: swap back
Request resumes from exact position
✓ No recomputation — KV cache preserved
✗ Slow — limited by PCIe bandwidth (~32 GB/s)
Strategy 2 — Recomputation
Preempted request selected
GPU blocks freed immediately
Token IDs preserved (cheap)
Request re-enters waiting queue
When rescheduled: re-run prefill
KV cache rebuilt from scratch
✓ Instant block release, no CPU RAM needed
✗ Wasted GPU compute on re-prefill
Rule of Thumb

Use swap when CPU RAM is plentiful and requests are long (expensive to recompute). Use recompute when CPU RAM is limited or requests are short. Default is recompute.

Chapter 06

Configuration Reference

The key parameters that control vLLM's memory management and scheduling behavior.

ParameterDefaultEffect
gpu_memory_utilization 0.9 Fraction of GPU VRAM reserved for the KV block pool. Increase for more concurrency; decrease to leave headroom for activations.
block_size 16 Tokens per physical block. Smaller = finer granularity (less waste). Larger = better attention kernel efficiency. 16 is empirically optimal for most models.
max_num_seqs 256 Max concurrent sequences in one batch. Cap this if you're hitting OOM on activations (separate budget from the KV cache).
swap_space 4 GB CPU RAM reserved for the swap block pool. Increase if you expect heavy preemption and want to preserve KV caches rather than recompute.
enable_prefix_caching False Enable APC — share physical blocks for identical prompt prefixes. High value for chatbots with shared system prompts.
preemption_mode recompute Strategy when block pool is exhausted. recompute drops and re-prefills; swap offloads to CPU.
max_model_len model max Maximum sequence length (prompt + output). Caps block table size per request. Reducing this directly increases concurrency capacity.

Startup Memory Calculation

# 1. Load weights → measure VRAM consumed
weight_mem = measure_model_footprint()

# 2. Profile dummy batch → measure peak activation memory
activation_peak = profile_with_dummy_batch()

# 3. Remaining VRAM → KV block pool
available = (total_vram * gpu_memory_utilization) \
          - weight_mem - activation_peak

# 4. Calculate block count
bytes_per_block = (block_size * num_layers
                   * num_kv_heads * head_dim
                   * 2  # K and V
                   * dtype_bytes)
num_gpu_blocks = available // bytes_per_block

# Example: A100 80 GB, Llama-3 8B, bfloat16, block_size=16
# → ~4 096 blocks → ~65 536 concurrent token slots