vLLM — GPU Memory Architecture Course

Chapter 01

KV Cache — Deep Dive

The KV cache is the single most important concept in LLM serving. Understanding it from first principles — attention math, memory growth, and architectural variants — explains every design decision in vLLM.

1.1 — Transformer Attention: The Math

At the heart of every transformer is the scaled dot-product attention operation. For each token, the model learns to "look up" relevant information from all previous tokens by comparing queries against keys and retrieving weighted values.

Attention(Q, K, V) = softmax( Q · Kᵀ / √dₖ ) · V

Q = Query matrix

K = Key matrix

V = Value matrix

dₖ = head dimension (scaling factor)

Three weight matrices — W_Q, W_K, W_V — are learned during training. At inference time, each token's embedding is multiplied by these weights to produce its Q, K, and V vectors.

Step 1

Q

Project → Query

Current token embedding × W_Q. Represents "what am I looking for?"

Step 2

K

Project → Keys

All past tokens × W_K. Each token says "what I represent as a key."

Step 3

Q·Kᵀ

Dot Product → Scores

Compare Q against every K. High dot-product = high relevance. Divide by √dₖ to stabilize gradients.

Step 4

V

Weighted Sum → Output

Softmax turns scores into probabilities. Multiply by V to produce the final attended representation.

# Scaled dot-product attention — the exact computation
import torch
import torch.nn.functional as F

def attention(Q, K, V, mask=None):
    """
    Q : [batch, heads, seq_len_q,  d_k]  ← only current token during decode
    K : [batch, heads, seq_len_kv, d_k]  ← ALL tokens (cached)
    V : [batch, heads, seq_len_kv, d_v]  ← ALL tokens (cached)
    """
    d_k = Q.shape[-1]

    # Similarity: how much does each query "match" each key?
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    # Causal mask: token i can only attend to tokens ≤ i
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    weights = F.softmax(scores, dim=-1)   # [batch, heads, seq_q, seq_kv]
    return torch.matmul(weights, V)          # [batch, heads, seq_q, d_v]

◈

Analogy: Dictionary Lookup

Think of K-V as a fuzzy dictionary. The Query is your search term. Keys are the dictionary headwords. Values are the definitions. Attention computes a weighted blend of all definitions based on how well your search term matches each headword — no hard lookup, all soft.

1.2 — Why We Cache K and V, Not Q

During autoregressive decoding (generating one token at a time), the model runs a forward pass for a single new token. The key insight:

Q

Query

Computed only for the current token being generated. Once we've used Q to compute attention, we discard it — the next token will produce a completely new Q.

✗ NOT cached — recomputed each step

K

Key

Computed for every past token. The new token's Q must compare against all previous Ks. These never change — token 5's key is the same whether we're generating token 6 or token 600.

✓ CACHED — accumulated across steps

V

Value

Computed for every past token. After attention scores are computed, we retrieve a weighted blend of all past Vs to form the output. These also never change for past tokens.

✓ CACHED — accumulated across steps

# What happens on each decode step (generating token t)

# 1. Embed the NEW token only
x_new = embed(token_t)                     # shape: [1, d_model]

# 2. Project to Q, K, V for this token
q_new = x_new @ W_Q                       # [1, d_k]  ← only new token
k_new = x_new @ W_K                       # [1, d_k]  ← only new token
v_new = x_new @ W_V                       # [1, d_v]  ← only new token

# 3. APPEND to KV cache (this is what grows each step)
kv_cache.K = torch.cat([kv_cache.K, k_new], dim=0)  # [t, d_k]
kv_cache.V = torch.cat([kv_cache.V, v_new], dim=0)  # [t, d_v]

# 4. Attend: current Q against ALL cached K, V
output = attention(
    Q = q_new,       # [1, d_k]    ← 1 query (new token)
    K = kv_cache.K,  # [t, d_k]   ← t keys (ALL past tokens)
    V = kv_cache.V,  # [t, d_v]   ← t values (ALL past tokens)
)

# Without KV cache: we would recompute K and V for ALL t past tokens
# every single step → O(t²) compute. With cache: O(t) per step.

◉

The Cost of Not Caching

Without a KV cache, generating a 1000-token response requires computing K and V for token 1 a total of 999 times, for token 2 a total of 998 times, etc. That's O(n²) compute. With the KV cache it's O(n) — a dramatic difference for long sequences.

1.3 — Live KV Cache Simulator

Watch the KV cache grow token by token. Each new token appends a row to the K and V matrices at every layer. The memory meter shows actual GPU RAM consumed (Llama-3 8B settings: 32 layers, 8 KV heads, head_dim=128, bfloat16).

// KV Cache Growth Simulator — Llama-3 8B

Generated Tokens

KV Cache Matrix (simplified — 3 layers shown)

K matrix

V matrix

                Each colored cell = one 128-dim vector stored in bfloat16.
                Columns = dimension chunks (8 heads × 128 = 1024 dims shown as 16 cells).
                Rows = tokens per layer.
              

GPU RAM Used

0.00 MB

of 256 MB max
(2048 tokens)

Layers: 32
KV Heads: 8
Head dim: 128
Dtype: bfloat16

            Memory = tokens × layers × kv_heads × head_dim × 2(K+V) × 2(bytes) = 0 × 32 × 8 × 128 × 2 × 2 = 0 bytes
          

1.4 — Real Memory Numbers

The KV cache formula depends on the model's architecture. For standard MHA, per-token KV cache size is:

# KV cache memory per token (one layer)
bytes_per_token_per_layer = (
    2          # K and V
  * num_kv_heads
  * head_dim
  * dtype_size  # 2 for bfloat16, 4 for float32
)

# Total KV cache for a full request
total_bytes = (
    num_tokens           # prompt + generated tokens
  * num_layers
  * bytes_per_token_per_layer
)

Real numbers for popular models at 2048 tokens per request:

GPT-2 (117M)

0 MB

MHA, 12L

Llama-2 7B

0 MB

MHA, 32L

Llama-3 8B

0 MB

GQA, 32L

Llama-3 70B

0 MB

GQA, 80L

Llama-3.1 405B

0 MB

GQA, 126L

⚠

KV Cache vs Model Weights

For Llama-3 70B with 32 concurrent requests at 2048 tokens each: model weights ≈ 140 GB (bfloat16), KV cache ≈ 32 × 0.64 GB = 20.5 GB. At 8192 tokens, KV cache balloons to 82 GB — nearly as large as the weights themselves. This is why KV cache management is the bottleneck, not weight memory.

1.5 — Multi-Head Attention: MHA vs GQA vs MQA

Modern LLMs use variations of multi-head attention that dramatically reduce KV cache size with minimal accuracy loss. The key axis: how many K/V heads share each Q head?

MHA — Multi-Head

Q₁

→

K₁

V₁

Q₂

→

K₂

V₂

Q₃

→

K₃

V₃

Q₄

→

K₄

V₄

1024 MB Llama-2 7B @ 2048 tok

Each Q head has its own private K and V head. Maximum expressivity, maximum memory. Used in: GPT-2, GPT-3, Llama-2.

GQA — Grouped-Query

Q₁

↘

Q₂

→

K₁

V₁

shared

Q₃

↗

Q₄

→

K₂

V₂

shared

256 MB Llama-3 8B @ 2048 tok

Groups of Q heads share a K/V head. 4× smaller KV cache than MHA with near-identical quality. Used in: Llama-3, Mistral, Gemma.

MQA — Multi-Query

Q₁

↘

Q₂

↘

Q₃

→

K₁

V₁

ALL share

Q₄

↗

64 MB Example model @ 2048 tok

All Q heads share a single K/V head. Extreme memory savings, but noticeable quality degradation for complex tasks. Used in: PaLM, Falcon.

# MHA: each of the num_heads heads has its own K, V
kv_heads_mha = num_heads          # e.g. 32 for Llama-2 7B

# GQA: num_kv_heads groups, each shared by (num_heads / num_kv_heads) Q heads
kv_heads_gqa = num_kv_heads       # e.g. 8 for Llama-3 8B (4× fewer than 32)

# MQA: single K, V shared by all Q heads
kv_heads_mqa = 1

# Memory impact (Llama-3 8B example, 2048 tokens, bfloat16):
# MHA would be: 2048 × 32 × 32 × 128 × 2 × 2 = 1,073,741,824 bytes = 1 GB
# GQA actual:   2048 × 32 ×  8 × 128 × 2 × 2 =   268,435,456 bytes = 256 MB ✓

1.6 — The Allocation Problem This Creates

Now that we know the KV cache is large, dynamic, and per-request, we can understand why allocating it naively is disastrous. Two failure modes emerge immediately:

⚠

Internal Fragmentation

A contiguous chunk reserved for the maximum possible output length (e.g. 2048 tokens) is held for a request that generates only 50 tokens. For Llama-3 8B that's 256 MB reserved but only 12 MB used. 244 MB sits idle until the request finishes.

⚠

External Fragmentation

As requests complete, freed memory becomes scattered. A new request needing 256 MB contiguously cannot start even if 300 MB total is free — because it's fragmented into non-contiguous chunks.

Memory Fragmentation — Visual

Each cell = one 16-token block of GPU memory. Notice the wasted reservations on the left vs tight packing on the right:

Naive Allocation

Req A (actual use)

Req B (actual use)

Wasted reservation

Utilization21%

Fragmented~79%

PagedAttention (vLLM)

Req A

Req B

Req C

Req D

Utilization95%

Fragmented<1 block

This is the problem PagedAttention (Chapter 2) solves: treat GPU memory like an OS treats RAM — fixed-size pages, non-contiguous allocation, virtual mapping.

Chapter 02

PagedAttention

vLLM's core innovation: divide GPU memory into fixed-size physical blocks and give each request a virtual block table — exactly like an OS page table.

The OS Analogy

◈

Think: Virtual Memory Paging

Physical block = physical RAM page.
Logical block table = page table (virtual → physical mapping).
Block pool = physical RAM pool managed by the kernel.
Prefix cache = shared libraries mapped read-only into multiple processes.
Swap = CPU offload when GPU pool is exhausted.

How It Works

GPU KV cache memory is divided into fixed-size physical blocks (default: 16 tokens each). Each request maintains a logical block table — an ordered list of physical block IDs that hold its KV data. These blocks need not be contiguous.

The attention kernel is modified to perform scatter-gather reads across non-contiguous blocks using the block table. The attention math is identical; only the memory access pattern changes.

Block Table — Physical Memory Mapping

Three requests share the same GPU block pool. Their logical blocks map to scattered physical blocks:

// Block Table

Logical View (per request)

Request A

Logical[0]→Phys #7

Logical[1]→Phys #2

Logical[2]→Phys #9

Request B

Logical[0]→Phys #1

Logical[1]→Phys #4

Request C

Logical[0]→Phys #6

→

Physical GPU Block Pool

Phys #0 — free

Phys #1 · Req B · tok 0–15

Phys #2 · Req A · tok 16–31

Phys #3 — free

Phys #4 · Req B · tok 16–31

Phys #5 — free

Phys #6 · Req C · tok 0–15

Phys #7 · Req A · tok 0–15

Phys #8 — free

Phys #9 · Req A · tok 32–47

Phys #10 — free

Phys #11 — free

Core Allocation Code

# KVCacheManager.allocate_slots — called every decode step
def allocate_slots(self, request, num_new_tokens):

    # How many new 16-token blocks are needed?
    num_blocks_needed = self.get_num_blocks_to_allocate(
        request_id=request.request_id,
        num_tokens=request.num_tokens + num_new_tokens,
    )

    # Is the pool large enough?
    if num_blocks_needed > self.block_pool.get_num_free_blocks():
        return None  # → scheduler will trigger preemption

    # Allocate blocks and update the request's block table
    new_blocks = self.allocate_new_blocks(request.request_id, ...)
    return KVCacheBlocks(new_blocks)

Prefix Caching (APC)

When requests share an identical prefix (e.g. the same system prompt), vLLM can map both requests' logical block tables to the same physical blocks. These blocks are hashed by token content — a cache hit skips prefill entirely.

◉

APC in Practice

A 1000-token system prompt shared across 100 concurrent requests eliminates 100× prefill computation. Enable with --enable-prefix-caching. Copy-on-write semantics apply when a shared block must be mutated.

Chapter 03

Continuous Batching

Traditional inference idles the GPU waiting for the slowest request. Continuous batching re-schedules on every single forward pass, filling freed capacity immediately.

Static vs. Continuous

// ❌ Static batching
batch = [req_A, req_B, req_C, req_D]
while any_running(batch):
    model.forward(batch)   // ALL wait for the slowest
// req_A done at step 10, but GPU idles until req_D at step 80

// ✅ Continuous batching (vLLM)
while True:
    scheduler.schedule()        // re-evaluate EVERY iteration
    outputs = model.forward(batch)
    for req in outputs:
        if req.is_done():
            free_blocks(req)          // instantly reclaim
    batch.add(waiting_queue.pop())

Live Simulation

7 requests compete for a 48-block GPU pool. Watch how finished requests immediately free capacity, allowing the waiting queue to drain continuously.

// Continuous Batching Simulator

Step: 0

Waiting Queue

GPU Block Pool — 48 blocks

Free: 48 Used: 0 Util: 0%

Running Batch

// Simulator ready — press Play.

Chapter 04

Memory Lifecycle

From request arrival to completion, every GPU block passes through a well-defined lifecycle managed by the scheduler and block manager in lockstep.

1

Request Arrives → Waiting Queue

No GPU memory allocated yet. Scheduler checks if enough free blocks exist.

// state: waiting

2

Prefill Phase — Allocate Initial Blocks

The entire prompt is processed in one forward pass. KV cache written into freshly allocated blocks. New block appended every 16 tokens.

// block_pool.allocate(ceil(prompt_len / 16))

3

Decode Loop — Allocate on Demand

Each decode step generates one token. Every 16 steps, one new block is needed. If the pool is empty, the scheduler preempts a lower-priority request.

// every 16 decode tokens: block_pool.allocate(1)

4

Request Completes → Free All Blocks

On EOS or max_tokens, all physical blocks are returned to the free list in the same scheduler iteration — immediately available.

// block_pool.free(request.block_table)

5

Next Iteration → Immediate Reuse

Freed blocks are considered in the very next scheduler call (milliseconds later). A waiting request immediately claims them. Zero idle time.

// scheduler.schedule() → waiting_queue.pop()

Chapter 05

Preemption Strategies

When the block pool runs dry, the scheduler must evict a running request to make room. Two strategies: swap to CPU, or discard and recompute.

When Does Preemption Trigger?

The scheduler runs before each forward pass. If a running request needs a new block but block_pool.get_num_free_blocks() == 0, it preempts one or more running requests using one of these strategies:

Strategy 1 — Swap (CPU Offload)

Preempted request selected

↓

GPU blocks copied → CPU RAM

↓

GPU blocks freed for others

↓

Request enters "swapped" queue

↓

When GPU space available: swap back

↓

Request resumes from exact position

✓ No recomputation — KV cache preserved

✗ Slow — limited by PCIe bandwidth (~32 GB/s)

Strategy 2 — Recomputation

Preempted request selected

↓

GPU blocks freed immediately

↓

Token IDs preserved (cheap)

↓

Request re-enters waiting queue

↓

When rescheduled: re-run prefill

↓

KV cache rebuilt from scratch

✓ Instant block release, no CPU RAM needed

✗ Wasted GPU compute on re-prefill

◉

Rule of Thumb

Use swap when CPU RAM is plentiful and requests are long (expensive to recompute). Use recompute when CPU RAM is limited or requests are short. Default is recompute.

Chapter 06

Configuration Reference

The key parameters that control vLLM's memory management and scheduling behavior.

Parameter	Default	Effect
gpu_memory_utilization	0.9	Fraction of GPU VRAM reserved for the KV block pool. Increase for more concurrency; decrease to leave headroom for activations.
block_size	16	Tokens per physical block. Smaller = finer granularity (less waste). Larger = better attention kernel efficiency. 16 is empirically optimal for most models.
max_num_seqs	256	Max concurrent sequences in one batch. Cap this if you're hitting OOM on activations (separate budget from the KV cache).
swap_space	4 GB	CPU RAM reserved for the swap block pool. Increase if you expect heavy preemption and want to preserve KV caches rather than recompute.
enable_prefix_caching	False	Enable APC — share physical blocks for identical prompt prefixes. High value for chatbots with shared system prompts.
preemption_mode	recompute	Strategy when block pool is exhausted. `recompute` drops and re-prefills; `swap` offloads to CPU.
max_model_len	model max	Maximum sequence length (prompt + output). Caps block table size per request. Reducing this directly increases concurrency capacity.

Startup Memory Calculation

# 1. Load weights → measure VRAM consumed
weight_mem = measure_model_footprint()

# 2. Profile dummy batch → measure peak activation memory
activation_peak = profile_with_dummy_batch()

# 3. Remaining VRAM → KV block pool
available = (total_vram * gpu_memory_utilization) \
          - weight_mem - activation_peak

# 4. Calculate block count
bytes_per_block = (block_size * num_layers
                   * num_kv_heads * head_dim
                   * 2  # K and V
                   * dtype_bytes)
num_gpu_blocks = available // bytes_per_block

# Example: A100 80 GB, Llama-3 8B, bfloat16, block_size=16
# → ~4 096 blocks → ~65 536 concurrent token slots

vLLMDeep Dive

KV Cache — Deep Dive

1.1 — Transformer Attention: The Math

Project → Query

Project → Keys

Dot Product → Scores

Weighted Sum → Output

1.2 — Why We Cache K and V, Not Q

1.3 — Live KV Cache Simulator

1.4 — Real Memory Numbers

1.5 — Multi-Head Attention: MHA vs GQA vs MQA

1.6 — The Allocation Problem This Creates

Memory Fragmentation — Visual

PagedAttention

The OS Analogy

How It Works

Block Table — Physical Memory Mapping

Core Allocation Code

Prefix Caching (APC)

Continuous Batching

Static vs. Continuous

Live Simulation

Memory Lifecycle

Request Arrives → Waiting Queue

Prefill Phase — Allocate Initial Blocks

Decode Loop — Allocate on Demand

Request Completes → Free All Blocks

Next Iteration → Immediate Reuse

Preemption Strategies

When Does Preemption Trigger?

Configuration Reference

Startup Memory Calculation

vLLM
Deep Dive