The inference loop

Forward pass, KV cache, prefill vs decode, batching, speculative decoding.

This page is the forward pass machinery itself — driven by weights, fed by context, ending in decoding. The KV cache is runtime state (see callout below).

One token, end to end

The hub showed the loop at a glance. Let's walk it more slowly.

Tokenize the input text into integer IDs.
Look up each ID's embedding from the embedding table.
Inject position via RoPE (or whatever positional scheme this model uses).
Pass through N transformer blocks. Each block: norm → attention → add → norm → MLP → add. Information flows through the residual stream.
Take the hidden state at the last position — that's a single vector representing "what the model is thinking right now about what to say next."
Project that vector through the output matrix W_out (often the transpose of the embedding matrix, "tied weights"). The result is a vector of logits — one number for every token in the vocabulary, often 50,000–200,000 numbers long.
Softmax the logits into a probability distribution that sums to 1.
Sample one token from that distribution according to your decoding settings (temperature, top-k, top-p).
Append the sampled token to the sequence.
Repeat from step 4 — but with the cache trick we'll discuss next.

The whole thing is like a writer dictating to themselves one word at a time, where the next word always depends on every word said so far. The model is a function from "everything written" to "probability of every possible next word." The loop is just calling that function over and over, growing the answer one word at a time.

Prefill vs decode — two very different cost shapes

If you've ever wondered why ChatGPT pauses for a second before starting to stream a response, then streams quickly word-by-word, you've experienced the prefill / decode split.

Prefill

When you submit a prompt of n tokens, the first forward pass runs over all n tokens at once, in parallel. Every token at every layer is computed in the same forward pass. This is great for GPUs (highly parallel, dense matrix multiplies all the way) but the cost scales with the prompt length. Prefill latency = "time to first token."

Decode

After the first token is produced, every subsequent token requires its own forward pass — but only for one new token at a time. Decode is sequential by necessity (you can't compute token 12 before token 11 exists). Decode latency per token stays roughly constant; the total decode time scales with how many tokens you generate.

Prefill is silently reading the entire question. Your eyes can scan the whole page in parallel; it's fast. Decode is then speaking the answer aloud, one word at a time. You can't say word seven before word six is out of your mouth; it's slow and serial.

Two practical consequences:

Pricing. Most LLM APIs charge differently for input tokens (cheap, prefill) and output tokens (expensive, decode). Now you know why.
Long prompts hurt time-to-first-token; long outputs hurt total response time. Two different optimizations. Short outputs from long prompts feel laggy at the start then fast. Short prompts with long outputs feel snappy at the start then slow.

The KV cache — what it is, what it does, how big it gets

If decode just kept re-running the entire forward pass over the entire growing sequence for every new token, generating 1,000 tokens would be horrendous — each token costs more than the last because the sequence is longer. The trick that makes decode feasible: cache the K and V vectors of past tokens.

What's stored

For every past token, in every layer, in every attention head, the model stores two vectors: K (key) and V (value). On the next decode step, the new token's Q (query) is computed fresh, but it attends over the cached K's and V's of all the past tokens. No recomputation of past K's and V's needed.

What this doesn't mean

The new token still attends over every previous position. That's the whole point of attention. The cache doesn't reduce the amount of attention computation — it eliminates the redundant work of recomputing K and V projections that haven't changed.

Memory footprint

cache_bytes ≈ 2 × n_layers × n_heads × head_dim × n_tokens × bytes_per_value

For a Llama 3 70B-style model (80 layers, 64 heads, head_dim 128, FP16): one token of cache is about 2 MB. A 100k-token context is ~200 GB just for the cache, way more than the weights themselves. This is why long context is the most memory-intensive lever you can pull.

The KV cache is your conversation notes. When the conversation has gone on for an hour, you don't re-read the entire transcript before saying each next sentence — you glance at your accumulated notes and respond. The notes grow as the conversation grows; eventually they fill the table.

The KV cache and intermediate activations are transient computation state. They live during one request and disappear. They are not weights (which are learned and persistent), and not context (which is the input you sent). When someone says "the model forgot what I said," the issue is almost always that older context rolled out of the input window or was never passed in by the surrounding system — not that the cache lost it. Don't confuse runtime state with the four lenses.

Try the loop yourself

Batching — sharing the GPU across users

A single user's decode underutilizes a modern GPU dramatically — the matrix multiply for one token is too small to fill a chip designed to multiply enormous matrices. So inference servers batch: combine many users' decode steps into one forward pass that's wide instead of deep.

The naive version is "static batching" — wait until you have n requests, run them together, return all results. But this couples slow requests with fast ones. Continuous batching lets requests join and leave the batch dynamically: as one request finishes, a new one slots in to take its place mid-flight. This is how production LLM serving (vLLM, TGI, TensorRT-LLM) gets high utilization without crushing per-user latency.

Batching does change one thing for the end user: latency variance. If you're alone on the server, your decode is fast. If a hundred other requests arrive, you slow down even though the model didn't change. This is why benchmark numbers from a quiet test rig rarely match production performance.

Speculative decoding — letting a small model do the typing

Decode is serial — you can't generate token N+1 until token N exists. Or can you? Speculative decoding says: while the big slow model is busy with token N, let a small fast "draft" model guess tokens N+1, N+2, N+3 ahead. Then have the big model verify all four guesses in one forward pass (which it can do because verifying multiple positions in parallel is just like prefill). Accepted tokens get appended; the first wrong guess and everything after it gets thrown out.

A junior writer speed-types a draft. The senior reads it and only corrects the parts that are wrong. As long as the junior is right most of the time, the senior gets through the work much faster than typing every word themselves.

In practice this gives 1.5×–3× decode speedup with no quality loss (the big model is still the only one whose output is final). The tricky part is choosing a good draft model that agrees with the big model often enough to make verification worthwhile.

Compute ≠ memory — the active-vs-total trick

The forward pass, the KV cache, prefill vs decode, sampling — all unchanged in an MoE. What changes is a single but substantial insight: an MoE's compute cost per token is much smaller than its total parameter count suggests. You pay for all the weights in VRAM, but only a slice of them run per token.

Active params vs total params

Real numbers:

Mixtral 8×7B — 47B total parameters, ~13B active per token. Memory: 47B. Compute: ~13B (≈ a single 13B dense model's compute).
DeepSeek-V3 — 671B total, 37B active. Memory: 671B. Compute: ~37B.
Arctic — 480B total, 17B active. One of the sparsest ratios shipped.

Total params stay in VRAM; only a small fraction compute per token. This is why MoE models feel paradoxically "big but fast" — that's the design, not a bug.

Imagine a kitchen brigade of 200 chefs. For your meal, only 8 of them touch any ingredient. You pay the restaurant to employ 200; you eat at 8-chef speed.

Expert parallelism — experts sharded across GPUs

A 671B-parameter MoE doesn't fit on one GPU. The natural sharding is expert parallelism: put different experts on different GPUs. Each GPU holds, say, 32 of the 256 experts. When a batch is routed, tokens have to physically travel to the GPU holding their chosen expert — and come back. This is called all-to-all communication, and it's often the bottleneck in MoE inference.

Experts sharded across 4 GPUs. Routing pushes tokens across the interconnect; return traffic brings results back. Balance is critical — one overloaded GPU stalls the whole batch.

Routing is runtime state, not a cache

The routing decisions made during a forward pass don't get cached across steps the way K and V vectors do. Each new token is routed fresh based on its own hidden state. So "runtime state" in an MoE includes: the KV cache (same as dense) + the routing decisions for the current step (not persisted). They're both ephemeral, they both disappear between requests, neither is part of the model's weights.

Scaffolding (RAG, tool use, agents, reasoning chains). None of these change the model. They change what the model sees in its context window, and what happens with the model's output before the next call. RAG injects retrieved documents into the prompt. Tool use parses the model's output, runs an external function (search, calculator, code execution), and injects the result back into the conversation. Reasoning scaffolds (chain-of-thought, ReAct, multi-step agent loops) just structure prompting and feedback. From the model's point of view, every call looks like another forward pass on whatever's currently in its context. The intelligence in scaffolding lives entirely outside the weights.