The transformer block

Attention + MLP + residual + norm, repeated N times. What depth actually buys you.

weights Attention projections, MLP weights, and norm parameters all live here. context Information flows up through the residual stream as the block transforms it.

Anatomy of a single block

A transformer block is a small, repeated unit. Modern decoder-only LLMs stack 30–80 of them. The block itself is short:

Input vector LayerNorm / RMSNorm Multi-head Attention residual + LayerNorm / RMSNorm MLP (per-token feed-forward) +

Two sub-blocks: attention then MLP. Each is wrapped in a residual connection (the dotted bypass) and a normalization step.

Read top to bottom:

  1. Normalize the input (LayerNorm or RMSNorm — modern models prefer RMSNorm for being slightly cheaper).
  2. Run multi-head attention — tokens look at each other (see attention.html).
  3. Add the attention output back to the original input. This is the residual connection.
  4. Normalize again.
  5. Run a small per-token feed-forward neural network (the MLP).
  6. Add again.

That's one block. Output goes into the next block. Sixty times.

What the MLP actually does

After attention has let tokens swap information, the MLP runs on each token independently — no token-to-token communication, just a small neural network applied identically at every position. Typical structure: project up to ~4× the embedding dimension, apply a non-linearity (GELU or SwiGLU), project back down.

Attention is the conversation phase: tokens compare notes, see what others know, pull in relevant context. The MLP is the thinking phase: each token, having heard from the others, retreats to its own corner and processes what it just learned. Then the next attention round happens, and the next thinking round, and so on for all N layers.

A common research finding: a lot of factual knowledge in LLMs lives in the MLP weights. Fine-tuning that targets specific factual updates often touches MLP layers more than attention. This isn't a hard rule — it's more like "the MLP has more raw parameter capacity than attention, so that's where most knowledge ends up packed."

The residual stream — the highway through the model

Notice the dotted lines bypassing both attention and MLP in the diagram. Those are residual connections. Their job is simple but profound: instead of replacing the token's vector at each block, attention and MLP add their contribution to it. The original information is always preserved.

The vector that flows up through every layer — gathering additions but never replacements — is the residual stream.

The residual stream is a shared whiteboard running floor-to-ceiling through every layer. Each layer reads what's already on the whiteboard, computes its own contribution, and writes it on top. Nothing is erased. By the top floor, the whiteboard is dense with annotations from every layer — each one's contribution still legible if you knew where to look.

Two important consequences:

What depth buys you — the layer slider

As a token's vector travels up through the layers, what gets added to it changes character. Empirically:

The slider below illustrates this qualitative shift on a tiny example:

Why deeper isn't always better

More layers means more capacity — but capacity costs both compute (each forward pass scales linearly with depth) and parameters (each layer is millions to hundreds of millions of weights). Diminishing returns kick in: a 200-layer model is rarely twice as good as a 100-layer model, but it's twice as expensive and twice as slow.

Modern open-weight models tend to land in the 30-80 layer range for this reason. Width (embedding dimension and MLP expansion) and depth (number of blocks) are tuned together — there's no universal right answer, but for a given parameter budget, very deep + very narrow rarely wins.

The sparse block — same attention, different MLP

Most of this page still applies if a model is MoE: attention, residual stream, norm, stacking — unchanged. What changes is the MLP slot. Instead of one fat feed-forward network that every token runs through, an MoE block has many smaller expert FFNs and a router that picks a few per token. Total parameters balloon; per-token compute stays roughly flat.

The motivation

In a dense block, the MLP is where most of the parameters live. Scaling the model means scaling the MLP — and that means every token pays for every parameter, at every layer. At some point you're paying a lot for capacity the current token doesn't need.

Idea: keep many specialized MLPs, and only run a few of them per token. Memory goes up (you still need all of them on the GPU); per-token compute stays small.

A dense block is a single chef doing every step of every dish. An MoE block is a kitchen brigade — a pasta chef, a grill chef, a dessert chef — and a maître d' who sends each order to the right specialists. Only two chefs touch your meal, but the whole brigade is on the clock.

Anatomy of a sparse block

Input vector Self-attention (unchanged) LayerNorm / RMSNorm Router (gate) E1 E2 E3 E4 E5 E6 E7 E8 top-2 highlighted: only these compute for this token

A sparse block with 8 experts, top-2 routing. The other 6 experts sit idle for this token but are still in memory.

The router (gate)

The router is small — just a single linear layer from embedding-dim to num-experts, followed by softmax and a top-k selection. Input: the token's hidden vector at this layer. Output: k expert indices + a weight for each (how much the final merge weights that expert's contribution).

Token hidden state (d-dim) Linear: d → N experts (~a few KB of weights) softmax top-k indices + weights

The gate is a tiny classifier. Cheap to run; its whole job is picking which experts to use.

The experts

Each expert is a regular FFN, the same shape the MLP would have been in a dense model. There are typically 8, 16, 64, or (DeepSeek-V3) hundreds. They're randomly initialized at start; the aux loss (see below) encourages them to specialize — one expert might drift toward code-like tokens, another toward punctuation contexts, another toward numeric sequences. Nobody tells them how to specialize; it emerges.

Each expert is a specialist chef. Pasta chef, grill chef, dessert chef. They never do each other's jobs. Over a year of service, each gets faster and better at their one thing.

Top-1, top-2, higher-k

How many experts to activate per token is a trade-off. Top-1 (Switch Transformer) is simplest and cheapest but risks brittle routing — if the router picks wrong, you have no fallback. Top-2 (Mixtral) is the modern default: cheap enough, gives you ensemble-like robustness, weighted-merged output. DeepSeek-V3 uses top-8 with very fine-grained experts (256 routed). Higher k on fine-grained experts is increasingly the frontier move.

Expert capacity — the tokens-per-chef limit

Batched matrix multiplication needs fixed-size tensors. But routing is per-token — if 40% of a batch routes to expert 3, that expert's tensor is huge while others are empty. The solution: expert capacity. Each expert gets a budget of
capacity_factor × (tokens_in_batch / num_experts)
tokens per batch. When an expert is full, overflow tokens are dropped — they skip the MLP for this layer entirely and rely on the residual stream to carry information forward.

capacity E1 16 toks E2 20 toks dropped E3 26 toks E4 14 toks E5 19 toks E6 12 toks E7 5 toks Tokens routed per expert per batch:

Expert 3 is over capacity; the overflow tokens are dropped — they bypass the MLP and rely on the residual stream.

Load balance — keeping every chef busy

Without incentive to spread, routers collapse. Early in training, one expert happens to be slightly better than the others by chance; the router routes more tokens to it; that expert gets more gradient; improves further; gets even more tokens. Eventually all tokens route to one expert, the others are dead. Router collapse.

The fix: an auxiliary loss during training that penalizes uneven routing. If expert usage is concentrated, the aux loss pushes back. Paired with small Gaussian noise injected into routing logits (exploration), this is enough to keep the kitchen balanced.

Aux loss OFF collapse training steps → Aux loss ON balanced training steps →

Without aux loss (left), one expert monopolizes; the others are starved. With aux loss (right), usage converges toward even.

The aux loss is the maître d's manager walking the floor with a clipboard. Any chef who got fewer than their fair share of orders this hour, the maître d' gets a talking-to. Next hour: the maître d' spreads orders more evenly. Repeat.

Shared experts (DeepSeek-style)

Newer MoE designs (DeepSeek-V2, V3) use a hybrid: one or two shared experts that run for every token, alongside the router-selected experts. The shared expert captures patterns common across all tokens (basic syntactic structure, common phrases), so the routed experts don't waste capacity re-learning those — they specialize further.

token SHARED EXPERT (always on) router R1 R2 R3 R4 + output = shared + top-k routed

Shared experts handle common patterns for every token in parallel with router-selected specialists.

Fine-grained vs coarse experts

The old default was 8 experts with top-2. The modern trend is many small experts: DeepSeek-V3 has 256 routed experts (plus 1 shared), with top-8 routing. The intuition: smaller experts can specialize more sharply; more of them means more total capacity without making any single one bulky. The router learns to pick a richer combination per token.

Back to the default (dense) frame. Everything below applies to both dense and MoE unless called out.

Encoder-decoder (T5, BART, original transformer). Two stacks. The encoder reads the input bidirectionally — no causal mask, every token sees every other token — and produces a representation. The decoder generates the output one token at a time, attending to both itself (causally) and the encoder's output. Useful for translation/summarization where input and output are clearly separate. Most modern chat LLMs are decoder-only because it scales better and is simpler — but the encoder-decoder lineage isn't dead, especially for narrow tasks.