The transformer block

Attention + MLP + residual + norm, repeated N times. What depth actually buys you.

weights Attention projections, MLP weights, and norm parameters all live here. context Information flows up through the residual stream as the block transforms it.

Anatomy of a single block

A transformer block is a small, repeated unit. Modern decoder-only LLMs stack 30–80 of them. The block itself is short:

Input vector LayerNorm / RMSNorm Multi-head Attention residual + LayerNorm / RMSNorm MLP (per-token feed-forward) +

Two sub-blocks: attention then MLP. Each is wrapped in a residual connection (the dotted bypass) and a normalization step.

Read top to bottom:

  1. Normalize the input (LayerNorm or RMSNorm — modern models prefer RMSNorm for being slightly cheaper).
  2. Run multi-head attention — tokens look at each other (see attention.html).
  3. Add the attention output back to the original input. This is the residual connection.
  4. Normalize again.
  5. Run a small per-token feed-forward neural network (the MLP).
  6. Add again.

That's one block. Output goes into the next block. Sixty times.

What the MLP actually does

After attention has let tokens swap information, the MLP runs on each token independently — no token-to-token communication, just a small neural network applied identically at every position. Typical structure: project up to ~4× the embedding dimension, apply a non-linearity (GELU or SwiGLU), project back down.

Attention is the conversation phase: tokens compare notes, see what others know, pull in relevant context. The MLP is the thinking phase: each token, having heard from the others, retreats to its own corner and processes what it just learned. Then the next attention round happens, and the next thinking round, and so on for all N layers.

A common research finding: a lot of factual knowledge in LLMs lives in the MLP weights. Fine-tuning that targets specific factual updates often touches MLP layers more than attention. This isn't a hard rule — it's more like "the MLP has more raw parameter capacity than attention, so that's where most knowledge ends up packed."

The residual stream — the highway through the model

Notice the dotted lines bypassing both attention and MLP in the diagram. Those are residual connections. Their job is simple but profound: instead of replacing the token's vector at each block, attention and MLP add their contribution to it. The original information is always preserved.

The vector that flows up through every layer — gathering additions but never replacements — is the residual stream.

The residual stream is a shared whiteboard running floor-to-ceiling through every layer. Each layer reads what's already on the whiteboard, computes its own contribution, and writes it on top. Nothing is erased. By the top floor, the whiteboard is dense with annotations from every layer — each one's contribution still legible if you knew where to look.

Two important consequences:

What depth buys you — the layer slider

As a token's vector travels up through the layers, what gets added to it changes character. Empirically:

The slider below illustrates this qualitative shift on a tiny example:

Why deeper isn't always better

More layers means more capacity — but capacity costs both compute (each forward pass scales linearly with depth) and parameters (each layer is millions to hundreds of millions of weights). Diminishing returns kick in: a 200-layer model is rarely twice as good as a 100-layer model, but it's twice as expensive and twice as slow.

Modern open-weight models tend to land in the 30-80 layer range for this reason. Width (embedding dimension and MLP expansion) and depth (number of blocks) are tuned together — there's no universal right answer, but for a given parameter budget, very deep + very narrow rarely wins.

Mixture-of-Experts (MoE). In standard transformers, every token goes through the same MLP at every layer. MoE replaces the single MLP with many MLPs ("experts") and a small "router" that picks one or two experts per token per layer. The total parameter count balloons (Mixtral 8×7B has ~47B parameters total) but per-token compute stays small (only ~2 experts active per token). The headline trade: more capacity at similar inference cost, but harder to train (load-balancing experts is fiddly) and harder to serve (memory still has to hold every expert).

Encoder-decoder (T5, BART, original transformer). Two stacks. The encoder reads the input bidirectionally — no causal mask, every token sees every other token — and produces a representation. The decoder generates the output one token at a time, attending to both itself (causally) and the encoder's output. Useful for translation/summarization where input and output are clearly separate. Most modern chat LLMs are decoder-only because it scales better and is simpler — but the encoder-decoder lineage isn't dead, especially for narrow tasks.