The transformer block
Attention + MLP + residual + norm, repeated N times. What depth actually buys you.
Anatomy of a single block
A transformer block is a small, repeated unit. Modern decoder-only LLMs stack 30–80 of them. The block itself is short:
Two sub-blocks: attention then MLP. Each is wrapped in a residual connection (the dotted bypass) and a normalization step.
Read top to bottom:
- Normalize the input (LayerNorm or RMSNorm — modern models prefer RMSNorm for being slightly cheaper).
- Run multi-head attention — tokens look at each other (see attention.html).
- Add the attention output back to the original input. This is the residual connection.
- Normalize again.
- Run a small per-token feed-forward neural network (the MLP).
- Add again.
That's one block. Output goes into the next block. Sixty times.
What the MLP actually does
After attention has let tokens swap information, the MLP runs on each token independently — no token-to-token communication, just a small neural network applied identically at every position. Typical structure: project up to ~4× the embedding dimension, apply a non-linearity (GELU or SwiGLU), project back down.
Attention is the conversation phase: tokens compare notes, see what others know, pull in relevant context. The MLP is the thinking phase: each token, having heard from the others, retreats to its own corner and processes what it just learned. Then the next attention round happens, and the next thinking round, and so on for all N layers.
A common research finding: a lot of factual knowledge in LLMs lives in the MLP weights. Fine-tuning that targets specific factual updates often touches MLP layers more than attention. This isn't a hard rule — it's more like "the MLP has more raw parameter capacity than attention, so that's where most knowledge ends up packed."
The residual stream — the highway through the model
Notice the dotted lines bypassing both attention and MLP in the diagram. Those are residual connections. Their job is simple but profound: instead of replacing the token's vector at each block, attention and MLP add their contribution to it. The original information is always preserved.
The vector that flows up through every layer — gathering additions but never replacements — is the residual stream.
The residual stream is a shared whiteboard running floor-to-ceiling through every layer. Each layer reads what's already on the whiteboard, computes its own contribution, and writes it on top. Nothing is erased. By the top floor, the whiteboard is dense with annotations from every layer — each one's contribution still legible if you knew where to look.
Two important consequences:
- Gradients can flow backwards across many layers. Without residuals, training a 60-layer model would be a numerical nightmare; with them, gradients have a clear "shortcut" path through the residual highway.
- Each layer adds, doesn't overwrite. If a later layer's attention or MLP learns nothing useful, it can output near-zero and the residual stream just passes through unchanged. The model degrades gracefully when capacity is wasted.
What depth buys you — the layer slider
As a token's vector travels up through the layers, what gets added to it changes character. Empirically:
- Early layers tend to capture surface features — what part of speech this token is, what tokens are nearby, basic syntactic role.
- Middle layers tend to capture relational structure — who's the subject of which verb, what entity is this pronoun referring to.
- Late layers tend to capture task-relevant semantics — features useful for predicting the next token specifically.
The slider below illustrates this qualitative shift on a tiny example:
Why deeper isn't always better
More layers means more capacity — but capacity costs both compute (each forward pass scales linearly with depth) and parameters (each layer is millions to hundreds of millions of weights). Diminishing returns kick in: a 200-layer model is rarely twice as good as a 100-layer model, but it's twice as expensive and twice as slow.
Modern open-weight models tend to land in the 30-80 layer range for this reason. Width (embedding dimension and MLP expansion) and depth (number of blocks) are tuned together — there's no universal right answer, but for a given parameter budget, very deep + very narrow rarely wins.
Mixture-of-Experts (MoE). In standard transformers, every token goes through the same MLP at every layer. MoE replaces the single MLP with many MLPs ("experts") and a small "router" that picks one or two experts per token per layer. The total parameter count balloons (Mixtral 8×7B has ~47B parameters total) but per-token compute stays small (only ~2 experts active per token). The headline trade: more capacity at similar inference cost, but harder to train (load-balancing experts is fiddly) and harder to serve (memory still has to hold every expert).
Encoder-decoder (T5, BART, original transformer). Two stacks. The encoder reads the input bidirectionally — no causal mask, every token sees every other token — and produces a representation. The decoder generates the output one token at a time, attending to both itself (causally) and the encoder's output. Useful for translation/summarization where input and output are clearly separate. Most modern chat LLMs are decoder-only because it scales better and is simpler — but the encoder-decoder lineage isn't dead, especially for narrow tasks.