Where this sits in the family of LLMs

A one-glance orientation across dense, MoE, diffusion, SSMs, and the rest.

This page is orientation, not mastery. The rest of the artifact teaches the default frame — a dense decoder-only transformer — and branches into MoE where the mental model materially changes. This page tells you what else exists and when you'd meet it. Each family gets a compact row in the table and a short paragraph. For the default and for MoE, the deep pages already do the full treatment.

The comparison at a glance

Family	Dominant mechanism	Sequence?	Objective	Cost shape	Memory	Typical use	Examples	Year
Dense transformer	self-attention + FFN	✓	next-token	O(n²) attn, linear depth	all weights in VRAM	general text	GPT-3, Llama 3, Mistral 7B	2017+
MoE transformer	self-attention + routed experts	✓	next-token + aux losses	sparse compute, full memory	all experts in VRAM	frontier quality, cheap inference	Mixtral, DeepSeek-V3, Llama 4 MoE	2021+
Encoder-decoder	two-stack attention	✓	span corruption / seq2seq	O(n²) both stacks	both stacks in VRAM	translation, summarization	T5, BART, FLAN-T5	2017–2020
State-space (Mamba)	selective scan	✓	next-token	O(n) linear	small, constant	long context, edge devices	Mamba, Mamba-2	2023+
Hybrid	Mamba + attention + MoE (alternating)	✓	next-token	mixed	mixed	long context + quality	Jamba, Zamba	2024+
Post-attention recurrent	linear attention / receptance	✓	next-token	O(n) linear	constant	constant-memory inference	RWKV, RetNet	2023+
Text diffusion	iterative denoising	kind of	denoising	O(steps × seq)	moderate	research, infilling, fast parallel gen	Mercury (Inception Labs), research diffusion LMs	2024+
Media diffusion	iterative denoising over images/audio/video	n/a	score / flow-matching	many steps; expensive	large	image, audio, video generation	Stable Diffusion, DALL-E 3, Sora	2020+
Multimodal add-on	modality encoders + transformer	✓	next-token (usually)	baseline + encoder	baseline + encoder	text + images/audio/video in/out	GPT-4V, Claude 3.5, Gemini	2023+

Row by row

Dense transformer

Every token passes through every parameter at every layer. This is the architecture the rest of the artifact teaches as the default frame. When someone says "LLM" without qualification, they usually mean this. Trade-off: simple to reason about, well-understood, but parameter count and compute grow together — you can't have a "big" model that's also "cheap" to run.

Pick it when: you want the simplest deployment profile, well-supported tooling, and aren't pushing the quality frontier at a specific cost target. Go to the hub for the full treatment.

MoE transformer

Same as dense outside the block: tokens, embeddings, attention, KV cache, sampling — all unchanged. Inside the block, the single fat MLP is replaced by N smaller expert MLPs and a router that picks top-k per token. Total parameters balloon (Mixtral 8×7B has 47B total); per-token compute stays flat (~13B active). The headline trade: more quality per FLOP, at the cost of more VRAM and more training complexity (router must be load-balanced).

Pick it when: you want frontier quality at lower inference cost per token and have the VRAM to hold all experts. See block.html for the full mental model.

Encoder-decoder

Two transformer stacks. The encoder reads the full input bidirectionally (no causal mask) and produces a representation. The decoder generates the output autoregressively, attending to its own tokens and to the encoder's output via cross-attention. Popular pre-2022 for translation, summarization, and any task where input and output are clearly separate.

Pick it when: the task is genuinely seq-in / seq-out with distinct sides (translation being the clearest example). For open chat and completion, decoder-only has largely won — it's simpler and scales better.

State-space models (Mamba)

Attention is O(n²); state-space models are O(n). Mamba replaces the attention sub-block with a selective scan — a parameterized recurrence that runs in linear time but, unlike classical RNNs, trains in parallel and captures long-range dependencies selectively. Memory is roughly constant regardless of sequence length.

Pick it when: you need very long context or constant-memory inference on edge devices, and can accept somewhat lower pure-benchmark quality than a comparable dense transformer on short-context tasks.

Hybrid architectures

Take the strengths of multiple families and interleave them. Jamba alternates Mamba layers (cheap long-range context), attention layers (crisp local reasoning), and MoE (parameter-efficient scaling) within a single stack. The bet: the downsides of each family are specific enough that a mix can outperform any pure version at a given budget.

Pick it when: you're building at the frontier and none of the pure families fits the profile you need — e.g., 128k+ context and frontier quality and reasonable inference cost.

Post-attention recurrent (RWKV, RetNet)

A different route to linear-time sequence modeling. RWKV uses a time-mix + channel-mix formulation that's trainable in parallel like a transformer but runs at inference like an RNN: constant memory, constant per-token cost. RetNet takes a similar "parallel-train, recurrent-infer" route with a retention mechanism. Neither uses attention in the classical sense.

Pick it when: constant-memory inference is the dominant constraint and you're willing to accept somewhat less well-supported tooling than the transformer ecosystem.

Text diffusion

Instead of autoregressively predicting one token at a time, a text diffusion model starts with a noisy target and iteratively denoises it over N steps. Potentially much faster wall-clock (all positions update in parallel each step) and can naturally fill in blanks, but as of 2026 is still mostly research with a few production pilots (Inception Labs' Mercury being the most visible). Quality at the frontier is not yet at parity with autoregressive.

Pick it when: you need massively parallel generation for a specific distribution and can tolerate research-grade tooling.

Media diffusion

The diffusion paradigm has conquered image, audio, and video generation. Train a network to denoise; at inference, start from pure noise and iteratively denoise toward something from the training distribution. Stable Diffusion and its descendants (SDXL, SD3) for images; Sora and Veo for video; Suno and Udio for audio. Different training objective (score-matching or flow-matching) than LLMs, but the underlying network is often a transformer (DiT — Diffusion Transformer).

Pick it when: you're generating media, not text. Not a text-LLM choice at all — included here so the family-map is complete.

Multimodal add-on

Not really a separate architecture family — more a technique applied on top of any of the above. Take a text transformer (usually dense or MoE), bolt on modality-specific encoders (ViT for vision, an audio encoder, etc.), and project their outputs into the transformer's embedding space so the model sees "image tokens" alongside text tokens. The transformer backbone is unchanged; the input side becomes multimodal.

Pick it when: the task involves non-text input or output. GPT-4V, Claude 3.5, Gemini all fit here with various text-backbone choices underneath.