Where this sits in the family of LLMs
A one-glance orientation across dense, MoE, diffusion, SSMs, and the rest.
This page is orientation, not mastery. The rest of the artifact teaches the default frame ā a dense decoder-only transformer ā and branches into MoE where the mental model materially changes. This page tells you what else exists and when you'd meet it. Each family gets a compact row in the table and a short paragraph. For the default and for MoE, the deep pages already do the full treatment.
The comparison at a glance
| Family | Dominant mechanism | Sequence? | Objective | Cost shape | Memory | Typical use | Examples | Year |
|---|---|---|---|---|---|---|---|---|
| Dense transformer | self-attention + FFN | ā | next-token | O(n²) attn, linear depth | all weights in VRAM | general text | GPT-3, Llama 3, Mistral 7B | 2017+ |
| MoE transformer | self-attention + routed experts | ā | next-token + aux losses | sparse compute, full memory | all experts in VRAM | frontier quality, cheap inference | Mixtral, DeepSeek-V3, Llama 4 MoE | 2021+ |
| Encoder-decoder | two-stack attention | ā | span corruption / seq2seq | O(n²) both stacks | both stacks in VRAM | translation, summarization | T5, BART, FLAN-T5 | 2017ā2020 |
| State-space (Mamba) | selective scan | ā | next-token | O(n) linear | small, constant | long context, edge devices | Mamba, Mamba-2 | 2023+ |
| Hybrid | Mamba + attention + MoE (alternating) | ā | next-token | mixed | mixed | long context + quality | Jamba, Zamba | 2024+ |
| Post-attention recurrent | linear attention / receptance | ā | next-token | O(n) linear | constant | constant-memory inference | RWKV, RetNet | 2023+ |
| Text diffusion | iterative denoising | kind of | denoising | O(steps Ć seq) | moderate | research, infilling, fast parallel gen | Mercury (Inception Labs), research diffusion LMs | 2024+ |
| Media diffusion | iterative denoising over images/audio/video | n/a | score / flow-matching | many steps; expensive | large | image, audio, video generation | Stable Diffusion, DALL-E 3, Sora | 2020+ |
| Multimodal add-on | modality encoders + transformer | ā | next-token (usually) | baseline + encoder | baseline + encoder | text + images/audio/video in/out | GPT-4V, Claude 3.5, Gemini | 2023+ |
Row by row
Dense transformer
Every token passes through every parameter at every layer. This is the architecture the rest of the artifact teaches as the default frame. When someone says "LLM" without qualification, they usually mean this. Trade-off: simple to reason about, well-understood, but parameter count and compute grow together ā you can't have a "big" model that's also "cheap" to run.
Pick it when: you want the simplest deployment profile, well-supported tooling, and aren't pushing the quality frontier at a specific cost target. Go to the hub for the full treatment.
MoE transformer
Same as dense outside the block: tokens, embeddings, attention, KV cache, sampling ā all unchanged. Inside the block, the single fat MLP is replaced by N smaller expert MLPs and a router that picks top-k per token. Total parameters balloon (Mixtral 8Ć7B has 47B total); per-token compute stays flat (~13B active). The headline trade: more quality per FLOP, at the cost of more VRAM and more training complexity (router must be load-balanced).
Pick it when: you want frontier quality at lower inference cost per token and have the VRAM to hold all experts. See block.html for the full mental model.
Encoder-decoder
Two transformer stacks. The encoder reads the full input bidirectionally (no causal mask) and produces a representation. The decoder generates the output autoregressively, attending to its own tokens and to the encoder's output via cross-attention. Popular pre-2022 for translation, summarization, and any task where input and output are clearly separate.
Pick it when: the task is genuinely seq-in / seq-out with distinct sides (translation being the clearest example). For open chat and completion, decoder-only has largely won ā it's simpler and scales better.
State-space models (Mamba)
Attention is O(n²); state-space models are O(n). Mamba replaces the attention sub-block with a selective scan ā a parameterized recurrence that runs in linear time but, unlike classical RNNs, trains in parallel and captures long-range dependencies selectively. Memory is roughly constant regardless of sequence length.
Pick it when: you need very long context or constant-memory inference on edge devices, and can accept somewhat lower pure-benchmark quality than a comparable dense transformer on short-context tasks.
Hybrid architectures
Take the strengths of multiple families and interleave them. Jamba alternates Mamba layers (cheap long-range context), attention layers (crisp local reasoning), and MoE (parameter-efficient scaling) within a single stack. The bet: the downsides of each family are specific enough that a mix can outperform any pure version at a given budget.
Pick it when: you're building at the frontier and none of the pure families fits the profile you need ā e.g., 128k+ context and frontier quality and reasonable inference cost.
Post-attention recurrent (RWKV, RetNet)
A different route to linear-time sequence modeling. RWKV uses a time-mix + channel-mix formulation that's trainable in parallel like a transformer but runs at inference like an RNN: constant memory, constant per-token cost. RetNet takes a similar "parallel-train, recurrent-infer" route with a retention mechanism. Neither uses attention in the classical sense.
Pick it when: constant-memory inference is the dominant constraint and you're willing to accept somewhat less well-supported tooling than the transformer ecosystem.
Text diffusion
Instead of autoregressively predicting one token at a time, a text diffusion model starts with a noisy target and iteratively denoises it over N steps. Potentially much faster wall-clock (all positions update in parallel each step) and can naturally fill in blanks, but as of 2026 is still mostly research with a few production pilots (Inception Labs' Mercury being the most visible). Quality at the frontier is not yet at parity with autoregressive.
Pick it when: you need massively parallel generation for a specific distribution and can tolerate research-grade tooling.
Media diffusion
The diffusion paradigm has conquered image, audio, and video generation. Train a network to denoise; at inference, start from pure noise and iteratively denoise toward something from the training distribution. Stable Diffusion and its descendants (SDXL, SD3) for images; Sora and Veo for video; Suno and Udio for audio. Different training objective (score-matching or flow-matching) than LLMs, but the underlying network is often a transformer (DiT ā Diffusion Transformer).
Pick it when: you're generating media, not text. Not a text-LLM choice at all ā included here so the family-map is complete.
Multimodal add-on
Not really a separate architecture family ā more a technique applied on top of any of the above. Take a text transformer (usually dense or MoE), bolt on modality-specific encoders (ViT for vision, an audio encoder, etc.), and project their outputs into the transformer's embedding space so the model sees "image tokens" alongside text tokens. The transformer backbone is unchanged; the input side becomes multimodal.
Pick it when: the task involves non-text input or output. GPT-4V, Claude 3.5, Gemini all fit here with various text-backbone choices underneath.
Further reading ā landmark papers
- Dense transformer: Attention Is All You Need ā Vaswani et al., 2017. arXiv:1706.03762
- MoE ā Switch Transformer: Fedus, Zoph & Shazeer, 2021. arXiv:2101.03961
- MoE ā Mixtral: Mistral AI, 2024. arXiv:2401.04088
- MoE ā DeepSeek-V3: DeepSeek, 2024. arXiv:2412.19437
- Encoder-decoder ā T5: Raffel et al., 2019. arXiv:1910.10683
- Mamba: Gu & Dao, 2023. arXiv:2312.00752
- Mamba-2: Dao & Gu, 2024. arXiv:2405.21060
- Jamba: AI21 Labs, 2024. arXiv:2403.19887
- RWKV: Peng et al., 2023. arXiv:2305.13048
- RetNet: Sun et al., 2023. arXiv:2307.08621
- Latent Diffusion (Stable Diffusion): Rombach et al., 2021. arXiv:2112.10752
- DiT (Diffusion Transformer): Peebles & Xie, 2022. arXiv:2212.09748