Glossary

One-line definitions for every technical term in the artifact. Each links back to where it's fully explained.

Glossary entries are orienting — what a term means in one sentence. The deep pages are explaining — why it works, what it implies, when it bites you. Use the glossary when you need to remember which term is which; use the deep pages when you want to actually understand.

The four lenses

Weights weights: The numbers learned during training; what the model "knows." Frozen at inference. Changing weights = teaching new behavior. → hub §5
Context context: What the model sees in the current turn — the input window. Changes per request, forgotten between requests. → hub §5
Scaffolding scaffolding: The system around the model — system prompts, retrieval, tools, agent loops. Doesn't change the model itself; changes what reaches it and what happens with its output. → hub §5
Decoding decoding: How the sampler turns the model's probability output into a single chosen token. Temperature, top-k, top-p. → hub §5
Runtime state (not a fifth lens): Transient computation state (KV cache, intermediate activations) that exists during one forward pass and disappears. Don't confuse with weights or context. → inference.html

Architecture

Token: A small piece of text (often a sub-word) mapped to an integer ID. The unit a model reads and writes. → tokens.html
Stop token: A special token (e.g. <|endoftext|>, </s>, or chat-template tokens like <|im_end|>) that the model emits to signal "I'm done." The decoding loop checks for it after each step and halts the generation; without one, you generate until the length cap.
Embedding: A vector representing a token's location in "meaning space." Tokens with related meanings sit near each other. → tokens.html
Embedding matrix: The lookup table that maps token IDs to embedding vectors. Part of the model's weights, learned during training.
Positional encoding: How the model knows token order. Modern models use RoPE (Rotary Position Embedding). → tokens.html
RoPE: Rotary Position Embedding. Rotates each token's Q and K vectors by an angle proportional to its position; attention then reads relative position from the rotation difference. → tokens.html
Attention: The mechanism by which each token looks at every other token in the sequence and updates itself based on a weighted blend of their information. → attention.html
Q, K, V (Query, Key, Value): Per-token vectors used in attention. Q = "what am I looking for", K = "what I'm advertising", V = "what I'll share if you listen." → attention.html
Attention head: One parallel attention computation. Modern models run many heads simultaneously, each capturing a different relationship.
Multi-head attention: Standard pattern: run attention several times in parallel with different Q/K/V projections, concatenate, project. → attention.html
Causal autoregressive property: Each token only attends to itself and earlier tokens, never later ones. The structural reason a decoder-only model is a next-token predictor. → attention.html
Causal mask: The matrix mask that enforces the causal property during parallel training. Not a separate idea from autoregressive — just an implementation detail.
Transformer block: One unit of attention + MLP + residual + normalization. Modern LLMs stack 30-80 of these. → block.html
MLP (feed-forward block): A small per-token neural network inside each transformer block, applied independently to every position. Where a lot of factual knowledge lives. → block.html
Residual stream: The vector flowing up through all transformer blocks; each block adds its contribution rather than replacing. → block.html
LayerNorm / RMSNorm / Layer norm: Normalization steps applied before attention and MLP. Keeps activation scales stable across layers. RMSNorm is a slightly cheaper variant favored by modern models.
Context mixing: What self-attention does: each token's representation absorbs information from other tokens. Every layer is another opportunity to mix more context into each position.
Knowledge recall: What the feed-forward network (MLP) does: each token, independently, uses its current representation to recall factual associations stored in the FFN weights. Most of a model's stored factual knowledge lives here.
Hidden state: The vector at any given layer/position in the residual stream. The "hidden state at the last position" is what gets projected to logits.
Logits: Raw "enthusiasm scores" the model outputs for every possible next token, before softmax converts them to probabilities. → inference.html
Softmax: The function that converts logits into a probability distribution that sums to 1.

Inference

Inference: Running the model to produce output. Distinguished from training (which updates weights).
Forward pass: One run of the model from input through all layers to logits.
Autoregressive generation: Producing text one token at a time, with each token conditioned on all previous tokens. → inference.html
KV cache: Cached K and V vectors for past tokens, so decode doesn't recompute them on every step. Runtime state, not weights or context. → inference.html
Prefill: The first forward pass over the entire prompt, processed in parallel. Sets up the KV cache. → inference.html
Decode: Generating tokens one at a time after prefill, each requiring its own forward pass. → inference.html
Time to first token (TTFT): Latency from request start to first generated token. Dominated by prefill cost.
Inter-token latency / time per output token (TPOT): Time between successive generated tokens during decode.
Batching: Combining multiple users' requests into one forward pass to improve GPU utilization. Continuous batching dynamically adds/removes requests mid-flight.
Speculative decoding: A small "draft" model proposes several tokens; the big model verifies them in one pass. Faster, identical output. → inference.html

Training

Pretraining: The first stage: predict-the-next-token training on trillions of tokens of text. Where capability comes from. → training.html
SFT (Supervised Fine-Tuning): The second stage: training on curated (prompt, ideal-response) pairs to give the model assistant-shaped behavior. → training.html
RLHF (Reinforcement Learning from Human Feedback): Post-SFT stage: train a reward model on human preferences, then use RL to nudge the LLM toward higher-reward outputs. → training.html
DPO (Direct Preference Optimization): Simpler alternative to RLHF: tweak the model directly using preference pairs, no separate reward model. → training.html
RLAIF: Reinforcement Learning from AI Feedback. Same as RLHF but with another LLM doing the judging. Cheaper, biased toward judge.
Constitutional AI: Anthropic's approach: use a written constitution of principles to have the model critique and revise its own outputs.
Catastrophic forgetting: When fine-tuning on new data damages capability the model previously had. Common LoRA failure mode.
Alignment tax: Capability loss that often comes alongside RLHF/DPO; the model becomes more obedient but slightly blander.

Levers

Fine-tuning: Updating model weights on a specific dataset to teach new behavior. Full, LoRA, or QLoRA variants. → levers.html §1
LoRA (Low-Rank Adaptation): Fine-tuning by learning a small low-rank delta (A·B) attached to selected weight matrices. Base weights stay frozen. → levers.html §1
QLoRA: LoRA with the base model quantized to 4 bits during training. Lets you fine-tune very large models on consumer GPUs.
Quantization: Storing weights in fewer bits (e.g., 16 → 4) to save memory. Same number of weights, coarser numerical resolution. → levers.html §2
RTN, GPTQ, AWQ, GGUF: Quantization methods. RTN is naive round-to-nearest. GPTQ adjusts for quantization errors per column. AWQ protects activation-important weights. GGUF is the file format for llama.cpp.
Context extension: Making a model handle longer inputs than it was trained for. Done by stretching positional encoding (RoPE scaling) and/or restructuring attention. → levers.html §3
RoPE scaling: Rescaling RoPE rotation frequencies so a model trained on shorter contexts can handle longer ones. Variants: linear (Position Interpolation), NTK-aware, YaRN.
Sliding window attention: Each token only attends to the last k tokens instead of all previous tokens. Reduces memory and compute; costs some long-range coherence.
Pruning: Removing weights, heads, or layers from a trained model. Unstructured zeroes individual weights; structured removes whole units. Usually paired with a healing fine-tune. → levers.html §4
Decoding: The process of converting model output (probability distribution) into chosen tokens. Controlled by temperature, top-k, top-p, repetition penalty. → levers.html §5
Temperature: Divides logits before softmax. T < 1 sharpens (more deterministic); T > 1 flattens (more random); T = 0 picks the top token always (greedy).
Top-k: Keep only the top K most-likely tokens before sampling.
Top-p (nucleus): Keep the smallest set of tokens whose cumulative probability is at least p.
Repetition penalty: Multiplicative penalty applied to logits of recently-generated tokens. Prevents loops.

Evaluation

Perplexity: Intrinsic eval: how well the model predicts held-out text. Measures raw modeling quality; misses helpfulness/safety. → eval.html
MMLU, ARC, GSM8K, HumanEval, etc.: Multiple-choice and code-generation benchmarks. Easy to score, prone to contamination. → eval.html
Contamination: When benchmark questions leak into training data, inflating scores without reflecting true capability.
MT-Bench: LLM-as-judge benchmark: a strong model (typically GPT-4) grades pairs of model outputs against criteria.
LMSYS Chatbot Arena: Human preference benchmark: users compare two anonymous models side-by-side and vote. Aggregated into Elo-style scores.
Needle-in-a-Haystack: Long-context probe: insert a unique fact at varying positions in a long context, ask about it. Tests retrieval accuracy by depth.

Architecture variants

Decoder-only: The standard modern LLM architecture (GPT family). One stack, causal attention, autoregressive generation. The baseline this artifact teaches.
Encoder-decoder: Two-stack architecture (T5, BART). Encoder reads bidirectionally; decoder generates autoregressively while attending to encoder output. → block.html variant note
Mixture-of-Experts (MoE): Replaces single MLP per block with many "expert" MLPs and a router that picks 1-2 per token. More capacity at similar inference cost. → block.html variant note
Multi-Query Attention (MQA): One K/V shared across all attention heads. Smaller KV cache; small quality cost. → attention.html variant note
Grouped-Query Attention (GQA): Heads grouped to share K/V — middle ground between standard MHA and MQA. Used by most modern open models (Llama 3, Mistral). → attention.html variant note
Multimodal: Models that accept images, audio, etc. as input. Typically add a small encoder that converts non-text inputs into vectors, then pass them through the same transformer downstream. → tokens.html variant note

Scaffolding

System prompt: A special prompt prefix instructing the model how to behave for the rest of the conversation. Pure scaffolding — doesn't change the model.
RAG (Retrieval-Augmented Generation): Retrieve relevant documents from an external store and inject them into the model's context before generation.
Tool use / function calling: The model emits structured output indicating an external function should be called; the system runs the function and feeds the result back into the conversation.
Agent / ReAct loop: Multi-step scaffolding: model thinks → acts (calls tool) → observes result → thinks again, until a final answer.
Chain-of-thought: Prompting the model to explain its reasoning step-by-step before giving a final answer. A scaffolding technique that often improves accuracy on reasoning tasks.

Mixture-of-Experts (MoE)

MoE (Mixture of Experts): A transformer variant where the per-block MLP is replaced by many smaller "expert" FFNs plus a router that picks a few per token. Total parameters balloon; per-token compute stays flat. → block.html / If it's an MoE
Router (gate): Small gating network inside an MoE block that picks which top-k experts handle each token. A single linear layer + softmax + top-k selection.
Expert: One FFN inside an MoE block. There are typically 8, 16, 64, or hundreds per block. Each specializes during training.
Top-k routing: How many experts are activated per token. Top-1 (Switch Transformer), top-2 (Mixtral), top-8 on fine-grained experts (DeepSeek-V3).
Expert capacity: Per-expert token budget per batch: capacity_factor × (tokens / num_experts). Overflow is dropped; dropped tokens bypass the MLP via the residual stream.
Auxiliary loss (load-balance): Training loss term encouraging even expert usage. Without it, routers collapse to a single expert.
Router z-loss: Stability loss penalizing large routing logits. Keeps the softmax from saturating across long training runs.
Router collapse / Dead experts: Failure modes where one expert dominates (collapse) or others go unused (death). Caused by insufficient aux loss or bad initialization.
Shared expert: Always-on expert running in parallel with routed experts (DeepSeek-V2/V3 style). Captures patterns common to all tokens so routed experts can specialize more sharply.
Fine-grained experts: Many small experts vs few large. DeepSeek-V3 uses 256 routed experts + 1 shared. The modern trend is finer.
Expert parallelism / All-to-all: Sharding experts across GPUs; the communication pattern that results when tokens route to experts on other GPUs. Often the serving bottleneck.
Upcycling: Converting a pretrained dense model into MoE by duplicating its MLP as expert copies, adding a fresh router, then continuing training.
Active parameters / Sparsity ratio: Active = parameters used per token. Total = all parameters in VRAM. Sparsity ratio = active / total. Mixtral 8×7B: ~28%. DeepSeek-V3: ~5.5%.

Landscape — other families

State-space model (SSM): Sequence model that replaces attention with a parameterized recurrence (Mamba). Linear-time in sequence length; constant-memory at inference.
Selective scan: Mamba's core operation: a recurrence whose update is input-dependent. Trains in parallel; runs recurrently at inference.
Denoising / Score-matching / Flow-matching: Diffusion training objectives. At inference, start from pure noise and iteratively refine toward a clean sample over N steps.
Iterative refinement: The diffusion inference loop — N denoising steps from noise to output.
Encoder-decoder: Two transformer stacks (T5, BART). Encoder reads the input bidirectionally; decoder generates the output autoregressively with cross-attention to the encoder.