Glossary
One-line definitions for every technical term in the artifact. Each links back to where it's fully explained.
Glossary entries are orienting โ what a term means in one sentence. The deep pages are explaining โ why it works, what it implies, when it bites you. Use the glossary when you need to remember which term is which; use the deep pages when you want to actually understand.
The four lenses
- Weights weights
- The numbers learned during training; what the model "knows." Frozen at inference. Changing weights = teaching new behavior. โ hub ยง5
- Context context
- What the model sees in the current turn โ the input window. Changes per request, forgotten between requests. โ hub ยง5
- Scaffolding scaffolding
- The system around the model โ system prompts, retrieval, tools, agent loops. Doesn't change the model itself; changes what reaches it and what happens with its output. โ hub ยง5
- Decoding decoding
- How the sampler turns the model's probability output into a single chosen token. Temperature, top-k, top-p. โ hub ยง5
- Runtime state (not a fifth lens)
- Transient computation state (KV cache, intermediate activations) that exists during one forward pass and disappears. Don't confuse with weights or context. โ inference.html
Architecture
- Token
- A small piece of text (often a sub-word) mapped to an integer ID. The unit a model reads and writes. โ tokens.html
- Embedding
- A vector representing a token's location in "meaning space." Tokens with related meanings sit near each other. โ tokens.html
- Embedding matrix
- The lookup table that maps token IDs to embedding vectors. Part of the model's weights, learned during training.
- Positional encoding
- How the model knows token order. Modern models use RoPE (Rotary Position Embedding). โ tokens.html
- RoPE
- Rotary Position Embedding. Rotates each token's Q and K vectors by an angle proportional to its position; attention then reads relative position from the rotation difference. โ tokens.html
- Attention
- The mechanism by which each token looks at every other token in the sequence and updates itself based on a weighted blend of their information. โ attention.html
- Q, K, V (Query, Key, Value)
- Per-token vectors used in attention. Q = "what am I looking for", K = "what I'm advertising", V = "what I'll share if you listen." โ attention.html
- Attention head
- One parallel attention computation. Modern models run many heads simultaneously, each capturing a different relationship.
- Multi-head attention
- Standard pattern: run attention several times in parallel with different Q/K/V projections, concatenate, project. โ attention.html
- Causal autoregressive property
- Each token only attends to itself and earlier tokens, never later ones. The structural reason a decoder-only model is a next-token predictor. โ attention.html
- Causal mask
- The matrix mask that enforces the causal property during parallel training. Not a separate idea from autoregressive โ just an implementation detail.
- Transformer block
- One unit of attention + MLP + residual + normalization. Modern LLMs stack 30-80 of these. โ block.html
- MLP (feed-forward block)
- A small per-token neural network inside each transformer block, applied independently to every position. Where a lot of factual knowledge lives. โ block.html
- Residual stream
- The vector flowing up through all transformer blocks; each block adds its contribution rather than replacing. โ block.html
- LayerNorm / RMSNorm / Layer norm
- Normalization steps applied before attention and MLP. Keeps activation scales stable across layers. RMSNorm is a slightly cheaper variant favored by modern models.
- Context mixing
- What self-attention does: each token's representation absorbs information from other tokens. Every layer is another opportunity to mix more context into each position.
- Knowledge recall
- What the feed-forward network (MLP) does: each token, independently, uses its current representation to recall factual associations stored in the FFN weights. Most of a model's stored factual knowledge lives here.
- Hidden state
- The vector at any given layer/position in the residual stream. The "hidden state at the last position" is what gets projected to logits.
- Logits
- Raw "enthusiasm scores" the model outputs for every possible next token, before softmax converts them to probabilities. โ inference.html
- Softmax
- The function that converts logits into a probability distribution that sums to 1.
Inference
- Inference
- Running the model to produce output. Distinguished from training (which updates weights).
- Forward pass
- One run of the model from input through all layers to logits.
- Autoregressive generation
- Producing text one token at a time, with each token conditioned on all previous tokens. โ inference.html
- KV cache
- Cached K and V vectors for past tokens, so decode doesn't recompute them on every step. Runtime state, not weights or context. โ inference.html
- Prefill
- The first forward pass over the entire prompt, processed in parallel. Sets up the KV cache. โ inference.html
- Decode
- Generating tokens one at a time after prefill, each requiring its own forward pass. โ inference.html
- Time to first token (TTFT)
- Latency from request start to first generated token. Dominated by prefill cost.
- Inter-token latency / time per output token (TPOT)
- Time between successive generated tokens during decode.
- Batching
- Combining multiple users' requests into one forward pass to improve GPU utilization. Continuous batching dynamically adds/removes requests mid-flight.
- Speculative decoding
- A small "draft" model proposes several tokens; the big model verifies them in one pass. Faster, identical output. โ inference.html
Training
- Pretraining
- The first stage: predict-the-next-token training on trillions of tokens of text. Where capability comes from. โ training.html
- SFT (Supervised Fine-Tuning)
- The second stage: training on curated (prompt, ideal-response) pairs to give the model assistant-shaped behavior. โ training.html
- RLHF (Reinforcement Learning from Human Feedback)
- Post-SFT stage: train a reward model on human preferences, then use RL to nudge the LLM toward higher-reward outputs. โ training.html
- DPO (Direct Preference Optimization)
- Simpler alternative to RLHF: tweak the model directly using preference pairs, no separate reward model. โ training.html
- RLAIF
- Reinforcement Learning from AI Feedback. Same as RLHF but with another LLM doing the judging. Cheaper, biased toward judge.
- Constitutional AI
- Anthropic's approach: use a written constitution of principles to have the model critique and revise its own outputs.
- Catastrophic forgetting
- When fine-tuning on new data damages capability the model previously had. Common LoRA failure mode.
- Alignment tax
- Capability loss that often comes alongside RLHF/DPO; the model becomes more obedient but slightly blander.
Levers
- Fine-tuning
- Updating model weights on a specific dataset to teach new behavior. Full, LoRA, or QLoRA variants. โ levers.html ยง1
- LoRA (Low-Rank Adaptation)
- Fine-tuning by learning a small low-rank delta (AยทB) attached to selected weight matrices. Base weights stay frozen. โ levers.html ยง1
- QLoRA
- LoRA with the base model quantized to 4 bits during training. Lets you fine-tune very large models on consumer GPUs.
- Quantization
- Storing weights in fewer bits (e.g., 16 โ 4) to save memory. Same number of weights, coarser numerical resolution. โ levers.html ยง2
- RTN, GPTQ, AWQ, GGUF
- Quantization methods. RTN is naive round-to-nearest. GPTQ adjusts for quantization errors per column. AWQ protects activation-important weights. GGUF is the file format for llama.cpp.
- Context extension
- Making a model handle longer inputs than it was trained for. Done by stretching positional encoding (RoPE scaling) and/or restructuring attention. โ levers.html ยง3
- RoPE scaling
- Rescaling RoPE rotation frequencies so a model trained on shorter contexts can handle longer ones. Variants: linear (Position Interpolation), NTK-aware, YaRN.
- Sliding window attention
- Each token only attends to the last k tokens instead of all previous tokens. Reduces memory and compute; costs some long-range coherence.
- Pruning
- Removing weights, heads, or layers from a trained model. Unstructured zeroes individual weights; structured removes whole units. Usually paired with a healing fine-tune. โ levers.html ยง4
- Decoding
- The process of converting model output (probability distribution) into chosen tokens. Controlled by temperature, top-k, top-p, repetition penalty. โ levers.html ยง5
- Temperature
- Divides logits before softmax. T < 1 sharpens (more deterministic); T > 1 flattens (more random); T = 0 picks the top token always (greedy).
- Top-k
- Keep only the top K most-likely tokens before sampling.
- Top-p (nucleus)
- Keep the smallest set of tokens whose cumulative probability is at least p.
- Repetition penalty
- Multiplicative penalty applied to logits of recently-generated tokens. Prevents loops.
Evaluation
- Perplexity
- Intrinsic eval: how well the model predicts held-out text. Measures raw modeling quality; misses helpfulness/safety. โ eval.html
- MMLU, ARC, GSM8K, HumanEval, etc.
- Multiple-choice and code-generation benchmarks. Easy to score, prone to contamination. โ eval.html
- Contamination
- When benchmark questions leak into training data, inflating scores without reflecting true capability.
- MT-Bench
- LLM-as-judge benchmark: a strong model (typically GPT-4) grades pairs of model outputs against criteria.
- LMSYS Chatbot Arena
- Human preference benchmark: users compare two anonymous models side-by-side and vote. Aggregated into Elo-style scores.
- Needle-in-a-Haystack
- Long-context probe: insert a unique fact at varying positions in a long context, ask about it. Tests retrieval accuracy by depth.
Architecture variants
- Decoder-only
- The standard modern LLM architecture (GPT family). One stack, causal attention, autoregressive generation. The baseline this artifact teaches.
- Encoder-decoder
- Two-stack architecture (T5, BART). Encoder reads bidirectionally; decoder generates autoregressively while attending to encoder output. โ block.html variant note
- Mixture-of-Experts (MoE)
- Replaces single MLP per block with many "expert" MLPs and a router that picks 1-2 per token. More capacity at similar inference cost. โ block.html variant note
- Multi-Query Attention (MQA)
- One K/V shared across all attention heads. Smaller KV cache; small quality cost. โ attention.html variant note
- Grouped-Query Attention (GQA)
- Heads grouped to share K/V โ middle ground between standard MHA and MQA. Used by most modern open models (Llama 3, Mistral). โ attention.html variant note
- Multimodal
- Models that accept images, audio, etc. as input. Typically add a small encoder that converts non-text inputs into vectors, then pass them through the same transformer downstream. โ tokens.html variant note
Scaffolding
- System prompt
- A special prompt prefix instructing the model how to behave for the rest of the conversation. Pure scaffolding โ doesn't change the model.
- RAG (Retrieval-Augmented Generation)
- Retrieve relevant documents from an external store and inject them into the model's context before generation.
- Tool use / function calling
- The model emits structured output indicating an external function should be called; the system runs the function and feeds the result back into the conversation.
- Agent / ReAct loop
- Multi-step scaffolding: model thinks โ acts (calls tool) โ observes result โ thinks again, until a final answer.
- Chain-of-thought
- Prompting the model to explain its reasoning step-by-step before giving a final answer. A scaffolding technique that often improves accuracy on reasoning tasks.