Glossary

One-line definitions for every technical term in the artifact. Each links back to where it's fully explained.

Glossary entries are orienting โ€” what a term means in one sentence. The deep pages are explaining โ€” why it works, what it implies, when it bites you. Use the glossary when you need to remember which term is which; use the deep pages when you want to actually understand.

The four lenses

Weights weights
The numbers learned during training; what the model "knows." Frozen at inference. Changing weights = teaching new behavior. โ†’ hub ยง5
Context context
What the model sees in the current turn โ€” the input window. Changes per request, forgotten between requests. โ†’ hub ยง5
Scaffolding scaffolding
The system around the model โ€” system prompts, retrieval, tools, agent loops. Doesn't change the model itself; changes what reaches it and what happens with its output. โ†’ hub ยง5
Decoding decoding
How the sampler turns the model's probability output into a single chosen token. Temperature, top-k, top-p. โ†’ hub ยง5
Runtime state (not a fifth lens)
Transient computation state (KV cache, intermediate activations) that exists during one forward pass and disappears. Don't confuse with weights or context. โ†’ inference.html

Architecture

Token
A small piece of text (often a sub-word) mapped to an integer ID. The unit a model reads and writes. โ†’ tokens.html
Embedding
A vector representing a token's location in "meaning space." Tokens with related meanings sit near each other. โ†’ tokens.html
Embedding matrix
The lookup table that maps token IDs to embedding vectors. Part of the model's weights, learned during training.
Positional encoding
How the model knows token order. Modern models use RoPE (Rotary Position Embedding). โ†’ tokens.html
RoPE
Rotary Position Embedding. Rotates each token's Q and K vectors by an angle proportional to its position; attention then reads relative position from the rotation difference. โ†’ tokens.html
Attention
The mechanism by which each token looks at every other token in the sequence and updates itself based on a weighted blend of their information. โ†’ attention.html
Q, K, V (Query, Key, Value)
Per-token vectors used in attention. Q = "what am I looking for", K = "what I'm advertising", V = "what I'll share if you listen." โ†’ attention.html
Attention head
One parallel attention computation. Modern models run many heads simultaneously, each capturing a different relationship.
Multi-head attention
Standard pattern: run attention several times in parallel with different Q/K/V projections, concatenate, project. โ†’ attention.html
Causal autoregressive property
Each token only attends to itself and earlier tokens, never later ones. The structural reason a decoder-only model is a next-token predictor. โ†’ attention.html
Causal mask
The matrix mask that enforces the causal property during parallel training. Not a separate idea from autoregressive โ€” just an implementation detail.
Transformer block
One unit of attention + MLP + residual + normalization. Modern LLMs stack 30-80 of these. โ†’ block.html
MLP (feed-forward block)
A small per-token neural network inside each transformer block, applied independently to every position. Where a lot of factual knowledge lives. โ†’ block.html
Residual stream
The vector flowing up through all transformer blocks; each block adds its contribution rather than replacing. โ†’ block.html
LayerNorm / RMSNorm / Layer norm
Normalization steps applied before attention and MLP. Keeps activation scales stable across layers. RMSNorm is a slightly cheaper variant favored by modern models.
Context mixing
What self-attention does: each token's representation absorbs information from other tokens. Every layer is another opportunity to mix more context into each position.
Knowledge recall
What the feed-forward network (MLP) does: each token, independently, uses its current representation to recall factual associations stored in the FFN weights. Most of a model's stored factual knowledge lives here.
Hidden state
The vector at any given layer/position in the residual stream. The "hidden state at the last position" is what gets projected to logits.
Logits
Raw "enthusiasm scores" the model outputs for every possible next token, before softmax converts them to probabilities. โ†’ inference.html
Softmax
The function that converts logits into a probability distribution that sums to 1.

Inference

Inference
Running the model to produce output. Distinguished from training (which updates weights).
Forward pass
One run of the model from input through all layers to logits.
Autoregressive generation
Producing text one token at a time, with each token conditioned on all previous tokens. โ†’ inference.html
KV cache
Cached K and V vectors for past tokens, so decode doesn't recompute them on every step. Runtime state, not weights or context. โ†’ inference.html
Prefill
The first forward pass over the entire prompt, processed in parallel. Sets up the KV cache. โ†’ inference.html
Decode
Generating tokens one at a time after prefill, each requiring its own forward pass. โ†’ inference.html
Time to first token (TTFT)
Latency from request start to first generated token. Dominated by prefill cost.
Inter-token latency / time per output token (TPOT)
Time between successive generated tokens during decode.
Batching
Combining multiple users' requests into one forward pass to improve GPU utilization. Continuous batching dynamically adds/removes requests mid-flight.
Speculative decoding
A small "draft" model proposes several tokens; the big model verifies them in one pass. Faster, identical output. โ†’ inference.html

Training

Pretraining
The first stage: predict-the-next-token training on trillions of tokens of text. Where capability comes from. โ†’ training.html
SFT (Supervised Fine-Tuning)
The second stage: training on curated (prompt, ideal-response) pairs to give the model assistant-shaped behavior. โ†’ training.html
RLHF (Reinforcement Learning from Human Feedback)
Post-SFT stage: train a reward model on human preferences, then use RL to nudge the LLM toward higher-reward outputs. โ†’ training.html
DPO (Direct Preference Optimization)
Simpler alternative to RLHF: tweak the model directly using preference pairs, no separate reward model. โ†’ training.html
RLAIF
Reinforcement Learning from AI Feedback. Same as RLHF but with another LLM doing the judging. Cheaper, biased toward judge.
Constitutional AI
Anthropic's approach: use a written constitution of principles to have the model critique and revise its own outputs.
Catastrophic forgetting
When fine-tuning on new data damages capability the model previously had. Common LoRA failure mode.
Alignment tax
Capability loss that often comes alongside RLHF/DPO; the model becomes more obedient but slightly blander.

Levers

Fine-tuning
Updating model weights on a specific dataset to teach new behavior. Full, LoRA, or QLoRA variants. โ†’ levers.html ยง1
LoRA (Low-Rank Adaptation)
Fine-tuning by learning a small low-rank delta (AยทB) attached to selected weight matrices. Base weights stay frozen. โ†’ levers.html ยง1
QLoRA
LoRA with the base model quantized to 4 bits during training. Lets you fine-tune very large models on consumer GPUs.
Quantization
Storing weights in fewer bits (e.g., 16 โ†’ 4) to save memory. Same number of weights, coarser numerical resolution. โ†’ levers.html ยง2
RTN, GPTQ, AWQ, GGUF
Quantization methods. RTN is naive round-to-nearest. GPTQ adjusts for quantization errors per column. AWQ protects activation-important weights. GGUF is the file format for llama.cpp.
Context extension
Making a model handle longer inputs than it was trained for. Done by stretching positional encoding (RoPE scaling) and/or restructuring attention. โ†’ levers.html ยง3
RoPE scaling
Rescaling RoPE rotation frequencies so a model trained on shorter contexts can handle longer ones. Variants: linear (Position Interpolation), NTK-aware, YaRN.
Sliding window attention
Each token only attends to the last k tokens instead of all previous tokens. Reduces memory and compute; costs some long-range coherence.
Pruning
Removing weights, heads, or layers from a trained model. Unstructured zeroes individual weights; structured removes whole units. Usually paired with a healing fine-tune. โ†’ levers.html ยง4
Decoding
The process of converting model output (probability distribution) into chosen tokens. Controlled by temperature, top-k, top-p, repetition penalty. โ†’ levers.html ยง5
Temperature
Divides logits before softmax. T < 1 sharpens (more deterministic); T > 1 flattens (more random); T = 0 picks the top token always (greedy).
Top-k
Keep only the top K most-likely tokens before sampling.
Top-p (nucleus)
Keep the smallest set of tokens whose cumulative probability is at least p.
Repetition penalty
Multiplicative penalty applied to logits of recently-generated tokens. Prevents loops.

Evaluation

Perplexity
Intrinsic eval: how well the model predicts held-out text. Measures raw modeling quality; misses helpfulness/safety. โ†’ eval.html
MMLU, ARC, GSM8K, HumanEval, etc.
Multiple-choice and code-generation benchmarks. Easy to score, prone to contamination. โ†’ eval.html
Contamination
When benchmark questions leak into training data, inflating scores without reflecting true capability.
MT-Bench
LLM-as-judge benchmark: a strong model (typically GPT-4) grades pairs of model outputs against criteria.
LMSYS Chatbot Arena
Human preference benchmark: users compare two anonymous models side-by-side and vote. Aggregated into Elo-style scores.
Needle-in-a-Haystack
Long-context probe: insert a unique fact at varying positions in a long context, ask about it. Tests retrieval accuracy by depth.

Architecture variants

Decoder-only
The standard modern LLM architecture (GPT family). One stack, causal attention, autoregressive generation. The baseline this artifact teaches.
Encoder-decoder
Two-stack architecture (T5, BART). Encoder reads bidirectionally; decoder generates autoregressively while attending to encoder output. โ†’ block.html variant note
Mixture-of-Experts (MoE)
Replaces single MLP per block with many "expert" MLPs and a router that picks 1-2 per token. More capacity at similar inference cost. โ†’ block.html variant note
Multi-Query Attention (MQA)
One K/V shared across all attention heads. Smaller KV cache; small quality cost. โ†’ attention.html variant note
Grouped-Query Attention (GQA)
Heads grouped to share K/V โ€” middle ground between standard MHA and MQA. Used by most modern open models (Llama 3, Mistral). โ†’ attention.html variant note
Multimodal
Models that accept images, audio, etc. as input. Typically add a small encoder that converts non-text inputs into vectors, then pass them through the same transformer downstream. โ†’ tokens.html variant note

Scaffolding

System prompt
A special prompt prefix instructing the model how to behave for the rest of the conversation. Pure scaffolding โ€” doesn't change the model.
RAG (Retrieval-Augmented Generation)
Retrieve relevant documents from an external store and inject them into the model's context before generation.
Tool use / function calling
The model emits structured output indicating an external function should be called; the system runs the function and feeds the result back into the conversation.
Agent / ReAct loop
Multi-step scaffolding: model thinks โ†’ acts (calls tool) โ†’ observes result โ†’ thinks again, until a final answer.
Chain-of-thought
Prompting the model to explain its reasoning step-by-step before giving a final answer. A scaffolding technique that often improves accuracy on reasoning tasks.