Training
Pretraining → SFT → RLHF / DPO. What each stage actually changes inside the weights.
The big picture: three stages
A modern instruction-following LLM goes through (at minimum) three distinct training stages. Each one uses gradient descent and roughly the same loss function — but the data and the goal differ.
| Stage | What you give it | What it learns | Compute scale |
|---|---|---|---|
| Pretraining | Trillions of tokens from the web, books, code | How language works; vast factual knowledge; many implicit skills | ~1× (the baseline) |
| SFT | ~10k–1M curated (prompt, ideal-response) pairs | Format and style of helpful responses | ~0.1%–1% of pretraining |
| RLHF / DPO | ~10k–1M (prompt, preferred-response, rejected-response) triples | Human preferences for tone, helpfulness, refusal patterns | ~0.1%–1% of pretraining |
Notice the lopsidedness: pretraining costs roughly two orders of magnitude more compute than the post-training stages combined. Nearly all of the model's capability comes from pretraining; the post-training stages mostly shape behavior.
Stage 1: Pretraining
Take a colossal pile of text — Common Crawl scrapes, Wikipedia, GitHub code, books, papers, Reddit, StackOverflow, scientific repositories. After heavy filtering, deduplication, and quality scoring, you might end up with a few trillion tokens. The training objective is the same one you've seen all along: predict the next token.
For every position in every document, the model is asked: given everything before, what's the next token? It guesses (a probability distribution); you compare to ground truth (the actual next token); compute loss; backpropagate; update weights; move on. Repeat for every token in every document for many epochs.
Pretraining is reading the entire library while playing a fill-in-the-blank game on every sentence. You don't know what any one book is for. Nobody tested you on comprehension. But by the time you've finished, you've absorbed how language works, what facts are true, what arguments tend to follow what claims, what code does what. You're an oracle who has read everything but has no idea what anyone wants.
What "emerges" — and why scale matters
A smaller pretrained model can complete sentences plausibly. A larger pretrained model can also do arithmetic, code, reason about hypotheticals, translate languages — capabilities that nobody trained directly. These didn't appear because someone wrote a "now learn arithmetic" stage. They appeared because predicting the next token well on a large enough corpus requires a lot of implicit competence, and the model finds it.
This is the empirical observation behind the "scale" thesis: capabilities emerge as a function of model size, training data, and compute together. Roughly: 10× more of all three buys you a step-function in capability. Whether this continues indefinitely is an open question; whether it's gotten us this far is not.
Compute scale, in pictures
Training a frontier model takes thousands of GPUs running for weeks or months, costing tens to hundreds of millions of dollars. The numbers are eye-watering enough to be useless intuition pumps, so here's a more useful one: pretraining is the most expensive thing humans regularly do with computers. Genuinely. There is nothing else civilians spend as much compute on as one frontier-model pretraining run.
Stage 2: Supervised fine-tuning (SFT)
A pretrained model is a strange creature. Ask "What is the capital of France?" and it might continue with "is a question often asked in geography classes…" — because that's a plausible continuation of an internet article that contains that question. It's not refusing to answer; it just sees a string of text and predicts what comes next, having no concept of "you're asking me a question."
SFT teaches it to behave like an assistant. The data: thousands of (prompt, ideal-response) pairs hand-curated by humans (or, increasingly, by other models). For each pair, train the model to maximize the probability of the ideal response given the prompt. Same loss, same gradient descent — just very different data.
SFT is an apprentice mimicking a master. The master shows them ten thousand worked examples — "here's a question, here's how a helpful answer looks; here's a request, here's how a graceful refusal looks." The apprentice was already a competent writer; now they know what shape a helpful response takes.
What changes inside the weights
SFT rarely teaches new knowledge — there isn't enough of it to compete with the trillions of pretraining tokens. What it changes is response shape: how the model formats answers, when it asks for clarification, whether it adds caveats. Concretely, the weight changes are usually small but cluster in particular subspaces — the parts of the model that produce final-token-level behavior.
SFT has a known failure mode: over-generalization from style. If your SFT data is stylistically uniform (always starts with "Sure! I'd be happy to help…" for instance), the model learns that style as a brand and applies it everywhere — including places it shouldn't. The Mickey Mouse "I'm just a friendly AI" voice in many models is partially an SFT artifact.
Stage 3: RLHF or DPO
SFT teaches the model what good responses look like, but it can't directly teach preferences. Concise vs verbose, careful vs confident, this style vs that style — these are subtler than "right answer" and need a different signal.
RLHF (Reinforcement Learning from Human Feedback)
The classic recipe:
- Have the SFT model generate two completions for the same prompt.
- Show both to a human; they pick the one they prefer.
- Collect tens of thousands of these preference pairs.
- Train a separate reward model to predict which response a human would prefer.
- Use reinforcement learning (PPO) to nudge the LLM toward producing higher-reward responses, while not drifting too far from the SFT model.
DPO (Direct Preference Optimization)
A more recent, simpler alternative. DPO skips the separate reward model entirely. Instead, given preference pairs, it tweaks the LLM directly to make the preferred response more likely and the rejected response less likely, in one combined loss. Less moving parts, often comparable results, easier to implement and stabilize. Many open models since 2023 have moved from RLHF to DPO.
RLHF is the apprentice cooking many dishes, having a panel of judges taste each one and score it, training a robotic stand-in for the judges' taste, and then cooking against the robotic stand-in. DPO is cutting out the robot entirely: just hand the apprentice pairs of dishes, say which one tasted better, and tell them to cook more like the winner and less like the loser. Same end goal; fewer steps.
What changes inside the weights
RLHF/DPO updates touch roughly the same parts of the model as SFT — final-layer behavioral subspaces — but they're sensitive in a way SFT isn't. Push too hard and you get "alignment tax": the model becomes obedient and bland, losing some of the sharpness it had before. This is one reason post-training is more art than science: the goal is to nudge behavior without flattening capability.
See it on a single prompt
What about RLAIF, constitutional AI, and the rest?
The post-training landscape moves fast. A few names you'll see:
- RLAIF — Reinforcement Learning from AI Feedback. Replaces the human judges with another LLM acting as a judge. Cheaper, scales, but inherits the judge's biases.
- Constitutional AI (Anthropic) — uses a written "constitution" of principles to have the model critique and revise its own outputs, generating training data without human raters in the inner loop.
- RLHF + safety fine-tuning — additional rounds focused specifically on refusing harmful requests, handling sensitive topics, etc. Usually mixed in with general preference tuning.
All of these share the same shape: gather some signal about preferred vs unpreferred outputs, use that signal to nudge the weights. The mental model is the same; the data pipeline differs.