Training
Pretraining → SFT → RLHF / DPO. What each stage actually changes inside the weights.
The big picture: three stages
A modern instruction-following LLM goes through (at minimum) three distinct training stages. Each one uses gradient descent and roughly the same loss function — but the data and the goal differ.
| Stage | What you give it | What it learns | Compute scale |
|---|---|---|---|
| Pretraining | Trillions of tokens from the web, books, code | How language works; vast factual knowledge; many implicit skills | ~1× (the baseline) |
| SFT | ~10k–1M curated (prompt, ideal-response) pairs | Format and style of helpful responses | ~0.1%–1% of pretraining |
| RLHF / DPO | ~10k–1M (prompt, preferred-response, rejected-response) triples | Human preferences for tone, helpfulness, refusal patterns | ~0.1%–1% of pretraining |
Notice the lopsidedness: pretraining costs roughly two orders of magnitude more compute than the post-training stages combined. Nearly all of the model's capability comes from pretraining; the post-training stages mostly shape behavior.
Stage 1: Pretraining
Take a colossal pile of text — Common Crawl scrapes, Wikipedia, GitHub code, books, papers, Reddit, StackOverflow, scientific repositories. After heavy filtering, deduplication, and quality scoring, you might end up with a few trillion tokens. The training objective is the same one you've seen all along: predict the next token.
For every position in every document, the model is asked: given everything before, what's the next token? It guesses (a probability distribution); you compare to ground truth (the actual next token); compute loss; backpropagate; update weights; move on. Repeat for every token in every document for many epochs.
Pretraining is reading the entire library while playing a fill-in-the-blank game on every sentence. You don't know what any one book is for. Nobody tested you on comprehension. But by the time you've finished, you've absorbed how language works, what facts are true, what arguments tend to follow what claims, what code does what. You're an oracle who has read everything but has no idea what anyone wants.
What "emerges" — and why scale matters
A smaller pretrained model can complete sentences plausibly. A larger pretrained model can also do arithmetic, code, reason about hypotheticals, translate languages — capabilities that nobody trained directly. These didn't appear because someone wrote a "now learn arithmetic" stage. They appeared because predicting the next token well on a large enough corpus requires a lot of implicit competence, and the model finds it.
This is the empirical observation behind the "scale" thesis: capabilities emerge as a function of model size, training data, and compute together. Roughly: 10× more of all three buys you a step-function in capability. Whether this continues indefinitely is an open question; whether it's gotten us this far is not.
Compute scale, in pictures
Training a frontier model takes thousands of GPUs running for weeks or months, costing tens to hundreds of millions of dollars. The numbers are eye-watering enough to be useless intuition pumps, so here's a more useful one: pretraining is the most expensive thing humans regularly do with computers. Genuinely. There is nothing else civilians spend as much compute on as one frontier-model pretraining run.
Stage 2: Supervised fine-tuning (SFT)
A pretrained model is a strange creature. Ask "What is the capital of France?" and it might continue with "is a question often asked in geography classes…" — because that's a plausible continuation of an internet article that contains that question. It's not refusing to answer; it just sees a string of text and predicts what comes next, having no concept of "you're asking me a question."
SFT teaches it to behave like an assistant. The data: thousands of (prompt, ideal-response) pairs hand-curated by humans (or, increasingly, by other models). For each pair, train the model to maximize the probability of the ideal response given the prompt. Same loss, same gradient descent — just very different data.
SFT is an apprentice mimicking a master. The master shows them ten thousand worked examples — "here's a question, here's how a helpful answer looks; here's a request, here's how a graceful refusal looks." The apprentice was already a competent writer; now they know what shape a helpful response takes.
What changes inside the weights
SFT rarely teaches new knowledge — there isn't enough of it to compete with the trillions of pretraining tokens. What it changes is response shape: how the model formats answers, when it asks for clarification, whether it adds caveats. Concretely, the weight changes are usually small but cluster in particular subspaces — the parts of the model that produce final-token-level behavior.
SFT has a known failure mode: over-generalization from style. If your SFT data is stylistically uniform (always starts with "Sure! I'd be happy to help…" for instance), the model learns that style as a brand and applies it everywhere — including places it shouldn't. The Mickey Mouse "I'm just a friendly AI" voice in many models is partially an SFT artifact.
Stage 3: RLHF or DPO
SFT teaches the model what good responses look like, but it can't directly teach preferences. Concise vs verbose, careful vs confident, this style vs that style — these are subtler than "right answer" and need a different signal.
RLHF (Reinforcement Learning from Human Feedback)
The classic recipe:
- Have the SFT model generate two completions for the same prompt.
- Show both to a human; they pick the one they prefer.
- Collect tens of thousands of these preference pairs.
- Train a separate reward model to predict which response a human would prefer.
- Use reinforcement learning (PPO) to nudge the LLM toward producing higher-reward responses, while not drifting too far from the SFT model.
DPO (Direct Preference Optimization)
A more recent, simpler alternative. DPO skips the separate reward model entirely. Instead, given preference pairs, it tweaks the LLM directly to make the preferred response more likely and the rejected response less likely, in one combined loss. Less moving parts, often comparable results, easier to implement and stabilize. Many open models since 2023 have moved from RLHF to DPO.
RLHF is the apprentice cooking many dishes, having a panel of judges taste each one and score it, training a robotic stand-in for the judges' taste, and then cooking against the robotic stand-in. DPO is cutting out the robot entirely: just hand the apprentice pairs of dishes, say which one tasted better, and tell them to cook more like the winner and less like the loser. Same end goal; fewer steps.
What changes inside the weights
RLHF/DPO updates touch roughly the same parts of the model as SFT — final-layer behavioral subspaces — but they're sensitive in a way SFT isn't. Push too hard and you get "alignment tax": the model becomes obedient and bland, losing some of the sharpness it had before. This is one reason post-training is more art than science: the goal is to nudge behavior without flattening capability.
See it on a single prompt
What about RLAIF, constitutional AI, and the rest?
The post-training landscape moves fast. A few names you'll see:
- RLAIF — Reinforcement Learning from AI Feedback. Replaces the human judges with another LLM acting as a judge. Cheaper, scales, but inherits the judge's biases.
- Constitutional AI (Anthropic) — uses a written "constitution" of principles to have the model critique and revise its own outputs, generating training data without human raters in the inner loop.
- RLHF + safety fine-tuning — additional rounds focused specifically on refusing harmful requests, handling sensitive topics, etc. Usually mixed in with general preference tuning.
All of these share the same shape: gather some signal about preferred vs unpreferred outputs, use that signal to nudge the weights. The mental model is the same; the data pipeline differs.
Training an MoE adds two real concerns
Pretraining, SFT, RLHF/DPO — all three stages work the same way on an MoE as on a dense model. What's new is that the router needs to be shaped alongside the experts, and the router is a fragile thing. Two training-time mechanics that don't exist in dense models:
Auxiliary losses — keeping the router from collapsing
The main loss (predict the next token) doesn't care how tokens are routed; it only cares about the final output. So the router, left to its own devices, can collapse every token onto one expert and still hit decent loss (because that one expert gets all the gradient and becomes competent). Two extra loss terms fight this:
- Load-balance loss — penalizes uneven expert usage. Computed per batch from the routing distribution; gradient pushes the router toward uniform routing.
- Router z-loss (ST-MoE) — penalizes large routing logits. Keeps the softmax from saturating, which stabilizes routing decisions over long training runs.
These aux losses have small weights relative to the main cross-entropy loss (typically ~0.01 for load-balance, ~0.001 for z-loss). They're nudges, not instructions — enough to spread tokens without preventing the experts from actually specializing.
Aux loss weight starts high (forces the router to explore so no expert dies) and decays as training progresses (lets experts drift into specializations).
Router warm-up and upcycling
Training from scratch is one thing; upcycling — starting from a pretrained dense model and converting it to MoE — is another. In upcycling, the router is the only randomly-initialized thing in the model. Everything else (attention, embeddings, expert initial weights) starts from a competent dense checkpoint.
That random-router problem needs care. If you drop the router in and train normally, the router might not learn useful routing before the aux loss pushes it to uniform, ending up with experts that are all copies of the original MLP. The fix is a router warm-up: crank aux loss high early, decay it over the first few thousand steps, and only then let the main loss dominate. Gives the router a chance to find its feet before the experts diverge.
Fine-tuning — LoRA on MoE is tricky
LoRA works by attaching a low-rank adapter to a weight matrix. In an MoE block, there are N expert weight matrices. Three choices, each with problems:
- One LoRA shared across all experts. Cheap. But if the task wants different behavior per expert, the shared adapter can't provide it.
- Per-expert LoRAs. More expressive. But sparsely-activated experts (ones the router rarely picks for your fine-tune data) never get meaningful gradient. You end up with uneven adaptation.
- LoRA only on the router — change routing behavior without touching experts. Useful for narrow tweaks; useless for broader behavioral change.
In practice: fine-tuning MoEs with LoRA is still an active area. If you're doing serious MoE fine-tuning, consider full or partial fine-tuning (expensive but predictable) over LoRA.