How do we know a change helped?
Evaluation as the feedback loop for every other lever.
The framing
Every lever in this artifact — fine-tune, quantize, extend context, prune, swap a decoding setting — is a change. Each one is meant to make something better. How do you actually tell? That's evaluation.
There is no single eval that captures "is the model better." Different methods measure different things, and all of them have known holes. The skill is knowing which eval to reach for given which change you made.
Four families of evaluation
1. Intrinsic — perplexity
What it measures: how well the model predicts held-out text. Lower perplexity = better at predicting the actual next token. Mathematically, it's just exp(loss).
What it catches: raw modeling quality. If you quantize a model and perplexity blows up, something's broken. If pretraining loss is going down, the model is learning.
What it misses: almost everything you actually care about in a chat product. A model can have great perplexity on Wikipedia and still refuse to answer your question, hallucinate, or be terrible at instruction-following. Perplexity rewards generic next-token prediction, not usefulness.
Perplexity is like measuring a writer by how well they can guess the next word in random books. It tells you something about their general fluency. It tells you nothing about whether they'd write a useful email.
2. Multiple-choice benchmarks — MMLU, ARC, GSM8K, HumanEval
What it measures: accuracy on a battery of test questions. MMLU has ~16k multiple-choice questions across 57 academic subjects. GSM8K is math word problems. HumanEval is Python coding problems. The model picks an answer (or generates code that runs); you score correctness automatically.
What it catches: narrow, well-defined, automatable capability. Easy to compare two models — just run both, get two numbers.
What it misses: real-world helpfulness. A model can be tuned to ace MMLU and still be a poor assistant. Worse, it might be a poor assistant because it was tuned to ace MMLU at the expense of conversational quality.
Contamination is the dirty secret. Many benchmark questions exist on the public internet, often in test prep materials. Pretraining corpora include the public internet. So benchmark questions leak into training data. Models then "score well" not because they reasoned to the answer but because they memorized the question. Most labs now do contamination checks, but it's an arms race.
3. LLM-as-judge — MT-Bench, AlpacaEval
What it measures: have a strong reference model (often GPT-4) read pairs of model outputs and judge which is better. Aggregate over many prompts to get win rates.
What it catches: open-ended quality at scale. You can evaluate "is this answer helpful and well-formatted?" without paying humans for every judgment.
What it misses: the judge has biases. GPT-4 prefers GPT-4-style answers. It rewards length even when conciseness is better. It struggles to grade things outside its own competence (a GPT-4 judge is not a great judge of code that exceeds GPT-4's coding ability).
4. Preference / arena — LMSYS Chatbot Arena
What it measures: humans submit prompts to two anonymous models side-by-side and vote on the better response. After thousands of votes per model, an Elo-style score emerges.
What it catches: real-world taste. Captures the messy gestalt of "do users actually like this model" — formatting, helpfulness, tone, creativity, refusal patterns all blended together.
What it misses: attribution. If model A is rated higher than model B, you don't know why. Is A smarter? Better-formatted? Just different in style? Hard to tell. Also: most arena votes are short prompts in English, so capabilities outside that distribution may not be reflected.
Match the eval to the lever
This is the practical payoff. When you pull a specific lever, certain evals will tell you whether it helped:
| Lever | Evals that will catch the effect | Evals that will miss it |
|---|---|---|
| Fine-tuning for a task | Your task-specific eval (held-out test set); preference eval if it's about tone/format; held-out general benchmark to catch forgetting | Perplexity on generic web text (changes in ways you don't care about) |
| Quantization | Perplexity (catches if numerics broke); your task eval (catches if the bits you care about degraded) | Arena (signal too coarse to catch sub-percent quality drops) |
| Context extension | Long-context retrieval probes (Needle-in-a-Haystack), long-document QA, multi-turn fidelity | Short-prompt benchmarks (most of MMLU); short-form arena prompts |
| Pruning | Perplexity (catches gross degradation), full task eval suite (catches narrow capability loss) | Anything narrow — pruning often degrades skills nobody tested for |
| Decoding controls (temperature, top-p) | Variance metrics (multiple runs of same prompt); creativity/diversity metrics; user studies | Perplexity (decoding doesn't change next-token probabilities, just sampling from them) |
| System prompt change | Task-specific eval; preference eval focused on tone/format | Static benchmarks (don't run system prompts) |
How teams actually pick a model
Across many real conversations with people who ship LLM products, the pattern looks the same:
- Start with arena scores as a coarse filter. Top 10 by arena Elo is a reasonable shortlist.
- Check benchmark scores on the dimensions you care about — code (HumanEval, SWE-Bench), math (GSM8K, MATH), long-context (RULER), instruction-following (IFEval).
- Build a custom eval set on your task. Even 100 hand-picked examples beats any public benchmark for your specific use case.
- Vibe-check. Spend an hour talking to the top 2-3 candidates. Things you'll notice in conversation often don't show up in any number.
- Account for cost, latency, deployment constraints. The model that wins on capability per dollar is usually not the model with the highest absolute capability.
"Vibes" gets disrespected as unscientific, but it's actually doing useful work — it's catching the residual after all your formal evals have run, telling you about the shape of capability that no individual eval is measuring.
The honest summary
Eval is hard. Every method has holes. The teams that ship best treat evals as a portfolio: a few automated benchmarks for fast iteration, a custom task-specific suite for honest measurement, periodic human review for catching things automation misses.
If you're making changes to an LLM (any of the levers), commit to evaluating those changes. If you ship without evals, you don't actually know whether your last change helped — and you'll find out, painfully, when something breaks downstream.