How do we know a change helped?

Evaluation as the feedback loop for every other lever.

Eval doesn't change the model — it's how you measure changes from any of the other lenses. Without a matching eval, you're shipping changes blind.

The framing

Every lever in this artifact — fine-tune, quantize, extend context, prune, swap a decoding setting — is a change. Each one is meant to make something better. How do you actually tell? That's evaluation.

There is no single eval that captures "is the model better." Different methods measure different things, and all of them have known holes. The skill is knowing which eval to reach for given which change you made.

Four families of evaluation

1. Intrinsic — perplexity

What it measures: how well the model predicts held-out text. Lower perplexity = better at predicting the actual next token. Mathematically, it's just exp(loss).

What it catches: raw modeling quality. If you quantize a model and perplexity blows up, something's broken. If pretraining loss is going down, the model is learning.

What it misses: almost everything you actually care about in a chat product. A model can have great perplexity on Wikipedia and still refuse to answer your question, hallucinate, or be terrible at instruction-following. Perplexity rewards generic next-token prediction, not usefulness.

Perplexity is like measuring a writer by how well they can guess the next word in random books. It tells you something about their general fluency. It tells you nothing about whether they'd write a useful email.

2. Multiple-choice benchmarks — MMLU, ARC, GSM8K, HumanEval

What it measures: accuracy on a battery of test questions. MMLU has ~16k multiple-choice questions across 57 academic subjects. GSM8K is math word problems. HumanEval is Python coding problems. The model picks an answer (or generates code that runs); you score correctness automatically.

What it catches: narrow, well-defined, automatable capability. Easy to compare two models — just run both, get two numbers.

What it misses: real-world helpfulness. A model can be tuned to ace MMLU and still be a poor assistant. Worse, it might be a poor assistant because it was tuned to ace MMLU at the expense of conversational quality.

Contamination is the dirty secret. Many benchmark questions exist on the public internet, often in test prep materials. Pretraining corpora include the public internet. So benchmark questions leak into training data. Models then "score well" not because they reasoned to the answer but because they memorized the question. Most labs now do contamination checks, but it's an arms race.

3. LLM-as-judge — MT-Bench, AlpacaEval

What it measures: have a strong reference model (often GPT-4) read pairs of model outputs and judge which is better. Aggregate over many prompts to get win rates.

What it catches: open-ended quality at scale. You can evaluate "is this answer helpful and well-formatted?" without paying humans for every judgment.

What it misses: the judge has biases. GPT-4 prefers GPT-4-style answers. It rewards length even when conciseness is better. It struggles to grade things outside its own competence (a GPT-4 judge is not a great judge of code that exceeds GPT-4's coding ability).

4. Preference / arena — LMSYS Chatbot Arena

What it measures: humans submit prompts to two anonymous models side-by-side and vote on the better response. After thousands of votes per model, an Elo-style score emerges.

What it catches: real-world taste. Captures the messy gestalt of "do users actually like this model" — formatting, helpfulness, tone, creativity, refusal patterns all blended together.

What it misses: attribution. If model A is rated higher than model B, you don't know why. Is A smarter? Better-formatted? Just different in style? Hard to tell. Also: most arena votes are short prompts in English, so capabilities outside that distribution may not be reflected.

Match the eval to the lever

This is the practical payoff. When you pull a specific lever, certain evals will tell you whether it helped:

LeverEvals that will catch the effectEvals that will miss it
Fine-tuning for a task Your task-specific eval (held-out test set); preference eval if it's about tone/format; held-out general benchmark to catch forgetting Perplexity on generic web text (changes in ways you don't care about)
Quantization Perplexity (catches if numerics broke); your task eval (catches if the bits you care about degraded) Arena (signal too coarse to catch sub-percent quality drops)
Context extension Long-context retrieval probes (Needle-in-a-Haystack), long-document QA, multi-turn fidelity Short-prompt benchmarks (most of MMLU); short-form arena prompts
Pruning Perplexity (catches gross degradation), full task eval suite (catches narrow capability loss) Anything narrow — pruning often degrades skills nobody tested for
Decoding controls (temperature, top-p) Variance metrics (multiple runs of same prompt); creativity/diversity metrics; user studies Perplexity (decoding doesn't change next-token probabilities, just sampling from them)
System prompt change Task-specific eval; preference eval focused on tone/format Static benchmarks (don't run system prompts)

How teams actually pick a model

Across many real conversations with people who ship LLM products, the pattern looks the same:

  1. Start with arena scores as a coarse filter. Top 10 by arena Elo is a reasonable shortlist.
  2. Check benchmark scores on the dimensions you care about — code (HumanEval, SWE-Bench), math (GSM8K, MATH), long-context (RULER), instruction-following (IFEval).
  3. Build a custom eval set on your task. Even 100 hand-picked examples beats any public benchmark for your specific use case.
  4. Vibe-check. Spend an hour talking to the top 2-3 candidates. Things you'll notice in conversation often don't show up in any number.
  5. Account for cost, latency, deployment constraints. The model that wins on capability per dollar is usually not the model with the highest absolute capability.

"Vibes" gets disrespected as unscientific, but it's actually doing useful work — it's catching the residual after all your formal evals have run, telling you about the shape of capability that no individual eval is measuring.

The honest summary

Eval is hard. Every method has holes. The teams that ship best treat evals as a portfolio: a few automated benchmarks for fast iteration, a custom task-specific suite for honest measurement, periodic human review for catching things automation misses.

If you're making changes to an LLM (any of the levers), commit to evaluating those changes. If you ship without evals, you don't actually know whether your last change helped — and you'll find out, painfully, when something breaks downstream.