Diagnosis: when something breaks, look here first

Ten common symptoms β†’ which lens to suspect, how to confirm, what to try.

This is the artifact's payoff page. When something goes wrong with an LLM in your work, open this. Each entry follows the same template: symptom β†’ suspect lens(es) β†’ confirm by β†’ try (in order). The mental model from the rest of the artifact is what makes the diagnoses make sense; if any feel mysterious, that's a sign to revisit the relevant deep page.

Warm-up β€” practice the four-lens diagnosis

The ten failure modes

#1 β€” "It forgot something I told it five turns ago."

context scaffolding

Confirm by:
Inspect the actual prompt being sent to the model. Print it. Count tokens. Is the earlier turn even there?
Try (in order):
  1. If the turn isn't in the prompt, your scaffolding is dropping it. Fix the system that builds the conversation history.
  2. If the turn is in the prompt but the model ignores it, try a model with longer effective context, or summarize old turns instead of dumping raw.
  3. If you're past the model's context limit, use context extension (levers.html Β§3) β€” but expect quality degradation in the middle ("lost in the middle").
  4. For long-running conversations, reach for an external memory layer (a separate retrieval system over conversation history). This is a scaffolding fix, not a model fix.

Almost never weights β€” the model still could use the info if it were in front of it.

#2 β€” "Same question, different answers each time."

decoding

Confirm by:
Set temperature=0 and rerun. If the answer is now consistent, decoding was the culprit.
Try (in order):
  1. For deterministic tasks (Q&A, code, structured outputs): use temperature=0 always.
  2. For semi-deterministic tasks (helpful chat): temperature=0.7, top_p=0.9.
  3. For creative tasks: accept that variance is the feature.
  4. If temp=0 still produces variation across runs: check that you're not getting different model versions from your provider, or different hardware (rare floating-point determinism issues).

The most common diagnostic mistake is blaming the model for variance that's coming from the sampler.

#3 β€” "It's confidently wrong about a fact."

weights scaffolding

Confirm by:
Check known-good facts in the same domain. Is the model wrong about everything in this area, or just this fact? Search if the fact appears anywhere reliable on the public internet β€” if it doesn't, the model never had a chance.
Try (in order):
  1. Scaffolding fix first (cheaper, faster): RAG. Inject the correct fact into context as part of the prompt. Done in an afternoon; no model retraining.
  2. If you control the data and the fact comes up often, fine-tune (LoRA) on a dataset including the correct fact. Slow; risks forgetting.
  3. If the fact lives in a database you control, give the model a tool to look it up rather than memorizing it.

RAG vs fine-tune vs tools is a fundamental question β€” match the fix to how often the fact changes and how often it's queried.

#4 β€” "It hallucinated a citation that doesn't exist."

weights scaffolding

Confirm by:
Trivial β€” try to look up the citation. If it doesn't exist, you have a hallucination.
Try (in order):
  1. Scaffolding fix: use RAG with real source URLs that the model is told to cite. The model still hallucinates wording, but the URLs are real and grounded.
  2. System prompt: "Cite only sources from the provided context. If you don't know, say so."
  3. Post-process: parse all URLs/citations in the model's output and verify them programmatically. If any fail, regenerate or flag.
  4. Choose a model trained more aggressively against hallucination (newer Anthropic / OpenAI / Google models tend to be better here than 2-3 generations back).

Decoding tweaks (temperature) don't help β€” hallucinations live in the weights' tendency to confabulate plausible-looking outputs.

#5 β€” "It's worse since I quantized it."

weights β€” representation

Confirm by:
Run your task eval before and after quantization. Don't trust perplexity alone β€” it can stay flat while specific capabilities collapse.
Try (in order):
  1. Move up to a higher bit-depth: 4 β†’ 5 β†’ 6 β†’ 8.
  2. Switch quantization method: RTN β†’ GPTQ β†’ AWQ. Expect quality jumps especially at 4-bit.
  3. Use mixed precision: keep sensitive layers (embedding, output projection, first/last few transformer layers) in higher precision.
  4. If the task is narrow and capability sensitive, accept that quantization may not be viable for this combination β€” use the unquantized model on better hardware.

#6 β€” "Longer context, but worse retrieval from the middle of it."

context weights

Confirm by:
Run a Needle-in-a-Haystack-style probe: insert a unique fact at varying depths in a long context, ask about it. Plot accuracy by position. The "U-shaped" curve (good at start and end, bad in middle) is the signature.
Try (in order):
  1. Restructure your prompt: put the most important info in the first 25% or last 25% of the context. The middle is where models forget.
  2. Use a model trained for long context with explicit anti-lost-in-the-middle techniques (some long-context models from 2024+ help here).
  3. Don't use long context if you don't have to. RAG with shorter context often beats stuffing everything in.

#7 β€” "After LoRA fine-tune, great at my task but worse at everything else."

weights

Confirm by:
Run a held-out general benchmark (MMLU, MT-Bench) before and after fine-tune. If it dropped significantly, that's catastrophic forgetting.
Try (in order):
  1. Reduce LoRA rank (r) β€” smaller adapter = less overwriting of base behavior.
  2. Reduce learning rate. Fine-tunes that overfit also overforget.
  3. Mix in some general SFT data alongside your task data (10-20% mix). The model gets your task without losing breadth.
  4. Switch from SFT to DPO if you have preference data β€” DPO tends to nudge less violently than SFT for the same goal.
  5. Train fewer epochs. One epoch is often enough.

#8 β€” "My RAG retrieves the right docs but the model still ignores them."

scaffolding context

Confirm by:
Print the actual final prompt the model receives. Are the docs really in there? In what format? Is the system prompt clear about how to use them?
Try (in order):
  1. System prompt clarity: "Answer using only the information in the <context> tags below. If the answer isn't in the context, say 'I don't have that information.'"
  2. Wrap retrieved docs in clear tags (<context>...</context>) so the model can identify them.
  3. Reranking: maybe the top-retrieved docs aren't actually the most relevant. A reranking step (often a smaller cross-encoder model) often helps.
  4. Few-shot examples in the system prompt showing the desired "use the context" behavior.

#9 β€” "It used the wrong tool, or called the API with bad arguments."

scaffolding weights

Confirm by:
Test the same task with a known-good tool-use model (Claude 3.5+, GPT-4+, recent Llama). If they handle it, your model's tool-use training is the issue. If they don't either, your tool description is the issue.
Try (in order):
  1. Improve the tool description: clear name, clear parameter docstrings, examples of when it should and shouldn't be called.
  2. Add few-shot examples in the system prompt showing the tool being used correctly on representative inputs.
  3. Switch to a model with stronger tool-use capability β€” function calling support has improved dramatically in recent generations.
  4. For complex tool flows, use a structured framework (function calling APIs, JSON schema validation, retry on parse failure).

#10 β€” "Output is super repetitive ('the the the…')."

decoding weights

Confirm by:
Raise temperature slightly. Add a repetition penalty of 1.1. Does the repetition stop?
Try (in order):
  1. Add repetition_penalty=1.1 (or 1.05–1.15).
  2. Raise temperature from 0 to 0.3-0.7.
  3. Check that you're not in a degenerate sampling mode (top_k=1 with temp=0 will deterministically loop on certain prompts).
  4. If repetition persists across all settings: something is broken with the model β€” corrupted weights, wrong tokenizer, mismatched chat template. Reload from a clean source.

Ten more failure modes β€” MoE only

These ten can't happen in a dense model. They're all about routing, experts, or sparsity. Same symptom β†’ suspect lens β†’ confirm β†’ try format.

#11 β€” "Router sends everything to expert 3."

weights

Confirm by:
Inspect expert utilization on a batch of varied prompts. If one expert is receiving >50% of tokens, it's collapse.
Try (in order):
  1. Increase aux-loss weight (load-balance term).
  2. Add router noise during training.
  3. If post-training (can't retrain), accept that this model has bad routing; use it only for the distribution it was trained on.

#12 β€” "Some experts are never used."

weights

Confirm by:
Per-expert usage histogram. If some experts are <1% of traffic, they're dead.
Try (in order):
  1. Early in training: stronger aux loss, more router noise.
  2. Mid-training: consider reinitializing dead experts from the mean of live ones ("expert revival").
  3. Post-training: prune the dead experts (they don't help), accept the smaller effective model.

#13 β€” "Throughput cratered."

weights β€” representation

Confirm by:
Check drop rate and per-expert overflow during inference. Elevated drops β†’ capacity is limiting you.
Try (in order):
  1. Raise capacity factor.
  2. Reduce batch size so drops don't concentrate.
  3. If throughput is still bad, the routing itself is skewed β€” see #11, fix at training time.

#14 β€” "The model got big but didn't slow down β€” is it broken?"

(not actually a failure mode, but people ask)

Confirm by:
Check active-vs-total parameter ratio. If it's an MoE, that's by design.
Try:
Nothing β€” this is the point of MoE. Total params in VRAM; sparse compute per token. See inference.html.

#15 β€” "Quantized the experts and quality crashed."

weights β€” representation

Confirm by:
Run perplexity + your task eval before vs after quantization. MoE quality often drops more than dense at the same bit-depth.
Try (in order):
  1. Keep experts at higher precision than the backbone (mixed precision β€” 8-bit experts, 4-bit attention).
  2. Use AWQ or GPTQ-style quantization (activation-aware), not naive RTN.
  3. Quantize only the most-used experts; leave the specialists full-precision.

#16 β€” "LoRA fine-tuning on MoE gave weird outputs."

weights

Confirm by:
Inspect which experts got meaningful gradient during fine-tune. If only 2-3 of 8 experts were consistently picked on your fine-tune data, the others didn't adapt at all.
Try (in order):
  1. Diversify fine-tune data so routing is less skewed.
  2. Consider full fine-tuning instead of LoRA (expensive but predictable on MoE).
  3. Try router-only LoRA (change routing behavior without touching experts) β€” works for narrow tweaks.

#17 β€” "Upcycled from dense but training stalled."

weights

Confirm by:
Check whether the router is actually learning (does expert usage differ from uniform after a few thousand steps?).
Try:
Router warm-up β€” aux loss weight high (e.g., 0.1) for the first few thousand steps, then decay to the typical 0.01.

#18 β€” "Scales on one GPU but breaks on 8."

scaffolding

Confirm by:
Profile all-to-all communication volume between GPUs. Check if routing is skewing to experts on specific GPUs.
Try (in order):
  1. Balance expert placement across GPUs (don't put all "popular" experts on one GPU).
  2. Reduce expert-parallelism degree; use tensor-parallelism alongside.
  3. Pick a framework designed for MoE serving (vLLM with MoE support, TensorRT-LLM, DeepSpeed-Inference).

#19 β€” "Hindi (or any rare-in-training) prompts are worse than on an equivalent dense model."

weights

Confirm by:
Compare the same prompt on a same-quality dense model. If MoE is specifically worse on low-resource distributions, experts for that content are undertrained.
Try:
For production: pick a model with more shared-expert or fine-grained-expert coverage (DeepSeek-style) for language-agnostic use. For research: continued pretraining on the low-resource distribution specifically.

#20 β€” "After pruning unused experts, out-of-distribution prompts got worse."

weights β€” structure

Confirm by:
Compare OOD eval before vs after pruning.
Try:
Accept the trade-off. Pruned experts were holding rare-pattern capacity; you saved memory by giving that up. If OOD matters, un-prune or re-train.
Back to the default (dense) frame.

The diagnostic moves, summarized

After internalizing the ten cases, the pattern becomes:

  1. Always check decoding first. It's the cheapest possibility and the most often confused with model quality. Set temperature=0 and rerun. If the symptom changes, you've found it.
  2. Then check context. Print the actual prompt. Count tokens. Verify that the information you assume is there, is there.
  3. Then check scaffolding. What's your system doing between the user and the model? System prompt? Retrieval? Tool calls? Each is a potential confound.
  4. Then suspect weights. If decoding, context, and scaffolding all check out, you have a model-shape problem. Now you're choosing between fine-tune, swap to a different model, or accept the limitation.

This ordering β€” decoding β†’ context β†’ scaffolding β†’ weights β€” is roughly cheapest to most expensive to investigate and roughly most-common-cause to least-common-cause. Following it saves time.

If none of these match your symptom

Most LLM problems map to one of these ten or a close variant. If yours genuinely doesn't: