Diagnosis: when something breaks, look here first

Ten common symptoms β†’ which lens to suspect, how to confirm, what to try.

This is the artifact's payoff page. When something goes wrong with an LLM in your work, open this. Each entry follows the same template: symptom β†’ suspect lens(es) β†’ confirm by β†’ try (in order). The mental model from the rest of the artifact is what makes the diagnoses make sense; if any feel mysterious, that's a sign to revisit the relevant deep page.

Warm-up β€” practice the four-lens diagnosis

The ten failure modes

#1 β€” "It forgot something I told it five turns ago."

context scaffolding

Confirm by:
Inspect the actual prompt being sent to the model. Print it. Count tokens. Is the earlier turn even there?
Try (in order):
  1. If the turn isn't in the prompt, your scaffolding is dropping it. Fix the system that builds the conversation history.
  2. If the turn is in the prompt but the model ignores it, try a model with longer effective context, or summarize old turns instead of dumping raw.
  3. If you're past the model's context limit, use context extension (levers.html Β§3) β€” but expect quality degradation in the middle ("lost in the middle").
  4. For long-running conversations, reach for an external memory layer (a separate retrieval system over conversation history). This is a scaffolding fix, not a model fix.

Almost never weights β€” the model still could use the info if it were in front of it.

#2 β€” "Same question, different answers each time."

decoding

Confirm by:
Set temperature=0 and rerun. If the answer is now consistent, decoding was the culprit.
Try (in order):
  1. For deterministic tasks (Q&A, code, structured outputs): use temperature=0 always.
  2. For semi-deterministic tasks (helpful chat): temperature=0.7, top_p=0.9.
  3. For creative tasks: accept that variance is the feature.
  4. If temp=0 still produces variation across runs: check that you're not getting different model versions from your provider, or different hardware (rare floating-point determinism issues).

The most common diagnostic mistake is blaming the model for variance that's coming from the sampler.

#3 β€” "It's confidently wrong about a fact."

weights scaffolding

Confirm by:
Check known-good facts in the same domain. Is the model wrong about everything in this area, or just this fact? Search if the fact appears anywhere reliable on the public internet β€” if it doesn't, the model never had a chance.
Try (in order):
  1. Scaffolding fix first (cheaper, faster): RAG. Inject the correct fact into context as part of the prompt. Done in an afternoon; no model retraining.
  2. If you control the data and the fact comes up often, fine-tune (LoRA) on a dataset including the correct fact. Slow; risks forgetting.
  3. If the fact lives in a database you control, give the model a tool to look it up rather than memorizing it.

RAG vs fine-tune vs tools is a fundamental question β€” match the fix to how often the fact changes and how often it's queried.

#4 β€” "It hallucinated a citation that doesn't exist."

weights scaffolding

Confirm by:
Trivial β€” try to look up the citation. If it doesn't exist, you have a hallucination.
Try (in order):
  1. Scaffolding fix: use RAG with real source URLs that the model is told to cite. The model still hallucinates wording, but the URLs are real and grounded.
  2. System prompt: "Cite only sources from the provided context. If you don't know, say so."
  3. Post-process: parse all URLs/citations in the model's output and verify them programmatically. If any fail, regenerate or flag.
  4. Choose a model trained more aggressively against hallucination (newer Anthropic / OpenAI / Google models tend to be better here than 2-3 generations back).

Decoding tweaks (temperature) don't help β€” hallucinations live in the weights' tendency to confabulate plausible-looking outputs.

#5 β€” "It's worse since I quantized it."

weights β€” representation

Confirm by:
Run your task eval before and after quantization. Don't trust perplexity alone β€” it can stay flat while specific capabilities collapse.
Try (in order):
  1. Move up to a higher bit-depth: 4 β†’ 5 β†’ 6 β†’ 8.
  2. Switch quantization method: RTN β†’ GPTQ β†’ AWQ. Expect quality jumps especially at 4-bit.
  3. Use mixed precision: keep sensitive layers (embedding, output projection, first/last few transformer layers) in higher precision.
  4. If the task is narrow and capability sensitive, accept that quantization may not be viable for this combination β€” use the unquantized model on better hardware.

#6 β€” "Longer context, but worse retrieval from the middle of it."

context weights

Confirm by:
Run a Needle-in-a-Haystack-style probe: insert a unique fact at varying depths in a long context, ask about it. Plot accuracy by position. The "U-shaped" curve (good at start and end, bad in middle) is the signature.
Try (in order):
  1. Restructure your prompt: put the most important info in the first 25% or last 25% of the context. The middle is where models forget.
  2. Use a model trained for long context with explicit anti-lost-in-the-middle techniques (some long-context models from 2024+ help here).
  3. Don't use long context if you don't have to. RAG with shorter context often beats stuffing everything in.

#7 β€” "After LoRA fine-tune, great at my task but worse at everything else."

weights

Confirm by:
Run a held-out general benchmark (MMLU, MT-Bench) before and after fine-tune. If it dropped significantly, that's catastrophic forgetting.
Try (in order):
  1. Reduce LoRA rank (r) β€” smaller adapter = less overwriting of base behavior.
  2. Reduce learning rate. Fine-tunes that overfit also overforget.
  3. Mix in some general SFT data alongside your task data (10-20% mix). The model gets your task without losing breadth.
  4. Switch from SFT to DPO if you have preference data β€” DPO tends to nudge less violently than SFT for the same goal.
  5. Train fewer epochs. One epoch is often enough.

#8 β€” "My RAG retrieves the right docs but the model still ignores them."

scaffolding context

Confirm by:
Print the actual final prompt the model receives. Are the docs really in there? In what format? Is the system prompt clear about how to use them?
Try (in order):
  1. System prompt clarity: "Answer using only the information in the <context> tags below. If the answer isn't in the context, say 'I don't have that information.'"
  2. Wrap retrieved docs in clear tags (<context>...</context>) so the model can identify them.
  3. Reranking: maybe the top-retrieved docs aren't actually the most relevant. A reranking step (often a smaller cross-encoder model) often helps.
  4. Few-shot examples in the system prompt showing the desired "use the context" behavior.

#9 β€” "It used the wrong tool, or called the API with bad arguments."

scaffolding weights

Confirm by:
Test the same task with a known-good tool-use model (Claude 3.5+, GPT-4+, recent Llama). If they handle it, your model's tool-use training is the issue. If they don't either, your tool description is the issue.
Try (in order):
  1. Improve the tool description: clear name, clear parameter docstrings, examples of when it should and shouldn't be called.
  2. Add few-shot examples in the system prompt showing the tool being used correctly on representative inputs.
  3. Switch to a model with stronger tool-use capability β€” function calling support has improved dramatically in recent generations.
  4. For complex tool flows, use a structured framework (function calling APIs, JSON schema validation, retry on parse failure).

#10 β€” "Output is super repetitive ('the the the…')."

decoding weights

Confirm by:
Raise temperature slightly. Add a repetition penalty of 1.1. Does the repetition stop?
Try (in order):
  1. Add repetition_penalty=1.1 (or 1.05–1.15).
  2. Raise temperature from 0 to 0.3-0.7.
  3. Check that you're not in a degenerate sampling mode (top_k=1 with temp=0 will deterministically loop on certain prompts).
  4. If repetition persists across all settings: something is broken with the model β€” corrupted weights, wrong tokenizer, mismatched chat template. Reload from a clean source.

The diagnostic moves, summarized

After internalizing the ten cases, the pattern becomes:

  1. Always check decoding first. It's the cheapest possibility and the most often confused with model quality. Set temperature=0 and rerun. If the symptom changes, you've found it.
  2. Then check context. Print the actual prompt. Count tokens. Verify that the information you assume is there, is there.
  3. Then check scaffolding. What's your system doing between the user and the model? System prompt? Retrieval? Tool calls? Each is a potential confound.
  4. Then suspect weights. If decoding, context, and scaffolding all check out, you have a model-shape problem. Now you're choosing between fine-tune, swap to a different model, or accept the limitation.

This ordering β€” decoding β†’ context β†’ scaffolding β†’ weights β€” is roughly cheapest to most expensive to investigate and roughly most-common-cause to least-common-cause. Following it saves time.

If none of these match your symptom

Most LLM problems map to one of these ten or a close variant. If yours genuinely doesn't: