Knobs you can turn after the model is trained

Five levers β€” what each one actually changes inside the model, and what it costs.

Every lever in this page follows the same structure: what concretely changes, when, where it lives, and which tradeoff surface it moves. If you can fill in those four boxes for any LLM intervention, you can predict its consequences.

1. Fine-tuning

What changes where?

Object:
Weight tensors β€” either all of them (full fine-tune) or a small low-rank delta attached to selected matrices (LoRA).
When:
Fine-tune time (a separate training run, much shorter than pretraining).
Where it lives:
On disk as updated weights (full) or as a separate small adapter file (LoRA). At inference: in VRAM with the rest of the model.
Tradeoffs:
Quality on your task ↑, capability elsewhere can ↓ (forgetting), training cost ↑, deployment complexity ↑ if managing multiple adapters.

weights

Full fine-tune vs LoRA vs QLoRA

Full fine-tune updates every weight. Maximum flexibility β€” the model can change as much as needed. Maximum cost β€” for a 70B model, you need enough VRAM to hold the model, the gradients, the optimizer state, and the activations. In practice that's often hundreds of GB across multiple GPUs. Risk: if your fine-tune dataset is small or narrow, you can wreck capabilities the model had before (catastrophic forgetting).

LoRA (Low-Rank Adaptation) is the workhorse of practical fine-tuning. Instead of updating the original weight matrix W (size d Γ— d), you freeze W and learn two small matrices A (d Γ— r) and B (r Γ— d) where r is small (typically 4–64). The effective weight at inference time is W + AΒ·B. Because r is tiny, the trainable parameter count drops by 100×–1000Γ—.

Full fine-tune is repainting the entire house. LoRA is hanging removable overlays in specific rooms β€” you can take them down, swap them, layer them. The walls (base weights) never change.

QLoRA combines LoRA with 4-bit quantization of the base model. The base weights are frozen and stored in 4 bits; only the LoRA adapter trains in normal precision. This makes it possible to fine-tune a 70B model on a single 24 GB consumer GPU. The trade is some numerical noise during training; in practice quality remains close to LoRA.

A real LoRA config, line by line

from peft import LoraConfig

config = LoraConfig(
    r              = 16,                              # rank β€” bigger r, more capacity, more params
    lora_alpha     = 32,                              # scaling factor (effective scale = alpha / r)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],  # which weights to adapt
    lora_dropout   = 0.05,                            # regularization during training
    bias           = "none",                          # don't train bias terms
    task_type      = "CAUSAL_LM",                    # decoder-only LM
)

LoRA hyperparameters look magical until they bite you. Common failure mode: r too small for the task β€” the adapter doesn't have enough capacity, training loss plateaus high, eval barely moves. Common fix: bump r, increase target_modules to include MLP layers. Opposite failure: r too high or training too long β€” overfits the fine-tune set, forgets general capability. Always evaluate on a held-out general benchmark, not just your task.

2. Quantization

What changes where?

Object:
The numerical representation of weight tensors. Same number of weights; fewer bits per weight.
When:
Load time (or as a separate one-time conversion step). The model is quantized once and stored.
Where it lives:
On disk as smaller files; in VRAM as smaller tensors.
Tradeoffs:
Memory ↓ (4Γ— from FP16β†’INT4), often speed ↑, accuracy ↓ slightly (a lot if done badly).

weights β€” representation

The basic idea

A weight stored in FP16 is a 16-bit floating-point number β€” about 65,000 distinct possible values. Most weights in a trained LLM cluster near zero with a long tail of larger values. You don't need 65,000 values to represent that distribution well. Quantization picks a smaller set of representative values and rounds every weight to its nearest one.

Quantization is replacing a high-resolution photo with a posterized version. Fewer color buckets, but the scene is still recognizable. Lower bit-depth = fewer buckets = more obvious posterization.

The quantization zoo

Several methods exist; they differ in how they pick the buckets and which tensors they touch.

Where quantization fails

Not all layers tolerate low-bit representation equally. Some empirical patterns:

The "4-bit is fine" mantra has caveats. For tasks the model was already weak at (uncommon languages, niche domains, edge-case reasoning), quantization can push the model from "barely working" to "broken." Always run a task-specific eval before and after β€” perplexity alone won't catch narrow capability collapse.

3. Context extension

What changes where?

Object:
Positional encoding formula (RoPE scaling) or attention structure (sliding window). Sometimes also the model's training recipe (continued pretraining on long-context data).
When:
Load time (RoPE scaling factor changes), or as additional fine-tuning on longer sequences.
Where it lives:
In the model config (RoPE base/scale parameters); in attention implementation (sliding window mask). KV cache memory grows linearly with the new max length.
Tradeoffs:
Max input length ↑, attention compute ↑↑ (quadratic), KV memory ↑, quality at long lengths often ↓ (especially in the middle of the context).

context

Why context isn't free

Two costs grow with context length:

RoPE scaling β€” making old position formulas cover new lengths

A model trained with RoPE on 4k context has rotation angles tuned for positions 0–4096. If you naively feed it 16k tokens, the angles for positions 4097–16383 are out of distribution and quality collapses.

The fix: rescale the rotation frequencies. Several flavors:

Often paired with a short fine-tuning run on long-context data to "heal" the model into the new range.

Sliding window attention

An alternative: instead of every token attending to every previous token, each token only attends to the last k tokens (typical k = 4096). Memory and compute become O(n Γ— k) instead of O(nΒ²). Mistral models use this. Trade: tokens far apart can't directly attend to each other, but the residual stream can carry information across the boundary indirectly through stacked layers.

The dirty truth about long context: most models are worse at retrieving information from the middle of their context window than from the beginning or end. This is the "lost in the middle" effect, observed across many models. Putting the relevant info in the first or last 25% of the prompt usually works better than relying on the middle. Test this with Needle-in-a-Haystack-style probes before trusting long context.

4. Pruning

What changes where?

Object:
Weight tensors β€” either individual values zeroed (unstructured) or whole heads/layers/dimensions removed (structured).
When:
Post-training, often followed by a "healing" fine-tune.
Where it lives:
On disk as smaller files; in VRAM as smaller tensors (structured) or sparse tensors (unstructured, requires hardware support to actually save compute).
Tradeoffs:
Size ↓, speed ↑ (only with structured pruning on supported hardware), capacity ↓, often a quality drop that needs a recovery fine-tune.

weights β€” structure

Unstructured vs structured

Unstructured pruning identifies the smallest individual weights (close to zero) and zeroes them. Easy to do, can prune 50%+ of weights with little quality loss. Catch: most hardware can't actually skip work for sparse matrices unless the sparsity is highly structured (e.g., 2:4 sparsity). So you save disk space and maybe memory, but rarely compute.

Structured pruning removes whole units β€” entire attention heads, MLP dimensions, even entire layers. Quality drops more sharply but the resulting model is genuinely smaller and faster on standard hardware. Methods like LLM-Pruner score components by their "importance" and remove the lowest-scoring ones.

Unstructured pruning is removing individual leaves from a tree. The tree looks slightly thinner; same shape, same shadow. Structured pruning is cutting whole branches. The tree is genuinely smaller β€” but it might look lopsided until it grows back.

Healing fine-tune

After any nontrivial pruning, the model needs recovery training: same data as SFT (or even some pretraining data), short run, learning rate small. This re-routes information through the surviving weights. Skipping it usually leaves visible quality damage.

Pruning vs quantization β€” when to reach for which

For most "make this model smaller / cheaper" goals, quantization is the first tool to reach for β€” easier, less destructive, more predictable. Pruning becomes useful when:

5. Decoding controls

What changes where?

Object:
The sampling step β€” how the model's probability output becomes a single chosen token.
When:
Inference time, per request. Configurable per call.
Where it lives:
In the inference engine's sampling loop, after the model's forward pass. The model itself is unchanged.
Tradeoffs:
Determinism vs creativity vs coherence vs latency.

decoding

The core controls

Common combinations

GoalSettings
Deterministic answers (Q&A, code)temperature=0, or temp=0.1 with top_p=0.95
Helpful chat (default)temperature=0.7, top_p=0.9
Creative writingtemperature=1.0, top_p=0.95
Brainstorming / explorationtemperature=1.2, top_p=0.98

The most common diagnostic mistake in production: blaming the model for variance that's coming from temperature. "The model gives different answers each time" is almost always temperature, not the model. "The model contradicts itself" might be the same. If you want determinism, set temperature=0 first, then debug.

KV cache tuning

Already covered in inference.html, but worth noting here as a "knob": for very long contexts you can configure cache behavior β€” eviction policies (drop oldest, drop least-attended), cache offloading (move parts to CPU), or attention sinks (always keep the first few tokens). All of these trade a small quality cost for memory savings.

Speculative decoding

Also covered in inference.html. The setting you'd see in an inference engine is something like --speculative-model llama-1b alongside the main model. Output is identical to non-speculative decoding; only latency improves. If your serving infra supports it, almost always worth turning on.

Putting it together β€” diagnosis preview

Once you have all five levers in your head, the diagnostic move is: given a problem, which lever is most likely to fix it?

SymptomFirst lever to try
Model is great in general but bad at my niche taskFine-tuning (LoRA)
Model fits in memory but is too slow / too expensiveQuantization β†’ speculative decoding
Model doesn't fit in my GPU at allQuantization (4-bit) first; then maybe pruning
Model can't see enough of my promptContext extension (RoPE scaling) or use a long-context model
Model is too random / not random enoughDecoding controls β€” temperature, top-p
Model invents factsNot a lever on this page β€” go to diagnosis.html; usually a scaffolding (RAG) fix

For the deeper "what broke and why," see the diagnosis page.