Knobs you can turn after the model is trained
Five levers β what each one actually changes inside the model, and what it costs.
Every lever in this page follows the same structure: what concretely changes, when, where it lives, and which tradeoff surface it moves. If you can fill in those four boxes for any LLM intervention, you can predict its consequences.
1. Fine-tuning
What changes where?
- Object:
- Weight tensors β either all of them (full fine-tune) or a small low-rank delta attached to selected matrices (LoRA).
- When:
- Fine-tune time (a separate training run, much shorter than pretraining).
- Where it lives:
- On disk as updated weights (full) or as a separate small adapter file (LoRA). At inference: in VRAM with the rest of the model.
- Tradeoffs:
- Quality on your task β, capability elsewhere can β (forgetting), training cost β, deployment complexity β if managing multiple adapters.
weights
Full fine-tune vs LoRA vs QLoRA
Full fine-tune updates every weight. Maximum flexibility β the model can change as much as needed. Maximum cost β for a 70B model, you need enough VRAM to hold the model, the gradients, the optimizer state, and the activations. In practice that's often hundreds of GB across multiple GPUs. Risk: if your fine-tune dataset is small or narrow, you can wreck capabilities the model had before (catastrophic forgetting).
LoRA (Low-Rank Adaptation) is the workhorse of practical fine-tuning. Instead of updating the original weight matrix W (size d Γ d), you freeze W and learn two small matrices A (d Γ r) and B (r Γ d) where r is small (typically 4β64). The effective weight at inference time is W + AΒ·B. Because r is tiny, the trainable parameter count drops by 100Γβ1000Γ.
Full fine-tune is repainting the entire house. LoRA is hanging removable overlays in specific rooms β you can take them down, swap them, layer them. The walls (base weights) never change.
QLoRA combines LoRA with 4-bit quantization of the base model. The base weights are frozen and stored in 4 bits; only the LoRA adapter trains in normal precision. This makes it possible to fine-tune a 70B model on a single 24 GB consumer GPU. The trade is some numerical noise during training; in practice quality remains close to LoRA.
A real LoRA config, line by line
from peft import LoraConfig
config = LoraConfig(
r = 16, # rank β bigger r, more capacity, more params
lora_alpha = 32, # scaling factor (effective scale = alpha / r)
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"], # which weights to adapt
lora_dropout = 0.05, # regularization during training
bias = "none", # don't train bias terms
task_type = "CAUSAL_LM", # decoder-only LM
)
r=16β rank of the adapter. Powers of 2 are conventional. r=4 is tiny (good for very narrow tasks); r=64 starts to compete with full fine-tune in capacity.lora_alpha=32β a scaling knob. The effective adapter contribution is(alpha / r) Γ AΒ·B. Common pattern: alpha = 2Γr.target_modulesβ which weight matrices in each layer get the adapter. Attention projections (q/k/v/o) are the standard set; adding MLP weights (gate_proj,down_proj,up_proj) increases capacity but also parameter count.lora_dropout=0.05β regularization. 0.05β0.1 is typical.bias="none"β don't train bias parameters. They're tiny anyway and excluded by convention.
LoRA hyperparameters look magical until they bite you. Common failure mode: r too small for the task β the adapter doesn't have enough capacity, training loss plateaus high, eval barely moves. Common fix: bump r, increase target_modules to include MLP layers. Opposite failure: r too high or training too long β overfits the fine-tune set, forgets general capability. Always evaluate on a held-out general benchmark, not just your task.
2. Quantization
What changes where?
- Object:
- The numerical representation of weight tensors. Same number of weights; fewer bits per weight.
- When:
- Load time (or as a separate one-time conversion step). The model is quantized once and stored.
- Where it lives:
- On disk as smaller files; in VRAM as smaller tensors.
- Tradeoffs:
- Memory β (4Γ from FP16βINT4), often speed β, accuracy β slightly (a lot if done badly).
weights β representation
The basic idea
A weight stored in FP16 is a 16-bit floating-point number β about 65,000 distinct possible values. Most weights in a trained LLM cluster near zero with a long tail of larger values. You don't need 65,000 values to represent that distribution well. Quantization picks a smaller set of representative values and rounds every weight to its nearest one.
Quantization is replacing a high-resolution photo with a posterized version. Fewer color buckets, but the scene is still recognizable. Lower bit-depth = fewer buckets = more obvious posterization.
The quantization zoo
Several methods exist; they differ in how they pick the buckets and which tensors they touch.
- RTN (round-to-nearest) β simplest. Define a uniform grid; round each weight to the nearest grid point. Fast to apply; quality OK at 8-bit, often poor at 4-bit.
- GPTQ β quantize one column of weights at a time, adjusting the not-yet-quantized columns to compensate for the rounding errors. Much better quality at 4-bit. Takes a few hours to apply.
- AWQ (Activation-aware Weight Quantization) β observes typical activation magnitudes and protects the weights that matter most for them. Often beats GPTQ on perplexity, especially at very low bit-depths.
- GGUF β the format used by llama.cpp. Mixes precision: some layers in 4-bit, others in 5- or 6-bit, with various trade-off "quants" you can pick (Q4_K_M, Q5_K_S, etc.). The de-facto standard for running LLMs on consumer hardware.
Where quantization fails
Not all layers tolerate low-bit representation equally. Some empirical patterns:
- Embedding and output layers are often more sensitive β many recipes leave these in higher precision.
- Attention K/V projections seem to tolerate aggressive quantization well.
- The very first and very last few layers of the stack are often more sensitive than the middle.
- 4-bit is the practical floor for general use. 3-bit and 2-bit work for specific models with dedicated methods, but quality drops are real.
The "4-bit is fine" mantra has caveats. For tasks the model was already weak at (uncommon languages, niche domains, edge-case reasoning), quantization can push the model from "barely working" to "broken." Always run a task-specific eval before and after β perplexity alone won't catch narrow capability collapse.
3. Context extension
What changes where?
- Object:
- Positional encoding formula (RoPE scaling) or attention structure (sliding window). Sometimes also the model's training recipe (continued pretraining on long-context data).
- When:
- Load time (RoPE scaling factor changes), or as additional fine-tuning on longer sequences.
- Where it lives:
- In the model config (RoPE base/scale parameters); in attention implementation (sliding window mask). KV cache memory grows linearly with the new max length.
- Tradeoffs:
- Max input length β, attention compute ββ (quadratic), KV memory β, quality at long lengths often β (especially in the middle of the context).
context
Why context isn't free
Two costs grow with context length:
- Attention compute is O(nΒ²). Every pair of tokens computes a score. Double the context, four times the cost.
- KV cache memory is O(n). Linear in tokens, but with a large constant. For a 70B-class model, one token of cache is ~2 MB β so 128k tokens is ~256 GB of cache.
RoPE scaling β making old position formulas cover new lengths
A model trained with RoPE on 4k context has rotation angles tuned for positions 0β4096. If you naively feed it 16k tokens, the angles for positions 4097β16383 are out of distribution and quality collapses.
The fix: rescale the rotation frequencies. Several flavors:
- Linear (Position Interpolation) β divide all positions by 4 so 16k tokens "fit" into the 0β4k range the model was trained on. Simple; degrades quality somewhat.
- NTK-aware β scale only the high-frequency dimensions, leaving the low-frequency ones (which encode coarse position) alone. Better quality than linear.
- YaRN β refines NTK with a few additional tricks (length-dependent scaling, attention temperature). The default for many recent long-context models.
Often paired with a short fine-tuning run on long-context data to "heal" the model into the new range.
Sliding window attention
An alternative: instead of every token attending to every previous token, each token only attends to the last k tokens (typical k = 4096). Memory and compute become O(n Γ k) instead of O(nΒ²). Mistral models use this. Trade: tokens far apart can't directly attend to each other, but the residual stream can carry information across the boundary indirectly through stacked layers.
The dirty truth about long context: most models are worse at retrieving information from the middle of their context window than from the beginning or end. This is the "lost in the middle" effect, observed across many models. Putting the relevant info in the first or last 25% of the prompt usually works better than relying on the middle. Test this with Needle-in-a-Haystack-style probes before trusting long context.
4. Pruning
What changes where?
- Object:
- Weight tensors β either individual values zeroed (unstructured) or whole heads/layers/dimensions removed (structured).
- When:
- Post-training, often followed by a "healing" fine-tune.
- Where it lives:
- On disk as smaller files; in VRAM as smaller tensors (structured) or sparse tensors (unstructured, requires hardware support to actually save compute).
- Tradeoffs:
- Size β, speed β (only with structured pruning on supported hardware), capacity β, often a quality drop that needs a recovery fine-tune.
weights β structure
Unstructured vs structured
Unstructured pruning identifies the smallest individual weights (close to zero) and zeroes them. Easy to do, can prune 50%+ of weights with little quality loss. Catch: most hardware can't actually skip work for sparse matrices unless the sparsity is highly structured (e.g., 2:4 sparsity). So you save disk space and maybe memory, but rarely compute.
Structured pruning removes whole units β entire attention heads, MLP dimensions, even entire layers. Quality drops more sharply but the resulting model is genuinely smaller and faster on standard hardware. Methods like LLM-Pruner score components by their "importance" and remove the lowest-scoring ones.
Unstructured pruning is removing individual leaves from a tree. The tree looks slightly thinner; same shape, same shadow. Structured pruning is cutting whole branches. The tree is genuinely smaller β but it might look lopsided until it grows back.
Healing fine-tune
After any nontrivial pruning, the model needs recovery training: same data as SFT (or even some pretraining data), short run, learning rate small. This re-routes information through the surviving weights. Skipping it usually leaves visible quality damage.
Pruning vs quantization β when to reach for which
For most "make this model smaller / cheaper" goals, quantization is the first tool to reach for β easier, less destructive, more predictable. Pruning becomes useful when:
- You want to specifically reduce compute, not just memory (structured pruning helps; quantization mostly doesn't).
- You're already at 4-bit quantization and need more reduction.
- You want a smaller model that can be fully fine-tuned downstream (pruning gives you fewer parameters to train).
5. Decoding controls
What changes where?
- Object:
- The sampling step β how the model's probability output becomes a single chosen token.
- When:
- Inference time, per request. Configurable per call.
- Where it lives:
- In the inference engine's sampling loop, after the model's forward pass. The model itself is unchanged.
- Tradeoffs:
- Determinism vs creativity vs coherence vs latency.
decoding
The core controls
- Temperature β divide all logits by T before softmax. T < 1 sharpens the distribution (more deterministic); T > 1 flattens it (more random). T = 0 picks the top token always (greedy decoding, fully deterministic). T = 1 leaves the distribution as the model emitted.
- Top-k β keep only the top K most-likely tokens; renormalize. K = 1 is greedy; K = 50 keeps the top 50 candidates regardless of their probabilities.
- Top-p (nucleus) β keep the smallest set of tokens whose probabilities sum to at least p. p = 1.0 keeps everything; p = 0.9 keeps the top "90% of probability mass," cutting whatever long tail is left.
- Repetition penalty β add a small penalty to tokens that have appeared recently, to prevent loops like "the the the." Usually 1.05β1.15.
Common combinations
| Goal | Settings |
|---|---|
| Deterministic answers (Q&A, code) | temperature=0, or temp=0.1 with top_p=0.95 |
| Helpful chat (default) | temperature=0.7, top_p=0.9 |
| Creative writing | temperature=1.0, top_p=0.95 |
| Brainstorming / exploration | temperature=1.2, top_p=0.98 |
The most common diagnostic mistake in production: blaming the model for variance that's coming from temperature. "The model gives different answers each time" is almost always temperature, not the model. "The model contradicts itself" might be the same. If you want determinism, set temperature=0 first, then debug.
KV cache tuning
Already covered in inference.html, but worth noting here as a "knob": for very long contexts you can configure cache behavior β eviction policies (drop oldest, drop least-attended), cache offloading (move parts to CPU), or attention sinks (always keep the first few tokens). All of these trade a small quality cost for memory savings.
Speculative decoding
Also covered in inference.html. The setting you'd see in an inference engine is something like --speculative-model llama-1b alongside the main model. Output is identical to non-speculative decoding; only latency improves. If your serving infra supports it, almost always worth turning on.
Putting it together β diagnosis preview
Once you have all five levers in your head, the diagnostic move is: given a problem, which lever is most likely to fix it?
| Symptom | First lever to try |
|---|---|
| Model is great in general but bad at my niche task | Fine-tuning (LoRA) |
| Model fits in memory but is too slow / too expensive | Quantization β speculative decoding |
| Model doesn't fit in my GPU at all | Quantization (4-bit) first; then maybe pruning |
| Model can't see enough of my prompt | Context extension (RoPE scaling) or use a long-context model |
| Model is too random / not random enough | Decoding controls β temperature, top-p |
| Model invents facts | Not a lever on this page β go to diagnosis.html; usually a scaffolding (RAG) fix |
For the deeper "what broke and why," see the diagnosis page.