Knobs you can turn after the model is trained

Five levers — what each one actually changes inside the model, and what it costs.

Every lever in this page follows the same structure: what concretely changes, when, where it lives, and which tradeoff surface it moves. If you can fill in those four boxes for any LLM intervention, you can predict its consequences.

1. Fine-tuning

What changes where?

Object:: Weight tensors — either all of them (full fine-tune) or a small low-rank delta attached to selected matrices (LoRA).
When:: Fine-tune time (a separate training run, much shorter than pretraining).
Where it lives:: On disk as updated weights (full) or as a separate small adapter file (LoRA). At inference: in VRAM with the rest of the model.
Tradeoffs:: Quality on your task ↑, capability elsewhere can ↓ (forgetting), training cost ↑, deployment complexity ↑ if managing multiple adapters.

weights

Full fine-tune vs LoRA vs QLoRA

Full fine-tune updates every weight. Maximum flexibility — the model can change as much as needed. Maximum cost — for a 70B model, you need enough VRAM to hold the model, the gradients, the optimizer state, and the activations. In practice that's often hundreds of GB across multiple GPUs. Risk: if your fine-tune dataset is small or narrow, you can wreck capabilities the model had before (catastrophic forgetting).

LoRA (Low-Rank Adaptation) is the workhorse of practical fine-tuning. Instead of updating the original weight matrix W (size d × d), you freeze W and learn two small matrices A (d × r) and B (r × d) where r is small (typically 4–64). The effective weight at inference time is W + A·B. Because r is tiny, the trainable parameter count drops by 100×–1000×.

Full fine-tune is repainting the entire house. LoRA is hanging removable overlays in specific rooms — you can take them down, swap them, layer them. The walls (base weights) never change.

QLoRA combines LoRA with 4-bit quantization of the base model. The base weights are frozen and stored in 4 bits; only the LoRA adapter trains in normal precision. This makes it possible to fine-tune a 70B model on a single 24 GB consumer GPU. The trade is some numerical noise during training; in practice quality remains close to LoRA.

A real LoRA config, line by line

from peft import LoraConfig

config = LoraConfig(
    r              = 16,                              # rank — bigger r, more capacity, more params
    lora_alpha     = 32,                              # scaling factor (effective scale = alpha / r)
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"],  # which weights to adapt
    lora_dropout   = 0.05,                            # regularization during training
    bias           = "none",                          # don't train bias terms
    task_type      = "CAUSAL_LM",                    # decoder-only LM
)

r=16 — rank of the adapter. Powers of 2 are conventional. r=4 is tiny (good for very narrow tasks); r=64 starts to compete with full fine-tune in capacity.
lora_alpha=32 — a scaling knob. The effective adapter contribution is (alpha / r) × A·B. Common pattern: alpha = 2×r.
target_modules — which weight matrices in each layer get the adapter. Attention projections (q/k/v/o) are the standard set; adding MLP weights (gate_proj, down_proj, up_proj) increases capacity but also parameter count.
lora_dropout=0.05 — regularization. 0.05–0.1 is typical.
bias="none" — don't train bias parameters. They're tiny anyway and excluded by convention.

LoRA hyperparameters look magical until they bite you. Common failure mode: r too small for the task — the adapter doesn't have enough capacity, training loss plateaus high, eval barely moves. Common fix: bump r, increase target_modules to include MLP layers. Opposite failure: r too high or training too long — overfits the fine-tune set, forgets general capability. Always evaluate on a held-out general benchmark, not just your task.

2. Quantization

What changes where?

Object:: The numerical representation of weight tensors. Same number of weights; fewer bits per weight.
When:: Load time (or as a separate one-time conversion step). The model is quantized once and stored.
Where it lives:: On disk as smaller files; in VRAM as smaller tensors.
Tradeoffs:: Memory ↓ (4× from FP16→INT4), often speed ↑, accuracy ↓ slightly (a lot if done badly).

weights — representation

The basic idea

A weight stored in FP16 is a 16-bit floating-point number — about 65,000 distinct possible values. Most weights in a trained LLM cluster near zero with a long tail of larger values. You don't need 65,000 values to represent that distribution well. Quantization picks a smaller set of representative values and rounds every weight to its nearest one.

Quantization is replacing a high-resolution photo with a posterized version. Fewer color buckets, but the scene is still recognizable. Lower bit-depth = fewer buckets = more obvious posterization.

The quantization zoo

Several methods exist; they differ in how they pick the buckets and which tensors they touch.

RTN (round-to-nearest) — simplest. Define a uniform grid; round each weight to the nearest grid point. Fast to apply; quality OK at 8-bit, often poor at 4-bit.
GPTQ — quantize one column of weights at a time, adjusting the not-yet-quantized columns to compensate for the rounding errors. Much better quality at 4-bit. Takes a few hours to apply.
AWQ (Activation-aware Weight Quantization) — observes typical activation magnitudes and protects the weights that matter most for them. Often beats GPTQ on perplexity, especially at very low bit-depths.
GGUF — the format used by llama.cpp. Mixes precision: some layers in 4-bit, others in 5- or 6-bit, with various trade-off "quants" you can pick (Q4_K_M, Q5_K_S, etc.). The de-facto standard for running LLMs on consumer hardware.

Where quantization fails

Not all layers tolerate low-bit representation equally. Some empirical patterns:

Embedding and output layers are often more sensitive — many recipes leave these in higher precision.
Attention K/V projections seem to tolerate aggressive quantization well.
The very first and very last few layers of the stack are often more sensitive than the middle.
4-bit is the practical floor for general use. 3-bit and 2-bit work for specific models with dedicated methods, but quality drops are real.

The "4-bit is fine" mantra has caveats. For tasks the model was already weak at (uncommon languages, niche domains, edge-case reasoning), quantization can push the model from "barely working" to "broken." Always run a task-specific eval before and after — perplexity alone won't catch narrow capability collapse.

3. Context extension

What changes where?

Object:: Positional encoding formula (RoPE scaling) or attention structure (sliding window). Sometimes also the model's training recipe (continued pretraining on long-context data).
When:: Load time (RoPE scaling factor changes), or as additional fine-tuning on longer sequences.
Where it lives:: In the model config (RoPE base/scale parameters); in attention implementation (sliding window mask). KV cache memory grows linearly with the new max length.
Tradeoffs:: Max input length ↑, attention compute ↑↑ (quadratic), KV memory ↑, quality at long lengths often ↓ (especially in the middle of the context).

context

Why context isn't free

Two costs grow with context length:

Attention compute is O(n²). Every pair of tokens computes a score. Double the context, four times the cost.
KV cache memory is O(n). Linear in tokens, but with a large constant. For a 70B-class model, one token of cache is ~2 MB — so 128k tokens is ~256 GB of cache.

RoPE scaling — making old position formulas cover new lengths

A model trained with RoPE on 4k context has rotation angles tuned for positions 0–4096. If you naively feed it 16k tokens, the angles for positions 4097–16383 are out of distribution and quality collapses.

The fix: rescale the rotation frequencies. Several flavors:

Linear (Position Interpolation) — divide all positions by 4 so 16k tokens "fit" into the 0–4k range the model was trained on. Simple; degrades quality somewhat.
NTK-aware — scale only the high-frequency dimensions, leaving the low-frequency ones (which encode coarse position) alone. Better quality than linear.
YaRN — refines NTK with a few additional tricks (length-dependent scaling, attention temperature). The default for many recent long-context models.

Often paired with a short fine-tuning run on long-context data to "heal" the model into the new range.

Sliding window attention

An alternative: instead of every token attending to every previous token, each token only attends to the last k tokens (typical k = 4096). Memory and compute become O(n × k) instead of O(n²). Mistral models use this. Trade: tokens far apart can't directly attend to each other, but the residual stream can carry information across the boundary indirectly through stacked layers.

The dirty truth about long context: most models are worse at retrieving information from the middle of their context window than from the beginning or end. This is the "lost in the middle" effect, observed across many models. Putting the relevant info in the first or last 25% of the prompt usually works better than relying on the middle. Test this with Needle-in-a-Haystack-style probes before trusting long context.

4. Pruning

What changes where?

Object:: Weight tensors — either individual values zeroed (unstructured) or whole heads/layers/dimensions removed (structured).
When:: Post-training, often followed by a "healing" fine-tune.
Where it lives:: On disk as smaller files; in VRAM as smaller tensors (structured) or sparse tensors (unstructured, requires hardware support to actually save compute).
Tradeoffs:: Size ↓, speed ↑ (only with structured pruning on supported hardware), capacity ↓, often a quality drop that needs a recovery fine-tune.

weights — structure

Unstructured vs structured

Unstructured pruning identifies the smallest individual weights (close to zero) and zeroes them. Easy to do, can prune 50%+ of weights with little quality loss. Catch: most hardware can't actually skip work for sparse matrices unless the sparsity is highly structured (e.g., 2:4 sparsity). So you save disk space and maybe memory, but rarely compute.

Structured pruning removes whole units — entire attention heads, MLP dimensions, even entire layers. Quality drops more sharply but the resulting model is genuinely smaller and faster on standard hardware. Methods like LLM-Pruner score components by their "importance" and remove the lowest-scoring ones.

Unstructured pruning is removing individual leaves from a tree. The tree looks slightly thinner; same shape, same shadow. Structured pruning is cutting whole branches. The tree is genuinely smaller — but it might look lopsided until it grows back.

Healing fine-tune

After any nontrivial pruning, the model needs recovery training: same data as SFT (or even some pretraining data), short run, learning rate small. This re-routes information through the surviving weights. Skipping it usually leaves visible quality damage.

Pruning vs quantization — when to reach for which

For most "make this model smaller / cheaper" goals, quantization is the first tool to reach for — easier, less destructive, more predictable. Pruning becomes useful when:

You want to specifically reduce compute, not just memory (structured pruning helps; quantization mostly doesn't).
You're already at 4-bit quantization and need more reduction.
You want a smaller model that can be fully fine-tuned downstream (pruning gives you fewer parameters to train).

5. Decoding controls

What changes where?

Object:: The sampling step — how the model's probability output becomes a single chosen token.
When:: Inference time, per request. Configurable per call.
Where it lives:: In the inference engine's sampling loop, after the model's forward pass. The model itself is unchanged.
Tradeoffs:: Determinism vs creativity vs coherence vs latency.

decoding

The core controls

Temperature — divide all logits by T before softmax. T < 1 sharpens the distribution (more deterministic); T > 1 flattens it (more random). T = 0 picks the top token always (greedy decoding, fully deterministic). T = 1 leaves the distribution as the model emitted.
Top-k — keep only the top K most-likely tokens; renormalize. K = 1 is greedy; K = 50 keeps the top 50 candidates regardless of their probabilities.
Top-p (nucleus) — keep the smallest set of tokens whose probabilities sum to at least p. p = 1.0 keeps everything; p = 0.9 keeps the top "90% of probability mass," cutting whatever long tail is left.
Repetition penalty — add a small penalty to tokens that have appeared recently, to prevent loops like "the the the." Usually 1.05–1.15.

Common combinations

Goal	Settings
Deterministic answers (Q&A, code)	temperature=0, or temp=0.1 with top_p=0.95
Helpful chat (default)	temperature=0.7, top_p=0.9
Creative writing	temperature=1.0, top_p=0.95
Brainstorming / exploration	temperature=1.2, top_p=0.98

The most common diagnostic mistake in production: blaming the model for variance that's coming from temperature. "The model gives different answers each time" is almost always temperature, not the model. "The model contradicts itself" might be the same. If you want determinism, set temperature=0 first, then debug.

KV cache tuning

Already covered in inference.html, but worth noting here as a "knob": for very long contexts you can configure cache behavior — eviction policies (drop oldest, drop least-attended), cache offloading (move parts to CPU), or attention sinks (always keep the first few tokens). All of these trade a small quality cost for memory savings.

Speculative decoding

Also covered in inference.html. The setting you'd see in an inference engine is something like --speculative-model llama-1b alongside the main model. Output is identical to non-speculative decoding; only latency improves. If your serving infra supports it, almost always worth turning on.

Nine more levers — only when the model is MoE

Everything above still applies. What's added: a family of levers that don't exist on a dense model, because they manipulate routing and experts specifically.

Num experts (total)

weights

Object:: Model config — how many expert FFNs per block.
When:: At model design / training time; can't change post-training without re-architecture.
Where it lives:: Model config + weight count on disk + full weight set in VRAM.
Tradeoffs:: More experts → more total params, more specialization potential, more VRAM, harder load balance.

Top-k (active experts per token)

weights

Object:: Router config — how many top experts each token activates.
When:: At training time; fixed per model. Some systems allow top-k override at inference but quality usually suffers.
Where it lives:: Inference path cost.
Tradeoffs:: Higher k → more compute per token, better ensembling, smoother outputs. Lower k → faster, sharper specialization, more brittle.

Capacity factor

weights — representation

Object:: Per-expert token budget multiplier. Effective capacity = cf × (tokens / num_experts).
When:: Training time and inference time; they don't have to match.
Where it lives:: Serving config.
Tradeoffs:: Higher cf → fewer dropped tokens, more VRAM, more idle compute when routing is balanced. Lower cf → lower memory, more drops when routing is skewed.

Shared experts on / off

weights

Object:: Whether the block includes always-on shared expert(s) alongside routed ones (DeepSeek-V2/V3 style).
When:: Model design time.
Where it lives:: Architecture.
Tradeoffs:: Shared experts free up routed experts to specialize more; cost is a slightly higher always-active parameter count.

Auxiliary loss weight

weights

Object:: Multiplier on the load-balance loss (and separately on the router z-loss).
When:: Training time; often scheduled (high early, decayed later).
Where it lives:: Training config.
Tradeoffs:: Too low → router collapse. Too high → forced uniform routing that prevents specialization.

Router temperature

weights

Object:: Temperature applied to routing logits before softmax.
When:: Training time (for exploration) and sometimes inference time.
Where it lives:: Router module.
Tradeoffs:: Higher temperature → softer, more diverse routing. Lower → sharper specialization but more collapse risk.

Expert parallelism degree

scaffolding

Object:: How experts are sharded across GPUs (e.g., 8 experts across 4 GPUs vs 2 GPUs).
When:: Deployment / serving time.
Where it lives:: Serving infrastructure.
Tradeoffs:: More parallelism → fits larger models, adds all-to-all communication cost. Balance matters more than total compute.

Upcycling (dense → MoE)

weights

Object:: Converting a pretrained dense model's single MLP per block into N expert copies, adding a fresh router, and continuing training.
When:: Model-init time for the MoE; followed by fine-tuning with aux losses.
Where it lives:: Checkpoint conversion script.
Tradeoffs:: Cheaper than training MoE from scratch; quality lands below a scratch-trained MoE but well above the dense start.

Expert pruning

weights — structure

Object:: Remove experts that are rarely picked by the router on representative data.
When:: Post-training, usually paired with a healing fine-tune.
Where it lives:: Smaller model on disk / in VRAM.
Tradeoffs:: Smaller, faster; often loses rare-distribution quality (the experts you pruned were specialized for the things you pruned on).

Back to the default (dense) frame.

Putting it together — diagnosis preview

Once you have all five levers in your head, the diagnostic move is: given a problem, which lever is most likely to fix it?

Symptom	First lever to try
Model is great in general but bad at my niche task	Fine-tuning (LoRA)
Model fits in memory but is too slow / too expensive	Quantization → speculative decoding
Model doesn't fit in my GPU at all	Quantization (4-bit) first; then maybe pruning
Model can't see enough of my prompt	Context extension (RoPE scaling) or use a long-context model
Model is too random / not random enough	Decoding controls — temperature, top-p
Model invents facts	Not a lever on this page — go to diagnosis.html; usually a scaffolding (RAG) fix

For the deeper "what broke and why," see the diagnosis page.