Model Harness Mental Model

Hover any element for details
Click a band to enter · Double-click a component to jump to its exact section

Tap a band to enter · Tap a component to jump to its exact section

How to read this

The model sits at the bottom. It is a single function. Token IDs in, probabilities out. Nothing more. Above it sit two layers of harness, stacked like geological strata. The middle band, the agent harness, is what a developer builds when one call to the model is not enough. The top band, the production harness, is what the model service ships around the model to serve it to the world safely and at scale.

Click any band to descend into it. Double-click any component to jump to its exact section. Or use the index in the hamburger menu, top right.

Tap any band to descend into it. Tap any component to jump to its exact section. Or use the index in the hamburger menu, top right.

§ 01 · The Surface Layer

The production harness

What the model service ships around the model. Its job is to accept one HTTP request and return one stream of tokens, safely and quickly.

Click any component to jump to its section · Click the model to descend into volume III

The production harness is the layer most developers never write. It is built by the platform that hosts the model. From the outside it looks like a thin pipe: send a request, get a response. From the inside it is eight or nine distinct components, each doing a specialised job, each capable of being upgraded or swapped without touching the model itself.

Ingress

The front door. Authenticates the caller, applies rate limits, routes the request to the correct model and the correct datacentre. Looks boring. Becomes interesting at scale, where a slow ingress can dominate the latency of every request even if the model is fast.

Ingress is also where pricing tiers express themselves. Different keys get different priority queues. Different keys get different versions of the model. Different keys get different size limits on the request payload.

Input classifiers

Before the model sees the user's message, a separate set of small models or rule systems checks it. They look for prompt injection attempts, content that violates policy, requests that should be refused on principle, and signals about what kind of request this actually is. They run fast because they have to. They cost almost nothing relative to running the main model.

The trade-off is latency. A classifier that adds two hundred milliseconds to every request is a real product cost. Platforms work hard to make them as small and as parallel as possible.

Context assembly

The model only ever sees one document. Everything else is a fight to make that document the right document.

Context assembly is the harness deciding what goes into the prompt before a single token of generation happens. The output is one ordered sequence of tokens, ready for the inference engine. This is the highest-leverage component in the entire stack.

What gets assembled, in roughly the order it appears

System prompt. Instructions, role, behaviour rules, format requirements. Sets the lens through which everything else is read.
Safety rules. Often injected by the platform, invisible to the developer.
Tool definitions. Schemas describing what tools exist, their arguments, their returns.
Memories. Long-term facts about the user, retrieved from a separate store.
Retrieved knowledge. Chunks fetched by semantic search from a vector store, when applicable.
Conversation history. Prior turns of the current session, often summarised.
Current user message. The thing being responded to. Goes last because of recency.

Three principles

Order matters more than people think. The model attends to every token, but positional encoding and recency dynamics mean tokens at the start and end behave differently from tokens in the middle. The middle of a long prompt is a graveyard for instructions you wanted the model to notice.

Tokens cost money and context. Every assembled token costs compute and eats into a finite window. Assembly decisions are budget decisions, not just relevance decisions.

Assembly is not concatenation. Real harnesses compress, summarise, deduplicate, and reorder. Two products on the same model can produce wildly different outputs because their assembly logic does different work.

What breaks

Context overflow

Window exceeded. The oldest content gets truncated silently. The model loses the thread.

Wrong order

System prompt buried in the middle of a long context. The model effectively ignores it.

Stale memory

Retrieved facts are no longer true. The model states them confidently anyway.

Tool pollution

Too many tool schemas defined. The model picks the wrong one or hallucinates parameters.

Retrieval noise

Chunks that look topically relevant but actively mislead.

Prompt leakage

The model can be coaxed into revealing or paraphrasing its system prompt.

Sampling

The model outputs a probability distribution over the vocabulary. Sampling is how the harness turns that distribution into a single chosen token. The levers here are familiar but worth listing.

Temperature. Scales the logits before the softmax. Lower temperature concentrates probability on the top candidates; higher temperature flattens it. Temperature zero is deterministic.
Top-k. Only sample from the k most likely tokens. Everything else is discarded.
Top-p (nucleus). Sample only from the smallest set of tokens whose cumulative probability exceeds p. Adapts to the shape of the distribution.
Stop sequences. Strings that, if emitted, cause generation to halt.
Repetition penalties. Down-weight tokens that have appeared recently.

Sampling is where the same model can feel cautious, creative, or unhinged depending on which parameter the harness chose. The model itself does not change.

Tool execution loop

The model cannot do anything. The tool loop is how the harness lets it.

A model emits tokens. If some of those tokens encode a structured tool call, something has to notice and act. That noticing-and-acting is the tool loop. It is the central pattern of agentic AI.

The mechanics

Tool definitions arrive in the context. The model is told which tools exist and how to call them.
The model emits tokens that look like a tool call. Usually a structured JSON block or a special tagged format. This is just text. The model has no idea what a tool is. It has learned a pattern.
A parser detects the call. The harness watches the output stream. When it sees a tool-call shape, it stops generation.
The harness dispatches. It calls the actual code or API the schema described. The model does not call the tool. The harness does, on the model's behalf.
The result is appended to the context. The tool's return value becomes new tokens in the conversation history.
The model is run again. Same model, new context. The next token is conditioned on whatever the tool returned.
The loop continues until the model emits an ordinary text response with no tool call, or hits a stop signal, or a guard trips.

"Tool use is not the model picking up a hammer. It is the model writing the word 'hammer' and the harness fetching one."

What breaks

Loop divergence

The agent keeps calling tools forever. No exit condition was specified or the model failed to detect one.

Schema mismatch

The model emits a tool call with the wrong shape. The parser fails or the tool errors out. Often visible as confidently wrong JSON.

Result drowning

A tool returns thousands of tokens. The context fills. Earlier context gets evicted. The agent loses the original task.

Mid-loop drift

Each tool result subtly shifts what the model thinks the task is. By turn five, it is solving a different problem.

Silent failures

Tool fails, but the error is swallowed. The model continues as if it succeeded.

Output classifiers

After the model generates tokens, another set of classifiers checks the output. Same idea as input classifiers, opposite direction. They look for policy violations the model failed to refuse, leaked secrets, leaked system prompts, or content the platform has committed not to produce. If they trip, the output gets blocked, replaced, or rewritten before the user sees it.

Two thin classifiers do work that, if removed, would force the main model to be both producer and police. Specialisation lets each do its job better.

Streaming

The model generates tokens autoregressively, one after the next. The harness has a choice: wait for the full response and send it as one block, or send each token as it is generated. Streaming sends each token over a persistent connection, typically using server-sent events, so the user starts reading before the model is done writing.

This is more than user-experience polish. Streaming is also a control point. Tool calls can be detected mid-stream and trigger the loop earlier. Output classifiers can interrupt generation when a violation appears. The streaming layer is both a delivery mechanism and a place to intervene.

Observability

Logs, traces, billing meters, replay artifacts. Every token in and every token out, recorded for debugging, billing, evals, and audits. Looks like infrastructure plumbing. Becomes the lifeblood of any team trying to improve a system, because without it you cannot tell whether a change helped, hurt, or made no difference.

Observability is also where the most sensitive privacy and retention decisions live. What gets logged, for how long, who can see it, and whether it can be used for training are all observability questions.

§ 02 · The Orchestration Layer

The agent harness

What you build when one call to the model is not enough. Lives inside the tool loop of the production harness, and bends one model into something that plans, acts, and revises.

Click any component to jump to its section

If the production harness is the platform's job, the agent harness is the developer's. It is where you decide how a multi-step task gets carried out: who plans, what state is carried between steps, which tools the agent has access to, what gets remembered, how quality is checked, and when the work is considered done. This is the layer where most of the visible design choices in modern AI products are made.

State

State is the harness's working memory for the current task. A typed object, often a dictionary, that every step of the agent reads from and writes to. It is recreated for each new task and discarded when the task ends. Not memory in the long-term sense. More like a scratchpad pinned to the wall while the work happens.

Good state design is invisible. Bad state design shows up as fields the agent reads but no longer trusts, or as fields that get clobbered by parallel steps. The hardest part is deciding what to put in state versus what to recompute or re-fetch.

Planner

Something has to decide what step happens next. Where that something lives changes the character of the whole system.

Pattern one · the model is the planner

The classic reasoning-and-acting pattern. The model itself, on each turn, decides whether to emit a tool call or a final answer. No separate planning step. The prompt encourages the model to think out loud, then act. Cheap, simple, surprisingly capable. Fails when the task requires planning many steps ahead, because the model has no place to put a plan. It can only emit one token at a time.

Pattern two · a graph is the planner

The developer writes an explicit graph of nodes and edges. Each node does one thing. The edges encode the logic. The model is consulted at certain nodes for narrow decisions, but the topology of the workflow is hard-coded. Predictable, debuggable, deployable. Constrained to the topologies the developer foresaw. New kinds of tasks need a new graph.

Pattern three · a separate planner model

One model writes the plan. A different model executes each step. The planner emits something close to a script. The executor follows it. Strong on multi-step tasks. Expensive because every step pays for two model calls. Brittle when the executor disagrees with the plan or when the plan goes stale partway through.

Why this choice changes the harness

If the model is the planner, most of the harness is the tool loop and context management. If a graph is the planner, most of the harness is the graph runtime and state object. If a separate planner exists, most of the harness is the orchestration between two models. The choice of planner is the choice of which problems the harness is built to solve.

What breaks

Goal drift

The planner forgets the original goal after a few steps because it is no longer near the end of the context.

Plan staleness

A separate planner writes a plan based on assumptions that turn out to be wrong. The executor follows it anyway.

Graph rigidity

A new variant of the task does not fit the existing graph. The system fails on inputs it could in principle handle.

Reasoning collapse

The model planner emits a confident but incoherent next step. With no critic, it gets executed.

Tools

Tools are the agent's hands. They can be APIs, code execution, retrieval functions, file readers, other agents. As far as the model is concerned, every tool is the same shape: a function that takes a JSON blob and returns text. As far as the harness is concerned, each tool is a piece of real-world machinery with its own error modes.

Two design decisions matter most: how many tools to expose at once, and how clearly the schemas describe them. Most production agents settle around ten to twenty tools per call. More than that and the model starts confusing them. Schema clarity does more work than people expect; the descriptions are part of the prompt, and the model reasons from them.

Memory

Three flavours, often conflated.

One · working memory

The current context window. Everything the model can actually see right now. Working memory exists only for the duration of a single inference call. The harness rebuilds it each time, which is why context assembly is the lever it is.

Two · episodic memory

What happened in the past, recorded so the harness can recall it. Conversation history is the simplest form. Summarised conversation history is denser. Distilled facts (the user prefers concise replies; the user lives in Bangalore) are densest. Episodic memory is rebuilt into working memory at the start of each new conversation.

Three · semantic memory

Knowledge stored as embeddings in a vector database, retrievable by semantic similarity at query time. This is what retrieval-augmented generation operates on. Semantic memory can be enormous, because none of it lives in the prompt until it is queried in.

The Analogy

Working memory is the page in front of you right now. Episodic memory is your journal of past pages. Semantic memory is the library down the street. All three are needed to write the next sentence well, but only one is open at a time.

The Precise Version

Working memory is the active context window, capped by the model's window size. Episodic memory is durable storage of prior interactions, in summarised or distilled form. Semantic memory is content addressable by vector similarity rather than by recency, with no inherent size limit.

What breaks

Memory leakage

Facts from one user's episodic memory end up retrieved into another user's context. A data isolation bug, usually fatal for trust.

Stale summaries

Episodic memory was summarised days ago. The summary no longer reflects what was actually said.

Retrieval mismatch

The semantic store returns chunks that are similar by embedding but not actually useful for the question.

State pollution

The agent's whiteboard accumulates fields it no longer needs, making it slower to reason about and easier to confuse.

Quality gates

If the model is the producer, something else has to be the editor. Quality gates are the editor. Without gates, an agent is as good as its worst draft. With gates, an agent can be substantially better than its average.

Maker and checker. One model call produces a draft. A second call, with a different prompt, checks it against the requirements. If the check fails, the draft is sent back for revision.
Critique and revise. Same model, two passes. The second pass is told to find what is wrong with the first and rewrite it. Cheaper than maker-checker and often nearly as good.
Best of n. The model produces several candidates. A judging step picks the best. Useful when generation is cheap and quality varies.
Hard validators. Non-model code that checks structural properties. Does this JSON parse. Does this SQL run. Does this number fall in range. Cheap, fast, catches the obvious.
Human in the loop. A person reviews before the agent acts on something irreversible. The most expensive gate and the most reliable one.

Why gates feel expensive but are not

A gate adds a model call. That looks expensive in unit economics. But the alternative is that the agent ships a wrong answer that costs more downstream than two model calls ever would. The right question is not whether to gate but where the gate has the highest leverage.

Control flow

When to loop, when to stop, when to escalate, when to fail. Control flow is the rules that govern how the agent moves between its components. It is usually the least glamorous part of the design and the most consequential. An agent without explicit control flow is an agent whose runtime behaviour is unpredictable.

The most important rule is the stop rule. An agent that does not know when it is done will either stop too early and ship a partial answer or never stop at all. Stop rules are most often based on either a maximum number of steps, a confidence check by a separate gate, or a structural signal in the agent's own output.

Output composition

Everything that happened during the run has to become one final answer for the user. Output composition is the step that decides what the user actually sees. Sometimes it is trivial: the last model call's response is the answer. More often it is non-trivial: stitching together tool results, citations, reasoning traces, and rendered artefacts into one coherent message.

Composition is also where formatting decisions live. Tables versus prose. Citations inline or at the end. Length caps. Markdown versus structured payload. The model produced the substance. The harness produces the form.

Sub-agent delegation

The cleanest scaling pattern for complex tasks is to call other agents as if they were tools. A research sub-agent, a writing sub-agent, a critic sub-agent, each with their own narrow prompt and tool set, each invoked by a parent agent that coordinates the work.

Sub-agents add expressive power but multiply the failure modes. A bug in a sub-agent's prompt can cascade through the parent without obvious warning. The hierarchy needs its own observability story, because debugging one model call inside many is hard, and debugging one model call inside many running in parallel is harder still.

§ 03 · The Bedrock

The model

A pure function. Given a sequence of token IDs, it returns a probability distribution over the next token. Nothing more.

Click any layer to jump to its section · bottom is input, top is output

The model is the bedrock. Everything else in this document is the harness sitting on top of it. The model itself does not remember the previous call. It cannot look anything up. It cannot decide to stop talking, ask a clarifying question, or pick up a tool. It can only emit probabilities. Every loop, every tool call, every multi-turn conversation, every flicker of agentic behaviour happens outside it.

This page is a quick reference. The full mental model of the model itself lives in Volume I.

The Analogy

A brilliant savant locked in a windowless room. You slide a sheet of paper under the door. They read what is on it, write a continuation, and slide it back. They cannot stand up. They cannot look anything up. They have no memory of yesterday's paper. They cannot ask you to clarify. They can only continue text.

The Precise Version

The model is a deterministic function from a sequence of token IDs to a vector of logits over the vocabulary. It is stateless across calls, inert with respect to the world, single-shot per invocation, and disconnected from any external resource.

Tokeniser

Raw text gets chopped into token IDs from a fixed vocabulary, typically thirty-two thousand to one hundred and twenty-eight thousand tokens. The algorithm is usually byte-pair encoding, which builds the vocabulary by repeatedly merging the most common pairs in a training corpus until the desired size is reached. The tokeniser is not a neural network. It is a deterministic lookup.

Embedding

Each token ID is converted into a dense vector by a simple table lookup. Row n of a large learned matrix gives you the vector for token n. At this point each token's vector knows nothing about its neighbours. It only knows what the token means in isolation, as learned during pretraining.

Attention

Attention is what lets each token's representation be informed by every other token in the sequence. Each token projects its vector into three roles: a query, a key, and a value. Queries are matched against keys, the matches are scored and softmaxed, and the resulting weights pull in a mixture of values. The mechanism runs in parallel across many heads, each learning to attend to different relational patterns. This is the engine of context.

Feed-forward

After attention has mixed information across tokens, each token's vector is run independently through a feed-forward network. The network expands the vector into a much wider intermediate space, applies a non-linearity, and contracts it back. This is where most of the model's parameters live, and where most of the world knowledge is stored.

Sampling head

At the top of the stack, the final token's vector is projected against the embedding matrix, producing one logit per vocabulary entry. Softmaxed, this becomes a probability distribution over what the next token might be. The harness then picks one, using temperature, top-k, top-p, and the other sampling levers.

"The model produced a probability distribution. Everything between that distribution and the user seeing a wrong answer is the harness."

Companion volumes

Volume I · The Model. The full mental model of the box. Tokeniser, embedding, positional encoding, attention, feed-forward, sampling, in depth.

Volume II · The Harness. This document. What wraps the model.

Volume III · The Manipulation. Forthcoming. How the box came to know what it knows. Pretraining, supervised fine-tuning, reinforcement learning from human feedback, direct preference optimisation.

The Curiosity Loop · Field Notes · Volume II
Composed in the Curator's Notebook