Tokens, embeddings, and position

How text becomes numbers — and how the model knows which number came first.

context Tokens and embeddings are how your input gets encoded for the model. weights The embedding lookup table itself is part of the model's weights — it's learned.

🌿 Same in MoE — tokenization, embeddings, and positional encoding are architecture-agnostic. MoE changes the block's MLP, nothing here.

Tokens — chopping text into reusable pieces

A model can't operate on raw characters directly — well, it could, but you'd waste a lot of capacity making it re-learn that "the" is a recurring chunk every single time. Instead, text gets chopped into tokens: pieces that are usually shorter than a word but longer than a character, chosen so that common pieces are reused across many words.

Tokens are like prefab Lego bricks for language. Common bricks ("the", "-ing", "un-") are stocked in standard sizes so you snap them together fast. Rare bricks ("antidisestablishmentarianism") get assembled out of multiple smaller pieces. The vocabulary is the catalog of available bricks — typically 50,000 to 200,000 entries.

How merges are learned (BPE in plain terms)

Most modern tokenizers are built with a process called Byte-Pair Encoding (BPE). The algorithm is dumber than it sounds:

Start with the rawest possible vocabulary: every individual character.
Look at all the text in your training corpus. Find the most common pair of adjacent characters.
Add that pair as a new token. From now on, treat it as one unit.
Repeat — find the next most common pair (now potentially involving your new token).
Stop when you've added enough tokens to hit your target vocabulary size.

After thousands of iterations, you end up with a vocabulary where high-frequency patterns are single tokens and low-frequency ones decompose. No human chose them; they emerge from corpus statistics.

Iteration	Pair found	New token	Why
1	"t" + "h"	"th"	Most common digraph in English
2	"th" + "e"	"the"	"the" is everywhere
3	"i" + "n"	"in"	Very common
4	"in" + "g"	"ing"	Suffix everywhere
…	…	…	…
50,000	"establish" + "ment"	"establishment"	Frequent enough to deserve its own ID

The result: "the" is one token (cheap), "establishment" might be one token (cheap), and "antidisestablishmentarianism" splits into something like ["anti", "dis", "establishment", "arian", "ism"] — five tokens, but built from familiar parts.

The tokenizer playground

Type anything below to see what a model would actually receive.

Why tokenization is more interesting than it sounds

Tokenization is one of those topics that feels mechanical until it bites you. A few cases worth internalizing:

Numbers are often broken

A typical tokenizer might encode 1234 as a single token (because that exact sequence appeared often enough during BPE training) but 1235 as three: ["12", "3", "5"]. To the model, those two numbers don't look at all alike. This is part of why arithmetic is hard for LLMs: the input representation actively obscures numeric structure.

Multilingual cost is asymmetric

English benefits massively from BPE — it's what the vocabulary was tuned on. Other languages, especially non-Latin scripts, often pay a "tokenization tax" of 2–5× more tokens for the same semantic content. "Hello, how are you?" might be 5 tokens in English; the equivalent Hindi sentence in Devanagari script could be 25. Per-token pricing turns directly into per-language pricing.

Code has its own dialect

Tokenizers trained mostly on prose handle code awkwardly: function might be a single token (good), but HashMap<String, Integer> could be a dozen tokens (bad). Models marketed for code (Code Llama, GPT-4 family) usually have vocabularies trained on more code, which makes them not just smarter at code but cheaper to run on it.

Whitespace and capitalization matter

Most tokenizers treat " the" (with leading space) as a different token than "the" (no space), because they often appear in different contexts. "The", "the", and "THE" are usually three separate tokens too. This is invisible to humans but very visible to the model.

Embeddings — turning IDs into geometry

A token ID like 1820 isn't useful to the math directly. The model needs each token represented as a vector — a list of (typically) 1024 to 8192 numbers. That vector is the token's embedding, and it's looked up from a giant table called the embedding matrix.

The embedding matrix is a giant phone book where every token's "phone number" is a list of thousands of numbers. The phone numbers aren't random — they're learned during training such that tokens with related meanings end up with phone numbers that point to nearby places in a high-dimensional space.

The famous example: in the embedding space of GPT-2 and similar models, the vector arithmetic king − man + woman lands close to queen. Meaning has become geometry. This isn't programmed in — it's an emergent property of training the model to predict the next token across enormous amounts of text.

The "phone book" analogy breaks because embeddings are continuous, not discrete. Two embeddings can be arbitrarily close. Also: real embedding spaces have hundreds or thousands of dimensions; you can't visualize them directly. The 2D plot above projects everything into two axes, losing most of the structure.

Position — how the model knows what came first

Here's the gap that most explanations skip too long: by themselves, embeddings carry no information about word order. "Dog bites man" and "man bites dog" would produce identical sets of embeddings — same words, same vectors. The model would have no way to tell which word came first.

So position information has to be injected separately. Several schemes exist; modern decoder-only LLMs almost universally use one called RoPE (Rotary Position Embedding).

RoPE in plain terms

RoPE works by rotating each token's query and key vectors by an angle that depends on the token's position. Same word at position 1 versus position 50 ends up rotated differently. When attention later compares two tokens (by computing the dot product of their Q and K), the rotation difference encodes the relative distance between them.

Imagine each embedding is a tiny arrow. RoPE rotates each arrow clockwise a little bit for each step forward in the sequence. The arrow for "cat" at position 0 points up; the arrow for "cat" at position 50 points off to the side. Attention compares arrows by angle — so it can read "these two arrows are 50 steps of rotation apart, meaning the tokens are 50 positions apart."

This trick has nice properties:

Relative position is built in. Two tokens 5 apart always have the same angle difference, regardless of where in the sequence they are.
It's parameter-free. No extra weights to learn — just a deterministic rotation formula.
It can be "stretched" later. When you want a model trained on 4k context to handle 32k context, you can rescale the rotation frequencies (RoPE scaling — see levers.html §3). It's not perfect, but it's better than retraining.

The "rotating arrow" picture suggests one rotation per token. In practice RoPE rotates pairs of dimensions of the embedding by different frequencies — fast oscillations in some dimensions, slow in others. This lets the model encode positions across a wide range of distances. The 2D arrow above is simplified to one rotation for visualization.

Multimodal models — vision and audio LLMs replace tokenization-of-text with something analogous. An image becomes a grid of patches (e.g., 16×16 pixel squares), each patch passed through a small encoder that produces a vector. Audio gets chunked similarly. Those vectors enter the same transformer downstream — so everything in the rest of this artifact applies. The architecture upstream of the residual stream changes; everything downstream stays the same.