From tokenization to decoding, the full forward pass
The model gets 2 token vectors, not 10 letter vectors. Letter counting requires reconstructing characters from subword pieces.
Used in the next three slides.
IDs shown are illustrative; exact values depend on the tokenizer.
Frequent stems become single tokens. Hyphens, prefixes, and suffixes become reusable pieces, so words stay decomposable rather than unknown.
Short, but brittle. Breaks on hyphens, new jargon, and multilingual text.
Full coverage, but very long. Quadratic attention cost makes it impractical.
Near-word efficiency with no unknown tokens. English prose averages about 0.75 words per token.
We start with a character-level vocabulary. Now let the corpus teach us which adjacent pairs deserve their own tokens.
Last slide, the fallback was character-level text. BPE starts there too: count adjacent pairs in the corpus, merge the most frequent pair, and repeat. Those learned merges become the vocabulary.
BPE is driven by frequency, not grammar. That is why stems and endings like -ing and -ed can emerge naturally when they keep showing up together.
Start at characters, count adjacent pairs, and let repetition decide what gets promoted.
There is a tie at the top. We will follow the hug stem first, then come back for reusable endings.
Same corpus, one level up: hu + g is still the most common pair.
The stem is built. Next we let shared endings compete.
One more merge turns a recurring ending into a reusable chunk.
Gray = base characters · Blue = stem-building merges · Green = ing · Orange = ed
The model computes on token vectors, not character vectors. When one token covers many letters, character-level reasoning must be reconstructed from subword pieces.
Cosma et al., "The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models", EMNLP 2025.
A token ID is simply a row index in a learned embedding matrix.
Each token is one high-dimensional vector. These plots are different 2D views of the same space.
The underlying vectors do not move. Only the 2D projection changes, revealing different relationships.
At input, both sentences start from the same bank embedding.
Each token updates its representation using the rest of the sequence.
Embeddings are static; attention makes them context-aware.
Project the same token state into three learned subspaces.
Compare \(q_{\mathrm{sat}}\) with every key \(k_j\)
Scale the scores, then softmax them into weights.
Mix value vectors with attention weights, then add back \(x_{\mathrm{sat}}\)
Stack tokens into matrices, then compute the full sequence at once.
Different heads attend in different learned subspaces, then their outputs are recombined.
Without positional signals, self-attention is permutation equivariant: reorder the token rows in, and the outputs reorder the same way.
Attention needs order. Build a position signal from first principles.
Start with the requirements, then choose the formula.
A raw index is too large, too uniform, and length-dependent.
Binary keeps values bounded and multi-rate, but nearby slots can still change abruptly.
Different dimensions vary at different speeds, but nearby positions stay close.
Absolute position says where a token sits. Attention scores often need how far apart the compared tokens are.
Rotate \(Q\) and \(K\) by absolute position; the score ends up depending on \((m-n)\).
Attention shares information across tokens. The FFN then updates each token independently.
Per block, attention is roughly \(4d_{\mathrm{model}}^2\) parameters, while the FFN is roughly \(8d_{\mathrm{model}}^2\). That is why, in large dense LLMs, the FFN usually outweighs attention in total parameters.
Same diagram as slide 3. This time, every box has a clear role.
The block repeats \(L\) times. The shape stays \(S \times d_{\mathrm{model}}\); each layer further refines the residual stream.
Take the last hidden state, score the vocabulary, form a distribution, then decode one token.
The model outputs one distribution. Decoding determines how that distribution becomes an actual token.
This is the full loop.