Day 2

How a Transformer
Computes the Next Token

From tokenization to decoding, the full forward pass

Roadmap

Workshop Roadmap

Day 1 — Completed

Why LLMs Behave as They Do

  • Inference vs. training
  • Next-token prediction
  • Loss → gradients → updates
  • Why this objective works
  • Data, scale, and alignment
i Day 1: the model as a system shaped by training.
Day 2 — Today

How a Transformer Processes Text and Predicts the Next Token

  • Tokenization & embeddings
  • Self-attention
  • Feed-forward layers
  • Full forward pass end-to-end
i Day 2: the model as a machine executing computation.
Architecture

Decoder-Only Transformer: High-Level View

Output
Softmax
Linear
LayerNorm
Transformer Block
Layer L
Transformer Block
Layer 2
Transformer Block
Layer 1
× L
Dropout
Positional
Encoding
+
Input Embedding
Input tokens
i Each block writes into the residual stream, which accumulates across all layers.
Tokenization

How Many r's Are in "Strawberry"?

strawberry
You see: 3 r's
Model answer: 2 r's

What the Model Sees

You type
s t r a w b e r r y
Model sees
[straw] [berry]

The model gets 2 token vectors, not 10 letter vectors. Letter counting requires reconstructing characters from subword pieces.

i This is structural, not a bug. The next slides show where the limitation comes from.
Tokenization

Before Reading, the Model Tokenizes

Raw text
Tokenizer
Token IDs
Embeddings

Three Steps

  • Define a fixed vocabulary of discrete symbols.
  • Split input text into pieces from that vocabulary.
  • Map each piece to an integer ID.
i The vocabulary is learned once, then frozen. GPT-2 uses 50,257 tokens; GPT-4-class tokenizers are roughly 100k.

Running Example

Used in the next three slides.

Sentence
"researchers fine-tune large-scale models"
Subword tokenization
[research] [ers] [fine] [-] [tune] [large] [-] [scale] [model] [s]
[3245, 891, 4680, 12, 10812, 3267, 12, 5649, 219, 82]

IDs shown are illustrative; exact values depend on the tokenizer.

Word-Level

Approach 1: Word-Level Tokenization

Running Sentence: Word-Level Tokenization

If every word is in the vocabulary:
[researchers] [fine-tune] [large-scale] [models]
But compounds like fine-tune and large-scale are often missing, so they collapse to unknown tokens:
[researchers] [UNK] [UNK] [models]

Strengths

  • Short sequences for familiar text.
  • Common words stay intact as single units.

Limits

  • Language is open-vocabulary. English Wikipedia alone has ~1.5M unique word forms.
  • New terms like "GPT-4", "LoRA", and "RLHF" create unknowns.
! Efficient, but brittle. Anything outside the fixed vocabulary becomes unknown.
Character-Level

Approach 2: Character-Level Tokenization

Running Sentence: Character-Level Tokenization

[r][e][s][e][a][r][c][h][e][r][s] [ ] [f][i][n][e][-][t][u][n][e] [ ] [l][a][r][g][e][-][s][c][a][l][e] [ ] [m][o][d][e][l][s]
Word-level (best case)
4 tokens
Character-level (same sentence)
~40 tokens
Scaling: 500-word passage
~2,500 character tokens

Strengths

  • No unknown tokens — ever.
  • Handles new words, typos, code, any language.

Limits

  • Sequences become \(5\text{–}10\times\) longer.
  • Self-attention is \(O(n^2)\), so \(5\times\) longer means roughly \(25\times\) more compute.
  • The model spends capacity reassembling words from letters on every forward pass.
! Perfect coverage, impractical cost. Quadratic attention scaling makes it too expensive at realistic sequence lengths.
Subword Strategy

Approach 3: Subword Tokenization, the Practical Default

Running Sentence: Subword Tokenization

[research] [ers] [fine] [-] [tune] [large] [-] [scale] [model] [s]

Frequent stems become single tokens. Hyphens, prefixes, and suffixes become reusable pieces, so words stay decomposable rather than unknown.

Word-level

4 tokens

Short, but brittle. Breaks on hyphens, new jargon, and multilingual text.

Character-level

~40 tokens

Full coverage, but very long. Quadratic attention cost makes it impractical.

Subword

~10 tokens

Near-word efficiency with no unknown tokens. English prose averages about 0.75 words per token.

+ Near-word efficiency with broad coverage. That is why subword tokenization became the modern default.
i One tradeoff remains: characters inside a token are no longer individually visible. This matters again at the end of the section.
BPE

How the Tokenizer Learns These Pieces

We start with a character-level vocabulary. Now let the corpus teach us which adjacent pairs deserve their own tokens.

Start: characters
Count adjacent pairs
Merge most frequent
Repeat N times

Core Rule

Last slide, the fallback was character-level text. BPE starts there too: count adjacent pairs in the corpus, merge the most frequent pair, and repeat. Those learned merges become the vocabulary.

Why BPE Works

BPE is driven by frequency, not grammar. That is why stems and endings like -ing and -ed can emerge naturally when they keep showing up together.

i BPE is a frequency-based compression algorithm. It builds the vocabulary the data demands, not the one a linguist would design.
? Watch BPE build up from characters on a tiny action corpus.
BPE Walkthrough

BPE on a tiny action corpus

Corpus state Vocabulary built so far
Start from a character-level vocabulary: every word is just letters (</w> marks word end)
h u g </w> h u g s </w> h u g g e d </w> h u g g i n g </w> s m i l e d </w> s m i l i n g </w> w a v e d </w> w a v i n g </w>

Start at characters, count adjacent pairs, and let repetition decide what gets promoted.

Base vocabulary
hugsedinmlwav
h u → 4
u g → 4
i n → 3
e d → 3

There is a tie at the top. We will follow the hug stem first, then come back for reusable endings.

Merge 1: h + u → hu  (freq: 4)
hu g </w> hu g s </w> hu g g e d </w> hu g g i n g </w> s m i l e d </w> s m i l i n g </w> w a v e d </w> w a v i n g </w>
Merged chunks so far
hu
hu g → 4
i n → 3
e d → 3
n g → 3

Same corpus, one level up: hu + g is still the most common pair.

Merge 2: hu + g → hug  (freq: 4)
hug </w> hug s </w> hug g e d </w> hug g i n g </w> s m i l e d </w> s m i l i n g </w> w a v e d </w> w a v i n g </w>
Merged chunks so far
hu hug
i n → 3
e d → 3
n g → 3
hug g → 2

The stem is built. Next we let shared endings compete.

Merge 3: i + n → in  (freq: 3)
hug </w> hug s </w> hug g e d </w> hug g in g </w> s m i l e d </w> s m i l in g </w> w a v e d </w> w a v in g </w>
Merged chunks so far
hu hug in
in g → 3
e d → 3
hug g → 2
l in → 1

One more merge turns a recurring ending into a reusable chunk.

Merge 4: in + g → ing  (freq: 3)
hug </w> hug s </w> hug g e d </w> hug g ing </w> s m i l e d </w> s m i l ing </w> w a v e d </w> w a v ing </w>
Merged chunks so far
hu hug in ing
e d → 3
hug g → 2
l ing → 1
v ing → 1
! Aha: ing just became reusable. Three different words now share the same ending token.
Merge 5: e + d → ed  (freq: 3)
hug </w> hug s </w> hug g ed </w> hug g ing </w> s m i l ed </w> s m i l ing </w> w a v ed </w> w a v ing </w>
Reusable endings unlocked
hu hug in ing ed
! Now ed joins the vocabulary too. The tokenizer is learning reusable endings, not memorizing whole words.
BPE Takeaway

BPE builds a vocabulary of reusable chunks

Vocabulary After These Merges

Base characters
h u g s e d i n m l w a v
Stem-building chunks
hu hug in
Reusable suffix chunks
ing ed

Gray = base characters  ·  Blue = stem-building merges  ·  Green = ing  ·  Orange = ed

Reusable Chunks in Action

hugs hug s
hugging hug g ing
hugged hug g ed
smiling s m i l ing
waved w a v ed
i The tokenizer has now learned reusable endings like ing and ed, so common word forms can reuse chunks instead of being rebuilt from scratch every time.
+ BPE starts from characters and promotes repeated pairs into reusable chunks. That is how pieces like hug, ing, and ed enter the vocabulary.
Why It Matters

Why Tokenization Shapes Reasoning

Core Mechanism

The model computes on token vectors, not character vectors. When one token covers many letters, character-level reasoning must be reconstructed from subword pieces.

"strawberry" [straw] [berry] 2 token vectors. No separate letter vectors.

Cosma et al., "The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models", EMNLP 2025.

Where It Breaks

  • Counting specific letters in a word.
  • Extracting the nth character from a string.
  • Detecting repeated characters.
  • Reversing a string exactly.

Why It Works So Well

  • Subword pieces let the model handle unseen words by composing familiar parts instead of needing every word in the vocabulary.
  • Sequences stay much shorter than character-level, which makes attention cheaper and preserves more context.
  • The vocabulary stays much smaller than word-level, reducing the size of the embedding and output layers.
  • That tradeoff is why tokenization is so widely used: it is efficient, flexible, and good enough for most language understanding tasks.
+ Tokenization defines the vocabulary, shapes sequence length, and sets the granularity of every downstream computation.
Mechanism

From token IDs to vectors

A token ID is simply a row index in a learned embedding matrix.

\[ T\in\{1,\dots,|V|\}^{n},\quad X^{(0)}=E[T]\in\mathbb{R}^{n\times d_{\text{model}}} \]
Token IDs: [812, 1452, 5021]
Row lookup in \(E\)
Output: \(x_1,x_2,x_3 \in \mathbb{R}^{d_{\text{model}}}\)

Three Retrieved Rows

\(E[812,:]\)
"river"+0.14-0.77+0.32...
\(E[1452,:]\)
"bank"+0.11-0.70+0.28...
\(E[5021,:]\)
"loan"-0.62+0.04+0.80...

Scale of the Full Table

\[ E \in \mathbb{R}^{|V|\times d_{\text{model}}} \]
Examples:
GPT-2 small: \(|V| = 50{,}257,\quad d_{\text{model}} = 768\)
\(\lvert E \rvert = 50{,}257 \times 768 = 38{,}597{,}376\)

GPT-4 tokenizer: \(|V| \approx 100{,}000\)
\(\text{Exact } d_{\text{model}} \text{ is not public, but the table still scales as } |V| \times d_{\text{model}}\)
Key Properties

What Embeddings Provide, and What They Still Lack

What Embeddings Already Provide

  • Learned table \(E \in \mathbb{R}^{|V| \times d_{\text{model}}}\): one trainable row per token type.
  • Often one of the largest parameter blocks, scaling with \(|V| \times d_{\text{model}}\).
  • Outputs \(X^{(0)}\) already match model width — later blocks consume them directly.
  • Useful structure appears as directions, neighborhoods, and relative offsets.

What Later Layers Must Add

  • Sense is unresolved: one row can support multiple intended meanings.
  • No token-token interaction yet — rows cannot use neighboring words.
  • Row identity alone does not specify sentence role or dependency structure.
  • Later layers must refine these vectors into context-dependent states.
Different projections reveal different relationships; context resolves the intended meaning.
Geometry

One Embedding Space, Multiple Semantic Projections

Each token is one high-dimensional vector. These plots are different 2D views of the same space.

Change the Lens, Not the Space

Active lens: Pet / companionship
This view emphasizes: pet and companionship features
Nearest to cat in this view:
dog hamster vet
\(z^{(\mathrm{lens})} = P_{\mathrm{lens}}x,\; P_{\mathrm{lens}}\in\mathbb{R}^{2\times d_{\text{model}}}\)

The underlying vectors do not move. Only the 2D projection changes, revealing different relationships.

Bridge

Why embeddings are not enough

Same Row, Different Meaning

The fisherman sat on the bank of the river.
She applied for a loan at the bank.
\[ h_{\text{bank}}^{(0)} = E[\text{bank}] \]

At input, both sentences start from the same bank embedding.

  • The row tells us which token this is.
  • The neighboring words tell us which sense is intended.
  • Attention is the mechanism that mixes in those neighbors.
Attention lets a token read from surrounding words.
Mechanism

Attention lets tokens read from each other

Each token updates its representation using the rest of the sequence.

Embeddings are static; attention makes them context-aware.

Attention now needs a scoring rule: what matters, and what gets copied forward?
Mechanism

Step 1 — Create \(Q\), \(K\), and \(V\) for each token

Project the same token state into three learned subspaces.

\(x_{\mathrm{sat}}\)
Attention creates three role-specific versions of each token state.
Mechanism

Step 2 — Compute attention scores

Compare \(q_{\mathrm{sat}}\) with every key \(k_j\)

\(q_{\mathrm{sat}}\)
who sat?
Embeddings (\(X\))
Queries (\(Q\))
Scores (\(\mathbf{s}\))
Keys (\(K\))
Values (\(V\))
Start from embeddings \(x_1, \ldots, x_n\).
Mechanism

Step 3 — From scores to attention weights

Scale the scores, then softmax them into weights.

RAW SCORES \(\mathbf{s}\)
query token: sat
SCALED SCORES \(\mathbf{z}\)
\(z_j = \frac{s_j}{\sqrt{d_k}}\)
Scaling keeps dot products in a stable range
ATTENTION WEIGHTS \(\mathbf{a}\)
\(\sum_j a_j = \text{--}\)
Raw similarity scores compare \(q_{\mathrm{sat}}\) with each key \(k_j\): \(s_j = q_{\mathrm{sat}}^{\mathsf{T}} k_j\).
Mechanism

Step 4 — Mix values, then add the residual

Mix value vectors with attention weights, then add back \(x_{\mathrm{sat}}\)

\(q_{\mathrm{sat}}\)
who sat?
Embeddings (\(X\))
Queries (\(Q\))
Scores (\(\mathbf{s}\))
Keys (\(K\))
Values (\(V\))
Start from the Step 3 scores for \(q_{\mathrm{sat}}\).
Mechanism

Attention in matrix form

Stack tokens into matrices, then compute the full sequence at once.

Token Matrix \(T\)
Embedding Matrix \(X\)
\(X \in \mathbb{R}^{S \times d}\)
Projected Matrices
Start from the same sequence: tokens and their embedding rows.
Mechanism

Multi-Head Attention

Different heads attend in different learned subspaces, then their outputs are recombined.

Multi-head attention lets different heads read the same sequence in parallel.
Problem

Self-Attention Compares Vectors, Not Word Order

Without positional signals, self-attention is permutation equivariant: reorder the token rows in, and the outputs reorder the same way.

The dog chased the cat
The dog is the chaser
Same tokens
The cat chased the dog
The cat is the chaser
Same tokens, different meaning.
Does attention know which word came first?
No. The layer is permutation equivariant without positions.
It only compares the token vectors it receives. If you permute those input rows, the same pairwise scores are computed between the same vectors, and the output rows are permuted in exactly the same way.
⚠️ Add position before attention so permuting tokens changes the vectors and breaks this symmetry.
Position must be added before attention so order changes the input.
Design Journey

Designing Positional Encoding

Attention needs order. Build a position signal from first principles.

1 Target
2 Integer
3 Binary
4 Sin/Cos
5 Relative
6 RoPE
Step 1 · Goals

What must a position code do?

Start with the requirements, then choose the formula.

Three design goals
1
Distinct slots Slot 1 and slot 100 must look different.
2
Stable meaning Position 5 should mean the same in short and long contexts.
3
Local smoothness Nearby slots should get nearby codes.
token embedding \(x_i\)
+
position code \(p_i\)
=
input to attention \(h_i^{(0)}\)
why this matters
The dog chased another dog
Both dog tokens share an embedding; position distinguishes them.
Good position codes are distinct, length-stable, and locally smooth.
Mechanism

Attention Mixes, FFN Transforms

Attention shares information across tokens. The FFN then updates each token independently.

Attention
Mixes information across tokens
Cross-token context
FFN
Transforms each token independently
Per-token transformation
Per Token: Expand, Activate, Contract
token state
\(d_{\mathrm{model}}\)
expand 4×
\(W_1\)
GELU
activate
contract
\(W_2\)
updated state
\(d_{\mathrm{model}}\)
\(\mathrm{FFN}(h_i) = W_2\,\mathrm{GELU}(W_1 h_i + b_1) + b_2\)
The same small network runs at every position: shared weights, different inputs.
Parameter Reality

Per block, attention is roughly \(4d_{\mathrm{model}}^2\) parameters, while the FFN is roughly \(8d_{\mathrm{model}}^2\). That is why, in large dense LLMs, the FFN usually outweighs attention in total parameters.

Model
Total
Attention
FFN
Other
GPT-3 13B
12.95B
4.23B32.6%
8.46B65.3%
0.27B2.1%
GPT-3 175B
174.60B
57.99B33.2%
115.97B66.4%
0.65B0.4%
Other = token embeddings + positional embeddings + LayerNorm parameters.
Attention shares information across tokens. The FFN updates each token independently and usually holds most of the parameters.
Synthesis

One Transformer Block, Revisited

Same diagram as slide 3. This time, every box has a clear role.

Block Output \(R^{(\ell+1)}\)
+
Dropout
Feed-Forward Network
LayerNorm
+
Dropout
Multi-Head Attention
LayerNorm
Block Input \(R^{(\ell)}\)
Read the Block Bottom to Top
LayerNorm → attention → residual add → LayerNorm → FFN → residual add. Both sublayers write into the same residual stream.
LayerNorm
Normalizes each row before the sublayer reads it.
Multi-Head Attention
Attention mixes information across earlier tokens and writes a context update.
Residual Add (+)
Writes the attention update back into the same stream.
Feed-Forward Network
The same two-layer MLP updates each row independently.
Two update equations
\(R_{\mathrm{mid}} = R^{(\ell)} + \mathrm{MHA}(\mathrm{LN}(R^{(\ell)}))\)
\(R^{(\ell+1)} = R_{\mathrm{mid}} + \mathrm{FFN}(\mathrm{LN}(R_{\mathrm{mid}}))\)
A full model stacks this same Transformer block layer after layer.
Concept

Same Block, More Depth

The block repeats \(L\) times. The shape stays \(S \times d_{\mathrm{model}}\); each layer further refines the residual stream.

\(R^{(L)}\) — next-token-ready context   \(S \times d_{\mathrm{model}}\)
Transformer Block
Layer L
Transformer Block
Layer 2
Transformer Block
Layer 1
× L layers
\(R^{(0)}\) — token + position embeddings   \(S \times d_{\mathrm{model}}\)
What Depth Means
Each layer runs the same transformer block again. The sequence shape stays fixed; the representation becomes more refined.
One block = one refinement pass — attention and FFN read the current stream and write back an update.
Each new block starts from the updated stream — later layers build on what earlier layers already produced.
Depth adds repeated computation — the model gets multiple chances to combine and refine information.
📖 GPT-2 small uses 12 layers. GPT-3 uses 96. Same block, more refinement passes.
Depth keeps the same sequence shape and refines meaning layer by layer.
Mechanism

Predict One Token, Append, Repeat

Take the last hidden state, score the vocabulary, form a distribution, then decode one token.

One full forward pass produces scores over the vocabulary.
Bridge

From Probabilities to Behavior

The model outputs one distribution. Decoding determines how that distribution becomes an actual token.

One Distribution
\( \ell = W_U\, h_t^{(L)} \)
softmax
next-token distribution
42%
mat
24%
rug
16%
.
11%
floor
7%
bed
Same distribution. Different selection rule.
Greedy (argmax)
Always pick the highest probability token.
Chosen token mat
Temperature Sampling
Adjust sharpness, then sample.
Example sample rug
Top-k Sampling
Restrict the candidate set, then sample.
Keep mat rug .
Example sample .
The transformer produces scores. Decoding turns those scores into text.
The distribution stays fixed; decoding changes how the next token is chosen.
Closing

You Now Know the Transformer

This is the full loop.

↩ autoregressive loop: append and run again
Each new token changes the context → the next distribution changes
Tokenize → Represent → Understand × L → Predict → Decode → Append → Repeat or Stop
Stop at EOS, a stop sequence, or the max token limit
From tokenization through decoding, this is the forward pass that drives generation.
Tokenize → Represent → Understand × L → Predict → Decode → Append → Repeat or Stop.
1 / 31