Day 2

How a Transformer
Computes the Next Token

From tokenization to decoding, the full forward pass

Setup

Roadmap

Workshop Roadmap

Day 1 — Completed

Why LLMs Behave as They Do

→ Inference vs. training
→ Next-token prediction
→ Loss → gradients → updates
→ Why this objective works
→ Data, scale, and alignment

i Day 1: the model as a system shaped by training.

Day 2 — Today

How a Transformer Processes Text and Predicts the Next Token

→ Tokenization & embeddings
→ Self-attention
→ Feed-forward layers
→ Full forward pass end-to-end

i Day 2: the model as a machine executing computation.

Architecture

Decoder-Only Transformer: High-Level View

Output

↑

Softmax

↑

Linear

↑

LayerNorm

↑

Transformer Block

Layer L

⋮

Transformer Block

Layer 2

⋮

Transformer Block

Layer 1

× L

↑

Dropout

↑

Positional
Encoding

→

+

↑

Input Embedding

↑

Input tokens

Inside a Transformer Block

Block Output

↑

+

↑

Dropout

↑

Feed-Forward Network

↑

LayerNorm

↑

+

↑

Dropout

↑

Multi-Head Attention

↑

LayerNorm

↑

Block Input

i Each block writes into the residual stream, which accumulates across all layers.

Tokenization

How Many r's Are in "Strawberry"?

strawberry

You see: 3 r's

Model answer: 2 r's

What the Model Sees

You type

s t r a w b e r r y

Model sees

[straw] [berry]

The model gets 2 token vectors, not 10 letter vectors. Letter counting requires reconstructing characters from subword pieces.

i This is structural, not a bug. The next slides show where the limitation comes from.

Tokenization

Before Reading, the Model Tokenizes

Raw text

Tokenizer

Token IDs

Embeddings

Three Steps

Define a fixed vocabulary of discrete symbols.
Split input text into pieces from that vocabulary.
Map each piece to an integer ID.

i The vocabulary is learned once, then frozen. GPT-2 uses 50,257 tokens; GPT-4-class tokenizers are roughly 100k.

Running Example

Used in the next three slides.

Sentence

"researchers fine-tune large-scale models"

Subword tokenization

[research] [ers] [fine] [-] [tune] [large] [-] [scale] [model] [s]

[3245, 891, 4680, 12, 10812, 3267, 12, 5649, 219, 82]

IDs shown are illustrative; exact values depend on the tokenizer.

Tokenization

Word-Level

Approach 1: Word-Level Tokenization

Running Sentence: Word-Level Tokenization

If every word is in the vocabulary:

[researchers] [fine-tune] [large-scale] [models]

But compounds like fine-tune and large-scale are often missing, so they collapse to unknown tokens:

[researchers] [UNK] [UNK] [models]

Strengths

Short sequences for familiar text.
Common words stay intact as single units.

Limits

Language is open-vocabulary. English Wikipedia alone has ~1.5M unique word forms.
New terms like "GPT-4", "LoRA", and "RLHF" create unknowns.

! Efficient, but brittle. Anything outside the fixed vocabulary becomes unknown.

Tokenization

Character-Level

Approach 2: Character-Level Tokenization

Running Sentence: Character-Level Tokenization

[r][e][s][e][a][r][c][h][e][r][s] [ ] [f][i][n][e][-][t][u][n][e] [ ] [l][a][r][g][e][-][s][c][a][l][e] [ ] [m][o][d][e][l][s]

Word-level (best case)

4 tokens

Character-level (same sentence)

~40 tokens

Scaling: 500-word passage

~2,500 character tokens

Strengths

No unknown tokens — ever.
Handles new words, typos, code, any language.

Limits

Sequences become \(5\text{–}10\times\) longer.
Self-attention is \(O(n^2)\), so \(5\times\) longer means roughly \(25\times\) more compute.
The model spends capacity reassembling words from letters on every forward pass.

! Perfect coverage, impractical cost. Quadratic attention scaling makes it too expensive at realistic sequence lengths.

Tokenization

Subword Strategy

Approach 3: Subword Tokenization, the Practical Default

Running Sentence: Subword Tokenization

[research] [ers] [fine] [-] [tune] [large] [-] [scale] [model] [s]

Frequent stems become single tokens. Hyphens, prefixes, and suffixes become reusable pieces, so words stay decomposable rather than unknown.

Word-level

4 tokens

Short, but brittle. Breaks on hyphens, new jargon, and multilingual text.

Character-level

~40 tokens

Full coverage, but very long. Quadratic attention cost makes it impractical.

Subword

~10 tokens

Near-word efficiency with no unknown tokens. English prose averages about 0.75 words per token.

+ Near-word efficiency with broad coverage. That is why subword tokenization became the modern default.

i One tradeoff remains: characters inside a token are no longer individually visible. This matters again at the end of the section.

Tokenization

BPE

How the Tokenizer Learns These Pieces

We start with a character-level vocabulary. Now let the corpus teach us which adjacent pairs deserve their own tokens.

Start: characters

Count adjacent pairs

Merge most frequent

Repeat N times

Core Rule

Last slide, the fallback was character-level text. BPE starts there too: count adjacent pairs in the corpus, merge the most frequent pair, and repeat. Those learned merges become the vocabulary.

Why BPE Works

BPE is driven by frequency, not grammar. That is why stems and endings like -ing and -ed can emerge naturally when they keep showing up together.

i BPE is a frequency-based compression algorithm. It builds the vocabulary the data demands, not the one a linguist would design.

? Watch BPE build up from characters on a tiny action corpus.

Tokenization

BPE Walkthrough

BPE on a tiny action corpus

Corpus state Vocabulary built so far

Start from a character-level vocabulary: every word is just letters (</w> marks word end)

h u g </w> h u g s </w> h u g g e d </w> h u g g i n g </w> s m i l e d </w> s m i l i n g </w> w a v e d </w> w a v i n g </w>

Start at characters, count adjacent pairs, and let repetition decide what gets promoted.

Base vocabulary

hugsedinmlwav

h u → 4

u g → 4

i n → 3

e d → 3

There is a tie at the top. We will follow the hug stem first, then come back for reusable endings.

Merge 1: h + u → hu (freq: 4)

hu g </w> hu g s </w> hu g g e d </w> hu g g i n g </w> s m i l e d </w> s m i l i n g </w> w a v e d </w> w a v i n g </w>

Merged chunks so far

hu

hu g → 4

i n → 3

e d → 3

n g → 3

Same corpus, one level up: hu + g is still the most common pair.

Merge 2: hu + g → hug (freq: 4)

hug </w> hug s </w> hug g e d </w> hug g i n g </w> s m i l e d </w> s m i l i n g </w> w a v e d </w> w a v i n g </w>

Merged chunks so far

hu hug

i n → 3

e d → 3

n g → 3

hug g → 2

The stem is built. Next we let shared endings compete.

Merge 3: i + n → in (freq: 3)

hug </w> hug s </w> hug g e d </w> hug g in g </w> s m i l e d </w> s m i l in g </w> w a v e d </w> w a v in g </w>

Merged chunks so far

hu hug in

in g → 3

e d → 3

hug g → 2

l in → 1

One more merge turns a recurring ending into a reusable chunk.

Merge 4: in + g → ing (freq: 3)

hug </w> hug s </w> hug g e d </w> hug g ing </w> s m i l e d </w> s m i l ing </w> w a v e d </w> w a v ing </w>

Merged chunks so far

hu hug in ing

e d → 3

hug g → 2

l ing → 1

v ing → 1

! Aha: ing just became reusable. Three different words now share the same ending token.

Merge 5: e + d → ed (freq: 3)

hug </w> hug s </w> hug g ed </w> hug g ing </w> s m i l ed </w> s m i l ing </w> w a v ed </w> w a v ing </w>

Reusable endings unlocked

hu hug in ing ed

! Now ed joins the vocabulary too. The tokenizer is learning reusable endings, not memorizing whole words.

Tokenization

BPE Takeaway

BPE builds a vocabulary of reusable chunks

Vocabulary After These Merges

Base characters

h u g s e d i n m l w a v

Stem-building chunks

hu hug in

Reusable suffix chunks

ing ed

Gray = base characters · Blue = stem-building merges · Green = ing · Orange = ed

Reusable Chunks in Action

hugs → hug s

hugging → hug g ing

hugged → hug g ed

smiling → s m i l ing

waved → w a v ed

i The tokenizer has now learned reusable endings like ing and ed, so common word forms can reuse chunks instead of being rebuilt from scratch every time.

+ BPE starts from characters and promotes repeated pairs into reusable chunks. That is how pieces like hug, ing, and ed enter the vocabulary.

Tokenization

Why It Matters

Why Tokenization Shapes Reasoning

Core Mechanism

The model computes on token vectors, not character vectors. When one token covers many letters, character-level reasoning must be reconstructed from subword pieces.

"strawberry" → [straw] [berry] 2 token vectors. No separate letter vectors.

Cosma et al., "The Strawberry Problem: Emergence of Character-level Understanding in Tokenized Language Models", EMNLP 2025.

Where It Breaks

Counting specific letters in a word.
Extracting the nth character from a string.
Detecting repeated characters.
Reversing a string exactly.

Why It Works So Well

Subword pieces let the model handle unseen words by composing familiar parts instead of needing every word in the vocabulary.
Sequences stay much shorter than character-level, which makes attention cheaper and preserves more context.
The vocabulary stays much smaller than word-level, reducing the size of the embedding and output layers.
That tradeoff is why tokenization is so widely used: it is efficient, flexible, and good enough for most language understanding tasks.

+ Tokenization defines the vocabulary, shapes sequence length, and sets the granularity of every downstream computation.

Embeddings

Mechanism

From token IDs to vectors

A token ID is simply a row index in a learned embedding matrix.

\[ T\in\{1,\dots,|V|\}^{n},\quad X^{(0)}=E[T]\in\mathbb{R}^{n\times d_{\text{model}}} \]

Token IDs: [812, 1452, 5021]

Row lookup in \(E\)

Output: \(x_1,x_2,x_3 \in \mathbb{R}^{d_{\text{model}}}\)

Three Retrieved Rows

⋮

⋯

\(E[812,:]\)

"river"+0.14-0.77+0.32...

\(E[1452,:]\)

"bank"+0.11-0.70+0.28...

\(E[5021,:]\)

"loan"-0.62+0.04+0.80...

⋮

⋯

Scale of the Full Table

\[ E \in \mathbb{R}^{|V|\times d_{\text{model}}} \]

Examples:
GPT-2 small: \(|V| = 50{,}257,\quad d_{\text{model}} = 768\)
\(\lvert E \rvert = 50{,}257 \times 768 = 38{,}597{,}376\)

GPT-4 tokenizer: \(|V| \approx 100{,}000\)
\(\text{Exact } d_{\text{model}} \text{ is not public, but the table still scales as } |V| \times d_{\text{model}}\)

Embeddings

Key Properties

What Embeddings Provide, and What They Still Lack

What Embeddings Already Provide

Learned table \(E \in \mathbb{R}^{|V| \times d_{\text{model}}}\): one trainable row per token type.
Often one of the largest parameter blocks, scaling with \(|V| \times d_{\text{model}}\).
Outputs \(X^{(0)}\) already match model width — later blocks consume them directly.
Useful structure appears as directions, neighborhoods, and relative offsets.

What Later Layers Must Add

Sense is unresolved: one row can support multiple intended meanings.
No token-token interaction yet — rows cannot use neighboring words.
Row identity alone does not specify sentence role or dependency structure.
Later layers must refine these vectors into context-dependent states.

→ Different projections reveal different relationships; context resolves the intended meaning.

Embeddings

Geometry

One Embedding Space, Multiple Semantic Projections

Each token is one high-dimensional vector. These plots are different 2D views of the same space.

Change the Lens, Not the Space

Active lens: Pet / companionship

This view emphasizes: pet and companionship features

Nearest to cat in this view:

dog hamster vet

\(z^{(\mathrm{lens})} = P_{\mathrm{lens}}x,\; P_{\mathrm{lens}}\in\mathbb{R}^{2\times d_{\text{model}}}\)

The underlying vectors do not move. Only the 2D projection changes, revealing different relationships.

Transition

Bridge

Why embeddings are not enough

Same Row, Different Meaning

The fisherman sat on the bank of the river.

She applied for a loan at the bank.

\[ h_{\text{bank}}^{(0)} = E[\text{bank}] \]

At input, both sentences start from the same bank embedding.

The row tells us which token this is.
The neighboring words tell us which sense is intended.
Attention is the mechanism that mixes in those neighbors.

→ Attention lets a token read from surrounding words.

Attention

Mechanism

Attention lets tokens read from each other

Each token updates its representation using the rest of the sequence.

Embeddings are static; attention makes them context-aware.

Attention now needs a scoring rule: what matters, and what gets copied forward?

Attention

Mechanism

Step 1 — Create \(Q\), \(K\), and \(V\) for each token

Project the same token state into three learned subspaces.

\(x_{\mathrm{sat}}\)

Attention creates three role-specific versions of each token state.

Attention

Mechanism

Step 2 — Compute attention scores

Compare \(q_{\mathrm{sat}}\) with every key \(k_j\)

\(q_{\mathrm{sat}}\)

who sat?

Embeddings (\(X\))

Queries (\(Q\))

Scores (\(\mathbf{s}\))

Keys (\(K\))

Values (\(V\))

Start from embeddings \(x_1, \ldots, x_n\).

Attention

Mechanism

Step 3 — From scores to attention weights

Scale the scores, then softmax them into weights.

RAW SCORES \(\mathbf{s}\)

query token: sat

\(\frac{1}{\sqrt{d_k}}\)

SCALED SCORES \(\mathbf{z}\)

\(z_j = \frac{s_j}{\sqrt{d_k}}\)

Scaling keeps dot products in a stable range

\(\operatorname{softmax}\)

ATTENTION WEIGHTS \(\mathbf{a}\)

\(\sum_j a_j = \text{--}\)

Raw similarity scores compare \(q_{\mathrm{sat}}\) with each key \(k_j\): \(s_j = q_{\mathrm{sat}}^{\mathsf{T}} k_j\).

Attention

Mechanism

Step 4 — Mix values, then add the residual

Mix value vectors with attention weights, then add back \(x_{\mathrm{sat}}\)

\(q_{\mathrm{sat}}\)

who sat?

Embeddings (\(X\))

Queries (\(Q\))

Scores (\(\mathbf{s}\))

Keys (\(K\))

Values (\(V\))

Start from the Step 3 scores for \(q_{\mathrm{sat}}\).

Attention

Mechanism

Attention in matrix form

Stack tokens into matrices, then compute the full sequence at once.

Token Matrix \(T\)

Embedding Matrix \(X\)

\(X \in \mathbb{R}^{S \times d}\)

Projected Matrices

Start from the same sequence: tokens and their embedding rows.

Attention

Mechanism

Multi-Head Attention

Different heads attend in different learned subspaces, then their outputs are recombined.

Multi-head attention lets different heads read the same sequence in parallel.

Input Representation

Problem

Self-Attention Compares Vectors, Not Word Order

Without positional signals, self-attention is permutation equivariant: reorder the token rows in, and the outputs reorder the same way.

The dog chased the cat

The dog is the chaser

Same tokens

The cat chased the dog

The cat is the chaser

Same tokens, different meaning.

❓ Does attention know which word came first?

No. The layer is permutation equivariant without positions.

It only compares the token vectors it receives. If you permute those input rows, the same pairwise scores are computed between the same vectors, and the output rows are permuted in exactly the same way.

⚠️ Add position before attention so permuting tokens changes the vectors and breaks this symmetry.

Position must be added before attention so order changes the input.

Input Representation

Design Journey

Designing Positional Encoding

Attention needs order. Build a position signal from first principles.

1 Target

2 Integer

3 Binary

4 Sin/Cos

5 Relative

6 RoPE

Step 1 · Goals

What must a position code do?

Start with the requirements, then choose the formula.

Three design goals

1

Distinct slots Slot 1 and slot 100 must look different.

2

Stable meaning Position 5 should mean the same in short and long contexts.

3

Local smoothness Nearby slots should get nearby codes.

token embedding \(x_i\)

+

position code \(p_i\)

=

input to attention \(h_i^{(0)}\)

why this matters

The dog chased another dog

Both dog tokens share an embedding; position distinguishes them.

Step 2 · Raw Index

Why not just add the index?

A raw index is too large, too uniform, and length-dependent.

Why Raw Indices Fail

A

Scale mismatch 252 overwhelms embedding values near 0.

B

Same scalar everywhere One scalar across every dimension adds almost no structure.

C

Length-dependent meaning Normalizing makes the same slot mean different things.

scale mismatch

embedding

-0.20.4-0.10.3 -0.50.10.2-0.3

+

position

251251251251 252252252252

=

result

250.8251.4250.9251.3 251.5252.1252.2251.7

short sequence

\(5 / 10 = 0.5\)

long sequence

\(5 / 10000 = 0.0005\)

Step 3 · Binary

Binary shows the right pattern, but it jumps

Binary keeps values bounded and multi-rate, but nearby slots can still change abruptly.

What Binary Gets Right

A

Bounded values The code stays small.

B

Multiple scales Some bits flip fast, others slowly.

C

Not smooth One step can flip several bits at once.

binary code across positions

Position

0

1

2

3

4

5

6

7

Bit 0 fast

0

1

0

1

0

1

0

1

Bit 1 medium

0

1

0

1

Bit 2 slow

0

1

why it remains hard to learn

Adjacent slots can still flip multiple bits at once.

Step 4 · Sin/Cos

Sinusoids keep the scales and smooth the changes

Different dimensions vary at different speeds, but nearby positions stay close.

What Sinusoids Improve

A

One dimension, one wave Each row changes smoothly across position.

B

Multiple scales Early rows move fast; later rows drift slowly.

\(PE(pos, 2i) = \sin\!\left(\frac{pos}{10000^{2i/d}}\right)\)

\(PE(pos, 2i+1) = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)\)

rows = dimensions, x-axis = position

binary intuition

fast

medium

slow

→

sinusoidal version

fast

medium

slow

Step 5 · Relative

Attention wants relative gap in the score

Absolute position says where a token sits. Attention scores often need how far apart the compared tokens are.

What Attention Needs

A

Absolute vs. relative Absolute position answers “which slot is this?” Relative position answers “how far is this token from the token it is being compared against?”

B

Scores are geometric Attention compares queries and keys with a dot product, so changing the relative gap should change the angle and therefore the score.

same sentence, two position views

0 1 2 3 4

The dog chased another dog

-2 -1 0 +1 +2

why this should affect attention

score between token \(m\) and token \(n\): \(s_{mn} = q_m^{\mathsf{T}} k_n\)

\(q_m \cdot k_n = |q_m|\,|k_n|\,\cos\theta_{mn}\)

Relative gap changes angle → angle changes score.

Step 6 needs a position scheme that makes score geometry reflect relative offset.

Step 6 · RoPE

RoPE makes attention scores depend on relative gap

Rotate \(Q\) and \(K\) by absolute position; the score ends up depending on \((m-n)\).

How RoPE Works

A

Inject position at scoring time Rotate \(Q/K\) instead of adding a vector to the embedding.

B

Keep the multiscale ladder Each 2D pair uses its own frequency.

C

Turn absolute rotations into relative offset Same gap, same score pattern.

1. rotate Q and K by absolute position

query at position \(m\): rotate by \(m\theta_i\) key at position \(n\): rotate by \(n\theta_i\)

Angle encodes the gap.

2. the dot product cancels absolute rotation

\((R_m q)^\top (R_n k) \;=\; q^\top R_{m-n}\, k\)

Same gap, same relative score anywhere.

Good position codes are distinct, length-stable, and locally smooth.

Transformer Block

Mechanism

Attention Mixes, FFN Transforms

Attention shares information across tokens. The FFN then updates each token independently.

Attention

Mixes information across tokens

Cross-token context

→

FFN

Transforms each token independently

Per-token transformation

Per Token: Expand, Activate, Contract

token state

\(d_{\mathrm{model}}\)

→

expand 4×

\(W_1\)

→

GELU

activate

→

contract

\(W_2\)

→

updated state

\(d_{\mathrm{model}}\)

\(\mathrm{FFN}(h_i) = W_2\,\mathrm{GELU}(W_1 h_i + b_1) + b_2\)

The same small network runs at every position: shared weights, different inputs.

Parameter Reality

Per block, attention is roughly \(4d_{\mathrm{model}}^2\) parameters, while the FFN is roughly \(8d_{\mathrm{model}}^2\). That is why, in large dense LLMs, the FFN usually outweighs attention in total parameters.

Model

Total

Attention

FFN

Other

GPT-3 13B

12.95B

4.23B32.6%

8.46B65.3%

0.27B2.1%

GPT-3 175B

174.60B

57.99B33.2%

115.97B66.4%

0.65B0.4%

Other = token embeddings + positional embeddings + LayerNorm parameters.

Attention shares information across tokens. The FFN updates each token independently and usually holds most of the parameters.

Transformer Block

Synthesis

One Transformer Block, Revisited

Same diagram as slide 3. This time, every box has a clear role.

Block Output \(R^{(\ell+1)}\)

↑

+

↑

Dropout

↑

Feed-Forward Network

↑

LayerNorm

↑

+

↑

Dropout

↑

Multi-Head Attention

↑

LayerNorm

↑

Block Input \(R^{(\ell)}\)

Read the Block Bottom to Top

LayerNorm → attention → residual add → LayerNorm → FFN → residual add. Both sublayers write into the same residual stream.

LayerNorm

Normalizes each row before the sublayer reads it.

Multi-Head Attention

Attention mixes information across earlier tokens and writes a context update.

Residual Add (+)

Writes the attention update back into the same stream.

Feed-Forward Network

The same two-layer MLP updates each row independently.

Two update equations

\(R_{\mathrm{mid}} = R^{(\ell)} + \mathrm{MHA}(\mathrm{LN}(R^{(\ell)}))\)

\(R^{(\ell+1)} = R_{\mathrm{mid}} + \mathrm{FFN}(\mathrm{LN}(R_{\mathrm{mid}}))\)

A full model stacks this same Transformer block layer after layer.

Model Depth

Concept

Same Block, More Depth

The block repeats \(L\) times. The shape stays \(S \times d_{\mathrm{model}}\); each layer further refines the residual stream.

\(R^{(L)}\) — next-token-ready context \(S \times d_{\mathrm{model}}\)

↑

Transformer Block

Layer L

⋮

Transformer Block

Layer 2

⋮

Transformer Block

Layer 1

× L layers

↑

\(R^{(0)}\) — token + position embeddings \(S \times d_{\mathrm{model}}\)

What Depth Means

Each layer runs the same transformer block again. The sequence shape stays fixed; the representation becomes more refined.

One block = one refinement pass — attention and FFN read the current stream and write back an update.

Each new block starts from the updated stream — later layers build on what earlier layers already produced.

Depth adds repeated computation — the model gets multiple chances to combine and refine information.

📖 GPT-2 small uses 12 layers. GPT-3 uses 96. Same block, more refinement passes.

Depth keeps the same sequence shape and refines meaning layer by layer.

Output & Generation

Mechanism

Predict One Token, Append, Repeat

Take the last hidden state, score the vocabulary, form a distribution, then decode one token.

→

One full forward pass produces scores over the vocabulary.

Output & Generation

Bridge

From Probabilities to Behavior

The model outputs one distribution. Decoding determines how that distribution becomes an actual token.

One Distribution

\( \ell = W_U\, h_t^{(L)} \)

→

softmax

→

next-token distribution

42%

mat

24%

rug

16%

.

11%

floor

7%

bed

Same distribution. Different selection rule.

Greedy (argmax)

Always pick the highest probability token.

Chosen token mat

Temperature Sampling

Adjust sharpness, then sample.

Example sample rug

Top-k Sampling

Restrict the candidate set, then sample.

Keep mat rug .

Example sample .

The transformer produces scores. Decoding turns those scores into text.

The distribution stays fixed; decoding changes how the next token is chosen.

Synthesis

Closing

You Now Know the Transformer

This is the full loop.

↩ autoregressive loop: append and run again

Each new token changes the context → the next distribution changes

Tokenize → Represent → Understand × L → Predict → Decode → Append → Repeat or Stop

Stop at EOS, a stop sequence, or the max token limit

✓ From tokenization through decoding, this is the forward pass that drives generation.

Tokenize → Represent → Understand × L → Predict → Decode → Append → Repeat or Stop.

How a TransformerComputes the Next Token

Workshop Roadmap

Why LLMs Behave as They Do

How a Transformer Processes Text and Predicts the Next Token

Decoder-Only Transformer: High-Level View

How Many r's Are in "Strawberry"?

What the Model Sees

Before Reading, the Model Tokenizes

Three Steps

Running Example

Approach 1: Word-Level Tokenization

Running Sentence: Word-Level Tokenization

Strengths

Limits

Approach 2: Character-Level Tokenization

Running Sentence: Character-Level Tokenization

Strengths

Limits

Approach 3: Subword Tokenization, the Practical Default

Running Sentence: Subword Tokenization

Word-level

Character-level

Subword

How the Tokenizer Learns These Pieces

Core Rule

Why BPE Works

BPE on a tiny action corpus

BPE builds a vocabulary of reusable chunks

Vocabulary After These Merges

Reusable Chunks in Action

Why Tokenization Shapes Reasoning

Core Mechanism

Where It Breaks

Why It Works So Well

From token IDs to vectors

Three Retrieved Rows

Scale of the Full Table

What Embeddings Provide, and What They Still Lack

What Embeddings Already Provide

What Later Layers Must Add

One Embedding Space, Multiple Semantic Projections

Change the Lens, Not the Space

Why embeddings are not enough

Same Row, Different Meaning

Attention lets tokens read from each other

Step 1 — Create \(Q\), \(K\), and \(V\) for each token

Step 2 — Compute attention scores

Step 3 — From scores to attention weights

Step 4 — Mix values, then add the residual

Attention in matrix form

Multi-Head Attention

Self-Attention Compares Vectors, Not Word Order

Designing Positional Encoding

What must a position code do?

Attention Mixes, FFN Transforms

One Transformer Block, Revisited

Same Block, More Depth

Predict One Token, Append, Repeat

From Probabilities to Behavior

You Now Know the Transformer

How a Transformer
Computes the Next Token