Module 01 · ~75 min · Interactive

What is a Language Model?

Forget GPT-4 for a second. Forget chatbots. Forget transformers. By the end of this lesson, you'll know — in your bones — what every single language model that has ever existed is actually doing.

— Field note

The whole field of language modeling is one equation: P(next_token | previous_tokens). That's it. Everything else — attention, RoPE, MoE, RLHF — is a clever way to make that prediction sharper.

1. The One-Sentence Definition

A language model is a function that takes a sequence of tokens and assigns a probability distribution over what comes next.

"The cat sat on the ___" → { mat: 0.41, floor: 0.18, sofa: 0.09, keyboard: 0.005, … }

That's not a metaphor. That is literally what every LM does, mechanically. Sampling from that distribution → generation. Compare two sentences → translation, spell check, code completion.

2. Demo: A Tiny LM, Right Here in Your Browser

Below is a real bigram model over a snippet of Shakespeare (the same tinyshakespeare dataset used in makemore and nanoGPT). It only knows: given the last character, what's likely next?. Click Step a few times. Watch it work.

Seed char:

T▋

Probability of next char given “T”:

50.0%

37.5%

12.5%

Notice: after "T", the most likely next character is "h". After a space, vowels light up. The model never read Shakespeare — it just counted co-occurrences. That's already a language model.

3. But… Where Do "Tokens" Come From?

Computers read numbers, not words. Tokenization splits raw text into discrete units, then assigns each an integer ID. Three common strategies — try them:

Tokens (65):

The·quick·brown·fox·jumps·over·the·lazy·dog.·Tokenization·is·fun!

Vocab: tiny (~80 symbols), but sequences get long. The model is forced to learn spelling.

Trade-off in your head:

Char-level: tiny vocab (~80 symbols), but long sequences.
Word-level: short sequences, huge vocab, can't handle new words (<UNK>).
BPE / subword: the sweet spot. GPT-2 uses ~50k tokens.

4. Tokens Become Vectors — "Embeddings"

Once a token has an integer ID, we look it up in a giant embedding table — that gives us a vector (say, 768 numbers). We need a representation where similar things are close together and that gradient descent can nudge. Integers don't have that property; vectors do.

An embedding space, visualized (2D projection):

Hover words to see neighbors. The model was never told "king is like queen" — it figured that out from co-occurrence. Meaning falls out of context.

— Field note

We never tell the network what words mean. We just train it to predict the next token. Meaning is an emergent property of the prediction task. The same will be true at scale — "understanding," "reasoning," all of it.

5. The Bigram Heatmap

Here's the secret of your tiny LM. It's just a 2D table: rows = "current char", columns = "next char", cells = "how often did this happen." Brighter = more likely. Hover to inspect.

Hover any cell to see the conditional probability.

This 27×27 matrix is your model. In Module 2 we replace it with a small neural net; in Module 5, with self-attention. The arc of the whole course is right here.

Foundations this lesson is built on

Read these alongside the exercise — global-first, the papers the whole field stands on.

2003A Neural Probabilistic Language ModelBengio et al. · Learns distributed word vectors jointly with a neural n-gram model — beats the curse of dimensionality.2013Efficient Estimation of Word Representations in Vector SpaceMikolov et al. · word2vec: shallow nets that produce dense embeddings where analogies become vector arithmetic.2015Neural Machine Translation of Rare Words with Subword UnitsSennrich et al. · Byte-Pair Encoding splits words into subwords → open vocabulary, no <UNK>.

Казахский угол

Kazakh agglutination inflates tokenization — a single root spills into extra tokens. Where Kazakh stands on tokenization:

→ Дерево вклада: токенизация

6. The Hot Questions

Don't move on until you can answer these out loud, in your own words. (No Googling.)

Why predict the next token rather than the current one?
What would the bigram heatmap look like for a model that has seen no data?
Why is character-level tokenization "easier to learn" but "harder to scale"?
What does it mean for two embedding vectors to be "close"? Cosine vs. Euclidean — which, and why?
If a bigram model is just a lookup table, what does "training" even mean?

7. Your Mission

Open code/exercise_01_bigram.py. There's a stub waiting. You will:

Load tinyshakespeare.txt.
Build a vocab (unique chars → integer IDs).
Count bigrams — a (V, V) matrix.
Normalize rows to probabilities (watch division by zero).
Sample 200 chars of new "Shakespeare." That's the magic moment.
Compute the average negative log-likelihood. That's your loss — write it down.

Reading checklist

Sennrich 2015 (BPE) — full read, 8 easy pages.
Bengio 2003 — sections 1–3 for now.
Watch "makemore Part 1" after the exercise.

Module 01 · What is a Language Model?Module 02: From Counts to Neurons → soon