What is a Language Model?
Forget GPT-4 for a second. Forget chatbots. Forget transformers. By the end of this lesson, you'll know — in your bones — what every single language model that has ever existed is actually doing.
The whole field of language modeling is one equation: P(next_token | previous_tokens). That's it. Everything else — attention, RoPE, MoE, RLHF — is a clever way to make that prediction sharper.
1. The One-Sentence Definition
A language model is a function that takes a sequence of tokens and assigns a probability distribution over what comes next.
That's not a metaphor. That is literally what every LM does, mechanically. Sampling from that distribution → generation. Compare two sentences → translation, spell check, code completion.
2. Demo: A Tiny LM, Right Here in Your Browser
Below is a real bigram model over a snippet of Shakespeare (the same tinyshakespeare dataset used in makemore and nanoGPT). It only knows: given the last character, what's likely next?. Click Step a few times. Watch it work.
3. But… Where Do "Tokens" Come From?
Computers read numbers, not words. Tokenization splits raw text into discrete units, then assigns each an integer ID. Three common strategies — try them:
Vocab: tiny (~80 symbols), but sequences get long. The model is forced to learn spelling.
- Char-level: tiny vocab (~80 symbols), but long sequences.
- Word-level: short sequences, huge vocab, can't handle new words (<UNK>).
- BPE / subword: the sweet spot. GPT-2 uses ~50k tokens.
4. Tokens Become Vectors — "Embeddings"
Once a token has an integer ID, we look it up in a giant embedding table — that gives us a vector (say, 768 numbers). We need a representation where similar things are close together and that gradient descent can nudge. Integers don't have that property; vectors do.
Hover words to see neighbors. The model was never told "king is like queen" — it figured that out from co-occurrence. Meaning falls out of context.
We never tell the network what words mean. We just train it to predict the next token. Meaning is an emergent property of the prediction task. The same will be true at scale — "understanding," "reasoning," all of it.
5. The Bigram Heatmap
Here's the secret of your tiny LM. It's just a 2D table: rows = "current char", columns = "next char", cells = "how often did this happen." Brighter = more likely. Hover to inspect.
Foundations this lesson is built on
Read these alongside the exercise — global-first, the papers the whole field stands on.
Kazakh agglutination inflates tokenization — a single root spills into extra tokens. Where Kazakh stands on tokenization:
→ Дерево вклада: токенизация6. The Hot Questions
Don't move on until you can answer these out loud, in your own words. (No Googling.)
- Why predict the next token rather than the current one?
- What would the bigram heatmap look like for a model that has seen no data?
- Why is character-level tokenization "easier to learn" but "harder to scale"?
- What does it mean for two embedding vectors to be "close"? Cosine vs. Euclidean — which, and why?
- If a bigram model is just a lookup table, what does "training" even mean?
7. Your Mission
Open code/exercise_01_bigram.py. There's a stub waiting. You will:
- Load
tinyshakespeare.txt. - Build a vocab (unique chars → integer IDs).
- Count bigrams — a
(V, V)matrix. - Normalize rows to probabilities (watch division by zero).
- Sample 200 chars of new "Shakespeare." That's the magic moment.
- Compute the average negative log-likelihood. That's your loss — write it down.
Reading checklist
- Sennrich 2015 (BPE) — full read, 8 easy pages.
- Bengio 2003 — sections 1–3 for now.
- Watch "makemore Part 1" after the exercise.