← All lessons · Atlas
Module 01 · ~75 min · Interactive

What is a Language Model?

Forget GPT-4 for a second. Forget chatbots. Forget transformers. By the end of this lesson, you'll know — in your bones — what every single language model that has ever existed is actually doing.

Field note

The whole field of language modeling is one equation: P(next_token | previous_tokens). That's it. Everything else — attention, RoPE, MoE, RLHF — is a clever way to make that prediction sharper.

1. The One-Sentence Definition

A language model is a function that takes a sequence of tokens and assigns a probability distribution over what comes next.

"The cat sat on the ___" → { mat: 0.41, floor: 0.18, sofa: 0.09, keyboard: 0.005, … }

That's not a metaphor. That is literally what every LM does, mechanically. Sampling from that distribution → generation. Compare two sentences → translation, spell check, code completion.

2. Demo: A Tiny LM, Right Here in Your Browser

Below is a real bigram model over a snippet of Shakespeare (the same tinyshakespeare dataset used in makemore and nanoGPT). It only knows: given the last character, what's likely next?. Click Step a few times. Watch it work.

Seed char:
T
Probability of next char given “T”:
o
50.0%
h
37.5%
:
12.5%
Notice: after "T", the most likely next character is "h". After a space, vowels light up. The model never read Shakespeare — it just counted co-occurrences. That's already a language model.

3. But… Where Do "Tokens" Come From?

Computers read numbers, not words. Tokenization splits raw text into discrete units, then assigns each an integer ID. Three common strategies — try them:

Tokens (65):
The·quick·brown·fox·jumps·over·the·lazy·dog.·Tokenization·is·fun!

Vocab: tiny (~80 symbols), but sequences get long. The model is forced to learn spelling.

Trade-off in your head:
  • Char-level: tiny vocab (~80 symbols), but long sequences.
  • Word-level: short sequences, huge vocab, can't handle new words (<UNK>).
  • BPE / subword: the sweet spot. GPT-2 uses ~50k tokens.

4. Tokens Become Vectors — "Embeddings"

Once a token has an integer ID, we look it up in a giant embedding table — that gives us a vector (say, 768 numbers). We need a representation where similar things are close together and that gradient descent can nudge. Integers don't have that property; vectors do.

An embedding space, visualized (2D projection):

Hover words to see neighbors. The model was never told "king is like queen" — it figured that out from co-occurrence. Meaning falls out of context.

Field note

We never tell the network what words mean. We just train it to predict the next token. Meaning is an emergent property of the prediction task. The same will be true at scale — "understanding," "reasoning," all of it.

5. The Bigram Heatmap

Here's the secret of your tiny LM. It's just a 2D table: rows = "current char", columns = "next char", cells = "how often did this happen." Brighter = more likely. Hover to inspect.

Hover any cell to see the conditional probability.
This 27×27 matrix is your model. In Module 2 we replace it with a small neural net; in Module 5, with self-attention. The arc of the whole course is right here.

Foundations this lesson is built on

Read these alongside the exercise — global-first, the papers the whole field stands on.

Казахский угол

Kazakh agglutination inflates tokenization — a single root spills into extra tokens. Where Kazakh stands on tokenization:

→ Дерево вклада: токенизация

6. The Hot Questions

Don't move on until you can answer these out loud, in your own words. (No Googling.)

  1. Why predict the next token rather than the current one?
  2. What would the bigram heatmap look like for a model that has seen no data?
  3. Why is character-level tokenization "easier to learn" but "harder to scale"?
  4. What does it mean for two embedding vectors to be "close"? Cosine vs. Euclidean — which, and why?
  5. If a bigram model is just a lookup table, what does "training" even mean?

7. Your Mission

Open code/exercise_01_bigram.py. There's a stub waiting. You will:

  • Load tinyshakespeare.txt.
  • Build a vocab (unique chars → integer IDs).
  • Count bigrams — a (V, V) matrix.
  • Normalize rows to probabilities (watch division by zero).
  • Sample 200 chars of new "Shakespeare." That's the magic moment.
  • Compute the average negative log-likelihood. That's your loss — write it down.
Reading checklist
Module 01 · What is a Language Model?Module 02: From Counts to Neurons → soon