← Atlas
Atlas · Lessons

Build an LLM from scratch

An interactive course on how a language model actually works — from a bigram in the browser to a Transformer. We teach on the foundational papers of the world, then anchor every layer to where Kazakh stands.

global-firstthe world canonKazakh-secondtied to the contribution tree
01~75 minlesson
What is a language model?
Forget GPT-4 for a minute. Every LM that has ever existed does one thing: P(next token | past tokens). A bigram in the browser, tokenization, embeddings, a heatmap.
based on: Bengio et al. 2003 · Mikolov et al. 2013 · Sennrich et al. 2015
02~90 minsoon
From counts to neurons (MLP)
Swap the bigram table for a small neural net. Same task, a smarter representation.
based on: Bengio et al. 2003
03~80 minsoon
Embeddings: meaning from context
word2vec and why "king − man + woman ≈ queen". Meaning is a by-product of prediction.
based on: Mikolov et al. 2013
04~85 minsoon
Attention
Bahdanau attention: the decoder learns to look at the right parts of the input. The bridge to the Transformer.
based on: Bahdanau et al. 2014
05~120 minsoon
The Transformer — Attention Is All You Need
Drop recurrence entirely. Multi-head self-attention + RoPE. The architecture of the whole era.
based on: Vaswani et al. 2017 · Su et al. 2021
06~90 minsoon
BERT and pretraining
Masked-LM, bidirectionality, fine-tuning. KazRoBERTa as the Kazakh descendant.
based on: Devlin et al. 2018
07~95 minsoon
Scale: GPT-3 and in-context learning
Few-shot out of nowhere, Chinchilla optimality. Why data matters more than size.
based on: Radford et al. 2019 · Brown et al. 2020 · Hoffmann et al. 2022
08~100 minsoon
Alignment: RLHF
SFT + RLHF (InstructGPT). How a "next-token predictor" turns into an assistant.
based on: Ouyang et al. 2022

Canon

foundational papers · global-first
2003A Neural Probabilistic Language ModelBengio et al. · Learns distributed word vectors jointly with a neural n-gram model — beats the curse of dimensionality.2013Efficient Estimation of Word Representations in Vector SpaceMikolov et al. · word2vec: shallow nets that produce dense embeddings where analogies become vector arithmetic.2014Sequence to Sequence Learning with Neural NetworksSutskever et al. · An LSTM encoder compresses the input to a vector; a second LSTM decodes the output.2014Neural Machine Translation by Jointly Learning to Align and TranslateBahdanau et al. · Soft attention: the decoder learns to focus on the right encoder states per step.2015Neural Machine Translation of Rare Words with Subword UnitsSennrich et al. · Byte-Pair Encoding splits words into subwords → open vocabulary, no <UNK>.2017Attention Is All You NeedVaswani et al. · Drop recurrence entirely — multi-head self-attention is the whole architecture. The Transformer.2018BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingDevlin et al. · Masked-LM pretraining of a bidirectional Transformer, then fine-tune on any task.2019Language Models are Unsupervised Multitask LearnersRadford et al. · GPT-2: a big autoregressive LM picks up tasks with zero task-specific training.2020Language Models are Few-Shot LearnersBrown et al. · GPT-3 (175B): scale alone unlocks in-context few-shot learning.2022Training Compute-Optimal Large Language ModelsHoffmann et al. · Chinchilla: most LLMs are undertrained — scale tokens with parameters.2022Training language models to follow instructions with human feedbackOuyang et al. · InstructGPT: SFT + RLHF (PPO) aligns a base model with human intent.2023LLaMA: Open and Efficient Foundation Language ModelsTouvron et al. · Open 7B–65B models matching GPT-3 class by training longer on more tokens.2021RoFormer: Enhanced Transformer with Rotary Position EmbeddingSu et al. · RoPE encodes position by rotating Q/K — relative position for free.