Atlas · Lessons

Build an LLM from scratch

An interactive course on how a language model actually works — from a bigram in the browser to a Transformer. We teach on the foundational papers of the world, then anchor every layer to where Kazakh stands.

global-first — the world canonKazakh-second — tied to the contribution tree

01~75 minlesson

What is a language model?

Forget GPT-4 for a minute. Every LM that has ever existed does one thing: P(next token | past tokens). A bigram in the browser, tokenization, embeddings, a heatmap.

based on: Bengio et al. 2003 · Mikolov et al. 2013 · Sennrich et al. 2015

02~90 minsoon

From counts to neurons (MLP)

Swap the bigram table for a small neural net. Same task, a smarter representation.

based on: Bengio et al. 2003

03~80 minsoon

Embeddings: meaning from context

word2vec and why "king − man + woman ≈ queen". Meaning is a by-product of prediction.

based on: Mikolov et al. 2013

04~85 minsoon

Attention

Bahdanau attention: the decoder learns to look at the right parts of the input. The bridge to the Transformer.

based on: Bahdanau et al. 2014

05~120 minsoon

The Transformer — Attention Is All You Need

Drop recurrence entirely. Multi-head self-attention + RoPE. The architecture of the whole era.

based on: Vaswani et al. 2017 · Su et al. 2021

06~90 minsoon

BERT and pretraining

Masked-LM, bidirectionality, fine-tuning. KazRoBERTa as the Kazakh descendant.

based on: Devlin et al. 2018

07~95 minsoon

Scale: GPT-3 and in-context learning

Few-shot out of nowhere, Chinchilla optimality. Why data matters more than size.

based on: Radford et al. 2019 · Brown et al. 2020 · Hoffmann et al. 2022

08~100 minsoon

Alignment: RLHF

SFT + RLHF (InstructGPT). How a "next-token predictor" turns into an assistant.

based on: Ouyang et al. 2022

Canon

foundational papers · global-first

2003A Neural Probabilistic Language ModelBengio et al. · Learns distributed word vectors jointly with a neural n-gram model — beats the curse of dimensionality.2013Efficient Estimation of Word Representations in Vector SpaceMikolov et al. · word2vec: shallow nets that produce dense embeddings where analogies become vector arithmetic.2014Sequence to Sequence Learning with Neural NetworksSutskever et al. · An LSTM encoder compresses the input to a vector; a second LSTM decodes the output.2014Neural Machine Translation by Jointly Learning to Align and TranslateBahdanau et al. · Soft attention: the decoder learns to focus on the right encoder states per step.2015Neural Machine Translation of Rare Words with Subword UnitsSennrich et al. · Byte-Pair Encoding splits words into subwords → open vocabulary, no <UNK>.2017Attention Is All You NeedVaswani et al. · Drop recurrence entirely — multi-head self-attention is the whole architecture. The Transformer.2018BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingDevlin et al. · Masked-LM pretraining of a bidirectional Transformer, then fine-tune on any task.2019Language Models are Unsupervised Multitask LearnersRadford et al. · GPT-2: a big autoregressive LM picks up tasks with zero task-specific training.2020Language Models are Few-Shot LearnersBrown et al. · GPT-3 (175B): scale alone unlocks in-context few-shot learning.2022Training Compute-Optimal Large Language ModelsHoffmann et al. · Chinchilla: most LLMs are undertrained — scale tokens with parameters.2022Training language models to follow instructions with human feedbackOuyang et al. · InstructGPT: SFT + RLHF (PPO) aligns a base model with human intent.2023LLaMA: Open and Efficient Foundation Language ModelsTouvron et al. · Open 7B–65B models matching GPT-3 class by training longer on more tokens.2021RoFormer: Enhanced Transformer with Rotary Position EmbeddingSu et al. · RoPE encodes position by rotating Q/K — relative position for free.