Research cartography · 2013 – 2026

Kazakh NLP Atlas

The State of Kazakh NLP Research

Kazakh NLP saw an explosion in 2024–2026 — but the growth is concentrated in speech and machine translation. Tokenization and morphology remain a thin, under-mapped frontier. That is exactly where contributions are open.

222

papers in corpus

2013–2026

years

47%

in 2024–2026

on tokenization

Field chronology

volume of work by year

2013

2014

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

2025

2026

▮ in gold — 2024–2026: the LLM era arrives in Kazakh

Breakthrough timeline

world ⟷ Kazakh · the lag is visible

🌍 World🇰🇿 Kazakhstan

2013

🌍 World

word2vec

Dense vector representations of words.

2014

🌍 World

seq2seq

Encoder-decoder: translation as generation.

2014

🌍 World

Attention (Bahdanau)

Alignment — the end of the bottleneck.

2016

🌍 World

BPE for NMT (Sennrich)

Subword units → rare words and morphology become tractable.

2017

🌍 World

Transformer

"Attention is All You Need" — the architecture of the whole era.

2018

🌍 World

BERT

Bidirectional pretraining, transfer to downstream tasks.

2019

🇰🇿 Kazakhstan

ASR / NMT in the neural mainstream

WMT19 introduces kk–en; continuous ASR enters production. The first neural works date to 2015–2017.

2019

🌍 World

XLM-R

Multilingual pretraining over 100 languages — Kazakh included.

2020

🇰🇿 Kazakhstan

Kazakh Speech Corpus (KSC)

A foundational dataset — the No. 1 hub in the citation graph.

2020

🌍 World

GPT-3

In-context learning: scale as a capability.

2021

🇰🇿 Kazakhstan

KazNERD + KazakhTTS

The first open NER and TTS resources for Kazakh.

2021

🌍 World

How Good is Your Tokenizer

Fertility: the cost of a tokenizer for non-English.

2022

🌍 World

Chinchilla

Compute-optimal: data matters more than size.

2023

🇰🇿 Kazakhstan

Kaz-RoBERTa

One of the first Kazakh pretrained models (BPE 52k, kz-transformers).

2023

🌍 World

LLaMA + Tokenizer Unfairness

Open weights → a wave of local adaptations; the low-resource token tax measured (Petrov et al.).

2024

🇰🇿 Kazakhstan

KazLLM (ISSAI)

The first large open Kazakh LLM (8B/70B, 150B+ tokens).

2024

🇰🇿 Kazakhstan

"Do LLMs Speak Kazakh?"

A pilot systematic evaluation of Kazakh across 7 models.

2024

🌍 World

MorphScore

A metric for tokenizer morpheme alignment.

2025

🇰🇿 Kazakhstan

Sherkala-Chat

SOTA chat at release, vocab 159k, fertility 4.73→2.04.

2026

🇰🇿 Kazakhstan

SozKZ

A from-scratch Kazakh SLM (50k BPE, 50–600M).

2026

🇰🇿 Kazakhstan

KazByte

Tokenizer-free: a byte-level adapter for Qwen2.5. Validation is ongoing.

2026

🇰🇿 Kazakhstan

Til-Core (state)

morphBPE 256k, a loud morphology claim, not a single downstream benchmark.

Territories

by decreasing openness

◇Tokenization21

A thin frontier. The boom is 2024–2026. There is no independent audit of morpheme alignment.

❖Morphology / segmentation48

New segmenters are being built, but no one audits what existing tokenizers actually do.

◈Language models / LLMs50

mapped

▣Evaluation / benchmarks76

Benchmarks are few and scattered. Til-Core shipped without a single downstream benchmark.

✶Embeddings22

mapped

◐Classification / sentiment26

mapped

◎NER / extraction55

mapped

⇄Machine translation105

dense

◉Speech (ASR / TTS)110

dense

▤Datasets / corpora146

dense

LLM architecture

a schematic for newcomers · and a contribution map

First — how an LLM is built at all. Arrows = data flow; the colors are the same as in the tree below: where Kazakh is dense, where it's empty. Click a block to jump to its layer and its papers.

Forward pass · how the model "thinks"

InputKazakh text · "My name is…"

1Tokenizationtext → tokens (subword / BPE)↳ agglutination → extra tokens (token-tax)

2Embeddings + positional encodingtokens → vectors, plus word order

3× N layersTransformer blockthe heart of the model — repeats N timesSelf-Attentionwho attends to whom in the textFeed-Forward"thinks" over each token↳ residual + LayerNorm around each sub-layer

4Outputnext-token probabilities → generation

Training · how the model is built

CorpusPretrainingSFT / fine-tuningRLHF / DPOEvaluationServing

Kazakh LLM stack

The same skeleton — but see where it's empty for Kazakh, and which layers a vanilla stack doesn't need at all. Color = status, number = volume of work, bar = how complete the layer is. Click an element for the list of papers.

many papersactiveincompletealmost none

Data154 papers

Tokenization59 papers

Representations28 papers

Model50 papers

Adaptation28 papers

Evaluation76 papers

Inference4 papers

Applications210 papers

Citation graph

size = influence · color = topic

hover · click = details · legend = constellation · wheel = zoom · drag background

104 related papers + 15 global hubs · 312 citation edges · method: s2-batch. 118 more papers without edges — in the list below.

Flagship models

claim ≠ verified

Model	Year	Params	Base	Vocab	Tokenizer	Morphology?	Benchmarks?
Til-Core-0.5B Tıl Qazyna (state) A loud morphology claim, with validation perplexity as the only metric. A 0.5B/1B family (+Instruct). No independent verification.	2026	497M	Qwen2 arch. (from scratch)	256 000	morphBPE — BPE that forbids merges across morpheme boundaries (BiLSTM segmenter)	YES — but the segmenter isn't released	NO — only val-PPL
Sherkala-Chat-8B Inception / MBZUAI Kazakh fertility 4.73 → 2.04. Morpheme alignment is not discussed.	2025	8B	Llama-3.1	159 766	extended BPE (+25% over Llama-3.1)	no (fertility-driven)	yes
SozKZ (50M–600M) S. Tukenov Argues via fertility, not via morpheme boundaries.	2026	50–600M	Llama-arch	50 000	ByteLevel BPE trained from scratch on Kazakh	no	partial
KazByte R. Akylzhanov A counterpoint to the whole field: the "tokenizer tax" is solved by removing the tokenizer. "Validation is ongoing" — no published results yet.	2026	adapter→Qwen2.5-7B	Qwen2.5	— (byte-level)	bypasses the tokenizer entirely (byte-level adapter)	n/a — no tokenizer	NO — position paper
KazLLM (8B / 70B) ISSAI / NU 150B+ tokens, 4 languages. No dedicated tokenizer work.	2024	8B, 70B	Llama-3.1	128 256 (Llama-3.1)	inherits Llama-3.1, extension undocumented	no	yes (task-perf)
Kaz-RoBERTa kz-transformers An early baseline. Used in hybrid morphological analyzers.	2023	~83M	RoBERTa	52 000	byte-level BPE (Kazakh + code-switched RU dialogues)	no	partial

Open territory

where contribution is open

◆ YOU ARE HERE

Independent audit of morpheme alignment in Kazakh tokenizers

No one has compared several KAZAKH tokenizers (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core) on morpheme boundaries against a single gold standard. Arnett 2025 treats Kazakh as 1 of 70 languages and only generic tokenizers; Duisenova 2026 builds a new one but doesn't audit the existing ones.

SHIPPABLE this week

◆ YOU ARE HERE

Empirical test of Til-Core's morphology claim

Til-Core shipped without a single downstream benchmark (only validation perplexity) and with a loud claim of "Kazakh morphology support." Be the first to measure it independently.

part of the audit

◆ YOU ARE HERE

Precision/F1 of morpheme alignment for Kazakh tokenizers

The original MorphScore (2024) measures only boundary recall; Arnett 2025 added precision/recall for Kazakh — but only for generic tokenizers (BLOOM, Llama, Gemma). No one has computed precision and F1 for KAZAKH tokenizers (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core).

small add-on to the audit

○ open

Joint fertility × morpheme-alignment table

Sherkala reports fertility, the MorphScore work reports alignment — but no one has brought both axes for Kazakh tokenizers into a single table.

medium

◆ YOU ARE HERE

Usage-vs-morphology divergence (what speakers actually say)

A morphologically correct form ≠ the form a native speaker uses (e.g. "біздің кітаптар" instead of "кітаптарымыз", "неге" as a monolith). This is methodologically uncovered by any work. A native-speaker survey → a new angle.

mini-survey, 30–50 responses