Kazakh NLP Atlas
Kazakh NLP saw an explosion in 2024–2026 — but the growth is concentrated in speech and machine translation. Tokenization and morphology remain a thin, under-mapped frontier. That is exactly where contributions are open.
Field chronology
volume of work by yearBreakthrough timeline
world ⟷ Kazakh · the lag is visibleTerritories
by decreasing opennessLLM architecture
a schematic for newcomers · and a contribution mapFirst — how an LLM is built at all. Arrows = data flow; the colors are the same as in the tree below: where Kazakh is dense, where it's empty. Click a block to jump to its layer and its papers.
The same skeleton — but see where it's empty for Kazakh, and which layers a vanilla stack doesn't need at all. Color = status, number = volume of work, bar = how complete the layer is. Click an element for the list of papers.
Citation graph
size = influence · color = topicFlagship models
claim ≠ verified| Model | Year | Params | Base | Vocab | Tokenizer | Morphology? | Benchmarks? |
|---|---|---|---|---|---|---|---|
| Til-Core-0.5B Tıl Qazyna (state) A loud morphology claim, with validation perplexity as the only metric. A 0.5B/1B family (+Instruct). No independent verification. | 2026 | 497M | Qwen2 arch. (from scratch) | 256 000 | morphBPE — BPE that forbids merges across morpheme boundaries (BiLSTM segmenter) | YES — but the segmenter isn't released | NO — only val-PPL |
| Sherkala-Chat-8B Inception / MBZUAI Kazakh fertility 4.73 → 2.04. Morpheme alignment is not discussed. | 2025 | 8B | Llama-3.1 | 159 766 | extended BPE (+25% over Llama-3.1) | no (fertility-driven) | yes |
| SozKZ (50M–600M) S. Tukenov Argues via fertility, not via morpheme boundaries. | 2026 | 50–600M | Llama-arch | 50 000 | ByteLevel BPE trained from scratch on Kazakh | no | partial |
| KazByte R. Akylzhanov A counterpoint to the whole field: the "tokenizer tax" is solved by removing the tokenizer. "Validation is ongoing" — no published results yet. | 2026 | adapter→Qwen2.5-7B | Qwen2.5 | — (byte-level) | bypasses the tokenizer entirely (byte-level adapter) | n/a — no tokenizer | NO — position paper |
| KazLLM (8B / 70B) ISSAI / NU 150B+ tokens, 4 languages. No dedicated tokenizer work. | 2024 | 8B, 70B | Llama-3.1 | 128 256 (Llama-3.1) | inherits Llama-3.1, extension undocumented | no | yes (task-perf) |
| Kaz-RoBERTa kz-transformers An early baseline. Used in hybrid morphological analyzers. | 2023 | ~83M | RoBERTa | 52 000 | byte-level BPE (Kazakh + code-switched RU dialogues) | no | partial |
Open territory
where contribution is openIndependent audit of morpheme alignment in Kazakh tokenizers
No one has compared several KAZAKH tokenizers (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core) on morpheme boundaries against a single gold standard. Arnett 2025 treats Kazakh as 1 of 70 languages and only generic tokenizers; Duisenova 2026 builds a new one but doesn't audit the existing ones.
SHIPPABLE this weekEmpirical test of Til-Core's morphology claim
Til-Core shipped without a single downstream benchmark (only validation perplexity) and with a loud claim of "Kazakh morphology support." Be the first to measure it independently.
part of the auditPrecision/F1 of morpheme alignment for Kazakh tokenizers
The original MorphScore (2024) measures only boundary recall; Arnett 2025 added precision/recall for Kazakh — but only for generic tokenizers (BLOOM, Llama, Gemma). No one has computed precision and F1 for KAZAKH tokenizers (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core).
small add-on to the auditJoint fertility × morpheme-alignment table
Sherkala reports fertility, the MorphScore work reports alignment — but no one has brought both axes for Kazakh tokenizers into a single table.
mediumUsage-vs-morphology divergence (what speakers actually say)
A morphologically correct form ≠ the form a native speaker uses (e.g. "біздің кітаптар" instead of "кітаптарымыз", "неге" as a monolith). This is methodologically uncovered by any work. A native-speaker survey → a new angle.
mini-survey, 30–50 responsesThe paper corpus
arXiv + Semantic Scholar222 papers shown