What can you do right now?
You don't need to invent a new paradigm. The field is full of concrete, under-covered gaps and claims nobody has verified. Below are the pain points and unverified hypotheses of Kazakh NLP, each with a citation to real work. Pick the one within your reach and make your first contribution.
Pain points
12 · with citationsHigh "tokenizer tax" of Kazakh text
Multilingual tokenizers (BPE, SentencePiece), trained predominantly on high-resource languages, fragment Kazakh words 3–5x more than English ones — this shortens the effective context window and increases compute cost. KazByte names this "tokenizer tax" as its main motivation, while SozKZ sidesteps the problem by training a tokenizer from scratch.
No standardized benchmark for morphology
Despite many works on Kazakh morphological analysis, there is no single public dataset with gold annotation of morpheme boundaries and POS tags for reproducible comparison. Differing annotation schemes make it impossible to align results.
Weak support for Kazakh-Russian code-switching
Speakers regularly switch between Kazakh and Russian within a single utterance (intra-sentential code-switching), but most ASR/NLP systems are trained on monolingual data. KSC2 contains such samples (proportion undisclosed), and dedicated datasets are almost nonexistent.
No unified set of LLM evaluation metrics
Benchmarks for Kazakh are fragmented: KazMMLU, TUMLU, KazQAD, Qorgau, KZ-SafetyPrompts use different protocols, models, and metrics. There is no way to objectively compare progress across works — there is no single leaderboard.
ASR for spontaneous and children's speech is almost absent
Most Kazakh ASR systems are trained on read speech (news, books, parliament). Children's speech, spontaneous dialogue, telephone conversations, and accented speech are minimally represented: the only children's-speech corpus covers children aged 2–8 (Telegram bot, voice recorders, home recordings).
OCR for Arabic and Latin Kazakh script barely exists
KazakhOCR showed that all multimodal LLMs (Gemma-3, Qwen2.5-VL, Llama-3.2-Vision) fail on Kazakh Arabic and Latin scripts, confusing them with Arabic, Persian, and Kurdish. There is no public dataset of real (non-synthetic) images.
Emotional / paralinguistic resources are limited to a single dataset
KazEmoTTS is the first public corpus of emotional Kazakh speech (74.85 h, 6 emotions). This is too little for emotion recognition in spontaneous speech and for multimodal analysis. Probing Whisper showed that emotion concentrates in the middle layers, but there are no downstream experiments.
NER does not cover specialized domains (medicine, law)
KazNERD is trained on TV news. Legal, medical, and scientific texts have different terminology and entities; in a translation post-editing work, fine-tuning yielded the largest quality gain precisely in the legal (+17%) and medical (+22%) domains — a sign of the largest remaining error margin.
Parallel corpora are small for most language pairs
KazParC is the first large public parallel corpus (kk–en–ru–tr, ~372K sentences), but it covers only 4 languages. The pairs kk–zh, kk–uz, kk–ky are used in training, but verified corpora are minimal, which limits translation quality.
Sentiment datasets cover only reviews, not all genres
KazSAnDRA is the largest public sentiment dataset (180K), but it consists only of consumer reviews. News, social media, and political statements are barely represented, which limits the applicability of classifiers.
Embeddings have no standard intrinsic benchmark
Cross-lingual embeddings for Turkic languages have been studied, but for Kazakh there is no public set of analogies or a SimLex-like resource that would allow comparing embedding quality without a downstream task.
Punctuation and normalization of ASR output are barely studied
The only work on punctuation/capitalization restoration for Kazakh uses only Wikipedia and books and reports a low F1 for rare marks (exclamation mark: F1=32.85). Normalization of ASR output in real applications remains unsolved.
Unverified hypotheses
8 · testableMorphology-aware segmentation improves downstream tasks in Kazakh compared to BPE
The intuition that "accounting for morpheme boundaries should help agglutinative languages" is widespread. But Sälevä & Lignos (2021) on en–kk (one of the three pairs in the work) showed that morphology-aware methods (LMVR, MORSEL) give no consistent advantage over BPE — the best method varies, and the results are statistically indistinguishable.
Byte-level tokenization outperforms BPE for Kazakh due to agglutinativity
KazByte hypothesizes that raw bytes, via an adapter to a frozen Qwen2.5-7B, will match or surpass the original. The authors explicitly state "empirical validation is ongoing" — no published comparisons exist. For other languages, byte-level has not yielded an unambiguous advantage.
Transfer from Turkish is more effective than transfer from Russian for Kazakh tasks
The typological closeness of Kazakh and Turkish (agglutination, vowel harmony, SOV) is often cited as a rationale for cross-lingual transfer, but there is no systematic "from Turkish vs. from Russian" comparison on fixed tasks (NER, SA, QA).
Reasoning in English with the answer translated into Kazakh preserves quality in modern LLMs
"Left Behind" (2026) showed that cross-lingual transfer (CoT in English → translation) yields gains only for bilingual architectures and does not work for English-dominant models. Nonetheless, the strategy is often assumed to work without verification on Kazakh benchmarks.
Increasing the tokenizer vocabulary size significantly improves downstream quality of Kazakh LLMs
SozKZ uses 50K BPE instead of 32K and shows competitive results, but without an ablation of the effect of vocabulary size alone at a fixed token count. Sherkala is trained with an expanded vocabulary, but no comparison across vocabulary size was conducted.
Synthetic data from TTS is sufficient to bootstrap ASR without real recordings
A work on speech command recognition (2023) achieved 89.79% on TTS synthesis. But generalization to continuous spontaneous speech is not proven: TTS produces read, not conversational, speech, which risks domain shift at deployment.
Prompts in Kazakh are systematically safer than Russian ones in the same LLMs
Qorgau shows differences in safety behavior between Kazakh and Russian, but the direction of the effect is inconsistent across categories. KZ-SafetyPrompts: GPT-4o refuses 28.2% of Kazakh prompts (range 5.5–53.8%), but there is no systematic kk-vs-ru comparison on identical prompts.
A small model trained from scratch on Kazakh outperforms a large multilingual one at equal inference budget
SozKZ-600M approaches LLaMA-3.2-1B (30.3% vs 32.0% on cultural QA) and beats 2B multilingual models on SIB-200 — indirect support. But there is no direct comparison at equal inference budget (FLOPS/latency) with Sherkala, and no results on KazQAD/KazNERD either.