Entry point · for your first contribution

What can you do right now?

Open problems for your first contribution

You don't need to invent a new paradigm. The field is full of concrete, under-covered gaps and claims nobody has verified. Below are the pain points and unverified hypotheses of Kazakh NLP, each with a citation to real work. Pick the one within your reach and make your first contribution.

How to read. “Pain points” are what the field is missing (data, benchmarks, metrics); each comes with a concrete first step. “Unverified hypotheses” are what is claimed or implied but not proven; each comes with how to test it. The difficulty badge: easy start is suitable for a first project.

Pain points

12 · with citations

Tokenizationeasy start

High "tokenizer tax" of Kazakh text

Multilingual tokenizers (BPE, SentencePiece), trained predominantly on high-resource languages, fragment Kazakh words 3–5x more than English ones — this shortens the effective context window and increases compute cost. KazByte names this "tokenizer tax" as its main motivation, while SozKZ sidesteps the problem by training a tokenizer from scratch.

First stepMeasure the fertility of existing tokenizers (GPT-4o, LLaMA-3, Qwen2.5) on standardized Kazakh text and compare against Turkish and English baselines.

arXiv:2603.27859 ↗arXiv:2603.20854 ↗arXiv:2503.01493 ↗

Morphology / segmentationmedium

No standardized benchmark for morphology

Despite many works on Kazakh morphological analysis, there is no single public dataset with gold annotation of morpheme boundaries and POS tags for reproducible comparison. Differing annotation schemes make it impossible to align results.

First stepCollect 1000–2000 sentences from open sources (KazNERD, Wikipedia), annotate morphemes under a single scheme, and publish it as a micro-benchmark.

source ↗source ↗source ↗

Datasets / corporamedium

Weak support for Kazakh-Russian code-switching

Speakers regularly switch between Kazakh and Russian within a single utterance (intra-sentential code-switching), but most ASR/NLP systems are trained on monolingual data. KSC2 contains such samples (proportion undisclosed), and dedicated datasets are almost nonexistent.

First stepUsing the open reviews dataset (100K Movie Reviews from Kazakhstan), annotate the share of code-switched phrases and publish the statistics as a baseline.

arXiv:2503.20007 ↗source ↗arXiv:2605.08600 ↗

Evaluation / benchmarkseasy start

No unified set of LLM evaluation metrics

Benchmarks for Kazakh are fragmented: KazMMLU, TUMLU, KazQAD, Qorgau, KZ-SafetyPrompts use different protocols, models, and metrics. There is no way to objectively compare progress across works — there is no single leaderboard.

First stepRun 3–5 public LLMs on all existing Kazakh benchmarks under a single protocol and publish a comparative table.

arXiv:2502.12829 ↗arXiv:2502.11020 ↗arXiv:2404.04487 ↗arXiv:2502.13640 ↗

Speech (ASR / TTS)medium

ASR for spontaneous and children's speech is almost absent

Most Kazakh ASR systems are trained on read speech (news, books, parliament). Children's speech, spontaneous dialogue, telephone conversations, and accented speech are minimally represented: the only children's-speech corpus covers children aged 2–8 (Telegram bot, voice recorders, home recordings).

First stepOn the available KSC2 and Common Voice, run a WER error analysis broken down by speaker accent, gender, and age, and publish the diagnostics.

source ↗arXiv:2009.10334 ↗source ↗

Datasets / corporaeasy start

OCR for Arabic and Latin Kazakh script barely exists

KazakhOCR showed that all multimodal LLMs (Gemma-3, Qwen2.5-VL, Llama-3.2-Vision) fail on Kazakh Arabic and Latin scripts, confusing them with Arabic, Persian, and Kurdish. There is no public dataset of real (non-synthetic) images.

First stepCollect 200–500 photos of real signs, newspapers, and documents in Kazakh Arabic/Latin script, annotate them, and publish as a small benchmark.

arXiv:2603.13238 ↗arXiv:2110.04075 ↗arXiv:2007.03579 ↗

Speech (ASR / TTS)medium

Emotional / paralinguistic resources are limited to a single dataset

KazEmoTTS is the first public corpus of emotional Kazakh speech (74.85 h, 6 emotions). This is too little for emotion recognition in spontaneous speech and for multimodal analysis. Probing Whisper showed that emotion concentrates in the middle layers, but there are no downstream experiments.

First stepFine-tune an open-source emotion-recognition model on KazEmoTTS and run a zero-shot evaluation on Common Voice Kazakh for a baseline.

arXiv:2404.01033 ↗source ↗source ↗

NER / extractionmedium

NER does not cover specialized domains (medicine, law)

KazNERD is trained on TV news. Legal, medical, and scientific texts have different terminology and entities; in a translation post-editing work, fine-tuning yielded the largest quality gain precisely in the legal (+17%) and medical (+22%) domains — a sign of the largest remaining error margin.

First stepTake open Kazakh legal acts (data.egov.kz), annotate 500 sentences using the KazNERD scheme, and evaluate the zero-shot transfer of existing models.

arXiv:2111.13419 ↗source ↗

Machine translationeasy start

Parallel corpora are small for most language pairs

KazParC is the first large public parallel corpus (kk–en–ru–tr, ~372K sentences), but it covers only 4 languages. The pairs kk–zh, kk–uz, kk–ky are used in training, but verified corpora are minimal, which limits translation quality.

First stepUsing NLLB and monolingual corpora, build a synthetic parallel dataset for kk–uz or kk–ky and evaluate it via back-translation BLEU.

arXiv:2403.19399 ↗arXiv:2602.04442 ↗source ↗

Classification / sentimenteasy start

Sentiment datasets cover only reviews, not all genres

KazSAnDRA is the largest public sentiment dataset (180K), but it consists only of consumer reviews. News, social media, and political statements are barely represented, which limits the applicability of classifiers.

First stepAnnotate 500–1000 Kazakh news headlines under a three-class sentiment scheme and evaluate the transfer of the KazSAnDRA model.

arXiv:2403.19335 ↗arXiv:2605.08600 ↗

Embeddingseasy start

Embeddings have no standard intrinsic benchmark

Cross-lingual embeddings for Turkic languages have been studied, but for Kazakh there is no public set of analogies or a SimLex-like resource that would allow comparing embedding quality without a downstream task.

First stepTranslate a subset of SimLex-999 or BATS into Kazakh with native speakers and publish it as the first intrinsic benchmark for Kazakh embeddings.

arXiv:2005.08340 ↗arXiv:2604.06202 ↗

Speech (ASR / TTS)easy start

Punctuation and normalization of ASR output are barely studied

The only work on punctuation/capitalization restoration for Kazakh uses only Wikipedia and books and reports a low F1 for rare marks (exclamation mark: F1=32.85). Normalization of ASR output in real applications remains unsolved.

First stepFine-tune a punctuation-restoration model on KSC2 transcriptions and compare against the Wikipedia baseline by F1 across all mark classes.

source ↗source ↗

Unverified hypotheses

8 · testable

contestedmedium

Morphology-aware segmentation improves downstream tasks in Kazakh compared to BPE

The intuition that "accounting for morpheme boundaries should help agglutinative languages" is widespread. But Sälevä & Lignos (2021) on en–kk (one of the three pairs in the work) showed that morphology-aware methods (LMVR, MORSEL) give no consistent advantage over BPE — the best method varies, and the results are statistically indistinguishable.

How to testCompare the SozKZ BPE tokenizer (50K) against a morphological segmenter (Morfessor) on three tasks — NER, MT, masked LM — under a single protocol on the same data.

arXiv:2103.11189 ↗arXiv:2603.20854 ↗

unverifiedmedium

Byte-level tokenization outperforms BPE for Kazakh due to agglutinativity

KazByte hypothesizes that raw bytes, via an adapter to a frozen Qwen2.5-7B, will match or surpass the original. The authors explicitly state "empirical validation is ongoing" — no published comparisons exist. For other languages, byte-level has not yielded an unambiguous advantage.

How to testFine-tune ByT5-small on Kazakh (OSCAR/CC100) and compare against a BPE model of the same size on KazMMLU and KazQAD.

arXiv:2603.27859 ↗arXiv:2603.20854 ↗

assumptionmedium

Transfer from Turkish is more effective than transfer from Russian for Kazakh tasks

The typological closeness of Kazakh and Turkish (agglutination, vowel harmony, SOV) is often cited as a rationale for cross-lingual transfer, but there is no systematic "from Turkish vs. from Russian" comparison on fixed tasks (NER, SA, QA).

How to testFine-tune models on Turkish and Russian data of equal volume, then fine-tune on KazNERD and compare F1 on the test set.

arXiv:2604.06202 ↗source ↗arXiv:2603.21036 ↗

unverifiedeasy start

Reasoning in English with the answer translated into Kazakh preserves quality in modern LLMs

"Left Behind" (2026) showed that cross-lingual transfer (CoT in English → translation) yields gains only for bilingual architectures and does not work for English-dominant models. Nonetheless, the strategy is often assumed to work without verification on Kazakh benchmarks.

How to testOn KazMMLU/KazQAD, compare three modes — direct answer in Kazakh, CoT in Kazakh, CoT in English + translation — for 3–5 models.

arXiv:2603.21036 ↗arXiv:2502.12829 ↗arXiv:2604.20531 ↗

unverifiedambitious

Increasing the tokenizer vocabulary size significantly improves downstream quality of Kazakh LLMs

SozKZ uses 50K BPE instead of 32K and shows competitive results, but without an ablation of the effect of vocabulary size alone at a fixed token count. Sherkala is trained with an expanded vocabulary, but no comparison across vocabulary size was conducted.

How to testTrain three identical models (architecture, data) with a 16K/32K/64K BPE vocabulary and compare fertility, perplexity, and F1 on NER.

arXiv:2603.20854 ↗arXiv:2503.01493 ↗

unverifiedeasy start

Synthetic data from TTS is sufficient to bootstrap ASR without real recordings

A work on speech command recognition (2023) achieved 89.79% on TTS synthesis. But generalization to continuous spontaneous speech is not proven: TTS produces read, not conversational, speech, which risks domain shift at deployment.

How to testFine-tune Whisper only on KazakhTTS2 synthesis and compare WER across three domains (KSC2 news, Common Voice, spontaneous chat) against a model trained on real data.

source ↗arXiv:2201.05771 ↗source ↗

unverifiedeasy start

Prompts in Kazakh are systematically safer than Russian ones in the same LLMs

Qorgau shows differences in safety behavior between Kazakh and Russian, but the direction of the effect is inconsistent across categories. KZ-SafetyPrompts: GPT-4o refuses 28.2% of Kazakh prompts (range 5.5–53.8%), but there is no systematic kk-vs-ru comparison on identical prompts.

How to testTake 200 prompts from Qorgau/KZ-SafetyPrompts, translate them from Russian into Kazakh with a native speaker, and compare the refusal rate of a single model on both versions.

arXiv:2502.13640 ↗arXiv:2605.26947 ↗

unverifiedmedium

A small model trained from scratch on Kazakh outperforms a large multilingual one at equal inference budget

SozKZ-600M approaches LLaMA-3.2-1B (30.3% vs 32.0% on cultural QA) and beats 2B multilingual models on SIB-200 — indirect support. But there is no direct comparison at equal inference budget (FLOPS/latency) with Sherkala, and no results on KazQAD/KazNERD either.

How to testCompare SozKZ-600M against quantized Sherkala-8B on KazMMLU/KazQAD/KazNERD under an equal latency constraint (≤100ms CPU) and record the accuracy-throughput trade-off.

arXiv:2603.20854 ↗arXiv:2503.01493 ↗arXiv:2502.12829 ↗

← back to the Atlas