Research cartography · 2013 – 2026

Kazakh NLP Atlas

The State of Kazakh NLP Research

Kazakh NLP saw an explosion in 2024–2026 — but the growth is concentrated in speech and machine translation. Tokenization and morphology remain a thin, under-mapped frontier. That is exactly where contributions are open.

222
papers in corpus
2013–2026
years
47%
in 2024–2026
21
on tokenization
01

Field chronology

volume of work by year
3
2013
8
2014
5
2015
1
2016
4
2017
2
2018
14
2019
18
2020
18
2021
17
2022
25
2023
35
2024
45
2025
25
2026
▮ in gold — 2024–2026: the LLM era arrives in Kazakh
02

Breakthrough timeline

world ⟷ Kazakh · the lag is visible
🌍 World🇰🇿 Kazakhstan
03

Territories

by decreasing openness
Tokenization21
A thin frontier. The boom is 2024–2026. There is no independent audit of morpheme alignment.
Morphology / segmentation48
New segmenters are being built, but no one audits what existing tokenizers actually do.
Language models / LLMs50
mapped
Evaluation / benchmarks76
Benchmarks are few and scattered. Til-Core shipped without a single downstream benchmark.
Embeddings22
mapped
Classification / sentiment26
mapped
NER / extraction55
mapped
Machine translation105
dense
Speech (ASR / TTS)110
dense
Datasets / corpora146
dense
04

LLM architecture

a schematic for newcomers · and a contribution map

First — how an LLM is built at all. Arrows = data flow; the colors are the same as in the tree below: where Kazakh is dense, where it's empty. Click a block to jump to its layer and its papers.

Kazakh LLM stack

The same skeleton — but see where it's empty for Kazakh, and which layers a vanilla stack doesn't need at all. Color = status, number = volume of work, bar = how complete the layer is. Click an element for the list of papers.

many papersactiveincompletealmost none
01
Data154 papers
02
Tokenization59 papers
03
Representations28 papers
04
Model50 papers
05
Adaptation28 papers
06
Evaluation76 papers
07
Inference4 papers
08
Applications210 papers
05

Citation graph

size = influence · color = topic
hover · click = details · legend = constellation · wheel = zoom · drag background
104 related papers + 15 global hubs · 312 citation edges · method: s2-batch. 118 more papers without edges — in the list below.
06

Flagship models

claim ≠ verified
ModelYearParamsBaseVocabTokenizerMorphology?Benchmarks?
Til-Core-0.5B
Tıl Qazyna (state)
A loud morphology claim, with validation perplexity as the only metric. A 0.5B/1B family (+Instruct). No independent verification.
2026497MQwen2 arch. (from scratch)256 000morphBPE — BPE that forbids merges across morpheme boundaries (BiLSTM segmenter)YES — but the segmenter isn't releasedNO — only val-PPL
Sherkala-Chat-8B
Inception / MBZUAI
Kazakh fertility 4.73 → 2.04. Morpheme alignment is not discussed.
20258BLlama-3.1159 766extended BPE (+25% over Llama-3.1)no (fertility-driven)yes
SozKZ (50M–600M)
S. Tukenov
Argues via fertility, not via morpheme boundaries.
202650–600MLlama-arch50 000ByteLevel BPE trained from scratch on Kazakhnopartial
KazByte
R. Akylzhanov
A counterpoint to the whole field: the "tokenizer tax" is solved by removing the tokenizer. "Validation is ongoing" — no published results yet.
2026adapter→Qwen2.5-7BQwen2.5— (byte-level)bypasses the tokenizer entirely (byte-level adapter)n/a — no tokenizerNO — position paper
KazLLM (8B / 70B)
ISSAI / NU
150B+ tokens, 4 languages. No dedicated tokenizer work.
20248B, 70BLlama-3.1128 256 (Llama-3.1)inherits Llama-3.1, extension undocumentednoyes (task-perf)
Kaz-RoBERTa
kz-transformers
An early baseline. Used in hybrid morphological analyzers.
2023~83MRoBERTa52 000byte-level BPE (Kazakh + code-switched RU dialogues)nopartial
07

Open territory

where contribution is open
◆ YOU ARE HERE

Independent audit of morpheme alignment in Kazakh tokenizers

No one has compared several KAZAKH tokenizers (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core) on morpheme boundaries against a single gold standard. Arnett 2025 treats Kazakh as 1 of 70 languages and only generic tokenizers; Duisenova 2026 builds a new one but doesn't audit the existing ones.

SHIPPABLE this week
◆ YOU ARE HERE

Empirical test of Til-Core's morphology claim

Til-Core shipped without a single downstream benchmark (only validation perplexity) and with a loud claim of "Kazakh morphology support." Be the first to measure it independently.

part of the audit
◆ YOU ARE HERE

Precision/F1 of morpheme alignment for Kazakh tokenizers

The original MorphScore (2024) measures only boundary recall; Arnett 2025 added precision/recall for Kazakh — but only for generic tokenizers (BLOOM, Llama, Gemma). No one has computed precision and F1 for KAZAKH tokenizers (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core).

small add-on to the audit
○ open

Joint fertility × morpheme-alignment table

Sherkala reports fertility, the MorphScore work reports alignment — but no one has brought both axes for Kazakh tokenizers into a single table.

medium
◆ YOU ARE HERE

Usage-vs-morphology divergence (what speakers actually say)

A morphologically correct form ≠ the form a native speaker uses (e.g. "біздің кітаптар" instead of "кітаптарымыз", "неге" as a monolith). This is methodologically uncovered by any work. A native-speaker survey → a new angle.

mini-survey, 30–50 responses
08

The paper corpus

arXiv + Semantic Scholar

222 papers shown

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models
2026Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek et al.arXiv
Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, ch…
Language models / LLMsMachine translationDatasets / corporaEvaluation / benchmarks
Bidirectional Kazakh Sign Language prosody-aware translation using computer vision and speech recognition techniques
2026M. Zhassuzak, Zholdas Buribayev, Maria Aouani et al.0Frontiers in Artificial Intelligence
Introduction This study presents a bidirectional communication system designed to enhance interaction between hearing-impaired and hearing individuals using gesture recognition. Methods The proposed framework integrates multiple components, including Kazakh S…
Machine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarksClassification / sentiment
Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward
2026Nigar Alishzade, Gulchin AbdullayevaarXiv
Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review…
Machine translationNER / extractionDatasets / corporaEvaluation / benchmarksEmbeddings
100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
2026Rustem YeshpanovarXiv
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switch…
Datasets / corporaEvaluation / benchmarksClassification / sentiment
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering
2026Anar Yeginbergen, Maite Oronoz, Rodrigo AgerriarXiv
This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated r…
Language models / LLMsDatasets / corporaEvaluation / benchmarks
Development of a Crowdsourcing-Based Adaptive Kazakh—English Translation System for the Kazakh Language
2026Almat Begaidarov, A. Serek, Aisaule Bazarkulova et al.0International Conference on Electronics, Computer and Computation
The Kazakh language is hard for learners and machine translation because of its complicated morphology, regional differences, and subtle differences in meaning. This research presents a crowdsourcing-driven adaptive translation system that amalgamates automat…
Morphology / segmentationMachine translationNER / extractionEvaluation / benchmarks
Development of Hybrid LLM-ASR Methodology for Improving Process of Kazakh Language Learning
2026Dinmukhammed Zhassulanov, A. Serek, Aisaule Bazarkulova0International Conference on Electronics, Computer and Computation
This paper suggests a structure that can guide the enhancement of the Kazakh language learning process by using the large language models (LLMs) and automatic speech recognition (ASR). To fill the gap of highquality speech materials on the Kazakh language, a …
Language models / LLMsSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models
2026A. Aitim1Applied Sciences
Kazakh sequence labeling is constrained by limited annotated resources, while its agglutinative morphology and productive suffixation increase data sparsity and exacerbate label inconsistency in part-of-speech (POS) tagging and named entity recognition (NER).…
TokenizationMorphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)NER / extractionDatasets / corpora
Layer-Wise Probing of Paralinguistic Attributes in Fine-Tuned Whisper for Kazakh Speech
2026Aimoldir Aldabergen, B. Kynabay, S. Kadyrov0Engineering, Technology & Applied Science Research
Large pre-trained speech models similar to Whisper are now commonly used for speech recognition and related tasks. The distribution of paralinguistic features, which include emotions and speaker characteristics across model layers, remains uncertain, particul…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarksEmbeddings
An Empirical Comparison of Cascade and Direct End-to-End Speech Translation for Low-Resource Language Pair
2026Zhanibek Kozhirbayev0Computers
Speech-to-text translation (S2TT) for low-resource languages remains challenging due to the scarcity of parallel speech translation data and the susceptibility of modular pipelines to error propagation. This paper presents a controlled empirical comparison of…
Morphology / segmentationMachine translationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Unifying Kazakh proper names in English-language texts: the role of translation technologies and translator training
2026D. Popič0Bulletin of L.N. Gumilyov Eurasian National University. PHILOLOGY Series
The rapid expansion of translation technologies has transformed both professional translation practice and translator education. In languages with developing digital infrastructures, such as Kazakh, the integration of machine translation and computer-assisted…
Machine translationNER / extractionEmbeddings
Multi-lingual meeting minutes-taking system: design and implementation
2026B. Kumalakov, A. Mazhitova0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
This study examines the challenge of automatically transcribing multilingual institutional speech in Kazakhstan, where speakers frequentlбукy switch between Kazakh, Russian, and English. While modern automatic speech recognition (ASR) systems achieve high acc…
TokenizationMorphology / segmentationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
2026Rauan AkylzhanovarXiv
Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model's…
TokenizationMorphology / segmentationLanguage models / LLMsDatasets / corporaEvaluation / benchmarks
Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models
2026Abdul-Salem BeibitkhanarXiv
We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally…
Language models / LLMsDatasets / corporaEvaluation / benchmarks
SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
2026Saken TukenovarXiv
Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ,…
TokenizationMorphology / segmentationLanguage models / LLMsDatasets / corporaEvaluation / benchmarksClassification / sentiment
Speech-to-Sign Gesture Translation for Kazakh: Dataset and Sign Gesture Translation System
2026Akdaulet Mnuarbek, A. Bekarystankyzy, M. Turdalyuly et al.0Computers
This paper presents the first prototype of a speech-to-sign language translation system for Kazakh Sign Language (KRSL). The proposed pipeline integrates the NVIDIA FastConformer model for automatic speech recognition (ASR) in the Kazakh language and addresse…
Machine translationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models
2026O. Ibrahimzade, K. TabasaranskyarXiv
Large language models (LLMs) have transformed natural language processing, yet their capabilities remain uneven across languages. Most multilingual models are trained primarily on high-resource languages, leaving many languages with large speaker populations …
Morphology / segmentationLanguage models / LLMsDatasets / corporaEvaluation / benchmarksEmbeddings
Enhancing Post-Editing of Kazakh Translations Using Fine-Tuned Large Language Models
2026A. Bekarystankyzy, D. Rakhimova, Aliya Zhiger et al.0Algorithms
Machine translation for low-resource languages such as Kazakh remains a complex task due to the scarcity of training data, intricate morphological structures, and culturally specific linguistic characteristics. This study presents the first extensive explorat…
Morphology / segmentationLanguage models / LLMsMachine translationDatasets / corporaEvaluation / benchmarksClassification / sentiment
Using Songs to Improve Kazakh Automatic Speech Recognition
2026Rustem Yeshpanov0arXiv
Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset …
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
2026Henry Gagnier, Sophie Gagnier, Ashwin KirubakaranarXiv
Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and L…
Language models / LLMsDatasets / corporaEvaluation / benchmarks
Pedagogical features of using artificial intelligence programs in teaching the kazakh language
2026А. А. Мukhametkali, А. Z. Мazinova0Eurasian Journal of Current Research in Psychology and Pedagogy
The article examines the pedagogical features of integrating artificial intelligence (AI) technologies into the process of teaching Kazakh at higher education institutions. Artificial intelligence is viewed as a tool for enhancing the effectiveness of languag…
Speech (ASR / TTS)NER / extractionDatasets / corpora
No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
2026Dmitry Karpov1arXiv
We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bash…
Machine translationDatasets / corpora
An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages
2026U. Tukeyev, A. Shormakova, Aidana Karibayeva et al.2Computers
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation proble…
Morphology / segmentationMachine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
Synthetic data generation for Kazakh speech separation and diarization based on the use of neural networks
2025D. Oralbekova, Orken J. Mamyrbayev, L. Azarova et al.0Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA)
This paper explores the impact of various synthetic data generation methods on the performance of speech separation and diarization models. Three approaches are considered: simple audio track overlay, synthetic dialogue generation, and acoustic condition mode…
Speech (ASR / TTS)NER / extractionDatasets / corpora
Development and increase of noise immunity of a model of biometric identification of a speaker based on metal-frequency cepstral coefficients and a convolutional neural network
2025M. Khizirova, K. Chezhimbayeva, Abdurazak Kassimov et al.0Eastern-European Journal of Enterprise Technologies
This study is focused on improving the noise robustness of a biometric speaker identification system based on mel-frequency cepstral coefficients (MFCC) and a convolutional neural network (CNN). The object of analysis is the acoustic structure of the Kazakh l…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarksClassification / sentiment
Natural Language Processing and Speech Technologies for Central Asian Turkic Languages: A Review of Current Methods, Resources, and Challenges
2025Palidan Muhetaer0Actual Problems of the Present
This article provides a comprehensive review of contemporary research in the field of natural language processing (NLP) and speech technologies for Central Asian Turkic languages, including Kazakh, Kyrgyz, and Uzbek. Although a number of theoretical and appli…
TokenizationMorphology / segmentationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarksClassification / sentimentEmbeddings
COMPARATIVE ANALYSIS OF LOCAL AND CLOUD-BASED SPEECH RECOGNITION MODELS FOR THE KAZAKH LANGUAGE
2025М.Г. Оспанов, К.С. Мауленов, А.Т. Байманкулов0Bulletin of D. Serikbayev EKTU
Разработка систем автоматического распознавания казахской речи остаётся актуальной задачей в условиях ограниченных языковых ресурсов и высокой морфологической сложности агглютинативных языков. Цель исследования заключается в сравнительном анализе локальных и …
Speech (ASR / TTS)
APPLICATION OF NON-AUTOREGRESSIVE DECODING FOR KAZAKH SPEECH RECOGNITION
2025D. Oralbekova, O. Mamyrbayev, A. Yerimbetova et al.0Herald of Kazakh-British technical university
In the field of speech recognition, end-to-end models are gradually replacing traditional and hybrid approaches. Their main principle is autoregressive decoding, where the output sequence is formed from left to right. However, it has not yet been proven that …
Morphology / segmentationSpeech (ASR / TTS)NER / extractionClassification / sentiment
DEVELOPMENT OF A MODEL FOR REAL-TIME RECOGNITION OF KAZAKH SIGN LANGUAGE USING MEDIAPIPE AND DEEP LEARNING METHODS
2025A. Yerimbetova, U. Berzhanova, E. Daiyrbayeva et al.0Herald of Kazakh-British technical university
This article discusses the process of developing a Kazakh sign language recognition system using the MediaPipe platform. The platform allows for efficient real-time gesture recognition. The main focus is on creating models for gesture recognition, training ne…
Speech (ASR / TTS)Datasets / corpora
DEVELOPMENT OF A MODEL FOR REAL-TIME RECOGNITION OF KAZAKH SIGN LANGUAGE USING MEDIAPIPE AND DEEP LEARNING METHODS
2025N. Amangeldy, A. Yerimbetova, N. Gazizova et al.0Herald of Kazakh-British technical university
Technologies for automatic processing of sign language have become an urgent need for members of society with hearing and speech impairments who face inequality in the era of digital transformation. In recent years, the issue of considering sign language as a…
Morphology / segmentationSpeech (ASR / TTS)NER / extractionDatasets / corporaEmbeddings
Artificial Intelligence and the Scientific Development of the Kazakh Language: Corpus, Terminology, and Content Automation
2025F. Orazbayeva, A. Ryskulova0Iasaýı ýnıversıtetіnіń habarshysy
This article provides a comprehensive analysis of effective strategies for enhancing the scientific and theoretical development of the Kazakh language through the integration of artificial intelligence technologies and linguistic corpora. The primary aim of t…
Morphology / segmentationSpeech (ASR / TTS)Datasets / corpora
Low-Resource Speech Recognition by Fine-Tuning Whisper with Optuna-LoRA
2025Huan Wang, Jie Bin, Chunyan Gou et al.2Applied Sciences
In low-resource speech recognition, the performance of the Whisper model is often limited by the size of the available training data. To address this challenge, this paper proposes a training optimization method for the Whisper model that integrates Low-Rank …
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages
2025Aidana Karibayeva, V. Karyukin, U. Tukeyev et al.1Applied Sciences
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resour…
Machine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
TRANSLATING THE UNTRANSLATABLE IN KAZAKH FAIRY TALES: HUMAN vs AI
2025B. Mizamkhan, G. Dyussembina, A. Yezmakhunova0Вестник "Филологичекие науки"
While the roles of human translators and machines have shifted with the development of such engines as Google Translate and DeepL, the issue of translating the untranslatable or culture-specific items from Kazakh into English still remain challenging. The lac…
Language models / LLMsMachine translationNER / extractionDatasets / corporaEmbeddings
A Kazakh language Dataset of Lip Movements for Command Recognition
2025Batyr Kenzheakhmetov, Alissultan Amankos, B. Amirgaliyev et al.0Scientific Data
Lip reading systems determine the content of speech based on the visual tracking of lips of the speaker and therefore serve to offer communicative substitutes when acoustic information is not available in the environment. The training of strong lip reading mo…
Speech (ASR / TTS)Datasets / corpora
Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval
2025Shantanu Agarwal, Joel Barry, Elizabeth Boschee et al.arXiv
Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval (CLIR). This report provides a detailed description of Information Sciences Institu…
Machine translationEvaluation / benchmarks
MULTILINGUAL AUTOMATIC SPEECH RECOGNITION INTERFACE FOR TYPING: USABILITY STUDY AND PERFORMANCE EVALUATION FOR KAZAKH, RUSSIAN, AND ENGLISH
2025Z. Makhataeva, Nursultan Atymtay, Rakhat Meiramov et al.0Scientific Journal of Astana IT University
We present a multilingual automatic speech recognition (ASR) system for Kazakh, Russian, and English designed for the trilingual community of Kazakhstan. Although prior research has shown that speech-based text entry can outperform conventional keyboard typin…
Language models / LLMsSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Multilingual Speech Command Recognition with Language Identification
2025Artur Muratov, Askat Kuzdeuov, H. A. Varol1Annual Conference of the IEEE Industrial Electronics Society
Multilingual Speech Command Recognition (SCR) facilitates voice interaction in environments where multiple languages are used interchangeably, a common characteristic of multilingual regions. In such settings, SCR and language identification (LID) are handled…
Language models / LLMsSpeech (ASR / TTS)NER / extraction
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
2025Aidana Karibayeva, V. Karyukin, B. Abduali et al.2Inf.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative …
Morphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)Datasets / corpora
Improving post-editing of Kazakh translations with fine-tuned large language models: Dataset and evaluation
2025D. Rakhimova, Aliya Zhiger, Madina Mansurova et al.0International journal of innovative research and scientific studies
Machine translation for low-resource languages like Kazakh faces significant challenges due to limited training data, complex morphology, and cultural-linguistic nuances. This paper presents the first comprehensive study on fine-tuning large language models f…
Morphology / segmentationLanguage models / LLMsMachine translationDatasets / corporaEvaluation / benchmarks
Improved Kazakh Named Entity Transcription Using Synthetic Speech
2025Rakhat Meiramov, H. A. Varol02025 10th International Conference on Computer Science and Engineering (UBMK)
Named entity transcription remains a challenge for low-resource languages such as Kazakh, often causing errors in proper nouns, locations, and other named entities. This paper presents a fine-tuning approach using OpenAI’s Whisper model, KazakhTTS2 and KazEmo…
Language models / LLMsSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
Fine-Tuning Neural ASR Models for Low-Resource Kazakh Children’s Speech with Preprocessing Enhancements
2025Zhansaya Duisenbekkyzy, D. Rakhimova, A. Yessirkepova et al.02025 10th International Conference on Computer Science and Engineering (UBMK)
This article presents a comprehensive study of automatic speech recognition (ASR) for Kazakh children's speech, with a particular focus on the impact of signal preprocessing techniques in a low-resource linguistic environment. Although commercial ASR systems …
Speech (ASR / TTS)NER / extractionDatasets / corpora
Restoring Punctuation and Capitalization in Kazakh: A BERT-Based Approach for Text Normalization
2025Sanzhar Umbet, Zhanibek Kozhirbayev12025 10th International Conference on Computer Science and Engineering (UBMK)
This paper introduces a punctuation and capitalization (PC) restoration model for Kazakh, developed using the bert-base-multilingual-uncased model within NVIDIA’s NeMo framework. The model was trained on a curated dataset of preprocessed Kazakh Wikipedia arti…
Speech (ASR / TTS)Datasets / corpora
A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages
2025Mamyr Altaibek, Altanbek Zulkazhav, B. Yergesh et al.1Journal of Artificial Intelligence and Technology
Speech emotion recognition (SER) plays a crucial role in enhancing human–computer interaction by identifying emotional states in speech. However, low-resource languages like Kazakh face challenges due to limited datasets and linguistic tools. To address this …
Speech (ASR / TTS)Datasets / corporaClassification / sentimentEmbeddings
Deploying Multilingual ASR in Digital Twin Systems: A Performance and Efficiency Analysis of Whisper and SeamlessM4T
2025B. Amirkhanov, G. Tyulepberdinova, G. Amirkhanova et al.12025 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)
This study presents a systematic comparison of two leading automatic speech recognition (ASR) model families—OpenAI's Whisper and Meta's SeamlessM4T—across three typologically diverse languages: English (Germanic), Russian (Slavic), and Kazakh (Turkic). Motiv…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Fine-Tuning Methods and Dataset Structures for Multilingual Neural Machine Translation: A Kazakh–English–Russian Case Study in the IT Domain
2025Zhanibek Kozhirbayev, Zhandos Yessenbayev1Electronics
This study explores fine-tuning methods and dataset structures for multilingual neural machine translation using the No Language Left Behind model, with a case study on Kazakh, English, and Russian. We compare single-stage and two-stage fine-tuning approaches…
Machine translationDatasets / corporaEvaluation / benchmarks
Technology of Distinguishing Homonyms Using Probabilistic and Statistical Methods in Compiling a Frequency Dictionary of Texts of the National Corpus of the Kazakh Language
2025Y. Bessirov, A. Zhanabekova0Tiltanym
The article presents a comprehensive overview of a probabilistic-statistical method for distinguishing homonyms in the process of compiling a frequency dictionary based on the Kazakh National Corpus. The research focuses on words that share identical written …
Morphology / segmentationSpeech (ASR / TTS)Datasets / corporaClassification / sentiment
NEURAL MACHINE TRANSLATION FOR ENGLISH-KAZAKH LANGUAGE PAIR
2025D. Rakhimova, A. Zhiger, V. Malykh et al.0Herald of Kazakh-British technical university
Currently, information technology is rapidly developing and one of its branches can be called machine translation. The use of machine translation in the process of understanding each other by people from different countries is increasing every year. At the mo…
Machine translationDatasets / corporaEvaluation / benchmarks
FEATURES OF USING EXTENDED FORMS OF THE TRANSFORMER MODEL IN KAZAKH SPEECH RECOGNITION
2025Kurmetkan Turdybek, Orken J. Mamyrbayev0Bulletin of D. Serikbayev EKTU
In recent years, speech recognition technologies have significantly advanced due to artificial intelligence and machine learning methods. These technologies enable the automatic understanding of human speech and its conversion into text. The field of speech r…
Morphology / segmentationSpeech (ASR / TTS)NER / extractionDatasets / corpora
AI-Based Offline Speech Recognition for Kazakh, Russian and English Languages
2025Nursultan Nyssanov, Zuleikha Syzdykova, K. Niyazaliyev et al.0Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing
This paper presents a novel fully autonomous multilingual audio transcription system tailored to Kazakh, Russian, and English. The proposed solution integrates a language detection module based on SpeechBrain with a transcription engine using Vosk, and employ…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarksClassification / sentiment
A SYSTEMATIC REVIEW ON TRANSLATING KAZAKH SIGN LANGUAGE TO SPEECH USING DYNAMIC GESTURE RECOGNITION
2025A. Aitim, Dariga Sattarkhuzhayeva, Aisulu Khairullayeva0Вестник КазАТК
This systematic review explores the importance of the Kazakh Sign Language (KSL) as the main method of communication for deaf and hard-of-hearing people in Kazakhstan. Despite its critical role, the automatic translation of KSL into spoken language remains un…
Machine translationSpeech (ASR / TTS)NER / extractionDatasets / corpora
Development and Evaluation of a Small Kazakh Language Corpus to Improve the Efficiency of Multilingual NLP Systems in Low-Resource Environments
2025Arailym Tleubayeva, Sultan Aubakirov, Aisultan Tabuldin et al.02025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST)
This study tackles NLP challenges in low-resource settings by developing the Small Kazakh Language Corpus—a high-quality, annotated collection of Kazakh texts sourced from news, scientific publications, and Wikipedia. The corpus was used to fine-tune two mode…
Language models / LLMsMachine translationDatasets / corporaEvaluation / benchmarksClassification / sentiment
Development of a Translator for Kazakh Sign Language to Speech Using Gesture Recognition
2025A. Aitim, Dariga Sattarkhuzhayeva, Aisulu Khairullayeva02025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST)
This paper presents the development of "DauysYm," an AI-powered translator that converts Kazakh Sign Language (KSL) into spoken language using dynamic gesture recognition. The project addresses communication barriers faced by people with hearing impairments i…
Machine translationSpeech (ASR / TTS)Datasets / corpora
SEMANTIC ROLE LABELING FOR KAZAKH: MODELS AND DATASETS
2025A. Aitim0Bulletin of Abai KazNPU. Series of Physical and Mathematical sciences
A fundamental component of natural language understanding, semantic role labeling (SRL) clarifies the relationship between predicates and their arguments, therefore enabling activities including information extraction, machine translation, and question answer…
Morphology / segmentationMachine translationNER / extractionDatasets / corporaEmbeddings
Multilingual Speech Command Recognition for Voice Controlled Robots and Smart Systems
2025Askat Kuzdeuov, H. A. Varol12025 11th International Conference on Control, Automation and Robotics (ICCAR)
Speech Command Recognition (SCR) has many applications in smart home systems, voice-controlled robots, and voice assistants. The modern SCR systems employ deep learning models trained on speech command datasets. Nowadays, the Google Speech Commands (GSC) data…
Language models / LLMsSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
From Characters to Subwords: Modeling Unit Conversion for Low-resource Speech Recognition
2025Yizhi Wang, Haofei Zhang, Huiqiong Wang et al.0IEEE International Conference on Acoustics, Speech, and Signal Processing
Multilingual automatic speech recognition (ASR) models greatly facilitate recognizing low-resource languages by sharing representations across similar languages. However, the commonly adopted modeling units, e.g., character-level modeling, lack language-speci…
TokenizationSpeech (ASR / TTS)NER / extractionDatasets / corporaEmbeddings
Do Chinese models speak Chinese languages?
2025Andrea W Wen-Yi, Unso Eun Seo Jo, David MimnoarXiv
The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they support the same languages as models developed in the United States or in Europe? Comp…
Language models / LLMsDatasets / corporaEvaluation / benchmarks
Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
2025Maksim Borisov, Zhanibek Kozhirbayev, Valentin Malykh5arXiv
Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair w…
Machine translationNER / extractionDatasets / corporaEvaluation / benchmarks
APPLICATION OF BLEU AND SARI METRICS IN EVALUATING SIMPLIFIED TEXTS IN KAZAKH: ANALYSIS AND EFFECTIVENESS
2025S. T. Nursapa, I. Ualiyeva0Herald of Kazakh-British technical university
This article explores the methodology for evaluating the quality of simplified texts in Kazakh using BLEU and SARI metrics. Text simplification is an important aspect for ensuring information accessibility and facilitating the learning process in Kazakh langu…
Machine translationEvaluation / benchmarks
Creating a Parallel Corpus for the Kazakh Sign Language and Learning
2025A. Yerimbetova, B. Sakenov, M. Sambetbayeva et al.6Applied Sciences
Kazakh Sign Language (KSL) is a crucial communication tool for individuals with hearing and speech impairments. Deep learning, particularly Transformer models, offers a promising approach to improving accessibility in education and communication. This study a…
Machine translationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
2025Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly et al.arXiv
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Ad…
TokenizationLanguage models / LLMsNER / extractionDatasets / corporaEvaluation / benchmarks
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages
2025Chen Zhang, Mingxu Tao, Zhiyuan Liao et al.arXiv
Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in …
Language models / LLMsDatasets / corporaEvaluation / benchmarks
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
2025Nurkhan Laiyk, Daniil Orel, Rituraj Joshi et al.arXiv
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, cov…
Language models / LLMsNER / extractionDatasets / corpora
Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
2025Maiya Goloburda, Nurkhan Laiyk, Diana Turmakhan et al.arXiv
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monoli…
Language models / LLMsNER / extractionDatasets / corporaEvaluation / benchmarks
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
2025Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan et al.arXiv
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been lim…
Language models / LLMsDatasets / corporaEvaluation / benchmarks
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages
2025Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili et al.arXiv
Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and …
Language models / LLMsMachine translationDatasets / corporaEvaluation / benchmarks
Design of QazSL Sign Language Recognition System for Physically Impaired Individuals
2025L. Zholshiyeva, T. Zhukabayeva, D. Baumuratova et al.13Journal of Robotics and Control (JRC)
Automating real-time sign language translation through deep learning and machine learning techniques can greatly enhance communication between the deaf community and the wider public. This research investigates how these technologies can change the way indivi…
Machine translationSpeech (ASR / TTS)Datasets / corpora
Analysis of the use of the hiformer model for kazakh speech recognition
2024O. Mamyrbayev, T. Kurmetkan0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
This article presents an overview of automatic speech recognition (ASR) technologies and describes the use of an advanced version of the Transformer model, the Hiformer model, in Kazakh speech recognition. A literature review of Kazakh speech recognition syst…
Speech (ASR / TTS)
Machine Learning Methods for Kazakh Morphology: A Comprehensive Overview
2024I. Akhmetov, S. Aubakirov, T. Saparov et al.22024 IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE)
Kazakh is an agglutinative language, where the sequential attachment of morphemes forms words, each bearing specific grammatical information. The complexity of this morphological structure presents significant challenges for computational linguistics, particu…
Morphology / segmentationMachine translationSpeech (ASR / TTS)Datasets / corporaEmbeddings
KAZAKH SPEECH AND RECOGNITION METHODS: ERROR ANALYSIS AND IMPROVEMENT PROSPECTS
2024Yerlan Karabaliyev, Kateryna Kolesnikova3Scientific Journal of Astana IT University
This study offers a detailed evaluation of automatic speech recognition (ASR) systems for the Kazakh, examining their performance in recognizing the phonetic and linguistic features unique to the language. The Kazakh language presents specific challenges for …
Morphology / segmentationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Machine Learning-Based Synonymous Word Detection in Kazakh
2024D. Rakhimova, Madina Mansurova, Nurakhmet Matanov et al.02024 9th International Conference on Computer Science and Engineering (UBMK)
This paper considers the problem of synonymous word detection in Kazakh using machine learning methods. The problem of automatic synonym detection is key to natural language processing tasks such as machine translation, context search, and semantic text analy…
Machine translationClassification / sentiment
Aligning Sentences for Kazakh-Turkish Parallel Corpora
2024D. Rakhimova, E. Adalı, Aidana Karibayeva02024 9th International Conference on Computer Science and Engineering (UBMK)
This paper presents a hybrid approach to sentence alignment for the Kazakh-Turkish parallel corpus, addressing the challenges posed by linguistic and structural differences between the two languages. The system is divided into two modules: the Hunaling module…
Machine translationDatasets / corpora
Named Entity Recognition from Kazakh Speech
2024Bauyrzhan Kairatuly, Madina Mansurova02024 9th International Conference on Computer Science and Engineering (UBMK)
This paper addresses the challenges of Named Entity Recognition (NER) in Kazakh speech, a critical task in Natural Language Processing (NLP). The integration of Automatic Speech Recognition (ASR) and NER technologies is explored to improve recognition accurac…
Speech (ASR / TTS)NER / extractionDatasets / corpora
Research on Low-Resource Neural Machine Translation Methods Based on Explicit Sparse Attention and Pre-trained Language Model
2024Tao Ning, Xin Xie, Jin Zhang0International Conference on Computer Science and Network Technology
The development of machine translation is driven by the need for global communication through the automatic translation of words, sentences, and texts from one language to another. This paper proposes an improved Transformer-based model to enhance the perform…
Language models / LLMsMachine translationNER / extractionDatasets / corpora
DEVELOPMENT OF METHODS AND ALGORITHMS TO BUILD ASPEAKER VERIFICATION IN KAZAKH LANGUAGE
2024S. Rashid, D. Kuanyshbay, A. Nurkey0Suleyman Demirel University Bulletin Natural and Technical Sciences
Speaker verification interfaces are gaining more and more popularity in both academic and commercial industries. It's connected with the latest advances in this area, which can be seen firsthand in our daily life: voice interfaces in computers, robots, cell p…
Speech (ASR / TTS)
Application of the conformer model for kazakh speech recognition
2024O. Mamyrbayev, T. Kurmetkan, R. Arslan0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
The article describes the application of a transformer-based conformer model and a convolutional neural network (CNN) in Kazakh speech recognition, with an overview of automatic speech recognition (ASR) technologies. By exploring ways to combine convolutional…
Speech (ASR / TTS)
Research and processing of the kazakh children’s acoustic corpus
2024D. Rakhimova, Zh. Duisenbekkyzy, E. Adali et al.1Bulletin of the National Engineering Academy of the Republic of Kazakhstan
Recent advancements in speech recognition technology have significantly enhanced accessibility and functionality across various sectors. Nonetheless, the task of recognizing kid’s speech presents considerable challenges. Children from different age groups exh…
TokenizationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
A SYSTEMATIC REVIEW OF EXISTING TOOLS TO AUTOMATED PROCESSING SYSTEMS FOR KAZAKH LANGUAGE
2024A. Aitim, R. Satybaldiyeva8BULLETIN Series of Physics & Mathematical Sciences
The development of automated systems for the Kazakh language has gained significant momentum in recent years, driven by the growing need for natural language processing (NLP) tools tailored to underrepresented languages. This systematic review aims to critica…
TokenizationMorphology / segmentationMachine translationSpeech (ASR / TTS)Datasets / corpora
AI-Based IVR
2024Gassyrbek Kosherbay, Nurgissa ApbazarXiv
The use of traditional IVR (Interactive Voice Response) methods often proves insufficient to meet customer needs. This article examines the application of artificial intelligence (AI) technologies to enhance the efficiency of IVR systems in call centers. A pr…
Language models / LLMsSpeech (ASR / TTS)Datasets / corporaClassification / sentiment
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
2024Jinpeng Li, Yu Pu, Qi Sun et al.5arXiv
Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost da…
TokenizationLanguage models / LLMsSpeech (ASR / TTS)NER / extractionDatasets / corpora
Recent Advancements and Challenges of Turkic Central Asian Language Processing
2024Yana Veitsman, Mareike HartmannarXiv
Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included th…
Datasets / corpora
SCIENTIFIC ASPECTS OF MODERN APPROACHES TO MACHINE TRANSLATION FOR SIGN LANGUAGE
2024Dana Nurgazina, Saule Kudubayeva, A. Ismailov1Scientific Journal of Astana IT University
Scientific research in the field of automated sign language translation represents a crucial stage in the development of technologies supporting the hearing-impaired and deaf communities. This article presents a comprehensive approach to addressing semantic a…
Machine translation
The link between translation difficulty and the quality of machine translation: a literature review and empirical investigation
2024S. Araghi, A. Palangkaraya2Language Resources and Evaluation
We survey the relevant literature on translation difficulty and automatic evaluation of machine translation (MT) quality and investigate whether source text’s translation difficulty features contain any information about MT quality. We analyse the 2017–2019 C…
Machine translationEvaluation / benchmarks
Data Augmentation Based Unsupervised Pre-Training for Low-resource Speech Recognition
2024Hong Luo, Xiao Xie, Penghua Li et al.2Chinese Control and Decision Conference
This paper proposes SpecWav2vec-F, a novel model built upon the Wav2vec 2.0 baseline. The model demonstrates enhanced effectiveness in low-resource speech recognition tasks by preserving relationships between different time steps in the latent speech space. I…
Speech (ASR / TTS)Datasets / corpora
Review of Hierarchical Transfer Learning Architecture in Low-Resource Machine Translation
2024Bilge Kagan Yazar, E. Kılıç0Signal Processing and Communications Applications Conference
Machine translation is a field of study that has attracted significant attention in recent years. The success of a model built on a language pair depends mainly on the number of parallel sentences between languages. Unlike high-resource languages, low-resourc…
Machine translationDatasets / corpora
Integrated End-to-End Automatic Speech Recognition for Languages for Agglutinative Languages
2024A. Bekarystankyzy, O. Mamyrbayev, Tolganay Anarbekova6ACM Trans. Asian Low Resour. Lang. Inf. Process.
The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work wa…
Morphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)Datasets / corporaClassification / sentiment
DEVELOPING METHODS FOR AUTOMATIC PROCESSING SYSTEMS OF KAZAKH LANGUAGE
2024A. Aitim10Вестник КазАТК
The linguistic diversity of the Kazakh language poses unique challenges and opportunities for the development of automatic processing systems. This research explores a comprehensive array of methodologies employed in advancing automatic processing systems tai…
Morphology / segmentationMachine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaClassification / sentiment
Research on the Construction of Low-Resource Parallel Corpus Based on Translation Plug-in Technology
2024Zulkar Iskander, Azragul Yusup0International Journal of Educational Curriculum Management and Research
: Parallel corpora play a crucial role in the field of natural language processing, especially in tasks such as machine translation and cross-language information retrieval. However, with the increasing demand for more languages, the challenges are becoming i…
Machine translationDatasets / corpora
KazQAD: Kazakh Open-Domain Question Answering Dataset
2024Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov et al.14arXiv
We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with ex…
Language models / LLMsMachine translationDatasets / corpora
Integration AI Techniques in Low-Resource Language: The Case of Kazakh Language
2024Asmaganbetova Kamshat, Ulanbek Auyeskhan, Nurzhanova Zarina et al.32024 IEEE AITU: Digital Generation
Sentiment analysis is an extensively explored domain within natural language processing (NLP); nevertheless, a significant emphasis has been placed on languages possessing ample resources, such as English. This paper delves into the transformative capabilitie…
Machine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarksClassification / sentiment
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
2024Adal Abilbekov, Saida Mussakhojayeva, Rustem Yeshpanov et al.arXiv
This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications. KazEmoTTS is a collection of 54,760 audio-text pairs, with a total duration of 74.85 hours, featuring 34.23 hours delivered by a fema…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
TEXT GENERATION MODELS FOR PARAPHRASE ON KAZAKH LANGUAGE
2024A. Kassenkhan, N. Mukazhanov, S. Nuralykyzy et al.0КазУТБ
This study delves into the relatively unexplored domain of natural language processing for the Kazakh language—a language with limited computational resources. The paper dissects the effectiveness of diffusion models and transformers in generating text, speci…
TokenizationMorphology / segmentationMachine translationNER / extractionDatasets / corpora
KazParC: Kazakh Parallel Corpus for Machine Translation
2024Rustem Yeshpanov, Alina Polonskaya, Huseyin Atakan Varol16arXiv
We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different …
Machine translationDatasets / corporaEvaluation / benchmarks
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes
2024Rustem Yeshpanov, Huseyin Atakan VarolarXiv
This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes nu…
Datasets / corporaEvaluation / benchmarksClassification / sentimentEmbeddings
Parallel texts dataset for Uzbek-Kazakh machine translation
2024B. Allaberdiev, G. Matlatipov, Elmurod Kuriyozov et al.12Data in Brief
This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building proces…
Machine translationNER / extractionDatasets / corpora
The Task of Post-Editing Machine Translation for the Low-Resource Language
2024D. Rakhimova, Aidana Karibayeva, A. Turarbek12Applied Sciences
In recent years, machine translation has made significant advancements; however, its effectiveness can vary widely depending on the language pair. Languages with limited resources, such as Kazakh, Uzbek, Kalmyk, Tatar, and others, often encounter challenges i…
Morphology / segmentationMachine translationNER / extractionDatasets / corporaEvaluation / benchmarks
Neurorecognition visualization in multitask end-to-end speech
2023Orken J. Mamyrbayev, Sergii Pavlov, A. Bekarystankyzy et al.0Optical Fibers and Their Applications
Nowadays, speech-processing technologies with different language systems are successfully used in mobile and stationary devices. Kazakh is considered a low-resource language, which poses various challenges for conventional speech recognition methods. This pap…
Speech (ASR / TTS)Datasets / corpora
An Empirical study of Unsupervised Neural Machine Translation: analyzing NMT output, model's behavior and sentences' contribution
2023Isidora Chara Tourni, Derry Wijaya0arXiv
Unsupervised Neural Machine Translation (UNMT) focuses on improving NMT results under the assumption there is no human translated parallel data, yet little work has been done so far in highlighting its advantages compared to supervised methods and analyzing i…
Machine translationNER / extractionDatasets / corporaEmbeddings
Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need
2023Askat Kuzdeuov, Shakhizat Nurgaliyev, Diana Turmakhan et al.52023 3rd International Conference on Robotics, Automation and Artificial Intelligence (RAAI)
Speech Command Recognition (SCR) is rapidly gaining prominence due to its diverse applications, such as virtual assistants, smart homes, hands-free navigation, and voice-controlled industrial machinery. In this paper, we present a data-centric approach to cre…
Speech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
Relevance-guided Neural Machine Translation
2023Isidora Chara Tourni, Derry Wijaya0arXiv
With the advent of the Transformer architecture, Neural Machine Translation (NMT) results have shown great improvement lately. However, results in low-resource conditions still lag behind in both bilingual and multilingual setups, due to the limited amount of…
Machine translationDatasets / corpora
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
2023Chen Zhang, Mingxu Tao, Quzhe Huang et al.arXiv
Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we p…
Language models / LLMsDatasets / corpora
Noise-Robust Automatic Speech Recognition for Industrial and Urban Environments
2023D. Orel, H. A. Varol4Annual Conference of the IEEE Industrial Electronics Society
Automatic Speech Recognition (ASR) models can achieve human parity, but their performance degrades significantly when used in noisy industrial and urban environments. In this paper, we present monolingual and multilingual ASR models, which can perform effecti…
Speech (ASR / TTS)NER / extractionDatasets / corpora
A Chinese–Kazakh Translation Method That Combines Data Augmentation and R-Drop Regularization
2023Cang-Rong Liu, Wushouer Silamu, Yanbing Li3Applied Sciences
Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through…
Machine translationNER / extractionDatasets / corpora
Machine Translation Shortcomings and Teaching Translation
2023L. Mirzoyeva4Revista Romaneasca pentru Educatie Multidimensionala
Nowadays, machine translation is considered to be a frequently used tool to render various types of texts related to such different spheres as science, film industry, etc. Statement of the problem: currently, as the higher school system in Kazakhstan starts i…
Machine translationNER / extraction
Cascade Speech Translation for the Kazakh Language
2023Zhanibek Kozhirbayev, T. Islamgozhayev11Applied Sciences
Speech translation systems have become indispensable in facilitating seamless communication across language barriers. This paper presents a cascade speech translation system tailored specifically for translating speech from the Kazakh language to Russian. The…
Machine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters
2023N. Kadyrbek, Madina Mansurova, A. Shomanov et al.10Big Data and Cognitive Computing
This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed aud…
TokenizationSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarksEmbeddings
Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech
2023N. Amangeldy, Aru Ukenova, G. Bekmanova et al.25Italian National Conference on Sensors
This article is devoted to solving the problem of converting sign language into a consistent text with intonation markup for subsequent voice synthesis of sign phrases by speech with intonation. The paper proposes an improved method of continuous recognition …
Morphology / segmentationMachine translationSpeech (ASR / TTS)Evaluation / benchmarks
Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection
2023Phat Do, Matt Coler, Jelske Dijkstra et al.6arXiv
We compare using a PHOIBLE-based phone mapping method and using phonological features input in transfer learning for TTS in low-resource languages. We use diverse source languages (English, Finnish, Hindi, Japanese, and Russian) and target languages (Bulgaria…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration
2023Rustem Yeshpanov, Saida Mussakhojayeva, Yerbolat KhassanovarXiv
This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages: Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek. We specifically target the zero-shot learning scena…
Speech (ASR / TTS)NER / extractionDatasets / corpora
Kazakh-Chinese neural machine translation based on data augmentation
2023Hao Wu, Beiqiang Ma0Conference on Computer Graphics, Artificial Intelligence, and Data Processing
Machine translation is an important research field in natural language processing and artificial intelligence, which studies how to use computers to automatically convert languages. We experimented with attentional neural network machine translation for Chine…
Machine translationDatasets / corpora
Extractive Question Answering for Kazakh Language
2023Magzhan Shymbayev, Yermek Alimzhanov52023 IEEE International Conference on Smart Information Systems and Technologies (SIST)
This article provides research and development of an extractive question answering system based on the BERT-like model for the Kazakh language. Developing an extractive question answering system requires large training datasets - tens of thousands of annotate…
Language models / LLMsMachine translationNER / extractionDatasets / corpora
Fine-Tuning the Wav2vec2 Model for Kazakh Speech: A Study on a Limited Corpus
2023Kairatuly Bauyrzhan, M. Madina, Ospan Assel32023 IEEE International Conference on Smart Information Systems and Technologies (SIST)
In this study, we developed a model for automatic recognition of Kazakh speech by fine-tuning the XLSR-Wav2Vec2 pre-trained model to a corpus of Kazakh speech. Our results show that fine-tuning the wav2vec2 model on a small corpus of Kazakh speech allows a si…
Speech (ASR / TTS)Datasets / corpora
ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES
2023A. Bekarystankyzy, O. Mamyrbayev1Scientific Journal of Astana IT University
With the improvement of intelligent systems, speech recognition technologies are being widely integrated into various aspects of human life. Speech recognition is applied to smart assistants, smart home infrastructure, the call center applications of banks, i…
Morphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)Datasets / corporaClassification / sentiment
The neural machine translation models for the low-resource Kazakh–English language pair
2023V. Karyukin, D. Rakhimova, Aidana Karibayeva et al.24PeerJ Computer Science
The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become on…
TokenizationMachine translationDatasets / corporaEvaluation / benchmarks
Speech Recognition for Turkic Languages Using Cross-Lingual Transfer Learning from Kazakh
2023D. Orel, Rustem Yeshpanov, H. A. Varol3International Conference on Big Data and Smart Computing
This paper investigates the effectiveness of transfer learning in building automatic speech recognition models for nine Turkic languages (Azerbaijani, Bashkir, Chuvash, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek), by leveraging large-scale training data…
Speech (ASR / TTS)Datasets / corpora
Multilingual Speech Recognition for Turkic Languages
2023Saida Mussakhojayeva, Kaisar Dauletbek, Rustem Yeshpanov et al.25Inf.
The primary aim of this study was to contribute to the development of multilingual automatic speech recognition for lower-resourced Turkic languages. Ten languages—Azerbaijani, Bashkir, Chuvash, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek—were co…
Speech (ASR / TTS)Datasets / corpora
A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
2023Weijing Meng, Nurmemet Yolwas10Italian National Conference on Sensors
Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recogniti…
Speech (ASR / TTS)Datasets / corporaClassification / sentimentEmbeddings
Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: An Overview
2022Wenqiang Du, Yikeremu Maimaitiyiming, Mewlude Nijat et al.17Applied Sciences
With the emergence of deep learning, the performance of automatic speech recognition (ASR) systems has remarkably improved. Especially for resource-rich languages such as English and Chinese, commercial usage has been made feasible in a wide range of applicat…
Speech (ASR / TTS)NER / extractionDatasets / corpora
MiLMo:Minority Multilingual Pre-trained Language Model
2022Junjie Deng, Hanru Shi, Xinhe Yu et al.arXiv
Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the m…
Language models / LLMsDatasets / corporaClassification / sentimentEmbeddings
Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF
2022Bakhyt BakiyevarXiv
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a c…
Tokenization
The TNT Team System Descriptions of Cantonese, Mongolian and Kazakh for IARPA OpenASR21 Challenge
2022Kai Tang, Jing Zhao, Jinghao Yan et al.0Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
This paper presents our systems and experimental analyses for the OpenASR21 Challenge. We describe the systems in the constrained condition, constrained-plus condition, and unconstrained condition, and our post-evaluation analyses for the Challenge. The syste…
Speech (ASR / TTS)Evaluation / benchmarks
KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus
2022Saida Mussakhojayeva, Yerbolat Khassanov, H. A. Varol24Interspeech
We present the first industrial-scale open-source Kazakh speech corpus for automatic speech recognition research and development. Our corpus subsumes two previously presented corpora: 1) Kazakh speech corpus (KSC) and 2) Kazakh text-to-speech 2 (KazakhTTS2). …
Language models / LLMsSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation
2022Bryan Li, Mohammad Sadegh Rasooli, Ajay Patel et al.arXiv
We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning …
Language models / LLMsMachine translationNER / extractionDatasets / corpora
Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language
2022A. Turganbayeva, D. Rakhimova, V. Karyukin et al.10Inf.
The problems of machine translation are constantly arising. While the most advanced translation platforms, such as Google and Yandex, allow for high-quality translations of languages with simple grammatical structures, more morphologically rich languages stil…
Morphology / segmentationMachine translationNER / extraction
Hybrid end-to-end model for Kazakh speech recognition
2022O. Mamyrbayev, D. Oralbekova, K. Alimhan et al.17International Journal of Speech Technology
Speech (ASR / TTS)
ResNet50+Transformer: kazakh offline handwritten text recognition
2022Y. Amirgaliyev, Mateus Mendes, K. Mukhtar et al.0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
Nowadays, due to the transition to digital data storage, there is a need to implement handwritten text recognition (HTR), which is an automatic translation of handwritten characters into a machine format. Handwriting recognition is complicated by the fact tha…
Machine translation
Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model
2022Xuan-Phi Nguyen, Shafiq Joty, Wu Kui et al.4arXiv
Numerous recent work on unsupervised machine translation (UMT) implies that competent unsupervised translations of low-resource and unrelated languages, such as Nepali or Sinhala, are only possible if the model is trained in a massive multilingual environment…
Machine translationDatasets / corpora
Descartes: Generating Short Descriptions of Wikipedia Articles
2022Marija Sakota, Maxime Peyrard, Robert WestarXiv
Wikipedia is one of the richest knowledge sources on the Web today. In order to facilitate navigating, searching, and maintaining its content, Wikipedia's guidelines state that all articles should be annotated with a so-called short description indicating the…
Machine translationNER / extractionDatasets / corporaEvaluation / benchmarks
A study of transformer-based end-to-end speech recognition system for Kazakh language
2022Mamyrbayev Orken, Oralbekova Dina, Alimhan Keylan et al.44Scientific Reports
Today, the Transformer model, which allows parallelization and also has its own internal attention, has been widely used in the field of speech recognition. The great advantage of this architecture is the fast learning speed, and the lack of sequential operat…
Morphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)Datasets / corporaClassification / sentiment
Emotional Speech Recognition Method Based on Word Transcription
2022G. Bekmanova, B. Yergesh, A. Sharipbay et al.25Italian National Conference on Sensors
The emotional speech recognition method presented in this article was applied to recognize the emotions of students during online exams in distance learning due to COVID-19. The purpose of this method is to recognize emotions in spoken speech through the know…
Speech (ASR / TTS)Datasets / corpora
Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level
2022O. Mamyrbayev, K. Alimhan, D. Oralbekova et al.15Eastern-European Journal of Enterprise Technologies
Ensuring the best quality and performance of modern speech technologies, today, is possible based on the widespread use of machine learning methods. The idea of this project is to study and implement an end-to-end system of automatic speech recognition using …
Morphology / segmentationSpeech (ASR / TTS)Datasets / corpora
KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics
2022Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan VarolarXiv
We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three fem…
Morphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
KazNERD: Kazakh Named Entity Recognition Dataset
2021Rustem Yeshpanov, Yerbolat Khassanov, Huseyin Atakan VarolarXiv
We present the development of a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules an…
NER / extractionDatasets / corpora
The Development of the Light Post-editing Module for English-Kazakh Translation
2021D. Rakhimova, V. Karyukin, Aidana Karibayeva et al.2The 7th International Conference on Engineering & MIS 2021
Applied intelligent systems play an important role in the modern world. One of their tasks is machine translation (MT) from one language into another one. MT allows people to freely communicate despite language barriers. This new technology is a special step …
Machine translationEvaluation / benchmarks
FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions
2021David Amat Olóndriz, Ponç Palau Puigdevall, Adrià Salvador PalauarXiv
In this paper we introduce the FooDI-ML dataset. This dataset contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, dri…
NER / extractionDatasets / corporaEvaluation / benchmarks
KOHTD: Kazakh Offline Handwritten Text Dataset
2021Nazgul Toiganbayeva, Mahmoud Kasem, Galymzhan Abdimanap et al.arXiv
Despite the transition to digital information exchange, many documents, such as invoices, taxes, memos and questionnaires, historical data, and answers to exam questions, still require handwritten inputs. In this regard, there is a need to implement Handwritt…
Morphology / segmentationSpeech (ASR / TTS)Datasets / corpora
THE TRANSLATION QUALITY PROBLEMS OF MACHINE TRANSLATION SYSTEMS FOR THE KAZAKH LANGUAGE
2021Asem Turarbek0Journal of Mathematics Mechanics and Computer Science
Machine translation
A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English
2021Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol20arXiv
We study training a single end-to-end (E2E) automatic speech recognition (ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and English. We first describe the development of multilingual E2E ASR based on Transformer networks and then perform…
Speech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
A baseline model for computationally inexpensive speech recognition for Kazakh using the Coqui STT framework
2021Ilnar Salimzianov0arXiv
Mobile devices are transforming the way people interact with computers, and speech interfaces to applications are ever more important. Automatic Speech Recognition systems recently published are very accurate, but often require powerful machinery (specialised…
TokenizationLanguage models / LLMsSpeech (ASR / TTS)NER / extractionDatasets / corpora
Error Correction Based on Transformer LM in Uyghur Speech Recognition
2021Yan Zhang, Mijit Ablimit, Askar Hamdulla12021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML)
For Uyghur, Kazakh and other minority languages or dialects, it is difficult to collect large-scale labeled corpus. In the case of low resources, reducing the recognition granularity which using phonemes or characters as the recognition unit can get good char…
Language models / LLMsSpeech (ASR / TTS)Datasets / corpora
End-to-End Model Based on RNN-T for Kazakh Speech Recognition
2021Orken J. Mamyrbayev, D. Oralbekova, A. Kydyrbekova et al.11International Conference on Computational Collective Intelligence
Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in re…
Language models / LLMsSpeech (ASR / TTS)Datasets / corpora
MAIN PROBLEMS OF USING THE FULL POST-EDITING MODEL BASED ON MACHINE LEARNING FOR ENGLISH-KAZAKH TRANSLATION
2021D. Rakhimova, К. А. Zhakypbayeva0BULLETIN Series of Physics & Mathematical Sciences
Machine learning is one of the main branches of artificial intelligence. Its main idea is not only to use an algorithm written by a computer, but also to learn how to solve a problem on your own. Recently, in the field of translation, the issue of using machi…
Machine translationDatasets / corpora
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
2021Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov et al.arXiv
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speaker…
Language models / LLMsSpeech (ASR / TTS)Datasets / corporaEvaluation / benchmarks
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
2021Mohammad Sadegh Rasooli, Chris Callison-Burch, Derry Tanti WijayaarXiv
We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models …
Machine translationDatasets / corporaEmbeddings
Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages
2021Garry Kuwanto, Afra Feyza Akyürek, Isidora Chara Tourni et al.8arXiv
We conduct an empirical study of neural machine translation (NMT) for truly low-resource languages, and propose a training curriculum fit for cases when both parallel training data and compute resource are lacking, reflecting the reality of most of the world'…
Machine translationDatasets / corpora
The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation
2021Jonne Sälevä, Constantine Lignos28arXiv
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations…
TokenizationMorphology / segmentationMachine translationDatasets / corpora
The Development and Construction of Bilingual Machine Translation Auxiliary Tool between Chinese and Kazakh Languages
2021M. Niyazbek, Kuenssaule Talp, Jing Sun4IOP Conference Series: Earth and Environment
This paper introduces the design and construction process of a bilingual machine translation auxiliary tool between Chinese and Kazakh languages. The tool uses the Jieba word segmentation tool to segment the input sentence, and then translates it according to…
Morphology / segmentationMachine translation
Development of a model and software solution for the problem of determining unknown words in post-editing machine translation
2021D. Rakhimova, N. M. Pazylkhan, A. Kulzhanova et al.0
Machine translation is the technology of consecutive translation of texts from one language to another by a computer program. As a result of machine translation, there are always certain disadvantages that can be solved by post-editing. Post-editing-human pro…
Machine translation
Classification of Handwritten Names of Cities and Handwritten Text Recognition using Various Deep Learning Models
2021Daniyar Nurseitov, Kairat Bostanbekov, Maksat Kanatov et al.arXiv
This article discusses the problem of handwriting recognition in Kazakh and Russian languages. This area is poorly studied since in the literature there are almost no works in this direction. We have tried to describe various approaches and achievements of re…
Datasets / corporaClassification / sentiment
Impact of Statistical Language Model on Example Based Machine Translation System between Kazakh and Turkish Languages
2020Gulshat Kessikbayeva, I. Çiçekli1International Conference on Natural Language Processing and Information Retrieval
In this paper a hybrid example based machine translation system between Kazakh and Turkish languages is presented. The system mainly based on example based machine translation method which is supported by a statistical language model for the target language. …
Morphology / segmentationLanguage models / LLMsMachine translationDatasets / corpora
ETHICAL ASPECT OF SPEECH CULTURE
2020G. Abdirasilova, М. Berkutbayeva, M. Student0
The basis of word culture is the language norm. Speech culture is " the degree of reproduction, maturation of language techniques. In addition, he has not only kindness, literacy, but also the skills of accurate and correct application of language techniques,…
Morphology / segmentationSpeech (ASR / TTS)Evaluation / benchmarks
Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
2020Xavier Garcia, Aditya Siddhant, Orhan Firat et al.35arXiv
Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised trans…
Machine translationDatasets / corpora
A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline
2020Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov et al.43arXiv
We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both gen…
Speech (ASR / TTS)Datasets / corpora
Attention-based Fully Gated CNN-BGRU for Russian Handwritten Text
2020Abdelrahman Abdallah, Mohamed Hamada, Daniyar NurseitovarXiv
This research approaches the task of handwritten text with attention encoder-decoder networks that are trained on Kazakh and Russian language. We developed a novel deep neural network model based on Fully Gated CNN, supported by Multiple bidirectional GRU and…
Datasets / corpora
Neural Named Entity Recognition for Kazakh
2020Gulmira Tolegen, Alymzhan Toleu, Orken Mamyrbayev et al.arXiv
We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This …
Morphology / segmentationNER / extractionDatasets / corporaEmbeddings
HKR For Handwritten Kazakh & Russian Database
2020Daniyar Nurseitov, Kairat Bostanbekov, Daniyar Kurmankhojayev et al.arXiv
In this paper, we present a new Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition. A few pre-processing and segmentation procedures have been developed together with the d…
Morphology / segmentationNER / extractionDatasets / corpora
Method of Sentiment Preservation in the Kazakh-Turkish Machine Translation
2020L. Zhetkenbay, G. Bekmanova, B. Yergesh et al.2Communication Systems and Applications
This paper describes characteristics which affect the sentiment analysis in the Kazakh language texts, models of morphological rules and morphological analysis algorithms, formal models of simple sentence structures in the Kazakh-Turkish combination, models a…
Morphology / segmentationMachine translationClassification / sentiment
BASIC CONCEPTS AND PARAMETERS OF KAZAKH GRAMMATOLOGY
2020N. Amirzhanova0
Grammatology is traditionally a field of linguistics that establishes and studies the relationship between the letters of the alphabet and the sounds of speech. Grammatology as a branch of linguistics appeared long ago, almost simultaneously with linguistics.…
Speech (ASR / TTS)
Cross-Lingual Word Embeddings for Turkic Languages
2020Elmurod Kuriyozov, Yerai Doval, Carlos Gómez-RodríguezarXiv
There has been an increasing interest in learning cross-lingual word embeddings to transfer knowledge obtained from a resource-rich language, such as English, to lower-resource languages for which annotated data is scarce, such as Turkish, Russian, and many o…
Datasets / corporaEvaluation / benchmarksClassification / sentimentEmbeddings
Multimodal systems for speech recognition
2020Orken J. Mamyrbayev, K. Alimhan, B. Amirgaliyev et al.9International Journal of Mobile Communications
In this article, we have implemented a system of multimodal recognition of Kazakh speech, based on speech and lip recognition. During the feature extraction phase, several methods have been used, such as voice activity detection (VAD), mel-frequency cepstral …
Morphology / segmentationSpeech (ASR / TTS)Evaluation / benchmarksClassification / sentimentEmbeddings
The solution of the problem of unknown words under neural machine translation of the Kazakh language
2020A. Turganbayeva, U. Tukeyev7Asian Conference on Intelligent Information and Database Systems
ABSTRACT The paper proposes a solution to the problem of unknown words for neural machine translation (NMT). The proposed solution is shown by the example of NMT of the Kazakh-English language pair. The novelty of the proposed technology for solving the probl…
TokenizationMachine translationDatasets / corpora
Development of Automatic Speech Recognition for Kazakh Language using Transfer Learning
2020Amirgaliyev E. N., Kuanyshbay D. N., Baimuratov O14arXiv
Development of Automatic Speech Recognition system for Kazakh language is very challenging due to a lack of data.Existing data of kazakh speech with its corresponding transcriptions are heavily accessed and not enough to gain a worth mentioning results.For th…
Language models / LLMsSpeech (ASR / TTS)Datasets / corpora
Morphological segmentation method for Turkic language neural machine translation
2020U. Tukeyev, Aidana Karibayeva, Z. Zhumanov et al.24Cogent Engineering
Abstract Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmentin…
TokenizationMorphology / segmentationMachine translationEvaluation / benchmarks
Speech Emotion Recognition For Kazakh And Russian Languages
20202Applied Mathematics & Information Sciences
Speech (ASR / TTS)
Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data
2019Antonio Toral, Lukas Edman, Galiya Yeshmagambetova et al.11Conference on Machine Translation
This paper presents the systems submitted by the University of Groningen to the English– Kazakh language pair (both translation directions) for the WMT 2019 news translation task. We explore the potential benefits of (i) morphological segmentation (both unsup…
Morphology / segmentationMachine translationEvaluation / benchmarks
NICT’s Unsupervised Neural and Statistical Machine Translation Systems for the WMT19 News Translation Task
2019Benjamin Marie, Haipeng Sun, Rui Wang et al.22Conference on Machine Translation
This paper presents the NICT’s participation in the WMT19 unsupervised news translation task. We participated in the unsupervised translation direction: German-Czech. Our primary submission to the task is the result of a simple combination of our unsupervised…
Machine translationDatasets / corporaEvaluation / benchmarks
NICT’s Supervised Neural Machine Translation Systems for the WMT19 News Translation Task
2019Raj Dabre, Kehai Chen, Benjamin Marie et al.16Conference on Machine Translation
In this paper, we describe our supervised neural machine translation (NMT) systems that we developed for the news translation task for Kazakh↔English, Gujarati↔English, Chinese↔English, and English→Finnish translation directions. We focused on leveraging mult…
Machine translationNER / extractionDatasets / corpora
The TALP-UPC Machine Translation Systems for WMT19 News Translation Task: Pivoting Techniques for Low Resource MT
2019Noe Casas, José A. R. Fonollosa, Carlos Escolano et al.16Conference on Machine Translation
In this article, we describe the TALP-UPC research group participation in the WMT19 news translation shared task for Kazakh-English. Given the low amount of parallel training data, we resort to using Russian as pivot language, training subword-based statistic…
TokenizationMachine translationDatasets / corpora
The RWTH Aachen University Machine Translation Systems for WMT 2019
2019Jan Rosendahl, Christian Herold, Yunsu Kim et al.4Conference on Machine Translation
This paper describes the neural machine translation systems developed at the RWTH Aachen University for the German-English, Chinese-English and Kazakh-English news translation tasks of the Fourth Conference on Machine Translation (WMT19). For all tasks, the f…
Morphology / segmentationLanguage models / LLMsMachine translation
Towards Interlingua Neural Machine Translation
2019Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa26arXiv
Common intermediate language representation in neural machine translation can be used to extend bilingual to multilingual systems by incremental training. In this paper, we propose a new architecture based on introducing an interlingual loss as an additional …
Machine translationDatasets / corporaEvaluation / benchmarksEmbeddings
Automatic Recognition of Kazakh Speech Using Deep Neural Networks
2019Orken J. Mamyrbayev, Mussa Turdalyuly, N. Mekebayev et al.20Asian Conference on Intelligent Information and Database Systems
Speech (ASR / TTS)
Automated rating of recorded classroom presentations using speech analysis in kazakh
2018Akzharkyn Izbassarova, Aidana Irmanova, A. P. JamesarXiv
Effective presentation skills can help to succeed in business, career and academy. This paper presents the design of speech assessment during the oral presentation and the algorithm for speech evaluation based on criteria of optimal intonation. As the pace of…
Speech (ASR / TTS)Evaluation / benchmarks
A free Kazakh speech database and a speech recognition baseline
2017Ying Shi, Askar Hamdullah, Zhiyuan Tang et al.6Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
Speech (ASR / TTS)
On Various Approaches to Machine Translation from Russian to Kazakh
2017Aibek Makazhanov, Bagdat Myrzakhmetov, Zhanibek et al.5
Machine translation
Regarding the impact of Kazakh phonetic transcription on the performance of automatic speech recognition systems
2017Muslima Karabalayeva, Zhandos Yessenbayev, Zhanibek Kozhirbayev1
Speech (ASR / TTS)
Complex Technology of Machine Translation Resources Extension for the Kazakh Language
2017D. Rakhimova, Z. Zhumanov2Asian Conference on Intelligent Information and Database Systems
Machine translationDatasets / corpora
Learning Word Alignment Models for Kazakh-English Machine Translation
2015A. Kartbayev7International Symposium on Integrated Uncertainty in Knowledge Modelling
Machine translation
Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation
2015A. Kartbayev6Natural Language Processing and Chinese Computing
Word alignment play an important role in the training of statistical machine translation systems. We present a technique to refine word alignments at phrase level after the collection of sentences from the Kazakh-English parallel corpora. The estimation techn…
Morphology / segmentationMachine translation
A Bilingual Kazakh-Russian System for Automatic Speech Recognition and Synthesis
2015Olga Khomitsevich, Valentin Mendelev, N. Tomashenko et al.17International Conference on Speech and Computer
Speech (ASR / TTS)
Kazakh Vowel Recognition at the Beginning of Words
2015Aigerim K. Buribayeva, A. Sharipbay0
This paper describes the method of recognition of Kazakh vowels at the beginning of the words using Dynamic Time Warping algorithm. This can be used for acceleration of recognition since word’s first sound identification can significantly decrease the list of…
Speech (ASR / TTS)
Initial explorations in Kazakh to English statistical machine translation
2014Z. Assylbekov, Assulan Nurkas10
Machine translation
Parametrc Representation of Kazakh Gestural Speech
2014Saule Kudubayeva, Gulmira Yermagambetova2International Conference on Speech and Computer
Speech (ASR / TTS)Embeddings
A study of certain morphological structures of Kazakh and their impact on the machine translation quality
2014Eldar Bekbulatov, A. Kartbayev7Advanced Industrial Conference on Telecommunications
Morphology / segmentationMachine translation
Perceptual MVDR-based unsupervised built-in speaker normalization for Kazakh speech recognition
2014Zhandos Yessenbayev, Umit Yapanel0Advanced Industrial Conference on Telecommunications
Speech (ASR / TTS)
ENGLISH -KAZAKH PARALLEL CORPUS FOR STATISTICAL MACHINE TRANSLATION
2014A. Kuandykova, A. Kartbayev, Tannur Kaldybekov3
Machine translationDatasets / corpora
STRUCTURAL TRANSFER RULES FOR KAZAKH-TO-ENGLISH MACHINE TRANSLATION IN THE FREE/OPEN-SOURCE PLATFORM APERTIUM
2014A. Sundetova, Aidana Karibayeva, U. Tukeyev8
Machine translation
LEXICAL SELECTION IN MACHINE TRANSLATION OF RUSSIAN-TO-KAZAKH
2014D. Rakhimova, M. Abakan0
Machine translation
Methods for applying VAD in Kazakh speech recognition systems
2013M. Kalimoldayev, K. Alimhan, Orken J. Mamyrbayev5International Journal of Speech Technology
Speech (ASR / TTS)
Machine translation of different systemic languages using a Apertium platform (with an example of English and Kazakh languages)
2013S. Assem, S. Aida3International Conference on Computer Applications Technology
Machine translation
Improving Low-Resource Kazakh-English and Turkish-English Neural Machine Translation Using Transfer Learning and Part of Speech Tags
2025Bilge Kagan Yazar, Erdal Kiliç2IEEE Access
This study presents a novel translation framework by combining transfer learning and part-of-speech (POS) tagging methods to improve the performance of low-resource neural machine translation models using Kazakh-English and Turkish-English language pairs. It …
Machine translationSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks
Maximum Entropy Model of Synonym Selection in Post-editing Machine Translation into Kazakh Language
2024A. Shormakova, U. Tukeyev0International Conference on Computational Collective Intelligence
Machine translation
Kazakh-Uzbek Speech Cascade Machine Translation on Complete Set of Endings
2023Tolganay Balabekova, Bauyrzhan Kairatuly, U. Tukeyev4International Conference on Computational Collective Intelligence
Machine translationSpeech (ASR / TTS)
Multi-Source Transformer for Kazakh-Russian-English Neural Machine Translation
2019Patrick Littell, Chi-kiu (羅致翹) Lo, Samuel Larkin et al.17Conference on Machine Translation
We describe the neural machine translation (NMT) system developed at the National Research Council of Canada (NRC) for the Kazakh-English news translation task of the Fourth Conference on Machine Translation (WMT19). Our submission is a multi-source NMT takin…
Machine translation
Development Kazakh-Turkish Machine Translation on the Base of Complete Set of Endings Model
2022Aitan Qamet, Kamila Zhakypbayeva, A. Turganbayeva et al.0Asian Conference on Intelligent Information and Database Systems
Machine translation
Kazakh Text Normalization using Machine Translation Approaches
2020Kozhirbaev Zhanibek, Yessenbayev Zhandos2Workshop on Cognitive Modeling and Computational Linguistics
Machine translation
Neural machine translation system for the Kazakh language based on synthetic corpora
2019U. Tukeyev, Aidana Karibayeva, B. Abduali10MATEC Web of Conferences
The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic co…
Morphology / segmentationMachine translationNER / extraction
The University of Maryland’s Kazakh-English Neural Machine Translation System at WMT19
2019Eleftheria Briakou, Marine Carpuat15Conference on Machine Translation
This paper describes the University of Maryland’s submission to the WMT 2019 Kazakh-English news translation task. We study the impact of transfer learning from another low-resource but related language. We experiment with different ways of encoding lexical u…
Machine translationDatasets / corpora
Neural machine translation system for the Kazakh language
2019U. Tukeyev, Z. Zhumanov0Machine Translation Summit
Machine translation
Rule-based machine translation from Kazakh to Turkish
2018S. Bayatli, S. Kurnaz, Ilnar Salimzianov et al.3European Association for Machine Translation Conferences/Workshops
Machine translation
Rule-weight learning for Kazakh-Turkish machine translation
2020S. M. Taha0
Machine translation
Development and Study of a Post-editing Model for Russian-Kazakh and English-Kazakh Translation Based on Machine Learning
2021D. Rakhimova, Kamila Sagat, Kamila Zhakypbaeva et al.1International Conference on Computational Collective Intelligence
Machine translation
A Comparative Evaluation of Open-Source Models for Russian-Kazakh Translation
2026Gleb Shanshin0Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
We describe an evaluation of several open-source models under identical inference conditions without task-specific training. Despite covering a wide range of available models, including both multilingual systems and models specifically designed for Russian– K…
Machine translationEvaluation / benchmarks
Lexical selection rules for Kazakh-to-English machine translation in the free/open-source platform Apertium
2015Aidana Karibayeva0
Machine translation
Example based machine translation system between kazakh and turkish supported by statistical language model (Kazakça ve türkçe dilleri arasında örnek tabanlı ve istatistik model destekli makine çeviri sistemi)
2016Gulshat Kessikbayeva0
Language models / LLMsMachine translation
A Free/Open-source Kazakh-Tatar Machine Translation System
2013Ilnar Salimzyanov, Jonathan North Washington, Francis M. Tyers22Machine Translation Summit
Machine translation
3rd International Conference on Computer Processing in Turkic Languages (TURKLANG 2015) A free/open-source machine translation system for English to Kazakh
5
Machine translation
Initial explorations in Kazakh to English statistical machine translation
2014Assylbekov, , Zhenisbek, Nurkas, Assulan0Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 and of the Fourth International Workshop EVALITA 2014 9-11 December 2014, Pisa
Machine translation
The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019
2019V. M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, F. Sánchez-Martínez9Conference on Machine Translation
This paper describes the two submissions of Universitat d’Alacant to the English-to-Kazakh news translation task at WMT 2019. Our submissions take advantage of monolingual data and parallel data from other language pairs by means of iterative backtranslation,…
Morphology / segmentationMachine translation
Do LLMs Speak Kazakh? A Pilot Evaluation of Seven Models
2024Akylbek Maxutov, Ayan Myrzakhmet, Pavel Braslavski17SIGTURK
We conducted a systematic evaluation of seven large language models (LLMs) on tasks in Kazakh, a Turkic language spoken by approximately 13 million native speakers in Kazakhstan and abroad. We used six datasets corresponding to different tasks – questions ans…
Language models / LLMsMachine translationNER / extractionDatasets / corporaEvaluation / benchmarksClassification / sentiment
Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper
2023Zhanibek Kozhirbayev11Journal of Advances in Information Technology
—In recent years, the progress made in neural models trained on extensive multilingual text or speech data has shown great potential for improving the status of underresourced languages. This paper focuses on experimenting with three state-of-the-art speech r…
Language models / LLMsMachine translationSpeech (ASR / TTS)Datasets / corpora
Leveraging Wav2Vec2.0 for Kazakh Speech Recognition: An Experimental Study
2024Zhanibek Kozhirbayev0Communication Systems and Applications
Speech (ASR / TTS)
A Study of Kazakh Speech Recognition in Hiformer Model
2024O. Mamyrbayev, Turdybek Kurmetkan, D. Oralbekova et al.0Asian Conference on Intelligent Information and Database Systems
Speech (ASR / TTS)
Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System
2022D. Oralbekova, Orken J. Mamyrbayev, M. Othman et al.2Asian Conference on Intelligent Information and Database Systems
Speech (ASR / TTS)
Review of methods of end-to-end automatic recognition of Kazakh speech
2024Yerlan Karabaliyev, K. Kolesnikova, Nurkhan Batyrkhan0EUSPN/ICTH
Speech (ASR / TTS)
INVESTIGATING A KAZAKH SPEECH RECOGNITION SYSTEM USING AN END-TO-END MODEL BASED ON CRF AND CTC
Д.О. Оралбекова, О.Ж. Мамырбаев, А.Б. Имансакипова et al.0
Speech (ASR / TTS)
Speech recognition for Kazakh language: a research paper
2024Galym Kapyshev, M. Nurtas, Aizhan Altaibek7Procedia Computer Science
Speech (ASR / TTS)
Automatic Speech Recognition Improvement for Kazakh Language with Enhanced Language Model
2023A. Bekarystankyzy, O. Mamyrbayev, Mateus Mendes et al.1Asian Conference on Intelligent Information and Database Systems
Language models / LLMsSpeech (ASR / TTS)
Speech recognition for Kazakh language: a research paper
2023Galym Kapyshev, M. Nurtas, Aizhan Altaibek0EUSPN/ICTH
Speech (ASR / TTS)
Continuous Speech Recognition of Kazakh Language
2019Оrken Mamyrbayev, Mussa Turdalyuly, N. Mekebayev et al.13ITM Web of Conferences
This article describes the methods of creating a system of recognizing the continuous speech of Kazakh language. Studies on recognition of Kazakh speech in comparison with other languages began relatively recently, that is after obtaining independence of the …
TokenizationSpeech (ASR / TTS)Datasets / corpora
Impact of Using a Bilingual Model on Kazakh-Russian Code-Switching Speech Recognition
2019Dmitrii Ubskii, Yuri N. Matveev, W. Minker2Majorov International Conference on Software Engineering and Computer Systems
Speech (ASR / TTS)
AUTOMATIC SPEECH RECOGNITION SYSTEM FOR KAZAKH LANGUAGE USING CONNECTIONIST TEMPORAL CLASSIFIER
2020Y. Amirgaliyev, Darkhan Kuanyshbay, D. Yedilkhan0
Speech (ASR / TTS)
QazNLP: Constraint-Aware Multi-Task Sequence Labeling for Morphologically Rich Low-Resource Languages
2026A. Aitim0IEEE Access
Automatic processing of morphologically rich, agglutinative, and low-resource languages remains challenging because productive affixation increases lexical sparsity, weakens statistical generalization, and often produces inconsistent predictions across relate…
TokenizationMorphology / segmentationLanguage models / LLMsSpeech (ASR / TTS)NER / extractionDatasets / corporaEvaluation / benchmarks

How this atlas was built

A corpus of 222 papers, gathered by a Python scraper from two sources: the arXiv API and the Semantic Scholar Graph API. Scope — Kazakh only (filter: a mention of Kazakh/Qazaq + NLP/ML relevance). Graph edges are real citations pulled via the Semantic Scholar batch API. The editorial layer (flagships, open territory, breakthrough timeline) is hand-verified against primary sources.

⚠ Limitations, honestly: Semantic Scholar partly returned HTTP 429 — non-arXiv coverage in the tokenization/LLM/morphology categories is incomplete. Auto-tags are an abstract-based heuristic, not manual annotation. The flagships' "claim" column reflects the authors' assertions, not verified truth. The scrapers are idempotent — a re-run will top up the corpus.

Sources: arXiv API · Semantic Scholar · morphology gold standard: UD_Kazakh-KTB, apertium-kaz.