Картография исследований · 2013 – 2026

Атлас казахского NLP

The State of Kazakh NLP Research

Казахский NLP пережил взрыв в 2024–2026 — но рост сконцентрирован в речи и машинном переводе. Токенизация и морфология остаются тонкой, недокартированной границей. Именно там открыты контрибьюшены.

222
работ в корпусе
2013–2026
годы
47%
за 2024–2026
21
про токенизацию
01

Хронология поля

объём работ по годам
3
2013
8
2014
5
2015
1
2016
4
2017
2
2018
14
2019
18
2020
18
2021
17
2022
25
2023
35
2024
45
2025
25
2026
▮ золотом — 2024–2026: эра LLM приходит в казахский
02

Линия прорывов

мир ⟷ казахский · виден лаг
🌍 Мир🇰🇿 Казахстан
03

Территории

по убыванию открытости
Токенизация21
Тонкая граница. Бум — за 2024–2026. Независимого аудита морфем-выравнивания нет.
Морфология / сегментация48
Строят новые сегментаторы, но не аудируют, что делают существующие токенайзеры.
Языковые модели / LLM50
есть карта
Оценка / бенчмарки76
Бенчмарков мало, и они разрозненны. Til-Core вышел без единого downstream-бенчмарка.
Эмбеддинги22
есть карта
Классификация / сентимент26
есть карта
NER / извлечение55
есть карта
Машинный перевод105
плотно
Речь (ASR / TTS)110
плотно
Датасеты / корпуса146
плотно
04

Архитектура LLM

схема для новичков · и карта вклада
Стек казахского LLM

Тот же скелет — но смотри, где для казахского пусто, и какие слои в обычном стеке вообще не нужны. Цвет = статус, число = объём работ, полоса = насколько слой закрыт. Клик по элементу — список работ.

много работактивнонеполнопочти нет работ
01
Данные154 работы
02
Токенизация59 работ
03
Представления28 работ
04
Модель50 работ
05
Адаптация28 работ
06
Оценка76 работ
07
Инференс4 работы
08
Приложения210 работ
05

Граф цитирований

размер = влияние · цвет = тема
наведи · клик = детали · легенда = созвездие · колесо = зум · тащи фон
104 связанных работ + 15 мировых хабов · 312 рёбер цитирования · метод: s2-batch. Ещё 118 работ без рёбер — в списке ниже.
06

Флагманские модели

claim ≠ проверено
МодельГодПараметрыБазаVocabТокенайзерМорфология?Бенчмарки?
Til-Core-0.5B
Тіл Қазына (гос.)
Громкий claim про морфологию, из метрик — только validation perplexity. Семейство 0.5B/1B (+Instruct). Независимых проверок нет.
2026497MQwen2-арх. (с нуля)256 000morphBPE — BPE с запретом слияний через морфемные границы (сегментатор BiLSTM)ДА — но сегментатор не выложенНЕТ — только val-PPL
Sherkala-Chat-8B
Inception / MBZUAI
Fertility казахского 4.73 → 2.04. Морфем-выравнивание не обсуждается.
20258BLlama-3.1159 766расширенный BPE (+25% к Llama-3.1)нет (fertility-driven)да
SozKZ (50M–600M)
S. Tukenov
Аргумент через fertility, не через морфемные границы.
202650–600MLlama-arch50 000ByteLevel BPE с нуля на казахскомнетчастично
KazByte
R. Akylzhanov
Контрапункт всему полю: «tokenizer tax» решают, убирая токенайзер. «Валидация продолжается» — опубликованных результатов нет.
2026adapter→Qwen2.5-7BQwen2.5— (byte-level)обходит токенайзер целиком (байтовый адаптер)n/a — нет токенайзераНЕТ — position paper
KazLLM (8B / 70B)
ISSAI / NU
150B+ токенов, 4 языка. Нет отдельной токенайзер-работы.
20248B, 70BLlama-3.1128 256 (Llama-3.1)наследует Llama-3.1, расширение не документированонетда (task-perf)
Kaz-RoBERTa
kz-transformers
Ранний baseline. Используется в гибридных морфо-анализаторах.
2023~83MRoBERTa52 000byte-level BPE (казахский + код-свитч RU диалоги)нетчастично
07

Незанятые земли

где открыт контрибьюшен
◆ ВЫ ЗДЕСЬ

Независимый аудит морфем-выравнивания казахских токенайзеров

Никто не сравнивал несколько КАЗАХСКИХ токенайзеров (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core) по морфемным границам на едином gold-стандарте. Arnett 2025 берёт казахский как 1 из 70 языков и только дженерик-токенайзеры; Duisenova 2026 строит новый, но не аудирует существующие.

ШИПАБЕЛЬНО на этой неделе
◆ ВЫ ЗДЕСЬ

Эмпирическая проверка claim Til-Core про морфологию

Til-Core вышел без единого downstream-бенчмарка (только validation perplexity) и с громким заявлением «поддержка казахской морфологии». Стань первым, кто измерил это независимо.

входит в аудит
◆ ВЫ ЗДЕСЬ

Precision/F1 морфем-выравнивания для казахских токенайзеров

Оригинальный MorphScore (2024) меряет только recall границ; Arnett 2025 добавила precision/recall для казахского — но лишь для дженерик-токенайзеров (BLOOM, Llama, Gemma). Precision и F1 для КАЗАХСКИХ токенайзеров (Kaz-RoBERTa, SozKZ, Sherkala, Til-Core) никто не считал.

малая добавка к аудиту
○ открыто

Совместная таблица fertility × morpheme-alignment

Sherkala репортит fertility, MorphScore-работа репортит alignment — но никто не свёл обе оси для казахских токенайзеров в одну таблицу.

средняя
◆ ВЫ ЗДЕСЬ

Usage-vs-morphology divergence (что носители реально говорят)

Морфологически правильная форма ≠ форма, которую носитель употребляет (напр. «біздің кітаптар» вместо «кітаптарымыз», «неге» как монолит). Это методологически не покрыто ни одной работой. Опрос носителей → новый угол.

мини-опрос, 30–50 ответов
08

Корпус работ

arXiv + Semantic Scholar

222 работы показано

KZ-SafetyPrompts: A Kazakh Safety Evaluation Prompt Dataset for Large Language Models
2026Wajdi Zaghouani, Shimaa Amer Ibrahim, Aruzhan Muratbek et al.arXiv
Kazakh is underrepresented in resources for evaluating the safety behavior of large language models. We present KZ-SafetyPrompts, a Kazakh prompt dataset for safety evaluation across eleven categories covering common risk areas such as self-harm, violence, ch…
Языковые модели / LLMМашинный переводДатасеты / корпусаОценка / бенчмарки
Bidirectional Kazakh Sign Language prosody-aware translation using computer vision and speech recognition techniques
2026M. Zhassuzak, Zholdas Buribayev, Maria Aouani et al.0Frontiers in Artificial Intelligence
Introduction This study presents a bidirectional communication system designed to enhance interaction between hearing-impaired and hearing individuals using gesture recognition. Methods The proposed framework integrates multiple components, including Kazakh S…
Машинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Sign Language Recognition and Translation for Low-Resource Languages: Challenges and Pathways Forward
2026Nigar Alishzade, Gulchin AbdullayevaarXiv
Sign languages are natural, visual-gestural languages used by Deaf communities worldwide. Over 300 distinct sign languages remain severely low-resource due to limited documentation, sparse datasets, and insufficient computational tools. This systematic review…
Машинный переводNER / извлечениеДатасеты / корпусаОценка / бенчмаркиЭмбеддинги
100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts
2026Rustem YeshpanovarXiv
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001-2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switch…
Датасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Effects of Cross-lingual Evidence in Multilingual Medical Question Answering
2026Anar Yeginbergen, Maite Oronoz, Rodrigo AgerriarXiv
This paper investigates Multilingual Medical Question Answering across high-resource (English, Spanish, French, Italian) and low-resource (Basque, Kazakh) languages. We evaluate three types of external evidence sources across models of varying size: curated r…
Языковые модели / LLMДатасеты / корпусаОценка / бенчмарки
Development of a Crowdsourcing-Based Adaptive Kazakh—English Translation System for the Kazakh Language
2026Almat Begaidarov, A. Serek, Aisaule Bazarkulova et al.0International Conference on Electronics, Computer and Computation
The Kazakh language is hard for learners and machine translation because of its complicated morphology, regional differences, and subtle differences in meaning. This research presents a crowdsourcing-driven adaptive translation system that amalgamates automat…
Морфология / сегментацияМашинный переводNER / извлечениеОценка / бенчмарки
Development of Hybrid LLM-ASR Methodology for Improving Process of Kazakh Language Learning
2026Dinmukhammed Zhassulanov, A. Serek, Aisaule Bazarkulova0International Conference on Electronics, Computer and Computation
This paper suggests a structure that can guide the enhancement of the Kazakh language learning process by using the large language models (LLMs) and automatic speech recognition (ASR). To fill the gap of highquality speech materials on the Kazakh language, a …
Языковые модели / LLMРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models
2026A. Aitim1Applied Sciences
Kazakh sequence labeling is constrained by limited annotated resources, while its agglutinative morphology and productive suffixation increase data sparsity and exacerbate label inconsistency in part-of-speech (POS) tagging and named entity recognition (NER).…
ТокенизацияМорфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Layer-Wise Probing of Paralinguistic Attributes in Fine-Tuned Whisper for Kazakh Speech
2026Aimoldir Aldabergen, B. Kynabay, S. Kadyrov0Engineering, Technology & Applied Science Research
Large pre-trained speech models similar to Whisper are now commonly used for speech recognition and related tasks. The distribution of paralinguistic features, which include emotions and speaker characteristics across model layers, remains uncertain, particul…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмаркиЭмбеддинги
An Empirical Comparison of Cascade and Direct End-to-End Speech Translation for Low-Resource Language Pair
2026Zhanibek Kozhirbayev0Computers
Speech-to-text translation (S2TT) for low-resource languages remains challenging due to the scarcity of parallel speech translation data and the susceptibility of modular pipelines to error propagation. This paper presents a controlled empirical comparison of…
Морфология / сегментацияМашинный переводРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Unifying Kazakh proper names in English-language texts: the role of translation technologies and translator training
2026D. Popič0Bulletin of L.N. Gumilyov Eurasian National University. PHILOLOGY Series
The rapid expansion of translation technologies has transformed both professional translation practice and translator education. In languages with developing digital infrastructures, such as Kazakh, the integration of machine translation and computer-assisted…
Машинный переводNER / извлечениеЭмбеддинги
Multi-lingual meeting minutes-taking system: design and implementation
2026B. Kumalakov, A. Mazhitova0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
This study examines the challenge of automatically transcribing multilingual institutional speech in Kazakhstan, where speakers frequentlбукy switch between Kazakh, Russian, and English. While modern automatic speech recognition (ASR) systems achieve high acc…
ТокенизацияМорфология / сегментацияРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
KazByte: Adapting Qwen models to Kazakh via Byte-level Adapter
2026Rauan AkylzhanovarXiv
Large language models fragment Kazakh text into many more tokens than equivalent English text, because their tokenizers were built for high-resource languages. This tokenizer tax inflates compute, shortens the effective context window, and weakens the model's…
ТокенизацияМорфология / сегментацияЯзыковые модели / LLMДатасеты / корпусаОценка / бенчмарки
Left Behind: Cross-Lingual Transfer as a Bridge for Low-Resource Languages in Large Language Models
2026Abdul-Salem BeibitkhanarXiv
We investigate how large language models perform on low-resource languages by benchmarking eight LLMs across five experimental conditions in English, Kazakh, and Mongolian. Using 50 hand-crafted questions spanning factual, reasoning, technical, and culturally…
Языковые модели / LLMДатасеты / корпусаОценка / бенчмарки
SozKZ: Training Efficient Small Language Models for Kazakh from Scratch
2026Saken TukenovarXiv
Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ,…
ТокенизацияМорфология / сегментацияЯзыковые модели / LLMДатасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Speech-to-Sign Gesture Translation for Kazakh: Dataset and Sign Gesture Translation System
2026Akdaulet Mnuarbek, A. Bekarystankyzy, M. Turdalyuly et al.0Computers
This paper presents the first prototype of a speech-to-sign language translation system for Kazakh Sign Language (KRSL). The proposed pipeline integrates the NVIDIA FastConformer model for automatic speech recognition (ASR) in the Kazakh language and addresse…
Машинный переводРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models
2026O. Ibrahimzade, K. TabasaranskyarXiv
Large language models (LLMs) have transformed natural language processing, yet their capabilities remain uneven across languages. Most multilingual models are trained primarily on high-resource languages, leaving many languages with large speaker populations …
Морфология / сегментацияЯзыковые модели / LLMДатасеты / корпусаОценка / бенчмаркиЭмбеддинги
Enhancing Post-Editing of Kazakh Translations Using Fine-Tuned Large Language Models
2026A. Bekarystankyzy, D. Rakhimova, Aliya Zhiger et al.0Algorithms
Machine translation for low-resource languages such as Kazakh remains a complex task due to the scarcity of training data, intricate morphological structures, and culturally specific linguistic characteristics. This study presents the first extensive explorat…
Морфология / сегментацияЯзыковые модели / LLMМашинный переводДатасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Using Songs to Improve Kazakh Automatic Speech Recognition
2026Rustem Yeshpanov0arXiv
Developing automatic speech recognition (ASR) systems for low-resource languages is hindered by the scarcity of transcribed corpora. This proof-of-concept study explores songs as an unconventional yet promising data source for Kazakh ASR. We curate a dataset …
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
2026Henry Gagnier, Sophie Gagnier, Ashwin KirubakaranarXiv
Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and L…
Языковые модели / LLMДатасеты / корпусаОценка / бенчмарки
Pedagogical features of using artificial intelligence programs in teaching the kazakh language
2026А. А. Мukhametkali, А. Z. Мazinova0Eurasian Journal of Current Research in Psychology and Pedagogy
The article examines the pedagogical features of integrating artificial intelligence (AI) technologies into the process of teaching Kazakh at higher education institutions. Artificial intelligence is viewed as a tool for enhancing the effectiveness of languag…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
2026Dmitry Karpov1arXiv
We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bash…
Машинный переводДатасеты / корпуса
An Integrated Approach to Adapting Open-Source AI Models for Machine Translation of Low-Resource Turkic Languages
2026U. Tukeyev, A. Shormakova, Aidana Karibayeva et al.2Computers
This study presents the application of free, open-source artificial intelligence (AI) techniques to advance machine translation for low-resource Turkic languages such as Kazakh, Azerbaijani, Kyrgyz, Turkish, Turkmen, and Uzbek. This machine translation proble…
Морфология / сегментацияМашинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
Synthetic data generation for Kazakh speech separation and diarization based on the use of neural networks
2025D. Oralbekova, Orken J. Mamyrbayev, L. Azarova et al.0Symposium on Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments (WILGA)
This paper explores the impact of various synthetic data generation methods on the performance of speech separation and diarization models. Three approaches are considered: simple audio track overlay, synthetic dialogue generation, and acoustic condition mode…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Development and increase of noise immunity of a model of biometric identification of a speaker based on metal-frequency cepstral coefficients and a convolutional neural network
2025M. Khizirova, K. Chezhimbayeva, Abdurazak Kassimov et al.0Eastern-European Journal of Enterprise Technologies
This study is focused on improving the noise robustness of a biometric speaker identification system based on mel-frequency cepstral coefficients (MFCC) and a convolutional neural network (CNN). The object of analysis is the acoustic structure of the Kazakh l…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Natural Language Processing and Speech Technologies for Central Asian Turkic Languages: A Review of Current Methods, Resources, and Challenges
2025Palidan Muhetaer0Actual Problems of the Present
This article provides a comprehensive review of contemporary research in the field of natural language processing (NLP) and speech technologies for Central Asian Turkic languages, including Kazakh, Kyrgyz, and Uzbek. Although a number of theoretical and appli…
ТокенизацияМорфология / сегментацияРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмаркиКлассификация / сентиментЭмбеддинги
COMPARATIVE ANALYSIS OF LOCAL AND CLOUD-BASED SPEECH RECOGNITION MODELS FOR THE KAZAKH LANGUAGE
2025М.Г. Оспанов, К.С. Мауленов, А.Т. Байманкулов0Bulletin of D. Serikbayev EKTU
Разработка систем автоматического распознавания казахской речи остаётся актуальной задачей в условиях ограниченных языковых ресурсов и высокой морфологической сложности агглютинативных языков. Цель исследования заключается в сравнительном анализе локальных и …
Речь (ASR / TTS)
APPLICATION OF NON-AUTOREGRESSIVE DECODING FOR KAZAKH SPEECH RECOGNITION
2025D. Oralbekova, O. Mamyrbayev, A. Yerimbetova et al.0Herald of Kazakh-British technical university
In the field of speech recognition, end-to-end models are gradually replacing traditional and hybrid approaches. Their main principle is autoregressive decoding, where the output sequence is formed from left to right. However, it has not yet been proven that …
Морфология / сегментацияРечь (ASR / TTS)NER / извлечениеКлассификация / сентимент
DEVELOPMENT OF A MODEL FOR REAL-TIME RECOGNITION OF KAZAKH SIGN LANGUAGE USING MEDIAPIPE AND DEEP LEARNING METHODS
2025A. Yerimbetova, U. Berzhanova, E. Daiyrbayeva et al.0Herald of Kazakh-British technical university
This article discusses the process of developing a Kazakh sign language recognition system using the MediaPipe platform. The platform allows for efficient real-time gesture recognition. The main focus is on creating models for gesture recognition, training ne…
Речь (ASR / TTS)Датасеты / корпуса
DEVELOPMENT OF A MODEL FOR REAL-TIME RECOGNITION OF KAZAKH SIGN LANGUAGE USING MEDIAPIPE AND DEEP LEARNING METHODS
2025N. Amangeldy, A. Yerimbetova, N. Gazizova et al.0Herald of Kazakh-British technical university
Technologies for automatic processing of sign language have become an urgent need for members of society with hearing and speech impairments who face inequality in the era of digital transformation. In recent years, the issue of considering sign language as a…
Морфология / сегментацияРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаЭмбеддинги
Artificial Intelligence and the Scientific Development of the Kazakh Language: Corpus, Terminology, and Content Automation
2025F. Orazbayeva, A. Ryskulova0Iasaýı ýnıversıtetіnіń habarshysy
This article provides a comprehensive analysis of effective strategies for enhancing the scientific and theoretical development of the Kazakh language through the integration of artificial intelligence technologies and linguistic corpora. The primary aim of t…
Морфология / сегментацияРечь (ASR / TTS)Датасеты / корпуса
Low-Resource Speech Recognition by Fine-Tuning Whisper with Optuna-LoRA
2025Huan Wang, Jie Bin, Chunyan Gou et al.2Applied Sciences
In low-resource speech recognition, the performance of the Whisper model is often limited by the size of the available training data. To address this challenge, this paper proposes a training optimization method for the Whisper model that integrates Low-Rank …
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
The Development and Experimental Evaluation of a Multilingual Speech Corpus for Low-Resource Turkic Languages
2025Aidana Karibayeva, V. Karyukin, U. Tukeyev et al.1Applied Sciences
The development of parallel audio corpora for Turkic languages, such as Kazakh, Uzbek, and Tatar, remains a significant challenge in the development of multilingual speech synthesis, recognition systems, and machine translation. These languages are low-resour…
Машинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
TRANSLATING THE UNTRANSLATABLE IN KAZAKH FAIRY TALES: HUMAN vs AI
2025B. Mizamkhan, G. Dyussembina, A. Yezmakhunova0Вестник "Филологичекие науки"
While the roles of human translators and machines have shifted with the development of such engines as Google Translate and DeepL, the issue of translating the untranslatable or culture-specific items from Kazakh into English still remain challenging. The lac…
Языковые модели / LLMМашинный переводNER / извлечениеДатасеты / корпусаЭмбеддинги
A Kazakh language Dataset of Lip Movements for Command Recognition
2025Batyr Kenzheakhmetov, Alissultan Amankos, B. Amirgaliyev et al.0Scientific Data
Lip reading systems determine the content of speech based on the visual tracking of lips of the speaker and therefore serve to offer communicative substitutes when acoustic information is not available in the environment. The training of strong lip reading mo…
Речь (ASR / TTS)Датасеты / корпуса
Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval
2025Shantanu Agarwal, Joel Barry, Elizabeth Boschee et al.arXiv
Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval (CLIR). This report provides a detailed description of Information Sciences Institu…
Машинный переводОценка / бенчмарки
MULTILINGUAL AUTOMATIC SPEECH RECOGNITION INTERFACE FOR TYPING: USABILITY STUDY AND PERFORMANCE EVALUATION FOR KAZAKH, RUSSIAN, AND ENGLISH
2025Z. Makhataeva, Nursultan Atymtay, Rakhat Meiramov et al.0Scientific Journal of Astana IT University
We present a multilingual automatic speech recognition (ASR) system for Kazakh, Russian, and English designed for the trilingual community of Kazakhstan. Although prior research has shown that speech-based text entry can outperform conventional keyboard typin…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Multilingual Speech Command Recognition with Language Identification
2025Artur Muratov, Askat Kuzdeuov, H. A. Varol1Annual Conference of the IEEE Industrial Electronics Society
Multilingual Speech Command Recognition (SCR) facilitates voice interaction in environments where multiple languages are used interchangeably, a common characteristic of multilingual regions. In such settings, SCR and language identification (LID) are handled…
Языковые модели / LLMРечь (ASR / TTS)NER / извлечение
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
2025Aidana Karibayeva, V. Karyukin, B. Abduali et al.2Inf.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative …
Морфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)Датасеты / корпуса
Improving post-editing of Kazakh translations with fine-tuned large language models: Dataset and evaluation
2025D. Rakhimova, Aliya Zhiger, Madina Mansurova et al.0International journal of innovative research and scientific studies
Machine translation for low-resource languages like Kazakh faces significant challenges due to limited training data, complex morphology, and cultural-linguistic nuances. This paper presents the first comprehensive study on fine-tuning large language models f…
Морфология / сегментацияЯзыковые модели / LLMМашинный переводДатасеты / корпусаОценка / бенчмарки
Improved Kazakh Named Entity Transcription Using Synthetic Speech
2025Rakhat Meiramov, H. A. Varol02025 10th International Conference on Computer Science and Engineering (UBMK)
Named entity transcription remains a challenge for low-resource languages such as Kazakh, often causing errors in proper nouns, locations, and other named entities. This paper presents a fine-tuning approach using OpenAI’s Whisper model, KazakhTTS2 and KazEmo…
Языковые модели / LLMРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
Fine-Tuning Neural ASR Models for Low-Resource Kazakh Children’s Speech with Preprocessing Enhancements
2025Zhansaya Duisenbekkyzy, D. Rakhimova, A. Yessirkepova et al.02025 10th International Conference on Computer Science and Engineering (UBMK)
This article presents a comprehensive study of automatic speech recognition (ASR) for Kazakh children's speech, with a particular focus on the impact of signal preprocessing techniques in a low-resource linguistic environment. Although commercial ASR systems …
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Restoring Punctuation and Capitalization in Kazakh: A BERT-Based Approach for Text Normalization
2025Sanzhar Umbet, Zhanibek Kozhirbayev12025 10th International Conference on Computer Science and Engineering (UBMK)
This paper introduces a punctuation and capitalization (PC) restoration model for Kazakh, developed using the bert-base-multilingual-uncased model within NVIDIA’s NeMo framework. The model was trained on a curated dataset of preprocessed Kazakh Wikipedia arti…
Речь (ASR / TTS)Датасеты / корпуса
A Multimodal Framework for Speech Emotion Recognition in Low-Resource Languages
2025Mamyr Altaibek, Altanbek Zulkazhav, B. Yergesh et al.1Journal of Artificial Intelligence and Technology
Speech emotion recognition (SER) plays a crucial role in enhancing human–computer interaction by identifying emotional states in speech. However, low-resource languages like Kazakh face challenges due to limited datasets and linguistic tools. To address this …
Речь (ASR / TTS)Датасеты / корпусаКлассификация / сентиментЭмбеддинги
Deploying Multilingual ASR in Digital Twin Systems: A Performance and Efficiency Analysis of Whisper and SeamlessM4T
2025B. Amirkhanov, G. Tyulepberdinova, G. Amirkhanova et al.12025 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA)
This study presents a systematic comparison of two leading automatic speech recognition (ASR) model families—OpenAI's Whisper and Meta's SeamlessM4T—across three typologically diverse languages: English (Germanic), Russian (Slavic), and Kazakh (Turkic). Motiv…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Fine-Tuning Methods and Dataset Structures for Multilingual Neural Machine Translation: A Kazakh–English–Russian Case Study in the IT Domain
2025Zhanibek Kozhirbayev, Zhandos Yessenbayev1Electronics
This study explores fine-tuning methods and dataset structures for multilingual neural machine translation using the No Language Left Behind model, with a case study on Kazakh, English, and Russian. We compare single-stage and two-stage fine-tuning approaches…
Машинный переводДатасеты / корпусаОценка / бенчмарки
Technology of Distinguishing Homonyms Using Probabilistic and Statistical Methods in Compiling a Frequency Dictionary of Texts of the National Corpus of the Kazakh Language
2025Y. Bessirov, A. Zhanabekova0Tiltanym
The article presents a comprehensive overview of a probabilistic-statistical method for distinguishing homonyms in the process of compiling a frequency dictionary based on the Kazakh National Corpus. The research focuses on words that share identical written …
Морфология / сегментацияРечь (ASR / TTS)Датасеты / корпусаКлассификация / сентимент
NEURAL MACHINE TRANSLATION FOR ENGLISH-KAZAKH LANGUAGE PAIR
2025D. Rakhimova, A. Zhiger, V. Malykh et al.0Herald of Kazakh-British technical university
Currently, information technology is rapidly developing and one of its branches can be called machine translation. The use of machine translation in the process of understanding each other by people from different countries is increasing every year. At the mo…
Машинный переводДатасеты / корпусаОценка / бенчмарки
FEATURES OF USING EXTENDED FORMS OF THE TRANSFORMER MODEL IN KAZAKH SPEECH RECOGNITION
2025Kurmetkan Turdybek, Orken J. Mamyrbayev0Bulletin of D. Serikbayev EKTU
In recent years, speech recognition technologies have significantly advanced due to artificial intelligence and machine learning methods. These technologies enable the automatic understanding of human speech and its conversion into text. The field of speech r…
Морфология / сегментацияРечь (ASR / TTS)NER / извлечениеДатасеты / корпуса
AI-Based Offline Speech Recognition for Kazakh, Russian and English Languages
2025Nursultan Nyssanov, Zuleikha Syzdykova, K. Niyazaliyev et al.0Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing
This paper presents a novel fully autonomous multilingual audio transcription system tailored to Kazakh, Russian, and English. The proposed solution integrates a language detection module based on SpeechBrain with a transcription engine using Vosk, and employ…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
A SYSTEMATIC REVIEW ON TRANSLATING KAZAKH SIGN LANGUAGE TO SPEECH USING DYNAMIC GESTURE RECOGNITION
2025A. Aitim, Dariga Sattarkhuzhayeva, Aisulu Khairullayeva0Вестник КазАТК
This systematic review explores the importance of the Kazakh Sign Language (KSL) as the main method of communication for deaf and hard-of-hearing people in Kazakhstan. Despite its critical role, the automatic translation of KSL into spoken language remains un…
Машинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Development and Evaluation of a Small Kazakh Language Corpus to Improve the Efficiency of Multilingual NLP Systems in Low-Resource Environments
2025Arailym Tleubayeva, Sultan Aubakirov, Aisultan Tabuldin et al.02025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST)
This study tackles NLP challenges in low-resource settings by developing the Small Kazakh Language Corpus—a high-quality, annotated collection of Kazakh texts sourced from news, scientific publications, and Wikipedia. The corpus was used to fine-tune two mode…
Языковые модели / LLMМашинный переводДатасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Development of a Translator for Kazakh Sign Language to Speech Using Gesture Recognition
2025A. Aitim, Dariga Sattarkhuzhayeva, Aisulu Khairullayeva02025 IEEE 5th International Conference on Smart Information Systems and Technologies (SIST)
This paper presents the development of "DauysYm," an AI-powered translator that converts Kazakh Sign Language (KSL) into spoken language using dynamic gesture recognition. The project addresses communication barriers faced by people with hearing impairments i…
Машинный переводРечь (ASR / TTS)Датасеты / корпуса
SEMANTIC ROLE LABELING FOR KAZAKH: MODELS AND DATASETS
2025A. Aitim0Bulletin of Abai KazNPU. Series of Physical and Mathematical sciences
A fundamental component of natural language understanding, semantic role labeling (SRL) clarifies the relationship between predicates and their arguments, therefore enabling activities including information extraction, machine translation, and question answer…
Морфология / сегментацияМашинный переводNER / извлечениеДатасеты / корпусаЭмбеддинги
Multilingual Speech Command Recognition for Voice Controlled Robots and Smart Systems
2025Askat Kuzdeuov, H. A. Varol12025 11th International Conference on Control, Automation and Robotics (ICCAR)
Speech Command Recognition (SCR) has many applications in smart home systems, voice-controlled robots, and voice assistants. The modern SCR systems employ deep learning models trained on speech command datasets. Nowadays, the Google Speech Commands (GSC) data…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
From Characters to Subwords: Modeling Unit Conversion for Low-resource Speech Recognition
2025Yizhi Wang, Haofei Zhang, Huiqiong Wang et al.0IEEE International Conference on Acoustics, Speech, and Signal Processing
Multilingual automatic speech recognition (ASR) models greatly facilitate recognizing low-resource languages by sharing representations across similar languages. However, the commonly adopted modeling units, e.g., character-level modeling, lack language-speci…
ТокенизацияРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаЭмбеддинги
Do Chinese models speak Chinese languages?
2025Andrea W Wen-Yi, Unso Eun Seo Jo, David MimnoarXiv
The release of top-performing open-weight LLMs has cemented China's role as a leading force in AI development. Do these models support languages spoken in China? Or do they support the same languages as models developed in the United States or in Europe? Comp…
Языковые модели / LLMДатасеты / корпусаОценка / бенчмарки
Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
2025Maksim Borisov, Zhanibek Kozhirbayev, Valentin Malykh5arXiv
Machine translation for low resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We propose a method to build a machine translation model for code-switched Kazakh-Russian language pair w…
Машинный переводNER / извлечениеДатасеты / корпусаОценка / бенчмарки
APPLICATION OF BLEU AND SARI METRICS IN EVALUATING SIMPLIFIED TEXTS IN KAZAKH: ANALYSIS AND EFFECTIVENESS
2025S. T. Nursapa, I. Ualiyeva0Herald of Kazakh-British technical university
This article explores the methodology for evaluating the quality of simplified texts in Kazakh using BLEU and SARI metrics. Text simplification is an important aspect for ensuring information accessibility and facilitating the learning process in Kazakh langu…
Машинный переводОценка / бенчмарки
Creating a Parallel Corpus for the Kazakh Sign Language and Learning
2025A. Yerimbetova, B. Sakenov, M. Sambetbayeva et al.6Applied Sciences
Kazakh Sign Language (KSL) is a crucial communication tool for individuals with hearing and speech impairments. Deep learning, particularly Transformer models, offers a promising approach to improving accessibility in education and communication. This study a…
Машинный переводРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
2025Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly et al.arXiv
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Ad…
ТокенизацияЯзыковые модели / LLMNER / извлечениеДатасеты / корпусаОценка / бенчмарки
MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages
2025Chen Zhang, Mingxu Tao, Zhiyuan Liao et al.arXiv
Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in …
Языковые модели / LLMДатасеты / корпусаОценка / бенчмарки
Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh
2025Nurkhan Laiyk, Daniil Orel, Rituraj Joshi et al.arXiv
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, cov…
Языковые модели / LLMNER / извлечениеДатасеты / корпуса
Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
2025Maiya Goloburda, Nurkhan Laiyk, Diana Turmakhan et al.arXiv
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monoli…
Языковые модели / LLMNER / извлечениеДатасеты / корпусаОценка / бенчмарки
KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan
2025Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan et al.arXiv
Despite having a population of twenty million, Kazakhstan's culture and language remain underrepresented in the field of natural language processing. Although large language models (LLMs) continue to advance worldwide, progress in Kazakh language has been lim…
Языковые модели / LLMДатасеты / корпусаОценка / бенчмарки
TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages
2025Jafar Isbarov, Arofat Akhundjanova, Mammad Hajili et al.arXiv
Being able to thoroughly assess massive multi-task language understanding (MMLU) capabilities is essential for advancing the applicability of multilingual language models. However, preparing such benchmarks in high quality native language is often costly and …
Языковые модели / LLMМашинный переводДатасеты / корпусаОценка / бенчмарки
Design of QazSL Sign Language Recognition System for Physically Impaired Individuals
2025L. Zholshiyeva, T. Zhukabayeva, D. Baumuratova et al.13Journal of Robotics and Control (JRC)
Automating real-time sign language translation through deep learning and machine learning techniques can greatly enhance communication between the deaf community and the wider public. This research investigates how these technologies can change the way indivi…
Машинный переводРечь (ASR / TTS)Датасеты / корпуса
Analysis of the use of the hiformer model for kazakh speech recognition
2024O. Mamyrbayev, T. Kurmetkan0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
This article presents an overview of automatic speech recognition (ASR) technologies and describes the use of an advanced version of the Transformer model, the Hiformer model, in Kazakh speech recognition. A literature review of Kazakh speech recognition syst…
Речь (ASR / TTS)
Machine Learning Methods for Kazakh Morphology: A Comprehensive Overview
2024I. Akhmetov, S. Aubakirov, T. Saparov et al.22024 IEEE 3rd International Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE)
Kazakh is an agglutinative language, where the sequential attachment of morphemes forms words, each bearing specific grammatical information. The complexity of this morphological structure presents significant challenges for computational linguistics, particu…
Морфология / сегментацияМашинный переводРечь (ASR / TTS)Датасеты / корпусаЭмбеддинги
KAZAKH SPEECH AND RECOGNITION METHODS: ERROR ANALYSIS AND IMPROVEMENT PROSPECTS
2024Yerlan Karabaliyev, Kateryna Kolesnikova3Scientific Journal of Astana IT University
This study offers a detailed evaluation of automatic speech recognition (ASR) systems for the Kazakh, examining their performance in recognizing the phonetic and linguistic features unique to the language. The Kazakh language presents specific challenges for …
Морфология / сегментацияРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Machine Learning-Based Synonymous Word Detection in Kazakh
2024D. Rakhimova, Madina Mansurova, Nurakhmet Matanov et al.02024 9th International Conference on Computer Science and Engineering (UBMK)
This paper considers the problem of synonymous word detection in Kazakh using machine learning methods. The problem of automatic synonym detection is key to natural language processing tasks such as machine translation, context search, and semantic text analy…
Машинный переводКлассификация / сентимент
Aligning Sentences for Kazakh-Turkish Parallel Corpora
2024D. Rakhimova, E. Adalı, Aidana Karibayeva02024 9th International Conference on Computer Science and Engineering (UBMK)
This paper presents a hybrid approach to sentence alignment for the Kazakh-Turkish parallel corpus, addressing the challenges posed by linguistic and structural differences between the two languages. The system is divided into two modules: the Hunaling module…
Машинный переводДатасеты / корпуса
Named Entity Recognition from Kazakh Speech
2024Bauyrzhan Kairatuly, Madina Mansurova02024 9th International Conference on Computer Science and Engineering (UBMK)
This paper addresses the challenges of Named Entity Recognition (NER) in Kazakh speech, a critical task in Natural Language Processing (NLP). The integration of Automatic Speech Recognition (ASR) and NER technologies is explored to improve recognition accurac…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Research on Low-Resource Neural Machine Translation Methods Based on Explicit Sparse Attention and Pre-trained Language Model
2024Tao Ning, Xin Xie, Jin Zhang0International Conference on Computer Science and Network Technology
The development of machine translation is driven by the need for global communication through the automatic translation of words, sentences, and texts from one language to another. This paper proposes an improved Transformer-based model to enhance the perform…
Языковые модели / LLMМашинный переводNER / извлечениеДатасеты / корпуса
DEVELOPMENT OF METHODS AND ALGORITHMS TO BUILD ASPEAKER VERIFICATION IN KAZAKH LANGUAGE
2024S. Rashid, D. Kuanyshbay, A. Nurkey0Suleyman Demirel University Bulletin Natural and Technical Sciences
Speaker verification interfaces are gaining more and more popularity in both academic and commercial industries. It's connected with the latest advances in this area, which can be seen firsthand in our daily life: voice interfaces in computers, robots, cell p…
Речь (ASR / TTS)
Application of the conformer model for kazakh speech recognition
2024O. Mamyrbayev, T. Kurmetkan, R. Arslan0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
The article describes the application of a transformer-based conformer model and a convolutional neural network (CNN) in Kazakh speech recognition, with an overview of automatic speech recognition (ASR) technologies. By exploring ways to combine convolutional…
Речь (ASR / TTS)
Research and processing of the kazakh children’s acoustic corpus
2024D. Rakhimova, Zh. Duisenbekkyzy, E. Adali et al.1Bulletin of the National Engineering Academy of the Republic of Kazakhstan
Recent advancements in speech recognition technology have significantly enhanced accessibility and functionality across various sectors. Nonetheless, the task of recognizing kid’s speech presents considerable challenges. Children from different age groups exh…
ТокенизацияРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
A SYSTEMATIC REVIEW OF EXISTING TOOLS TO AUTOMATED PROCESSING SYSTEMS FOR KAZAKH LANGUAGE
2024A. Aitim, R. Satybaldiyeva8BULLETIN Series of Physics & Mathematical Sciences
The development of automated systems for the Kazakh language has gained significant momentum in recent years, driven by the growing need for natural language processing (NLP) tools tailored to underrepresented languages. This systematic review aims to critica…
ТокенизацияМорфология / сегментацияМашинный переводРечь (ASR / TTS)Датасеты / корпуса
AI-Based IVR
2024Gassyrbek Kosherbay, Nurgissa ApbazarXiv
The use of traditional IVR (Interactive Voice Response) methods often proves insufficient to meet customer needs. This article examines the application of artificial intelligence (AI) technologies to enhance the efficiency of IVR systems in call centers. A pr…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаКлассификация / сентимент
Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text
2024Jinpeng Li, Yu Pu, Qi Sun et al.5arXiv
Whisper and other large-scale automatic speech recognition models have made significant progress in performance. However, their performance on many low-resource languages, such as Kazakh, is not satisfactory. It is worth researching how to utilize low-cost da…
ТокенизацияЯзыковые модели / LLMРечь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Recent Advancements and Challenges of Turkic Central Asian Language Processing
2024Yana Veitsman, Mareike HartmannarXiv
Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included th…
Датасеты / корпуса
SCIENTIFIC ASPECTS OF MODERN APPROACHES TO MACHINE TRANSLATION FOR SIGN LANGUAGE
2024Dana Nurgazina, Saule Kudubayeva, A. Ismailov1Scientific Journal of Astana IT University
Scientific research in the field of automated sign language translation represents a crucial stage in the development of technologies supporting the hearing-impaired and deaf communities. This article presents a comprehensive approach to addressing semantic a…
Машинный перевод
The link between translation difficulty and the quality of machine translation: a literature review and empirical investigation
2024S. Araghi, A. Palangkaraya2Language Resources and Evaluation
We survey the relevant literature on translation difficulty and automatic evaluation of machine translation (MT) quality and investigate whether source text’s translation difficulty features contain any information about MT quality. We analyse the 2017–2019 C…
Машинный переводОценка / бенчмарки
Data Augmentation Based Unsupervised Pre-Training for Low-resource Speech Recognition
2024Hong Luo, Xiao Xie, Penghua Li et al.2Chinese Control and Decision Conference
This paper proposes SpecWav2vec-F, a novel model built upon the Wav2vec 2.0 baseline. The model demonstrates enhanced effectiveness in low-resource speech recognition tasks by preserving relationships between different time steps in the latent speech space. I…
Речь (ASR / TTS)Датасеты / корпуса
Review of Hierarchical Transfer Learning Architecture in Low-Resource Machine Translation
2024Bilge Kagan Yazar, E. Kılıç0Signal Processing and Communications Applications Conference
Machine translation is a field of study that has attracted significant attention in recent years. The success of a model built on a language pair depends mainly on the number of parallel sentences between languages. Unlike high-resource languages, low-resourc…
Машинный переводДатасеты / корпуса
Integrated End-to-End Automatic Speech Recognition for Languages for Agglutinative Languages
2024A. Bekarystankyzy, O. Mamyrbayev, Tolganay Anarbekova6ACM Trans. Asian Low Resour. Lang. Inf. Process.
The relevance of the problem of automatic speech recognition lies in the lack of research for low-resource languages, stemming from limited training data and the necessity for new technologies to enhance efficiency and performance. The purpose of this work wa…
Морфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаКлассификация / сентимент
DEVELOPING METHODS FOR AUTOMATIC PROCESSING SYSTEMS OF KAZAKH LANGUAGE
2024A. Aitim10Вестник КазАТК
The linguistic diversity of the Kazakh language poses unique challenges and opportunities for the development of automatic processing systems. This research explores a comprehensive array of methodologies employed in advancing automatic processing systems tai…
Морфология / сегментацияМашинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаКлассификация / сентимент
Research on the Construction of Low-Resource Parallel Corpus Based on Translation Plug-in Technology
2024Zulkar Iskander, Azragul Yusup0International Journal of Educational Curriculum Management and Research
: Parallel corpora play a crucial role in the field of natural language processing, especially in tasks such as machine translation and cross-language information retrieval. However, with the increasing demand for more languages, the challenges are becoming i…
Машинный переводДатасеты / корпуса
KazQAD: Kazakh Open-Domain Question Answering Dataset
2024Rustem Yeshpanov, Pavel Efimov, Leonid Boytsov et al.14arXiv
We introduce KazQAD -- a Kazakh open-domain question answering (ODQA) dataset -- that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with ex…
Языковые модели / LLMМашинный переводДатасеты / корпуса
Integration AI Techniques in Low-Resource Language: The Case of Kazakh Language
2024Asmaganbetova Kamshat, Ulanbek Auyeskhan, Nurzhanova Zarina et al.32024 IEEE AITU: Digital Generation
Sentiment analysis is an extensively explored domain within natural language processing (NLP); nevertheless, a significant emphasis has been placed on languages possessing ample resources, such as English. This paper delves into the transformative capabilitie…
Машинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
KazEmoTTS: A Dataset for Kazakh Emotional Text-to-Speech Synthesis
2024Adal Abilbekov, Saida Mussakhojayeva, Rustem Yeshpanov et al.arXiv
This study focuses on the creation of the KazEmoTTS dataset, designed for emotional Kazakh text-to-speech (TTS) applications. KazEmoTTS is a collection of 54,760 audio-text pairs, with a total duration of 74.85 hours, featuring 34.23 hours delivered by a fema…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
TEXT GENERATION MODELS FOR PARAPHRASE ON KAZAKH LANGUAGE
2024A. Kassenkhan, N. Mukazhanov, S. Nuralykyzy et al.0КазУТБ
This study delves into the relatively unexplored domain of natural language processing for the Kazakh language—a language with limited computational resources. The paper dissects the effectiveness of diffusion models and transformers in generating text, speci…
ТокенизацияМорфология / сегментацияМашинный переводNER / извлечениеДатасеты / корпуса
KazParC: Kazakh Parallel Corpus for Machine Translation
2024Rustem Yeshpanov, Alina Polonskaya, Huseyin Atakan Varol16arXiv
We introduce KazParC, a parallel corpus designed for machine translation across Kazakh, English, Russian, and Turkish. The first and largest publicly available corpus of its kind, KazParC contains a collection of 371,902 parallel sentences covering different …
Машинный переводДатасеты / корпусаОценка / бенчмарки
KazSAnDRA: Kazakh Sentiment Analysis Dataset of Reviews and Attitudes
2024Rustem Yeshpanov, Huseyin Atakan VarolarXiv
This paper presents KazSAnDRA, a dataset developed for Kazakh sentiment analysis that is the first and largest publicly available dataset of its kind. KazSAnDRA comprises an extensive collection of 180,064 reviews obtained from various sources and includes nu…
Датасеты / корпусаОценка / бенчмаркиКлассификация / сентиментЭмбеддинги
Parallel texts dataset for Uzbek-Kazakh machine translation
2024B. Allaberdiev, G. Matlatipov, Elmurod Kuriyozov et al.12Data in Brief
This paper presents a parallel corpus of raw texts between the Uzbek and Kazakh languages as a dataset for machine translation applications, focusing on the data collection process, dataset description, and its potential for reuse. The dataset-building proces…
Машинный переводNER / извлечениеДатасеты / корпуса
The Task of Post-Editing Machine Translation for the Low-Resource Language
2024D. Rakhimova, Aidana Karibayeva, A. Turarbek12Applied Sciences
In recent years, machine translation has made significant advancements; however, its effectiveness can vary widely depending on the language pair. Languages with limited resources, such as Kazakh, Uzbek, Kalmyk, Tatar, and others, often encounter challenges i…
Морфология / сегментацияМашинный переводNER / извлечениеДатасеты / корпусаОценка / бенчмарки
Neurorecognition visualization in multitask end-to-end speech
2023Orken J. Mamyrbayev, Sergii Pavlov, A. Bekarystankyzy et al.0Optical Fibers and Their Applications
Nowadays, speech-processing technologies with different language systems are successfully used in mobile and stationary devices. Kazakh is considered a low-resource language, which poses various challenges for conventional speech recognition methods. This pap…
Речь (ASR / TTS)Датасеты / корпуса
An Empirical study of Unsupervised Neural Machine Translation: analyzing NMT output, model's behavior and sentences' contribution
2023Isidora Chara Tourni, Derry Wijaya0arXiv
Unsupervised Neural Machine Translation (UNMT) focuses on improving NMT results under the assumption there is no human translated parallel data, yet little work has been done so far in highlighting its advantages compared to supervised methods and analyzing i…
Машинный переводNER / извлечениеДатасеты / корпусаЭмбеддинги
Speech Command Recognition: Text-to-Speech and Speech Corpus Scraping Are All You Need
2023Askat Kuzdeuov, Shakhizat Nurgaliyev, Diana Turmakhan et al.52023 3rd International Conference on Robotics, Automation and Artificial Intelligence (RAAI)
Speech Command Recognition (SCR) is rapidly gaining prominence due to its diverse applications, such as virtual assistants, smart homes, hands-free navigation, and voice-controlled industrial machinery. In this paper, we present a data-centric approach to cre…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
Relevance-guided Neural Machine Translation
2023Isidora Chara Tourni, Derry Wijaya0arXiv
With the advent of the Transformer architecture, Neural Machine Translation (NMT) results have shown great improvement lately. However, results in low-resource conditions still lag behind in both bilingual and multilingual setups, due to the limited amount of…
Машинный переводДатасеты / корпуса
MC$^2$: Towards Transparent and Culturally-Aware NLP for Minority Languages in China
2023Chen Zhang, Mingxu Tao, Quzhe Huang et al.arXiv
Current large language models demonstrate deficiencies in understanding low-resource languages, particularly the minority languages in China. This limitation stems from the scarcity of available pre-training data. To address this accessibility challenge, we p…
Языковые модели / LLMДатасеты / корпуса
Noise-Robust Automatic Speech Recognition for Industrial and Urban Environments
2023D. Orel, H. A. Varol4Annual Conference of the IEEE Industrial Electronics Society
Automatic Speech Recognition (ASR) models can achieve human parity, but their performance degrades significantly when used in noisy industrial and urban environments. In this paper, we present monolingual and multilingual ASR models, which can perform effecti…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
A Chinese–Kazakh Translation Method That Combines Data Augmentation and R-Drop Regularization
2023Cang-Rong Liu, Wushouer Silamu, Yanbing Li3Applied Sciences
Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through…
Машинный переводNER / извлечениеДатасеты / корпуса
Machine Translation Shortcomings and Teaching Translation
2023L. Mirzoyeva4Revista Romaneasca pentru Educatie Multidimensionala
Nowadays, machine translation is considered to be a frequently used tool to render various types of texts related to such different spheres as science, film industry, etc. Statement of the problem: currently, as the higher school system in Kazakhstan starts i…
Машинный переводNER / извлечение
Cascade Speech Translation for the Kazakh Language
2023Zhanibek Kozhirbayev, T. Islamgozhayev11Applied Sciences
Speech translation systems have become indispensable in facilitating seamless communication across language barriers. This paper presents a cascade speech translation system tailored specifically for translating speech from the Kazakh language to Russian. The…
Машинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters
2023N. Kadyrbek, Madina Mansurova, A. Shomanov et al.10Big Data and Cognitive Computing
This study is devoted to the transcription of human speech in the Kazakh language in dynamically changing conditions. It discusses key aspects related to the phonetic structure of the Kazakh language, technical considerations in collecting the transcribed aud…
ТокенизацияРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмаркиЭмбеддинги
Continuous Sign Language Recognition and Its Translation into Intonation-Colored Speech
2023N. Amangeldy, Aru Ukenova, G. Bekmanova et al.25Italian National Conference on Sensors
This article is devoted to solving the problem of converting sign language into a consistent text with intonation markup for subsequent voice synthesis of sign phrases by speech with intonation. The paper proposes an improved method of continuous recognition …
Морфология / сегментацияМашинный переводРечь (ASR / TTS)Оценка / бенчмарки
Strategies in Transfer Learning for Low-Resource Speech Synthesis: Phone Mapping, Features Input, and Source Language Selection
2023Phat Do, Matt Coler, Jelske Dijkstra et al.6arXiv
We compare using a PHOIBLE-based phone mapping method and using phonological features input in transfer learning for TTS in low-resource languages. We use diverse source languages (English, Finnish, Hindi, Japanese, and Russian) and target languages (Bulgaria…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration
2023Rustem Yeshpanov, Saida Mussakhojayeva, Yerbolat KhassanovarXiv
This work aims to build a multilingual text-to-speech (TTS) synthesis system for ten lower-resourced Turkic languages: Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek. We specifically target the zero-shot learning scena…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Kazakh-Chinese neural machine translation based on data augmentation
2023Hao Wu, Beiqiang Ma0Conference on Computer Graphics, Artificial Intelligence, and Data Processing
Machine translation is an important research field in natural language processing and artificial intelligence, which studies how to use computers to automatically convert languages. We experimented with attentional neural network machine translation for Chine…
Машинный переводДатасеты / корпуса
Extractive Question Answering for Kazakh Language
2023Magzhan Shymbayev, Yermek Alimzhanov52023 IEEE International Conference on Smart Information Systems and Technologies (SIST)
This article provides research and development of an extractive question answering system based on the BERT-like model for the Kazakh language. Developing an extractive question answering system requires large training datasets - tens of thousands of annotate…
Языковые модели / LLMМашинный переводNER / извлечениеДатасеты / корпуса
Fine-Tuning the Wav2vec2 Model for Kazakh Speech: A Study on a Limited Corpus
2023Kairatuly Bauyrzhan, M. Madina, Ospan Assel32023 IEEE International Conference on Smart Information Systems and Technologies (SIST)
In this study, we developed a model for automatic recognition of Kazakh speech by fine-tuning the XLSR-Wav2Vec2 pre-trained model to a corpus of Kazakh speech. Our results show that fine-tuning the wav2vec2 model on a small corpus of Kazakh speech allows a si…
Речь (ASR / TTS)Датасеты / корпуса
ЕND-TO-END SPEECH RECOGNITION SYSTEMS FOR AGGLUTINATIVE LANGUAGES
2023A. Bekarystankyzy, O. Mamyrbayev1Scientific Journal of Astana IT University
With the improvement of intelligent systems, speech recognition technologies are being widely integrated into various aspects of human life. Speech recognition is applied to smart assistants, smart home infrastructure, the call center applications of banks, i…
Морфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаКлассификация / сентимент
The neural machine translation models for the low-resource Kazakh–English language pair
2023V. Karyukin, D. Rakhimova, Aidana Karibayeva et al.24PeerJ Computer Science
The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become on…
ТокенизацияМашинный переводДатасеты / корпусаОценка / бенчмарки
Speech Recognition for Turkic Languages Using Cross-Lingual Transfer Learning from Kazakh
2023D. Orel, Rustem Yeshpanov, H. A. Varol3International Conference on Big Data and Smart Computing
This paper investigates the effectiveness of transfer learning in building automatic speech recognition models for nine Turkic languages (Azerbaijani, Bashkir, Chuvash, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek), by leveraging large-scale training data…
Речь (ASR / TTS)Датасеты / корпуса
Multilingual Speech Recognition for Turkic Languages
2023Saida Mussakhojayeva, Kaisar Dauletbek, Rustem Yeshpanov et al.25Inf.
The primary aim of this study was to contribute to the development of multilingual automatic speech recognition for lower-resourced Turkic languages. Ten languages—Azerbaijani, Bashkir, Chuvash, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek—were co…
Речь (ASR / TTS)Датасеты / корпуса
A Study of Speech Recognition for Kazakh Based on Unsupervised Pre-Training
2023Weijing Meng, Nurmemet Yolwas10Italian National Conference on Sensors
Building a good speech recognition system usually requires a lot of pairing data, which poses a big challenge for low-resource languages, such as Kazakh. In recent years, unsupervised pre-training has achieved good performance in low-resource speech recogniti…
Речь (ASR / TTS)Датасеты / корпусаКлассификация / сентиментЭмбеддинги
Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: An Overview
2022Wenqiang Du, Yikeremu Maimaitiyiming, Mewlude Nijat et al.17Applied Sciences
With the emergence of deep learning, the performance of automatic speech recognition (ASR) systems has remarkably improved. Especially for resource-rich languages such as English and Chinese, commercial usage has been made feasible in a wide range of applicat…
Речь (ASR / TTS)NER / извлечениеДатасеты / корпуса
MiLMo:Minority Multilingual Pre-trained Language Model
2022Junjie Deng, Hanru Shi, Xinhe Yu et al.arXiv
Pre-trained language models are trained on large-scale unsupervised data, and they can fine-turn the model only on small-scale labeled datasets, and achieve good results. Multilingual pre-trained language models can be trained on multiple languages, and the m…
Языковые модели / LLMДатасеты / корпусаКлассификация / сентиментЭмбеддинги
Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF
2022Bakhyt BakiyevarXiv
The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a c…
Токенизация
The TNT Team System Descriptions of Cantonese, Mongolian and Kazakh for IARPA OpenASR21 Challenge
2022Kai Tang, Jing Zhao, Jinghao Yan et al.0Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
This paper presents our systems and experimental analyses for the OpenASR21 Challenge. We describe the systems in the constrained condition, constrained-plus condition, and unconstrained condition, and our post-evaluation analyses for the Challenge. The syste…
Речь (ASR / TTS)Оценка / бенчмарки
KSC2: An Industrial-Scale Open-Source Kazakh Speech Corpus
2022Saida Mussakhojayeva, Yerbolat Khassanov, H. A. Varol24Interspeech
We present the first industrial-scale open-source Kazakh speech corpus for automatic speech recognition research and development. Our corpus subsumes two previously presented corpora: 1) Kazakh speech corpus (KSC) and 2) Kazakh text-to-speech 2 (KazakhTTS2). …
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
Multilingual Bidirectional Unsupervised Translation Through Multilingual Finetuning and Back-Translation
2022Bryan Li, Mohammad Sadegh Rasooli, Ajay Patel et al.arXiv
We propose a two-stage approach for training a single NMT model to translate unseen languages both to and from English. For the first stage, we initialize an encoder-decoder model to pretrained XLM-R and RoBERTa weights, then perform multilingual fine-tuning …
Языковые модели / LLMМашинный переводNER / извлечениеДатасеты / корпуса
Semantic Connections in the Complex Sentences for Post-Editing Machine Translation in the Kazakh Language
2022A. Turganbayeva, D. Rakhimova, V. Karyukin et al.10Inf.
The problems of machine translation are constantly arising. While the most advanced translation platforms, such as Google and Yandex, allow for high-quality translations of languages with simple grammatical structures, more morphologically rich languages stil…
Морфология / сегментацияМашинный переводNER / извлечение
Hybrid end-to-end model for Kazakh speech recognition
2022O. Mamyrbayev, D. Oralbekova, K. Alimhan et al.17International Journal of Speech Technology
Речь (ASR / TTS)
ResNet50+Transformer: kazakh offline handwritten text recognition
2022Y. Amirgaliyev, Mateus Mendes, K. Mukhtar et al.0Bulletin of the National Engineering Academy of the Republic of Kazakhstan
Nowadays, due to the transition to digital data storage, there is a need to implement handwritten text recognition (HTR), which is an automatic translation of handwritten characters into a machine format. Handwriting recognition is complicated by the fact tha…
Машинный перевод
Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model
2022Xuan-Phi Nguyen, Shafiq Joty, Wu Kui et al.4arXiv
Numerous recent work on unsupervised machine translation (UMT) implies that competent unsupervised translations of low-resource and unrelated languages, such as Nepali or Sinhala, are only possible if the model is trained in a massive multilingual environment…
Машинный переводДатасеты / корпуса
Descartes: Generating Short Descriptions of Wikipedia Articles
2022Marija Sakota, Maxime Peyrard, Robert WestarXiv
Wikipedia is one of the richest knowledge sources on the Web today. In order to facilitate navigating, searching, and maintaining its content, Wikipedia's guidelines state that all articles should be annotated with a so-called short description indicating the…
Машинный переводNER / извлечениеДатасеты / корпусаОценка / бенчмарки
A study of transformer-based end-to-end speech recognition system for Kazakh language
2022Mamyrbayev Orken, Oralbekova Dina, Alimhan Keylan et al.44Scientific Reports
Today, the Transformer model, which allows parallelization and also has its own internal attention, has been widely used in the field of speech recognition. The great advantage of this architecture is the fast learning speed, and the lack of sequential operat…
Морфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаКлассификация / сентимент
Emotional Speech Recognition Method Based on Word Transcription
2022G. Bekmanova, B. Yergesh, A. Sharipbay et al.25Italian National Conference on Sensors
The emotional speech recognition method presented in this article was applied to recognize the emotions of students during online exams in distance learning due to COVID-19. The purpose of this method is to recognize emotions in spoken speech through the know…
Речь (ASR / TTS)Датасеты / корпуса
Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level
2022O. Mamyrbayev, K. Alimhan, D. Oralbekova et al.15Eastern-European Journal of Enterprise Technologies
Ensuring the best quality and performance of modern speech technologies, today, is possible based on the widespread use of machine learning methods. The idea of this project is to study and implement an end-to-end system of automatic speech recognition using …
Морфология / сегментацияРечь (ASR / TTS)Датасеты / корпуса
KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics
2022Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan VarolarXiv
We present an expanded version of our previously released Kazakh text-to-speech (KazakhTTS) synthesis corpus. In the new KazakhTTS2 corpus, the overall size has increased from 93 hours to 271 hours, the number of speakers has risen from two to five (three fem…
Морфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
KazNERD: Kazakh Named Entity Recognition Dataset
2021Rustem Yeshpanov, Yerbolat Khassanov, Huseyin Atakan VarolarXiv
We present the development of a dataset for Kazakh named entity recognition. The dataset was built as there is a clear need for publicly available annotated corpora in Kazakh, as well as annotation guidelines containing straightforward--but rigorous--rules an…
NER / извлечениеДатасеты / корпуса
The Development of the Light Post-editing Module for English-Kazakh Translation
2021D. Rakhimova, V. Karyukin, Aidana Karibayeva et al.2The 7th International Conference on Engineering & MIS 2021
Applied intelligent systems play an important role in the modern world. One of their tasks is machine translation (MT) from one language into another one. MT allows people to freely communicate despite language barriers. This new technology is a special step …
Машинный переводОценка / бенчмарки
FooDI-ML: a large multi-language dataset of food, drinks and groceries images and descriptions
2021David Amat Olóndriz, Ponç Palau Puigdevall, Adrià Salvador PalauarXiv
In this paper we introduce the FooDI-ML dataset. This dataset contains over 1.5M unique images and over 9.5M store names, product names descriptions, and collection sections gathered from the Glovo application. The data made available corresponds to food, dri…
NER / извлечениеДатасеты / корпусаОценка / бенчмарки
KOHTD: Kazakh Offline Handwritten Text Dataset
2021Nazgul Toiganbayeva, Mahmoud Kasem, Galymzhan Abdimanap et al.arXiv
Despite the transition to digital information exchange, many documents, such as invoices, taxes, memos and questionnaires, historical data, and answers to exam questions, still require handwritten inputs. In this regard, there is a need to implement Handwritt…
Морфология / сегментацияРечь (ASR / TTS)Датасеты / корпуса
THE TRANSLATION QUALITY PROBLEMS OF MACHINE TRANSLATION SYSTEMS FOR THE KAZAKH LANGUAGE
2021Asem Turarbek0Journal of Mathematics Mechanics and Computer Science
Машинный перевод
A Study of Multilingual End-to-End Speech Recognition for Kazakh, Russian, and English
2021Saida Mussakhojayeva, Yerbolat Khassanov, Huseyin Atakan Varol20arXiv
We study training a single end-to-end (E2E) automatic speech recognition (ASR) model for three languages used in Kazakhstan: Kazakh, Russian, and English. We first describe the development of multilingual E2E ASR based on Transformer networks and then perform…
Речь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
A baseline model for computationally inexpensive speech recognition for Kazakh using the Coqui STT framework
2021Ilnar Salimzianov0arXiv
Mobile devices are transforming the way people interact with computers, and speech interfaces to applications are ever more important. Automatic Speech Recognition systems recently published are very accurate, but often require powerful machinery (specialised…
ТокенизацияЯзыковые модели / LLMРечь (ASR / TTS)NER / извлечениеДатасеты / корпуса
Error Correction Based on Transformer LM in Uyghur Speech Recognition
2021Yan Zhang, Mijit Ablimit, Askar Hamdulla12021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML)
For Uyghur, Kazakh and other minority languages or dialects, it is difficult to collect large-scale labeled corpus. In the case of low resources, reducing the recognition granularity which using phonemes or characters as the recognition unit can get good char…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпуса
End-to-End Model Based on RNN-T for Kazakh Speech Recognition
2021Orken J. Mamyrbayev, D. Oralbekova, A. Kydyrbekova et al.11International Conference on Computational Collective Intelligence
Automatic speech recognition is a rapidly developing area in machine learning. The most popular speech recognition systems today are end-to-end systems, especially those models that directly output a sequence of words taking into account the input sound in re…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпуса
MAIN PROBLEMS OF USING THE FULL POST-EDITING MODEL BASED ON MACHINE LEARNING FOR ENGLISH-KAZAKH TRANSLATION
2021D. Rakhimova, К. А. Zhakypbayeva0BULLETIN Series of Physics & Mathematical Sciences
Machine learning is one of the main branches of artificial intelligence. Its main idea is not only to use an algorithm written by a computer, but also to learn how to solve a problem on your own. Recently, in the field of translation, the issue of using machi…
Машинный переводДатасеты / корпуса
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset
2021Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov et al.arXiv
This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speaker…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпусаОценка / бенчмарки
"Wikily" Supervised Neural Translation Tailored to Cross-Lingual Tasks
2021Mohammad Sadegh Rasooli, Chris Callison-Burch, Derry Tanti WijayaarXiv
We present a simple but effective approach for leveraging Wikipedia for neural machine translation as well as cross-lingual tasks of image captioning and dependency parsing without using any direct supervision from external parallel data or supervised models …
Машинный переводДатасеты / корпусаЭмбеддинги
Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages
2021Garry Kuwanto, Afra Feyza Akyürek, Isidora Chara Tourni et al.8arXiv
We conduct an empirical study of neural machine translation (NMT) for truly low-resource languages, and propose a training curriculum fit for cases when both parallel training data and compute resource are lacking, reflecting the reality of most of the world'…
Машинный переводДатасеты / корпуса
The Effectiveness of Morphology-aware Segmentation in Low-Resource Neural Machine Translation
2021Jonne Sälevä, Constantine Lignos28arXiv
This paper evaluates the performance of several modern subword segmentation methods in a low-resource neural machine translation setting. We compare segmentations produced by applying BPE at the token or sentence level with morphologically-based segmentations…
ТокенизацияМорфология / сегментацияМашинный переводДатасеты / корпуса
The Development and Construction of Bilingual Machine Translation Auxiliary Tool between Chinese and Kazakh Languages
2021M. Niyazbek, Kuenssaule Talp, Jing Sun4IOP Conference Series: Earth and Environment
This paper introduces the design and construction process of a bilingual machine translation auxiliary tool between Chinese and Kazakh languages. The tool uses the Jieba word segmentation tool to segment the input sentence, and then translates it according to…
Морфология / сегментацияМашинный перевод
Development of a model and software solution for the problem of determining unknown words in post-editing machine translation
2021D. Rakhimova, N. M. Pazylkhan, A. Kulzhanova et al.0
Machine translation is the technology of consecutive translation of texts from one language to another by a computer program. As a result of machine translation, there are always certain disadvantages that can be solved by post-editing. Post-editing-human pro…
Машинный перевод
Classification of Handwritten Names of Cities and Handwritten Text Recognition using Various Deep Learning Models
2021Daniyar Nurseitov, Kairat Bostanbekov, Maksat Kanatov et al.arXiv
This article discusses the problem of handwriting recognition in Kazakh and Russian languages. This area is poorly studied since in the literature there are almost no works in this direction. We have tried to describe various approaches and achievements of re…
Датасеты / корпусаКлассификация / сентимент
Impact of Statistical Language Model on Example Based Machine Translation System between Kazakh and Turkish Languages
2020Gulshat Kessikbayeva, I. Çiçekli1International Conference on Natural Language Processing and Information Retrieval
In this paper a hybrid example based machine translation system between Kazakh and Turkish languages is presented. The system mainly based on example based machine translation method which is supported by a statistical language model for the target language. …
Морфология / сегментацияЯзыковые модели / LLMМашинный переводДатасеты / корпуса
ETHICAL ASPECT OF SPEECH CULTURE
2020G. Abdirasilova, М. Berkutbayeva, M. Student0
The basis of word culture is the language norm. Speech culture is " the degree of reproduction, maturation of language techniques. In addition, he has not only kindness, literacy, but also the skills of accurate and correct application of language techniques,…
Морфология / сегментацияРечь (ASR / TTS)Оценка / бенчмарки
Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
2020Xavier Garcia, Aditya Siddhant, Orhan Firat et al.35arXiv
Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised trans…
Машинный переводДатасеты / корпуса
A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline
2020Yerbolat Khassanov, Saida Mussakhojayeva, Almas Mirzakhmetov et al.43arXiv
We present an open-source speech corpus for the Kazakh language. The Kazakh speech corpus (KSC) contains around 332 hours of transcribed audio comprising over 153,000 utterances spoken by participants from different regions and age groups, as well as both gen…
Речь (ASR / TTS)Датасеты / корпуса
Attention-based Fully Gated CNN-BGRU for Russian Handwritten Text
2020Abdelrahman Abdallah, Mohamed Hamada, Daniyar NurseitovarXiv
This research approaches the task of handwritten text with attention encoder-decoder networks that are trained on Kazakh and Russian language. We developed a novel deep neural network model based on Fully Gated CNN, supported by Multiple bidirectional GRU and…
Датасеты / корпуса
Neural Named Entity Recognition for Kazakh
2020Gulmira Tolegen, Alymzhan Toleu, Orken Mamyrbayev et al.arXiv
We present several neural networks to address the task of named entity recognition for morphologically complex languages (MCL). Kazakh is a morphologically complex language in which each root/stem can produce hundreds or thousands of variant word forms. This …
Морфология / сегментацияNER / извлечениеДатасеты / корпусаЭмбеддинги
HKR For Handwritten Kazakh & Russian Database
2020Daniyar Nurseitov, Kairat Bostanbekov, Daniyar Kurmankhojayev et al.arXiv
In this paper, we present a new Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition. A few pre-processing and segmentation procedures have been developed together with the d…
Морфология / сегментацияNER / извлечениеДатасеты / корпуса
Method of Sentiment Preservation in the Kazakh-Turkish Machine Translation
2020L. Zhetkenbay, G. Bekmanova, B. Yergesh et al.2Communication Systems and Applications
This paper describes characteristics which affect the sentiment analysis in the Kazakh language texts, models of morphological rules and morphological analysis algorithms, formal models of simple sentence structures in the Kazakh-Turkish combination, models a…
Морфология / сегментацияМашинный переводКлассификация / сентимент
BASIC CONCEPTS AND PARAMETERS OF KAZAKH GRAMMATOLOGY
2020N. Amirzhanova0
Grammatology is traditionally a field of linguistics that establishes and studies the relationship between the letters of the alphabet and the sounds of speech. Grammatology as a branch of linguistics appeared long ago, almost simultaneously with linguistics.…
Речь (ASR / TTS)
Cross-Lingual Word Embeddings for Turkic Languages
2020Elmurod Kuriyozov, Yerai Doval, Carlos Gómez-RodríguezarXiv
There has been an increasing interest in learning cross-lingual word embeddings to transfer knowledge obtained from a resource-rich language, such as English, to lower-resource languages for which annotated data is scarce, such as Turkish, Russian, and many o…
Датасеты / корпусаОценка / бенчмаркиКлассификация / сентиментЭмбеддинги
Multimodal systems for speech recognition
2020Orken J. Mamyrbayev, K. Alimhan, B. Amirgaliyev et al.9International Journal of Mobile Communications
In this article, we have implemented a system of multimodal recognition of Kazakh speech, based on speech and lip recognition. During the feature extraction phase, several methods have been used, such as voice activity detection (VAD), mel-frequency cepstral …
Морфология / сегментацияРечь (ASR / TTS)Оценка / бенчмаркиКлассификация / сентиментЭмбеддинги
The solution of the problem of unknown words under neural machine translation of the Kazakh language
2020A. Turganbayeva, U. Tukeyev7Asian Conference on Intelligent Information and Database Systems
ABSTRACT The paper proposes a solution to the problem of unknown words for neural machine translation (NMT). The proposed solution is shown by the example of NMT of the Kazakh-English language pair. The novelty of the proposed technology for solving the probl…
ТокенизацияМашинный переводДатасеты / корпуса
Development of Automatic Speech Recognition for Kazakh Language using Transfer Learning
2020Amirgaliyev E. N., Kuanyshbay D. N., Baimuratov O14arXiv
Development of Automatic Speech Recognition system for Kazakh language is very challenging due to a lack of data.Existing data of kazakh speech with its corresponding transcriptions are heavily accessed and not enough to gain a worth mentioning results.For th…
Языковые модели / LLMРечь (ASR / TTS)Датасеты / корпуса
Morphological segmentation method for Turkic language neural machine translation
2020U. Tukeyev, Aidana Karibayeva, Z. Zhumanov et al.24Cogent Engineering
Abstract Dictionaries play an important role in neural machine translation (NMT). However, a large dictionary requires a significant amount of memory, which limits the application of NMT and can cause a memory error. This limitation can be solved by segmentin…
ТокенизацияМорфология / сегментацияМашинный переводОценка / бенчмарки
Speech Emotion Recognition For Kazakh And Russian Languages
20202Applied Mathematics & Information Sciences
Речь (ASR / TTS)
Neural Machine Translation for English–Kazakh with Morphological Segmentation and Synthetic Data
2019Antonio Toral, Lukas Edman, Galiya Yeshmagambetova et al.11Conference on Machine Translation
This paper presents the systems submitted by the University of Groningen to the English– Kazakh language pair (both translation directions) for the WMT 2019 news translation task. We explore the potential benefits of (i) morphological segmentation (both unsup…
Морфология / сегментацияМашинный переводОценка / бенчмарки
NICT’s Unsupervised Neural and Statistical Machine Translation Systems for the WMT19 News Translation Task
2019Benjamin Marie, Haipeng Sun, Rui Wang et al.22Conference on Machine Translation
This paper presents the NICT’s participation in the WMT19 unsupervised news translation task. We participated in the unsupervised translation direction: German-Czech. Our primary submission to the task is the result of a simple combination of our unsupervised…
Машинный переводДатасеты / корпусаОценка / бенчмарки
NICT’s Supervised Neural Machine Translation Systems for the WMT19 News Translation Task
2019Raj Dabre, Kehai Chen, Benjamin Marie et al.16Conference on Machine Translation
In this paper, we describe our supervised neural machine translation (NMT) systems that we developed for the news translation task for Kazakh↔English, Gujarati↔English, Chinese↔English, and English→Finnish translation directions. We focused on leveraging mult…
Машинный переводNER / извлечениеДатасеты / корпуса
The TALP-UPC Machine Translation Systems for WMT19 News Translation Task: Pivoting Techniques for Low Resource MT
2019Noe Casas, José A. R. Fonollosa, Carlos Escolano et al.16Conference on Machine Translation
In this article, we describe the TALP-UPC research group participation in the WMT19 news translation shared task for Kazakh-English. Given the low amount of parallel training data, we resort to using Russian as pivot language, training subword-based statistic…
ТокенизацияМашинный переводДатасеты / корпуса
The RWTH Aachen University Machine Translation Systems for WMT 2019
2019Jan Rosendahl, Christian Herold, Yunsu Kim et al.4Conference on Machine Translation
This paper describes the neural machine translation systems developed at the RWTH Aachen University for the German-English, Chinese-English and Kazakh-English news translation tasks of the Fourth Conference on Machine Translation (WMT19). For all tasks, the f…
Морфология / сегментацияЯзыковые модели / LLMМашинный перевод
Towards Interlingua Neural Machine Translation
2019Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa26arXiv
Common intermediate language representation in neural machine translation can be used to extend bilingual to multilingual systems by incremental training. In this paper, we propose a new architecture based on introducing an interlingual loss as an additional …
Машинный переводДатасеты / корпусаОценка / бенчмаркиЭмбеддинги
Automatic Recognition of Kazakh Speech Using Deep Neural Networks
2019Orken J. Mamyrbayev, Mussa Turdalyuly, N. Mekebayev et al.20Asian Conference on Intelligent Information and Database Systems
Речь (ASR / TTS)
Automated rating of recorded classroom presentations using speech analysis in kazakh
2018Akzharkyn Izbassarova, Aidana Irmanova, A. P. JamesarXiv
Effective presentation skills can help to succeed in business, career and academy. This paper presents the design of speech assessment during the oral presentation and the algorithm for speech evaluation based on criteria of optimal intonation. As the pace of…
Речь (ASR / TTS)Оценка / бенчмарки
A free Kazakh speech database and a speech recognition baseline
2017Ying Shi, Askar Hamdullah, Zhiyuan Tang et al.6Asia-Pacific Signal and Information Processing Association Annual Summit and Conference
Речь (ASR / TTS)
On Various Approaches to Machine Translation from Russian to Kazakh
2017Aibek Makazhanov, Bagdat Myrzakhmetov, Zhanibek et al.5
Машинный перевод
Regarding the impact of Kazakh phonetic transcription on the performance of automatic speech recognition systems
2017Muslima Karabalayeva, Zhandos Yessenbayev, Zhanibek Kozhirbayev1
Речь (ASR / TTS)
Complex Technology of Machine Translation Resources Extension for the Kazakh Language
2017D. Rakhimova, Z. Zhumanov2Asian Conference on Intelligent Information and Database Systems
Машинный переводДатасеты / корпуса
Learning Word Alignment Models for Kazakh-English Machine Translation
2015A. Kartbayev7International Symposium on Integrated Uncertainty in Knowledge Modelling
Машинный перевод
Refining Kazakh Word Alignment Using Simulation Modeling Methods for Statistical Machine Translation
2015A. Kartbayev6Natural Language Processing and Chinese Computing
Word alignment play an important role in the training of statistical machine translation systems. We present a technique to refine word alignments at phrase level after the collection of sentences from the Kazakh-English parallel corpora. The estimation techn…
Морфология / сегментацияМашинный перевод
A Bilingual Kazakh-Russian System for Automatic Speech Recognition and Synthesis
2015Olga Khomitsevich, Valentin Mendelev, N. Tomashenko et al.17International Conference on Speech and Computer
Речь (ASR / TTS)
Kazakh Vowel Recognition at the Beginning of Words
2015Aigerim K. Buribayeva, A. Sharipbay0
This paper describes the method of recognition of Kazakh vowels at the beginning of the words using Dynamic Time Warping algorithm. This can be used for acceleration of recognition since word’s first sound identification can significantly decrease the list of…
Речь (ASR / TTS)
Initial explorations in Kazakh to English statistical machine translation
2014Z. Assylbekov, Assulan Nurkas10
Машинный перевод
Parametrc Representation of Kazakh Gestural Speech
2014Saule Kudubayeva, Gulmira Yermagambetova2International Conference on Speech and Computer
Речь (ASR / TTS)Эмбеддинги
A study of certain morphological structures of Kazakh and their impact on the machine translation quality
2014Eldar Bekbulatov, A. Kartbayev7Advanced Industrial Conference on Telecommunications
Морфология / сегментацияМашинный перевод
Perceptual MVDR-based unsupervised built-in speaker normalization for Kazakh speech recognition
2014Zhandos Yessenbayev, Umit Yapanel0Advanced Industrial Conference on Telecommunications
Речь (ASR / TTS)
ENGLISH -KAZAKH PARALLEL CORPUS FOR STATISTICAL MACHINE TRANSLATION
2014A. Kuandykova, A. Kartbayev, Tannur Kaldybekov3
Машинный переводДатасеты / корпуса
STRUCTURAL TRANSFER RULES FOR KAZAKH-TO-ENGLISH MACHINE TRANSLATION IN THE FREE/OPEN-SOURCE PLATFORM APERTIUM
2014A. Sundetova, Aidana Karibayeva, U. Tukeyev8
Машинный перевод
LEXICAL SELECTION IN MACHINE TRANSLATION OF RUSSIAN-TO-KAZAKH
2014D. Rakhimova, M. Abakan0
Машинный перевод
Methods for applying VAD in Kazakh speech recognition systems
2013M. Kalimoldayev, K. Alimhan, Orken J. Mamyrbayev5International Journal of Speech Technology
Речь (ASR / TTS)
Machine translation of different systemic languages using a Apertium platform (with an example of English and Kazakh languages)
2013S. Assem, S. Aida3International Conference on Computer Applications Technology
Машинный перевод
Improving Low-Resource Kazakh-English and Turkish-English Neural Machine Translation Using Transfer Learning and Part of Speech Tags
2025Bilge Kagan Yazar, Erdal Kiliç2IEEE Access
This study presents a novel translation framework by combining transfer learning and part-of-speech (POS) tagging methods to improve the performance of low-resource neural machine translation models using Kazakh-English and Turkish-English language pairs. It …
Машинный переводРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки
Maximum Entropy Model of Synonym Selection in Post-editing Machine Translation into Kazakh Language
2024A. Shormakova, U. Tukeyev0International Conference on Computational Collective Intelligence
Машинный перевод
Kazakh-Uzbek Speech Cascade Machine Translation on Complete Set of Endings
2023Tolganay Balabekova, Bauyrzhan Kairatuly, U. Tukeyev4International Conference on Computational Collective Intelligence
Машинный переводРечь (ASR / TTS)
Multi-Source Transformer for Kazakh-Russian-English Neural Machine Translation
2019Patrick Littell, Chi-kiu (羅致翹) Lo, Samuel Larkin et al.17Conference on Machine Translation
We describe the neural machine translation (NMT) system developed at the National Research Council of Canada (NRC) for the Kazakh-English news translation task of the Fourth Conference on Machine Translation (WMT19). Our submission is a multi-source NMT takin…
Машинный перевод
Development Kazakh-Turkish Machine Translation on the Base of Complete Set of Endings Model
2022Aitan Qamet, Kamila Zhakypbayeva, A. Turganbayeva et al.0Asian Conference on Intelligent Information and Database Systems
Машинный перевод
Kazakh Text Normalization using Machine Translation Approaches
2020Kozhirbaev Zhanibek, Yessenbayev Zhandos2Workshop on Cognitive Modeling and Computational Linguistics
Машинный перевод
Neural machine translation system for the Kazakh language based on synthetic corpora
2019U. Tukeyev, Aidana Karibayeva, B. Abduali10MATEC Web of Conferences
The lack of big parallel data is present for the Kazakh language. This problem seriously impairs the quality of machine translation from and into Kazakh. This article considers the neural machine translation of the Kazakh language on the basis of synthetic co…
Морфология / сегментацияМашинный переводNER / извлечение
The University of Maryland’s Kazakh-English Neural Machine Translation System at WMT19
2019Eleftheria Briakou, Marine Carpuat15Conference on Machine Translation
This paper describes the University of Maryland’s submission to the WMT 2019 Kazakh-English news translation task. We study the impact of transfer learning from another low-resource but related language. We experiment with different ways of encoding lexical u…
Машинный переводДатасеты / корпуса
Neural machine translation system for the Kazakh language
2019U. Tukeyev, Z. Zhumanov0Machine Translation Summit
Машинный перевод
Rule-based machine translation from Kazakh to Turkish
2018S. Bayatli, S. Kurnaz, Ilnar Salimzianov et al.3European Association for Machine Translation Conferences/Workshops
Машинный перевод
Rule-weight learning for Kazakh-Turkish machine translation
2020S. M. Taha0
Машинный перевод
Development and Study of a Post-editing Model for Russian-Kazakh and English-Kazakh Translation Based on Machine Learning
2021D. Rakhimova, Kamila Sagat, Kamila Zhakypbaeva et al.1International Conference on Computational Collective Intelligence
Машинный перевод
A Comparative Evaluation of Open-Source Models for Russian-Kazakh Translation
2026Gleb Shanshin0Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
We describe an evaluation of several open-source models under identical inference conditions without task-specific training. Despite covering a wide range of available models, including both multilingual systems and models specifically designed for Russian– K…
Машинный переводОценка / бенчмарки
Lexical selection rules for Kazakh-to-English machine translation in the free/open-source platform Apertium
2015Aidana Karibayeva0
Машинный перевод
Example based machine translation system between kazakh and turkish supported by statistical language model (Kazakça ve türkçe dilleri arasında örnek tabanlı ve istatistik model destekli makine çeviri sistemi)
2016Gulshat Kessikbayeva0
Языковые модели / LLMМашинный перевод
A Free/Open-source Kazakh-Tatar Machine Translation System
2013Ilnar Salimzyanov, Jonathan North Washington, Francis M. Tyers22Machine Translation Summit
Машинный перевод
3rd International Conference on Computer Processing in Turkic Languages (TURKLANG 2015) A free/open-source machine translation system for English to Kazakh
5
Машинный перевод
Initial explorations in Kazakh to English statistical machine translation
2014Assylbekov, , Zhenisbek, Nurkas, Assulan0Proceedings of the First Italian Conference on Computational Linguistics CLiC-it 2014 and of the Fourth International Workshop EVALITA 2014 9-11 December 2014, Pisa
Машинный перевод
The Universitat d’Alacant Submissions to the English-to-Kazakh News Translation Task at WMT 2019
2019V. M. Sánchez-Cartagena, Juan Antonio Pérez-Ortiz, F. Sánchez-Martínez9Conference on Machine Translation
This paper describes the two submissions of Universitat d’Alacant to the English-to-Kazakh news translation task at WMT 2019. Our submissions take advantage of monolingual data and parallel data from other language pairs by means of iterative backtranslation,…
Морфология / сегментацияМашинный перевод
Do LLMs Speak Kazakh? A Pilot Evaluation of Seven Models
2024Akylbek Maxutov, Ayan Myrzakhmet, Pavel Braslavski17SIGTURK
We conducted a systematic evaluation of seven large language models (LLMs) on tasks in Kazakh, a Turkic language spoken by approximately 13 million native speakers in Kazakhstan and abroad. We used six datasets corresponding to different tasks – questions ans…
Языковые модели / LLMМашинный переводNER / извлечениеДатасеты / корпусаОценка / бенчмаркиКлассификация / сентимент
Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper
2023Zhanibek Kozhirbayev11Journal of Advances in Information Technology
—In recent years, the progress made in neural models trained on extensive multilingual text or speech data has shown great potential for improving the status of underresourced languages. This paper focuses on experimenting with three state-of-the-art speech r…
Языковые модели / LLMМашинный переводРечь (ASR / TTS)Датасеты / корпуса
Leveraging Wav2Vec2.0 for Kazakh Speech Recognition: An Experimental Study
2024Zhanibek Kozhirbayev0Communication Systems and Applications
Речь (ASR / TTS)
A Study of Kazakh Speech Recognition in Hiformer Model
2024O. Mamyrbayev, Turdybek Kurmetkan, D. Oralbekova et al.0Asian Conference on Intelligent Information and Database Systems
Речь (ASR / TTS)
Development of CRF and CTC Based End-To-End Kazakh Speech Recognition System
2022D. Oralbekova, Orken J. Mamyrbayev, M. Othman et al.2Asian Conference on Intelligent Information and Database Systems
Речь (ASR / TTS)
Review of methods of end-to-end automatic recognition of Kazakh speech
2024Yerlan Karabaliyev, K. Kolesnikova, Nurkhan Batyrkhan0EUSPN/ICTH
Речь (ASR / TTS)
INVESTIGATING A KAZAKH SPEECH RECOGNITION SYSTEM USING AN END-TO-END MODEL BASED ON CRF AND CTC
Д.О. Оралбекова, О.Ж. Мамырбаев, А.Б. Имансакипова et al.0
Речь (ASR / TTS)
Speech recognition for Kazakh language: a research paper
2024Galym Kapyshev, M. Nurtas, Aizhan Altaibek7Procedia Computer Science
Речь (ASR / TTS)
Automatic Speech Recognition Improvement for Kazakh Language with Enhanced Language Model
2023A. Bekarystankyzy, O. Mamyrbayev, Mateus Mendes et al.1Asian Conference on Intelligent Information and Database Systems
Языковые модели / LLMРечь (ASR / TTS)
Speech recognition for Kazakh language: a research paper
2023Galym Kapyshev, M. Nurtas, Aizhan Altaibek0EUSPN/ICTH
Речь (ASR / TTS)
Continuous Speech Recognition of Kazakh Language
2019Оrken Mamyrbayev, Mussa Turdalyuly, N. Mekebayev et al.13ITM Web of Conferences
This article describes the methods of creating a system of recognizing the continuous speech of Kazakh language. Studies on recognition of Kazakh speech in comparison with other languages began relatively recently, that is after obtaining independence of the …
ТокенизацияРечь (ASR / TTS)Датасеты / корпуса
Impact of Using a Bilingual Model on Kazakh-Russian Code-Switching Speech Recognition
2019Dmitrii Ubskii, Yuri N. Matveev, W. Minker2Majorov International Conference on Software Engineering and Computer Systems
Речь (ASR / TTS)
AUTOMATIC SPEECH RECOGNITION SYSTEM FOR KAZAKH LANGUAGE USING CONNECTIONIST TEMPORAL CLASSIFIER
2020Y. Amirgaliyev, Darkhan Kuanyshbay, D. Yedilkhan0
Речь (ASR / TTS)
QazNLP: Constraint-Aware Multi-Task Sequence Labeling for Morphologically Rich Low-Resource Languages
2026A. Aitim0IEEE Access
Automatic processing of morphologically rich, agglutinative, and low-resource languages remains challenging because productive affixation increases lexical sparsity, weakens statistical generalization, and often produces inconsistent predictions across relate…
ТокенизацияМорфология / сегментацияЯзыковые модели / LLMРечь (ASR / TTS)NER / извлечениеДатасеты / корпусаОценка / бенчмарки

Как построен этот атлас

Корпус из 222 работ собран Python-скрейпером из двух источников: arXiv API и Semantic Scholar Graph API. Scope — только казахский язык (фильтр: упоминание Kazakh/Qazaq + NLP/ML-релевантность). Рёбра графа — реальные цитаты, вытянутые через Semantic Scholar batch API. Редакторский слой (флагманы, незанятые земли, линия прорывов) — ручная верификация по первоисточникам.

⚠ Ограничения честно: Semantic Scholar частично отдавал HTTP 429 — non-arXiv покрытие в категориях tokenization/LLM/morphology неполное. Авто-тэги — эвристика по абстракту, не ручная разметка. Столбец «claim» у флагманов отражает заявления авторов, не проверенную истину. Скрейперы идемпотентны — дозапуск пополнит корпус.

Источники: arXiv API · Semantic Scholar · gold-стандарт морфологии: UD_Kazakh-KTB, apertium-kaz.