Myanmar-English Code-Switching Speech Dataset: Measr

20250 citationsJournal Article

Authors

Theingi Aye · University of Computer Studies Yangon

Win Pa Pa · Ministry of Health and Sports

Hay Mar Soe Naing · University of Computer Studies Yangon

Abstract

Teachers and students often employ the use of intrasentential code-switching while teaching Information Technology (IT) and other subjects in universities and colleges. Myanmar automatic speech recognition (ASR) systems still stumble on code-switching because switched utterances blur language borders, mix pronunciations, and broaden word lists-especially when English terms slip through Myanmar sounds. These mismatches push up word-error rates and cause many mixed-language phrases to be misunderstood. This paper presents the MEASR dataset, a spontaneous speech dataset featuring Myanmar-English intra-sentential code-switching collected from real online IT teaching sessions. It addresses the challenges code-switching poses to Myanmar ASR systems - such as language boundary confusion, pronunciation mixing, and vocabulary expansion-which conventional monolingual models struggle with. The MEASR dataset includes around 10 hours of speech recorded at 16 kHz, mono channel, and 16-bit resolution. The paper details the dataset's design and provides an analysis to support improved code-switching in automatic speech recognition (CS-ASR) through bilingual training, pronunciation adaptation, and language identification.

Topics & Keywords

Speech Recognition and Synthesis Natural Language Processing Techniques ICT in Developing Communities

UN Sustainable Development Goals

Quality Education

Publication Details

DOI: 10.1109/o-cocosda68185.2025.11385063

Field-Weighted Citation Impact: 0.00