LaFresCat: A studio-quality Catalan multi-accent speech dataset for text-to-speech synthesis

20260 citationsJournal Articlehybrid Open Access

Authors

Àlex Peiró Lilja · Barcelona Supercomputing Center

Carme Armentano-Oller · Barcelona Supercomputing Center

José Joaquín Montes Giraldo · Barcelona Supercomputing Center

Wendy Elvira-García · Universitat de Barcelona

Ignasi Belart Esquerrà · Barcelona Supercomputing Center

Rodolfo Zevallos · Barcelona Supercomputing Center

Cristina España-Bonet

Abstract

Current text-to-speech (TTS) systems are capable of learning the phonetics of a language accurately given that the speech data used to train such models covers all phonetic phenomena. For languages with different varieties, this includes all their richness and accents. This is the case of Catalan, a mid-resourced language with several dialects or accents. Although there are various publicly available corpora, there is a lack of high-quality open-access data for speech technologies covering its variety of accents. Common Voice includes recordings of Catalan speakers from different regions; however, accent labeling has been shown to be inaccurate, and artificially enhanced samples may be unsuitable for TTS. To address these limitations, we present LaFresCat, the first studio-quality Catalan multi-accent dataset. LaFresCat comprises 3.5 h of professionally recording speech covering four of the most prominent Catalan accents: Balearic, Central, North-Western, and Valencian. In this work, we provide a detailed description of the dataset design: utterances were selected to be phonetically balanced, detailed speaker instructions were provided, native speakers from the regions corresponding to the Catalan accents were hired, and the recordings were formatted and post-processed. The resulting dataset, LaFresCat, is publicly available. To preliminarily evaluate the dataset, we trained and assessed a lightweight flow-based TTS system, which is also provided as a by-product. We also analyzed LaFresCat samples and the corresponding TTS-generated samples at the phonetic level, employing expert annotations and Pillai scores to quantify acoustic vowel overlap. Preliminary results suggest a significant improvement in predicted mean opinion score (UTMOS), with an increase of 0.42 points when the TTS system is fine-tuned on LaFresCat rather than trained from scratch, starting from a pre-trained version based on Central Catalan data from Common Voice. Subsequent human expert annotations achieved nearly 90% accuracy in accent classification for LaFresCat recordings. However, although the TTS tends to homogenize pronunciation, it still learns distinct dialectal patterns. This assessment offers key insights for establishing a baseline to guide future evaluations of Catalan multi-accent TTS systems and further studies of LaFresCat. • Creation of the first studio-quality Catalan multi-accent dataset for text-to-speech. • Training a flow-based text-to-speech (TTS) system to evaluate accent traits from LaFresCat, Matxa-TTS. • Annotators can perceptually differentiate LaFresCat samples based on their accents. • Phonetic analysis shows TTS shifts toward Central Catalan, mirroring current usage. • Catalan multi-accent dataset and TTS models are publicly available.

Topics & Keywords

Speech Recognition and Synthesis Phonetics and Phonology Research Voice and Speech Disorders

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Computer Speech & Language

Volume 100, pp. 101945-101945

DOI: 10.1016/j.csl.2026.101945

Field-Weighted Citation Impact: 0.00