LLM-based Translation and Topic Modelling of Folk Songs

20260 citationsOthergreen Open Access

Authors

Olha Petrovych · Estonian Literary Museum

Kaarel Veskis · Estonian Literary Museum

Abstract

This dataset accompanies the forthcoming article “LLM-based Translation and Topic Modelling of Folk Songs: A Computational Comparison of Ukrainian and Estonian Oral Traditions.” It supports a computational study of thematic structures in two folk song traditions: 2,852 Estonian runosongs from the Järvamaa region and 2,762 Ukrainian songs from the Podillia region. To enable cross-linguistic comparison, both corpora were translated into English using Claude 3.5 Sonnet with prompts designed to preserve lexical consistency and reflect genre and dialect features. The translated texts were processed using standard NLP techniques and analysed with topic modelling (LDA and BERTopic), Ward's hierarchical clustering, and t-SNE visualisation methods in a Google Colab environment. The uploaded ZIP archive contains four folders (code, data, tables, figures) and a license file. Code Topic_Modeling_of_Folk_Songs.py — The Python script written and executed in Google Colab. It contains the full workflow for text preprocessing, TF–IDF vectorisation, topic modelling (LDA and BERTopic), hierarchical clustering (Ward’s method), and t-SNE visualisation. Data Tolgitud_2852_valitud_Järva_laulu.csv — Estonian runosongs from Järvamaa with their LLM-based English translations. translated_podillia_songs.csv — Ukrainian folk songs from Podillia with their LLM-based English translations. Both files include the song texts and metadata used for computational analysis. Tables This folder contains all tables generated during the computational analysis. These tables form the quantitative basis for the results interpreted and discussed in the article. Figures This folder contains all figures produced during the computational analysis in Google Colab. Together, these materials provide full transparency and reproducibility of the computational workflow underlying the study and allow other researchers to inspect, reuse, or extend the analysis.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19352677

Command Palette

LLM-based Translation and Topic Modelling of Folk Songs

Authors

Abstract

Topics & Keywords

Publication Details