Lemmatisation Models for Historical Variants of Italian

20260 citationsDatasetgreen Open Access

Authors

SIMONETTA MONTEMAGNI · Institute for Computational Linguistics “A. Zampolli”

Chiara Alzetta · Institute for Computational Linguistics “A. Zampolli”

Abstract

This repository provides two lemmatization models designed for use within the Stanza pipeline. Both models enable the automatic lemmatization of Italian texts, with a particular focus on historical variants of the language.The models differ in the annotation principles adopted in the training data: Model 1 (named it_isdtohistofull_nocharlm_lemmatizer.pt)This model is trained on manually lemmatized data drawn from three distinct corpora: Dante’s Divine Comedy, a collection of literary texts, and a set of letters from the World War I. The annotation preserves lemmas that are closely aligned with the historical forms found in the texts, resulting in a richer but more variable lemma space. Model 2 (named it_isdtnhistnfull_nocharlm_lemmatizer.pt )This model is trained on harmonized versions of the same corpora, where the normalization does not affect the surface orthography of the texts, but rather the annotation principles. In particular, lemmas are standardized toward modern Italian forms, reducing sparsity and increasing consistency across the dataset. Together, these models enable comparative experiments on the impact of different lemmatization strategies—historically faithful vs. normalized—on the processing of historical Italian texts

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19371678

Command Palette

Lemmatisation Models for Historical Variants of Italian

Authors

Abstract

Topics & Keywords

Publication Details