Search for a command to run...
This repository provides two lemmatization models designed for use within the Stanza pipeline. Both models enable the automatic lemmatization of Italian texts, with a particular focus on historical variants of the language.The models differ in the annotation principles adopted in the training data: Model 1 (named it_isdtohistofull_nocharlm_lemmatizer.pt)This model is trained on manually lemmatized data drawn from three distinct corpora: Dante’s Divine Comedy, a collection of literary texts, and a set of letters from the World War I. The annotation preserves lemmas that are closely aligned with the historical forms found in the texts, resulting in a richer but more variable lemma space. Model 2 (named it_isdtnhistnfull_nocharlm_lemmatizer.pt )This model is trained on harmonized versions of the same corpora, where the normalization does not affect the surface orthography of the texts, but rather the annotation principles. In particular, lemmas are standardized toward modern Italian forms, reducing sparsity and increasing consistency across the dataset. Together, these models enable comparative experiments on the impact of different lemmatization strategies—historically faithful vs. normalized—on the processing of historical Italian texts