From Document-Level to Segment-Level: LLM-Based Terminology Extraction for Translation Workflows

20260 citationsPreprintgreen Open Access

Authors

Jakub Absolon · University of Ss. Cyril and Methodius in Trnava

Abstract

This preprint introduces a novel approach to terminology extraction in translation workflows called Segment-Level: LLM-Based Terminology Extraction. Traditional terminology extraction methods typically operate at document or corpus level, which may limit their usefulness in Computer-Assisted Translation (CAT) environments where translators primarily work with individual segments. Based on practical experience with CAT tools such as Phrase, Trados, and Crowdin, as well as student-based experiments evaluating AI terminology extraction capabilities, this study identifies limitations of document-level extraction. These limitations include increased noise, reduced domain focus, over-extraction of general vocabulary, and limited usefulness for real-time translation workflows. To address these challenges, this paper proposes Segment-Level: LLM-Based Terminology Extraction, a prompt-based approach using Large Language Models (LLMs) to extract concept-based terminology candidates directly from individual source segments. The method is guided by ISO 704 terminology principles and emphasizes concept-oriented, domain-relevant, and translation-relevant terminology selection. A preliminary micro-study indicates promising results, showing that segment-level extraction: reduces noise in term candidate selection improves domain relevance reduces over-extraction enhances translation consistency improves terminology relevance in translation workflows The proposed method introduces a new direction for terminology extraction and may support real-time terminology assistance in CAT tools, terminology management, machine translation customization, and human-in-the-loop translation workflows. This preprint presents the conceptual framework, methodology, prompt design, and preliminary observations supporting the feasibility of Segment-Level: LLM-Based Terminology Extraction.

Topics & Keywords

Biomedical Text Mining and Ontologies linguistics and terminology studies Natural Language Processing Techniques

UN Sustainable Development Goals

Quality Education

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19235259