Search for a command to run...
Recent advancements in information availability and computational capabilities have transformed the analysis of annual reports, integrating traditional financial metrics with insights from textual data. To extract actionable insights from this wealth of textual data, automated review processes, such as topic modeling, are essential. This study benchmarks classical approaches against contemporary neural techniques and introduces FinTextSim, a sentence-transformer finetuned for financial text. Using Item 7 and Item 7A of 10-K filings from S&P 500 companies (2016-2023), we systematically evaluate these models qualitatively and quantitatively. BERTopic in combination with FinTextSim consistently outperforms all alternatives, producing notably clearer, more coherent and financially relevant topic clusters. Compared to the most widely used standard embedding models and financial baselines, FinTextSim improves intratopic similarity by up to 71% and reduces intertopic similarity by more than 108%, highlighting the importance of domain-specific embeddings. Crucially, these qualitative gains translate into quantitative predictive benefits: incorporating FinTextSim-derived topic features into a logistic regression framework for corporate performance prediction leads to a statistically significant two-percentage-point increase in both ROC-AUC and F1-score over a purely financial baseline. In contrast, off-the-shelf sentence-transformers and classical topic models introduce noise that degrades predictive performance. For non-linear classifiers, several textual representations yield modest gains, reflecting their greater capacity to absorb noisier features. However, FinTextSim remains the most stable and consistently strong performer across both linear and non-linear settings. Overall, FinTextSim acts as a domain-adapted information filter, translating unstructured financial text into structured, semantically rich representations that human analysts and generic models often overlook. By bridging interpretability and predictive utility, it enables the extraction of economically relevant information from corporate narratives and supports more effective decision-making, resource allocation, and corporate performance forecasting.
Published in: Frontiers in Artificial Intelligence
Volume 9, pp. 1752103-1752103