A practical guide to text mining with topic extraction

201553 citationsReview

Authors

Andrew T. Karl · Advanced Emissions Solutions (United States)

James Wisnowski · Advanced Emissions Solutions (United States)

W. Heath Rushing · Advanced Emissions Solutions (United States)

Abstract

Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open‐ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document‐term matrix ( DTM ). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis ( LSA ) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open‐source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361 This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition Statistical Learning and Exploratory Methods of the Data Sciences > Text Mining

Topics & Keywords

Advanced Text Analysis Techniques Text and Document Classification Technologies Computational and Text Analysis Methods

Publication Details

Published in: Wiley Interdisciplinary Reviews Computational Statistics

Volume 7, Issue 5, pp. 326-340

DOI: 10.1002/wics.1361

Field-Weighted Citation Impact: 4.31

Command Palette

A practical guide to text mining with topic extraction

Authors

Abstract

Topics & Keywords

Publication Details