Search for a command to run...
Purpose This study aims to present a framework for applying natural language processing techniques to analyze and classify hotel customer reviews. Design/methodology/approach Using a data set of over 500,000 hotel reviews, a supervised machine learning model is developed to predict whether a review is good or bad based on its textual content. The approach involved a comprehensive data preprocessing pipeline, including tokenization, stop-word removal and lemmatization. For feature engineering, a combination of sentiment analysis scores (using valence-aware dictionary and sentiment reasoner), basic text metrics and advanced text vectorization techniques is integrated such as Doc2Vec and TF-IDF. Findings Given the significant class imbalance in the data set, with a very low percentage of negative reviews, the model performance is rigorously evaluated using the precision–recall curve and the average precision (AP) metric, which are better suited for such scenarios than the traditional receiver operating characteristic curve. The final model, a random forest classifier, achieved an AP of 0.37, demonstrating its effectiveness in identifying the minority class of negative reviews. The results indicate that sentiment analysis features are the most influential in predicting reviewer satisfaction. Practical implications The framework created allows recognizing negative reviews in time to take action to facilitate immediate service recovery and proactive reputation management. The model can be applicable on a strategic level to uncover recurring operational issues, track customer satisfaction trends and derive marketing insights used in the reviews. Originality/value This paper provides a foundation for developing automated systems that enable hotels to better understand and respond to customer feedback in real time.