An Annotated Corpus of Uzbek Business Reviews for Aspect-Based Sentiment Analysis

20260 citationsDatasetgreen Open Access

Authors

Sanatbek Matlatipov · Epic Systems (United States)

Abstract

This dataset contains 5,038 annotated business reviews designed for Aspect-Based Sentiment Analysis (ABSA). The reviews were scraped from Commeta Sharh, a publicly accessible business review platform in Uzbekistan. The dataset captures the natural linguistic diversity of the region, featuring mixed-language text (including Russian and Uzbek) alongside Uzbek-language metadata categories. The corpus spans 630 unique businesses across 23 domains (e.g., Education/Ta'lim). It serves as a valuable resource for evaluating low-resource and code-switched NLP models, specifically for extracting specific business aspects and their associated sentiment polarities. Dataset Characteristics Total Reviews: 5,038 (filtered from a larger pool, excluding entries with fewer than five words). Businesses Covered: 630 Business Domains: 23 Task: Aspect-Based Sentiment Analysis (ABSA) – Aspect Term Extraction (ATE) and Aspect Polarity Classification (APC). Data Structure The dataset is provided in JSON format. Each entry represents a single user review and contains the following fields: review_id: A unique identifier for the review. text: The raw text of the user review. business_name: The name of the reviewed business. business_category: The domain/industry of the business (e.g., "Ta'lim" for Education). user_rating: The numerical rating given by the user (typically 1-5). aspects: A list of extracted aspects, where each aspect contains: term: The specific word or phrase from the text representing the aspect. category: The broader category of the aspect (e.g., "xizmat" for service, "boshqalar" for others). polarity: The sentiment expressed toward the aspect (positive, negative, or neutral). num_aspects: The total count of aspects identified in the text. annotation_source: The model used for the automated annotation pipeline (e.g., qwen2.5-7b-finetuned). parse_success: A boolean indicating if the model output was successfully parsed into the JSON structure. raw_output: The raw JSON string generated by the fine-tuned LLM before parsing.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18790639

Command Palette

An Annotated Corpus of Uzbek Business Reviews for Aspect-Based Sentiment Analysis

Authors

Abstract

Topics & Keywords

Publication Details