Search for a command to run...
Objective: This dataset is designed to benchmark and improve Visual Question Answering (VQA) systems in the context of Vietnamese tourism and cultural heritage. It addresses the lack of high-quality, regionally specific multimodal data for Southeast Asia. Data Content: The dataset comprises thousands of images sourced from Wikimedia Commons, paired with human-verified question-answer sets in Vietnamese. The questions cover five levels of complexity, ranging from basic object identification to deep cultural reasoning. Methodology: 1. Sourcing: Legally compliant images were filtered from Wikimedia Commons. 2. Annotation: Expert annotators generated QA pairs, focusing on architectural details, historical significance, and spatial reasoning. 3. Validation: Data was cleaned using automated scripts to ensure 100% synchronization between metadata (JSON) and image files, with factual auditing via Large Multimodal Models (LMMs). Usage: The data is split into train and test sets (.json). It is intended for training, fine-tuning, and evaluating Vision-Language Models (VLMs) on localized cultural contexts.