Search for a command to run...
This repository contains a curated Bengali fake news detection dataset comprising 10,205 full-text news articles collected from Bangladeshi online sources and annotated with binary labels (Real and Fake) across two major domains: Politics and Sports. The dataset is designed to support research in low-resource Natural Language Processing (NLP), misinformation detection, and cross-domain text classification. The dataset is provided in a single clean CSV file with three columns: - Category: Domain of the article (Politics or Sports) - Label: Authenticity label (Real or Fake) - News_Article: Full Bengali Unicode news text The corpus includes: - 6,165 Real articles (60.4%) - 4,040 Fake articles (39.6%) - 5,962 Sports articles (58.4%) - 4,243 Politics articles (41.6%) All articles were collected through web scraping using BeautifulSoup and Selenium from reputable Bengali news portals for real news and from unreliable or satirical sources and public Facebook pages for fake news. Labels were assigned through source-based verification and cross-checking with fact-checking platforms. Text length statistics show a strong linguistic contrast between real and fake news: - Average length of real articles: 2,027 characters - Average length of fake articles: 920 characters - Total corpus size: around 16.2 million characters This dataset is particularly valuable for: - Binary fake news classification - Cross-domain learning (Politics ↔ Sports) - Low-resource language NLP research - Transformer model evaluation - Linguistic analysis of misinformation The dataset focuses on Bangladesh-centric Bengali news content and does not include personal user data or private information. All content was collected from publicly accessible sources in compliance with platform redistribution policies.