Dataset for LENS: Layered Exploration of Narrative Structures in Telegram

20260 citationsDatasetgreen Open Access

Authors

Alfonso de Paz · Consejo Superior de Investigaciones Científicas

Berta Chulvi · NortonLifeLock (United States)

David Arroyo · Consejo Superior de Investigaciones Científicas

Abstract

This dataset constitutes the empirical foundation for the validation of the LENS (Layered Exploration of Narrative Structures) framework, a multiscalar methodology designed to dissect information flows in semi-opaque digital environments like Telegram. The dataset contains 1,216,335 messages extracted from 53 Telegram channels focused on the Spanish-speaking disinformation ecosystem (covering narratives such as the COVID-19 pandemic, conspiratorial frames, and anti-vaccine discourses). The longitudinal series spans 289 weeks, with data collected up to September 7, 2025. The purpose of this dataset is strictly illustrative and analytical. It is designed to demonstrate the operational viability of the LENS framework in detecting attention peaks, discursive silos, and potential coordinated campaigns through semantic topology, rather than constituting an exhaustive census of the ecosystem. Data Collection and Methodology To construct a robust and replicable sample dataset, the data collection process strictly applied a set of predefined filters based on the Phase 2 workflow of the LENS framework. A conservative strategy was adopted to isolate specific communication dynamics, resulting in a final sample of 53 Telegram channels and a total corpus of 1,216,335 posts. The following inclusion criteria were applied to the monitored content: Channel Type (Channels only, no groups): The dataset exclusively includes broadcasting spaces where the discursive agenda is strictly set by the administrators, intentionally excluding decentralized discussion groups. Minimum Age: Only channels created before February 24, 2020 (pre-COVID-19) were included to capture the baseline and subsequent evolution of narratives from the onset of the pandemic. Language: The content must be predominantly in Spanish (at least 90% of the detected messages) to ensure the linguistic robustness of the downstream semantic analysis. Data Structure (Data Dictionary) The dataset is provided in tabular format. The variables are detailed below: id: Internal unique identifier for the database record. link: Direct public URL to the specific Telegram message. msg_id: Original numeric identifier of the message within the Telegram channel. channel_link: URL or username (t.me/...) of the Telegram channel from which the message was extracted. sender: Identifier or name of the sender. text: Raw text content of the published message. created_at: Timestamp indicating the exact date and time the message was published. views: Total number of accumulated views for the message at the time of extraction. lang: Detected language of the message text (mostly 'es' for Spanish). is_reply: Boolean value (True/False) indicating whether the message is a direct reply to a previous post within the channel.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.18768873

Command Palette

Dataset for LENS: Layered Exploration of Narrative Structures in Telegram

Authors

Abstract

Topics & Keywords

Publication Details