Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"

20260 citationsDatasetgreen Open Access

Authors

Branislav Pecher · Kempelen Institute of Intelligent Technologies

Adrian Bindas · Kempelen Institute of Intelligent Technologies

Jan Jakubcik · Kempelen Institute of Intelligent Technologies

Matus Tuna · Kempelen Institute of Intelligent Technologies

Matúš Tibenský · Kempelen Institute of Intelligent Technologies

Šimon Liška · Kempelen Institute of Intelligent Technologies

Abstract

This is a dataset accompanying the paper “Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok”, designed to analyze video interactions and user engagement patterns on TikTok website. It contains records of interactions of social media auditing agents with TikTok website over the timespan of present study. The video excerpts included in this dataset are used solely as units of content for analytical purposes. They do not represent, reflect, or imply the personal views, intentions, or stance of the individuals who created them. Content should be interpreted as data artifacts, not as statements attributable to any person. To minimize the risk of third-party misuse, the dataset is available only to researchers for non-commercial research purposes upon verification of their email address associated with academic organisation. Paper: TBA Preprint: TBA GitHub repository: https://github.com/kinit-sk/ai-auditology-personalisation-drift-tiktok References If you use this dataset in any publication, project, tool or in any other form, please, cite the following paper: TBA Dataset Description The dataset consists of 3 CSV files: ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_plus_neutral.csv — Data for the first user group (neutral+polarising) consists of 30 users from runs which were seeded with both polarizing and neutral topic. ai-auditology-personalisation-drift-tiktok_32_agents_polarizing_only.csv — Data for the second user group (polarising only) consists of an additional 32 users (4 for topic+stance) that are only seeded with a polarising topic (representing maximum polarity), but interact with a neutral topic during the interaction phase. ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv — Data for the third user group (mixed polarity) seeded with equal manner with only the US politics topic. The CSV files contain 28 columns (29 for data contained in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv), capturing details such as session and video identifiers, timestamps, ad classifications, visual indicators, user demographics, and video metadata. Column name Data type Description Example interaction_number integer Unique integer per interaction per agent 1,2,3… video_url string URL of video the agent interacted with https://www.tiktok.com/@author123 video_id string TikTok unique video ID 1234 video_author string TikTok author name author123 video_description string Video description generated by video author plus hashtags This video is about… video_time_duration integer Duration of video in seconds 67.9333 video_transcript string Speech transcript by inhouse Whisper model Welcome to my video about… video_transcript_language string Code for language detected in transcript en, fr …. video_action_skip bool Decision by user interaction predictor, TRUE if video is to be skipped TRUE, FALSE video_action_watch bool Decision by user interaction predictor, TRUE if video is to be watched TRUE, FALSE video_action_like bool Decision by user interaction predictor, TRUE if video is to be liked TRUE, FALSE video_action_bookmark bool Decision by user interaction predictor, TRUE if video is to be bookmarked TRUE, FALSE video_time_watch_loop_start integer UNIX timestamp of time when agent started watching particular video 1765302470.8245792 video_time_watch_loop_end integer UNIX timestamp of time when agent finished watching particular video 1765302470.8245792 video_time_skip integer UNIX timestamp of time when agent skipped particular video 1765302470.8245792 video_time_like integer UNIX timestamp of time when agent liked particular video 1765302470.8245792 video_time_bookmark integer UNIX timestamp of time when agent bookmarked particular video 1765302470.8245792 video_time_predict_interaction integer UNIX timestamp of time when user interaction predictor predicted how to interact with particular video 1765302470.8245792 agent_id string Unique ID of agent agent_id topic string Topic of interest of given agent Vaccines, US Politics, Flatearth, Climate change, Cooking stance string Stance towards the topic of interest of given agent support, oppose gender string Gender set for given agent in TikTok male, female country_code string Country of origin set for given agent US date_of_birth string Date of birth set for given agent in TikTok 1/2/2005 run_id string ID of given agent run 1759515058.941394_main predicted_topic_match bool TRUE if predicted_topic == topic of interest TRUE, FALSE predicted_stance_match bool TRUE if predicted stance == stance of given agent TRUE, FALSE predicted_topic string Topic predicted by data annotator using these data fields: video_author, video_description, video_transcript Vaccines, US Politics, Flatearth, Climate change, Cooking predicted_stance string Predicted stance towards the topic of interest of given agent. Only in ai-auditology-personalisation-drift-tiktok_US_politics_4_agents_mixed_polarity.csv support, oppose Ethical considerations Most of the ethical, legal and societal issues tied to this dataset were already described in the Ethical Considerations section of the associated paper. The most severe risks were tied to a Terms of Service (ToS) violation, various types of privacy intrusions, the possibility of third-party misuse, or the erosion of some privacy rights such as the right to erasure. The research, from which this dataset resulted from, was done as a part of the research project, which obtained approval from the organisational Ethics Committee (decision as of December 17, 2024). To minimise any potential legal and ethical issues, we directly involved legal and ethics experts as part of this project. Researchers and research engineers conducting this auditing study also participated in four ethics assessment workshops together with ethics and legal experts, where relevant ethical and legal challenges have been identified and appropriate mitigations proposed. The execution of sockpuppeting audits requires creating automated bots and using them for data collection, which is a potential violation of the terms of service of the social media platforms. However, this breach of ToS is permitted by Article 40 (12) of the EU Act on Digital Services (DSA) if the research concerns systemic risks. This work directly addresses such a systemic risk by the assessment of social media platforms compliance with obligations imposed by legislation, specifically prohibiting profiling-based advertising to minors stated by the Article 28(2) of DSA, as foreseen by Recital 83 of the DSA. Second, the interaction of the bots with the content on the platform may impact the platform and society (e.g., increasing the view or like count). However, we minimise the number of bots that we run. When it comes to data, we collect only publicly available metadata. The user interaction model, which we use for annotation purposes to determine the topic and stance of a video towards such a topic, is based on a large language model, and so we may observe potentially biased and incorrect findings due to the mistakes made by it. We address this problem by ad-hoc as well as systematic manual annotation of selected dataset subset. To accomplish this, we need to perform human annotation. It is done solely by the authors of the study, following recommendations from ethics experts in order to minimise possible negative consequences and ensure well-being. Labels in the dataset that are derived from the prediction of above-mentioned annotation system (namely: predicted_topic, predicted_topic_match, predicted_stance_match) as well as transcript of the speech in the video (video_transcript) are a product of statistical machine learning systems and therefore might be inaccurate and may differ from the video author opinions and stances towards the topics of interest. Finally, to support users' rights to rectification and erasure in case of the publication of incorrect or sensitive information, we provide a procedure for them to request the removal of their posts from the dataset or to flag the inaccuracies in the data. To do this, users can contact the authors using the contact form provided for accessing the dataset.

Topics & Keywords

Publication Details

Published in: Zenodo (CERN European Organization for Nuclear Research)

DOI: 10.5281/zenodo.19144520

Command Palette

Dataset for paper "Algorithmic Audit of Personalisation Drift in Polarising Topics on TikTok"

Authors

Abstract

Topics & Keywords

Publication Details