A Longitudinal Dataset of URLs Sampled From the Wayback Machine

20250 citationsJournal Article

Authors

Michele C. Weigle · Old Dominion University

Abstract

We present a longitudinal dataset of 27.3 million URLs sampled from the Internet Archive’s (IA) Wayback Machine, comprising approximately 3.8 billion archived web pages (mementos) spanning 1996-2021. Our goal was to construct a sample to revisit fundamental questions about the size, nature, and persistence of the publicly archivable web, in particular, reconsidering the question, “How long does a web page last?” Although every web collection inevitably carries some bias, our sampling aimed to reduce imbalance by accounting for temporal coverage, domain frequency, URL depth, and MIME type. The dataset originates from IA’s ZipNum index, and includes URLs for HTML pages selected based on filename extensions for human interaction. Employing IA’s CDX API, we categorized URLs by their first capture date and MIME type, focusing on 92 million entries with text/html types. To balance the archival representation over the years, we extracted top-level URLs from deep links for earlier years where archiving was sparse. We aimed for consistent yearly sampling, with a target of 1 million URLs first archived during each of the 26 years of our study. Oversampling from popular domains like Yahoo and Twitter was addressed through logarithmic-scale downsampling. Various sampling strategies were employed to ensure a balanced representation across domains and time periods. This dataset, featuring TimeMaps of 27.3 million URLs, serves as a valuable resource for web archive studies and invites further exploration of web page durability and archival practices over time. Researchers can utilize the dataset to replicate sampling strategies and conduct new inquiries into the archived web.

Topics & Keywords

Web Data Mining and Analysis Personal Information Management and User Behavior Web visibility and informetrics

Publication Details

DOI: 10.1109/jcdl67857.2025.00045

Field-Weighted Citation Impact: 0.00

Command Palette

A Longitudinal Dataset of URLs Sampled From the Wayback Machine

Authors

Abstract

Topics & Keywords

Publication Details