Search for a command to run...
We present a longitudinal dataset of 27.3 million URLs sampled from the Internet Archive’s (IA) Wayback Machine, comprising approximately 3.8 billion archived web pages (mementos) spanning 1996-2021. Our goal was to construct a sample to revisit fundamental questions about the size, nature, and persistence of the publicly archivable web, in particular, reconsidering the question, “How long does a web page last?” Although every web collection inevitably carries some bias, our sampling aimed to reduce imbalance by accounting for temporal coverage, domain frequency, URL depth, and MIME type. The dataset originates from IA’s ZipNum index, and includes URLs for HTML pages selected based on filename extensions for human interaction. Employing IA’s CDX API, we categorized URLs by their first capture date and MIME type, focusing on 92 million entries with text/html types. To balance the archival representation over the years, we extracted top-level URLs from deep links for earlier years where archiving was sparse. We aimed for consistent yearly sampling, with a target of 1 million URLs first archived during each of the 26 years of our study. Oversampling from popular domains like Yahoo and Twitter was addressed through logarithmic-scale downsampling. Various sampling strategies were employed to ensure a balanced representation across domains and time periods. This dataset, featuring TimeMaps of 27.3 million URLs, serves as a valuable resource for web archive studies and invites further exploration of web page durability and archival practices over time. Researchers can utilize the dataset to replicate sampling strategies and conduct new inquiries into the archived web.