Search for a command to run...
Web archives preserve the evolving web, yet the extent and quality of this preservation remain uneven and insufficiently understood at scale. We present a comprehensive macro level analysis of Internet Archive holdings, focusing on HTML resources, the core of user facing web content. Our study examines a curated dataset of 27.3 million URIs and 3.8 billion mementos across 7 million domains, using archival span, archival rate, resilience (HTTP stability), and fixity (content consistency) to characterize temporal coverage and preservation quality. Our findings reveal substantial disparities. By 2021, 83% of URIs were not re-archived, underscoring widening coverage gaps. Archival practices evolved from emphasizing root URIs of major sites (1996-2002) to deep links and, more recently, service or framework endpoints. Archival activity followed a heavy tailed distribution (Gini coefficient 0.928), with a small fraction of URIs accounting for most mementos. Root URIs accumulated more captures and longer spans (average 14 years) than deep links (average 3 years), which were underrepresented and more prone to loss. Status codes further complicated interpretation: 29% of mementos were redirects and 5% errors, with deep links five times more likely to return 404 errors than roots. Further analysis revealed a moderate negative correlation between span and resilience ($r=-0.31$) and a moderate positive correlation between fixity and resilience ($r=0.29$). Resilience benefited more directly from higher archival rates, particularly for deep links. Newer content (2015 and later) exhibited higher resilience even at lower rates, suggesting improved stability and modern archiving practices. These insights help archivists and researchers identify coverage gaps, assess preservation risks, and better interpret the biases of web archives.