Archiving Data Big and Small
Ilya Kreymer
12/15/2015
What is Web Archiving
- Web = HTTP traffic
- Archiving = Preserving at high-fidelity, saving it all
- Different from Scraping, Extraction
- 'Lossless' preservation, HTTP headers + full HTTP content
Why Web Archive
Web content changes and disappears over time
Crawling since 1996, public access since 2001
created by Alexa and Internet Archive
What are the components?
- Crawler (mostly Heritrix) preserving HTTP traffic to WARC files
- An index of urls and their locations in WARC files.
- A web app performing url rewriting, retrieval of content from WARC files.
Crawling HTML content primarily for analysis
Publicly available, since 2008
What does it provide?
- Crawling via Apache Nutch crawler into WARC files
- A url index of all urls and their locations in WARC files
- Link and Metadata files provided (WAT)
- Extracted Text Files provided (WET).
On-demand high fidelity web archiving
What does it do?
- Records all traffic real time (to WARC) as user interacts with a web page.
- 'Replay' what has been recorded immediately.
- Allows user to create public or private collections.
- New project, a lot more coming soon!
These projects all share...
... a common format
WARC (Web ARChive)
The WARC (Web ARChive) Format
- Standardized, almost ubiqutous across web archiving initiatives.
- Created in collaboration between Internet Archive, many national libraries
- Improvement on previous ARC format
- Designed to fully store HTTP request and response traffic, support deduplication, metadata, other arbitrary resources
- WARC 1.0 ISO Standard since 2005
- WARC 1.1 revision in progress: https://github.com/iipc/warc-specifications
WARC Format: Details
- WARC file contains or more concatenated records
- Each record can be (often is) gzip compressed
-
.warc.gz extension if records are gzip compressed
- .warc extension if not gzip compressed
- Entire file is NOT gzipped compressed