talk-20151215

Archiving Data Big and Small

Ilya Kreymer

12/15/2015

About Me

Worked at Internet Archive on their Wayback Machine
Build web archive replay tools
Created Url Index for CommonCrawl
Currently working on the Webrecorder project
at Rhizome.org -- a digital art and preservation org.
Also recently launched oldweb.today

What is Web Archiving

Web = HTTP traffic
Archiving = Preserving at high-fidelity, saving it all
Different from Scraping, Extraction
'Lossless' preservation, HTTP headers + full HTTP content

Why Web Archive

Source: Ten years of the UK web archive: what have we saved?

Web content changes and disappears over time

Who is Web Archiving?

Wayback Machine

Crawling since 1996, public access since 2001

created by Alexa and Internet Archive

What are the components?

Crawler (mostly Heritrix) preserving HTTP traffic to WARC files
An index of urls and their locations in WARC files.
A web app performing url rewriting, retrieval of content from WARC files.

Other 'Wayback Machines'

Many other, lesser-known public web archives:

International Internet Preservation Consortium group

CommonCrawl

Crawling HTML content primarily for analysis

Publicly available, since 2008

What does it provide?

Crawling via Apache Nutch crawler into WARC files
A url index of all urls and their locations in WARC files
Link and Metadata files provided (WAT)
Extracted Text Files provided (WET).

Webrecorder

On-demand high fidelity web archiving

What does it do?

Records all traffic real time (to WARC) as user interacts with a web page.
'Replay' what has been recorded immediately.
Allows user to create public or private collections.
New project, a lot more coming soon!

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

Standardized, almost ubiqutous across web archiving initiatives.
Created in collaboration between Internet Archive, many national libraries
- Improvement on previous ARC format
Designed to fully store HTTP request and response traffic, support deduplication, metadata, other arbitrary resources
WARC 1.0 ISO Standard since 2005
WARC 1.1 revision in progress: https://github.com/iipc/warc-specifications

WARC Format: Details

WARC file contains or more concatenated records
Each record can be (often is) gzip compressed
.warc.gz extension if records are gzip compressed
.warc extension if not gzip compressed
Entire file is NOT gzipped compressed

WARC Format: Details

Each record contains MIME-style WARC headers, followed by HTTP headers, followed by HTTP payload
HTTP response record, WARC-Type: response
WARC/1.0 WARC-Type: response WARC-Date: 2013-12-04T16:47:32Z WARC-Record-ID: <>

talk-20151215

ikreymer

talk-20151215

0 1

talk-20151215

Archiving Data Big and Small

Ilya Kreymer

12/15/2015

About Me

What is Web Archiving

Why Web Archive

Who is Web Archiving?

Wayback Machine

Other 'Wayback Machines'

CommonCrawl

Webrecorder

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

WARC Format: Details

WARC Format: Details

talk-20151215

ikreymer

talk-20151215

0 1 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

talk-20151215

Archiving Data Big and Small

Ilya Kreymer

12/15/2015

About Me

What is Web Archiving

Why Web Archive

Who is Web Archiving?

Wayback Machine

Other 'Wayback Machines'

CommonCrawl

Webrecorder

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

WARC Format: Details

WARC Format: Details

0 1