talks



talks

0 1


talks

Collection of reveal.js slides from various talks

On Github ikreymer / talks

Archiving Data Big and Small

Ilya Kreymer

12/15/2015

About Me

What is Web Archiving

  • Web = HTTP traffic
  • Archiving = Preserving at high-fidelity, saving it all
  • Different from Scraping, Extraction
  • 'Lossless' preservation, HTTP headers + full HTTP content

Why Web Archive

Web content changes and disappears over time

Who is Web Archiving?

Wayback Machine

Crawling since 1996, public access since 2001

created by Alexa and Internet Archive

What are the components?

  • Crawler (mostly Heritrix) preserving HTTP traffic to WARC files
  • An index of urls and their locations in WARC files.
  • A web app performing url rewriting, retrieval of content from WARC files.

Other 'Wayback Machines'

Many other, lesser-known public web archives:

International Internet Preservation Consortium group

CommonCrawl

Crawling HTML content primarily for analysis

Publicly available, since 2008

What does it provide?

  • Crawling via Apache Nutch crawler into WARC files
  • A url index of all urls and their locations in WARC files
  • Link and Metadata files provided (WAT)
  • Extracted Text Files provided (WET).

Webrecorder

On-demand high fidelity web archiving

What does it do?

  • Records all traffic real time (to WARC) as user interacts with a web page.
  • 'Replay' what has been recorded immediately.
  • Allows user to create public or private collections.
  • New project, a lot more coming soon!

These projects all share...

... a common format

WARC (Web ARChive)

The WARC (Web ARChive) Format

  • Standardized, almost ubiqutous across web archiving initiatives.
  • Created in collaboration between Internet Archive, many national libraries
    • Improvement on previous ARC format
  • Designed to fully store HTTP request and response traffic, support deduplication, metadata, other arbitrary resources
  • WARC 1.0 ISO Standard since 2005
  • WARC 1.1 revision in progress: https://github.com/iipc/warc-specifications

WARC Format: Details

  • WARC file contains or more concatenated records
  • Each record can be (often is) gzip compressed
  • .warc.gz extension if records are gzip compressed
  • .warc extension if not gzip compressed
  • Entire file is NOT gzipped compressed

WARC Format: Details

  • Each record contains MIME-style WARC headers, followed by HTTP headers, followed by HTTP payload
  • HTTP response record, WARC-Type: response

    WARC/1.0 WARC-Type: response WARC-Date: 2013-12-04T16:47:32Z WARC-Record-ID: <>

>

  • >