init commit



init commit

0 0


bittorrent_amia15

an upcoming talk, very exciting

On Github ablwr / bittorrent_amia15

Seeding and Leeching: Collaborative Preservation using BitTorrent

Ashley Blewer, #amia15

Torrents in archives

I’m ready stop talking about how torrents can be archives and let’s talk about how we can use torrents in archives, or as archives. I'm gonna introduce some technical concepts about this technology. Knowledge of technology is power. So let’s talk about how torrents work. Let’s talk about the benefits of this technology.

Torrents as archives

I’m not here to talk politics or academics. Reciting introductory-level copyright laws is not just boring, it’s detrimental to moving the conversation forward. Copyright isn’t a factor here.

Copy, right?

Wait a minute. Maybe I’m being brash. What if copyright is an issue? Implementing digital rights management into BitTorrent profiles is possible. Private trackers require a handshake to occur between the person initiating joining the torrent stream and the swarm of torrents. There’s a gate here that says “you have to be part of this club if you want to join our swarm."

Copy, right?

Wait a minute. Maybe I’m being brash. What if copyright is an issue? Implementing digital rights management into BitTorrent profiles is possible. Private trackers require a handshake to occur between the person initiating joining the torrent stream and the swarm of torrents. There’s a gate here that says “you have to be part of this club if you want to join our swarm."

Tech stacks on stacks

Technical stuff… the essence of torrent is that it is just a protocol for sharing data on the internet. What do I mean by protocol?

PROTOCOL

It’s a system of rules just like HTTP or SFTP or IMAP that allows things to move between computers in a certain way. And there are a lot of benefits to the way this protocol works, specifically.

HTTP?

(GET, PUT, POST, DELETE, etc)

HTTP is a protocol that is really good at sending documents over the internet. It’s based on REST, representational state transfer, a commonly used software architecture. (GET, PUT, POST, DELETE, etc).

BitTorrent protocol requirements

  • a web server
  • 'metainfo' file*
  • torrent tracker
  • 'original' downloader
  • end user web browsers
  • end user downloaders
BitTorrent is a protocol too. Protocols have rules. What does this protocol require? - An ordinary web server - A static 'metainfo' file - A BitTorrent tracker - An 'original' downloader - The end user web browsers - The end user downloaders

BitTorrent as a file

In a “physical of the digital” way, torrents look like this — either a .torrent file or a magnet URI containing the same information to connect to a peer network and information about the targetmedia file (the intended thing to be downloaded, comes in as pieces/chunks, also can be a file with files within it) structured using Bencode, a fancy and flexible binary encoding, for “storing and transmitting loosely structured data” — the benefit and speed of this protocol means that downloading happens collaboratively between peers, randomly (or not truly randomly, but not linearly at least) grabbing data as it is most readily availabl

BitTorrent as a tracker

Tracker GET requests have the following keys: Torrent tracker: info_hash: the 20 byte sha1 hash of the bencoded form of the info value from the metainfo file. peer_id: string that a downloader uses as its id, randomly generated ip: IP or dns name (optional) port: port number this peer is listening on. uploaded: the total amount uploaded so far downloaded: the total amount downloaded so far left: the number of bytes this peer still has to download event: maps to started, completed, or stopped (optional)

Peers

BitTorrent's peer protocol operates over TCP or uTP (uTorrent Transport Protocol). TCP enables two hosts to establish a connection and exchange streams of data (unlike IP which works with packets). It’s interesting because TCP ensures files are received in order, but torrent files can come in “randomly.” But with TCP, it guarantees data is received.

Handshake

We call it a handshake. Like gentlemen.

TCP

TCP is really good at splitting things into chunks. I could nerd out about TCP for a long time. I think it’s a pretty cool protocol. But the important part that matters for this talk is that TCP is very good at making sure data is received. This is good for torrents because it has to constantly check to see how much data is available, has been received, has left to download, and when something is at 100%. We talk about fixity a lot as archivists, checking an entire file for fixity using a checksum to validate. But TCP and torrents are constantly checking EACH LITTLE PIECE for fixity. AND the whole thing.

Two things

These connections know two things about themselves and they ask two things of other peers.

1. Are you busy? 2. Are you interested?

1. Are you busy? 2. Are you interested?
Peer connections are always looking for a match.
Are you busy, are you interested. That can only mean one thing. OK. If it’s all good, downloading will proceed. Remember that this is also happening at such a granular level of the file — every little chunk of information is doing this. So a match is found and data can start to be transferred.

Metainfo file!

How does it announce the downloading is finished and complete? The metainfo (torrent) file! It populates the information there.

Randomness

Randomness. So this is all happening wildly, madly. I say randomly because that’s how it can be generally perceived. A little bit here, a little bit there. But there’s an algorithm underlying this randomness, which I guess makes it not random. It’s more like “give me whatever you can send over the fastest.” It’s like being really, really hungry. Torrents are HUNGRY.
Supermarket Sweep hungry. Making a mad dash to grab everything, and popping all the pieces (which are identically sized, minus the final piece) into their proper places.

1-to-1

Instead of the traditional method which is ... blah blah blah one-to-one

Switching gears

Y'all, I lied about not getting academic or political.
Oh look, the NDSA levels of preservation! I don't expect you to be able to read this, but I expect that all of you have this completely memorized already, right? Thinking about the NDSA levels of preservation: It's pretty easy to cover all levels of preservation with this protocol, especially if there's institutional support behind it. But what it means is lots of people have verified copies of a digital file, this fulfills the archival mantra of LOCKSS (Lots of Copies Keep Stuff Safe). The more copies there are distributed, the stronger a file is. Fixity is sorta constantly being checked. Torrents use checksums to verify file fixity as well! So they are already archivally sound in that way! Torrents are validated via an MD5 hash. And in fact this hash must validate in order to work and continue to be part of the same torrent swarm. Security can be monitored by making the torrent private, metadata can and usually is packaged, and media files are interoperable, using formats and codecs that allow for the largest possible audience.

“When a file is made available using HTTP, all upload cost is placed on the hosting machine. With BitTorrent, when multiple people are downloading the same file at the same time, they upload pieces of the file to each other. This redistributes the cost of upload to downloaders, (where it is often not even metered), thus making hosting a file with a potentially unlimited number of downloaders affordable.”

“When a file is made available using HTTP, all upload cost is placed on the hosting machine. With BitTorrent, when multiple people are downloading the same file at the same time, they upload pieces of the file to each other. This redistributes the cost of upload to downloaders, (where it is often not even metered), thus making hosting a file with a potentially unlimited number of downloaders affordable.” -- This is from a white paper authored by the creator of this protocol, available bittorrent.org. http://bittorrent.org/bittorrentecon.pdf

$ Bandwidth $

Bandwidth politics are icky much like copyright politics — lots of variables. I think this is interesting. I don't have time to talk about this. Much like copyright, there's not a clear answer here.

Start doing

So I hope my talk and the following talks prove that this isn’t a hypothetical situation. This is something that can actually be implemented, and I’d like to see it implemented. The technology is right there, so it just needs some technical and community support to facilitate archives — particularly small archives or archives with little institutional support — into being stronger and more community-driven. I want to emphasize that what we are talking about isn’t a theoretical concept, it’s a real concept that works for Internet Archive, with the Prelinger archives and XFR Collective as examples of that, and Democracy Now! and private trackers.

XFR Collective

I’m not gonna talk about the Prelinger archives because he is literally right here in this room with me right next to me, but I do volunteer at XFR Collective, a nonprofit that transfers magnetic media, and our primary method for access and storage is via Internet Archive. We do the digitization, we send the files (in our case physically transport hard drives) to San Francisco, and one preservation copy and all access copies live at the Internet Archives.

Democracy Now!

Democracy Now! has a good thing going with using torrents as distribution: only the five most recent episodes, because they know distributed downloading only works really well when there are many people sharing. Which is why maybe it's a good idea for archives to grow out of communities, rather than archives trying to build a community. http://www.democracynow.org/pages/help/torrent

The end

Seeding and Leeching: Collaborative Preservation using BitTorrent Ashley Blewer, #amia15