Don't know about you... but I'm feeling like SHA-2!
Checksumming with Taylor Swift
Ashley Blewer
24 April 2015
Personal Digital Archiving
Post-talk note: Press 's', open up the popup, press the largest image and then use the arrow keys to move through the slides
"Nice to meet you. Where’ve you been? I can show you incredible things."
So this talk is trying to reach the lowest common denominator using the most accessible pop star and perfectly nice person, Taylor Swift to explain what's the the big deal behind checksums.
This talk is "Don’t Know About You… But I’m feeling like Sha-2!"
Hello.
"Um, you don’t know about me? But I bet you’d like to."
And I wasn't kidding about the amount of Taylor Swift you're gonna get in the next five minutes.
Bro...
CHECKSUMMING! "It's difficult but it's real." So what do people mean when they are like "Bro, do you even checksum your digital preservation assets?"
Bro...
In archives, we use checksums a bit differently than how they are usually used in the tech world, so I want to give an overview of what it looks like "from the other side" to help archivists understand ~what it all means~. "This is the golden age!"
c29c11407b4b6600f6fa47018008b147
Answer: IT'S JUST A STRING OF JIBBERISH. That's all it is. "It's beautiful, wonderful, don't you ever change." And as long as that string of jibberish is the same right now as it is when you check up on that file in 10 years, you're good.
and you're good!
...you're good.
and you're good!*
Kinda.... you're good*
Concepts
Um. "Next chapter".
Let's talk concepts.
Concepts: Security
Let's talk about... security.
“You go talk to your friends talk to my friends talk to me”
Security-wise, checksums are used to prevent a man-in-the-middle attack by using what I like to think of as the Taylor Swift method of breakup success. I made a diagram! To represent this better.
So, you can see here... encrypted passwords or other sensitive information can pass through a network to another location without that information being compromised. “You go talk to your friends talk to my friends talk to me”
The way this is successful is if information is passed through other channels but never directly, even if hackers are "standing by you, waiting at your back door," your data will still be safe because plaintext information is never transferred directly, and that obfuscation and abstraction happens just like this. Except instead of your friends and my friends, its checksums and salting.
"Because we are never, ever getting back together. Like, ever."
Concepts: Salting
I mentioned Salting. Salt is a random string, which is linked to existing password hashes and then hashed again for extra layers of protection. The salt value and resulting hash is then stored in the database. You are able to un-encrypt but the information, without each other, can't be un-encrypted.
Concepts: Collision Attack
A collision attack attempts to find two arbritary outputs which produce the same hash value, hence, a collision (two files existing with the same value). This does come up in archives and is a problem. And seems to happen more than it should. "At least, that's what people say."
Concepts: "Broken"
Security-wise, these things don't last "forever & always". So you hear, oh, “___ is broken.” MD5, SHA-1. It's because the technology that keeps these file fingerprints secure can be corrupted if the algorithm can be decrypted by computers. “This slope is treacherous, this path is dangerous” It's the ability to break the code at all, as well as the amount of time and money necessary to cause that break to happen. It follows a sort of Moore's Law pattern: it gets faster and cheaper every day. Info security is a moving target.
SHA
So we have SHA. SHA stands for Secure Hash Algorithm. It was developed by the National Institute of Standards and Technology (NIST) as a U.S. Federal Information Processing Standard (FIPS). SHA is very common, the lower, less secure versions (SHA-1) are used to generate small unique identifiers, like git commits. More complex algorithms are used in security. Git uses SHA-1 so you have unique commits. HTTPS used to use SHA-1 and now uses SHA-2 (That's SHA-256) to determine if you are legitimately on the website you intend to be on via the SSL certificate.
CRC
CRC (cyclic redundancy check) was invented by W. Wesley Peterson in 1961 but “32, and still growing up now.” CRC-32 is a common version, 32 stands for 32 polynomial lengths. Lots of limitations here, not good for security, can be faked easily. But it is used in file format verification, Matroska video wrapper files being an example of that.
MD5
MD5! "You look like my next mistake!" We see MD5 a lot in archives, at least in my experience. MD5 stands for Message Digest algorithm 5, and was invented by US cryptographer Professor Ronald Rivest in 1991 to replace the old MD4 standard.
Back to archives-land
Fortunately for archival purposes, usually security isn't an issue and it's okay if checksums are "broken" from a security standpoint as long as they work from a fixity standpoint (file integrity, file location authentication -- are the files there and are they what they used to be?). "Hey, isn't this easy?"
The core principal still stands in infosec and in digipres though: you want the assurance that a file has no changed in any way, even ways that seem invisible. And that's why implementing this strategy into a preservation workflow is important.
But...
"But that doesn't mean we're out of the woods yet." I don't have time to get into this but faking checksums or random non-validation or incorrect validation even when a file does change is still a thing. It's "trouble trouble trouble". Or maybe "it's like trying to solve a crossword puzzle but there's no right answer". Sorry, that deserves much longer discussion. Next time. But in the meantime... "Back up, baby, back up".