On Github classam / hash_presentation
Hi, I'm Curtis Lassam!
Or classam on Github or twitter.
I'm a software developer for a University in B.C.
Not the one you're thinking of.
The other one.
We are going to talk about hash functions.
I know what you're thinking.
"Oh no"
"That's basic computing science. What does it have to do with the web?"
"I'm here to learn about frameworks that are 15 minutes old, not MATH."
"When's lunch?"
And so on.
But I'm going to do my best to show you why this fundamental technique belongs in your toolkit.
You can tell from my drawing of a toolkit: I am not a handy man.
You might be thinking:
And you'd be right.
So for you guys, I've hidden Portland Pete on one slide in this set. If you tell me where he was in the presentation after I'm done, you get a prize.
But before you get too excited, the prize is a high five.
Let's get started - what IS a hash function?
Here's an example. Let's write a function that takes a string
splits it into characters
converts every character into an integer
adds the integers together
and mods the result by 100.
This is a hash function.
Not every hash function works THIS way, but this - like many other functions, is a hash.
It has a number of useful properties.
Regardless of what sequence of characters you put in, you're guaranteed that the output will be between 0 and 99.
The output for a string will always be the same.
Many different inputs can produce the same output.
And given the output, it's impossible to guess the input that produced it.
I know, I know, invoking the wikipedia definition of something is the presentation equivalent to opening a speech with "Webster's Dictionary Defines", but it's a good definition.
So, that's a hash function - now let's talk about Hash Tables.
Hash tables are a data structure concept used to store keys and values.
Keys and values are important - you deal with them when you work with things like
MongoDB
Memcached
or Javascript objects
And if you're dealing with keys and values, chances are, behind the scenes, something clever with hash tables is happening.
It could be something else, like a tree or a trie, but those aren't what we're talking about right now, so I'm going to pretend that they don't exist.
So let's start with a big block of memory - let's say an array with 100 elements.
We want to store the value "hash two one five dee two nine" against the key "color".
So we run a hash function on the key, "color"
this produces a number
which we mod by the length of our table.
this gives us an index.
and then we can just store our value at the location pointed to by the index.
That's a hash table. Easy peasy.
This has a lot of really positive qualities.
Insert? Constant time.
Delete? Constant time.
Lookup? Constant time.
Search? Uuuuh. Don't search a hash table.
But it is not quite so easy, nor quite so peasy.
As the array fills up with values,
it gets more and more likely that we'll have a hash value point to a spot in the array that's already full.
This is called a "collision".
What do we do when there's already a value in the spot where we want to store a value?
We can keep walking forward in the table until we find an available spot.
This is called 'linear probing'.
Alternatively, we could root a linked list at every space in the array. That way, our hash table can take as many values as we can throw at it.
Or a tree. We could also root a tree at every space in the table.
This - rooting a separate data structure at every spot in the table - is called a 'chained hash table'.
One problem:
Here, there's a bunch of different values stored where we're looking for the key, "cheese".
How do we know which is the right one?
We need to store the key with the value, so that we can make sure we're retrieving the right thing.
This is going to be the case any time we have to deal with collision - if we can retrieve multiple values from our hash table, we need to be able to tell which is the correct one.
Even with some strategy for collision detection in place, it's possible for the table to get so full that it performs very sluggishly.
A crowded chained-hash is little better than a linked list.
or - in the case of hashing strategies that just shuffle addresses around - it's possible for the table to become completely full.
When this happens, it's time to rebuild the hash.
This is the time-consuming process of addressing an even bigger whack of memory,
Taking all of the keys out of the first array, re-hashing them, and putting them in the second array.
There's one language I can think of, whose default HashTable implementation can perform this computationally intensive rebuild step unexpectedly any time an insert pushes the table above a pre-defined load limit.
But for the sake of politeness I'm not going to mention which language.
Of course, linear probing and chained hashing are not the only hash table management strategies - There are many, many hashing strategies.
Like robin-hood hashing
which steals data from your richer tables
and inserts it to your poorer tables.
Or hopscotch hashing,
where you implement the entire array in chalk, on the pavement.
I am definitely right about those last two. No need to check them on the internet or anything.
Okay, so, that first few minutes of the presentation was there to get you up to speed on the basics. Now let's get to some of the meat!
Bloom filters.
A bloom filter is a great way to keep balloons out of your face.
I'm not sure how this slide ended up in here.
A bloom filter is a data structure that's fast
and space efficient
used to test for membership in a set.
That's important - it tests for set membership, it doesn't store any data.
It can tell you if something is in a set, but you can't retrieve an item from the set.
Like, if we have three objects - banana, apple, and bowling ball.
And a bloom filter, representing the set of 'fruit'.
We can use the set to determine that banana and apple are 'fruit', and 'bowling ball' is not.
But we can't give you a list of all of the fruit that were used to populate the set.
We don't know them.
So, a lot of the time, Bloom Filters are used to answer questions like
Okay, let's look at this image problem a little bit.
Let's imagine we're running a forum and our itinerant users keep posting the same images again and again. There's gigabytes of the same few hundred images, over and over and over.
We want to detect when the user is sending us a repeat image, and instead, just use an image link that already exists.
let's say we run a hash function on the target image,
yes, you can hash images - you can hash most anything
mod the result by the length of an array
move to that index location in the array
and save a link to the image there.
Then, when we're checking a new image, we can hash it, and then check to see if it's in our array.
of course, this is just a bog-standard hash table, and I'm supposed to be talking about Bloom Filters
this works fine, but it takes up a lot of space
and I promised you space efficiency
Let's imagine there are 5000 images we want to keep out of our forums and they take up 100 kilobytes each.
That means about a 500 megabyte table of duplicate images. While it's not a tonne of space, it's enough that you probably won't want to keep the whole thing in RAM.
remember, though, that we don't need storage from our data structure
we're only interested in whether or not this image exists
let's imagine that we store zeroes in every space in our table
and we store ones where our hash functions land.
that way, we can check if an image is banned
by hashing it, modding it
and then checking if that spot in the array has a 1 in it.
if there's no 1 there, we can't possibly have seen that image before.
there's only one slight problem with this technique.
what happens when we have a different image that accidentally collides with an image that we've set earlier?
this creates a false positive, and unfairly takes pictures of Nicholas Cage out of circulation
How do we stop collisions like this from occurring?
Well, we can use a hash function that guarantees an astronomically low probability of collision.
MD5, for example.
is a hash function that guarantees and astronomically low probability of collision
But, with the astronomically low probability of collision, you get, axiomatically, an astronomically high number of potential outputs. With one bit for every output, you're suddenly saddled with
four times ten to the 25 terabytes, which is just a little bit unrealistic to implement as a bit array.
it does hit on one of the two strategies we can use to reduce collisions, though
the obvious one: more space.
now let's look at the less obvious one:
more hash functions.
instead of hashing our image just once, let's hash it twice, with two different hash functions
this gives us two different locations in the table, and we can store a one at both of them
another image
might share the result of one of the hash functions
but it's not as likely to share the result of all of them
and if any of the hash functions point to a location with a zero in it, we know that this object can never have been entered into the Bloom Filter
So, this is a bloom filter.
A bit field, and multiple hash functions to set the bits in that field.
We put items in by hashing them multiple times and setting the bits at all of those locations
And we check if items are in the set by hashing items multiple times and checking if the bits are set at all of those locations.
There are some downsides to this.
Because each item we put in to the Bloom Filter has multiple hash functions, and there can be some overlap, we can never delete anything from the Bloom Filter
if we do
we run the risk of accidentally deleting something else from the filter.
on top of that, we've reduced the chance of collisions, but it's impractical to reduce that chance to effectively zero, so there's always some chance of a false positive.
however, this probability is at least a number that we have control over
if we know the desired probability of a collision
- in the case of our image filter, let's say 0.1%.
and the number of things we want to put in the filter
as we mentioned before, about 5000 images
we can use handwavy math to determine how much space we need
and how many hash functions we need
to get an optimal solution
so, we need an array of 71,888 bits - 8.8 kilobytes
that's small enough that we could keep it in RAM - heck we could send that bloom filter to the user and run it client-side. It's certainly easier to work with than 500 megabytes of image.
and 10 different hash functions.
A lot of the time, Bloom Filters are paired up with other data structures that handle the storage of the items.
For example, they are useful when paired with data structures that exhibit worst-case performance when searching for items that are not in the data-structure.
In a linked list, or unsorted large array, for example, you get the worst possible case performance when you're searching for an item that just isn't there.
The bloom filter can check, before you hit the data structure, if the data is in there.
they are also very useful when the retrieval step for data takes a long time - for example, when a network call needs to be made to a far away database, a local bloom filter is a wonderfully fast way to know if you are wasting everybody's time with a request for something that doesn't exist.
or, data with a very low hit rate - if you're dealing with the sort of data where you're getting 10 misses for every hit, you can catch all of those misses with a bloom filter.
Google Chrome does this -
when checking if a URL is malicious, it maintains a small, fast, local Bloom Filter seeded with malicious URLs. If the Bloom Filter flags a URL as a possible match,
only then will Chrome make the milliseconds-long round-trip to their servers to check the details.
This use case matches up pretty well with our list of times when a bloom filter is useful - a call to a remote server takes a huge amount of time compared to running a few hash functions, and maybe only one in 100 websites you visit will actually be a hit
Let's talk about how to pick which hash functions to use.
I mean, I showed you my awesome cheesehash function earlier, but its terrible, and on top of that it only really works on strings.
Now, the number of different Hash Functions are countless - each one a unique snowflake.
only three of these are made-up
So which ones do we pick for our data structures?
well, for data-structures, the two properties we're looking for in a hash are that the hash should be fast, and well-distributed
When we say 'fast', what we mean is 'non-cryptographic'.
Cryptographic hashes are awesome. They have a bunch of properties that make them totally bad-ass for security functionality.
The most important one being that they're "collision resistant" - it's astronomically unlikely that you will be able to find two different items that hash to the same value.
But these cryptographic features also make them a lot more processor-hungry.
So we should avoid hashes like SHA or MD5 when we're working on data structure stuff, because we don't need awesome security features in our Bloom Filter. It's just a waste of CPU cycles.
So, non-cryptographic.
And well-distributed, which means that no matter how similar your data is going in to the hash function,
the output appears all over the spectrum.
Hash functions that exhibit this quality are known as 'avalanching' hashes
because small changes in the input
lead to large changes in the output.
Of course, this is also a desirable property in cryptographic hashes, but this one we're willing to blow our precious processor time on, because it's really important for data structures that depend on hash functions to have well-distributed hash functions.
A common hash used for this purpose is the non-cryptographic, well-avalanching, public domain Murmur3 - implementations of which exist for most modern languages
- and which has appeared in numerous open-source products, including Hadoop, Cassandra, and nginx.
It also takes a seed value, so you can create dozens of different hash functions out of Murmur3
just by changing the seed value.
There are about five different implementations of it in NPM
One other thing - we hashed an image to put in our data structure.
What sort of hash function works on an image?
Well, most of them, really, but a perceptual hash, or pHash, is designed specifically to cluster very similar images together in the hash output.
For example, if it's the same image, but sized larger, or skewed to the left, it should end up with a hash location very close to or identical to the hash location of the original image.
Of course, by nature, this hash function won't be distributed in the way that we would need for an optimal general-purpose data structure
It's the opposite of an 'avalanching' hash - small changes in the input lead to almost no changes in the output.
But we can abuse that property so that false positives are unfairly clustered on images that look very similar to our banned images. Which would actually probably be a good thing.
You can try it out with a npm install phash
Okay, so, that concludes the data-structures portion of the presentation.
Now, let's talk about how hash functions contribute to the security of your application.
Okay, you're running a web application, and the worst-case scenario happens.
Some hacker makes off with the user table from your database.
"How"?
I don't know, for the sake of argument let's say SQL Injection.
At this point your users are all shit out of luck, right?
I mean, some brigand has made off with all of their passwords.
But no, because our developers have cleverly obscured the passwords before saving them.
Yeah, we're back here at "The Stuff Every Developer Knows" junction, but I think it's useful to establish a baseline before getting in to the meatier details.
So, in order to hide passwords this way, when the user creates a password, we don't save the password.
Instead, we save the result of a hash function.
Later, when the user tries to log in, they provide a password. We hash the password, and compare it with our hashed password for that user.
If the two match, the user has provided the correct password.
Of course, if two different passwords ever collide - if two strings hash to the same output value - then it would be possible for someone to log in to our site with the wrong password. That's bad.
Do you remember earlier when I said that cryptographic hashes - like MD5, for example - have a feature called collision resistance, which means that two inputs are astronomically unlikely to collide? Here is where that's super important.
Now, our passwords are protected.
Yup. Totally protected. Nothing could possibly go wrong.
Rainbow Tables: the way to break hashed passwords.
So, we've stolen a whole database full of usernames and hashed passwords, and we want to get at the raw passwords so that we can try them out on banking sites and such.
What we can do, is create a list of everything that we think of as a possible password. An enormous, comprehensive list.
Then, we use the same hash that the programmer used to hash the original passwords, and we run that on every single item in our gigantic password list.
Now we have a giant collection of hash-to-password pairs, which we can compare against the original data.
Any time we find a match with one of the hashes in our set, we know what the password must have been.
This checking of every hash in the set against every hash in the database is n-squared, but inherently very parallelizable, so with a bit of optimization this can run VERY fast.
This table of precomputed password-to-hash pairs is called a Rainbow Table.
The MD5 hash function is so common that Juuso Salonen (I hope I pronounced his name right) released a utility called Bozocrack that works by taking a hashed password, searching for it on Google, and then MD5-hashing everything that comes back in the google results until it finds a match.
It's not as comprehensive as a MD5 Rainbow Table, but it still manages to be depressingly effective.
Part of the reason the Rainbow Table attack is so effective is that, because every user's password has been hashed with the same hash function, it's possible for us to test passwords against the entire table.
While it is unlikely that we will ever crack all of the passwords, we're able to suss out the users who have simple passwords very quickly.
Just testing the whole database against the 1000 most common passwords shouldn't take more than an hour, and will provide us with a wealth of potentially useful data.
So what we want to do is reduce the effectiveness of this kind of attack, by using a different hash function for every single person in the database.
That doesn't mean we need to enlist thousands of different hash implementations - what it actually means is that we want to use one hash function that we seed with a different value for each user in the table, before we hash the password for that user.
It has to be a value that we have access to - because we need to be able to recreate this custom hash function every time we check the user's password.
It's quite common to use the username for this value.
That doesn't mean you should use the username.
Cryptographers recommend that you create a large block of randomized data and save it against the user to use as a seed.
This hash-seeding value is called a salt.
Okay, so that covers the very basics.
And when I say basics I mean very most basics. This isn't exactly state of the art. Unix's 'crypt' function first used salted hashes to protect against rainbow tables in 1976, a decade before I was born.
Okay, don't use MD5. Let's talk about that.
I've been using MD5 as my de-facto cryptographic hash example for this presentation, because it's well-known, and well-understood, and it's been in use for a good long time.
Unfortunately, in that good long time, cracks have begun to show in this venerable hashing algorithm. Collisions have been forced. It is growing less and less capable of standing up against modern hardware.
The biggest reason to avoid MD5 is simply that it is too fast.
It's possible to MD5 hash millions of values per second.
This makes brute force attacks against MD5-hashed passwords very easy.
This MD5-hashing python script I wrote runs, on a cheap little virtual machine, one million MD5 hashes in 1.5 seconds.
Now, BCrypt is designed to be slow. Using the default settings on the same virtual machine, that run would take me 3.4 days.
This, combined with a salt for each individual user, means that brute forcing passwords out of a database could take days or months per user, rather than an hour or two for the whole thing.
BCrypt also comes with a work variable that you can crank up to make things even slower as hardware gets faster. When I turn it up to 15, my million-hash script goes from 3.4 days to run, to 26 days to run.
At this point, though, just logging a single user in to my site could take a couple of seconds. I'm not sure if they're willing to wait that long.
One of the key rules of security is "don't roll your own". Don't try to be clever and implement a solution that seems like it should work - do a little bit of research on the problem space and use something that is designed with your problem in mind.
So, for password security, you should use something that's specifically designed for password security - like Bcrypt, or PBKDF2
But even for general purposes, if you're looking for a secure hash algorithm, MD5 is kind of old and broken.
SHA means "Secure Hash Algorithm", and SHA-512 has become a new standard for hashing, one that's cryptographically much more secure.
So, listening to this, you might be thinking - why send the password to the server-side at all?
The user could just hash their own password and send the hashed password to the server for verification.
Heck, we could go totally stateless - we could start including a hashed verification code with every instruction we send to the server.
That way, nobody can tamper with our instructions to the server!
Well, there are two different possibilities, here.
Either we're trying this from a browser, or we're not trying this from a browser.
Let's look at the problems of doing this with browser code.
What are we trying to accomplish, here?
We don't want people who can inspect our traffic to see a password.
We don't want people who can inspect our traffic to be able to act on behalf of our users.
So we send our users a bit of code that will allow them to obscure their credentials when they communicate with us.
Except if people can inspect our traffic, they can also alter our traffic.
Which means they can replace our crypto code with equivalent code that steals credentials.
Javascript browser crypto is exactly as secure as the transport layer providing it.
So Javascript browser crypto over HTTP can't possibly be more secure than just using HTTP - which is not secure at all.
And Javascript browser crypto over HTTPS is as secure as just using TLS, which is ... more or less secure.
I trust that you can solve for the value of Javascript Browser crypto in this equation.
Let's imagine, then, that we're not doing this from the browser, but from a trusted library that we have total control over, and we still want to keep them away from user data and out of our application.
Well, if we just send a token containing a hash of our password, all they have to do is steal that token and they can start running Rainbow attacks on it. Plus, with our token they can still pretend \ to be our user pretty effectively. That's bad.
So, what we need to do is hash the password and a timestamp together?
Should the server keep track of the timestamp?
Couldn't the attacker just steal our timestamped message and replace the command with their own, malicious command?
This gets really complicated, really fast, and it is SO easy to do it wrong.
We shouldn't be in the business of designing our own security algorithms.
The hash function is a building block, but we need more guidance when it comes to problems like this.
Which brings me to my final section in the topic on security, HMAC
HMAC stands for "Hash-based Message Authentication Code", and it's sort of what we're trying to generate, here.
A cryptographic hash function can be used to generate a document signature, but some of the common cryptographic hashes - MD5 and SHA, for example, have a property where, once you have a signature for a secret+message combination, it is trivially easy to generate secret+message+foo.
So if you send the message "Yo!" Somebody else could steal that message and use it to send the message, "Yo! Take your wallet to the dark alley behind East Hastings at 11PM tomorrow."
Which would be bad.
Using a HMAC protects against this attack, and a few other varieties of attack - it's a good way to generate a signature for a message.
But it doesn't solve all of our problems. People can still snoop on our traffic.
With no timestamp, people can still replay our messages to the server with impunity...
So the lesson, here, is that you also shouldn't be using HMAC, really, for this problem.
What we want is encryption, so what we need here is a full-blown encryption algorithm, like RSA, or a full-blown encryption protocol, like GPG.
The take-home message, though, is that you really, really shouldn't try to build your own encryption protocol from scratch out of hash functions.
It's like trying to build a car by yourself - you can probably do it, but the result probably won't be that safe, unless you did a lot of research, first.