On Github LawnGnome / non-cryptographic-hashing
Not all hash functions have to be cryptographic, though. There's also a place for functions that are fast and reasonably well distributed, but provide no guarantee of security. Some can also minimise memory usage.
This is the world of the non-cryptographic hash function.
There are two key goals. We want something that's quick, and we want something that distributes its output well. A common use of non-cryptographic hash functions is in implementing hash tables — while that's not usually a huge concern in Python, you don't want clumps in your output, because that often leads to collisions.
def hash_function(data: bytes) -> int: for chunk in md_pad(data): hash = cleverness(hash, chunk) return hash
Broadly, this is how every hash function is constructed. Easy, huh? (Python provides a cleverness function, I'm sure.) Cleverness is actually a "mixing function": it takes the current state and mutates it with the chunk.
Chunk sizes vary, but are most commonly bytes or 32-bit words.
As an aside, cryptographic hash functions usually use a Merkle-Damgård (DAM-gour(d)) construction, which adds padding in a specific way, but is otherwise a fancy (and hard) way of saying this.
0xffffffff + 0x00000001 = 0x00000000Integer overflow behaviour is used heavily. In higher level languages like Python, you may have issues with the integer type being unlimited precision, but most of this kind of work is usually done in C anyway.
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; for (char *byte = data; byte < (data + len); byte++) { hash ^= *byte; hash *= 16777619; } return hash; }Here's the C implementation of the algorithm. Conceptually, it's very simple: for each byte (since it's a bytewise algorithm), we xor the hash value, then multiply it. Note that we'll see lots of magic numbers.
uint32_t fnv1a32(const char *data, size_t len) { > uint32_t hash = 2166136261; for (char *byte = data; byte < (data + len); byte++) { hash ^= *byte; hash *= 16777619; } return hash; }
hash: 1000 0001 0001 1100 1001 1101 1100 0101
*byte: (unset)First, we set our hash value to a randomly selected number. (FNV-0 did not have this step.) This isn't actually a prime: it's not chosen for its dispersal characteristics, but rather is simply a non-zero value so that the initial hash value isn't all zeroes. (In fact, it's the FNV-0 hash of the e-mail signature line of Landon Curt Noll!)
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; > for (char *byte = data; byte < (data + len); byte++) { hash ^= *byte; hash *= 16777619; } return hash; }
hash: 1000 0001 0001 1100 1001 1101 1100 0101
*byte: 0110 0001Pick a byte (lowercase "a").
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; for (char *byte = data; byte < (data + len); byte++) { > hash ^= *byte; hash *= 16777619; } return hash; }
hash: 1000 0001 0001 1100 1001 1101 1010 0100
*byte: 0110 0001Now we do the exclusive-or. Note that only three bits in hash actually changed. Not very avalanchey!
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; for (char *byte = data; byte < (data + len); byte++) { hash ^= *byte; > hash *= 16777619; } return hash; }
hash: 1110 0100 0000 1100 0010 1001 0010 1100
*byte: 0110 0001Now we see some changes! The sharp of eye would have noticed that this prime is close to 2^24, which means that we're effectively performing a bitshift with some added noise in the lower bits. There's some avalanche.
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; > for (char *byte = data; byte < (data + len); byte++) { hash ^= *byte; hash *= 16777619; } return hash; }
hash: 1110 0100 0000 1100 0010 1001 0010 1100
*byte: 0110 0010Now we do it again with b, which is only two bits different from a.
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; for (char *byte = data; byte < (data + len); byte++) { > hash ^= *byte; hash *= 16777619; } return hash; }
hash: 1110 0100 0000 1100 0010 1001 0100 1110
*byte: 0110 0001Again, only three bits change when we do the xor.
uint32_t fnv1a32(const char *data, size_t len) { uint32_t hash = 2166136261; for (char *byte = data; byte < (data + len); byte++) { hash ^= *byte; > hash *= 16777619; } return hash; }
hash: 0100 1101 0010 0101 0000 0101 1100 1010
*byte: 0110 0001You can see how well distributed the changes are, even after only two bytes of input. Every nibble has one or two bits modified. A small change resulted in a big change.
A a aa aal aalii aam Aani aardvarkThis is the OS X /usr/share/dict/words. This is a good test for collisions, and an interesting one for performance with lots of varied, short inputs.
0 1 2 3 4 5 6 7This is a file of numbers from 0 to 235885 (the same size as the words input). Again, this is really interesting for collisions: a less varied input.
Another way to visualise the distribution besides just the number of collisions is to map it out. I've shamelessly stolen this approach from Ian Boyd, who wrote a seminal StackExchange essay on hashing functions three years ago.
White dots indicate unused values within the output.
Basically, you want this to look like noise. FNV-1a is pretty good on this count.
static const uint32_t crc32tab[256] = { /* ... */ }; uint32_t crc32(const char *data, size_t len) { uint32_t crc = 0xffffffff; const char *ptr; for (ptr = data; ptr < (data + len); ptr++) { crc = (crc >> 8) ^ crc32tab[(crc ^ (*ptr)) & 0xff]; } return crc; }CRC is interesting because it uses very basic primitives, but combines them with a lookup table that is used for extra mixing. The mathematics are interesting: in effect, we're doing a binary long division with a polynomial, since message * CRC is used as the check.
uint32_t murmurhash3_32(const char *data, size_t len, uint32_t seed) { static const uint32_t c1 = 0xcc9e2d51; static const uint32_t c2 = 0x1b873593; static const uint32_t r1 = 15; static const uint32_t r2 = 13; static const uint32_t m = 5; static const uint32_t n = 0xe6546b64; uint32_t hash = seed; const int nblocks = len / 4; const uint32_t *blocks = (const uint32_t *) data; for (int i = 0; i < nblocks; i++) { uint32_t k = blocks[i]; k *= c1; k = (k << r1) | (k >> (32 - r1)); k *= c2; hash ^= k; hash = ((hash << r2) | (hash >> (32 - r2))) * m + n; } const uint8_t *tail = (const uint8_t *) (data + nblocks * 4); uint32_t k1 = 0; switch (len & 3) { case 3: k1 ^= tail[2] << 16; case 2: k1 ^= tail[1] << 8; case 1: k1 ^= tail[0]; k1 *= c1; k1 = (k1 << r1) | (k1 >> (32 - r1)); k1 *= c2; hash ^= k1; } hash ^= len; hash ^= (hash >> 16); hash *= 0x85ebca6b; hash ^= (hash >> 13); hash *= 0xc2b2ae35; hash ^= (hash >> 16); return hash; }Yikes! It's not that bad, though. Let's break it down: this is typical of a more modern hash function.
uint32_t murmurhash3_32(const char *data, size_t len, uint32_t seed);Here's the prototype. A lot of modern hash functions are initialised with seeds. This makes it harder to construct DoS attacks: if the caller can use a random number to seed, you can't pre-compute what will collide. (Talk about PHP DoS attack fixed in 5.3.9 in January 2012.)
static const uint32_t c1 = 0xcc9e2d51; static const uint32_t c2 = 0x1b873593; static const uint32_t r1 = 15; static const uint32_t r2 = 13; static const uint32_t m = 5; static const uint32_t n = 0xe6546b64; uint32_t hash = seed;We define a bunch of constants, and set up the initial hash value from the seed.
const size_t nblocks = len / 4; const uint32_t *blocks = (const uint32_t *) data; for (size_t i = 0; i < nblocks; i++) { uint32_t k = blocks[i]; k *= c1; k = (k << r1) | (k >> (32 - r1)); k *= c2; hash ^= k; hash = ((hash << r2) | (hash >> (32 - r2))) * m + n; }Here's the heart of the mixing function. Note use of four bytes at a time instead of one (no unaligned access, cache friendly), use of C reinterpret casting semantics, and same math primitives. Trailing bytes are left off by nblocks integer division. Note constants; n is pretty close to 2^32 (enforce overflow).
const uint8_t *tail; tail = (const uint8_t *) (data + nblocks * 4);Go through some gymnastics for the last few bytes to ensure they're properly mixed into the hash value. Note that we're interpreting the tail as bytes, rather than words.
uint32_t k1 = 0; switch (len & 3) { case 3: k1 ^= tail[2] << 16; case 2: k1 ^= tail[1] << 8; case 1: k1 ^= tail[0]; k1 *= c1; k1 = (k1 << r1) | (k1 >> (32 - r1)); k1 *= c2; hash ^= k1; }The use of a different mixing function for the last few bytes is common. Each byte gets mixed into k1, then additional mixing takes place before it's mixed into the main hash.
hash ^= len; hash ^= (hash >> 16); hash *= 0x85ebca6b; hash ^= (hash >> 13); hash *= 0xc2b2ae35; hash ^= (hash >> 16); return hash;Finally, after all mixing, some postprocessing occurs. This is common in newer hash functions: with careful chosen values, you can attempt to invoke additional avalanching, but this can also be a source of weakness (clumping).
const uint32_t xxhash32_prime1 = 2654435761U; const uint32_t xxhash32_prime2 = 2246822519U; const uint32_t xxhash32_prime3 = 3266489917U; const uint32_t xxhash32_prime4 = 668265263U; const uint32_t xxhash32_prime5 = 374761393U; inline uint32_t xxhash32_rotl(uint32_t x, int r) { return (x << r) | (x >> (32 - r)); } inline uint32_t xxhash32_mix(uint32_t v, uint32_t chunk) { v += chunk * xxhash32_prime2; v = xxhash32_rotl(v, 13); return v * xxhash32_prime1; } uint32_t xxhash32(const char *data, size_t len, uint32_t seed) { const char *p = data; const char *end = data + len; uint32_t hash = 0; if (len >= 16) { uint32_t v1 = seed + xxhash32_prime1 + xxhash32_prime2; uint32_t v2 = seed + xxhash32_prime2; uint32_t v3 = seed; uint32_t v4 = seed - xxhash32_prime1; for (; p <= (end - 16); p += 16) { const uint32_t *chunk = (const uint32_t *) p; v1 = xxhash32_mix(v1, chunk[0]); v2 = xxhash32_mix(v2, chunk[1]); v3 = xxhash32_mix(v3, chunk[2]); v4 = xxhash32_mix(v4, chunk[3]); } hash = xxhash32_rotl(v1, 1) + xxhash32_rotl(v2, 7) + xxhash32_rotl(v3, 12) + xxhash32_rotl(v4, 18); } else { hash = seed + xxhash32_prime5; } for (; (p + 4) <= end; p += 4) { hash += *((const uint32_t *) p) * xxhash32_prime3; hash = xxhash32_rotl(hash, 17) * xxhash32_prime4; } for (; p < end; p++) { hash += ((uint8_t) *p) * xxhash32_prime5; hash = xxhash32_rotl(hash, 11) * xxhash32_prime1; } hash ^= hash >> 15; hash *= xxhash32_prime2; hash ^= hash >> 13; hash *= xxhash32_prime3; hash ^= hash >> 16; return hash; }This one's even bigger — probably as big as you'd really ever want to implement — but the principle is the same. (This is simplified.) Some primes are defined atop, some utility functions are defined, and then there are three mixing functions: one for when you can read 16 bytes, one for 4 bytes, and one for 1 byte.
uint32_t v1 = seed + xxhash32_prime1 + xxhash32_prime2; uint32_t v2 = seed + xxhash32_prime2; uint32_t v3 = seed; uint32_t v4 = seed - xxhash32_prime1;The 16 byte case starts by using the seed and two of the five primes to set up four working variables.
for (; p <= (end - 16); p += 16) { const uint32_t *chunk = (const uint32_t *) p; v1 = xxhash32_mix(v1, chunk[0]); v2 = xxhash32_mix(v2, chunk[1]); v3 = xxhash32_mix(v3, chunk[2]); v4 = xxhash32_mix(v4, chunk[3]); }Then it reads 16 bytes at a time and mixes each 4 byte chunk into those working variables.
inline uint32_t xxhash32_rotl(uint32_t x, int r) { return (x << r) | (x >> (32 - r)); } inline uint32_t xxhash32_mix(uint32_t v, uint32_t in) { v += in * xxhash32_prime2; v = xxhash32_rotl(v, 13); return v * xxhash32_prime1; }The mixing function itself is very similar to MurmurHash3: multiply by a prime, rotate, and multiply again.
hash = xxhash32_rotl(v1, 1) + xxhash32_rotl(v2, 7) + xxhash32_rotl(v3, 12) + xxhash32_rotl(v4, 18);Finally, we set the hash, then move onto the 4 byte case to handle trailing data.
if (len >= 16) { /* what you just saw */ } else { hash = seed + xxhash32_prime5; }There's a separate initialisation case for small inputs.
for (; (p + 4) <= end; p += 4) { hash += *((const uint32_t *) p) * xxhash32_prime3; hash = xxhash32_rotl(hash, 17) * xxhash32_prime4; } for (; p < end; p++) { hash += ((uint8_t) *p) * xxhash32_prime5; hash = xxhash32_rotl(hash, 11) * xxhash32_prime1; }There are then the 4 and 1 byte mixers, which are used until all the data has been mixed.
hash ^= hash >> 15; hash *= xxhash32_prime2; hash ^= hash >> 13; hash *= xxhash32_prime3; hash ^= hash >> 16; return hash;Finally, a post processing step. Notice how these all look pretty similar now you know how to break them down?
20 byte blocks
More per block work compared to MurmurHash32
7 rotations and 2 swaps per block
20 byte blocks are an unusual choice: used to optimise readahead caching with the size of the SSE pipeline on CPUs at the time. More work per block, which is a cost when dealing with small inputs.inline uint16_t get16bits(const char *p) { return *(const uint16_t*)p; } uint32_t superfasthash(const char *data, size_t len) { uint32_t hash = 0; uint32_t tmp; size_t rem; rem = len & 3; len >>= 2; for (; len > 0; len--) { hash += get16bits (data); tmp = (get16bits (data+2) << 11) ^ hash; hash = (hash << 16) ^ tmp; data += 2*sizeof (uint16_t); hash += hash >> 11; } switch (rem) { case 3: hash += get16bits (data); hash ^= hash << 16; hash ^= data[sizeof (uint16_t)] << 18; hash += hash >> 11; break; case 2: hash += get16bits (data); hash ^= hash << 11; hash += hash >> 17; break; case 1: hash += *data; hash ^= hash << 10; hash += hash >> 1; } hash ^= hash << 3; hash += hash >> 5; hash ^= hash << 4; hash += hash >> 17; hash ^= hash << 25; hash += hash >> 6; return hash; }Interesting: there's no initial state besides the seed and the length of the input.
for (; len > 0; len--) { hash += get16bits(data); tmp = (get16bits(data + 2) << 11) ^ hash; hash = (hash << 16) ^ tmp; data += 2 * sizeof(uint16_t); hash += hash >> 11; }This is the main mixing function. It deals in four byte chunks, but actually handles the high and low 16 bits separately. There are some gymnastics around bitshifting, but note that shift amounts are the same.
if (rem == 3) { hash += get16bits(data); hash ^= hash << 16; hash ^= data[sizeof(uint16_t)] << 18; hash += hash >> 11; } else if (rem == 2) { hash += get16bits(data); hash ^= hash << 11; hash += hash >> 17; } else if (rem == 1) { hash += *data; hash ^= hash << 10; hash += hash >> 1; }Trailing bytes are handled differently based on the size, which is interesting.
hash ^= hash << 3; hash += hash >> 5; hash ^= hash << 4; hash += hash >> 17; hash ^= hash << 25; hash += hash >> 6; return hash;Probably because of the similar bitshifts, there's a lot of shifting in the post processing. The problem is that the damage has been done by now: the 4 byte loop often collapses into the same state, and then you're going to end up with basically the same output.