Non-cryptographic hashing – Adam Harvey @LGnome New Relic – What?



Non-cryptographic hashing – Adam Harvey @LGnome New Relic – What?

0 0


non-cryptographic-hashing

A talk on non-cryptographic hashing.

On Github LawnGnome / non-cryptographic-hashing

Non-cryptographic hashing

Adam Harvey @LGnome New Relic

What?

Hash function

Let's start with the obvious: what is a hash function? (Most of you likely know this already, but let's make sure we're all on the same page.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer vitae ex nec nisl fermentum sodales in ut lectus.
61df8894 0ec572f8 19e37640 372ecf1b
Fundamentally, a hash function takes any amount of data and returns a number based on that data.

MD5 SHA-1 SHA-256 SHA-512

We use them every day. They underpin SSL, block encryption, password storage, and many other things that we take for granted.

Non-Cryptographic hash function

All of those are cryptographic hash functions, though. They're designed with specific qualities in mind: to make it as difficult as possible to generate collisions and to be able to work back to find out what the input was. Key derivation functions extend this for use with passwords: bcrypt uses the Blowfish block cipher in conjunction with a basic hashing function to generate CPU-hard hashes based on salted passwords.

Non-Cryptographic hash function

Not all hash functions have to be cryptographic, though. There's also a place for functions that are fast and reasonably well distributed, but provide no guarantee of security. Some can also minimise memory usage.

This is the world of the non-cryptographic hash function.

Why?

Hash tables

You need some way of distributing things into buckets. Taking a hash of the key is usually the easiest way: in this case, you're not interested in security, but you are interested in speed. (Why would you want to implement a hash table nowadays?)

What have I seen?

Many languages provide set types, but when you're dealing with a lot of data you may not want to keep everything in memory. Hashing it first can help, particularly if you're doing something that can withstand the odd false positive.

Is this valid?

While you'd often need to use a cryptographic hash function for checksumming, there are scenarios where you really do just want the 2010s version of a parity bit. These can be quick and simple. Message digests.

Interoperability

People spec weird things all the time. New Relic CAT example.

It's fun?

How?

Goals

  • Fast
  • Well distributed output

There are two key goals. We want something that's quick, and we want something that distributes its output well. A common use of non-cryptographic hash functions is in implementing hash tables — while that's not usually a huge concern in Python, you don't want clumps in your output, because that often leads to collisions.

Avalanche effect

00010000
00001000
00011000
00000000
000??000
This is what you don't want. You want each change to the data to have a bigger effect on the output.

Avalanche effect

00010000
00001000
00011000
00000000
????????
Having each change have a large change on the output is called the avalanche effect.
def hash_function(data: bytes) -> int:
  for chunk in md_pad(data):
    hash = cleverness(hash, chunk)

  return hash

Broadly, this is how every hash function is constructed. Easy, huh? (Python provides a cleverness function, I'm sure.) Cleverness is actually a "mixing function": it takes the current state and mutates it with the chunk.

Chunk sizes vary, but are most commonly bytes or 32-bit words.

As an aside, cryptographic hash functions usually use a Merkle-Damgård (DAM-gour(d)) construction, which adds padding in a specific way, but is otherwise a fancy (and hard) way of saying this.

Cleverness

  • Unsigned, fixed length integers
  • Bitwise operations (xor; bitshifts)
  • Multiplication
The basic constructs that tend to be used. Bitshifts and xor are extremely useful when the input has little variance, as they can be used to force lots of changes to otherwise unaffected bits.

Integers

  0xffffffff
+ 0x00000001
= 0x00000000
Integer overflow behaviour is used heavily. In higher level languages like Python, you may have issues with the integer type being unlimited precision, but most of this kind of work is usually done in C anyway.

Primes

Most hashing algorithms include one or more (usually more) prime numbers as constants. This is mostly for bucketing behaviour: if you're using the output to place things into buckets, you don't want common factors lest you end up bucketing naïvely into the same buckets — if your hash function is biased towards multiples of eight and you're implementing something with eight buckets, you may not end up using seven of them much.

FNV-1a

Let's walk slowly through an example: in this case, FNV-1a. FNV was created by Glenn Fowler, Landon Curt Noll and Phong Vo in 1991, and later revised to improve its avalanche behaviour. It's still widely used today, and is simple enough that we can step through it and see the principles noted above in action.

FNV-1a

uint32_t fnv1a32(const char *data, size_t len) {
  uint32_t hash = 2166136261;
  for (char *byte = data; byte < (data + len); byte++) {
    hash ^= *byte;
    hash *= 16777619;
  }
  return hash;
}
Here's the C implementation of the algorithm. Conceptually, it's very simple: for each byte (since it's a bytewise algorithm), we xor the hash value, then multiply it. Note that we'll see lots of magic numbers.

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
>   uint32_t hash = 2166136261;
    for (char *byte = data; byte < (data + len); byte++) {
      hash ^= *byte;
      hash *= 16777619;
    }
    return hash;
  }
hash:  1000 0001 0001 1100 1001 1101 1100 0101
*byte: (unset)
First, we set our hash value to a randomly selected number. (FNV-0 did not have this step.) This isn't actually a prime: it's not chosen for its dispersal characteristics, but rather is simply a non-zero value so that the initial hash value isn't all zeroes. (In fact, it's the FNV-0 hash of the e-mail signature line of Landon Curt Noll!)

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
    uint32_t hash = 2166136261;
>   for (char *byte = data; byte < (data + len); byte++) {
      hash ^= *byte;
      hash *= 16777619;
    }
    return hash;
  }
hash:  1000 0001 0001 1100 1001 1101 1100 0101
*byte:                               0110 0001
Pick a byte (lowercase "a").

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
    uint32_t hash = 2166136261;
    for (char *byte = data; byte < (data + len); byte++) {
>     hash ^= *byte;
      hash *= 16777619;
    }
    return hash;
  }
hash:  1000 0001 0001 1100 1001 1101 1010 0100
*byte:                               0110 0001
Now we do the exclusive-or. Note that only three bits in hash actually changed. Not very avalanchey!

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
    uint32_t hash = 2166136261;
    for (char *byte = data; byte < (data + len); byte++) {
      hash ^= *byte;
>     hash *= 16777619;
    }
    return hash;
  }
hash:  1110 0100 0000 1100 0010 1001 0010 1100
*byte:                               0110 0001
Now we see some changes! The sharp of eye would have noticed that this prime is close to 2^24, which means that we're effectively performing a bitshift with some added noise in the lower bits. There's some avalanche.

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
    uint32_t hash = 2166136261;
>   for (char *byte = data; byte < (data + len); byte++) {
      hash ^= *byte;
      hash *= 16777619;
    }
    return hash;
  }
hash:  1110 0100 0000 1100 0010 1001 0010 1100
*byte:                               0110 0010
Now we do it again with b, which is only two bits different from a.

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
    uint32_t hash = 2166136261;
    for (char *byte = data; byte < (data + len); byte++) {
>     hash ^= *byte;
      hash *= 16777619;
    }
    return hash;
  }
hash:  1110 0100 0000 1100 0010 1001 0100 1110
*byte:                               0110 0001
Again, only three bits change when we do the xor.

fnv1a32("ab", 2)

  uint32_t fnv1a32(const char *data, size_t len) {
    uint32_t hash = 2166136261;
    for (char *byte = data; byte < (data + len); byte++) {
      hash ^= *byte;
>     hash *= 16777619;
    }
    return hash;
  }
hash:  0100 1101 0010 0101 0000 0101 1100 1010
*byte:                               0110 0001
You can see how well distributed the changes are, even after only two bytes of input. Every nibble has one or two bits modified. A small change resulted in a big change.

Shootout

Let's look at a few functions (specifically, the 32 bit versions where appropriate). I've chosen functions that are feasible to implement in a variety of languages (so don't rely on anything in particular) and are generally considered to be among the state of the art. Firstly, though, we need a way to compare them.

Goals

  • Fast
  • Well distributed output
Let's look at our goals again. How can we judge them?

Fast

Fast is easy. CPU time pretty much tells us that.

Distribution

We can look at collisions. So let's talk about inputs.

Input: words

A
a
aa
aal
aalii
aam
Aani
aardvark
This is the OS X /usr/share/dict/words. This is a good test for collisions, and an interesting one for performance with lots of varied, short inputs.

Input: numbers

0
1
2
3
4
5
6
7
This is a file of numbers from 0 to 235885 (the same size as the words input). Again, this is really interesting for collisions: a less varied input.

Input: Holmes

The complete Sherlock Holmes canon, with each adventure or novel as a separate input. This is a good performance test for larger (40k-200k) inputs. (Less likely for hashtable keys, but likely for "have I seen this" cases.) What we'll see is that some algorithms are better optimised for larger inputs than others.

FNV-1a

Since we've already looked at FNV-1a in some detail, let's look at some results for it.

FNV-1a

Corpus CPU time (ms) Collisions Numbers 55.40 0 Words 68.12 5 (0.002%) Holmes 3.13 0 There are collisions in the word corpus. This isn't really unexpected: we're only getting 32 bit values, and given the birthday paradox, there's a greater than 75% chance of a collision for a naïve hash function with more than 200000 inputs.

FNV-1a

Another way to visualise the distribution besides just the number of collisions is to map it out. I've shamelessly stolen this approach from Ian Boyd, who wrote a seminal StackExchange essay on hashing functions three years ago.

White dots indicate unused values within the output.

Basically, you want this to look like noise. FNV-1a is pretty good on this count.

CRC32

Another classic algorithm is CRC32, which dates from 1961. It's another byte-wise algorithm.

CRC32

static const uint32_t crc32tab[256] = { /* ... */ };

uint32_t crc32(const char *data, size_t len) {
  uint32_t crc = 0xffffffff;
  const char *ptr;

  for (ptr = data; ptr < (data + len); ptr++) {
    crc = (crc >> 8) ^ crc32tab[(crc ^ (*ptr)) & 0xff];
  }

  return crc;
}
CRC is interesting because it uses very basic primitives, but combines them with a lookup table that is used for extra mixing. The mathematics are interesting: in effect, we're doing a binary long division with a polynomial, since message * CRC is used as the check.

CRC32

Corpus CPU time (ms) % FNV-1a Collisions Numbers 61.42 111% 0 Words 76.82 113% 8 (0.003%) Holmes 6.23 199% 0 It's a bit slower than FNV-1a, but it's still pretty good for a byte-wise algorithm.

CRC32

You can see some clumping in here. This is because of the lookup table: values will tend towards certain buckets.

MurmurHash3

MurmurHash3 is another relatively simple to implement hashing function. As the name implies, it's the third revision of a function designed by Austin Appleby. The name is inspired by two basic operations: MUltiply and Rotate.

MurmurHash3

uint32_t murmurhash3_32(const char *data, size_t len, uint32_t seed) {
  static const uint32_t c1 = 0xcc9e2d51;
  static const uint32_t c2 = 0x1b873593;
  static const uint32_t r1 = 15;
  static const uint32_t r2 = 13;
  static const uint32_t m = 5;
  static const uint32_t n = 0xe6546b64;

  uint32_t hash = seed;

  const int nblocks = len / 4;
  const uint32_t *blocks = (const uint32_t *) data;
  for (int i = 0; i < nblocks; i++) {
    uint32_t k = blocks[i];
    k *= c1;
    k = (k << r1) | (k >> (32 - r1));
    k *= c2;

    hash ^= k;
    hash = ((hash << r2) | (hash >> (32 - r2))) * m + n;
  }

  const uint8_t *tail = (const uint8_t *) (data + nblocks * 4);
  uint32_t k1 = 0;

  switch (len & 3) {
  case 3:
    k1 ^= tail[2] << 16;
  case 2:
    k1 ^= tail[1] << 8;
  case 1:
    k1 ^= tail[0];

    k1 *= c1;
    k1 = (k1 << r1) | (k1 >> (32 - r1));
    k1 *= c2;
    hash ^= k1;
  }

  hash ^= len;
  hash ^= (hash >> 16);
  hash *= 0x85ebca6b;
  hash ^= (hash >> 13);
  hash *= 0xc2b2ae35;
  hash ^= (hash >> 16);

  return hash;
}
Yikes! It's not that bad, though. Let's break it down: this is typical of a more modern hash function.

MurmurHash3

uint32_t murmurhash3_32(const char *data,
                        size_t      len,
                        uint32_t    seed);
Here's the prototype. A lot of modern hash functions are initialised with seeds. This makes it harder to construct DoS attacks: if the caller can use a random number to seed, you can't pre-compute what will collide. (Talk about PHP DoS attack fixed in 5.3.9 in January 2012.)

MurmurHash3

static const uint32_t c1 = 0xcc9e2d51;
static const uint32_t c2 = 0x1b873593;
static const uint32_t r1 = 15;
static const uint32_t r2 = 13;
static const uint32_t m = 5;
static const uint32_t n = 0xe6546b64;

uint32_t hash = seed;
We define a bunch of constants, and set up the initial hash value from the seed.

MurmurHash3

const size_t nblocks = len / 4;
const uint32_t *blocks = (const uint32_t *) data;

for (size_t i = 0; i < nblocks; i++) {
  uint32_t k = blocks[i];

  k *= c1;
  k = (k << r1) | (k >> (32 - r1));
  k *= c2;

  hash ^= k;
  hash = ((hash << r2) | (hash >> (32 - r2))) * m + n;
}
Here's the heart of the mixing function. Note use of four bytes at a time instead of one (no unaligned access, cache friendly), use of C reinterpret casting semantics, and same math primitives. Trailing bytes are left off by nblocks integer division. Note constants; n is pretty close to 2^32 (enforce overflow).

MurmurHash3

const uint8_t *tail;
tail = (const uint8_t *) (data + nblocks * 4);
Go through some gymnastics for the last few bytes to ensure they're properly mixed into the hash value. Note that we're interpreting the tail as bytes, rather than words.

MurmurHash3

uint32_t k1 = 0;
switch (len & 3) {
  case 3:
    k1 ^= tail[2] << 16;
  case 2:
    k1 ^= tail[1] << 8;
  case 1:
    k1 ^= tail[0];

    k1 *= c1;
    k1 = (k1 << r1) | (k1 >> (32 - r1));
    k1 *= c2;
    hash ^= k1;
}
The use of a different mixing function for the last few bytes is common. Each byte gets mixed into k1, then additional mixing takes place before it's mixed into the main hash.

MurmurHash3

hash ^= len;
hash ^= (hash >> 16);
hash *= 0x85ebca6b;
hash ^= (hash >> 13);
hash *= 0xc2b2ae35;
hash ^= (hash >> 16);

return hash;
Finally, after all mixing, some postprocessing occurs. This is common in newer hash functions: with careful chosen values, you can attempt to invoke additional avalanching, but this can also be a source of weakness (clumping).

MurmurHash3

Corpus CPU time (ms) % FNV-1a Collisions Numbers 69.10 125% 4 (0.002%) Words 69.23 102% 6 (0.003%) Holmes 0.95 30% 0 A few more collisions, which is unfortunate. It's slower for small inputs than FNV-1a (which isn't surprising, given the extra setup work), but way faster for larger inputs because it's word-wise.

MurmurHash3

xxHash

xxHash is another relatively new hash function that has been quickly adopted by a number of projects, including pfSense and TeamViewer. (For what, I don't know.) It has a similar construction to MurmurHash3, except it reads 16 bytes at a time.

xxHash

const uint32_t xxhash32_prime1 = 2654435761U;
const uint32_t xxhash32_prime2 = 2246822519U;
const uint32_t xxhash32_prime3 = 3266489917U;
const uint32_t xxhash32_prime4 = 668265263U;
const uint32_t xxhash32_prime5 = 374761393U;

inline uint32_t xxhash32_rotl(uint32_t x, int r) {
  return (x << r) | (x >> (32 - r));
}

inline uint32_t xxhash32_mix(uint32_t v, uint32_t chunk) {
  v += chunk * xxhash32_prime2;
  v = xxhash32_rotl(v, 13);
  return v * xxhash32_prime1;
}

uint32_t xxhash32(const char *data, size_t len, uint32_t seed) {
  const char *p = data;
  const char *end = data + len;

  uint32_t hash = 0;

  if (len >= 16) {
    uint32_t v1 = seed + xxhash32_prime1 + xxhash32_prime2;
    uint32_t v2 = seed + xxhash32_prime2;
    uint32_t v3 = seed;
    uint32_t v4 = seed - xxhash32_prime1;

    for (; p <= (end - 16); p += 16) {
      const uint32_t *chunk = (const uint32_t *) p;

      v1 = xxhash32_mix(v1, chunk[0]);
      v2 = xxhash32_mix(v2, chunk[1]);
      v3 = xxhash32_mix(v3, chunk[2]);
      v4 = xxhash32_mix(v4, chunk[3]);
    }

    hash = xxhash32_rotl(v1, 1) +
           xxhash32_rotl(v2, 7) +
           xxhash32_rotl(v3, 12) +
           xxhash32_rotl(v4, 18);
  } else {
    hash = seed + xxhash32_prime5;
  }

  for (; (p + 4) <= end; p += 4) {
    hash += *((const uint32_t *) p) * xxhash32_prime3;
    hash = xxhash32_rotl(hash, 17) * xxhash32_prime4;
  }

  for (; p < end; p++) {
    hash += ((uint8_t) *p) * xxhash32_prime5;
    hash = xxhash32_rotl(hash, 11) * xxhash32_prime1;
  }

  hash ^= hash >> 15;
  hash *= xxhash32_prime2;
  hash ^= hash >> 13;
  hash *= xxhash32_prime3;
  hash ^= hash >> 16;

  return hash;
}
This one's even bigger — probably as big as you'd really ever want to implement — but the principle is the same. (This is simplified.) Some primes are defined atop, some utility functions are defined, and then there are three mixing functions: one for when you can read 16 bytes, one for 4 bytes, and one for 1 byte.

xxHash

uint32_t v1 = seed + xxhash32_prime1 + xxhash32_prime2;
uint32_t v2 = seed + xxhash32_prime2;
uint32_t v3 = seed;
uint32_t v4 = seed - xxhash32_prime1;
The 16 byte case starts by using the seed and two of the five primes to set up four working variables.

xxHash

for (; p <= (end - 16); p += 16) {
  const uint32_t *chunk = (const uint32_t *) p;

  v1 = xxhash32_mix(v1, chunk[0]);
  v2 = xxhash32_mix(v2, chunk[1]);
  v3 = xxhash32_mix(v3, chunk[2]);
  v4 = xxhash32_mix(v4, chunk[3]);
}
Then it reads 16 bytes at a time and mixes each 4 byte chunk into those working variables.

xxHash

inline uint32_t xxhash32_rotl(uint32_t x, int r) {
  return (x << r) | (x >> (32 - r));
}

inline uint32_t xxhash32_mix(uint32_t v, uint32_t in) {
  v += in * xxhash32_prime2;
  v = xxhash32_rotl(v, 13);
  return v * xxhash32_prime1;
}
The mixing function itself is very similar to MurmurHash3: multiply by a prime, rotate, and multiply again.

xxHash

hash = xxhash32_rotl(v1, 1) +
       xxhash32_rotl(v2, 7) +
       xxhash32_rotl(v3, 12) +
       xxhash32_rotl(v4, 18);
Finally, we set the hash, then move onto the 4 byte case to handle trailing data.

xxHash

if (len >= 16) {
  /* what you just saw */
} else {
  hash = seed + xxhash32_prime5;
}
There's a separate initialisation case for small inputs.

xxHash

for (; (p + 4) <= end; p += 4) {
  hash += *((const uint32_t *) p) * xxhash32_prime3;
  hash = xxhash32_rotl(hash, 17) * xxhash32_prime4;
}

for (; p < end; p++) {
  hash += ((uint8_t) *p) * xxhash32_prime5;
  hash = xxhash32_rotl(hash, 11) * xxhash32_prime1;
}
There are then the 4 and 1 byte mixers, which are used until all the data has been mixed.

xxHash

hash ^= hash >> 15;
hash *= xxhash32_prime2;
hash ^= hash >> 13;
hash *= xxhash32_prime3;
hash ^= hash >> 16;

return hash;
Finally, a post processing step. Notice how these all look pretty similar now you know how to break them down?

xxHash

Corpus CPU time (ms) % FNV-1a Collisions Numbers 69.41 125% 2 (0.001%) Words 73.97 109% 6 (0.003%) Holmes 0.396 12.6% 0 Minimal collisions, slightly slower than MurmurHash3 on small inputs, but even quicker on fast because of the hoops it jumps through for speed (reading 16 bytes at a time).

xxHash

CityHash

I'm not going to go through CityHash in great detail. It's a hash function developed at Google a few years ago that was probably one of the early of the modern generation of hash functions. It's loosely based on early MurmurHash.

CityHash32

CityHash64

CityHash128

It comes in three variants, with more effort having gone into the 64 and 128 bit variants. We'll only look at the 32 bit version here.

CityHash32

20 byte blocks

More per block work compared to MurmurHash32

7 rotations and 2 swaps per block

20 byte blocks are an unusual choice: used to optimise readahead caching with the size of the SSE pipeline on CPUs at the time. More work per block, which is a cost when dealing with small inputs.

CityHash32

Corpus CPU time (ms) % FNV-1a Collisions Numbers 74.23 134% 5 (0.002%) Words 68.92 101% 7 (0.003%) Holmes 0.68 22% 0 It's a little disappointing compared to xxHash. Collisions are slightly up, and it's noticeably slower for all input sizes due to the extra work. I suspect it's also over-optimised for the x86 CPUs in use in 2008-2010: decisions around block sizes and operations were made with great reference to hardware.

CityHash32

This is pretty good.

SuperFastHash

All these good hash functions are boring. Let's look at what happens when you use a bad function. To quote NonCryptoHashZoo: SuperFastHash is the quintessential example of how easy it is to go wrong when making your own hash function even if you're incredibly smart.

SuperFastHash

inline uint16_t get16bits(const char *p) {
  return *(const uint16_t*)p;
}

uint32_t superfasthash(const char *data, size_t len) {
  uint32_t hash = 0;
  uint32_t tmp;
  size_t rem;

  rem = len & 3;
  len >>= 2;

  for (; len > 0; len--) {
    hash  += get16bits (data);
    tmp    = (get16bits (data+2) << 11) ^ hash;
    hash   = (hash << 16) ^ tmp;
    data  += 2*sizeof (uint16_t);
    hash  += hash >> 11;
  }

  switch (rem) {
    case 3:	hash += get16bits (data);
        hash ^= hash << 16;
        hash ^= data[sizeof (uint16_t)] << 18;
        hash += hash >> 11;
        break;
    case 2:	hash += get16bits (data);
        hash ^= hash << 11;
        hash += hash >> 17;
        break;
    case 1: hash += *data;
        hash ^= hash << 10;
        hash += hash >> 1;
  }

  hash ^= hash << 3;
  hash += hash >> 5;
  hash ^= hash << 4;
  hash += hash >> 17;
  hash ^= hash << 25;
  hash += hash >> 6;

  return hash;
}
Interesting: there's no initial state besides the seed and the length of the input.

SuperFastHash

for (; len > 0; len--) {
  hash  += get16bits(data);
  tmp    = (get16bits(data + 2) << 11) ^ hash;
  hash   = (hash << 16) ^ tmp;
  data  += 2 * sizeof(uint16_t);
  hash  += hash >> 11;
}
This is the main mixing function. It deals in four byte chunks, but actually handles the high and low 16 bits separately. There are some gymnastics around bitshifting, but note that shift amounts are the same.

SuperFastHash

if (rem == 3) {
  hash += get16bits(data);
  hash ^= hash << 16;
  hash ^= data[sizeof(uint16_t)] << 18;
  hash += hash >> 11;
} else if (rem == 2) {
  hash += get16bits(data);
  hash ^= hash << 11;
  hash += hash >> 17;
} else if (rem == 1) {
  hash += *data;
  hash ^= hash << 10;
  hash += hash >> 1;
}
Trailing bytes are handled differently based on the size, which is interesting.

SuperFastHash

hash ^= hash << 3;
hash += hash >> 5;
hash ^= hash << 4;
hash += hash >> 17;
hash ^= hash << 25;
hash += hash >> 6;

return hash;
Probably because of the similar bitshifts, there's a lot of shifting in the post processing. The problem is that the damage has been done by now: the 4 byte loop often collapses into the same state, and then you're going to end up with basically the same output.

SuperFastHash

Corpus CPU time (ms) % FNV-1a Collisions Numbers 66.13 119% 23946 (10.152%) Words 82.01 120% 54 (0.023%) Holmes 9.68 310% 0 Oh. Oh dear. Hell, it's even slower than FNV-1a, so you're not even trading speed for distribution.

SuperFastHash

See that stripe down the right hand side? That's not good. That's a lot of values collapsing into the same buckets.

JSHash

Of course, it could be worse. I won't go through the implementation of this one, but this is the output of JSHash (Justin Sobel) with the numbers input. We used this in a bog standard chained hash table implementation, and then switched to MurmurHash3 and saw a 0.2% speedup just from that in an unrelated project.

Conclusions

Know your environment

MurmurHash3 and xxHash smoke every other algorithm when implemented in (OK) C or C++, because they exploit how CPUs behave when you're writing at that level. They use cache locality and read alignment to their advantage. In Python, though, that's just overhead: the byte at a time approach is actually faster because of the way the bytes type is implemented, and other algorithms can be simpler.

Respect your standard library

If you're working in an interpreted language, chances are whatever your language provides in its standard library will be written in C and quicker. In Python and PHP, you may as well use CRC32, since they're provided. Less work, and faster!

Know your inputs

xxHash makes some compromises that don't make sense if your average input is under 16 bytes.

Small inputs

For the words corpus, which has a mean size of 9 bytes, the block reading algorithms don't do any better. At that point, you might as well pick something extremely simple like FNV-1a.

Large inputs

For the Holmes corpus, which has a mean of 67k, the block reading algorithms shine. xxHash is crazy fast.

Know your requirements

Understand how many collisions you're willing to bear. The best way to figure this out is to implement and test with sample data: hash functions are easy to implement, and you can test output using sets. You may also want to consider complexity of implementation: xxHash is robust, but more complicated.

Questions?

@LGnome

Resources

The NonCryptoHashZoo is now down, but fortunately the Internet Archive comes to the rescue.
Non-cryptographic hashing Adam Harvey @LGnome New Relic