One of the funny things about hash functions is that everyone uses them and almost nobody thinks about them. They’re the plumbing of computer science: invisible until something goes wrong, at which point your database is on fire and you’re reading a Wikipedia article about “avalanche effects” at 2am.
A hash function, in its simplest form, takes some data and produces a fixed-size number. That’s it. You give it “hello” and it gives you 0x5d41402a. You give it the entire works of Shakespeare and it gives you… also a fixed-size number, just a different one. Hopefully a different one. We’ll get to that.
Where hash functions live
Hash functions are everywhere. Hash tables use them to decide which bucket your data goes in. Databases use them to index things quickly. Git uses them to identify commits. Blockchains use them to chain blocks together. Password systems use them to store credentials without storing credentials. Compilers use them. Caches use them. Your Python dictionaries use them. If you’ve ever wondered how your computer does anything quickly, the answer is usually “a hash table, somewhere.”
The basic promise is this: instead of searching through everything to find what you want, you compute a number that tells you exactly where to look. It’s like the difference between looking for a book by walking through a library versus looking it up in the catalogue. Except the catalogue is mathematics, and sometimes it lies to you.
Speed, mostly
Hash functions need to be fast. This is the first and most important property for most use cases. If you’re inserting a million items into a hash table, you’re calling the hash function a million times. If each call takes a microsecond instead of a nanosecond, you’ve just added a second to your operation. Users notice seconds.
The exception is passwords, where you actually want the hash function to be slow, because you’re trying to make life difficult for attackers who want to try billions of guesses. But that’s a different article, and frankly a different kind of paranoia.
What makes a good hash?
Before we look at how hash functions work, we need to know what we’re aiming for. A hash function isn’t just “some arithmetic” – it’s arithmetic designed to achieve specific properties. Understanding these properties tells us whether a hash function is doing its job.
Uniform distribution
A good hash function spreads its outputs evenly across all possible values. If you hash a million different inputs with a 64-bit hash function, you want them scattered uniformly across all 2^64 possible outputs, not clustered in some corner.
Why does this matter? In a hash table with 1024 buckets, you typically take the hash modulo 1024 to pick a bucket. If your hash function favours certain bit patterns, some buckets will be overcrowded while others sit empty. Your O(1) lookups quietly become O(n).
The tricky part is that real-world inputs are not uniformly distributed. People’s names cluster around common patterns. File sizes cluster around powers of two. IP addresses cluster around certain ranges. A good hash function takes this structured, clumpy input and produces uniform output anyway.
The avalanche criterion
This is the key property that separates good hash functions from bad ones. The avalanche criterion says: if you flip a single bit in the input, roughly half the bits in the output should flip. Every input bit should influence every output bit, and changing one should make the output look completely different.
Why “avalanche”? Because a tiny change cascades through the computation until it’s affected everything. Change one letter in a document and the hash should be unrecognisably different – not just slightly different in one corner, but different everywhere.
Here’s a simple test. Take any input, hash it, then flip one input bit and hash again. Compare the outputs bit by bit. A good hash function will flip about 50% of the output bits, with each bit having an independent 50% chance of changing. A bad hash function might only change a few output bits, or change them in predictable patterns.
fn test_avalanche(hash: impl Fn(u64) -> u64, input: u64) {
let original = hash(input);
for bit in 0..64 {
let flipped_input = input ^ (1 << bit);
let flipped_hash = hash(flipped_input);
let changed_bits = (original ^ flipped_hash).count_ones();
// For a good hash, changed_bits should be around 32 (half of 64)
println!("Flip bit {}: {} bits changed", bit, changed_bits);
}
}If you run this on a bad hash function – say, the identity function, or XOR with a constant – you’ll see that flipping input bit N only changes output bit N. That’s catastrophic for hash tables. It means inputs that differ only in their high bits will cluster in the same buckets.
Diffusion
Diffusion is the mechanism behind the avalanche property. It means spreading the influence of each input bit across all output bits. Without diffusion, bits stay isolated – what happens in bit 47 stays in bit 47 – and the hash function can’t do its job.
Different operations provide different amounts of diffusion, as we’ll see. Some operations (like XOR) provide no diffusion at all. Others (like multiplication) provide excellent diffusion. The art of hash function design is combining operations to get complete diffusion as quickly as possible.
The birthday paradox
Here’s an uncomfortable fact about collisions. With an n-bit hash, how many items can you hash before you’re likely to see a collision? Your intuition might say 2^n. The correct answer is roughly √(2^n) = 2^(n/2) for a 50% probability.
This is the birthday paradox. In a room of 23 people, there’s a 50% chance two share a birthday – not because 23 is close to 365, but because there are 253 pairs of people. Collisions are about pairs, and pairs grow quadratically.
For hash tables, this is usually fine. You’re not storing 2^32 items, and a few collisions are handled gracefully. But for content-addressed storage or deduplication, where you’re comparing potentially billions of items, a 64-bit hash is dangerously small. Git uses 160-bit hashes for a reason.
The toolkit
Hash functions are built from a small set of primitive operations. Each operation has strengths and weaknesses for mixing bits. Understanding them explains why hash functions look the way they do – which is to say, like someone threw random arithmetic at a wall and kept what stuck.
XOR: combining without bias
XOR is the workhorse of hashing. It combines two values bit by bit: if the bits are different, the result is 1; if they’re the same, the result is 0. XOR has a beautiful property: it combines information without adding bias. If one input is uniformly random, the output is uniformly random regardless of the other input.
fn xor_combine(a: u64, b: u64) -> u64 {
a ^ b
}The problem with XOR is that it provides zero diffusion. Bit 0 of the output depends only on bit 0 of the inputs. Bit 47 depends only on bit 47. If you XOR two values that differ in one bit, the output differs in exactly that one bit. This is terrible for avalanche – flip one input bit, exactly one output bit changes.
XOR is also its own inverse: a ^ b ^ b = a. This is elegant mathematically but means XOR can’t make anything “harder to reverse.” It combines entropy nicely, but contributes nothing to confusion.
XOR is best used to combine the outputs of operations that do provide diffusion. You’ll see hash functions do something like: multiply, then XOR, then multiply again. The multiplications spread bits around; the XOR combines the results.
Rotation and shifts: repositioning bits
Since XOR can’t mix bit positions, we need another way to move bits around. Rotation does exactly this – it slides all the bits, wrapping around at the ends:
fn example_rotate(x: u64) -> u64 {
x.rotate_left(17) // Bit 0 becomes bit 17, bit 47 becomes bit 0
}Now if we XOR after rotating, bit positions can interact:
fn xor_rotate(a: u64, b: u64) -> u64 {
a ^ b.rotate_left(17) // Bit 0 of a XORs with bit 47 of b
}This is the XOR-rotate pattern you’ll see in many hash functions. Rotate to realign bits, then XOR to combine them. It’s cheap (rotation is one CPU instruction), reversible, and provides controlled diffusion.
Bit shifts (>> and <<) are similar but discard bits instead of wrapping them. Shifting right by 32 positions throws away the lower 32 bits. This can be useful when you want to extract information from high bits without keeping low bits around.
Addition: the carry chain
Addition with overflow (modular addition) is more interesting than it first appears. Unlike XOR, addition creates dependencies between bit positions through the carry chain.
fn add_example(a: u64, b: u64) -> u64 {
a.wrapping_add(b)
}When you add two bits and both are 1, you get 0 with a carry of 1 to the next position. This carry can cascade: in the worst case, adding 1 to 0xFFFFFFFF flips all 32 bits to produce 0x00000000.
The carry chain means output bit n depends on input bits 0 through n – all the lower bits can influence it through their carries. This is vertical diffusion: influence flows upward from low bits to high bits.
The limitation is that carries only flow one direction. The high bits of the inputs cannot influence the low bits of the output. This is why you’ll often see addition combined with rotation – rotate to bring high bits down, then add to spread influence upward again.
Multiplication: comprehensive mixing
Multiplication is the heavy hitter. When you multiply two 64-bit numbers, each output bit depends on many input bits from both operands. This is because multiplication is essentially “shift-and-add” repeated for each set bit – every bit position interacts with every other.
fn multiply_example(a: u64, b: u64) -> u64 {
a.wrapping_mul(b)
}Specifically, output bit n of a×b depends on input bits 0 through n from both a and b. The high bits of the result depend on every input bit. This is excellent diffusion – one operation accomplishes what would take many XOR-rotates.
The catch is that multiplication constants matter. Multiplying by 2 just shifts left by one bit – no mixing at all. Multiplying by a constant with few set bits doesn’t spread influence well. Good multiplication constants are odd (so they’re coprime to 2^64 and form a bijection), have roughly half their bits set, and don’t have obvious patterns.
The first figure averages over all possible multipliers – it shows the theoretical potential of multiplication as an operation. But hash functions don’t use random multipliers; they use fixed constants. Multiply by 2 and you just get a shift. A good constant does better, but high input bits still can’t reach low output bits, because carries only propagate upward. That empty triangle is a problem.
There’s also a more fundamental problem: zero. Multiply anything by zero and you get zero – all that lovely diffusion collapses instantly. If an attacker can arrange for one operand to be zero at a critical point in the hash computation, the output becomes predictable regardless of the other operand. Zero collapse has been the basis of real attacks against hash functions that weren’t careful about what gets multiplied.
Folded multiply mix
This is a newer weapon of fast hash functions. Modern CPUs allow you multiply two 64-bit numbers into a 128-bit result. Instead of throwing away the top half as you would in traditional 64-bit multiplication, the folded multiply mix instead XORs the upper and lower halves together:
fn folded_multiply(a: u64, b: u64) -> u64 {
let product = (a as u128) * (b as u128);
let high = (product >> 64) as u64;
let low = product as u64;
high ^ low
}Why is this so good? The low 64 bits of the product capture interactions between low bits of the inputs. The high 64 bits capture interactions involving high bits – (high bits of a) × (high bits of b), plus carry-overs from lower products. By XORing them together, every output bit is influenced by nearly every input bit from both operands.
The improvement is dramatic. The theoretical average shows near-uniform coverage. With a fixed constant, the folding fills in most of that empty triangle from regular multiplication – but some positions still have weaker influence than others. One folded multiply is good; two or three in sequence is better.
This gives you near-complete diffusion in a single operation. Modern CPUs have instructions specifically for producing the full 128-bit product (mulx on x86), making this cheap. Many of the fastest non-cryptographic hash functions are essentially a sequence of folded multiplies and XORs with a careful construction.
The folded multiply inherits the zero-collapse problem from regular multiplication – if either operand is zero, the 128-bit product is zero, and XORing zero with zero still gives zero. Hash functions using folded multiplies need to ensure attackers can’t force a zero operand at a point where it matters.
Putting it together
Here’s a minimal hash function that demonstrates these pieces working together:
const MULTIPLIER: u64 = 0x517cc1b727220a95; // An odd constant with good bit patterns
fn folded_multiply(a: u64, b: u64) -> u64 {
let product = (a as u128) * (b as u128);
((product >> 64) ^ product) as u64
}
fn simple_hash(data: &[u8], seed: u64) -> u64 {
let mut state = seed;
for &byte in data {
// Mix in the byte
state ^= byte as u64;
// Diffuse with a folded multiply
state = folded_multiply(state, MULTIPLIER);
}
state
}Each piece has a job:
- The seed makes outputs unpredictable to attackers who don’t know it
- The XOR combines each byte with the state without bias
- The folded multiply spreads that byte’s influence across all state bits
This is not a production hash function – it processes bytes one at a time, which is slow, suffers from zero-collapse, and lacks finalization – but it illustrates the pattern. Real hash functions often process larger chunks and use multiple mixing rounds to fill in those diffusion gaps. The building blocks are the same.
Different jobs, different hashes
Not all hash functions want the same things. The requirements for a hash table are different from the requirements for a digital signature. Picking the wrong hash function for your use case is a time-honoured way to create bugs that only manifest in production, on Fridays, during an outage.
Hash tables and collisions
For hash tables, collisions are annoying but not catastrophic. Two items land in the same bucket; you check both. Performance degrades gracefully from O(1) towards O(n) as collisions accumulate. You want uniform distribution and good avalanche, but you don’t need cryptographic security.
What you do need is speed. Hash tables call the hash function constantly – every insert, every lookup, every delete. A hash function that’s 10% faster makes your entire program faster. This is why hash tables don’t use SHA-256.
HashDoS and keyed hashing
In 2011, researchers demonstrated that you could crash web servers by sending carefully crafted requests that all hashed to the same bucket. The hash table degrades from O(1) to O(n), your server starts doing quadratic work on attacker-controlled input, and everything falls over.
The fix is keyed hashing: mix in a random secret (the “key” or “seed”) so attackers can’t predict which inputs collide. Every time your program starts, it picks a new random seed, and the attacker’s carefully crafted collision inputs stop working.
Some hash functions were designed with this in mind. SipHash, for instance, was specifically built to resist HashDoS while being much faster than cryptographic hashes. Others, like some versions of MurmurHash, had seeds that could be attacked – researchers found “seed-independent collisions” that collided regardless of the key. The defence you thought you had wasn’t there.
Stable vs ephemeral
Sometimes you want the same input to always produce the same output, forever, across different machines and program runs. This is stable or portable hashing. You need it for distributed systems (all nodes must agree on which server owns which key) or content-addressed storage (the hash is the identity).
Other times you explicitly want different runs to produce different hashes. This is ephemeral hashing, and it’s what protects you from HashDoS. Rust’s standard library uses a randomised seed by default for exactly this reason.
The mistake is using a stable hash where you needed an ephemeral one, or vice versa. Use a randomised hash for your distributed database sharding key and your nodes will disagree about who owns what. Use a stable hash for your web server’s request parameters and you’re back to being vulnerable to HashDoS.
Cryptographic vs non-cryptographic
Non-cryptographic hash functions (xxHash, wyhash, rapidhash) are designed to be fast. They’re fine for hash tables and checksums but will absolutely fall over if an attacker is specifically trying to find collisions. They might do 3-6 total operations where a cryptographic hash does 64 rounds.
Cryptographic hash functions (SHA-256, BLAKE3) are designed for security. They achieve this through many rounds of mixing operations – SHA-256 does 64 rounds of compression – and they need to satisfy three specific properties that go beyond what we’ve discussed so far. These properties aren’t just “good to have.” They’re formal requirements, and violating any of them breaks real applications in ways that get their own Wikipedia articles.
Preimage resistance means that given a hash output, you can’t find any input that produces it. Not the original input – any input. If I give you the SHA-256 hash 2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae, you shouldn’t be able to find any string that produces this hash without essentially trying inputs at random.
Why does this matter? Passwords. When you log in, the server hashes what you typed and compares it to the stored hash. If attackers can reverse hashes, they can recover passwords from leaked databases. They don’t need to find your exact password – any input that produces the same hash will work. The server can’t tell the difference.
Second preimage resistance is subtler. Given a specific input and its hash, you shouldn’t be able to find a different input with the same hash. This sounds similar to preimage resistance, but it’s actually a different problem. You now have a target input to study, not just a target output.
This matters for integrity verification. When you download software and check its hash against the published value, you’re trusting that no attacker could have found malicious code that happens to hash identically. A second preimage attack would let someone study the legitimate file and construct a malicious replacement with a matching hash.
Collision resistance is the strongest property. You shouldn’t be able to find any two different inputs that produce the same hash. Not a specific collision – any collision at all. The attacker gets to choose both inputs, which gives them much more freedom than in the preimage cases.
This is harder to achieve than it sounds. Remember the birthday paradox? With an n-bit hash, you expect to find a collision after about 2^(n/2) attempts, not 2^n. A 128-bit hash should resist collisions for 2^64 attempts – which sounds like a lot until you realise that modern GPUs can compute billions of hashes per second. This is why serious cryptographic hashes use 256 bits or more.
Collision resistance breaks digital signatures. When you sign a document, you’re actually signing its hash. If an attacker can find two documents with the same hash – say, a contract saying “I owe you £100” and another saying “I owe you £100,000” – they can get you to sign the first one and then substitute the second. The signature verifies because the hash matches. You just signed away a lot of money.
These three properties form a hierarchy. Collision resistance is the hardest to maintain: if you can find collisions, you can potentially use them to attack preimage resistance too. Historically, hash functions have failed in roughly this order – first someone finds theoretical collision attacks, then practical ones, then maybe preimage attacks follow. MD5’s collision resistance was broken in 2004; its preimage resistance is merely “weakened” but not fully broken to this day.
The distinction between cryptographic and non-cryptographic hashes matters because cryptographic hash functions are 5-20x slower. Using SHA-256 for your hash table is like hiring a security guard to watch your sock drawer – technically effective, but expensive and rather missing the point.
Hash functions and random numbers
This connection between hashing and randomness isn’t purely academic – it shapes how operating systems actually work.
Hash functions and random number generators are deeply intertwined, sharing fundamental mathematical properties.
The properties that make a hash function good – the avalanche effect (small changes produce dramatically different outputs), uniform distribution (outputs statistically indistinguishable from random), and one-wayness (computational infeasibility of reversal) – are exactly the properties you want in a pseudorandom number generator. This isn’t coincidence. Both are trying to produce outputs that appear random and resist analysis.
ChaCha20, widely used for secure random number generation in operating systems and cryptographic libraries, is literally a keyed hash function operating in counter mode. You give it a key and a counter value, it produces output that looks random. Increment the counter, get more random-looking output. The core uses only Add-Rotate-XOR operations, with a critical feed-forward step that adds the initial state to the output after mixing. This addition makes the function non-invertible – the essential one-way property of a hash.
The Linux kernel’s random number generation reflects this connection. Version 4.8 (2016) switched to ChaCha20 for generating random numbers. Version 5.17 (2022) adopted BLAKE2s – a hash function from the BLAKE family – for the entropy pool that feeds ChaCha20. Hash functions all the way down.
This relationship runs both directions. When cryptographers design a new hash function, they often evaluate whether it would make a good PRNG. When they design a PRNG, they borrow techniques from hash function design. The folded multiply we discussed earlier? It shows up in both contexts, because “mix bits thoroughly and non-reversibly” is a shared goal.
Anyway
Hash functions are one of those things that seem simple until you look closely, at which point they become complicated, and then if you keep looking they become simple again but in a different way. The core idea – turn data into a number, quickly – is trivial. Making that number uniformly distributed, unpredictably different for similar inputs, and efficient to compute is where it gets interesting.
The avalanche criterion and diffusion are the heart of it. Everything else – the choice of operations, the structure of the computation, the careful selection of constants – serves those goals. Once you understand that a hash function is trying to spread each input bit’s influence across all output bits as quickly as possible, the designs start to make sense.