liams.website

A history of hash functions (1953-2003)

From IBM's scatter storage to Xiaoyun Wang's standing ovation at CRYPTO 2004, covering the birth of hashing and the slow collapse of MD5.

8 minute read

One of the recurring patterns in computer security is that every algorithm eventually falls. Not because it was badly designed – usually quite the opposite – but because mathematicians are patient, resourceful, and occasionally brilliant in ways that make previous brilliance look quaint. The hash functions that underpin digital security have followed this arc with remarkable consistency: enter service with confidence, attract academic attention, develop theoretical cracks, suffer practical breaks, and then linger in production systems for years while everyone argues about migration timelines.

This is the story of how we got here. It starts, as many computing stories do, with IBM.

The accidental birth of hashing

The hash function concept emerged not from cryptography but from a prosaic problem: searching for chemical compounds. In 1947, two IBM clients asked engineer Hans Peter Luhn to help them search coded chemical structures efficiently. Luhn was an interesting character – a German immigrant who had previously invented a thread-counting device for the textile industry, patented a foldable raincoat, and built something called a “Cocktail Oracle” for mixing drinks. He would go on to produce an idea that changed computing forever.

In January 1953, Luhn wrote an internal IBM memo proposing what Donald Knuth later documented as the first hash function concept. His “bucket method” transformed keys into addresses through mathematical operations – paired digits from phone numbers would be summed, creating derived values that determined where data was stored. The insight was revolutionary: instead of searching through every record sequentially, you could calculate where data should be and look there directly.

Working in parallel, another IBM team independently implemented similar ideas. Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel built hashing into the IBM 701 assembler in 1953-54. McGraw deserves particular recognition for co-inventing open addressing and linear probing – collision resolution methods for hash tables still used today. The term “hash” itself wouldn’t appear in published literature until the late 1960s, by which time it was already widespread jargon among practitioners.

Early hash tables went by evocative names like “scatter storage” before standardising on modern terminology. The culinary metaphor stuck because hash functions genuinely chop up their input, scrambling data into seemingly random output. But these early hashes served only to organise data quickly – the cryptographic applications that would make hash functions critical to internet security lay decades in the future.

Error detection versus adversaries

While IBM engineers worked on hash tables, the telecommunications world needed something different: ways to detect transmission errors. W. Wesley Peterson answered this need with his 1961 paper “Cyclic Codes for Error Detection”, introducing the Cyclic Redundancy Check.

Peterson’s CRC treated data as polynomials over finite fields, with the remainder from polynomial division serving as a check value. This elegant mathematical approach proved remarkably efficient in hardware – implementable with simple shift registers and XOR gates. The 32-bit CRC standardised in 1975 became integral to Ethernet, and CRC variants now protect everything from ZIP files to 5G network transmissions.

CRC wasn’t designed for security. Its purpose was detecting accidental corruption, not deliberate manipulation. This distinction – accidents versus adversaries – would become critical as the field evolved. CRC is quite good at catching random bit flips from cosmic rays or electrical noise. It is rather less good when someone is actively trying to fool it.

Cryptographic hashing emerges

The demand for cryptographic hash functions crystallised with Whitfield Diffie and Martin Hellman’s landmark 1976 paper “New Directions in Cryptography”. Their revolutionary public-key concepts required one-way functions – operations easy to compute forward but practically impossible to reverse. Hash functions fit this need: you could verify that data matched a hash, but recovering the original data from just the hash should be infeasible.

Ralph Merkle – a Stanford PhD student who would later co-invent public-key cryptography independently of Diffie-Hellman – formalised these requirements in 1979, defining the properties that cryptographic hash functions needed: collision resistance (hard to find two inputs producing the same output), preimage resistance (hard to reverse), and second preimage resistance (given one input, hard to find another with the same output). Merkle also independently developed (alongside Ivan Damgård, hence the name) the Merkle-Damgård construction – a method of building collision-resistant hash functions from compression functions that would underpin MD5, SHA-1, and SHA-2.

That same year, Gideon Yuval’s paper “How to Swindle Rabin” demonstrated how the birthday paradox could generate unexpected collisions. With an n-bit hash, you don’t need 2^n attempts to find a collision – you need roughly 2^(n/2), because collisions are about pairs and pairs grow quadratically. This insight would haunt hash function designers for decades.

The theoretical foundations were laid. The next decade would produce the algorithms that built the internet.

The MD family: Rivest’s designs

Ronald Rivest – the “R” in RSA who had co-invented public-key cryptography in 1978, legend has it after drinking wine at a Passover celebration – created a family of Message Digest algorithms that would define an era. His naming was admirably direct: MD2, MD4, MD5. No marketing department involved.

MD2 (1989) targeted 8-bit computers, using constants derived from the digits of pi to demonstrate no hidden backdoors. If you’re wondering why pi specifically: it’s a “nothing up my sleeve” number. Anyone can verify those are really the digits of pi, which means Rivest couldn’t have chosen them to create a secret weakness. The paranoia here is productive.

MD4 (1990) redesigned everything for 32-bit machines, trading security margins for speed. The RFC itself acknowledged MD4 was “at the edge” of risking successful attack. This is the kind of admission that ages poorly.

MD5 (1991) represented Rivest’s attempt at a “conservative” replacement. It added a fourth round of processing (64 total operations versus MD4’s 48), unique constants for each step, and optimised shift amounts for faster avalanche effects. The design was solid. The 128-bit output seemed adequate. MD5 became ubiquitous – SSL certificates, password storage, software verification, and countless protocols depended on it.

The cracks appear

The first concerns emerged in 1993, when Den Boer and Bosselaers found pseudo-collisions in MD5 – not real collisions, but weaknesses in the compression function that suggested the mathematical foundations weren’t as solid as hoped.

Then in 1996, Hans Dobbertin – a German cryptographer who had already broken MD4 – attacked MD5’s compression function. Not a full collision, but close enough that he published a warning in RSA’s newsletter: “MD5 should no longer be implemented where a collision-resistant hash function is required.”

This is the part of the story where you might expect the industry to transition away from MD5. Reader, it did not.

MD5 remained standard practice. The warning was noted and largely ignored. In March 2004, a distributed computing project called MD5CRK launched, attempting to find practical collisions through brute force. Thousands of volunteers donated CPU cycles. The project estimated years of work ahead.

Then came August 17, 2004.

The standing ovation

The CRYPTO 2004 conference held its traditional rump session in Santa Barbara – an informal evening where researchers present late-breaking results. Xiaoyun Wang, a mathematician from Shandong University in China, took the stage with results that would reshape the field.

Wang demonstrated practical collision attacks against MD5, MD4, SHA-0, and several other hash functions. Her team had found MD5 collisions in about one hour on an IBM p690 cluster. Not years. Not months. An hour.

Contemporary accounts describe what happened next as “one of the most dramatic moments in cryptographic history”. Wang received a standing ovation from the assembled cryptographers – not something that happens often at academic conferences, and essentially never at rump sessions.

The MD5CRK distributed computing project shut down shortly after. Its goal had been achieved far more elegantly by mathematical insight than by brute force. All those CPU cycles, rendered obsolete by cleverness.

Wang wasn’t finished. She also announced attacks on SHA-1 – the 160-bit successor that the security community had been recommending as MD5’s replacement – requiring fewer than 2^69 operations. This was still a large number, but well below the 2^80 brute-force threshold that a secure 160-bit hash should provide.

The standing ovation was earned. Wang had just demonstrated that the two most widely deployed hash functions in the world were fundamentally compromised. August 17, 2004 was the day the cryptographic community learned that hash function security was more fragile than anyone had assumed.

What the break meant

A collision attack on a hash function means you can find two different inputs that produce identical outputs. This sounds abstract until you consider the applications.

Digital signatures work by signing the hash of a document rather than the document itself. If you can find two documents with the same hash – say, a legitimate contract and a fraudulent one – you can get someone to sign the legitimate one and then substitute the fraudulent version. The signature remains valid because it matches the hash, and the hash matches both documents.

SSL certificates are signed by Certificate Authorities. If you can create a collision between a legitimate certificate request and a malicious one, you can potentially get a CA to sign something dangerous.

Software distribution relies on hash verification. Users download a file and check its hash against a published value. If attackers can create malware that hashes identically to legitimate software…

These weren’t theoretical concerns for long. But that’s a story for the next article.

The aftermath, briefly

Wang had fired the starting gun on a decade of chaos. MD5 was demonstrably broken, yet it would linger in production systems for years. As late as 2011, Chrome couldn’t remove MD5 support “due to its widespread use on the Internet”. The Flame malware of 2012 used MD5 collisions to forge Microsoft code-signing certificates – nation-state attackers exploiting a vulnerability that had been public for eight years.

SHA-1’s 160-bit promise suddenly looked fragile. NIST began recommending transition away from it, formally deprecating it in 2011 and disallowing it for digital signatures in 2013. The practical collision attack – SHAttered – wouldn’t arrive until 2017, but the trajectory was clear.

The cryptographic community responded with SHA-2, SHA-3, and a public competition to develop structurally different alternatives. The industry responded with… reluctance. Git’s default remains SHA-1. The migration has taken over twenty years and isn’t complete.

Breaking an algorithm, it turns out, is the easy part. Getting people to stop using it is hard.

In the next article, we’ll see what happened after Wang’s standing ovation: the exploits that weaponised her research, the SHA-3 competition drama, the HashDoS attacks that proved even non-cryptographic hashes carry security implications, and why Linus Torvalds has described Git’s SHA-1 situation as “pointless churn” two decades later.