A history of hash functions (2004

The previous article ended with Xiaoyun Wang receiving a standing ovation at CRYPTO 2004. She had just demonstrated practical collision attacks against MD5, MD4, and SHA-0, while also announcing theoretical attacks on SHA-1. The cryptographic community understood immediately what this meant.

The rest of the world took somewhat longer.

This is the story of what happened after the algorithms broke: the exploits that weaponised the research, the competitions to build replacements, the surprising revelation that even non-cryptographic hash functions carry security implications, and the slow, painful reality of getting people to actually stop using broken algorithms.

The exploits arrive

Wang’s 2004 attack showed that collisions could be found. The question was whether anyone could actually do anything malicious with them. The answer came quickly.

By March 2005, researchers demonstrated X.509 certificates with different public keys sharing the same MD5 hash. This was concerning – if you could get a Certificate Authority to sign one certificate, the signature would also be valid for the other.

In 2006, Vlastimil Klima refined the attack using a technique called “tunnelling”. What had required an IBM cluster for Wang now took one minute on a single notebook computer. Collision generation was becoming accessible.

The real wake-up call came in December 2008, at the Chaos Communication Congress in Berlin. A team of researchers used 200 PlayStation 3 consoles to perform an MD5 chosen-prefix collision attack. They created a fraudulent Certificate Authority certificate – one that browsers would trust to sign any website on the internet. The compute cost: approximately $20,000 on Amazon EC2.

Bruce Schneier – the cryptographer and security researcher whose blog and books have made him perhaps the field’s most prominent public voice – was characteristically direct: “No one should be using MD5 anymore.”

Yet people continued using MD5. The most sophisticated attack came in 2012 with Flame, a 20-megabyte cyber-espionage tool linked to Stuxnet and attributed to US/Israeli intelligence. Flame used MD5 collisions to forge Microsoft code-signing certificates, allowing it to spread via Windows Update. Marc Stevens of CWI Amsterdam – who would later lead the SHAttered research – analysed the malware and confirmed the collision attack was “entirely new and unknown”, demonstrating “world-class cryptanalysis” capabilities. Nation-state attackers were exploiting a vulnerability that had been public for eight years.

Microsoft issued an emergency patch, revoking an entire certificate infrastructure. The embarrassing detail: in 2009, US Cyber Command had embedded an MD5 hash in their official emblem – the hash of their mission statement – as a piece of clever symbolism. By then, MD5 was thoroughly broken.

SHA-1’s longer fall

MD5 was the headline, but SHA-1 was the bigger concern. It was more widely deployed, had a longer output (160 bits versus 128), and was the recommended replacement for MD5. Wang’s 2005 announcement that SHA-1 could be attacked in fewer than 2^69 operations started a clock.

The deprecation timeline stretched across a decade:

2005: Wang demonstrates theoretical attacks; NIST recommends transition
2011: NIST formally deprecates SHA-1
2013: NIST disallows SHA-1 for digital signatures
2014: Google announces Chrome will warn on SHA-1 certificates
2016: Certificate Authorities forbidden to issue SHA-1 certificates
2017: Major browsers stop trusting SHA-1 SSL certificates

The practical collision attack arrived on February 23, 2017. Google and CWI Amsterdam announced SHAttered: two PDF files displaying completely different content with identical SHA-1 hashes. The research team, led by Marc Stevens, produced the collision after 9.2 quintillion SHA-1 computations – equivalent to 6,500 CPU-years or 110 GPU-years, costing approximately $110,000 on cloud infrastructure.

That sounds expensive. It was also 100,000 times faster than brute force. And costs only go down.

The consequences were immediate. One day after the announcement, a WebKit developer uploaded the two colliding PDFs to test cache protections. The WebKit Subversion repository – which used SHA-1 for deduplication – immediately corrupted and stopped accepting commits. Two files with identical hashes but different contents created an irrecoverable conflict. Apache SVN rushed out emergency patches within days.

GitHub implemented collision detection using counter-cryptanalysis code, rejecting any Git content showing evidence of collision attacks. The practical break had arrived, and infrastructure scrambled to defend against it.

Git’s ongoing saga

Git’s relationship with SHA-1 provides an instructive case study in migration difficulty.

Linus Torvalds initially downplayed SHAttered. “There’s a big difference between using a cryptographic hash for security signing, and using one for generating a ‘content identifier’,” he noted. Git prepends type and length information to content before hashing, which complicates collision attacks. The SHAttered PDFs wouldn’t directly work as Git objects.

This is true, but also somewhat beside the point. Chosen-prefix collision attacks continued improving. In January 2020, the SHA-1 shambles attack reduced costs to approximately $45,000 and demonstrated PGP key impersonation. The theoretical threat to Git became increasingly concrete.

Git has supported SHA-256 repositories since version 2.29 in October 2020. Five years later, no major hosting provider – GitHub, GitLab, Bitbucket – supports SHA-256 repositories. The transition requires maintaining bidirectional SHA-1/SHA-256 translation tables, and the ecosystem hasn’t coordinated.

In April 2025, commemorating Git’s 20th anniversary, Torvalds reflected with nuanced regret: “SHA-1 I regret in the sense that I think it caused a lot of pointless churn with the whole ‘trying to support SHA-256 as well as SHA-1.’ And I understand why it happened, but I do think it was mostly pointless.” Git 3.0, targeting 2026, aims to complete the transition – over 20 years after SHA-1’s first theoretical break.

The Firefox team experienced a related problem: when mozilla.org moved to SHA-2-only certificates, downloads dropped significantly. Users with legacy browsers couldn’t download modern browsers – a chicken-and-egg problem that slowed the entire industry’s migration.

The SHA-3 competition

The 2004-2005 crisis convinced NIST that depending solely on algorithms sharing the same basic structure was dangerous. MD5, SHA-1, and SHA-2 all use the Merkle-Damgård construction. If a breakthrough cryptanalytic technique compromised SHA-2, the internet would have no fallback. Structural diversity became the goal.

On November 2, 2007, NIST announced a public competition to develop SHA-3. The process was deliberately open – anyone could submit, the analysis would be public, and the winner would emerge from years of scrutiny rather than government secrecy.

The competition received 64 submissions, with 51 accepted for the first round. After five years of conferences, papers, and cryptanalysis, five finalists remained: BLAKE, Grøstl, JH, Keccak, and Skein. Skein’s team included Bruce Schneier, giving it considerable credibility and media attention.

On October 2, 2012, NIST declared Keccak the winner. The algorithm was created by Belgian cryptographers Guido Bertoni, Joan Daemen, Michaël Peeters, and Gilles Van Assche. Joan Daemen had previously co-designed Rijndael, which became AES – making him the only cryptographer to have designed both major modern NIST standards.

Keccak used a “sponge construction” – deliberately, fundamentally different from the Merkle-Damgård approach. NIST explicitly wanted an algorithm where attacks against SHA-2 would most likely not work against SHA-3.

The competition generated memorable drama. Just before the announcement, Schneier argued for “no award” – a legitimate if provocative position: “It’s not that the new hash functions aren’t any good, it’s that we don’t really need one. When we started this process back in 2006, it looked as if we would be needing a new hash function soon… But it’s 2012, and SHA-512 is still looking good.”

He was gracious in defeat: “Yes, I would have rather my own Skein had won, but it was a good choice.”

The bitterest controversy erupted in August 2013 – during the Snowden revelations about NSA interference in cryptographic standards. NIST proposed reducing Keccak’s security capacity parameter for performance gains. In any other year, this might have been a reasonable engineering tradeoff. In 2013, with revelations that the NSA had deliberately weakened cryptographic standards, it looked suspicious.

Schneier responded harshly: “In light of the Snowden documents that reveal that the NSA has attempted to intentionally weaken cryptographic standards, this is a huge deal. NIST risks publishing an algorithm that no one will trust.”

Daniel J. Bernstein – the cryptographer who had sued the US government at age 24, establishing that software code is protected speech under the First Amendment – intervened on NIST’s mailing list, advocating for maintaining original security levels. The Keccak team defended the changes but supported reverting to original parameters.

NIST ultimately backpedalled. The final standard (August 2015) preserved the original security margins. The controversy illustrated how damaged trust in cryptographic institutions had become.

HashDoS: when speed becomes a weapon

The previous sections concerned cryptographic hash functions – algorithms designed for security. A parallel story was unfolding with non-cryptographic hash functions, the ones used for hash tables. These were optimised for speed, not security. The assumption was that speed was all that mattered.

The assumption was wrong.

On December 28, 2011, at the 28th Chaos Communication Congress, researchers Alexander Klink and Julian Wälde demonstrated that virtually every major web framework was vulnerable to denial-of-service attacks exploiting hash table collisions.

The attack was devastatingly simple. Web frameworks automatically convert HTTP POST parameters into hash tables. If an attacker submits form data with hundreds of thousands of keys that all hash to the same value, insertion becomes O(n²) instead of O(n). A carefully crafted 2MB POST request containing 200,000 colliding strings could trigger 40 billion string comparisons and lock up a server’s CPU for over 10 seconds.

Minimal bandwidth, maximum damage. With just 6 kilobits per second of upload bandwidth, an attacker could keep one CPU core fully occupied. A gigabit connection could theoretically overwhelm 100,000 cores.

The disclosure affected nearly every major platform. PHP, Java, Python, Ruby, ASP.NET, and V8 were all vulnerable. CVEs rained across the industry. Microsoft responded with exceptional speed, releasing a patch on December 29, 2011 – just one day after public disclosure, 30 days after notification. The patch limited POST parameters to 1,000 by default.

One notable exception: Perl wasn’t vulnerable. It had implemented randomised hash seeds back in 2003, following an earlier academic paper by Crosby and Wallach. The rest of the industry scrambled to add randomisation.

Many implementations switched to MurmurHash with random seeds, assuming the problem was solved.

They were wrong.

The 29C3 sequel

One year later, at 29C3, Jean-Philippe Aumasson – a Swiss cryptographer who would later co-design BLAKE2 and BLAKE3 – and Bernstein demonstrated that the industry’s patches were inadequate.

Using differential cryptanalysis – techniques from the cryptographic world – they proved that MurmurHash2, MurmurHash3, and CityHash remained vulnerable even with random seeds. By introducing specific XOR differences (like 0x80 in every 8th byte), they generated “universal multicollisions” – strings that collided regardless of the seed value.

The randomisation that everyone had implemented provided no protection against an attacker who understood the hash function’s structure. Ruby (which had switched to MurmurHash after 28C3), JRuby, and numerous other systems remained vulnerable.

The solution came from Aumasson and Bernstein themselves: SipHash, a cryptographically-strong hash function designed specifically for hash table use. SipHash provided security against differential attacks while remaining much faster than full cryptographic hashes – the right tradeoff for protecting hash tables from malicious input.

SipHash adoption spread across the industry: Ruby in November 2012, Python 3.4 in 2014, Rust’s standard library, and eventually most security-conscious hash table implementations.

The lesson was clear: when hash tables process untrusted input, “fast” and “non-cryptographic” are not sufficient specifications. You need something designed to resist adversaries, not just accidents.

As recently as 2025, CVE-2025-27209 documented a HashDoS vulnerability in Node.js v24.0.0 because the V8 engine was using rapidhash with static secrets. The class of vulnerability keeps recurring.

History repeats itself

MD5 was theoretically weak by 1996, practically broken by 2004, weaponised by 2008, and still used for password hashing by a quarter of CMS systems in 2019. SHA-1 followed the same trajectory on a similar timescale. The industry’s ability to break algorithms consistently outpaces its ability to stop using them.

This isn’t just inertia. Migration is genuinely hard. Changing hash algorithms can break compatibility, require coordinated ecosystem-wide updates, and surface in unexpected places. But the pattern suggests we should assume that every hash function will eventually fall, and plan accordingly.