Making rapidhash rapid in Rust

Rapidhash is an extremely fast, high-quality non-cryptographic hash function. If you want to understand why – the folded multiplies, the parallel lanes, the tiered input handling – I wrote about the algorithm itself. This article is about something different: what it took to make rapidhash fast in Rust specifically, and why the Rust crate I maintain doesn’t produce the same hashes as the C++ original (though it can, when you need it to).

The algorithm translates directly – Rust has the same integer types and multiply semantics. The problem is Rust’s Hasher trait, which sits between your hash function and HashMap. The trait was designed for streaming hashers that process bytes incrementally; rapidhash is a block hasher that wants 16 or 48 bytes at a time. Making these worldviews coexist without sacrificing speed required some undignified compiler negotiations, and some deliberate divergence from the C++ design. Here’s what that looks like.

The Hasher problem

Rust’s standard library has a different model. The Hash and Hasher traits separate “what to hash” from “how to hash it”:

rust

/// Implemented on types that can be hashed, like String.
pub trait Hash {
    fn hash<H: Hasher>(&self, state: &mut H);
}

/// Implemented by hash functions, like rapidhash.
pub trait Hasher {
    fn finish(&self) -> u64;
    fn write(&mut self, bytes: &[u8]);

    // Default implementations for primitive types
    fn write_u8(&mut self, i: u8) { ... }
    fn write_u16(&mut self, i: u16) { ... }
    fn write_u32(&mut self, i: u32) { ... }
    fn write_u64(&mut self, i: u64) { ... }
    fn write_usize(&mut self, i: usize) { ... }
    // ... etc
}

A type implements Hash to describe how to serialise itself for hashing. A hasher implements Hasher to accept those bytes and produce a hash. The #[derive(Hash)] macro generates the serialisation automatically.

This separation is elegant, but it creates problems for fast hash functions. The Hasher trait was designed around bytewise streaming, where each byte can be processed as it arrives. This works beautifully for hash functions like SipHash that maintain rolling state and mix incrementally. But rapidhash is a block hasher. It wants to consume 16 or 48 or 96 bytes at a time, not dribbles of 4 bytes here and 8 bytes there. As purplesyringa explains, Rust’s hashing traits simply weren’t designed with block hashers in mind.

A struct with four u32 fields will call write_u32 four times. Each call has overhead: function call, state update, and importantly for block hashers like rapidhash, four separate mixing operations where one would suffice. We want to buffer those 16 bytes and mix them all at once, but the trait doesn’t know we’re going to receive more data. It can’t tell us “here are all four fields, hash them together.”

There’s also the string delimiter problem. In bytewise hashes, if you hash the tuple ("ab", "c"), you get the same bytes as ("a", "bc"). To distinguish them, the default Hash implementation for strings writes a 0xFF byte after each string as a separator. This is unnecessary for rapidhash, which already mixes in the length of each string and naturally distinguishes them.

You might think this could be solved with a write_str method that hashers can override to handle strings specially. And you’d be right. There’s been a hasher_prefixfree_extras feature on nightly since 2022 that would add exactly this. Four years later, it’s not much closer to stabilisation. So block hashers remain stuck emitting and separately mixing the superfluous 0xFF byte on every string, unable to opt out through any stable API.

Buffering small writes

The Rust rapidhash crate borrows a trick from Orson Peters’ foldhash: the sponge buffer Instead of mixing every dribble of data the moment it arrives, we collect bytes in a holding pen until there’s enough to justify the expensive multiply.

rust

pub struct RapidHasher<'s> {
    seed: u64,
    secrets: &'s [u64; 7],
    sponge: u128,      // Buffer for small writes
    sponge_len: u8,    // How many bits are used
}

When you call write_u32, the bytes go into the sponge. When you call it again, those bytes go in too. Only when the sponge is full (128 bits) do we actually perform the expensive folded multiply mix:

rust

fn write_u32(&mut self, i: u32) {
    let bytes = i as u128;
    if self.sponge_len + 32 <= 128 {
        // Hot path: just add to sponge
        self.sponge |= bytes << self.sponge_len;
        self.sponge_len += 32;
    } else {
        // Cold path: sponge full, mix it
        let a = self.sponge as u64;
        let b = (self.sponge >> 64) as u64;
        self.seed = rapid_mix(a ^ self.seed, b ^ self.secrets[0]);
        self.sponge = bytes;
        self.sponge_len = 32;
    }
}

Now that struct with four u32 fields does one mix instead of four. The overhead of #[derive(Hash)] calling write_u32 separately for each field is absorbed.

There’s a catch, because there’s always a catch. The sponge only buffers the small integer writes – write_u32, write_u64, and friends. Strings bypass it entirely.

Hasher::write_str default implementationrust

fn write_str(&mut self, s: &str) {
    self.write(s.as_bytes());
    self.write_u8(0xFF);
}

This means the 0xFF suffix that #[derive(Hash)] appends to strings doesn’t benefit from the buffer at all. Worse, the buffer logic is still present at the call site, generating extra LLVM-IR even when it’s not used.

For single-string hashing, you want the compiler to inline the hasher and optimise away the unused buffer entirely. But adding the sponge was originally preventing the hasher from being inlined when hashing single strings. The method had grown too large for LLVM’s inlining heuristics to consider it worthwhile. This is what spurred the inlining work in the next section.

The sponge is a net win for structs with multiple integer fields, or tuples of integers, where the buffering genuinely prevents multiple mix operations. For everything else, the goal is to make the buffer cheap enough that LLVM will inline through it and then delete it. We’re optimising for code that doesn’t run.

The inlining dance

Compilers have opinions about inlining, and they’re not always right. Inline too little and you’re paying for function calls on every lookup. Inline too much and your code balloons, shoving other hot paths out of the instruction cache. The compiler’s heuristics are tuned for average code, and hash functions are not average.

For hash functions used in hash tables, this matters a lot. The hash function is called on every insert, every lookup, every delete. If your hash function’s code bloats the call site, everything around it gets slower too.

The Rust crate uses stratified inlining:

Tiny path, 16 bytes or fewer: #[inline(always)]. This path is small and hot. Inlining it at every call site is worth it.
Medium path, 17-400 bytes: #[inline(never)]. This path has more code (three lanes, loop handling). Inlining it would bloat call sites. Instead, we make one copy and call into it. We try to avoid register clobbering (overwriting registers that the caller expects to be preserved) for <48 length inputs, to ensure the function call overhead is mostly negligible compared to the actual hashing work, especially at the larger input sizes.
Large path, 400+ bytes: #[cold] and #[inline(never)]. This path is rare and large. Mark it cold so the compiler doesn’t try to optimise the branch prediction for it. We separate this from the second tier because it’s a much larger method and want to reduce register clobbering on the medium path to make it faster to call.

Rapidhash tiering pseudocoderust

/// Called rapidhash_core in the real code
#[inline(always)]
fn rapidhash(data: &[u8], seed: u64) -> u64 {
    if data.len() <= 16 {
         // tiny path
    } else {
         return rapidhash_medium(data, seed);
    }
}

/// Called rapidhash_core_17_288 in the real code (the name is outdated; the cutoff is now 400)
#[cold]
#[inline(never)]
fn rapidhash_medium(data: &[u8], seed: u64) -> u64 {
    if data.len() <= 400 {
         // medium path
    } else {
         return rapidhash_large(data, seed);
    }
}

/// rapidhash_core_cold in the real code
#[cold]
#[inline(never)]
fn rapidhash_large(data: &[u8], seed: u64) -> u64 {
    // cold path
}

The trick is convincing the compiler that this stratification is correct. Rust’s likely() intrinsics help, but the real control comes from marking functions as inline-never or inline-always and trusting LTO to do the right thing, which it mostly does, except when it doesn’t, and you have to check the assembly to find out which.

Function boundary overhead

When you call a non-inlined function, the calling convention takes over. Arguments go into specific registers (or onto the stack if there are too many). Return values come back in specific registers (or via memory if they’re too large). The compiler has no choice in the matter – these conventions are fixed by the ABI.

You might hope LLVM would be clever about this. If a function is private, only called from one place, surely the compiler could specialise it? Use a custom calling convention, avoid the register shuffle? In practice, no – at least not reliably enough to depend on.

This matters because every value you pass or return has a cost. Arguments consume registers: on x86-64 you get six (rdi, rsi, rdx, rcx, r8, r9), on ARM64 you get eight (x0 through x7). Exceed that and values spill to the stack. Return values are worse – you typically get one or two registers for free, and anything larger goes through memory.

Early versions of the Rust crate passed intermediate state between tiers:

rust

#[inline(always)]
pub(super) const fn rapidhash_core(
    mut a: u64, mut b: u64, mut seed: u64,
    secrets: &[u64; 7], data: &[u8]
) -> (u64, u64, u64) {
    if data.len() <= 16 {
        // ... tiny path
    } else if data.len() <= 288 {
        (a, b, seed) = rapidhash_core_16_288(a, b, seed, secrets, data);
    } else {
        return rapidhash_core_cold(a, b, seed, secrets, data);
    }

    // Final mixing
    seed = seed.wrapping_add(data.len() as u64);
    a ^= RAPID_SECRET[1];
    b ^= seed;
    (a, b) = rapid_mum(a, b);

    // Returns a tuple, again!
    (a, b, seed)
}

That three-element tuple return is expensive. On x86-64, three u64 values don’t fit in the two return registers (rax, rdx), so the caller passes a hidden pointer to stack memory, the callee writes the tuple there, and the caller reads it back. That’s three stores, three loads, and pointer arithmetic – all for values that were sitting in registers moments ago.

The current implementation restructures so each tier computes and returns the final hash directly, as a single u64:

rust

#[inline(always)]
pub(super) const fn rapidhash_core(mut seed: u64, secrets: &[u64; 7], data: &[u8]) -> u64 {
    if likely(data.len() <= 16) {
        // ... tiny path, returns final hash
        rapidhash_finish(a, b, seed, secrets)
    } else {
        unsafe { rapidhash_core_17_288(seed, secrets, data) }
    }
}

#[cold]
#[inline(never)]
const unsafe fn rapidhash_core_17_288(mut seed: u64, secrets: &[u64; 7], data: &[u8]) -> u64 {
    // ...
    if unlikely(data.len() > 400) {
        return rapidhash_core_cold(seed, secrets, data);
    }
    // ...
}

Now each non-inlined function takes four arguments (seed, secrets pointer, data pointer, length) and returns one value (the hash). Everything fits in registers. No stack dance, no hidden pointers, no store-load round trips. The branches become tail calls.

Avoiding stack spills on hot paths

When a function uses more registers than the calling convention allows it to clobber, it has to save the old values somewhere. That somewhere is the stack. Each push at function entry preserves a register; each pop at function exit restores it. This is the function prologue and epilogue, and for a function that needs six callee-saved registers, you’re looking at six pushes on the way in and six pops on the way out.

For large inputs, who cares? Twelve memory operations spread across a function that processes thousands of bytes is rounding error. But for small inputs, this fixed overhead dominates. Consider hashing a 24-byte key: you read three 8-byte chunks, do a couple of folded multiplies, maybe 10–15 cycles of actual work. If you’re also pushing and popping six registers at 1–2 cycles each, plus the stack pointer arithmetic, you’ve just doubled your execution time on bookkeeping.

This is the tyranny of fixed costs. The stack spill doesn’t scale with input size; it’s the same whether you’re hashing 20 bytes or 200. And since most hash table keys are small – short strings, integer IDs, compact structs – you’re paying this overhead on nearly every operation.

The Rust crate uses an architecture-specific hack to defer the stack spill until it’s actually needed:

rust

#[cold]
#[inline(never)]
const unsafe fn rapidhash_core_17_288(mut seed: u64, secrets: &[u64; 7], data: &[u8]) -> u64 {
    assume(data.len() > 16);

    // This branch is a hack to move the function prologue/epilogue (stack spilling)
    // into the >48 path as it otherwise unnecessarily hurts the <48 path.
    #[cfg(not(target_arch = "aarch64"))]
    if likely(data.len() <= 48) {
        return rapidhash_final_48(seed, secrets, data, data);
    }

    // ... 48+ byte path with loop, which now gets the stack spill
}

The trick is to structure the code so the compiler places the register saves after the early-exit branch. Inputs of 17–48 bytes check the length, find they don’t need the loop, and jump straight to rapidhash_final_48 (which is #[inline(always)]) without ever touching the stack. Only inputs over 48 bytes continue past the branch and hit the prologue. The compiler sees that the registers are only needed in the loop path, so it only saves them there.

On ARM64, the compiler handles register allocation differently – it’s less aggressive about using callee-saved registers in the first place – and this hack actually slows things down. So it’s disabled with #[cfg(not(target_arch = "aarch64"))]. The right optimisation depends on the target, which is one of those mildly annoying facts about writing fast code.

Stack spill managementasm

example::rapidhash_core_cold::h653c7a8ed349a5fa:
        push    rbp        ; push six callee-saved registers onto stack
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        ...                ; code for >400 inputs

example::rapidhash_core_17_288::hfd584601588cf559:
        mov     r8, rdx    ; no pushing of callee-saved registers onto stack
        mov     rax, rdi
        cmp     rcx, 49
        jae     .LBB1_1
        ...                ; code for <48 inputs, returning before .LBB1_1
.LBB1_1:
        cmp     rcx, 400
        ja      .LBB1_12   ; jump to cold path if >400 bytes
        push    r15        ; otherwise, push five callee-saved registers onto stack, only on the >48 path
        push    r14
        push    r13
        push    r12
        push    rbx
        ...                ; code for >48 inputs, including the call to rapidhash_core_cold

example::main::hbaa888d9139c15b4:
        ...                ; code for <=16 byte inputs are inlined
.LBB1_12:                  ; >16 byte inputs have a simple call, with the returning code
        call    qword ptr [rip + example::rapidhash_core_17_288::hfd584601588cf559@GOTPCREL]
        mov     rdx, rax
        jmp     .LBB2_2

Bounds check elision

Rust’s safety guarantees come with bounds checks attached. Usually that’s fine – a branch predictor that’s right 99% of the time adds negligible overhead. But when you’re already inside a function that only gets called for inputs over 16 bytes, checking whether the input is over 16 bytes is pure waste.

The #[inline(never)] paths use std::hint::assert_unchecked() to tell the compiler it can remove bounds checks. The function for 17-288 byte inputs will never be called with 16 or fewer bytes, but LLVM isn’t able to specialise the function without this assertion:

rust

/// SAFETY: caller guarantees len > 16
#[cold]
#[inline(never)]
const unsafe fn rapidhash_core_17_288(mut seed: u64, secrets: &[u64; 7], data: &[u8]) -> u64 {
    // SAFETY: caller guarantees len > 16
    std::hint::assert_unchecked(data.len() > 16);

    // any <=16 byte reads can now elide bounds checks...
}

The bounds checks still exist in the inlined entry point; they just don’t appear in the deeper tiers where they’d be redundant. The tiered methods are private to the crate, and each tier trusts that its caller has already validated the length.

Struct size matters

The RapidHasher struct needs to be small. Every time you create a hasher, you’re initialising this struct. Every time you hash something, you’re passing it around. If the struct is bigger, there’s more to initialise, more to copy, more pressure on registers and stack.

The sponge buffer itself adds 16 bytes to the struct, plus the length tracking and padding. Thankfully, the sponge is frequently optimised away when hashing less complex types.

rust

struct RapidHasher<'s> {
    seed: u64,
    secrets: &'s [u64; 7],  // 8 bytes on 64-bit
    sponge: u128,
    sponge_len: u8,
}

Foldhash was originally storing its secrets as individual u64 fields. One of the optimisations we ported back into foldhash was to store a reference to a secrets array instead, ensuring everyone’s hashers get smaller.

Reliable benchmarking

Benchmarking hash functions in Rust is surprisingly tricky. We used Criterion, but found that the default Cargo settings produced inconsistent results. By default, Rust compiles with lto=false and codegen-units=16, which means the compiler optimises each codegen unit separately and may make different inlining decisions between builds. Run your benchmark twice, get two different numbers. Very scientific.

For hashing, this is particularly problematic. A single hash operation can involve inlining across three crate boundaries: the type being hashed (implementing Hash), the standard library’s Hash trait machinery, and the hasher itself. Whether these get inlined together or stay as separate function calls depends on LTO and codegen unit boundaries. To get reliable, reproducible benchmarks, we had to set lto=true and codegen-units=1 on our bench profile. The numbers finally stopped jumping around.

Benchmark chart showing throughput comparison of hash functions across various input sizes on aarch64 Apple M1 Max — Throughput benchmarks on Apple M1 Max, comparing rapidhash with portable compilation settings.

We borrowed Orson Peters’ foldhash benchmark suite for comprehensive testing. It’s designed specifically to exercise hash functions through Rust’s Hasher trait, which matters more than you’d think. Many hasher benchmarks, including those for gxhash, museair, and others, measure raw throughput by passing byte slices directly to the hash function. This completely bypasses the Hasher trait overhead we discussed earlier: the small writes, the sponge buffering, the delimiter bytes.

The foldhash suite measures what actually happens when you use a hash function in a real Rust program. It tests raw throughput across various input sizes, but also hashmap lookups (both hits and misses), and insertion performance. This is closer to how hash functions are actually used: not hashing 10KB blobs in isolation, but powering hash tables with typical keys.

Here’s a summary of the results on Apple M1 Max. The “avg_rank” column shows the average position across all tests (lower is better), and “geometric_mean” shows the mean time in nanoseconds across throughput tests (again, lower is better):

Hasher	Avg Rank	Geometric Mean (ns)
rapidhash-fast	2.11	4.29
foldhash-fast	2.84	4.83
fxhash	2.88	5.50
rapidhash-quality	3.53	4.82
foldhash-quality	4.62	5.24
ahash	5.05	5.94
siphash	6.97	22.17

Full benchmark results across ARM and x86 platforms are available in the rapidhash repository.

Rapidhash consistently ranks near the top across input sizes and platforms. For small inputs (1-64 bytes), which dominate real-world workloads, it matches or beats almost everything. For large inputs, the multi-lane approach pays off and rapidhash pulls ahead of functions that use fewer parallel accumulators, only beaten by some hashes that depend on accelerated AES instructions such as gxhash.

The “quality” variant (closer to the C++ rapidhash) is slightly slower but still competitive. The “fast” variant trades a tiny amount of hash quality for speed, skipping the final avalanche mixing step (an extra folded multiply) because skipping this step and accepting marginally lower quality hash output is faster in almost all practical hashmap uses.

Contributions and acknowledgements

The sponge buffer trick came from Orson Peters’ foldhash, which pioneered the design of buffering small integer writes to avoid per-field mixing overhead in Rust hashers. Many of the optimisations developed for rapidhash, including the stratified inlining and struct size reductions, made their way back to foldhash. The rapidhash benchmarks were run before we merged our improvements back into foldhash, so foldhash now benchmarks roughly equal with rapidhash today, only just missing out on specific input sizes. Foldhash also isn’t reported on SMHasher3’s quality benchmark.

This work would have been much more difficult without Compiler Explorer. When you’re reasoning about whether a bounds check will be elided or why something is a nanosecond faster, just look.

What’s next

How we mitigate seed-independent collisions in rapidhash is the next article, making sure certain input patterns can’t collide regardless of what seed you use.