Notes on building systems that work and other things... maybe?

Racing DRAM refresh

· 9 min

Every ~7.8 microseconds your DRAM controller stops serving reads and refreshes a row of cells. A read that lands in that window jumps from ~80ns to ~300-600ns. That is the tail in your tail latency, and it is structural: the refresh is going to happen, the only question is whether your read collides with it.

The trick to dodge it is not mine. It comes from LaurieWired, whose original C++ implementation lives at LaurieWired/tailslayer. The idea: replicate your data across N independent DRAM channels, pin one worker thread per core, and take whichever read finishes first. Refresh schedules are uncorrelated across channels, so at any instant at least one channel is probably not refreshing.

I ported it to Rust. The repo is vimoppa/tailslayer-rs, Apache 2.0. This post is about what the port forced me to get right, and the one section that is genuinely hard: turning a physical address into a channel index on AMD.

The 7.8us stall

The mechanism is documented in the DDR4 spec, not measured by me. The refresh interval (tREFI) is ~7.8us; a refresh blocks the rank for ~350ns on DDR4. The library encodes those as constants:

// crates/tailslayer-rs-hw/src/mem/mod.rs
impl MemoryTech for Ddr4Tech {
    fn refresh_interval_ns(&self) -> u64 { 7800 }
    fn refresh_duration_ns(&self) -> u64 { 350 }
    // ...
}

To check whether the stall is actually visible on a given box, there is a trefi-probe binary. It flushes a cache line, reads it, times the read with rdtsc, and looks for periodic spikes:

// bins/trefi-probe/src/main.rs (trimmed)
unsafe fn timed_probe(addr: *const u8) -> u64 {
    tailslayer_rs_hw::cache_flush(addr);
    tailslayer_rs_hw::fence_full();
    tailslayer_rs_hw::fence_load();
    let t0 = tailslayer_rs_hw::read_timestamp();
    core::ptr::read_volatile(addr);
    let t1 = tailslayer_rs_hw::read_timestamp_end();
    t1 - t0
}

It then bins the inter-spike intervals against the expected tREFI and its harmonics (1T, 2T, 3T), finds the histogram peak, and reports how far it drifted from the expected interval:

// bins/trefi-probe/src/main.rs (trimmed)
let peak_cyc = bin_lo + (peak_bin as f64 + 0.5) * bin_width;
let peak_us = peak_cyc / (tsc_ghz * 1000.0);
eprintln!("\n  Histogram peak: {peak_cyc:.0} cycles ({peak_us:.2} us), count={peak_count}");
eprintln!("  Expected:       {t:.0} cycles ({:.2} us)", cli.trefi_us);
eprintln!("  Deviation:      {:.1}%", (peak_cyc - t).abs() / t * 100.0);

The probe prints whatever your hardware produces. I'm not going to quote a peak or a spike percentage as if it were a fixed property of the world, because it isn't. Run it on your box and read the verdict.

Physical address to channel

This is the part that matters, and the part where the original C++ and AMD's address hashing disagree.

To replicate across channels, you have to know which channel a given physical address lands on. On Intel it's one bit:

// crates/tailslayer-rs-hw/src/strategy/mod.rs
impl ChannelStrategy for BitExtract {
    fn compute_channel(&self, phys_addr: u64) -> u32 {
        ((phys_addr >> self.channel_bit) & 1) as u32
    }
    fn num_channels(&self) -> usize { 2 }
    fn min_channel_offset(&self) -> usize { 1 << self.channel_bit }
}

Shift, mask, done. Two channels, selected by (commonly) bit 8.

AMD Zen does not do this. Each channel-select bit is the parity of many address bits XOR-folded together. The original C++ advertises AMD support, but it places the channel at a fixed bit and offset rather than computing the XOR hash. On a part where the channel really is XOR-folded, fixed bit extraction does not track the actual channel, so two replicas can land on the same channel, which defeats the hedge. The disagreement is between a fixed-offset model and AMD's XOR-based one, not a flat bug.

The fold is one function, shared between the channel mapper and the address decomposer:

// crates/tailslayer-rs-hw/src/strategy/mod.rs
pub fn xor_fold(addr: u64, masks: &[u64]) -> u32 {
    let mut result = 0u32;
    for (i, mask) in masks.iter().enumerate() {
        let bit = (addr & mask).count_ones() & 1;
        result |= bit << i;
    }
    result
}

Each mask yields one output bit, popcount(addr & mask) mod 2. N masks give N bits, i.e. 2^N channels. The AMD strategy wraps this with a physical offset, because AMD hoists a chunk of the address space and you have to subtract it back out before hashing:

// crates/tailslayer-rs-hw/src/strategy/mod.rs
impl ChannelStrategy for XorHash {
    fn compute_channel(&self, phys_addr: u64) -> u32 {
        let addr = phys_addr.wrapping_sub(self.phys_offset);
        xor_fold(addr, &self.masks)
    }
    fn num_channels(&self) -> usize { 1 << self.masks.len() }
}

The same fold drives AddressDecomposer, which goes further and splits an address into channel, sub-channel, and bank group for DDR5 hedging:

// crates/tailslayer-rs-hw/src/mem/addr.rs
pub fn decompose(&self, phys_addr: u64) -> DramAddress {
    let addr = phys_addr.wrapping_sub(self.phys_offset);
    DramAddress {
        channel: xor_fold(addr, &self.channel_masks),
        subchannel: xor_fold(addr, &self.subchannel_masks),
        bank_group: xor_fold(addr, &self.bank_group_masks),
    }
}

The masks are the entire ballgame, and they are not mine. They come out of reverse-engineering papers: ZenHammer (USENIX Security '24) and Sudoku (DRAMSec '25). Here is Zen 4, DDR5, two channels, taken straight from the Sudoku tables:

// crates/tailslayer-rs-hw/src/channel.rs
// From Sudoku paper Table 2: Ryzen 9 7950X 2Ch-1DPC
// Channel masks: 0x0000000100 (bit 8), 0x0000080000 (bit 19)
Self::AmdZen4Ddr5_2ch => ChannelMapper::amd(
    vec![0x100, 0x80000], // bits 8 and 19
    0,
),

Two masks, so two channel-select bits, so four addressable channel indices. The honest caveat is in the code, not buried in a footnote:

// crates/tailslayer-rs-hw/src/channel.rs
// 12 channels needs 4 mask bits (2^4=16 ⊇ 12).
// WARNING: estimated from Zen 4 patterns. Use --discover on real hardware.
Self::AmdZen5Ddr5_12ch => ChannelMapper::amd(vec![0x100, 0x80000, 0x200, 0x100000], 0),

These hashes vary by BIOS version, DIMM population, and SKU. A profile that is correct on one Ryzen 9 7950X is a guess on the next board with different memory. That is the nature of reverse-engineered address mapping, and the Zen 5 masks are openly labelled as estimated.

The benchmark does not trust the profile blindly. Before it runs, it translates each replica's virtual address through /proc/self/pagemap, computes the channel, and bails if two replicas collide:

// bins/hedged-bench/src/main.rs (trimmed)
let phys_i = tailslayer_rs_hw::virt_to_phys(replicas[i] as usize)?;
let ch_i = config.channel_mapper.compute(phys_i);
// ... compare against every other replica ...
if ch_i == ch_j {
    eprintln!("ERROR: Replicas {i} and {j} on same channel ({ch_i})!");
    std::process::exit(1);
}

If the mask is wrong for your hardware, you find out here, not after collecting five million bogus samples.

What the Rust rewrite bought

The mechanism is identical to the C++. The safety story is not.

The original allocated hugepages with mmap and munmap, and freed them by hand in every exit path. Miss one and you leak a 1GB locked page. In Rust the allocation is an RAII type, and Drop is the only cleanup path:

// crates/tailslayer-rs-hw/src/lib.rs
impl Drop for HugePage {
    fn drop(&mut self) {
        unsafe {
            libc::munmap(self.ptr as *mut libc::c_void, self.len);
        }
    }
}

There is no "exit path" to forget. The page unmaps when it goes out of scope, including on the error branch inside alloc itself, where mlock can fail after the mmap succeeded.

mmap failure is a Result, not a perror followed by assert:

// crates/tailslayer-rs-hw/src/lib.rs (trimmed)
if ptr == libc::MAP_FAILED {
    return Err(HwError::HugePageAlloc {
        size,
        source: std::io::Error::last_os_error(),
    });
}

You cannot accidentally read from a page that failed to allocate. The type system will not let you reach the pointer.

Thread safety is enforced at compile time. HugePage is Send but deliberately not Sync, and the reasoning is written next to the unsafe impl:

// crates/tailslayer-rs-hw/src/lib.rs
// SAFETY: HugePage owns its allocation exclusively. It can be moved between
// threads (Send) because the raw pointer is valid for the struct's lifetime
// and munmap is called in Drop. NOT Sync: concurrent access to the pointer
// from multiple threads requires external synchronization.
unsafe impl Send for HugePage {}

The hedged reader joins all its worker threads in Drop before the region is freed, so the raw pointers the workers hold can never outlive the memory they point at. That ordering is the invariant the whole design rests on, and it is encoded in the destructor.

All the inline assembly and the syscalls live in one crate, tailslayer-rs-hw. The library on top, tailslayer-rs, is mostly ordinary Rust, but not entirely safe: it still has the raw-pointer striping in replica.rs, a read_volatile in reader.rs, and an unsafe impl Send/Sync. So unsafe exists on both sides of the crate boundary. What the boundary buys you is narrower: every instruction that talks to the hardware, the asm and the syscalls, is on one side. If you want to audit the parts that touch the machine, you read one crate; if you want to audit all the unsafe, you read two.

One binary, two architectures

The library runs on x86_64 and aarch64 (Graviton, Ampere) from the same source. The arch-specific intrinsics are four functions behind cfg, identical signatures, different instructions.

x86_64 uses rdtsc for the clock, clflush to evict, lfence/mfence to fence:

// crates/tailslayer-rs-hw/src/platform/x86_64.rs (trimmed)
pub unsafe fn read_timestamp() -> u64 {
    let lo: u32; let hi: u32;
    core::arch::asm!(
        "lfence", "rdtsc",
        out("eax") lo, out("edx") hi,
        options(nostack, preserves_flags),
    );
    ((hi as u64) << 32) | (lo as u64)
}

pub unsafe fn cache_flush(addr: *const u8) {
    core::arch::asm!("clflush [{addr}]", addr = in(reg) addr,
        options(nostack, preserves_flags));
}

aarch64 uses the generic timer cntvct_el0, dc civac to clean-and-invalidate, and dmb/dsb to fence:

// crates/tailslayer-rs-hw/src/platform/aarch64.rs (trimmed)
pub unsafe fn read_timestamp() -> u64 {
    let val: u64;
    core::arch::asm!(
        "isb", "mrs {val}, cntvct_el0",
        val = out(reg) val,
        options(nostack, preserves_flags),
    );
    val
}

pub unsafe fn cache_flush(addr: *const u8) {
    core::arch::asm!("dc civac, {addr}", addr = in(reg) addr,
        options(nostack, preserves_flags));
}

The timed_probe and measurement_loop code above never names an architecture. It calls read_timestamp, cache_flush, fence_full, and the right asm gets compiled in.

Why it won't run on the machine I wrote it on

I wrote this on an Apple Silicon Mac it cannot run on.

It's Linux-only by construction. The hardware crate opens with compile_error!("tailslayer-rs-hw requires Linux"), so it won't even compile on macOS. Past that it needs MAP_HUGETLB 1GB pages, /proc/self/pagemap for virtual-to-physical translation, sched_setaffinity pinning, and chrt -f 99 real-time scheduling. None of those exist on macOS.

Apple Silicon closes the two doors the whole technique walks through. First, there are no userspace physical addresses: there is no /proc/self/pagemap equivalent, and SIP keeps physical mapping in the kernel. Without a physical address the channel hash has nothing to compute on. Second, the LPDDR channel and bank interleave is proprietary and not publicly reverse-engineered. The ZenHammer and Sudoku work this port relies on is Intel and AMD DDR only. So even with a physical address there is no published hash to map it to a channel.

A few more things stack up. Apple Silicon uses 16KB base pages with no large pages or superpages. The userspace timer cntvct_el0 runs at ~24 MHz (about 41.7ns per tick), too coarse to time a single read. And unified LPDDR uses per-bank refresh, which already softens the stall this whole thing chases.

What is logically hardware-free is the pure computation: the channel math, the address decomposer, the DDR/LPDDR constant tables, and the around 40 tests that exercise them. None of it touches the machine. But "hardware-free" is not "builds on a Mac." That compile_error! fires on any non-Linux target, so as written none of the crate compiles here at all, --no-default-features or not. To actually run the pure core off Linux you would first have to lift the Linux gate and sort out the no_std test dependencies. Until you do, it is hardware-free in principle and uncompilable on a Mac in practice.

You could also write a coarse demo, a separate small program you write yourself, not part of the crate: mach_absolute_time (or cntvct_el0) plus sys_dcache_flush() to plot a read-latency distribution, idle versus under a memory-bandwidth stressor, which shows a tail exists. It cannot do channel-aware hedging, because you cannot know which channel an address lands on. That is observing the DRAM tail on Apple Silicon, not slaying it.

What I would not claim

This is a port. The mechanism is LaurieWired's, the AMD masks are from the papers, and the Zen 5 and Oxide profiles are extrapolated guesses that the code itself flags as estimated. If you run it on a SKU I have not seen, the channel hash may simply be wrong, and the collision check will tell you so.

What I would build next is the part I left as CLI flags rather than working code: a real --auto-detect that derives masks on the box instead of looking them up in a table, and a --verify that proves replicas sit on different channels with a timing test rather than trusting /proc/self/pagemap plus a profile. Until that exists, treat every profile as a hypothesis the hardware gets to reject.