Lock-Free SPSC Ring Buffer: A High-Throughput

This article covers Lock-Free SPSC Ring Buffer: A High-Throughput Queue Between Two Threads. Build a correct single-producer single-consumer ring buffer using atomics, and understand why SPSC is simpler than MPMC. Includes C, Zig, and Ru...

A single-producer single-consumer (SPSC) queue is one of the most useful concurrency building blocks:

a producer thread generates work
a consumer thread processes work
both operate concurrently

SPSC is special because there is exactly:

one writer for tail
one writer for head

That ownership eliminates many races and lets us build a fast queue without locks.

1) The mental model

We store items in a fixed-size array (capacity is typically a power of two).

head points to the next element to read
tail points to the next slot to write

The queue is empty when head == tail.

To distinguish full vs empty, a common trick is to keep one slot empty:

full when advancing tail would make it equal to head

2) Correctness: what ordering do we need?

Producer steps:

write item into buf[tail]
publish new tail

Consumer steps:

read tail
read item from buf[head]
publish new head

We need:

producer’s item write becomes visible before consumer sees updated tail
consumer’s head update becomes visible before producer checks for full

This is exactly what release (publish) and acquire (consume) ordering provide.

3) C (C11 atomics) implementation

#include <stdatomic.h>
#include <stddef.h>
#include <stdint.h>

#define CAP 1024u

typedef struct {
    uint64_t buf[CAP];
    atomic_uint head;
    atomic_uint tail;
} spsc_t;

static inline void spsc_init(spsc_t *q) {
    atomic_store_explicit(&q->head, 0u, memory_order_relaxed);
    atomic_store_explicit(&q->tail, 0u, memory_order_relaxed);
}

static inline int spsc_push(spsc_t *q, uint64_t v) {
    unsigned tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
    unsigned head = atomic_load_explicit(&q->head, memory_order_acquire);

    unsigned next = (tail + 1u) % CAP;
    if (next == head) return 0; // full

    q->buf[tail] = v;

    // Publish the element.
    atomic_store_explicit(&q->tail, next, memory_order_release);
    return 1;
}

static inline int spsc_pop(spsc_t *q, uint64_t *out) {
    unsigned head = atomic_load_explicit(&q->head, memory_order_relaxed);
    unsigned tail = atomic_load_explicit(&q->tail, memory_order_acquire);

    if (head == tail) return 0; // empty

    *out = q->buf[head];

    unsigned next = (head + 1u) % CAP;
    atomic_store_explicit(&q->head, next, memory_order_release);
    return 1;
}

Why relaxed sometimes works:

each index has a single writer, so we don’t need RMW operations
but we still need acquire/release to connect data visibility to index updates

4) Zig implementation

const std = @import("std");

pub fn SpscQueue(comptime T: type, comptime CAP: usize) type {
    return struct {
        buf: [CAP]T = undefined,
        head: std.atomic.Value(usize) = std.atomic.Value(usize).init(0),
        tail: std.atomic.Value(usize) = std.atomic.Value(usize).init(0),

        pub fn push(self: *@This(), v: T) bool {
            const tail = self.tail.load(.monotonic);
            const head = self.head.load(.acquire);
            const next = (tail + 1) % CAP;
            if (next == head) return false;
            self.buf[tail] = v;
            self.tail.store(next, .release);
            return true;
        }

        pub fn pop(self: *@This()) ?T {
            const head = self.head.load(.monotonic);
            const tail = self.tail.load(.acquire);
            if (head == tail) return null;
            const v = self.buf[head];
            const next = (head + 1) % CAP;
            self.head.store(next, .release);
            return v;
        }
    };
}

5) Rust implementation

use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicUsize, Ordering};

pub struct Spsc<T, const CAP: usize> {
    buf: [UnsafeCell<Option<T>>; CAP],
    head: AtomicUsize,
    tail: AtomicUsize,
}

unsafe impl<T: Send, const CAP: usize> Send for Spsc<T, CAP> {}
unsafe impl<T: Send, const CAP: usize> Sync for Spsc<T, CAP> {}

impl<T, const CAP: usize> Spsc<T, CAP> {
    pub fn new() -> Self {
        let buf: [UnsafeCell<Option<T>>; CAP] = std::array::from_fn(|_| UnsafeCell::new(None));
        Self { buf, head: AtomicUsize::new(0), tail: AtomicUsize::new(0) }
    }

    pub fn push(&self, v: T) -> bool {
        let tail = self.tail.load(Ordering::Relaxed);
        let head = self.head.load(Ordering::Acquire);
        let next = (tail + 1) % CAP;
        if next == head { return false; }

        unsafe { *self.buf[tail].get() = Some(v); }
        self.tail.store(next, Ordering::Release);
        true
    }

    pub fn pop(&self) -> Option<T> {
        let head = self.head.load(Ordering::Relaxed);
        let tail = self.tail.load(Ordering::Acquire);
        if head == tail { return None; }

        let v = unsafe { (*self.buf[head].get()).take() };
        let next = (head + 1) % CAP;
        self.head.store(next, Ordering::Release);
        v
    }
}

Rust note:

This uses UnsafeCell to allow interior mutation.
The SPSC guarantees (one producer thread calls push, one consumer calls pop) are required for safety.

6) Performance notes

Prefer power-of-two capacities and use idx & (CAP-1).
Avoid false sharing: place head and tail on separate cache lines (padding) when you optimize further.
The queue is not MPMC; don’t share it across multiple producers/consumers.

References

C11 atomics: https://en.cppreference.com/w/c/atomic
Rust atomics: https://doc.rust-lang.org/std/sync/atomic/
Zig atomics: https://ziglang.org/documentation/master/std/#std.atomic
Linux Kernel memory barriers (background): https://www.kernel.org/doc/Documentation/memory-barriers.txt