Lock-Free SPSC Ring Buffer: A High-Throughput Queue Between Two Threads

This article covers Lock-Free SPSC Ring Buffer: A High-Throughput Queue Between Two Threads. Build a correct single-producer single-consumer ring buffer using atomics, and understand why SPSC is simpler than MPMC. Includes C, Zig, and Ru...

A single-producer single-consumer (SPSC) queue is one of the most useful concurrency building blocks:

  • a producer thread generates work
  • a consumer thread processes work
  • both operate concurrently

SPSC is special because there is exactly:

  • one writer for tail
  • one writer for head

That ownership eliminates many races and lets us build a fast queue without locks.

1) The mental model

We store items in a fixed-size array (capacity is typically a power of two).

  • head points to the next element to read
  • tail points to the next slot to write

The queue is empty when head == tail.

To distinguish full vs empty, a common trick is to keep one slot empty:

  • full when advancing tail would make it equal to head

2) Correctness: what ordering do we need?

Producer steps:

  1. write item into buf[tail]
  2. publish new tail

Consumer steps:

  1. read tail
  2. read item from buf[head]
  3. publish new head

We need:

  • producer’s item write becomes visible before consumer sees updated tail
  • consumer’s head update becomes visible before producer checks for full

This is exactly what release (publish) and acquire (consume) ordering provide.

3) C (C11 atomics) implementation

#include <stdatomic.h>
#include <stddef.h>
#include <stdint.h>

#define CAP 1024u

typedef struct {
    uint64_t buf[CAP];
    atomic_uint head;
    atomic_uint tail;
} spsc_t;

static inline void spsc_init(spsc_t *q) {
    atomic_store_explicit(&q->head, 0u, memory_order_relaxed);
    atomic_store_explicit(&q->tail, 0u, memory_order_relaxed);
}

static inline int spsc_push(spsc_t *q, uint64_t v) {
    unsigned tail = atomic_load_explicit(&q->tail, memory_order_relaxed);
    unsigned head = atomic_load_explicit(&q->head, memory_order_acquire);

    unsigned next = (tail + 1u) % CAP;
    if (next == head) return 0; // full

    q->buf[tail] = v;

    // Publish the element.
    atomic_store_explicit(&q->tail, next, memory_order_release);
    return 1;
}

static inline int spsc_pop(spsc_t *q, uint64_t *out) {
    unsigned head = atomic_load_explicit(&q->head, memory_order_relaxed);
    unsigned tail = atomic_load_explicit(&q->tail, memory_order_acquire);

    if (head == tail) return 0; // empty

    *out = q->buf[head];

    unsigned next = (head + 1u) % CAP;
    atomic_store_explicit(&q->head, next, memory_order_release);
    return 1;
}

Why relaxed sometimes works:

  • each index has a single writer, so we don’t need RMW operations
  • but we still need acquire/release to connect data visibility to index updates

4) Zig implementation

const std = @import("std");

pub fn SpscQueue(comptime T: type, comptime CAP: usize) type {
    return struct {
        buf: [CAP]T = undefined,
        head: std.atomic.Value(usize) = std.atomic.Value(usize).init(0),
        tail: std.atomic.Value(usize) = std.atomic.Value(usize).init(0),

        pub fn push(self: *@This(), v: T) bool {
            const tail = self.tail.load(.monotonic);
            const head = self.head.load(.acquire);
            const next = (tail + 1) % CAP;
            if (next == head) return false;
            self.buf[tail] = v;
            self.tail.store(next, .release);
            return true;
        }

        pub fn pop(self: *@This()) ?T {
            const head = self.head.load(.monotonic);
            const tail = self.tail.load(.acquire);
            if (head == tail) return null;
            const v = self.buf[head];
            const next = (head + 1) % CAP;
            self.head.store(next, .release);
            return v;
        }
    };
}

5) Rust implementation

use std::cell::UnsafeCell;
use std::sync::atomic::{AtomicUsize, Ordering};

pub struct Spsc<T, const CAP: usize> {
    buf: [UnsafeCell<Option<T>>; CAP],
    head: AtomicUsize,
    tail: AtomicUsize,
}

unsafe impl<T: Send, const CAP: usize> Send for Spsc<T, CAP> {}
unsafe impl<T: Send, const CAP: usize> Sync for Spsc<T, CAP> {}

impl<T, const CAP: usize> Spsc<T, CAP> {
    pub fn new() -> Self {
        let buf: [UnsafeCell<Option<T>>; CAP] = std::array::from_fn(|_| UnsafeCell::new(None));
        Self { buf, head: AtomicUsize::new(0), tail: AtomicUsize::new(0) }
    }

    pub fn push(&self, v: T) -> bool {
        let tail = self.tail.load(Ordering::Relaxed);
        let head = self.head.load(Ordering::Acquire);
        let next = (tail + 1) % CAP;
        if next == head { return false; }

        unsafe { *self.buf[tail].get() = Some(v); }
        self.tail.store(next, Ordering::Release);
        true
    }

    pub fn pop(&self) -> Option<T> {
        let head = self.head.load(Ordering::Relaxed);
        let tail = self.tail.load(Ordering::Acquire);
        if head == tail { return None; }

        let v = unsafe { (*self.buf[head].get()).take() };
        let next = (head + 1) % CAP;
        self.head.store(next, Ordering::Release);
        v
    }
}

Rust note:

  • This uses UnsafeCell to allow interior mutation.
  • The SPSC guarantees (one producer thread calls push, one consumer calls pop) are required for safety.

6) Performance notes

  • Prefer power-of-two capacities and use idx & (CAP-1).
  • Avoid false sharing: place head and tail on separate cache lines (padding) when you optimize further.
  • The queue is not MPMC; don’t share it across multiple producers/consumers.

References