Cache Locality and Data Layout: Turning Memory

This article covers Cache Locality and Data Layout: Turning Memory Into Throughput. CPU caches reward predictable access. Learn spatial/temporal locality, AoS vs SoA, prefetching, and how to measure improvements in C, Zig, and Rust.

A surprising amount of “performance work” is memory work.

Modern CPUs can do multiple arithmetic operations per cycle, but a cache miss can cost tens to hundreds of cycles. That gap means your data layout and access patterns often matter more than micro-optimizing instructions.

1) The cache hierarchy in one page

Typical hierarchy:

L1: very small, very fast (per core)
L2: larger, still fast (per core)
L3: much larger, shared among cores
DRAM: huge, slow

Key idea: memory is moved in cache lines (often 64 bytes). When you touch one byte, the CPU fetches the whole line.

2) Locality

Spatial locality: if you touch a[i], you’ll likely touch a[i+1] soon.
Temporal locality: if you touch x, you’ll likely touch x again soon.

Cache-friendly code maximizes both.

3) AoS vs SoA

Two common layouts for collections:

Array of Structs (AoS)

[{x,y,z, mass}, {x,y,z, mass}, ...]

Great when you always use all fields together.

Struct of Arrays (SoA)

x[] y[] z[] mass[]

Great when you process one field at a time over many elements (SIMD-friendly too).

4) Mini benchmark: sum of a field

We’ll compare AoS vs SoA by summing mass across many particles.

C: AoS vs SoA

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

typedef struct {
    float x, y, z;
    float mass;
} Particle;

static uint64_t now_ns(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}

int main(void) {
    const size_t n = 10 * 1000 * 1000;

    Particle *aos = aligned_alloc(64, n * sizeof(Particle));
    float *mass = aligned_alloc(64, n * sizeof(float));
    if (!aos || !mass) return 1;

    for (size_t i = 0; i < n; i++) {
        aos[i].mass = 1.0f;
        mass[i] = 1.0f;
    }

    volatile float sink = 0.0f;

    uint64_t t0 = now_ns();
    float s0 = 0.0f;
    for (size_t i = 0; i < n; i++) s0 += aos[i].mass;
    sink = s0;
    uint64_t t1 = now_ns();

    uint64_t t2 = now_ns();
    float s1 = 0.0f;
    for (size_t i = 0; i < n; i++) s1 += mass[i];
    sink = s1;
    uint64_t t3 = now_ns();

    printf("AoS ns: %llu\n", (unsigned long long)(t1 - t0));
    printf("SoA ns: %llu\n", (unsigned long long)(t3 - t2));

    free(aos);
    free(mass);
    return (int)sink;
}

What you’re learning:

In AoS, each mass load drags x,y,z into cache lines even if unused.
In SoA, the mass[] stream is dense and predictable.

Zig: measuring and alignment

const std = @import("std");

const Particle = packed struct {
    x: f32,
    y: f32,
    z: f32,
    mass: f32,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const a = gpa.allocator();

    const n: usize = 10 * 1000 * 1000;

    const aos = try a.alignedAlloc(Particle, 64, n);
    defer a.free(aos);

    const mass = try a.alignedAlloc(f32, 64, n);
    defer a.free(mass);

    for (aos, 0..) |*p, i| {
        p.* = .{ .x = 0, .y = 0, .z = 0, .mass = 1 };
        mass[i] = 1;
    }

    var timer = try std.time.Timer.start();
    var s0: f32 = 0;
    for (aos) |p| s0 += p.mass;
    const t_aos = timer.read();

    timer.reset();
    var s1: f32 = 0;
    for (mass) |m| s1 += m;
    const t_soa = timer.read();

    std.debug.print("AoS ns: {}\n", .{t_aos});
    std.debug.print("SoA ns: {}\n", .{t_soa});
    std.debug.print("sink: {d}\n", .{s0 + s1});
}

Rust: slices, iterators, and “don’t optimize the benchmark away”

use std::time::Instant;

#[derive(Clone, Copy)]
struct Particle {
    x: f32,
    y: f32,
    z: f32,
    mass: f32,
}

fn main() {
    let n = 10_000_000usize;
    let mut aos = vec![Particle { x: 0.0, y: 0.0, z: 0.0, mass: 1.0 }; n];
    let mass = vec![1.0f32; n];

    let t0 = Instant::now();
    let s0: f32 = aos.iter().map(|p| p.mass).sum();
    let aos_ns = t0.elapsed().as_nanos();

    let t1 = Instant::now();
    let s1: f32 = mass.iter().copied().sum();
    let soa_ns = t1.elapsed().as_nanos();

    println!("AoS ns: {aos_ns}");
    println!("SoA ns: {soa_ns}");
    println!("sink: {}", s0 + s1);
}

5) Measuring correctly

Run in release mode.
Pin CPU frequency if possible, close background apps.
Use perf stat to observe cache effects:
- cache-misses, cache-references, cycles, instructions, branches.

6) Practical rules of thumb

Prefer tight, contiguous arrays for hot paths.
Avoid pointer-chasing in tight loops when possible.
Batch work: process many items per function call.
Consider SoA when you only use a subset of fields.

References

“What Every Programmer Should Know About Memory” (Drepper): https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
perf documentation: https://perf.wiki.kernel.org/
Intel® SDM (cache and memory ordering background): https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html