Cache Locality and Data Layout: Turning Memory Into Throughput
This article covers Cache Locality and Data Layout: Turning Memory Into Throughput. CPU caches reward predictable access. Learn spatial/temporal locality, AoS vs SoA, prefetching, and how to measure improvements in C, Zig, and Rust.
A surprising amount of “performance work” is memory work.
Modern CPUs can do multiple arithmetic operations per cycle, but a cache miss can cost tens to hundreds of cycles. That gap means your data layout and access patterns often matter more than micro-optimizing instructions.
1) The cache hierarchy in one page
Typical hierarchy:
- L1: very small, very fast (per core)
- L2: larger, still fast (per core)
- L3: much larger, shared among cores
- DRAM: huge, slow
Key idea: memory is moved in cache lines (often 64 bytes). When you touch one byte, the CPU fetches the whole line.
2) Locality
- Spatial locality: if you touch
a[i], you’ll likely toucha[i+1]soon. - Temporal locality: if you touch
x, you’ll likely touchxagain soon.
Cache-friendly code maximizes both.
3) AoS vs SoA
Two common layouts for collections:
Array of Structs (AoS)
[{x,y,z, mass}, {x,y,z, mass}, ...]
Great when you always use all fields together.
Struct of Arrays (SoA)
x[] y[] z[] mass[]
Great when you process one field at a time over many elements (SIMD-friendly too).
4) Mini benchmark: sum of a field
We’ll compare AoS vs SoA by summing mass across many particles.
C: AoS vs SoA
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
typedef struct {
float x, y, z;
float mass;
} Particle;
static uint64_t now_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}
int main(void) {
const size_t n = 10 * 1000 * 1000;
Particle *aos = aligned_alloc(64, n * sizeof(Particle));
float *mass = aligned_alloc(64, n * sizeof(float));
if (!aos || !mass) return 1;
for (size_t i = 0; i < n; i++) {
aos[i].mass = 1.0f;
mass[i] = 1.0f;
}
volatile float sink = 0.0f;
uint64_t t0 = now_ns();
float s0 = 0.0f;
for (size_t i = 0; i < n; i++) s0 += aos[i].mass;
sink = s0;
uint64_t t1 = now_ns();
uint64_t t2 = now_ns();
float s1 = 0.0f;
for (size_t i = 0; i < n; i++) s1 += mass[i];
sink = s1;
uint64_t t3 = now_ns();
printf("AoS ns: %llu\n", (unsigned long long)(t1 - t0));
printf("SoA ns: %llu\n", (unsigned long long)(t3 - t2));
free(aos);
free(mass);
return (int)sink;
}
What you’re learning:
- In AoS, each
massload dragsx,y,zinto cache lines even if unused. - In SoA, the
mass[]stream is dense and predictable.
Zig: measuring and alignment
const std = @import("std");
const Particle = packed struct {
x: f32,
y: f32,
z: f32,
mass: f32,
};
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const a = gpa.allocator();
const n: usize = 10 * 1000 * 1000;
const aos = try a.alignedAlloc(Particle, 64, n);
defer a.free(aos);
const mass = try a.alignedAlloc(f32, 64, n);
defer a.free(mass);
for (aos, 0..) |*p, i| {
p.* = .{ .x = 0, .y = 0, .z = 0, .mass = 1 };
mass[i] = 1;
}
var timer = try std.time.Timer.start();
var s0: f32 = 0;
for (aos) |p| s0 += p.mass;
const t_aos = timer.read();
timer.reset();
var s1: f32 = 0;
for (mass) |m| s1 += m;
const t_soa = timer.read();
std.debug.print("AoS ns: {}\n", .{t_aos});
std.debug.print("SoA ns: {}\n", .{t_soa});
std.debug.print("sink: {d}\n", .{s0 + s1});
}
Rust: slices, iterators, and “don’t optimize the benchmark away”
use std::time::Instant;
#[derive(Clone, Copy)]
struct Particle {
x: f32,
y: f32,
z: f32,
mass: f32,
}
fn main() {
let n = 10_000_000usize;
let mut aos = vec![Particle { x: 0.0, y: 0.0, z: 0.0, mass: 1.0 }; n];
let mass = vec![1.0f32; n];
let t0 = Instant::now();
let s0: f32 = aos.iter().map(|p| p.mass).sum();
let aos_ns = t0.elapsed().as_nanos();
let t1 = Instant::now();
let s1: f32 = mass.iter().copied().sum();
let soa_ns = t1.elapsed().as_nanos();
println!("AoS ns: {aos_ns}");
println!("SoA ns: {soa_ns}");
println!("sink: {}", s0 + s1);
}
5) Measuring correctly
- Run in release mode.
- Pin CPU frequency if possible, close background apps.
- Use
perf statto observe cache effects:cache-misses,cache-references,cycles,instructions,branches.
6) Practical rules of thumb
- Prefer tight, contiguous arrays for hot paths.
- Avoid pointer-chasing in tight loops when possible.
- Batch work: process many items per function call.
- Consider SoA when you only use a subset of fields.
References
- “What Every Programmer Should Know About Memory” (Drepper): https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
perfdocumentation: https://perf.wiki.kernel.org/- Intel® SDM (cache and memory ordering background): https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html