Append-Only Logs: The Storage Pattern Behind Databases and Durable Systems

This article covers Append-Only Logs: The Storage Pattern Behind Databases and Durable Systems. Append-only logs are simple, fast, and crash-friendly. Learn record formats, checksums, compaction, and how WAL-like designs work, with C, Zi...

Many durable systems are built around one idea:

  • never overwrite data in place
  • append new records

This is the append-only log pattern.

It shows up in:

  • database write-ahead logs (WAL)
  • Kafka-like systems
  • journaling filesystems (conceptually)

1) Why append-only is powerful

  • sequential writes are fast
  • avoids torn overwrites
  • crash recovery is scanning forward

2) A minimal log record format

Common fields:

  • magic/version
  • length
  • checksum
  • payload

Recovery:

  • scan from beginning
  • validate checksum
  • stop at first invalid record

3) C: write a length+checksum record

#include <stdint.h>
#include <unistd.h>

typedef struct {
    uint32_t len;
    uint32_t crc;
} hdr_t;

static uint32_t fake_crc32(const unsigned char *p, uint32_t n) {
    uint32_t x = 0;
    for (uint32_t i = 0; i < n; i++) x = (x * 33) ^ p[i];
    return x;
}

int log_append(int fd, const void *data, uint32_t len) {
    hdr_t h = { .len = len, .crc = fake_crc32((const unsigned char*)data, len) };
    if (write(fd, &h, sizeof(h)) != (ssize_t)sizeof(h)) return -1;
    if (write(fd, data, len) != (ssize_t)len) return -1;
    return 0;
}

4) Zig: encode a record

const std = @import("std");

pub fn appendRecord(w: anytype, payload: []const u8) !void {
    var hdr: [4]u8 = undefined;
    std.mem.writeInt(u32, &hdr, @intCast(payload.len), .little);
    try w.writeAll(&hdr);
    try w.writeAll(payload);
}

5) Rust: scan until first invalid record

fn scan(mut bytes: &[u8]) -> Vec<Vec<u8>> {
    let mut out = Vec::new();
    loop {
        if bytes.len() < 4 { break; }
        let len = u32::from_le_bytes(bytes[..4].try_into().unwrap()) as usize;
        bytes = &bytes[4..];
        if bytes.len() < len { break; }
        out.push(bytes[..len].to_vec());
        bytes = &bytes[len..];
    }
    out
}

6) Compaction and snapshots

Because logs grow, systems add:

  • periodic snapshots
  • compaction/segment deletion

High level:

  • snapshot captures current state
  • older segments can be deleted

7) Durability: appends are not durable until you say so

An append-only design is crash-friendly, but it is not automatically durable.

On most systems:

  • write() copies bytes into the kernel page cache.
  • the kernel later flushes those dirty pages to disk.
  • a crash or power loss can drop recent writes.

If you need the log to survive a crash, you need a durability point:

  • fsync(fd) / fdatasync(fd) on a file descriptor
  • sometimes also fsync(dirfd) if you need directory entry durability (new file / rename)

This is why many systems use "group commit": batch a set of log appends, then fsync once.

8) Record framing and partial writes

Real logs must handle:

  • short writes
  • torn writes (crash in the middle)
  • trailing garbage (old bytes at end of file)

Common techniques:

  • length prefix (so the reader knows how many bytes belong to the record)
  • checksum (so the reader can detect corruption)
  • magic/version (to detect wrong files and format upgrades)

On recovery:

  • scan forward
  • stop at the first record that fails validation
  • truncate the file at the last known-good offset

The key invariant: never treat unvalidated tail bytes as a record.

9) Segments: keeping the log manageable

Instead of one giant file, many systems use segments:

  • log-00000001.seg
  • log-00000002.seg

Benefits:

  • easier deletion after compaction
  • bounded recovery time (you can skip old segments after a snapshot)
  • better operational tooling (rotate, archive)

Typical layout:

  • CURRENT pointer (or manifest) says which segment is active
  • a snapshot describes the last applied position
  • segments after that position are replayed

10) Ordering rules: what must happen before what

Durable systems care about ordering:

  • Write record bytes
  • Flush record bytes
  • Then publish a durable pointer that says "record is committed"

That "pointer" might be:

  • a separate metadata file
  • a record in the log itself
  • an atomically renamed manifest

Atomic rename is a common building block because it provides a single, well-defined commit point.

References