Append-Only Logs: The Storage Pattern Behind

This article covers Append-Only Logs: The Storage Pattern Behind Databases and Durable Systems. Append-only logs are simple, fast, and crash-friendly. Learn record formats, checksums, compaction, and how WAL-like designs work, with C, Zi...

Many durable systems are built around one idea:

never overwrite data in place
append new records

This is the append-only log pattern.

It shows up in:

database write-ahead logs (WAL)
Kafka-like systems
journaling filesystems (conceptually)

1) Why append-only is powerful

sequential writes are fast
avoids torn overwrites
crash recovery is scanning forward

2) A minimal log record format

Common fields:

magic/version
length
checksum
payload

Recovery:

scan from beginning
validate checksum
stop at first invalid record

3) C: write a length+checksum record

#include <stdint.h>
#include <unistd.h>

typedef struct {
    uint32_t len;
    uint32_t crc;
} hdr_t;

static uint32_t fake_crc32(const unsigned char *p, uint32_t n) {
    uint32_t x = 0;
    for (uint32_t i = 0; i < n; i++) x = (x * 33) ^ p[i];
    return x;
}

int log_append(int fd, const void *data, uint32_t len) {
    hdr_t h = { .len = len, .crc = fake_crc32((const unsigned char*)data, len) };
    if (write(fd, &h, sizeof(h)) != (ssize_t)sizeof(h)) return -1;
    if (write(fd, data, len) != (ssize_t)len) return -1;
    return 0;
}

4) Zig: encode a record

const std = @import("std");

pub fn appendRecord(w: anytype, payload: []const u8) !void {
    var hdr: [4]u8 = undefined;
    std.mem.writeInt(u32, &hdr, @intCast(payload.len), .little);
    try w.writeAll(&hdr);
    try w.writeAll(payload);
}

5) Rust: scan until first invalid record

fn scan(mut bytes: &[u8]) -> Vec<Vec<u8>> {
    let mut out = Vec::new();
    loop {
        if bytes.len() < 4 { break; }
        let len = u32::from_le_bytes(bytes[..4].try_into().unwrap()) as usize;
        bytes = &bytes[4..];
        if bytes.len() < len { break; }
        out.push(bytes[..len].to_vec());
        bytes = &bytes[len..];
    }
    out
}

6) Compaction and snapshots

Because logs grow, systems add:

periodic snapshots
compaction/segment deletion

High level:

snapshot captures current state
older segments can be deleted

7) Durability: appends are not durable until you say so

An append-only design is crash-friendly, but it is not automatically durable.

On most systems:

write() copies bytes into the kernel page cache.
the kernel later flushes those dirty pages to disk.
a crash or power loss can drop recent writes.

If you need the log to survive a crash, you need a durability point:

fsync(fd) / fdatasync(fd) on a file descriptor
sometimes also fsync(dirfd) if you need directory entry durability (new file / rename)

This is why many systems use "group commit": batch a set of log appends, then fsync once.

8) Record framing and partial writes

Real logs must handle:

short writes
torn writes (crash in the middle)
trailing garbage (old bytes at end of file)

Common techniques:

length prefix (so the reader knows how many bytes belong to the record)
checksum (so the reader can detect corruption)
magic/version (to detect wrong files and format upgrades)

On recovery:

scan forward
stop at the first record that fails validation
truncate the file at the last known-good offset

The key invariant: never treat unvalidated tail bytes as a record.

9) Segments: keeping the log manageable

Instead of one giant file, many systems use segments:

log-00000001.seg
log-00000002.seg

Benefits:

easier deletion after compaction
bounded recovery time (you can skip old segments after a snapshot)
better operational tooling (rotate, archive)

Typical layout:

CURRENT pointer (or manifest) says which segment is active
a snapshot describes the last applied position
segments after that position are replayed

10) Ordering rules: what must happen before what

Durable systems care about ordering:

Write record bytes
Flush record bytes
Then publish a durable pointer that says "record is committed"

That "pointer" might be:

a separate metadata file
a record in the log itself
an atomically renamed manifest

Atomic rename is a common building block because it provides a single, well-defined commit point.

References

SQLite WAL: https://www.sqlite.org/wal.html
PostgreSQL WAL intro: https://www.postgresql.org/docs/current/wal-intro.html
LevelDB log format: https://github.com/google/leveldb/blob/main/doc/log_format.md