Append-Only Logs: The Storage Pattern Behind Databases and Durable Systems
This article covers Append-Only Logs: The Storage Pattern Behind Databases and Durable Systems. Append-only logs are simple, fast, and crash-friendly. Learn record formats, checksums, compaction, and how WAL-like designs work, with C, Zi...
Many durable systems are built around one idea:
- never overwrite data in place
- append new records
This is the append-only log pattern.
It shows up in:
- database write-ahead logs (WAL)
- Kafka-like systems
- journaling filesystems (conceptually)
1) Why append-only is powerful
- sequential writes are fast
- avoids torn overwrites
- crash recovery is scanning forward
2) A minimal log record format
Common fields:
- magic/version
- length
- checksum
- payload
Recovery:
- scan from beginning
- validate checksum
- stop at first invalid record
3) C: write a length+checksum record
#include <stdint.h>
#include <unistd.h>
typedef struct {
uint32_t len;
uint32_t crc;
} hdr_t;
static uint32_t fake_crc32(const unsigned char *p, uint32_t n) {
uint32_t x = 0;
for (uint32_t i = 0; i < n; i++) x = (x * 33) ^ p[i];
return x;
}
int log_append(int fd, const void *data, uint32_t len) {
hdr_t h = { .len = len, .crc = fake_crc32((const unsigned char*)data, len) };
if (write(fd, &h, sizeof(h)) != (ssize_t)sizeof(h)) return -1;
if (write(fd, data, len) != (ssize_t)len) return -1;
return 0;
}
4) Zig: encode a record
const std = @import("std");
pub fn appendRecord(w: anytype, payload: []const u8) !void {
var hdr: [4]u8 = undefined;
std.mem.writeInt(u32, &hdr, @intCast(payload.len), .little);
try w.writeAll(&hdr);
try w.writeAll(payload);
}
5) Rust: scan until first invalid record
fn scan(mut bytes: &[u8]) -> Vec<Vec<u8>> {
let mut out = Vec::new();
loop {
if bytes.len() < 4 { break; }
let len = u32::from_le_bytes(bytes[..4].try_into().unwrap()) as usize;
bytes = &bytes[4..];
if bytes.len() < len { break; }
out.push(bytes[..len].to_vec());
bytes = &bytes[len..];
}
out
}
6) Compaction and snapshots
Because logs grow, systems add:
- periodic snapshots
- compaction/segment deletion
High level:
- snapshot captures current state
- older segments can be deleted
7) Durability: appends are not durable until you say so
An append-only design is crash-friendly, but it is not automatically durable.
On most systems:
write()copies bytes into the kernel page cache.- the kernel later flushes those dirty pages to disk.
- a crash or power loss can drop recent writes.
If you need the log to survive a crash, you need a durability point:
fsync(fd)/fdatasync(fd)on a file descriptor- sometimes also
fsync(dirfd)if you need directory entry durability (new file / rename)
This is why many systems use "group commit": batch a set of log appends, then fsync once.
8) Record framing and partial writes
Real logs must handle:
- short writes
- torn writes (crash in the middle)
- trailing garbage (old bytes at end of file)
Common techniques:
- length prefix (so the reader knows how many bytes belong to the record)
- checksum (so the reader can detect corruption)
- magic/version (to detect wrong files and format upgrades)
On recovery:
- scan forward
- stop at the first record that fails validation
- truncate the file at the last known-good offset
The key invariant: never treat unvalidated tail bytes as a record.
9) Segments: keeping the log manageable
Instead of one giant file, many systems use segments:
log-00000001.seglog-00000002.seg
Benefits:
- easier deletion after compaction
- bounded recovery time (you can skip old segments after a snapshot)
- better operational tooling (rotate, archive)
Typical layout:
CURRENTpointer (or manifest) says which segment is active- a snapshot describes the last applied position
- segments after that position are replayed
10) Ordering rules: what must happen before what
Durable systems care about ordering:
- Write record bytes
- Flush record bytes
- Then publish a durable pointer that says "record is committed"
That "pointer" might be:
- a separate metadata file
- a record in the log itself
- an atomically renamed manifest
Atomic rename is a common building block because it provides a single, well-defined commit point.
References
- SQLite WAL: https://www.sqlite.org/wal.html
- PostgreSQL WAL intro: https://www.postgresql.org/docs/current/wal-intro.html
- LevelDB log format: https://github.com/google/leveldb/blob/main/doc/log_format.md