Build Git From Scratch — A Complete Checklist & Guide

This article covers Build Git From Scratch — A Complete Checklist & Guide. Language-agnostic. Every step describes what to implement and why, with pseudocode and concrete data-format references so you can implement it in any language...

Table of Contents

  1. Understand the Git Object Model
  2. Repository Initialization (git init)
  3. Hashing & the Object Store (git hash-object)
  4. Reading Objects (git cat-file)
  5. Tree Objects — Snapshots of Directories
  6. Commit Objects
  7. References — Branches & HEAD
  8. The Index / Staging Area (git add)
  9. Writing a Tree from the Index (git write-tree)
  10. Creating a Commit (git commit)
  11. Reading the Working Tree (git status)
  12. Checkout & Restoring Files (git checkout)
  13. Diffing (git diff)
  14. Log (git log)
  15. Branching (git branch, git switch)
  16. Merging (git merge)
  17. Pack Files & Compression (Advanced)
  18. Remote Protocol Basics (git clone, git fetch, git push)
  19. Test Suites
  20. Recommended Build Order

1. Understand the Git Object Model

What it is

Git's entire history is stored as a content-addressed object store: every piece of data (file, directory snapshot, commit, tag) is identified by the SHA-1 (or SHA-256 in newer Git) hash of its content. The store lives in .git/objects/.

Four object types

Type Contains
blob Raw file contents
tree A directory listing (entries pointing to blobs and trees)
commit A pointer to a tree + metadata (author, message, parent)
tag A named pointer to any object (usually a commit)

Object storage format

Every object is stored as:

<type> <content-length-in-bytes>\0<raw-content>

This concatenated byte sequence is then zlib-compressed and written to:

.git/objects/<first-2-hex-chars-of-hash>/<remaining-38-hex-chars>

Checklist

  • Understand that Git is purely a key-value store keyed on SHA-1 hashes.
  • Understand that the hash is computed before compression on the uncompressed header+content.
  • Understand the four object types and their roles.
  • Decide whether to use SHA-1 (classic) or SHA-256 (modern). SHA-1 is simpler to start with.

2. Repository Initialization (git init)

What it does

Creates the hidden .git/ directory with an initial directory structure, making the current folder a Git repository.

Directory structure to create

.git/
├── HEAD                  ← points to current branch (e.g. "ref: refs/heads/main")
├── config                ← repository config (can be minimal at first)
├── description           ← human-readable description (ignored by most tools)
├── objects/
│   ├── info/             ← empty for now
│   └── pack/             ← empty for now
└── refs/
    ├── heads/            ← branch refs live here
    └── tags/             ← tag refs live here

Implementation steps

  1. Detect if a .git/ directory already exists; if so, abort or reinitialize.
  2. Create the directories listed above.
  3. Write .git/HEAD with the contents ref: refs/heads/main\n (or master, depending on your default).
  4. Write a minimal .git/config:
    [core]
        repositoryformatversion = 0
        filemode = true
        bare = false
    
  5. Write .git/description with a placeholder string.

Checklist

  • Create .git/ and all sub-directories.
  • Write initial HEAD file pointing to refs/heads/main.
  • Write a minimal config file.
  • Write description file.
  • Handle the --bare flag (omit the working directory; root IS .git/).

3. Hashing & the Object Store (git hash-object)

What it does

Takes raw bytes, prepends the Git object header, computes SHA-1, zlib-compresses the result, and writes it to .git/objects/.

Precise algorithm

function hash_object(type, data):
    header = "{type} {len(data)}\0"         # e.g. "blob 14\0"
    store  = header + data                   # concatenate as bytes
    sha1   = SHA1(store)                     # 40-char hex string
    compressed = zlib_compress(store, level=1)
    path = ".git/objects/" + sha1[0:2] + "/" + sha1[2:]
    create_dirs_if_needed(path)
    write_file(path, compressed)
    return sha1

Important details

  • The SHA-1 is computed on the uncompressed header + data.
  • Use zlib deflate (level 1 is fine; real Git uses level 1 by default).
  • If the object already exists (same hash), you can skip writing.
  • For blobs, data is the raw file contents (bytes, not text—preserve encoding).

Checklist

  • Implement SHA-1 (or use your language's crypto library).
  • Implement zlib compress/decompress.
  • Implement hash_object(type, data) -> sha1_hex that writes to the object store.
  • Implement write_object(sha1, compressed_bytes) that creates the two-level directory path.
  • Handle the -w (write) flag — by default hash-object only prints the hash; -w writes it.
  • Implement reading a file from disk and calling hash_object("blob", contents).

4. Reading Objects (git cat-file)

What it does

Reads a compressed object from .git/objects/, decompresses it, parses the header, and returns the type and content.

Precise algorithm

function read_object(sha1):
    path = ".git/objects/" + sha1[0:2] + "/" + sha1[2:]
    compressed = read_file(path)
    raw = zlib_decompress(compressed)
    null_pos = raw.index('\0')
    header = raw[0:null_pos]          # e.g. "blob 14"
    content = raw[null_pos+1:]        # everything after the null byte
    type, size = header.split(' ')
    assert len(content) == int(size)
    return type, content

Checklist

  • Implement read_object(sha1) -> (type, content).
  • Support short SHA-1s by scanning the objects directory for a prefix match.
  • Implement the -t flag (print type), -s flag (print size), -p flag (pretty-print content).
  • Pretty-printing for blobs: print raw content as text.
  • Pretty-printing for trees and commits: handled in later steps.

5. Tree Objects — Snapshots of Directories

What a tree is

A tree object is a binary-encoded listing of directory entries. Each entry describes one file or sub-directory.

Binary format (per entry)

<mode> <name>\0<20-bytes-raw-sha1>

Where:

  • mode is an ASCII string (not binary): 100644 for regular file, 100755 for executable, 040000 for directory, 120000 for symlink.
  • name is the filename (no path separators), ASCII/UTF-8.
  • \0 is a literal null byte.
  • The 20-byte SHA-1 is raw binary, not hex.

Entries are concatenated together with no separator between them, and sorted in a specific order (explained below).

Sorting rule (important!)

Git sorts tree entries as if directories have a trailing /. This means a directory named foo is sorted as foo/, which affects ordering when you have entries like foo (file) and foo-bar (file) and foo (directory).

function sort_key(entry):
    if entry.type == "tree":
        return entry.name + "/"
    else:
        return entry.name

Sort entries by sort_key before writing.

Parsing a tree

function parse_tree(data):
    entries = []
    i = 0
    while i < len(data):
        space_pos = data.index(' ', i)
        mode = data[i:space_pos]
        null_pos = data.index('\0', space_pos)
        name = data[space_pos+1:null_pos]
        sha1_raw = data[null_pos+1 : null_pos+21]   # 20 raw bytes
        sha1_hex = to_hex(sha1_raw)
        entries.append((mode, name, sha1_hex))
        i = null_pos + 21
    return entries

Writing a tree

function write_tree(entries):
    # entries: list of (mode_str, name, sha1_hex)
    sorted_entries = sort_tree_entries(entries)
    content = b""
    for mode, name, sha1 in sorted_entries:
        content += f"{mode} {name}\0".encode()
        content += bytes.fromhex(sha1)   # raw 20-byte SHA-1
    return hash_object("tree", content)

Checklist

  • Implement parse_tree(data) -> list of (mode, name, sha1_hex).
  • Implement write_tree(entries) -> sha1.
  • Implement sorting logic using the "directory gets trailing slash" rule.
  • Handle the four modes: regular file (100644), executable (100755), symlink (120000), directory (040000).
  • Pretty-print trees with cat-file -p: show each entry as <mode> <type> <sha1>\t<name>.

6. Commit Objects

What a commit is

A commit is a plain-text object that points to a tree (the snapshot of all files at that point in time) and to zero or more parent commits.

Commit format

tree <tree-sha1>
parent <parent-sha1>         ← zero or more parent lines
author <name> <email> <unix-timestamp> <timezone-offset>
committer <name> <email> <unix-timestamp> <timezone-offset>
gpgsig <optional-signature>skip for basic implementation

<blank line>
<commit message>

Example:

tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
parent a1b2c3d4e5f6...
author Jane Doe <jane@example.com> 1700000000 +0000
committer Jane Doe <jane@example.com> 1700000000 +0000

Initial commit

Parsing a commit

function parse_commit(data):
    lines = data.split('\n')
    headers = {}
    i = 0
    while lines[i] != '':       # blank line separates headers from message
        key, _, value = lines[i].partition(' ')
        headers.setdefault(key, []).append(value)
        i += 1
    message = '\n'.join(lines[i+1:])
    return headers, message

Writing a commit

function write_commit(tree_sha1, parent_sha1s, author, committer, message):
    lines = [f"tree {tree_sha1}"]
    for p in parent_sha1s:
        lines.append(f"parent {p}")
    lines.append(f"author {author}")
    lines.append(f"committer {committer}")
    lines.append("")
    lines.append(message)
    content = "\n".join(lines) + "\n"
    return hash_object("commit", content.encode())

Author/Committer format

Name <email> <unix-timestamp-seconds> <+HHMM or -HHMM>

Example: Jane Doe <jane@example.com> 1700000000 +0000

Checklist

  • Implement parse_commit(data) -> (headers_dict, message).
  • Implement write_commit(tree, parents, author, committer, message) -> sha1.
  • Support multiple parent lines (for merge commits).
  • Read author name/email from .git/config or environment variables (GIT_AUTHOR_NAME, etc.).
  • Use current Unix timestamp and local timezone offset.
  • Pretty-print commit with cat-file -p.

7. References — Branches & HEAD

What refs are

A reference (ref) is simply a named pointer to a SHA-1 hash, stored as a text file in .git/refs/. Branches live under .git/refs/heads/, tags under .git/refs/tags/.

File contents

A direct ref file contains just the 40-character hex SHA-1 followed by a newline:

a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2\n

HEAD is a special ref that either:

  1. Symbolic ref: points to another ref: ref: refs/heads/main\n
  2. Detached HEAD: contains a direct SHA-1 (when you checked out a specific commit).

Reading HEAD

function read_head():
    content = read_file(".git/HEAD").strip()
    if content.startswith("ref: "):
        return ("symbolic", content[5:])   # e.g. "refs/heads/main"
    else:
        return ("detached", content)       # raw SHA-1

Resolving a ref to a commit SHA

function resolve_ref(ref_name):
    # ref_name can be "HEAD", "main", "refs/heads/main", or a sha1
    # Try direct SHA-1 first
    if is_valid_sha1(ref_name):
        return ref_name
    # Try standard locations
    for prefix in ["", "refs/", "refs/heads/", "refs/tags/", "refs/remotes/"]:
        path = ".git/" + prefix + ref_name
        if file_exists(path):
            content = read_file(path).strip()
            if content.startswith("ref: "):   # packed symref
                return resolve_ref(content[5:])
            return content
    raise Error(f"unknown ref: {ref_name}")

Packed refs

When there are many refs, Git stores them in .git/packed-refs for efficiency. Lines look like:

# pack-refs with: peeled fully-peeled sorted
a1b2c3... refs/heads/feature-x
^c4d5e6... # optional: peeled tag

Your resolver should fall back to checking packed-refs if a loose ref file isn't found.

Checklist

  • Implement read_ref(ref_name) -> sha1 (resolve through symrefs).
  • Implement write_ref(ref_name, sha1) (create/update a ref file).
  • Implement read_head() -> (type, value).
  • Implement write_head(symbolic_or_sha1).
  • Implement list_refs(prefix) -> list of (name, sha1) (scan refs/ directory).
  • Parse packed-refs and include those results in lookups.

8. The Index / Staging Area (git add)

What the index is

The index (also called the staging area or cache) is a binary file at .git/index that stores the state of the working tree that will be used for the next commit. It tracks every file, its mode, its SHA-1, and metadata used to efficiently detect changes.

Index file format (Version 2)

Header:
    4 bytes: magic "DIRC"
    4 bytes: version number (2)
    4 bytes: number of entries (big-endian uint32)

Entries (one per staged file):
    4 bytes: ctime seconds
    4 bytes: ctime nanoseconds
    4 bytes: mtime seconds
    4 bytes: mtime nanoseconds
    4 bytes: dev
    4 bytes: ino
    4 bytes: mode (e.g. 0x000081A4 = 0100644 octal)
    4 bytes: uid
    4 bytes: gid
    4 bytes: file size
    20 bytes: SHA-1 of blob
    2 bytes: flags (top 4 bits: assume-valid, extended, stage; bottom 12: filename length)
    variable: filename (null-terminated, padded to 8-byte boundary)

Footer:
    20 bytes: SHA-1 of entire index content up to this point

Tip: For a minimal implementation, you can store the index as a simple JSON/TOML/binary structure of {path: sha1_hex, mode: int} and convert to the real format later.

git add algorithm

function git_add(paths):
    index = read_index()
    for path in expand_paths(paths):     # handle globs, directories recursively
        if is_directory(path):
            for file in walk_files(path):
                add_file_to_index(index, file)
        else:
            add_file_to_index(index, path)
    write_index(index)

function add_file_to_index(index, path):
    contents = read_file(path)
    sha1 = hash_object("blob", contents)   # writes to object store
    stat = os.stat(path)
    index[path] = IndexEntry(sha1, stat.mode, stat.ctime, stat.mtime, ...)

Checklist

  • Define an IndexEntry data structure.
  • Implement read_index() -> dict of path -> IndexEntry (parse .git/index).
  • Implement write_index(entries) (serialize and write .git/index).
  • Implement the index checksum (SHA-1 of all bytes before the trailing hash).
  • Implement git_add(paths) that hashes blobs and updates the index.
  • Handle removing files from the index (git rm).
  • Handle .gitignore patterns when adding files.

9. Writing a Tree from the Index (git write-tree)

What it does

Converts the flat list of path -> sha1 entries in the index into a nested tree of tree objects (one per directory level).

Algorithm

function write_tree_from_index(index_entries):
    # Build a nested dict representing the directory tree
    tree = {}
    for path, entry in index_entries.items():
        parts = path.split('/')
        node = tree
        for part in parts[:-1]:       # directories
            node = node.setdefault(part, {})
        filename = parts[-1]
        node[filename] = entry        # leaf: IndexEntry

    return build_tree(tree)

function build_tree(node):
    entries = []
    for name, child in node.items():
        if isinstance(child, dict):                 # sub-directory
            subtree_sha1 = build_tree(child)
            entries.append(("040000", name, subtree_sha1))
        else:                                       # blob
            mode = format_mode(child.mode)          # e.g. "100644"
            entries.append((mode, name, child.sha1))
    return write_tree(entries)       # from step 5

Checklist

  • Implement build_nested_tree(index_entries) -> nested_dict.
  • Implement build_tree(nested_dict) -> sha1 (recursive).
  • Correctly format file modes from stat mode integers to Git mode strings.

10. Creating a Commit (git commit)

What it does

Takes the current index, creates a tree from it, and creates a commit object pointing to that tree (and to HEAD as the parent).

Algorithm

function git_commit(message):
    index = read_index()
    tree_sha1 = write_tree_from_index(index)

    # Determine parent(s)
    head_type, head_value = read_head()
    if head_type == "symbolic":
        current_branch_ref = head_value           # e.g. "refs/heads/main"
        try:
            parent_sha1 = resolve_ref(current_branch_ref)
            parents = [parent_sha1]
        except:
            parents = []                          # first commit, no parent
    else:
        parents = [head_value]                    # detached HEAD

    # Build author/committer strings
    author = get_author_string()                  # "Name <email> timestamp tz"
    committer = get_committer_string()

    # Write commit object
    commit_sha1 = write_commit(tree_sha1, parents, author, committer, message)

    # Update branch ref or HEAD
    if head_type == "symbolic":
        write_ref(current_branch_ref, commit_sha1)
    else:
        write_head(commit_sha1)                   # update detached HEAD

    return commit_sha1

Checklist

  • Read config for user.name and user.email.
  • Handle first commit (no parent).
  • Handle detached HEAD.
  • Update the branch ref after writing the commit.
  • Support --author and --message / -m flags.
  • Support reading the commit message from $EDITOR when -m is not given.

11. Reading the Working Tree (git status)

What it does

Compares three states: (1) the last commit (HEAD), (2) the index (staging area), (3) the working directory. Reports differences as staged changes, unstaged changes, and untracked files.

Three-way comparison

HEAD tree  ↔  Index  →  "staged changes"    (what will be in next commit)
Index      ↔  Disk   →  "unstaged changes"  (what is modified but not yet staged)
Disk only             →  "untracked files"

Algorithm

function git_status():
    index = read_index()
    head_tree = read_tree_recursive(resolve_commit_tree(read_head()))
    working_files = walk_all_files(".")   # excluding .git/

    staged_changes = []
    for path in union(head_tree.keys(), index.keys()):
        if path in index and path not in head_tree:
            staged_changes.append(("new file", path))
        elif path in head_tree and path not in index:
            staged_changes.append(("deleted", path))
        elif index[path].sha1 != head_tree[path].sha1:
            staged_changes.append(("modified", path))

    unstaged_changes = []
    untracked = []
    for path in working_files:
        if is_gitignored(path): continue
        if path not in index:
            untracked.append(path)
        else:
            # Fast check: compare mtime before hashing
            if mtime_changed(path, index[path]):
                disk_sha1 = hash_object_no_write("blob", read_file(path))
                if disk_sha1 != index[path].sha1:
                    unstaged_changes.append(("modified", path))

    return staged_changes, unstaged_changes, untracked

Checklist

  • Implement read_tree_recursive(sha1) -> flat dict of path -> (mode, sha1).
  • Implement walk_all_files(dir) -> list of relative paths (skip .git/).
  • Implement .gitignore pattern matching (at minimum: exact names, * glob, leading /, trailing /).
  • Implement mtime-based fast-path to avoid hashing every file on every status call.
  • Format and print the three sections of output.

12. Checkout & Restoring Files (git checkout)

What it does

Updates the working tree and index to reflect a given commit, branch, or file state.

Checking out a branch or commit

function git_checkout(target):
    target_sha1 = resolve_ref(target)
    commit = parse_commit(read_object(target_sha1)[1])
    target_tree = commit.headers["tree"][0]

    current_tree = read_tree_recursive(current_head_tree())
    target_tree_flat = read_tree_recursive(target_tree)

    # Remove files not in target
    for path in current_tree:
        if path not in target_tree_flat:
            delete_file(path)

    # Write files from target
    for path, (mode, sha1) in target_tree_flat.items():
        content = read_object(sha1)[1]
        write_file(path, content)
        set_file_mode(path, mode)

    # Update index
    new_index = build_index_from_tree(target_tree_flat)
    write_index(new_index)

    # Update HEAD
    if target is a branch name:
        write_head_symbolic("refs/heads/" + target)
    else:
        write_head(target_sha1)   # detached HEAD

Checking out a specific file

function git_checkout_file(path, ref="HEAD"):
    sha1 = resolve_ref(ref)
    tree_flat = read_tree_recursive(get_commit_tree(sha1))
    blob_sha1 = tree_flat[path].sha1
    content = read_object(blob_sha1)[1]
    write_file(path, content)
    # update index entry too

Checklist

  • Implement read_tree_recursive(sha1) -> flat dict.
  • Implement the three-way file reconciliation (add, remove, update).
  • Handle dirty working tree (warn or abort if uncommitted changes would be overwritten).
  • Support checking out a specific file by path.
  • Update HEAD to be a symbolic ref (branch checkout) or detached (commit/tag checkout).
  • Set correct file permissions based on mode.

13. Diffing (git diff)

What it does

Shows the line-by-line differences between two versions of a file. Git uses the Myers diff algorithm by default.

Diff output format (unified diff)

diff --git a/file.txt b/file.txt
index abc1234..def5678 100644
--- a/file.txt
+++ b/file.txt
@@ -1,4 +1,5 @@
 unchanged line
-removed line
+added line
 another unchanged line

Myers diff algorithm (simplified)

function diff_lines(a_lines, b_lines):
    # Returns list of operations: ('equal', line), ('insert', line), ('delete', line)
    # Standard shortest-edit-script (SES) algorithm
    n, m = len(a_lines), len(b_lines)
    # V is an array indexed from -(n+m) to (n+m)
    V = {}; V[1] = 0
    trace = []
    for d in range(0, n + m + 1):
        snapshot = dict(V)
        trace.append(snapshot)
        for k in range(-d, d + 1, 2):
            if k == -d or (k != d and V.get(k-1, -1) < V.get(k+1, -1)):
                x = V.get(k+1, 0)
            else:
                x = V.get(k-1, 0) + 1
            y = x - k
            while x < n and y < m and a_lines[x] == b_lines[y]:
                x += 1; y += 1
            V[k] = x
            if x >= n and y >= m:
                return backtrack(trace, a_lines, b_lines)

Tip: For a first implementation, use your language's built-in diff library (Python's difflib, for example) to produce a working diff command, then replace it with Myers later.

Checklist

  • Implement (or wrap) a line-diff algorithm.
  • Implement git diff (index vs working tree).
  • Implement git diff --staged (HEAD vs index).
  • Implement git diff <commit1> <commit2> (two arbitrary commits).
  • Format output as unified diff with @@ ... @@ hunks.
  • Color output (red for deletions, green for additions) when outputting to a terminal.

14. Log (git log)

What it does

Walks the commit graph backwards from HEAD (or a given ref), printing each commit.

Algorithm

function git_log(start_ref="HEAD", limit=None):
    sha1 = resolve_ref(start_ref)
    visited = set()
    queue = [sha1]          # use a priority queue for merge commits (by timestamp)
    count = 0
    while queue:
        current = queue.pop(0)
        if current in visited: continue
        visited.add(current)
        type_, data = read_object(current)
        commit = parse_commit(data)
        print_commit(current, commit)
        count += 1
        if limit and count >= limit: break
        for parent in commit.headers.get("parent", []):
            queue.append(parent)

Output format

commit a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
Author: Jane Doe <jane@example.com>
Date:   Mon Nov 13 12:34:56 2023 +0000

    Commit message here

Checklist

  • Implement git_log(start, n) that follows parent pointers.
  • Handle merge commits (two parents) — visit both.
  • Implement --oneline format.
  • Implement --graph (ASCII art branch graph) — complex, save for last.
  • Implement --decorate (show branch/tag names next to commits).
  • Implement -- <path> filtering (only show commits that touched a specific file).

15. Branching (git branch, git switch)

What a branch is

A branch is just a ref file in .git/refs/heads/ containing the SHA-1 of the tip commit. Creating a branch is cheap — it's just writing a file.

git branch

function git_branch_create(name, start_point="HEAD"):
    sha1 = resolve_ref(start_point)
    write_ref("refs/heads/" + name, sha1)

function git_branch_list():
    current = read_head()
    for ref_file in list_files(".git/refs/heads/"):
        name = ref_file.name
        sha1 = read_ref("refs/heads/" + name)
        marker = "*" if current == name else " "
        print(f"  {marker} {name}")

function git_branch_delete(name):
    path = ".git/refs/heads/" + name
    if not file_exists(path):
        raise Error(f"branch {name} not found")
    delete_file(path)

git switch / git checkout <branch>

Alias for the checkout algorithm in step 12, switching to a named branch.

Checklist

  • Implement git branch <name> (create at HEAD).
  • Implement git branch (list all branches, mark current with *).
  • Implement git branch -d <name> (delete, warn if unmerged).
  • Implement git switch <name> / git checkout <name>.
  • Implement git switch -c <name> (create and switch).

16. Merging (git merge)

Types of merge

  1. Fast-forward: The current branch is an ancestor of the target. Just move the branch pointer forward.
  2. Three-way merge: Both branches have diverged. Find the common ancestor (merge base), then combine changes.
  3. Merge conflict: The same region of a file was changed on both sides.

Fast-forward detection

function is_ancestor(maybe_ancestor, descendant):
    # Walk backwards from descendant; if we reach maybe_ancestor, it's an ancestor
    visited = set()
    queue = [descendant]
    while queue:
        sha1 = queue.pop()
        if sha1 == maybe_ancestor: return True
        if sha1 in visited: continue
        visited.add(sha1)
        commit = parse_commit(read_object(sha1)[1])
        queue.extend(commit.headers.get("parent", []))
    return False

Finding the merge base (Lowest Common Ancestor)

function find_merge_base(sha1_a, sha1_b):
    ancestors_a = all_ancestors(sha1_a)    # BFS, collect all reachable commits
    queue = [sha1_b]
    visited = set()
    while queue:
        sha1 = queue.pop(0)
        if sha1 in ancestors_a:
            return sha1                    # first common ancestor found
        if sha1 in visited: continue
        visited.add(sha1)
        commit = parse_commit(read_object(sha1)[1])
        queue.extend(commit.headers.get("parent", []))
    return None

Three-way merge algorithm

function three_way_merge(base_tree, ours_tree, theirs_tree):
    all_paths = union(base_tree.keys(), ours_tree.keys(), theirs_tree.keys())
    conflicts = []
    for path in all_paths:
        base = base_tree.get(path)
        ours = ours_tree.get(path)
        theirs = theirs_tree.get(path)

        if ours == theirs:
            result = ours                           # same on both sides, take either
        elif base == ours:
            result = theirs                         # we didn't change, take theirs
        elif base == theirs:
            result = ours                           # they didn't change, take ours
        else:
            # Both sides changed — attempt line-level merge
            merged, ok = merge_file_content(base, ours, theirs)
            if ok:
                result = merged
            else:
                result = merged                     # write conflict markers
                conflicts.append(path)
        write_to_index_and_disk(path, result)
    return conflicts

Conflict markers

<<<<<<< HEAD
our version of the line(s)
=======
their version of the line(s)
>>>>>>> branch-name

Checklist

  • Implement is_ancestor(a, b) -> bool.
  • Implement fast-forward merge (move branch pointer, update working tree).
  • Implement find_merge_base(a, b) -> sha1.
  • Implement three-way file merge with conflict markers.
  • Write merge result to index (with stage entries for conflicts: stages 1, 2, 3).
  • Write MERGE_HEAD file with the SHA-1 of the branch being merged in.
  • After resolving conflicts, git commit reads MERGE_HEAD to create a merge commit with two parents.

17. Pack Files & Compression (Advanced)

What pack files are

Instead of storing one object per file (which is slow for repos with millions of objects), Git packs objects into a single .pack file with a .idx index.

Pack file format (overview)

Header: "PACK" + version (4 bytes) + object count (4 bytes)
Objects: each object is stored as:
    - Type + size (variable-length encoding)
    - Compressed data (zlib) OR delta from another object
Footer: SHA-1 of pack content

Delta compression

Objects can be stored as deltas (diffs) against a base object. This dramatically reduces size for similar blobs (e.g. consecutive versions of a file).

Checklist

  • Implement git gc or git pack-objects to pack loose objects.
  • Implement pack index (.idx) parsing and writing.
  • Implement reading objects from a pack file.
  • Implement delta encoding/decoding (OBJ_OFS_DELTA and OBJ_REF_DELTA types).

Note: Pack file support is needed for cloning and fetching from remotes. You can defer this until step 18.


18. Remote Protocol Basics (git clone, git fetch, git push)

Remotes config

Stored in .git/config:

[remote "origin"]
    url = https://github.com/user/repo.git
    fetch = +refs/heads/*:refs/remotes/origin/*

Smart HTTP protocol (upload-pack / receive-pack)

Git communicates with servers using the pkt-line format:

# A pkt-line: 4-byte hex length prefix (includes the 4 bytes) + data
0015# service=git-upload-pack\n
0000  # flush packet (end of section)

git clone high-level flow

1. Send HTTP GET to /info/refs?service=git-upload-pack
2. Server responds with available refs (branches, tags) in pkt-line format
3. Client sends "want <sha1>" for desired refs + "done"
4. Server responds with a PACK file containing all required objects
5. Unpack the PACK file into .git/objects/
6. Write remote-tracking refs: .git/refs/remotes/origin/main = sha1
7. Checkout the default branch

git fetch and git push

  • fetch: Same as clone step 1–4, but only download new objects.
  • push: Send receive-pack request, upload a pack with new objects, update remote refs.

Checklist

  • Implement pkt-line encoding/decoding.
  • Implement HTTP transport: GET /info/refs, POST /upload-pack.
  • Parse the initial ref advertisement from the server.
  • Implement git fetch: compute missing objects (negotiation), receive pack.
  • Implement git clone: fetch + checkout.
  • Implement git push: send pack + update ref commands to receive-pack.
  • Store remote-tracking refs under refs/remotes/<remote>/.

19. Test Suites

Test Suite 1: Core Object Model

The following tests cover steps 2–10 and can be written in any test framework.

# test_object_model.py  (using Python + pytest as an example)
import os, shutil, tempfile, subprocess, hashlib, zlib

# ---- Fixtures ----

def setup_repo(tmp_path):
    """Initialize a fresh repo in tmp_path and return its path."""
    os.chdir(tmp_path)
    run("git init")        # replace with your implementation's CLI call
    return tmp_path

def run(cmd):
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
    assert result.returncode == 0, result.stderr
    return result.stdout.strip()

# ---- Test 1: init creates correct directory structure ----

def test_init_creates_structure(tmp_path):
    setup_repo(tmp_path)
    assert os.path.isdir(tmp_path / ".git")
    assert os.path.isdir(tmp_path / ".git" / "objects")
    assert os.path.isdir(tmp_path / ".git" / "refs" / "heads")
    assert os.path.isfile(tmp_path / ".git" / "HEAD")
    head = (tmp_path / ".git" / "HEAD").read_text()
    assert head.startswith("ref: refs/heads/")

# ---- Test 2: hash-object produces correct SHA-1 ----

def test_hash_object_blob(tmp_path):
    setup_repo(tmp_path)
    content = b"hello world\n"
    (tmp_path / "file.txt").write_bytes(content)
    sha1 = run("mygit hash-object -w file.txt")
    # Verify manually
    header = f"blob {len(content)}\0".encode()
    expected = hashlib.sha1(header + content).hexdigest()
    assert sha1 == expected, f"Expected {expected}, got {sha1}"
    # Verify object file exists
    obj_path = tmp_path / ".git" / "objects" / sha1[:2] / sha1[2:]
    assert obj_path.exists()
    # Verify decompresses correctly
    raw = zlib.decompress(obj_path.read_bytes())
    assert raw == header + content

# ---- Test 3: cat-file reads back what hash-object wrote ----

def test_cat_file_blob(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "hello.txt").write_text("Hello, Git!\n")
    sha1 = run("mygit hash-object -w hello.txt")
    assert run(f"mygit cat-file -t {sha1}") == "blob"
    assert run(f"mygit cat-file -p {sha1}") == "Hello, Git!"

# ---- Test 4: add + commit produces valid tree and commit objects ----

def test_add_and_commit(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "a.txt").write_text("file a\n")
    (tmp_path / "b.txt").write_text("file b\n")
    run("mygit add .")
    commit_sha1 = run('mygit commit -m "initial"')
    # HEAD should now point to a commit
    head_sha1 = run("mygit rev-parse HEAD")
    assert head_sha1 == commit_sha1
    # Commit should have a tree
    tree_sha1 = run(f"mygit cat-file -p {commit_sha1}").split('\n')[0].split(' ')[1]
    tree_type = run(f"mygit cat-file -t {tree_sha1}")
    assert tree_type == "tree"
    # Tree should contain a.txt and b.txt
    tree_content = run(f"mygit cat-file -p {tree_sha1}")
    assert "a.txt" in tree_content
    assert "b.txt" in tree_content

# ---- Test 5: second commit has parent ----

def test_second_commit_has_parent(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "a.txt").write_text("v1\n")
    run("mygit add .")
    c1 = run('mygit commit -m "commit 1"')
    (tmp_path / "a.txt").write_text("v2\n")
    run("mygit add .")
    c2 = run('mygit commit -m "commit 2"')
    commit_data = run(f"mygit cat-file -p {c2}")
    assert f"parent {c1}" in commit_data

# ---- Test 6: tree sort order is correct ----

def test_tree_sort_order(tmp_path):
    setup_repo(tmp_path)
    os.mkdir(tmp_path / "foo")
    (tmp_path / "foo" / "bar.txt").write_text("bar\n")
    (tmp_path / "foo-baz.txt").write_text("baz\n")   # "foo-baz" sorts before "foo/" in git
    run("mygit add .")
    run('mygit commit -m "test sort"')
    head = run("mygit rev-parse HEAD")
    commit_data = run(f"mygit cat-file -p {head}")
    tree_sha1 = commit_data.split('\n')[0].split()[1]
    tree_entries = run(f"mygit cat-file -p {tree_sha1}").split('\n')
    names = [e.split('\t')[1] for e in tree_entries if e]
    # Real git: "foo-baz.txt" comes before "foo" directory
    assert names.index("foo-baz.txt") < names.index("foo")

# ---- Test 7: checkout restores files ----

def test_checkout_restores_file(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "readme.txt").write_text("version 1\n")
    run("mygit add .")
    c1 = run('mygit commit -m "v1"')
    (tmp_path / "readme.txt").write_text("version 2\n")
    run("mygit add .")
    run('mygit commit -m "v2"')
    run(f"mygit checkout {c1}")
    assert (tmp_path / "readme.txt").read_text() == "version 1\n"

Test Suite 2: Branching & Merging

# test_branching_merging.py

# ---- Test 8: create and list branches ----

def test_branch_create_and_list(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "f.txt").write_text("x\n")
    run("mygit add .")
    run('mygit commit -m "init"')
    run("mygit branch feature")
    branches = run("mygit branch")
    assert "feature" in branches
    assert "main" in branches or "master" in branches

# ---- Test 9: fast-forward merge ----

def test_fast_forward_merge(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "base.txt").write_text("base\n")
    run("mygit add ."); run('mygit commit -m "base"')
    run("mygit branch feature")
    run("mygit switch feature")
    (tmp_path / "feature.txt").write_text("feature\n")
    run("mygit add ."); run('mygit commit -m "feature work"')
    feature_tip = run("mygit rev-parse HEAD")
    run("mygit switch main")
    run("mygit merge feature")
    main_tip = run("mygit rev-parse HEAD")
    assert main_tip == feature_tip      # fast-forward: main tip = feature tip
    assert (tmp_path / "feature.txt").exists()

# ---- Test 10: three-way merge, no conflict ----

def test_three_way_merge_no_conflict(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "shared.txt").write_text("line1\nline2\n")
    (tmp_path / "main_only.txt").write_text("main\n")
    run("mygit add ."); run('mygit commit -m "base"')
    run("mygit branch feature")
    # Commit on main
    (tmp_path / "main_only.txt").write_text("main updated\n")
    run("mygit add ."); run('mygit commit -m "main change"')
    # Commit on feature
    run("mygit switch feature")
    (tmp_path / "feature_only.txt").write_text("feature\n")
    run("mygit add ."); run('mygit commit -m "feature change"')
    run("mygit switch main")
    run("mygit merge feature")
    assert (tmp_path / "feature_only.txt").exists()
    assert (tmp_path / "main_only.txt").read_text() == "main updated\n"

# ---- Test 11: merge conflict is detected ----

def test_merge_conflict_detected(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "conflict.txt").write_text("original\n")
    run("mygit add ."); run('mygit commit -m "base"')
    run("mygit branch feature")
    (tmp_path / "conflict.txt").write_text("main version\n")
    run("mygit add ."); run('mygit commit -m "main edit"')
    run("mygit switch feature")
    (tmp_path / "conflict.txt").write_text("feature version\n")
    run("mygit add ."); run('mygit commit -m "feature edit"')
    run("mygit switch main")
    result = subprocess.run("mygit merge feature", shell=True, capture_output=True, text=True)
    assert result.returncode != 0 or "CONFLICT" in result.stdout + result.stderr
    content = (tmp_path / "conflict.txt").read_text()
    assert "<<<<<<" in content
    assert "=======" in content
    assert ">>>>>>>" in content

# ---- Test 12: log shows correct history ----

def test_log_history(tmp_path):
    setup_repo(tmp_path)
    messages = ["first commit", "second commit", "third commit"]
    for i, msg in enumerate(messages):
        (tmp_path / f"f{i}.txt").write_text(f"content {i}\n")
        run("mygit add .")
        run(f'mygit commit -m "{msg}"')
    log = run("mygit log --oneline")
    for msg in messages:
        assert msg in log
    # Most recent first
    lines = [l for l in log.split('\n') if l]
    assert "third commit" in lines[0]
    assert "first commit" in lines[-1]

# ---- Test 13: status correctly categorizes changes ----

def test_status_categories(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / "tracked.txt").write_text("v1\n")
    run("mygit add ."); run('mygit commit -m "init"')
    # Modify tracked file without staging
    (tmp_path / "tracked.txt").write_text("v2\n")
    # Stage a new file
    (tmp_path / "staged_new.txt").write_text("new\n")
    run("mygit add staged_new.txt")
    # Untracked file
    (tmp_path / "untracked.txt").write_text("untracked\n")
    status = run("mygit status")
    assert "staged_new.txt" in status    # staged
    assert "tracked.txt" in status       # unstaged modification
    assert "untracked.txt" in status     # untracked

# ---- Test 14: gitignore works ----

def test_gitignore(tmp_path):
    setup_repo(tmp_path)
    (tmp_path / ".gitignore").write_text("*.log\nbuild/\n")
    (tmp_path / "app.py").write_text("code\n")
    (tmp_path / "debug.log").write_text("log\n")
    os.mkdir(tmp_path / "build")
    (tmp_path / "build" / "output.bin").write_bytes(b"\x00\x01")
    status = run("mygit status")
    assert "app.py" in status
    assert "debug.log" not in status
    assert "output.bin" not in status

Follow this sequence to always have a working (partial) implementation at each stage:

Phase 1 — Core Storage
  [x] Step 2: git init
  [x] Step 3: hash-object (blobs)
  [x] Step 4: cat-file
  [x] Step 5: tree objects
  [x] Step 6: commit objects (manually constructed)
  [x] Step 7: refs + HEAD

Phase 2 — Staging & Committing
  [x] Step 8: index / git add
  [x] Step 9: write-tree from index
  [x] Step 10: git commit (end-to-end)

Phase 3 — Inspection
  [x] Step 11: git status
  [x] Step 13: git diff
  [x] Step 14: git log

Phase 4 — Branching & Navigation
  [x] Step 12: git checkout / restore
  [x] Step 15: git branch / switch
  [x] Step 16: git merge

Phase 5 — Remotes & Optimization
  [x] Step 17: pack files
  [x] Step 18: remote protocol (clone, fetch, push)

Quick Reference: Key Data Formats

Format Example
Object header blob 42\0
Loose object path .git/objects/ab/cdef1234...
Tree entry 100644 file.txt\0<20-byte-sha1>
Ref file a1b2c3...40hex...\n
Symref ref: refs/heads/main\n
Author string Jane <j@ex.com> 1700000000 +0000
pkt-line 0017# service=...\n

Useful Resources