Build Git From Scratch — A Complete Checklist & Guide
This article covers Build Git From Scratch — A Complete Checklist & Guide. Language-agnostic. Every step describes what to implement and why, with pseudocode and concrete data-format references so you can implement it in any language...
Table of Contents
- Understand the Git Object Model
- Repository Initialization (
git init) - Hashing & the Object Store (
git hash-object) - Reading Objects (
git cat-file) - Tree Objects — Snapshots of Directories
- Commit Objects
- References — Branches & HEAD
- The Index / Staging Area (
git add) - Writing a Tree from the Index (
git write-tree) - Creating a Commit (
git commit) - Reading the Working Tree (
git status) - Checkout & Restoring Files (
git checkout) - Diffing (
git diff) - Log (
git log) - Branching (
git branch,git switch) - Merging (
git merge) - Pack Files & Compression (Advanced)
- Remote Protocol Basics (
git clone,git fetch,git push) - Test Suites
- Recommended Build Order
1. Understand the Git Object Model
What it is
Git's entire history is stored as a content-addressed object store: every piece of data (file, directory snapshot, commit, tag) is identified by the SHA-1 (or SHA-256 in newer Git) hash of its content. The store lives in .git/objects/.
Four object types
| Type | Contains |
|---|---|
blob |
Raw file contents |
tree |
A directory listing (entries pointing to blobs and trees) |
commit |
A pointer to a tree + metadata (author, message, parent) |
tag |
A named pointer to any object (usually a commit) |
Object storage format
Every object is stored as:
<type> <content-length-in-bytes>\0<raw-content>
This concatenated byte sequence is then zlib-compressed and written to:
.git/objects/<first-2-hex-chars-of-hash>/<remaining-38-hex-chars>
Checklist
- Understand that Git is purely a key-value store keyed on SHA-1 hashes.
- Understand that the hash is computed before compression on the uncompressed header+content.
- Understand the four object types and their roles.
- Decide whether to use SHA-1 (classic) or SHA-256 (modern). SHA-1 is simpler to start with.
2. Repository Initialization (git init)
What it does
Creates the hidden .git/ directory with an initial directory structure, making the current folder a Git repository.
Directory structure to create
.git/
├── HEAD ← points to current branch (e.g. "ref: refs/heads/main")
├── config ← repository config (can be minimal at first)
├── description ← human-readable description (ignored by most tools)
├── objects/
│ ├── info/ ← empty for now
│ └── pack/ ← empty for now
└── refs/
├── heads/ ← branch refs live here
└── tags/ ← tag refs live here
Implementation steps
- Detect if a
.git/directory already exists; if so, abort or reinitialize. - Create the directories listed above.
- Write
.git/HEADwith the contentsref: refs/heads/main\n(ormaster, depending on your default). - Write a minimal
.git/config:[core] repositoryformatversion = 0 filemode = true bare = false - Write
.git/descriptionwith a placeholder string.
Checklist
- Create
.git/and all sub-directories. - Write initial
HEADfile pointing torefs/heads/main. - Write a minimal
configfile. - Write
descriptionfile. - Handle the
--bareflag (omit the working directory; root IS.git/).
3. Hashing & the Object Store (git hash-object)
What it does
Takes raw bytes, prepends the Git object header, computes SHA-1, zlib-compresses the result, and writes it to .git/objects/.
Precise algorithm
function hash_object(type, data):
header = "{type} {len(data)}\0" # e.g. "blob 14\0"
store = header + data # concatenate as bytes
sha1 = SHA1(store) # 40-char hex string
compressed = zlib_compress(store, level=1)
path = ".git/objects/" + sha1[0:2] + "/" + sha1[2:]
create_dirs_if_needed(path)
write_file(path, compressed)
return sha1
Important details
- The SHA-1 is computed on the uncompressed
header + data. - Use zlib deflate (level 1 is fine; real Git uses level 1 by default).
- If the object already exists (same hash), you can skip writing.
- For blobs,
datais the raw file contents (bytes, not text—preserve encoding).
Checklist
- Implement SHA-1 (or use your language's crypto library).
- Implement zlib compress/decompress.
- Implement
hash_object(type, data) -> sha1_hexthat writes to the object store. - Implement
write_object(sha1, compressed_bytes)that creates the two-level directory path. - Handle the
-w(write) flag — by defaulthash-objectonly prints the hash;-wwrites it. - Implement reading a file from disk and calling
hash_object("blob", contents).
4. Reading Objects (git cat-file)
What it does
Reads a compressed object from .git/objects/, decompresses it, parses the header, and returns the type and content.
Precise algorithm
function read_object(sha1):
path = ".git/objects/" + sha1[0:2] + "/" + sha1[2:]
compressed = read_file(path)
raw = zlib_decompress(compressed)
null_pos = raw.index('\0')
header = raw[0:null_pos] # e.g. "blob 14"
content = raw[null_pos+1:] # everything after the null byte
type, size = header.split(' ')
assert len(content) == int(size)
return type, content
Checklist
- Implement
read_object(sha1) -> (type, content). - Support short SHA-1s by scanning the objects directory for a prefix match.
- Implement the
-tflag (print type),-sflag (print size),-pflag (pretty-print content). - Pretty-printing for blobs: print raw content as text.
- Pretty-printing for trees and commits: handled in later steps.
5. Tree Objects — Snapshots of Directories
What a tree is
A tree object is a binary-encoded listing of directory entries. Each entry describes one file or sub-directory.
Binary format (per entry)
<mode> <name>\0<20-bytes-raw-sha1>
Where:
modeis an ASCII string (not binary):100644for regular file,100755for executable,040000for directory,120000for symlink.nameis the filename (no path separators), ASCII/UTF-8.\0is a literal null byte.- The 20-byte SHA-1 is raw binary, not hex.
Entries are concatenated together with no separator between them, and sorted in a specific order (explained below).
Sorting rule (important!)
Git sorts tree entries as if directories have a trailing /. This means a directory named foo is sorted as foo/, which affects ordering when you have entries like foo (file) and foo-bar (file) and foo (directory).
function sort_key(entry):
if entry.type == "tree":
return entry.name + "/"
else:
return entry.name
Sort entries by sort_key before writing.
Parsing a tree
function parse_tree(data):
entries = []
i = 0
while i < len(data):
space_pos = data.index(' ', i)
mode = data[i:space_pos]
null_pos = data.index('\0', space_pos)
name = data[space_pos+1:null_pos]
sha1_raw = data[null_pos+1 : null_pos+21] # 20 raw bytes
sha1_hex = to_hex(sha1_raw)
entries.append((mode, name, sha1_hex))
i = null_pos + 21
return entries
Writing a tree
function write_tree(entries):
# entries: list of (mode_str, name, sha1_hex)
sorted_entries = sort_tree_entries(entries)
content = b""
for mode, name, sha1 in sorted_entries:
content += f"{mode} {name}\0".encode()
content += bytes.fromhex(sha1) # raw 20-byte SHA-1
return hash_object("tree", content)
Checklist
- Implement
parse_tree(data) -> list of (mode, name, sha1_hex). - Implement
write_tree(entries) -> sha1. - Implement sorting logic using the "directory gets trailing slash" rule.
- Handle the four modes: regular file (
100644), executable (100755), symlink (120000), directory (040000). - Pretty-print trees with
cat-file -p: show each entry as<mode> <type> <sha1>\t<name>.
6. Commit Objects
What a commit is
A commit is a plain-text object that points to a tree (the snapshot of all files at that point in time) and to zero or more parent commits.
Commit format
tree <tree-sha1>
parent <parent-sha1> ← zero or more parent lines
author <name> <email> <unix-timestamp> <timezone-offset>
committer <name> <email> <unix-timestamp> <timezone-offset>
gpgsig <optional-signature> ← skip for basic implementation
<blank line>
<commit message>
Example:
tree 4b825dc642cb6eb9a060e54bf8d69288fbee4904
parent a1b2c3d4e5f6...
author Jane Doe <jane@example.com> 1700000000 +0000
committer Jane Doe <jane@example.com> 1700000000 +0000
Initial commit
Parsing a commit
function parse_commit(data):
lines = data.split('\n')
headers = {}
i = 0
while lines[i] != '': # blank line separates headers from message
key, _, value = lines[i].partition(' ')
headers.setdefault(key, []).append(value)
i += 1
message = '\n'.join(lines[i+1:])
return headers, message
Writing a commit
function write_commit(tree_sha1, parent_sha1s, author, committer, message):
lines = [f"tree {tree_sha1}"]
for p in parent_sha1s:
lines.append(f"parent {p}")
lines.append(f"author {author}")
lines.append(f"committer {committer}")
lines.append("")
lines.append(message)
content = "\n".join(lines) + "\n"
return hash_object("commit", content.encode())
Author/Committer format
Name <email> <unix-timestamp-seconds> <+HHMM or -HHMM>
Example: Jane Doe <jane@example.com> 1700000000 +0000
Checklist
- Implement
parse_commit(data) -> (headers_dict, message). - Implement
write_commit(tree, parents, author, committer, message) -> sha1. - Support multiple
parentlines (for merge commits). - Read author name/email from
.git/configor environment variables (GIT_AUTHOR_NAME, etc.). - Use current Unix timestamp and local timezone offset.
- Pretty-print commit with
cat-file -p.
7. References — Branches & HEAD
What refs are
A reference (ref) is simply a named pointer to a SHA-1 hash, stored as a text file in .git/refs/. Branches live under .git/refs/heads/, tags under .git/refs/tags/.
File contents
A direct ref file contains just the 40-character hex SHA-1 followed by a newline:
a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2\n
HEAD
HEAD is a special ref that either:
- Symbolic ref: points to another ref:
ref: refs/heads/main\n - Detached HEAD: contains a direct SHA-1 (when you checked out a specific commit).
Reading HEAD
function read_head():
content = read_file(".git/HEAD").strip()
if content.startswith("ref: "):
return ("symbolic", content[5:]) # e.g. "refs/heads/main"
else:
return ("detached", content) # raw SHA-1
Resolving a ref to a commit SHA
function resolve_ref(ref_name):
# ref_name can be "HEAD", "main", "refs/heads/main", or a sha1
# Try direct SHA-1 first
if is_valid_sha1(ref_name):
return ref_name
# Try standard locations
for prefix in ["", "refs/", "refs/heads/", "refs/tags/", "refs/remotes/"]:
path = ".git/" + prefix + ref_name
if file_exists(path):
content = read_file(path).strip()
if content.startswith("ref: "): # packed symref
return resolve_ref(content[5:])
return content
raise Error(f"unknown ref: {ref_name}")
Packed refs
When there are many refs, Git stores them in .git/packed-refs for efficiency. Lines look like:
# pack-refs with: peeled fully-peeled sorted
a1b2c3... refs/heads/feature-x
^c4d5e6... # optional: peeled tag
Your resolver should fall back to checking packed-refs if a loose ref file isn't found.
Checklist
- Implement
read_ref(ref_name) -> sha1(resolve through symrefs). - Implement
write_ref(ref_name, sha1)(create/update a ref file). - Implement
read_head() -> (type, value). - Implement
write_head(symbolic_or_sha1). - Implement
list_refs(prefix) -> list of (name, sha1)(scanrefs/directory). - Parse
packed-refsand include those results in lookups.
8. The Index / Staging Area (git add)
What the index is
The index (also called the staging area or cache) is a binary file at .git/index that stores the state of the working tree that will be used for the next commit. It tracks every file, its mode, its SHA-1, and metadata used to efficiently detect changes.
Index file format (Version 2)
Header:
4 bytes: magic "DIRC"
4 bytes: version number (2)
4 bytes: number of entries (big-endian uint32)
Entries (one per staged file):
4 bytes: ctime seconds
4 bytes: ctime nanoseconds
4 bytes: mtime seconds
4 bytes: mtime nanoseconds
4 bytes: dev
4 bytes: ino
4 bytes: mode (e.g. 0x000081A4 = 0100644 octal)
4 bytes: uid
4 bytes: gid
4 bytes: file size
20 bytes: SHA-1 of blob
2 bytes: flags (top 4 bits: assume-valid, extended, stage; bottom 12: filename length)
variable: filename (null-terminated, padded to 8-byte boundary)
Footer:
20 bytes: SHA-1 of entire index content up to this point
Tip: For a minimal implementation, you can store the index as a simple JSON/TOML/binary structure of
{path: sha1_hex, mode: int}and convert to the real format later.
git add algorithm
function git_add(paths):
index = read_index()
for path in expand_paths(paths): # handle globs, directories recursively
if is_directory(path):
for file in walk_files(path):
add_file_to_index(index, file)
else:
add_file_to_index(index, path)
write_index(index)
function add_file_to_index(index, path):
contents = read_file(path)
sha1 = hash_object("blob", contents) # writes to object store
stat = os.stat(path)
index[path] = IndexEntry(sha1, stat.mode, stat.ctime, stat.mtime, ...)
Checklist
- Define an
IndexEntrydata structure. - Implement
read_index() -> dict of path -> IndexEntry(parse.git/index). - Implement
write_index(entries)(serialize and write.git/index). - Implement the index checksum (SHA-1 of all bytes before the trailing hash).
- Implement
git_add(paths)that hashes blobs and updates the index. - Handle removing files from the index (
git rm). - Handle
.gitignorepatterns when adding files.
9. Writing a Tree from the Index (git write-tree)
What it does
Converts the flat list of path -> sha1 entries in the index into a nested tree of tree objects (one per directory level).
Algorithm
function write_tree_from_index(index_entries):
# Build a nested dict representing the directory tree
tree = {}
for path, entry in index_entries.items():
parts = path.split('/')
node = tree
for part in parts[:-1]: # directories
node = node.setdefault(part, {})
filename = parts[-1]
node[filename] = entry # leaf: IndexEntry
return build_tree(tree)
function build_tree(node):
entries = []
for name, child in node.items():
if isinstance(child, dict): # sub-directory
subtree_sha1 = build_tree(child)
entries.append(("040000", name, subtree_sha1))
else: # blob
mode = format_mode(child.mode) # e.g. "100644"
entries.append((mode, name, child.sha1))
return write_tree(entries) # from step 5
Checklist
- Implement
build_nested_tree(index_entries) -> nested_dict. - Implement
build_tree(nested_dict) -> sha1(recursive). - Correctly format file modes from stat mode integers to Git mode strings.
10. Creating a Commit (git commit)
What it does
Takes the current index, creates a tree from it, and creates a commit object pointing to that tree (and to HEAD as the parent).
Algorithm
function git_commit(message):
index = read_index()
tree_sha1 = write_tree_from_index(index)
# Determine parent(s)
head_type, head_value = read_head()
if head_type == "symbolic":
current_branch_ref = head_value # e.g. "refs/heads/main"
try:
parent_sha1 = resolve_ref(current_branch_ref)
parents = [parent_sha1]
except:
parents = [] # first commit, no parent
else:
parents = [head_value] # detached HEAD
# Build author/committer strings
author = get_author_string() # "Name <email> timestamp tz"
committer = get_committer_string()
# Write commit object
commit_sha1 = write_commit(tree_sha1, parents, author, committer, message)
# Update branch ref or HEAD
if head_type == "symbolic":
write_ref(current_branch_ref, commit_sha1)
else:
write_head(commit_sha1) # update detached HEAD
return commit_sha1
Checklist
- Read config for
user.nameanduser.email. - Handle first commit (no parent).
- Handle detached HEAD.
- Update the branch ref after writing the commit.
- Support
--authorand--message/-mflags. - Support reading the commit message from
$EDITORwhen-mis not given.
11. Reading the Working Tree (git status)
What it does
Compares three states: (1) the last commit (HEAD), (2) the index (staging area), (3) the working directory. Reports differences as staged changes, unstaged changes, and untracked files.
Three-way comparison
HEAD tree ↔ Index → "staged changes" (what will be in next commit)
Index ↔ Disk → "unstaged changes" (what is modified but not yet staged)
Disk only → "untracked files"
Algorithm
function git_status():
index = read_index()
head_tree = read_tree_recursive(resolve_commit_tree(read_head()))
working_files = walk_all_files(".") # excluding .git/
staged_changes = []
for path in union(head_tree.keys(), index.keys()):
if path in index and path not in head_tree:
staged_changes.append(("new file", path))
elif path in head_tree and path not in index:
staged_changes.append(("deleted", path))
elif index[path].sha1 != head_tree[path].sha1:
staged_changes.append(("modified", path))
unstaged_changes = []
untracked = []
for path in working_files:
if is_gitignored(path): continue
if path not in index:
untracked.append(path)
else:
# Fast check: compare mtime before hashing
if mtime_changed(path, index[path]):
disk_sha1 = hash_object_no_write("blob", read_file(path))
if disk_sha1 != index[path].sha1:
unstaged_changes.append(("modified", path))
return staged_changes, unstaged_changes, untracked
Checklist
- Implement
read_tree_recursive(sha1) -> flat dict of path -> (mode, sha1). - Implement
walk_all_files(dir) -> list of relative paths(skip.git/). - Implement
.gitignorepattern matching (at minimum: exact names,*glob, leading/, trailing/). - Implement mtime-based fast-path to avoid hashing every file on every
statuscall. - Format and print the three sections of output.
12. Checkout & Restoring Files (git checkout)
What it does
Updates the working tree and index to reflect a given commit, branch, or file state.
Checking out a branch or commit
function git_checkout(target):
target_sha1 = resolve_ref(target)
commit = parse_commit(read_object(target_sha1)[1])
target_tree = commit.headers["tree"][0]
current_tree = read_tree_recursive(current_head_tree())
target_tree_flat = read_tree_recursive(target_tree)
# Remove files not in target
for path in current_tree:
if path not in target_tree_flat:
delete_file(path)
# Write files from target
for path, (mode, sha1) in target_tree_flat.items():
content = read_object(sha1)[1]
write_file(path, content)
set_file_mode(path, mode)
# Update index
new_index = build_index_from_tree(target_tree_flat)
write_index(new_index)
# Update HEAD
if target is a branch name:
write_head_symbolic("refs/heads/" + target)
else:
write_head(target_sha1) # detached HEAD
Checking out a specific file
function git_checkout_file(path, ref="HEAD"):
sha1 = resolve_ref(ref)
tree_flat = read_tree_recursive(get_commit_tree(sha1))
blob_sha1 = tree_flat[path].sha1
content = read_object(blob_sha1)[1]
write_file(path, content)
# update index entry too
Checklist
- Implement
read_tree_recursive(sha1) -> flat dict. - Implement the three-way file reconciliation (add, remove, update).
- Handle dirty working tree (warn or abort if uncommitted changes would be overwritten).
- Support checking out a specific file by path.
- Update HEAD to be a symbolic ref (branch checkout) or detached (commit/tag checkout).
- Set correct file permissions based on mode.
13. Diffing (git diff)
What it does
Shows the line-by-line differences between two versions of a file. Git uses the Myers diff algorithm by default.
Diff output format (unified diff)
diff --git a/file.txt b/file.txt
index abc1234..def5678 100644
--- a/file.txt
+++ b/file.txt
@@ -1,4 +1,5 @@
unchanged line
-removed line
+added line
another unchanged line
Myers diff algorithm (simplified)
function diff_lines(a_lines, b_lines):
# Returns list of operations: ('equal', line), ('insert', line), ('delete', line)
# Standard shortest-edit-script (SES) algorithm
n, m = len(a_lines), len(b_lines)
# V is an array indexed from -(n+m) to (n+m)
V = {}; V[1] = 0
trace = []
for d in range(0, n + m + 1):
snapshot = dict(V)
trace.append(snapshot)
for k in range(-d, d + 1, 2):
if k == -d or (k != d and V.get(k-1, -1) < V.get(k+1, -1)):
x = V.get(k+1, 0)
else:
x = V.get(k-1, 0) + 1
y = x - k
while x < n and y < m and a_lines[x] == b_lines[y]:
x += 1; y += 1
V[k] = x
if x >= n and y >= m:
return backtrack(trace, a_lines, b_lines)
Tip: For a first implementation, use your language's built-in diff library (Python's
difflib, for example) to produce a workingdiffcommand, then replace it with Myers later.
Checklist
- Implement (or wrap) a line-diff algorithm.
- Implement
git diff(index vs working tree). - Implement
git diff --staged(HEAD vs index). - Implement
git diff <commit1> <commit2>(two arbitrary commits). - Format output as unified diff with
@@ ... @@hunks. - Color output (red for deletions, green for additions) when outputting to a terminal.
14. Log (git log)
What it does
Walks the commit graph backwards from HEAD (or a given ref), printing each commit.
Algorithm
function git_log(start_ref="HEAD", limit=None):
sha1 = resolve_ref(start_ref)
visited = set()
queue = [sha1] # use a priority queue for merge commits (by timestamp)
count = 0
while queue:
current = queue.pop(0)
if current in visited: continue
visited.add(current)
type_, data = read_object(current)
commit = parse_commit(data)
print_commit(current, commit)
count += 1
if limit and count >= limit: break
for parent in commit.headers.get("parent", []):
queue.append(parent)
Output format
commit a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
Author: Jane Doe <jane@example.com>
Date: Mon Nov 13 12:34:56 2023 +0000
Commit message here
Checklist
- Implement
git_log(start, n)that followsparentpointers. - Handle merge commits (two parents) — visit both.
- Implement
--onelineformat. - Implement
--graph(ASCII art branch graph) — complex, save for last. - Implement
--decorate(show branch/tag names next to commits). - Implement
-- <path>filtering (only show commits that touched a specific file).
15. Branching (git branch, git switch)
What a branch is
A branch is just a ref file in .git/refs/heads/ containing the SHA-1 of the tip commit. Creating a branch is cheap — it's just writing a file.
git branch
function git_branch_create(name, start_point="HEAD"):
sha1 = resolve_ref(start_point)
write_ref("refs/heads/" + name, sha1)
function git_branch_list():
current = read_head()
for ref_file in list_files(".git/refs/heads/"):
name = ref_file.name
sha1 = read_ref("refs/heads/" + name)
marker = "*" if current == name else " "
print(f" {marker} {name}")
function git_branch_delete(name):
path = ".git/refs/heads/" + name
if not file_exists(path):
raise Error(f"branch {name} not found")
delete_file(path)
git switch / git checkout <branch>
Alias for the checkout algorithm in step 12, switching to a named branch.
Checklist
- Implement
git branch <name>(create at HEAD). - Implement
git branch(list all branches, mark current with*). - Implement
git branch -d <name>(delete, warn if unmerged). - Implement
git switch <name>/git checkout <name>. - Implement
git switch -c <name>(create and switch).
16. Merging (git merge)
Types of merge
- Fast-forward: The current branch is an ancestor of the target. Just move the branch pointer forward.
- Three-way merge: Both branches have diverged. Find the common ancestor (merge base), then combine changes.
- Merge conflict: The same region of a file was changed on both sides.
Fast-forward detection
function is_ancestor(maybe_ancestor, descendant):
# Walk backwards from descendant; if we reach maybe_ancestor, it's an ancestor
visited = set()
queue = [descendant]
while queue:
sha1 = queue.pop()
if sha1 == maybe_ancestor: return True
if sha1 in visited: continue
visited.add(sha1)
commit = parse_commit(read_object(sha1)[1])
queue.extend(commit.headers.get("parent", []))
return False
Finding the merge base (Lowest Common Ancestor)
function find_merge_base(sha1_a, sha1_b):
ancestors_a = all_ancestors(sha1_a) # BFS, collect all reachable commits
queue = [sha1_b]
visited = set()
while queue:
sha1 = queue.pop(0)
if sha1 in ancestors_a:
return sha1 # first common ancestor found
if sha1 in visited: continue
visited.add(sha1)
commit = parse_commit(read_object(sha1)[1])
queue.extend(commit.headers.get("parent", []))
return None
Three-way merge algorithm
function three_way_merge(base_tree, ours_tree, theirs_tree):
all_paths = union(base_tree.keys(), ours_tree.keys(), theirs_tree.keys())
conflicts = []
for path in all_paths:
base = base_tree.get(path)
ours = ours_tree.get(path)
theirs = theirs_tree.get(path)
if ours == theirs:
result = ours # same on both sides, take either
elif base == ours:
result = theirs # we didn't change, take theirs
elif base == theirs:
result = ours # they didn't change, take ours
else:
# Both sides changed — attempt line-level merge
merged, ok = merge_file_content(base, ours, theirs)
if ok:
result = merged
else:
result = merged # write conflict markers
conflicts.append(path)
write_to_index_and_disk(path, result)
return conflicts
Conflict markers
<<<<<<< HEAD
our version of the line(s)
=======
their version of the line(s)
>>>>>>> branch-name
Checklist
- Implement
is_ancestor(a, b) -> bool. - Implement fast-forward merge (move branch pointer, update working tree).
- Implement
find_merge_base(a, b) -> sha1. - Implement three-way file merge with conflict markers.
- Write merge result to index (with stage entries for conflicts: stages 1, 2, 3).
- Write
MERGE_HEADfile with the SHA-1 of the branch being merged in. - After resolving conflicts,
git commitreadsMERGE_HEADto create a merge commit with two parents.
17. Pack Files & Compression (Advanced)
What pack files are
Instead of storing one object per file (which is slow for repos with millions of objects), Git packs objects into a single .pack file with a .idx index.
Pack file format (overview)
Header: "PACK" + version (4 bytes) + object count (4 bytes)
Objects: each object is stored as:
- Type + size (variable-length encoding)
- Compressed data (zlib) OR delta from another object
Footer: SHA-1 of pack content
Delta compression
Objects can be stored as deltas (diffs) against a base object. This dramatically reduces size for similar blobs (e.g. consecutive versions of a file).
Checklist
- Implement
git gcorgit pack-objectsto pack loose objects. - Implement pack index (
.idx) parsing and writing. - Implement reading objects from a pack file.
- Implement delta encoding/decoding (OBJ_OFS_DELTA and OBJ_REF_DELTA types).
Note: Pack file support is needed for cloning and fetching from remotes. You can defer this until step 18.
18. Remote Protocol Basics (git clone, git fetch, git push)
Remotes config
Stored in .git/config:
[remote "origin"]
url = https://github.com/user/repo.git
fetch = +refs/heads/*:refs/remotes/origin/*
Smart HTTP protocol (upload-pack / receive-pack)
Git communicates with servers using the pkt-line format:
# A pkt-line: 4-byte hex length prefix (includes the 4 bytes) + data
0015# service=git-upload-pack\n
0000 # flush packet (end of section)
git clone high-level flow
1. Send HTTP GET to /info/refs?service=git-upload-pack
2. Server responds with available refs (branches, tags) in pkt-line format
3. Client sends "want <sha1>" for desired refs + "done"
4. Server responds with a PACK file containing all required objects
5. Unpack the PACK file into .git/objects/
6. Write remote-tracking refs: .git/refs/remotes/origin/main = sha1
7. Checkout the default branch
git fetch and git push
- fetch: Same as clone step 1–4, but only download new objects.
- push: Send
receive-packrequest, upload a pack with new objects, update remote refs.
Checklist
- Implement pkt-line encoding/decoding.
- Implement HTTP transport: GET
/info/refs, POST/upload-pack. - Parse the initial ref advertisement from the server.
- Implement
git fetch: compute missing objects (negotiation), receive pack. - Implement
git clone: fetch + checkout. - Implement
git push: send pack + update ref commands toreceive-pack. - Store remote-tracking refs under
refs/remotes/<remote>/.
19. Test Suites
Test Suite 1: Core Object Model
The following tests cover steps 2–10 and can be written in any test framework.
# test_object_model.py (using Python + pytest as an example)
import os, shutil, tempfile, subprocess, hashlib, zlib
# ---- Fixtures ----
def setup_repo(tmp_path):
"""Initialize a fresh repo in tmp_path and return its path."""
os.chdir(tmp_path)
run("git init") # replace with your implementation's CLI call
return tmp_path
def run(cmd):
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
assert result.returncode == 0, result.stderr
return result.stdout.strip()
# ---- Test 1: init creates correct directory structure ----
def test_init_creates_structure(tmp_path):
setup_repo(tmp_path)
assert os.path.isdir(tmp_path / ".git")
assert os.path.isdir(tmp_path / ".git" / "objects")
assert os.path.isdir(tmp_path / ".git" / "refs" / "heads")
assert os.path.isfile(tmp_path / ".git" / "HEAD")
head = (tmp_path / ".git" / "HEAD").read_text()
assert head.startswith("ref: refs/heads/")
# ---- Test 2: hash-object produces correct SHA-1 ----
def test_hash_object_blob(tmp_path):
setup_repo(tmp_path)
content = b"hello world\n"
(tmp_path / "file.txt").write_bytes(content)
sha1 = run("mygit hash-object -w file.txt")
# Verify manually
header = f"blob {len(content)}\0".encode()
expected = hashlib.sha1(header + content).hexdigest()
assert sha1 == expected, f"Expected {expected}, got {sha1}"
# Verify object file exists
obj_path = tmp_path / ".git" / "objects" / sha1[:2] / sha1[2:]
assert obj_path.exists()
# Verify decompresses correctly
raw = zlib.decompress(obj_path.read_bytes())
assert raw == header + content
# ---- Test 3: cat-file reads back what hash-object wrote ----
def test_cat_file_blob(tmp_path):
setup_repo(tmp_path)
(tmp_path / "hello.txt").write_text("Hello, Git!\n")
sha1 = run("mygit hash-object -w hello.txt")
assert run(f"mygit cat-file -t {sha1}") == "blob"
assert run(f"mygit cat-file -p {sha1}") == "Hello, Git!"
# ---- Test 4: add + commit produces valid tree and commit objects ----
def test_add_and_commit(tmp_path):
setup_repo(tmp_path)
(tmp_path / "a.txt").write_text("file a\n")
(tmp_path / "b.txt").write_text("file b\n")
run("mygit add .")
commit_sha1 = run('mygit commit -m "initial"')
# HEAD should now point to a commit
head_sha1 = run("mygit rev-parse HEAD")
assert head_sha1 == commit_sha1
# Commit should have a tree
tree_sha1 = run(f"mygit cat-file -p {commit_sha1}").split('\n')[0].split(' ')[1]
tree_type = run(f"mygit cat-file -t {tree_sha1}")
assert tree_type == "tree"
# Tree should contain a.txt and b.txt
tree_content = run(f"mygit cat-file -p {tree_sha1}")
assert "a.txt" in tree_content
assert "b.txt" in tree_content
# ---- Test 5: second commit has parent ----
def test_second_commit_has_parent(tmp_path):
setup_repo(tmp_path)
(tmp_path / "a.txt").write_text("v1\n")
run("mygit add .")
c1 = run('mygit commit -m "commit 1"')
(tmp_path / "a.txt").write_text("v2\n")
run("mygit add .")
c2 = run('mygit commit -m "commit 2"')
commit_data = run(f"mygit cat-file -p {c2}")
assert f"parent {c1}" in commit_data
# ---- Test 6: tree sort order is correct ----
def test_tree_sort_order(tmp_path):
setup_repo(tmp_path)
os.mkdir(tmp_path / "foo")
(tmp_path / "foo" / "bar.txt").write_text("bar\n")
(tmp_path / "foo-baz.txt").write_text("baz\n") # "foo-baz" sorts before "foo/" in git
run("mygit add .")
run('mygit commit -m "test sort"')
head = run("mygit rev-parse HEAD")
commit_data = run(f"mygit cat-file -p {head}")
tree_sha1 = commit_data.split('\n')[0].split()[1]
tree_entries = run(f"mygit cat-file -p {tree_sha1}").split('\n')
names = [e.split('\t')[1] for e in tree_entries if e]
# Real git: "foo-baz.txt" comes before "foo" directory
assert names.index("foo-baz.txt") < names.index("foo")
# ---- Test 7: checkout restores files ----
def test_checkout_restores_file(tmp_path):
setup_repo(tmp_path)
(tmp_path / "readme.txt").write_text("version 1\n")
run("mygit add .")
c1 = run('mygit commit -m "v1"')
(tmp_path / "readme.txt").write_text("version 2\n")
run("mygit add .")
run('mygit commit -m "v2"')
run(f"mygit checkout {c1}")
assert (tmp_path / "readme.txt").read_text() == "version 1\n"
Test Suite 2: Branching & Merging
# test_branching_merging.py
# ---- Test 8: create and list branches ----
def test_branch_create_and_list(tmp_path):
setup_repo(tmp_path)
(tmp_path / "f.txt").write_text("x\n")
run("mygit add .")
run('mygit commit -m "init"')
run("mygit branch feature")
branches = run("mygit branch")
assert "feature" in branches
assert "main" in branches or "master" in branches
# ---- Test 9: fast-forward merge ----
def test_fast_forward_merge(tmp_path):
setup_repo(tmp_path)
(tmp_path / "base.txt").write_text("base\n")
run("mygit add ."); run('mygit commit -m "base"')
run("mygit branch feature")
run("mygit switch feature")
(tmp_path / "feature.txt").write_text("feature\n")
run("mygit add ."); run('mygit commit -m "feature work"')
feature_tip = run("mygit rev-parse HEAD")
run("mygit switch main")
run("mygit merge feature")
main_tip = run("mygit rev-parse HEAD")
assert main_tip == feature_tip # fast-forward: main tip = feature tip
assert (tmp_path / "feature.txt").exists()
# ---- Test 10: three-way merge, no conflict ----
def test_three_way_merge_no_conflict(tmp_path):
setup_repo(tmp_path)
(tmp_path / "shared.txt").write_text("line1\nline2\n")
(tmp_path / "main_only.txt").write_text("main\n")
run("mygit add ."); run('mygit commit -m "base"')
run("mygit branch feature")
# Commit on main
(tmp_path / "main_only.txt").write_text("main updated\n")
run("mygit add ."); run('mygit commit -m "main change"')
# Commit on feature
run("mygit switch feature")
(tmp_path / "feature_only.txt").write_text("feature\n")
run("mygit add ."); run('mygit commit -m "feature change"')
run("mygit switch main")
run("mygit merge feature")
assert (tmp_path / "feature_only.txt").exists()
assert (tmp_path / "main_only.txt").read_text() == "main updated\n"
# ---- Test 11: merge conflict is detected ----
def test_merge_conflict_detected(tmp_path):
setup_repo(tmp_path)
(tmp_path / "conflict.txt").write_text("original\n")
run("mygit add ."); run('mygit commit -m "base"')
run("mygit branch feature")
(tmp_path / "conflict.txt").write_text("main version\n")
run("mygit add ."); run('mygit commit -m "main edit"')
run("mygit switch feature")
(tmp_path / "conflict.txt").write_text("feature version\n")
run("mygit add ."); run('mygit commit -m "feature edit"')
run("mygit switch main")
result = subprocess.run("mygit merge feature", shell=True, capture_output=True, text=True)
assert result.returncode != 0 or "CONFLICT" in result.stdout + result.stderr
content = (tmp_path / "conflict.txt").read_text()
assert "<<<<<<" in content
assert "=======" in content
assert ">>>>>>>" in content
# ---- Test 12: log shows correct history ----
def test_log_history(tmp_path):
setup_repo(tmp_path)
messages = ["first commit", "second commit", "third commit"]
for i, msg in enumerate(messages):
(tmp_path / f"f{i}.txt").write_text(f"content {i}\n")
run("mygit add .")
run(f'mygit commit -m "{msg}"')
log = run("mygit log --oneline")
for msg in messages:
assert msg in log
# Most recent first
lines = [l for l in log.split('\n') if l]
assert "third commit" in lines[0]
assert "first commit" in lines[-1]
# ---- Test 13: status correctly categorizes changes ----
def test_status_categories(tmp_path):
setup_repo(tmp_path)
(tmp_path / "tracked.txt").write_text("v1\n")
run("mygit add ."); run('mygit commit -m "init"')
# Modify tracked file without staging
(tmp_path / "tracked.txt").write_text("v2\n")
# Stage a new file
(tmp_path / "staged_new.txt").write_text("new\n")
run("mygit add staged_new.txt")
# Untracked file
(tmp_path / "untracked.txt").write_text("untracked\n")
status = run("mygit status")
assert "staged_new.txt" in status # staged
assert "tracked.txt" in status # unstaged modification
assert "untracked.txt" in status # untracked
# ---- Test 14: gitignore works ----
def test_gitignore(tmp_path):
setup_repo(tmp_path)
(tmp_path / ".gitignore").write_text("*.log\nbuild/\n")
(tmp_path / "app.py").write_text("code\n")
(tmp_path / "debug.log").write_text("log\n")
os.mkdir(tmp_path / "build")
(tmp_path / "build" / "output.bin").write_bytes(b"\x00\x01")
status = run("mygit status")
assert "app.py" in status
assert "debug.log" not in status
assert "output.bin" not in status
20. Recommended Build Order
Follow this sequence to always have a working (partial) implementation at each stage:
Phase 1 — Core Storage
[x] Step 2: git init
[x] Step 3: hash-object (blobs)
[x] Step 4: cat-file
[x] Step 5: tree objects
[x] Step 6: commit objects (manually constructed)
[x] Step 7: refs + HEAD
Phase 2 — Staging & Committing
[x] Step 8: index / git add
[x] Step 9: write-tree from index
[x] Step 10: git commit (end-to-end)
Phase 3 — Inspection
[x] Step 11: git status
[x] Step 13: git diff
[x] Step 14: git log
Phase 4 — Branching & Navigation
[x] Step 12: git checkout / restore
[x] Step 15: git branch / switch
[x] Step 16: git merge
Phase 5 — Remotes & Optimization
[x] Step 17: pack files
[x] Step 18: remote protocol (clone, fetch, push)
Quick Reference: Key Data Formats
| Format | Example |
|---|---|
| Object header | blob 42\0 |
| Loose object path | .git/objects/ab/cdef1234... |
| Tree entry | 100644 file.txt\0<20-byte-sha1> |
| Ref file | a1b2c3...40hex...\n |
| Symref | ref: refs/heads/main\n |
| Author string | Jane <j@ex.com> 1700000000 +0000 |
| pkt-line | 0017# service=...\n |
Useful Resources
- Git Internals chapter of the Pro Git book: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain
- Git pack format: https://git-scm.com/docs/pack-format
- Git wire protocol: https://git-scm.com/docs/http-protocol
- Gitlet (JavaScript reference impl): http://gitlet.maryrosecook.com/
- Write yourself a Git (Python walkthrough): https://wyag.thb.lt/
- Myers diff paper: "An O(ND) Difference Algorithm and Its Variations" — Eugene Myers, 1986