Thursday, January 22

deleting files from git history

Working on a project where we included some built files that took up a bunch of space, and decided we should get rid of those. The git repository isn’t public yet and is only shared by a handful of users, so it seemed worth thinking about rewriting the history a bit.

There’s reasonably good documentation for this in the usual places if you look, but I ran into some trouble.

First, what seemed to work: David Underhill has a good short script from back in 2009 for using git filter-branch to eliminate particular files from history:

I recently had a need to rewrite a git repository’s history. This isn’t generally a very good idea, though it is useful if your repository contains files it should not (such as unneeded large binary files or copyrighted material). I also am using it because I had a branch where I only wanted to merge a subset of files back into master (though there are probably better ways of doing this). Anyway, it is not very hard to rewrite history thanks to the excellent git-filter-branch tool which comes with git.

I’ll reproduce the script here, in the not-unlikely event that his writeup goes away:

#!/bin/bash
set -o errexit

# Author: David Underhill
# Script to permanently delete files/folders from your git repository.  To use 
# it, cd to your repository's root and then run the script with a list of paths
# you want to delete, e.g., git-delete-history path1 path2

if [ $# -eq 0 ]; then
    exit 0
fi

# make sure we're at the root of git repo
if [ ! -d .git ]; then
    echo "Error: must run this script from the root of a git repository"
    exit 1
fi

# remove all paths passed as arguments from the history of the repo
files=$@
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

# remove the temporary history git-filter-branch otherwise leaves behind for a long time
rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune

A big thank you to Mr. Underhill for documenting this one. filter-branch seems really powerful, and not as brain-hurting as some things in git land. The docs are currently pretty good, and worth a read if you’re trying to solve this problem.

Lets you rewrite Git revision history by rewriting the branches mentioned in the <rev-list options>, applying custom filters on each revision. Those filters can modify each tree (e.g. removing a file or running a perl rewrite on all files) or information about each commit. Otherwise, all information (including original commit times or merge information) will be preserved.

After this, things got muddier. The script seemed to work fine, and after running it I was able to see all the history I expected, minus some troublesome files. (A version with --prune-empty added to the git filter-branch invocation got rid of some empty commits.) But then:

brennen@exuberance 20:05:00 /home/brennen/code $  du -hs pi_bootstrap 
218M    pi_bootstrap
brennen@exuberance 20:05:33 /home/brennen/code $  du -hs experiment
199M    experiment

That second repo is a clone of the original with the script run against it. Why is it only tens of megabytes smaller, when minus the big binaries I zapped, it should come in somewhere under 10 megs?

I will spare you, dear reader, the contortions I went through arriving at a solution for this, partially because I don’t have the energy left to reconstruct them from the tattered history of my googling over the last few hours. What I figured out was that for some reason, a bunch of blobs were persisting in a pack file, despite not being referenced by any commits, and no matter what I couldn’t get git gc or git repack to zap them.

I more or less got this far with commands like:

brennen@exuberance 20:49:10 /home/brennen/code/experiment2/.git (master) $  git count-objects -v
count: 0
size: 0
in-pack: 2886
packs: 1
size-pack: 202102
prune-packable: 0
garbage: 0
size-garbage: 0

And:

git verify-pack -v ./objects/pack/pack-b79fc6e30a547433df5c6a0c6212672c5e5aec5f > ~/what_the_fuck

…which gives a list of all the stuff in a pack file, including super-not-human-readable sizes that you can sort on, and many permutations of things like:

brennen@exuberance 20:49:12 /home/brennen/code/experiment2/.git (master) $  git log --pretty=oneline | cut -f1 -d' ' | xargs -L1 git cat-file -s | sort -nr | head
589
364
363
348
341
331
325
325
322
320

…where cat-file is a bit of a Swiss army knife for looking at objects, with -s meaning “tell me a size”.

(An aside: If you are writing software that outputs a size in bytes, blocks, etc., and you do not provide a “human readable” option to display this in comprehensible units, the innumerate among us quietly hate your guts. This is perhaps unjust of us, but I’m just trying to communicate my experience here.)

And finally, Aristotle Pagaltzis’s script for figuring out which commit has a given blob (the answer is fucking none of them, in my case):

#!/bin/sh
obj_name="$1"
shift
git log "$@" --pretty=format:'%T %h %s' \
| while read tree commit subject ; do
    if git ls-tree -r $tree | grep -q "$obj_name" ; then
        echo $commit "$subject"
    fi
done

Also somewhere in there I learned how to use git bisect (which is really cool and likely something I will use again) and went through and made entirely certain there was nothing in the history with a bunch of big files in it.

So eventually I got to thinking ok, there’s something here that is keeping these objects from getting expired or pruned or garbage collected or whatever, so how about doing a clone that just copies the stuff in the commits that still exist at this point. Which brings us to:

brennen@exuberance 19:03:08 /home/brennen/code/experiment2 (master) $  git help clone
brennen@exuberance 19:06:52 /home/brennen/code/experiment2 (master) $  cd ..
brennen@exuberance 19:06:55 /home/brennen/code $  git clone --no-local ./experiment2 ./experiment2_no_local
Cloning into './experiment2_no_local'...
remote: Counting objects: 2874, done.
remote: Compressing objects: 100% (1611/1611), done.
remote: Total 2874 (delta 938), reused 2869 (delta 936)
Receiving objects: 100% (2874/2874), 131.21 MiB | 37.48 MiB/s, done.
Resolving deltas: 100% (938/938), done.
Checking connectivity... done.
brennen@exuberance 19:07:15 /home/brennen/code $  du -hs ./experiment2_no_local
133M    ./experiment2_no_local
brennen@exuberance 19:07:20 /home/brennen/code $  git help clone
brennen@exuberance 19:08:34 /home/brennen/code $  git clone --no-local --single-branch ./experiment2 ./experiment2_no_local_single_branch
Cloning into './experiment2_no_local_single_branch'...
remote: Counting objects: 1555, done.
remote: Compressing objects: 100% (936/936), done.
remote: Total 1555 (delta 511), reused 1377 (delta 400)
Receiving objects: 100% (1555/1555), 1.63 MiB | 0 bytes/s, done.
Resolving deltas: 100% (511/511), done.
Checking connectivity... done.
brennen@exuberance 19:08:47 /home/brennen/code $  du -hs ./experiment2_no_local_single_branch
3.0M    ./experiment2_no_local_single_branch

What’s going on here? Well, git clone --no-local:

--local
-l

    When the repository to clone from is on a local machine, this flag
    bypasses the normal "Git aware" transport mechanism and clones the
    repository by making a copy of HEAD and everything under objects and
    refs directories. The files under .git/objects/ directory are
    hardlinked to save space when possible.

    If the repository is specified as a local path (e.g., /path/to/repo),
    this is the default, and --local is essentially a no-op. If the
    repository is specified as a URL, then this flag is ignored (and we
    never use the local optimizations). Specifying --no-local will override
    the default when /path/to/repo is given, using the regular Git
    transport instead.

And --single-branch:

--[no-]single-branch

    Clone only the history leading to the tip of a single branch, either
    specified by the --branch option or the primary branch remote’s HEAD
    points at. When creating a shallow clone with the --depth option, this
    is the default, unless --no-single-branch is given to fetch the
    histories near the tips of all branches. Further fetches into the
    resulting repository will only update the remote-tracking branch for
    the branch this option was used for the initial cloning. If the HEAD at
    the remote did not point at any branch when --single-branch clone was
    made, no remote-tracking branch is created.

I have no idea why --no-local by itself reduced the size but didn’t really do the job.

It’s possible the lingering blobs would have been garbage collected eventually, and at any rate it seems likely that in pushing them to a remote repository I would have bypassed whatever lazy local file copy operation was causing everything to persist on cloning, thus rendering all this head-scratching entirely pointless, but then who knows. At least I understand git file structure a little better than I did before.

For good measure, I just remembered how old much of the software on this machine is, and I feel like kind of an ass:

brennen@exuberance 21:20:50 /home/brennen/code $  git --version
git version 1.9.1

This is totally an old release. If there’s a bug here, maybe it’s fixed by now. I will not venture a strong opinion as to whether there is a bug. Maybe this is entirely expected behavior. It is time to drink a beer.

postscript: on finding bugs

The first thing you learn, by way of considerable personal frustration and embarrassment, goes something like this:

Q: My stuff isn’t working. I think there is probably a bug in this mature and widely-used (programming language | library | utility software).

A: Shut up shut up shut up shut up there is not a bug. Now go and figure out what is wrong with your code.

The second thing goes something like this:

Oh. I guess that’s actually a bug.

Which is to say: I have learned that I’m probably wrong, but sometimes I’m also wrong about being wrong.