Thursday, January 22
deleting files from git history
Working on a project where we included some built files that took up a bunch of space, and decided we should get rid of those. The git repository isn’t public yet and is only shared by a handful of users, so it seemed worth thinking about rewriting the history a bit.
There’s reasonably good documentation for this in the usual places if you look, but I ran into some trouble.
First, what seemed to work: David Underhill has a good short script from
back in 2009 for using
git filter-branch to eliminate particular files from
I recently had a need to rewrite a git repository’s history. This isn’t generally a very good idea, though it is useful if your repository contains files it should not (such as unneeded large binary files or copyrighted material). I also am using it because I had a branch where I only wanted to merge a subset of files back into master (though there are probably better ways of doing this). Anyway, it is not very hard to rewrite history thanks to the excellent git-filter-branch tool which comes with git.
I’ll reproduce the script here, in the not-unlikely event that his writeup goes away:
#!/bin/bash set -o errexit # Author: David Underhill # Script to permanently delete files/folders from your git repository. To use # it, cd to your repository's root and then run the script with a list of paths # you want to delete, e.g., git-delete-history path1 path2 if [ $# -eq 0 ]; then exit 0 fi # make sure we're at the root of git repo if [ ! -d .git ]; then echo "Error: must run this script from the root of a git repository" exit 1 fi # remove all paths passed as arguments from the history of the repo files=$@ git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD # remove the temporary history git-filter-branch otherwise leaves behind for a long time rm -rf .git/refs/original/ && git reflog expire --all && git gc --aggressive --prune
A big thank you to Mr. Underhill for documenting this one.
seems really powerful, and not as brain-hurting as some things in git land.
The docs are currently pretty good, and worth a read if you’re trying to
solve this problem.
Lets you rewrite Git revision history by rewriting the branches mentioned in the <rev-list options>, applying custom filters on each revision. Those filters can modify each tree (e.g. removing a file or running a perl rewrite on all files) or information about each commit. Otherwise, all information (including original commit times or merge information) will be preserved.
After this, things got muddier. The script seemed to work fine, and after
running it I was able to see all the history I expected, minus some troublesome
files. (A version with
--prune-empty added to the
invocation got rid of some empty commits.) But then:
brennen@exuberance 20:05:00 /home/brennen/code $ du -hs pi_bootstrap 218M pi_bootstrap brennen@exuberance 20:05:33 /home/brennen/code $ du -hs experiment 199M experiment
That second repo is a clone of the original with the script run against it. Why is it only tens of megabytes smaller, when minus the big binaries I zapped, it should come in somewhere under 10 megs?
I will spare you, dear reader, the contortions I went through arriving at a
solution for this, partially because I don’t have the energy left to
reconstruct them from the tattered history of my googling over the last few
hours. What I figured out was that for some reason, a bunch of blobs were
persisting in a pack file, despite not being referenced by any commits, and no
matter what I couldn’t get
git gc or
git repack to zap them.
I more or less got this far with commands like:
brennen@exuberance 20:49:10 /home/brennen/code/experiment2/.git (master) $ git count-objects -v count: 0 size: 0 in-pack: 2886 packs: 1 size-pack: 202102 prune-packable: 0 garbage: 0 size-garbage: 0
git verify-pack -v ./objects/pack/pack-b79fc6e30a547433df5c6a0c6212672c5e5aec5f > ~/what_the_fuck
…which gives a list of all the stuff in a pack file, including super-not-human-readable sizes that you can sort on, and many permutations of things like:
brennen@exuberance 20:49:12 /home/brennen/code/experiment2/.git (master) $ git log --pretty=oneline | cut -f1 -d' ' | xargs -L1 git cat-file -s | sort -nr | head 589 364 363 348 341 331 325 325 322 320
cat-file is a bit of a Swiss army knife for looking at objects, with
-s meaning “tell me a size”.
(An aside: If you are writing software that outputs a size in bytes, blocks, etc., and you do not provide a “human readable” option to display this in comprehensible units, the innumerate among us quietly hate your guts. This is perhaps unjust of us, but I’m just trying to communicate my experience here.)
And finally, Aristotle Pagaltzis’s script for figuring out which commit has a given blob (the answer is fucking none of them, in my case):
#!/bin/sh obj_name="$1" shift git log "$@" --pretty=format:'%T %h %s' \ | while read tree commit subject ; do if git ls-tree -r $tree | grep -q "$obj_name" ; then echo $commit "$subject" fi done
Also somewhere in there I learned how to use
git bisect (which is
really cool and likely something I will use again) and went through and made
entirely certain there was nothing in the history with a bunch of big files
So eventually I got to thinking ok, there’s something here that is keeping these objects from getting expired or pruned or garbage collected or whatever, so how about doing a clone that just copies the stuff in the commits that still exist at this point. Which brings us to:
brennen@exuberance 19:03:08 /home/brennen/code/experiment2 (master) $ git help clone brennen@exuberance 19:06:52 /home/brennen/code/experiment2 (master) $ cd .. brennen@exuberance 19:06:55 /home/brennen/code $ git clone --no-local ./experiment2 ./experiment2_no_local Cloning into './experiment2_no_local'... remote: Counting objects: 2874, done. remote: Compressing objects: 100% (1611/1611), done. remote: Total 2874 (delta 938), reused 2869 (delta 936) Receiving objects: 100% (2874/2874), 131.21 MiB | 37.48 MiB/s, done. Resolving deltas: 100% (938/938), done. Checking connectivity... done. brennen@exuberance 19:07:15 /home/brennen/code $ du -hs ./experiment2_no_local 133M ./experiment2_no_local brennen@exuberance 19:07:20 /home/brennen/code $ git help clone brennen@exuberance 19:08:34 /home/brennen/code $ git clone --no-local --single-branch ./experiment2 ./experiment2_no_local_single_branch Cloning into './experiment2_no_local_single_branch'... remote: Counting objects: 1555, done. remote: Compressing objects: 100% (936/936), done. remote: Total 1555 (delta 511), reused 1377 (delta 400) Receiving objects: 100% (1555/1555), 1.63 MiB | 0 bytes/s, done. Resolving deltas: 100% (511/511), done. Checking connectivity... done. brennen@exuberance 19:08:47 /home/brennen/code $ du -hs ./experiment2_no_local_single_branch 3.0M ./experiment2_no_local_single_branch
What’s going on here? Well,
git clone --no-local:
--local -l When the repository to clone from is on a local machine, this flag bypasses the normal "Git aware" transport mechanism and clones the repository by making a copy of HEAD and everything under objects and refs directories. The files under .git/objects/ directory are hardlinked to save space when possible. If the repository is specified as a local path (e.g., /path/to/repo), this is the default, and --local is essentially a no-op. If the repository is specified as a URL, then this flag is ignored (and we never use the local optimizations). Specifying --no-local will override the default when /path/to/repo is given, using the regular Git transport instead.
--[no-]single-branch Clone only the history leading to the tip of a single branch, either specified by the --branch option or the primary branch remote’s HEAD points at. When creating a shallow clone with the --depth option, this is the default, unless --no-single-branch is given to fetch the histories near the tips of all branches. Further fetches into the resulting repository will only update the remote-tracking branch for the branch this option was used for the initial cloning. If the HEAD at the remote did not point at any branch when --single-branch clone was made, no remote-tracking branch is created.
I have no idea why
--no-local by itself reduced the size but didn’t really do
It’s possible the lingering blobs would have been garbage collected eventually, and at any rate it seems likely that in pushing them to a remote repository I would have bypassed whatever lazy local file copy operation was causing everything to persist on cloning, thus rendering all this head-scratching entirely pointless, but then who knows. At least I understand git file structure a little better than I did before.
For good measure, I just remembered how old much of the software on this machine is, and I feel like kind of an ass:
brennen@exuberance 21:20:50 /home/brennen/code $ git --version git version 1.9.1
This is totally an old release. If there’s a bug here, maybe it’s fixed by now. I will not venture a strong opinion as to whether there is a bug. Maybe this is entirely expected behavior. It is time to drink a beer.
postscript: on finding bugs
The first thing you learn, by way of considerable personal frustration and embarrassment, goes something like this:
Q: My stuff isn’t working. I think there is probably a bug in this mature and widely-used (programming language | library | utility software).
A: Shut up shut up shut up shut up there is not a bug. Now go and figure out what is wrong with your code.
The second thing goes something like this:
Oh. I guess that’s actually a bug.
Which is to say: I have learned that I’m probably wrong, but sometimes I’m also wrong about being wrong.