Repairing and recovering broken git repositories
Posted on Tue 08 December 2015 in Troubleshooting
Whether it's filesystem corruption due to a power outage, an overactive virus scanner, or a simple slip of the keyboard, it is not uncommon to hear about corruption inside the .git directory. It is much rarer to hear about such corruption being caused by git. I personally have never seen it, and it would surely be considered a critical bug if it were to happen.
So, what can we remove while still having something to recover? Well, pretty much everything except the objects directory. And even if you remove files from there, all other objects will be recoverable.
Make backups and try in a copy first
Your repository is already broken. Don't break it any further without first making sure nobody can access it except you, making a backup (tar, rsync) of the repository and first trying the commands in a copy of the repository.
$ tar zcvf myrepo.tar.gz myrepo
$ rsync -av myrepo/ myrepo-copy/
$ cd myrepo-copy/
All the files in .git are gone!
One of the more interesting (and surprisingly easy to solve!) cases of corruption I've seen is someone losing all the files in the .git directory, but no directories or files inside those directories were lost. We never did find out how it happened, but it was surprisingly easy to fix this.
The gitrepository-layout manpage can tell you which files git expects to exist. Below you find out how to restore them when they've gone missing.
HEAD
When .git/HEAD is gone, git doesn't even think your repository is a repository. So really, we must fix this first or else we will not be able to use any git commands to salvage the rest.
$ rm .git/HEAD
$ git status
fatal: Not a git repository (or any of the parent directories): .git
This is one of the very few times where touching files inside .git is OK. If you know which branch you had checked out, you can simply put that information inside .git/HEAD. I had the master branch checked out before deleting the HEAD file.
$ echo 'ref: refs/heads/master' > .git/HEAD
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
If you don't know which branch (or even commit in detached HEAD state) you had
checked out, try a few. If you picked the wrong one, git diff
will tell you
that there are many uncommitted changes.
index
Should you misplace your index, git thinks that all your files have been
deleted from the repository with git rm --cached
.
$ rm .git/index
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
(use "git reset HEAD <file>..." to unstage)
deleted: .gitignore
deleted: docs/index.rst
deleted: setup.py
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
docs/
setup.py
To rebuild the index, you can simply do git reset --mixed
. This makes the
index look like the last commit while leaving the worktree alone. If you had
any local changes you git add
ed but did not commit yet, you will need to
re-add those.
$ git reset --mixed
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean
config
When the configuration is gone, you can't really get it back. But you can set the default configuration variables and re-add your remotes. This should get you into a workable state. So let's first do the default configuration.
$ git init
Git's init command will create a configuration if none exists. It will not wipe
any objects, so it's safe to run in an existing repository. If your repository
was a shared repository, you will need to tell git that manually though, using
git config core.sharedRepository true
.
With the configuration in place, we can re-add our remotes. My corrupt repository has one remote that lives on GitHub, so I'll add it.
$ git remote add origin git@github.com:seveas/whelk.git
$ git fetch
Are we done? Almost! git branch -vva
will tell you that your branches are no
longer tracking their remote counterparts. If you have only a master branch, a
simple git branch -u origin/master master
is enough to set up tracking. If
you have many branches, you will want to script this.
for head in $(git for-each-ref --format '%(refname:short)' refs/heads); do
if git rev-parse -q --verify origin/$head >/dev/null; then
git branch -u origin/$head $head
fi
done
packed-refs
If the packed-refs file is gone, you might have lost an awful lot of refs. Try
a git fetch
to see if some of them come back (tags and remote refs). For
local refs, see the recipe below that discusses losing the refs directory.
A folder in .git is gone!
Not even two weeks after the case of the missing files, a user popped into #git who had lost everything except .git/objects/ (seriously, how do people do this?!). We managed to recreate everything else, which was of course made easier because he had only a single remote and a single branch. But it just goes to show that you can lose a lot of things and still keep git happy.
In this case it is important though to recover things in the correct order. The order we used was:
- HEAD
- refs/
- HEAD again and index
- config
refs
The refs directory contains all your branches, tags and other refs, except for
the ones stored in .git/packed-refs. When you lose refs, there are a few
strategies to get them back. The simplest one is to fetch from a remote
repository (if you have any). This will bring back refs in refs/remotes, and
tags that the remote has. When you lost all refs, you will first need to
manually mkdir -p .git/refs/heads
to get git to recognize the repository at
all.
For local refs, there are two locations where you can recover the previous values of refs : the reflog and the output of fsck. If you still have your reflogs, you will find the correct value of a ref to on the last line of its reflog. Here's an example of recovering the master branch:
$ tail -n1 .git/logs/refs/heads/master
54bc41416c5d3ecb978acb0df80d57aa3e54494c 2c78628255b8cc7f0b5d47981acea138db8716d2 Dennis Kaarsemaker <dennis@kaarsemaker.net> 1446765968 +0100 merge upstream/master: Fast-forward
$ git update-ref refs/heads/master 2c78628255b8cc7f0b5d47981acea138db8716d2
The reflog in .git/logs/HEAD can show you which branch you had last checked out. This can help you update the HEAD ref.
$ tail -n1 .git/logs/HEAD
7f79f6a992b11aaaf2592075346d83b1ba0f4ff8 a5e28dbe709a544f51b9c44752e14e5cd007a815 Dennis Kaarsemaker <dennis@kaarsemaker.net> 1448810920 +0100 checkout: moving from 7f79f6a992b11aaaf2592075346d83b1ba0f4ff8 to master
$ git symbolic-ref HEAD refs/heads/master
If you do not have any reflogs, you can still recover refs by looking at your
commit objects. If a commit has no descendants, it could be at the tip of a
branch, so a ref should point to it. It could also be a commit that was
amended, rebased or simply discarded, so this method may give you some false
positives to sort through. So how do you find commits without descendants?
Fortunately you don't need to do this manually, git fsck
is here to help.
Let's break a simple repository to show it.
$ git clone https://github.com/seveas/whelk.git
[output omitted]
$ cd whelk/
$ rm .git/packed-refs .git/refs/heads/master
$ git fsck
notice: HEAD points to an unborn branch (master)
Checking object directories: 100% (256/256), done.
Checking objects: 100% (589/589), done.
error: refs/remotes/origin/HEAD: invalid sha1 pointer 0000000000000000000000000000000000000000
notice: No default references
dangling tag 92d0fe18f9a55177d955edf58048b49db7987d5b
dangling commit aa7856977e80d11833e97b4151f400a516316179
dangling commit 16e449da82ec8bb51aed56c0c4c05473442db90a
dangling commit 864c345397fcb3bdb902402e17148e19b3f263a8
dangling tag be9471e1263a78fd765d4c72925c0425c90d3d64
These dangling commits are the tips of the branches. But which one is which? There's no way to know without looking, so let's create some temporary branches and have a look
$ git update-ref refs/heads/recovery-1 aa7856977e80d11833e97b4151f400a516316179
$ git update-ref refs/heads/recovery-2 16e449da82ec8bb51aed56c0c4c05473442db90a
$ git update-ref refs/heads/recovery-3 864c345397fcb3bdb902402e17148e19b3f263a8
$ git log --graph --all --oneline --decorate
In the resulting log, you'll see where these temporary branches point, and you
can use git branch -m
to give them their correct names back. And those
dangling tags? Those are tag objects that you can now recover, the tag object
even has the tag name in it!
$ git cat-file tag be9471e1263a78fd765d4c72925c0425c90d3d64
object 34555e0e3315f60ca5810562a36269187c2ced46
type commit
tag 2.5
tagger Dennis Kaarsemaker <dennis@kaarsemaker.net> 1428783307 +0200
Version 2.5
$ git update-ref refs/tags/2.5 be9471e1263a78fd765d4c72925c0425c90d3d64
logs
If the reflogs are gone, they cannot be recovered. Fortunately, these logs aren't necessary for the normal operation of git and losing them only makes recovering refs harder.
objects
If the objects directory is gone, it's time to give up. This is where your data
lives, and with it gone, what's left is useless. If you still have your
worktree, you can use it to start a new repository. Delete the .git directory
and git init
to start over.
If the directory is not gone completely, but you have some corrupt or missing objects, see below for tips and tricks on how to recover from this
info
The info/ directory is mostly useless these days, as it is only used for the
obsolete dumb http protocol. If you still use this protocol and lost the info/
directory of the repository that is being pulled from, you can recreate it with
git update-server-info
.
modules
If the modules directory is gone, git can get quite upset. To fix this, move
the submodules' worktrees out of the way (or delete them if you're sure you
have no changes) and simply run git submodule update
to reclone them. Then
put your worktrees back if you had local changes, and you can commit those.
worktrees
A feature still under heavy development is support for multiple worktrees for a single repository. Information about these worktrees is stored in the worktrees directory. For each worktree, there is a separate directory containing at least HEAD, index, logs/HEAD, gitdir and commondir.
HEAD, index and logs/HEAD can be recovered as above. gitdir should contain the path to the .git file inside the separate worktree and commondir should contain the path to the original .git dir of the repository, usually ../..
Object corruption
The worst kind of corruption in a git repository is corrupt or missing objects. Corrupt objects are incredibly tricky to recover if you do not have a copy of them, so we will focus on restoring missing objects from another copy of the repository so any local-only work is not lost.
So let's first find out which objects are corrupt and remove them (you did read the first section of this article, saying to try this first in a copy of the repository, right?).
$ git fsck --full
error: inflate: data stream error (incorrect header check)
error: unable to unpack 27c221b1620b8414de002b00aa990fd8e0d768a7 header
error: inflate: data stream error (incorrect header check)
fatal: loose object 27c221b1620b8414de002b00aa990fd8e0d768a7 (stored in .git/objects/27/c221b1620b8414de002b00aa990fd8e0d768a7) is corrupt
error: .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack SHA1 checksum mismatch
error: index CRC mismatch for object 66e007c864c1460986af0993698234f4442882f1 from .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack at offset 1485
error: inflate: data stream error (incorrect data check)
error: cannot unpack 66e007c864c1460986af0993698234f4442882f1 from .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack at offset 1485
Checking objects: 100% (441/441), done.
dangling commit fe3af8c7274267a4262bc093adcee57511e13211
This repository was intentionally broken by modifying some files with a hex
editor. git fsck
detects this and tells you which files have been tampered
with. Any corrupt loose objects can simply be removed, but corrupt packfiles
probably also contain some recoverable objects, so we try to recover those
before removing the file.
$ mv .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack .
$ git unpack-objects -r < pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack
$ rm pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack
$ rm .git/objects/27/c221b1620b8414de002b00aa990fd8e0d768a7
With those files now out of the way, git fsck
will report all missing
objects. We can try recovering those from a fresh clone.
$ git unpack-objects < ../fresh-clone/.git/objects/pack/pack-*.pack
If there are still missing objects, you can try adding the current contents of the work directory to your repository:
$ find -type f -print0 | xargs -0 git hash-object -w