Repairing and recovering broken git repositories

Posted on Tue 08 December 2015 in Troubleshooting

Whether it's filesystem corruption due to a power outage, an overactive virus scanner, or a simple slip of the keyboard, it is not uncommon to hear about corruption inside the .git directory. It is much rarer to hear about such corruption being caused by git. I personally have never seen it, and it would surely be considered a critical bug if it were to happen.

So, what can we remove while still having something to recover? Well, pretty much everything except the objects directory. And even if you remove files from there, all other objects will be recoverable.

Make backups and try in a copy first

Your repository is already broken. Don't break it any further without first making sure nobody can access it except you, making a backup (tar, rsync) of the repository and first trying the commands in a copy of the repository.

$ tar zcvf myrepo.tar.gz myrepo
$ rsync -av myrepo/ myrepo-copy/
$ cd myrepo-copy/

All the files in .git are gone!

One of the more interesting (and surprisingly easy to solve!) cases of corruption I've seen is someone losing all the files in the .git directory, but no directories or files inside those directories were lost. We never did find out how it happened, but it was surprisingly easy to fix this.

The gitrepository-layout manpage can tell you which files git expects to exist. Below you find out how to restore them when they've gone missing.

HEAD

When .git/HEAD is gone, git doesn't even think your repository is a repository. So really, we must fix this first or else we will not be able to use any git commands to salvage the rest.

$ rm .git/HEAD
$ git status
fatal: Not a git repository (or any of the parent directories): .git

This is one of the very few times where touching files inside .git is OK. If you know which branch you had checked out, you can simply put that information inside .git/HEAD. I had the master branch checked out before deleting the HEAD file.

$ echo 'ref: refs/heads/master' > .git/HEAD
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

If you don't know which branch (or even commit in detached HEAD state) you had checked out, try a few. If you picked the wrong one, git diff will tell you that there are many uncommitted changes.

index

Should you misplace your index, git thinks that all your files have been deleted from the repository with git rm --cached.

$ rm .git/index
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

    deleted:    .gitignore
    deleted:    docs/index.rst
    deleted:    setup.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

    .gitignore
    docs/
    setup.py

To rebuild the index, you can simply do git reset --mixed. This makes the index look like the last commit while leaving the worktree alone. If you had any local changes you git added but did not commit yet, you will need to re-add those.

$ git reset --mixed
$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

config

When the configuration is gone, you can't really get it back. But you can set the default configuration variables and re-add your remotes. This should get you into a workable state. So let's first do the default configuration.

$ git init

Git's init command will create a configuration if none exists. It will not wipe any objects, so it's safe to run in an existing repository. If your repository was a shared repository, you will need to tell git that manually though, using git config core.sharedRepository true.

With the configuration in place, we can re-add our remotes. My corrupt repository has one remote that lives on GitHub, so I'll add it.

$ git remote add origin git@github.com:seveas/whelk.git
$ git fetch

Are we done? Almost! git branch -vva will tell you that your branches are no longer tracking their remote counterparts. If you have only a master branch, a simple git branch -u origin/master master is enough to set up tracking. If you have many branches, you will want to script this.

for head in $(git for-each-ref --format '%(refname:short)' refs/heads); do
    if git rev-parse -q --verify origin/$head >/dev/null; then
        git branch -u origin/$head $head
    fi
done

packed-refs

If the packed-refs file is gone, you might have lost an awful lot of refs. Try a git fetch to see if some of them come back (tags and remote refs). For local refs, see the recipe below that discusses losing the refs directory.

A folder in .git is gone!

Not even two weeks after the case of the missing files, a user popped into #git who had lost everything except .git/objects/ (seriously, how do people do this?!). We managed to recreate everything else, which was of course made easier because he had only a single remote and a single branch. But it just goes to show that you can lose a lot of things and still keep git happy.

In this case it is important though to recover things in the correct order. The order we used was:

  • HEAD
  • refs/
  • HEAD again and index
  • config

refs

The refs directory contains all your branches, tags and other refs, except for the ones stored in .git/packed-refs. When you lose refs, there are a few strategies to get them back. The simplest one is to fetch from a remote repository (if you have any). This will bring back refs in refs/remotes, and tags that the remote has. When you lost all refs, you will first need to manually mkdir -p .git/refs/heads to get git to recognize the repository at all.

For local refs, there are two locations where you can recover the previous values of refs : the reflog and the output of fsck. If you still have your reflogs, you will find the correct value of a ref to on the last line of its reflog. Here's an example of recovering the master branch:

$ tail -n1 .git/logs/refs/heads/master
54bc41416c5d3ecb978acb0df80d57aa3e54494c 2c78628255b8cc7f0b5d47981acea138db8716d2 Dennis Kaarsemaker <dennis@kaarsemaker.net> 1446765968 +0100  merge upstream/master: Fast-forward
$ git update-ref refs/heads/master 2c78628255b8cc7f0b5d47981acea138db8716d2

The reflog in .git/logs/HEAD can show you which branch you had last checked out. This can help you update the HEAD ref.

$ tail -n1 .git/logs/HEAD
7f79f6a992b11aaaf2592075346d83b1ba0f4ff8 a5e28dbe709a544f51b9c44752e14e5cd007a815 Dennis Kaarsemaker <dennis@kaarsemaker.net> 1448810920 +0100  checkout: moving from 7f79f6a992b11aaaf2592075346d83b1ba0f4ff8 to master
$ git symbolic-ref HEAD refs/heads/master

If you do not have any reflogs, you can still recover refs by looking at your commit objects. If a commit has no descendants, it could be at the tip of a branch, so a ref should point to it. It could also be a commit that was amended, rebased or simply discarded, so this method may give you some false positives to sort through. So how do you find commits without descendants? Fortunately you don't need to do this manually, git fsck is here to help. Let's break a simple repository to show it.

$ git clone https://github.com/seveas/whelk.git
[output omitted]
$ cd whelk/
$ rm .git/packed-refs .git/refs/heads/master
$ git fsck
notice: HEAD points to an unborn branch (master)
Checking object directories: 100% (256/256), done.
Checking objects: 100% (589/589), done.
error: refs/remotes/origin/HEAD: invalid sha1 pointer 0000000000000000000000000000000000000000
notice: No default references
dangling tag 92d0fe18f9a55177d955edf58048b49db7987d5b
dangling commit aa7856977e80d11833e97b4151f400a516316179
dangling commit 16e449da82ec8bb51aed56c0c4c05473442db90a
dangling commit 864c345397fcb3bdb902402e17148e19b3f263a8
dangling tag be9471e1263a78fd765d4c72925c0425c90d3d64

These dangling commits are the tips of the branches. But which one is which? There's no way to know without looking, so let's create some temporary branches and have a look

$ git update-ref refs/heads/recovery-1 aa7856977e80d11833e97b4151f400a516316179
$ git update-ref refs/heads/recovery-2 16e449da82ec8bb51aed56c0c4c05473442db90a
$ git update-ref refs/heads/recovery-3 864c345397fcb3bdb902402e17148e19b3f263a8
$ git log --graph --all --oneline --decorate

In the resulting log, you'll see where these temporary branches point, and you can use git branch -m to give them their correct names back. And those dangling tags? Those are tag objects that you can now recover, the tag object even has the tag name in it!

$ git cat-file tag be9471e1263a78fd765d4c72925c0425c90d3d64
object 34555e0e3315f60ca5810562a36269187c2ced46
type commit
tag 2.5
tagger Dennis Kaarsemaker <dennis@kaarsemaker.net> 1428783307 +0200

Version 2.5
$ git update-ref refs/tags/2.5 be9471e1263a78fd765d4c72925c0425c90d3d64

logs

If the reflogs are gone, they cannot be recovered. Fortunately, these logs aren't necessary for the normal operation of git and losing them only makes recovering refs harder.

objects

If the objects directory is gone, it's time to give up. This is where your data lives, and with it gone, what's left is useless. If you still have your worktree, you can use it to start a new repository. Delete the .git directory and git init to start over.

If the directory is not gone completely, but you have some corrupt or missing objects, see below for tips and tricks on how to recover from this

info

The info/ directory is mostly useless these days, as it is only used for the obsolete dumb http protocol. If you still use this protocol and lost the info/ directory of the repository that is being pulled from, you can recreate it with git update-server-info.

modules

If the modules directory is gone, git can get quite upset. To fix this, move the submodules' worktrees out of the way (or delete them if you're sure you have no changes) and simply run git submodule update to reclone them. Then put your worktrees back if you had local changes, and you can commit those.

worktrees

A feature still under heavy development is support for multiple worktrees for a single repository. Information about these worktrees is stored in the worktrees directory. For each worktree, there is a separate directory containing at least HEAD, index, logs/HEAD, gitdir and commondir.

HEAD, index and logs/HEAD can be recovered as above. gitdir should contain the path to the .git file inside the separate worktree and commondir should contain the path to the original .git dir of the repository, usually ../..

Object corruption

The worst kind of corruption in a git repository is corrupt or missing objects. Corrupt objects are incredibly tricky to recover if you do not have a copy of them, so we will focus on restoring missing objects from another copy of the repository so any local-only work is not lost.

So let's first find out which objects are corrupt and remove them (you did read the first section of this article, saying to try this first in a copy of the repository, right?).

$ git fsck --full
error: inflate: data stream error (incorrect header check)
error: unable to unpack 27c221b1620b8414de002b00aa990fd8e0d768a7 header
error: inflate: data stream error (incorrect header check)
fatal: loose object 27c221b1620b8414de002b00aa990fd8e0d768a7 (stored in .git/objects/27/c221b1620b8414de002b00aa990fd8e0d768a7) is corrupt
error: .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack SHA1 checksum mismatch
error: index CRC mismatch for object 66e007c864c1460986af0993698234f4442882f1 from .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack at offset 1485
error: inflate: data stream error (incorrect data check)
error: cannot unpack 66e007c864c1460986af0993698234f4442882f1 from .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack at offset 1485
Checking objects: 100% (441/441), done.
dangling commit fe3af8c7274267a4262bc093adcee57511e13211

This repository was intentionally broken by modifying some files with a hex editor. git fsck detects this and tells you which files have been tampered with. Any corrupt loose objects can simply be removed, but corrupt packfiles probably also contain some recoverable objects, so we try to recover those before removing the file.

$ mv .git/objects/pack/pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack .
$ git unpack-objects -r < pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack
$ rm pack-0672bd01813664b80248dbe8330bf52da9c02b9f.pack
$ rm .git/objects/27/c221b1620b8414de002b00aa990fd8e0d768a7

With those files now out of the way, git fsck will report all missing objects. We can try recovering those from a fresh clone.

$ git unpack-objects < ../fresh-clone/.git/objects/pack/pack-*.pack

If there are still missing objects, you can try adding the current contents of the work directory to your repository:

$ find -type f -print0 | xargs -0 git hash-object -w