As any good systems administrator will tell you, backups are an essential part of properly managing any system. And if he's ever tried to back up busy git repositories in a naive way with rsync or tar, he'll also complain about corrupt repositories after restoring them. (And if he doesn't complain about this, he may never have tried a restore...)
So why is this? Mostly because a filesystem offers no support for database-like transactions among many files and git uses the filesystem like a database. In other words, to properly back up a repository you need to temporarily disable all access to it. This is of course highly undesirable, as this stops people from committing data to a repository while you do the backup.
This can be worked around by knowing how git manipulates the repository, which we will discuss later. But first I want to show you some backup strategies that are common, but do not work for busy repositories. Of course, if you can stop all access to a repository or if you're the only person using the repository, you can ignore all this and just rsync and be done with it.
A clone is not a backup at all. It is merely a copy of some or all objects and refs, but does not back up any hooks, configuration, reflogs and dangling objects.
Dropbox and other sync mechanisms
Occasionally people think it's a good idea to store git repositories on dropbox. Not only is this not a good way to back up, it can also seriously damage the original repository, especially when also using this dropbox'ed repository on multiple computers.
So why is it bad? For backups it's mostly that recovering an older version is done per file, making it really cumbersome or impossible to recover a complete older state of the repository. The corruption of the original is mostly because dropbox ignores permissions and handles frequent updates to small files (think refs) rather poorly, causing HEAD to point to the wrong place.
rsync and tar
rsync and tar (and other file-based archiving programs) run afoul of git's ref update strategy. When you push to a repository, or pull into it, git does its updates in two phases: first the objects and then the refs. The objects phase is safe: as the name of the object is derived from the content, objects never change, and since git writes to temporary files which it renames into place, this is automatically atomic.
The refs phase is where all the problems are. While git uses the same write-to-tempfiles-and-rename strategy as for objects, the refs point to objects that must be there. Image the following timeline of a git operation running while you're backup is halfway through copying:
|Time||Git||rsync or tar|
|416||write objects/c2/a1f5....||copies objects/d5/...|
|417||write refs/heads/master pointing to c2a1f5||copies objects/d6/...|
At this point your backup has a refs/heads/master that points to an object that's not in the backup, so it's not suitable to recover from.
So how does one properly back up?
So are there no good ways to back up a git repository? Of course there are, you just need to take git's ref update strategy into account.
Filesystem snapshot and a normal archiver
If your filesystem or block device supports atomic snapshots, those can be used to make backups from. A possible backup strategy with LVM and tar could be as simple as:
lvcreate -L1G -s -n gitsnapshot /dev/vg00/git mount /dev/vg00/gitsnapshot /mnt/git tar -C /mnt zcf /var/backups/git-$(date +%Y-%m-%d).tar.gz git umount /mnt/git lvremove /dev/vg00/gitsnapshot
Being aware of git's ref update strategy
But even if you do not have the capability to do snapshots, you can still make consistent backups, as long as you back up objects after refs. You might end up with some dangling objects, but your repository will be consistent and all refs will point to existing objects. With rsync, that could look like:
rsync -av --delete --exclude objects /srv/git/ /var/backups/git/git-$(date +Y%-%m-%d)/ rsync -av --delete --include objects --exclude '/*/*' /srv/git/ /var/backups/git/git-$(date +Y%-%m-%d)/
The objects and refs consistency issue is by far the most common one, and the only one that can be influenced from the outside. But there are other repository maintenance tasks you may have scripted, such as regular gc or repack runs. If you have such other maintenance tasks, incorporate the backups in the same process so they don't run simultaneously. You should also make sure that the backup runs before any gc, cleanup or expire job that you have so you have a backup in case they go haywire.