Friday, February 12, 2010

Got git?

One of the reasons I haven't been blogging much lately is that I've been studying Git because it seems to be what all the cool kids are using nowadays. It is taking me a while to wrap my brain around it, but I think it will prove to be well worth the effort. There's a lot of material out there and I don't have a whole lot to add to it, but there are couple of key ideas that I wish someone had told me from the get-go. So here they are for the benefit of anyone who wants to follow me down this rabbit hole.

1. Figuring out revision control systems and deciding which one to use for your own work seems to be a rite of passage for all programmers. Arguably the most crucial task of any non-trivial RCS is doing a three-way merge. Three-way merge is a bit of a misnomer. When you do a three-way merge you are really still merging only two files. But you are doing it with the aid of a third file, which is the "common ancestor" of the two files that you are merging, that is, it's the common file that two divergent lines of development started with before any changes were made. Figuring out which revision to use as the common ancestor for a merge is one of the most complex tasks that an RCS has to do, particularly if the files being merged are the products of previous merges. There are lots of different approaches. This is one of the key features that distinguishes one RCS from another.

2. In order to find common ancestors, most RCS systems explicitly store metadata about the history of a file. In other words, in the RCS database/repository there will be information along the lines of, "The original version of file foo.c was... it was then changed to..." This is the reason that you generally have to use separate commands to inform the RCS when files are created, renamed, and deleted.

3. Most RCS systems store file histories as sequences of changes rather than complete snapshots in order to save space.

4. Git is unique among RCS systems in that it does store complete snapshots and not changes, and it does not store explicit metadata about file histories. Git avoids becoming horribly inefficient by using a content-based storage system so that you don't store multiple copies of the same file. Also, Git finds common ancestors for merging using a heuristic algorithm rather than explicit metadata. This has a number of important consequences.

First, Git is useful for more than just revision control. It can be used as a back-end storage system for a wide variety of applications.

Second, in a "normal" RCS, which stores file histories as a sequence of deltas, these sequences form a chain of dependencies. This makes the repository very sensitive to data corruption; if you lose a delta, all of the downstream snapshots become corrupt. It also makes it difficult or impossible to make retroactive changes to the commit history. But Git stores every commit as a complete snapshot, so there is no chain of dependencies. If part of the repository becomes corrupt, that corruption doesn't "spread downstream" the way it does in a delta-based RCS. Furthermore, the content-based storage system that makes all this efficient uses SHA-1 based hashes as keys. This means that if the SHA-1 hash of an object is correct that guarantees that the underlying data object is not corrupt. So not only is Git able to contain repository corruption when it happens, it is able to detect it when it happens as well. And, as a corollary, it can also tell when corruption has been successfully repaired.

Third, there is no distinction between a commit and a branch except for a little bit of bookkeeping. This means that creating a new branch is no more expensive than creating a new commit. In most RCS systems creating a new branch is a relatively expensive operation. But in Git it's cheap, so creating new branches can become an ordinary part of day-to-day workflow.

5. The underlying machinery of Git (which is called the "plumbing") is pretty simple and easy to understand. By way of contrast, the UI layer that is built on top of the plumbing (which is called the "porcelain") is horrifically complex. It's that complexity combined with the unorthodox nature of Git's design that makes it intimidating for many people. I would recommend learning the plumbing first and then tackle the porcelain. (Here's another handy reference.) I would leave the actual manual for last. It's a good reference once you know what you're doing, but I found it a less than optimal way of climbing the learning curve.

Finally, the Git community is very helpful and supportive. So if you've been thinking about taking the Git plunge, I recommend it. It takes a little getting used to, but once you understand it it's very powerful. Besides, it's better than anything else. :-)


Sebastian said...

This is a good summary, thanks for sharing it. It's probably to late but I found Randals Techtalk very helpful when I first started using git:

Anonymous said...

I think the most useful document for me was Git for Computer Scientists. (The name is misleading, actually; the only bit of computer science you need to know is what a directed acyclic graph is.)

I found git surprisingly difficult to learn and get used to, even though I'm quite intimately familiar with CVS and Subversion. This was mostly related to the user interface and the not-very-well-explained concept of the index.

But I'm pretty happy with Git now, and even abusing it for purposes for which it's not well suited, such as distributed storage of my book, music and photo collections.

XO said...

Thanks for the insight Ron.

I have been considering git as back-end storage system on a Raid 5 for quite some time.

XO said...

Interesting tool: