On Arch and Subversion

26 February, 2004

With Subversion 1.0 just released, the version control debate within the free software community has been reopened. The main point of contention is the divide between the two most mainstream open-source revision control projects: Arch and Subversion. These two projects are both very ambitious and have significant following, but they differ in very deep and fundamental ways. This leads to many debates within the Free Software community about which is better and which one people should use.

Notable in this debate are a few documents that could be seen as "position statements" from the leaders of Arch and Subversion. In February 2003 Tom Lord (author of Arch) posted a message to the gnu-arch-users lists titled Dianosing Subversion in which he gives his reasons for why Subversion is a failure. One interesting part of the Arch culture is that Tom Lord frequently writes long essays to this list addressing everything from the resonsibility of engineers in society to how free software R&D ought to be funded. In any case, Greg Hudson of Subversion wrote a reply entitled Undiagnosing Subversion, in which he defends Subversion and argues that it does what it was designed to do.

I don't presume to be part of this series of manifestos, but I want to elaborate on this exchange from the point of view of a disinterested third party. As someone who is interested in revision control but not a guru, and who has some familiarity with both projects, I am writing this essay to weigh in on the debate. I will explain what I see to be the trade-offs and values inherent in both systems, and hopefully make clear why these two projects are so different that they cannot cooperate and focus their efforts on a single project.

I should mention that I am not too familiar with the Subversion community. I have never been subscribed to a Subversion mailing list (though I have read some of the archives). So much of my impression of the Subversion community is second-hand.

Changeset vs. Snapshot Model of Revision Control

The following quotes serve to illustrate a major difference between the way Arch and Subversion approach revision control.

WHAT IS REVISION CONTROL?

It is the role of revision control systems to provide two broad classes of functionality:

  1. The cataloging and archival of unitary changes made (primarily) to software and documentation source code, deliberately and explicitly, by contributors to that code.
  2. The quasi-algebraic manipulation of unitary changes to form useful combinations of changes; the combination and coordination of work made by single or multiple contributors acting across space and time.

--Tom Lord

What is Subversion?

Subversion is a free/open-source version control system. That is, Subversion manages files and directories over time. A tree of files is placed into a central repository. The repository is much like an ordinary file server, except that it remembers every change ever made to your files and directories. This allows you to recover older versions of your data, or examine the history of how your data changed. In this regard, many people think of a version control system as a sort of "time machine".

--The Subversion Book

Arch and Subversion handle version control in fundamentally different ways. Arch is based on storing changesets, which are like patches except that they can also change permissions and rename files in a tree. The way Arch stores a versioned tree is to import the first revision as a .tar.gz archive, and for each revision thereafter it stores a changeset. To get to any revision, you start with the base revision and apply all the changesets in order until you get the revision you want (this is how checkouts work in Arch). The only other prominent version control system that is changeset-based is the commercial BitKeeper, which is what Linus Torvalds uses for Linux.

Subversion, on the other hand, uses the snapshot model of version control. Instead of storing changes, it stores the state of the tree at every revision; its repository model is basically a sequence of complete copies of the entire repository. Of course, it optimizes so that it only copies data when a file changes, but the model is still based on having a complete filesystem tree for every revision.

One way to picture this difference is to think of revision control as a directed graph (in the graph theory sense) where nodes are revisions and edges are changesets. Arch stores the edges, Subversion stores the nodes.

Really, these two models are theoretically equivalent; with changesets you can generate any snapshot, and with snapshots you can compute any changeset. Indeed, as an optimization Arch builds what it calls "revision libraries," which are snapshots of intermediate revisions. This way, the intermediate revisions are directly accessible for operations that require them. These revision libraries even use a similar optimization of sharing data on disk between files that don't change by using hard links.

However, when the system is focused around changesets as Arch is, it becomes more natural to support some of the features Arch boasts: distributed development and powerful merging. Creating a private branch of someone else's line of development is as simple as storing a sequence of private changesets on your own computer. The other person can merge your changes by obtaining your changesets and applying them. Part of an Arch working directory is a set of patch logs which are records of what changesets have been applied to this tree. This makes it easy to support repeatedly merging from a branch without generating conflicts (something CVS never aspired to do) by looking to see what changesets have already been applied and skipping those.

Greg Hudson goes into the disadvantages of the changeset-based approach in "Undiagnosing Svn." It requires grokking a more complicated model, and using it even in a limited regard requires creating local archives and branches, mirroring them when necessary, and keeping in mind what should be merged where. The Arch mantra is that branches are cheap and should be used without hesitation, however this introduces more complexity. One problem he doesn't mention is that it is less efficient than the snapshot-based approach. For example, checking out the most recent revision requires downloading the base revision and applying however many changesets exist in the archive. To deal with this limitation a number of optimizations are introduced in Arch such as revision libraries, but without understanding how to use these properly performance will suffer.

The snapshot-based model is more intuitive to the mass of developers who are familiar with CVS. Checking out the latest code is a simple download, and everyone shares the same repository.

In "Diagnosing SVN" Tom Lord argues that the snapshot model of version control is fundamentally wrong. More fair is to see two approaches with tradeoffs inherent in each. If you value decentralized development and powerful merge operations, Arch is for you. If you value simplicity and efficiency (caveat: I don't have hard data showing Subversion is faster, I am judging based on the number of operations required to perform common tasks), choose Subversion.

POSIX vs. a Portable System

Arch is very closely tied to POSIX. It was originally implemented as a series of shell scripts, and even now when it is written in C it calls many POSIX programs and relies on POSIX-specific optimizations. Portability to non-POSIX systems is not a priority to Tom Lord, and though some Arch users work on portability to Cygwin and such, it will not be practical to use Arch on non-POSIX platforms in the foreseeable future. Subversion, in its goal of replacing CVS and being usable to the greatest number of people, is portable to all the major platforms.

Tied to this is an Arch philosophy of never building into Arch what already exists in POSIX. For example, the idea of building a permissions system into Arch is rejected on principle. "The filesystem already provides this," is the general retort. Arch uses standard POSIX formats such as .tar.gz for storing data and standard directory structures for storing metadata. There is a strong resistance to any kind of binary "blob" that separates a man from his data.

In contrast, Subversion uses cross-platform libraries such as libneon and BerkeleyDB to essentially bring cross-platform code to each platform. This is what makes Subversion a self-contained system that demands very little of the operating system, as opposed to Arch which will only implement what POSIX does not already provide.

CVS Replacement vs. Revolution

Subversion has never aimed to do more than reimplement CVS in a way that is more efficient and eliminates its most obvious limitations (the inability to move files and version directories). It doesn't pretend to be chasing BitKeeper. It doesn't hope to change the way software development works. The development of Subversion is fairly pragmatic, and as I understand it Subversion's main developers are funded.

Tom Lord, however, who I single out because he is the driving force behind Arch's design and evolution, frequently sees Arch as part of a picture for things such as A Free Software Industry. Tom has no shortage of big ideas about the interplay of business, free software, and engineering. Arch's model and its features play a role in achieving his bigger goals. Tom is not funded and frequently "passes the hat" around the Arch community, asking for support so that he can work on Arch full-time.


Last modified 17 June 2004.