UNIX Unleashed, Internet Edition

- 23 -

Introduction to Revision Control

Eric Goebelbecker

Web sites, programming projects, and even networks revolve around collections of files. Many of these files depend upon information that is stored in other files, such as targets for hypertext links, arguments to functions, or network names and addresses. These relationships can be very difficult to manage, especially when more than one person is involved or as small projects evolve into large systems.

One of the tools commonly found on a UNIX system for managing those relationships is a Revision Control System. (also called a Source Control System; this chapter will use both terms interchangeably) These systems allow a person (or group of people) to track the changes made to a set of files, quickly and accurately undo a set of changes, and maintain an audit trail regarding why changes were made.

This chapter will explore the common characteristics and concepts behind these systems and how you can use them to help manage your projects more effectively. This will be done without going too far into the specifics of any particular system. RCS, SCCS, and CVS, three of the most widely used source control systems are covered fully in the next three chapters.

Source control is often closely associated with software development. While it is an indispensable tool for any programming project, this chapter will illustrate how it can also be useful for many other projects.

This chapter will:

Explain what revision control is and what it is frequently used for.
Demonstrate essential revision control concepts, such as creating revisions, checking changes in and out of the system, how file changes are logically organized and how the systems can be used to easily move between file revisions.
Cover advanced topics such as using revision control to prevent conflicts created by a group of people working on a single set of files, documenting changes to files and creating revision branches.

What Is Revision Control?

Managing change is a common part of computing. Programmers have to manage bug fixes while producing new versions of applications that are frequently based on the code that contains what is being fixed. System administrators have to manage a variety of configuration changes, such as adding new users to systems and adding new systems to networks, without interfering with day-to-day operations. Web authors have to make continuous revisions to documents in order to keep up with the constantly growing and improving Internet competition. Just about any computer related job (or any job that can use a computer, for that matter) goes through a seemingly endless cycle of revision, refinement, and renewal.

Fortunately for UNIX users, most of the files used in these processes are text files, files that consist of (mostly) human readable characters. (A more technical description would be files that are limited to the ASCII character set.) Programs in C/C++, Perl, and Java code are written in text files, as are HTML and JavaScript documents. UNIX configuration files for system and network management are usually human readable, as are many of the languages used for document creation and formatting, such as troff, postscript, and ghostscript.

Why is this fortunate? Because revision control systems can manage any text file. They are sets of utilities that allow users to manage the creation and maintenance of any document, either alone or in groups. The systems covered in this book are SCCS, RCS, and CVS.

These systems provide some common features:

The ability to save multiple versions of a file, and easily select between them.
The ability to resolve (and prevent) conflicts caused by more than one person altering a file simultaneously.
The ability to review the history of changes made to a file.
The ability to link versions of different files together.

Revision Control Concepts--an Example

In order to illustrate the concepts behind version control, let's use an example HTML project. Concepts will be introduced without actually demonstrating any commands or utilities. Instead we will simply describe the operations that we could perform in order to maintain our project.

Our project will start with the following file, hello.html.

<!DOCTYPE HTML PUBLIC -//IETF//DTD HTML//EN>
<html>
<head>
<title>An Html Page</title>
</head>
<body>
<h1>Hello World!</h1>
<hr>
<address><a href= mailto:eric@prophet>Eric Goebelbecker</a></address>
</body>
</html>

Registering the Initial Revision

The first step is to register hello.html. When a file is registered, a control file is created, the revision is numbered, and the original file is marked read-only if we specify that we want a copy to stay behind.

Revisions (or deltas in SCCS terminology) are the building blocks of source control projects. Files (and groups of files) are stored and retrieved in terms of the changes made to them. Each time a file is changed and checked in, a new revision is created.

Since this is the original file, it is referred to as the root of the revision tree. It would typically be numbered version 1.1. Revision control systems allow these numbers to be overridden when files are registered or checked out (We'll explain how and when files are checked out in the next section.)

Revision numbers, such as 1.1, are used as names for versions of files (Actual names can be used in some situations also (see the "Symbolic Names" section later in this chapter). The leftmost number usually signifies a major release for a product. If we were working on a new version of an existing product, we might override this number to be 2 or 3, depending upon what internal policies exist for version numbers. The second number represents the minor version, where 2.5 might represent the fifth revision of a file within version 2. (Revision numbers have taken on a life of their own since the early days of RCS and SCCS, and really don't mean as much as they used to.)

It is significant that the version control system marks any remaining copies of the file as read-only. A version control system is only as accurate as the changes it's aware of, and registering changes is very important. On a superficial level the file's permissions act as a reminder to us to keep the file in sync with the revision control system. More importantly, the file permissions perform a crucial part when more than one person is involved in working on a project.

Edits to a file cannot be saved if the file is marked read-only, and the permissions on the file can only be changed by the owner (or by the super-user). The right way to edit the file is to check it out from the control system, which marks the file as only being writeable by the person who has checked it out. Therefore, if one user checks a file out, others will not be able to alter it until it is checked back in. This is the most fundamental operation in what is called file locking.

NOTE: When a group is working together, for instance, to create a set of Web Pages, an application development project, or any other non-system administration-related project, all of the users should have a proper account and should be using it. No one should be working as root, since file locking essentially becomes useless when a user can override it at will.

Revision control systems store the series of changes to objects in control files. (Each system has different options and stores these files differently. See chapter 24 for details on RCS files and chapter 26 for details on SCCS files. CVS, which is covered in chapter 25 uses RCS files.) These files contain complete histories of the project, which allows them to serve as both a backup and an audit trail. In fact, keeping a current copy of the file isn't really necessary, just as long as the history file is available. Many programming utilities, such as make and emacs, are aware of revision control and can automatically retrieve the latest version of a file.

Registering hello.html starts the revision control process. This process essentially enforces a discipline on users who are working on that project. Files cannot be altered unless they are checked out, and others cannot work on them unless they are checked in. If you do not check a file in, your coworkers will most likely tell you to. Also, as you will see in the next section, when files are checked in, the systems allows you to add comments regarding the changes you made. If the comments are missing or incomplete, trouble frequently ensues, especially when the changes are implicated in a problem.

Creating a New Revision

The e-mail address on line #9 in hello.html will not work for external systems because the domain name is incomplete, so we must update the file. (Otherwise, how are people going to tell us what they think of our masterpiece?)

The file is still marked read-only from when we registered it with the revision control system. In order to edit it, we need to check out the latest version of hello.html.

Checking a file out (or getting it in SCCS terms), provides us with a modifiable working copy of the file. It also marks the file as being edited within the revision control system, locking other users out from checking in revisions that could conflict with ours. (Files can also be checked out for read-only, so the file can be examined at any time, but only one user can lock it at a time.)

After checking out the file, the line is modified:

<address><a href=mailto:eric@niftydomain.com>Eric Goebelbecker</a></address>

Then we check in (or delta) the file. As a part of the check in process, the system prompts us for a comment. (The SCCS request prompt is shown.)

comments? Fixed e-mail address.

The project now has a second revision, which is numbered version 1.2 since we didn't override the default.

The Revision Tree

Let's imagine that this process continues and hello.html grows into a more sophisticated HTML page.

Figure 23.1.
A Simple Revision Tree.

Each revision is a node on the revision tree. The node labeled version 1.1 (root) in Figure 23.1 represents the initial revision of hello.html. The node labeled version 1.2 represents the version with the corrected e-mail address; version 1.3 could represent a version with some graphics added, and so on.

NOTE: As you may have figured out already, version control systems use a tree metaphor, much like UNIX directories.

For a simple file such as hello.html, viewing the history of revisions as a tree may seem like a bit of a stretch. Later, when we cover revision branching in the "Advanced Concepts" section, the metaphor will have more meaning.

Returning to an Earlier Revision

Version 1.3 contained a very large graphic, which worked fine on our local LAN, but took too long to download elsewhere on the Internet.

When the large graphic was added to the page, a lot of formatting was also added, so simply removing the graphic or adding a smaller one would seriously affect the page. In order to make the page usable quickly, use the revision control system to retrieve version 1.2 until you have time to solve the problem with version 1.3.

The systems make this easy, because the file can be checked out at a specific revision level. You can also check out revision 1.2 as a read-only file so users can view it, while addressing the problem with revision 1.3.

Advanced Concepts

Now that we've covered the basic concepts, let's move on to some more advanced applications of revision control, such as how to use it to resolve problems, how to maintain more than one version of a project, and how it makes managing a project that involves more than one person much easier to manage than e-mail and those sticky-pad notes.

Revision History

Having only three versions of hello.html made the transition back to an earlier version too easy. Let's move on to a more comprehensive example.

An accounting package has a major new feature added. (Let's imagine that now it calculates the value of a customer's account in U.S. dollars and German Marks.) Following the addition of that enhancement, a few minor features and a pair of bugs are fixed.

One day a customer points out that the calculation in German currency has a problem. Since the program has gone through some changes since that feature was added, how can the bug be isolated quickly? Viewing the revision history could help. Below is a theoretical revision history from SCCS.

D 1.5 97/08/03 16:23:32 fred 4 3        00024/00025/00200
MRs:
COMMENTS:
Added compatibility with fvwm
D 1.4 97/08/03 16:23:32 fred 4 3        00024/00025/00200
MRs:
COMMENTS:
Fixed divide by zero bug in entry module
D 1.3 97/07/15 19:14:21 mike 3 2        00002/00002/00223
MRs:
COMMENTS:
Added report formatting features and support for HP680C
D 1.2 97/06/27 19:03:26 melvin 2 1        00012/00003/00213
MRs:
COMMENTS:
Added Deutsch Mark valuation module

The bug was introduced back in version 1.2 when Melvin added the support for Deutsche Marks. However, since then Mike and Fred added reporting features and support for fvwm and fixed another bug. We see how use of revision comments can aid in a project by isolating when and where a problem may have been introduced. The "How do I use RCS?" section in Chapter 24 explains the use of the rcslog command for viewing revision histories in RCS and CVS. "Examining Revision Details and History" in chapter 26 explains how to view this information in SCCS. CVS

Multiple Versions of a Single File or Project

In the previous examples, you only needed a revision tree with a single path, the trunk. Let's look at a situation where a project needs more advanced solutions.

A small ISP (Internet Service Provider) provides two varieties of service to its customers. One is a shell account where a customer can dial in and log into a UNIX host. The other is a PPP account, where the customer dials in for a network connection, but never logs into one of the ISP's systems. (Note to nitpickers: the PPP login is handled by a terminal server.)

All users do, however, need to have accounts on the POP mail server, because all of them will receive mail and the mail must be saved with proper ownership and file permissions until the users retrieve it, either with a mail agent from their shell account or to their systems at home. Therefore, the ISP needs to maintain two UNIX passwd files, one for shell users only and one with all users. (Second note to nitpickers: yes, if the terminal server uses a passwd file, we need three. It's only an example!)

The initial revision of the passwd file, prior to the ISP offering PPP accounts, might have looked like this:

abe:x:200:200:Abraham Lincolni:/export/home/abe:/bin/sh
ben:x:201:200:Benjamin Franklin:/export/home/ben:/bin/ksh
sue:x:202:200:Susan B Anthony:/export/home/sue:/bin/ksh
ike:x:203:200:Dwight D Eisenhower:/export/home/ike:/bin/ksh
fdr:x:204:200:Franklin D Roosevelt:/export/home/fdr:/bin/ksh
harry:205:200:Harry S Truman:/export/home/harry:/bin/sh
john:x:206:200:John Galt:/export/home/john:/bin/csh

At a certain point, however, the ISP administrator needed to add users to the passwd file who did not belong on the shell host, only on the POP host:

abe:x:200:200:Abraham Lincolni:/export/home/abe:/bin/sh
ben:x:201:200:Benjamin Franklin:/export/home/ben:/bin/ksh
sue:x:202:200:Susan B Anthony:/export/home/sue:/bin/ksh
ike:x:203:200:Dwight D Eisenhower:/export/home/ike:/bin/ksh
fdr:x:204:200:Franklin D Roosevelt:/export/home/fdr:/bin/ksh
harry:x:205:200:Harry S Truman:/export/home/harry:/bin/sh
john:x:206:200:John Galt:/export/home/john:/bin/csh
bill:x:207:200:William Clinton:/tmp:/bin/nosuchshell
hillary:x:208:200:Hillary Clinton:/tmp:/bin/nosuchshell
al:x:209:200:Albert Gore:/tmp:/bin/nosuchshell
hank:x:210:200:Hank Reardon:/tmp:/bin/nosuchshell

(The users with nosuchshell only have access to POP mail.)

Revision control provides two possible solutions for this problem.

Branching the Revision Tree

If the administrator just wanted to use the Shell accounts as a base for the POP mail file, she could add a branch to the revision tree.

Figure 23.2.
A revision tree with branches.

As Figure 23.2 shows, a branch creates a new development path for the project. It also has an impact on revision numbers. The branch that extends from revision 1.2 is labeled 1.2.1.1, because it is the initial revision derived from number 1.2. The second set of two numbers is used exactly as the first, with a major and minor number.

NOTE: Revision numbers can be thought of as extending revision control's similarity to UNIX file systems. The revision numbers label versions much the same way directory names identify subdirectories.

By branching, the administrator is able to include the contents of the existing file in the new version without adding unneeded entries in the original tree. But what happens when a new shell user signs up? The administrator still has to add the same information in two places.

Merges

No one wants to do the same thing twice, least of all a probably already overloaded system administrator. But what mechanism would allow users who are added to the shell system to show up on the POP system without inadvertently adding POP users to the list of shell users?

Most revision control systems support merging branches in order to avoid having to manually add changes. This process allows the administrator to add entries from the main tree to the branch, without also adding them back to the main tree. In Figure 23.3 the version 1.4 is merged with 1.2.1.2 to create version 1.2.1.3.

Figure 23.3.
Branched revision tree with a One-way merge.

Merging files can be a very intricate process, and it is a powerful feature that can be used in many more ways than the one we just covered. For more information, see Chapter 24's "How do I use RCS?" for details on merging files managed by RCS, the "Merging" section of Chapter 25 for CVS information, and the Chapter 26 "Merging Revisions" heading for a method used in SCCS.

File Locking

We've already covered how checking out a file for editing prior to making changes prevents conflicts. Let's examine a situation where files are changed without the benefit of file locking. We'll refer to Figure 23.4, where Arthur and Beverly are trying to finish a web project for a major client.

Figure 23.4.
Two-person Web project without file locking.

Arthur grabs a copy of revision 1.5 of index.html and begins editing it. While he is making changes, Beverly also grabs a copy of revision 1.5 of index.html and begins making her changes, independently of Arthur. Arthur checks in his changes as revision 1.6, reports to his manager that the changes are complete, and confidently flies to Belize for his two-week scuba diving vacation. Beverly checks in her changes as revision 1.7, which now contains none of Arthur's changes! Charlie, their manager, discovers that Arthur's changes are not in the weekly release and calls Arthur to find out why, completely ruining Arthur's vacation. Note that even though revision 1.7 is the descendant of 1.6, it doesn't contain the changes Arthur made, since the revision control system simply replaced 1.6 with 1.7. (The system has no way of evaluating what changes should be applied.)

One way to resolve this conflict is to check out both versions 1.6 and 1.7 (to different filenames, of course) and merge them. Arthur's vacation, however, is still ruined.

Figure 23.5.
Two-Person Web project with file locking.

Compare this with the second timeline (Figure 23.5). Arthur grabs a locked copy of revision 1.5 of index.html and begins editing it. While he is making changes, Beverly tries to grab a copy of revision 1.5 of index.html, but the source control system informs her that the revision is locked by Arthur and that she cannot check it out. Beverly waits for Arthur to finish, or if her changes are urgent, she contacts Arthur to work out a way to get her changes done quickly. Arthur checks in his changes as revision 1.6, reports to his manager that the changes are complete, and blissfully flies to Australia for his four-week scuba diving vacation. (on which he is spending the bonus he received for implementing a source control system for the company.) Beverly learns that index.html is no longer locked and checks out revision 1.6. Beverly checks in her changes as revision 1.7, which contains both her modifications and Arthur's. Charlie notices that Arthur's changes are in the weekly release and remembers what a great thing it was that they finally implemented that source control system after Arthur's previous vacation. (Beverly tours Spain for two weeks, and Charlie goes home to play golf, leaving the new developer in charge.)

Keywords

RCS and SCCS enable you to imbed codes into working files that are expanded (converted) into information about the file when it is checked out. These codes can help identify the file once it has left the revision control system and also help you figure out what state the file is in without having to resort to revision control commands.

Some of the options available are:

Branch and version information--The system will insert the version of the file and any applicable branch information.
Line Number--The line number where the keyword is placed, which can be very useful for debugging languages that do not have a preprocessor, such a Perl.
Date and time information--The date and/or time that the file was checked out and the date that the latest revision was created.
Module name--The name of the file.
Author-- The author of the file and also the name of the last person to lock it (not in SCCS).
Log message--The revision comments(not in SCCS).

The codes available differ for different systems by a wide margin. See the specific chapter (and manual pages) for the system you are using for more information.

Symbolic Names, Baselines, and Releases

A symbolic name is a name that is attached to a particular revision of a file that can be used to refer to it without having to know the revision number. Therefore, a major milestone in a file's history can be referred to with a name.

NOTE: SCCS does not support symbolic names. See the section on releases in this chapter and the chapter 26 for a possible workaround.

A baseline is a captured set of revisions that have some special association, such as "submitted to editor," "compiles successfully," "ran for two hours without crashing," "released for beta testing." (Of course, the last two might mean the same thing for some development organizations.)

The ability to create symbolic names is probably the most compelling reason to use a more sophisticated revision control system, such as RCS or CVS, instead of SCCS, although SCCS does provide a workaround that should satisfy most situations.

Using Releases to Replace Symbolic Names

Without symbolic names, you can achieve a similar effect using release numbers. A release is baseline, usually with the property of being released for distribution, which, depending upon the type of file, is a program that has been provided to customers in either binary or source form, a document that has been printed and sold or distributed, or perhaps a document that has simply been submitted to someone for approval.

Symbolic names can be replaced by manipulating the revision numbers. When the project hits a milestone, you can either synchronize all of the file's revision numbers (bring them all to the same level, such as 1.7) or increase the major version number of the next revision (the next change for all of the files is checked in at 2.1).

The second method works quite well, since most systems will automatically retrieve the highest minor revision when only a major revision number is specified. So if a project was released with three files at versions 1.1, 1.5, and 1.7, the system will automatically retrieve those versions the next time the major revision number 1 is retrieved since no minor number was specified.

Summary

In this chapter we've covered the basic concepts behind revision control, and how it can be used to manage a variety of activities. We demonstrated how users first register a file with the system, then check it out for editing and then check it back in when the changes are done so the system becomes aware of the file's new state. We then discussed how this series of revisions can be viewed as a revision tree, and how files can be extracted from the system at any point on that tree. From there we covered advanced concepts, such as "branching" the tree in order to create more than one version of a project, and how to view a file's revision history.

The advanced section also covered file locking in order to prevent editing conflicts and how to have the version control system automatically add annotations to files when they are checked out. We also touched on the process of merging file revisions and the use of symbolic names and baselines for versions of projects. By understanding these concepts you should not only be able to pick a source control system and learn it rapidly, but also be able to identify situations where adopting a revision control system will help make you more productive.

RCS, which is covered in depth in chapter 24, has become the most widely used "free" revision control system, primarily because of it's advanced features such a symbolic names and it's availability on all UNIX variants. It is also the basis for CVS, which is covered in depth in Chapter 25. CVS is found in many networked development environments because it simplifies the process of distributing files in a controlled manner while tracking changes.

Chapter 26 covers SCCS, which is the simplest of the revision control systems to learn, and is the system that is most frequently bundled with UNIX variants. It is commonly used for one or two person projects that need basic file locking and backup capabilities.