TOC BACK FORWARD HOME

UNIX Unleashed, Internet Edition

- 4 -

Awk

By Ann Marshall and David B. Horvath

The UNIX utility awk is a pattern-matching and processing language with considerably more power than you might realize. It searches one or more specified files, checking for records that match a specified pattern. If awk finds a match, the corresponding action is performed. Awk is a simple concept, but it is a powerful tool. Often, an awk program is only a few lines long, and because of this, an awk program is often written, used, and discarded. A traditional programming language, such as Pascal or C, would take more thought, more lines of code, and hence, more time.

Short awk programs arise from two of awk's built-in features: the amount of predefined flexibility and the number of details automatically handled by the language. Together, these features allow the manipulation of large data files in short (often single-line) programs and make awk stand apart from other programming languages. Certainly, any time you spend learning awk will pay dividends in improved productivity and efficiency.

Uses

The uses for awk vary from the simple to the complex. Originally, awk was intended for various kinds of data manipulation. Intentionally omitting parts of a file, counting occurrences in a file, and writing reports are natural uses for awk.

Awk uses the syntax of the C programming language; so if you know C, you have an idea of awk syntax. If you are new to programming or don't know C, learning awk will familiarize you with many of the C constructs.

Examples of where awk can be helpful abound. Computer-aided manufacturing, for example, is plagued with nonstandardization, so the output of a computer that's running a particular tool is quite likely to be incompatible with the input required for a different tool. Rather than write any complex C program, this type of simple data transformation is a perfect awk task.

One problem of computer-aided manufacturing today is that no standard format yet exists for the program running the machine. Therefore, the output from computer A running machine A probably is not the input needed for computer B running machine B. Although machine A is finished with the material, machine B is not ready to accept it. Production halts while someone edits the file so it meets computer B's needed format. This is a perfect and simple awk task.

Due to the amount of built-in automation within awk, it is also useful for rapid prototyping or trying out an idea that could later be implemented in another language.

Awk works with text files, not binary files. Because binary data can contain values that look like record terminators (newline characters)--or not have any at the end of the record--awk will get confused. If you need to process binary files, look into Perl or use a traditional programming language such as C.

Features

Reflecting the UNIX environment, awk features resemble the structures of both C and shell scripts. Highlights include flexibility, predefined variables, automation, standard program constructs, conventional variable types, powerful output formatting borrowed from C, and ease of use.

The flexibility means that most tasks may be done more than one way in awk. With the application in mind, the programmer chooses which method to use. The built-in variables already provide many of the tools to do what is needed. Awk is highly automated. For instance, awk automatically retrieves each record, separates it into fields, and does type conversion when needed, without programmer's request. Furthermore, there are no variable declarations. Awk includes the usual programming constructs for the control of program flow: an if statement for two-way decisions and do, for, and while statements for looping. Awk also includes its own notational shorthand to ease typing. (This is UNIX after all!) Awk borrows the printf() statement from C to allow "pretty" and versatile formats for output. These features combine to make awk user-friendly.

A Brief History

Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan created awk in 1977. The name comes from the last initials of the creators. These are some of the same people who created the UNIX operating system and the C programming language. You will see many similarities between awk and C, largely for that reason.

In 1985, more features were added, creating nawk (new awk). For quite a while, nawk remained exclusively the property of AT&T, Bell Labs. Although it became part of System V for Release 3.1, some versions of UNIX, such as SunOS, keep both awk and nawk due to a syntax incompatibility. Others, such as System V, run nawk under the name awk (although System V has nawk too). In The Free Software Foundation, GNU introduced their version of awk--gawk--based on the IEEE POSIX (Institute of Electrical and Electronics Engineers, Inc., IEEE Standard for Information Technology, Portable Operating System Interface, Part 2: Shell and Utilities Volume 2, ANSI approved 4/5/93), awk standard, which is different from awk or nawk. Linux PC shareware UNIX uses gawk rather than awk or nawk. Throughout this chapter, the word awk is used when any of the three (new awk, POSIX awk, or gawk) will do. The versions are mostly upwardly compatible. Awk is the oldest, then nawk, then POSIX awk, and then gawk as shown in Figure 4.1. I have used the notation version++ to denote a concept that began in that version and continues through any later versions.


NOTE: Due to different syntax, not all code written in the original awk language will run under nawk, POSIX awk, or gawk. However, except when noted, all the concepts of awk are implemented in nawk and gawk. Where it matters, the version is specified. If an example does not work using the awk command, try nawk.

Figure 4.1.
The evolution of awk.

Refer to the end of the chapter for more information and further resources on awk and its derivatives.

Fundamentals

This section introduces the basics of the awk programming language. One feature of awk that almost continually holds true is this: You can do most tasks more than one way. The command line exemplifies this. First, I explain the variety of ways awk can be called from the command line--using files for input, the program file, and possibly an output file. Next, I introduce the main construct of awk, which is the pattern action statement. Then, I explain the fundamental ways awk can read and transform input. I conclude the section with a look at the format of an awk program.

Entering Awk from the Command Line

In its simplest form, awk takes the material you want to process from standard input and displays the results to standard output (the monitor). You write the awk program on the command line.

You can either specify explicit awk statements on the command line, or, with the -f flag, specify an awk program file that contains a series of awk commands. In addition to the standard UNIX design allowing for standard input and output, you can, of course, use file redirection in your shell, too; so awk < inputfile is functionally identical to awk inputfile. To save the output in a file, the file redirection awk > outputfile does the trick. Awk can work with multiple input files at once if they are specified on the command line.

The most common way use awk is as part of a command pipe, where it's filtering the output of a command. An example is ls -l | awk '{print $3}', which would print just the third column of each line of the ls command. Awk scripts can become quite complex, so if you have a standard set of filter rules that you would like to apply to a file, with the output sent directly to the printer, you could use something like awk -f myawkscript inputfile | lp.


TIP: To specify your awk script on the command line, it is best to use single quotes to let you embed spaces and to ensure that the command shell does not interpret any special characters in the awk script.

Files for Input

Input and output places can be changed. You can specify an input file by typing the name of the file after the program with a blank space between the two. The input file enters the awk environment from your workstation keyboard (standard input). To signal the end of the input file, type Ctrl-D. The program on the command line executes on the input file you just entered and the results are displayed on the monitor (the standard output).

Here's a simple little awk command that echoes all lines I type, prefacing each with the number of words (or fields, in awk parlance, hence the NF variable for number of fields) in the line.

Note that Ctrl-D means that while holding down the Control key, you should press the D key.

$ awk '{print NF ": " $0}'
I am testing my typing.
A quick brown fox jumps when vexed by lazy ducks.
Ctrl+D
5: I am testing my typing.
10: A quick brown fox jumps when vexed by lazy ducks.
$ _

You can also name more than one input file on the command line, causing the combined files to act as one input. This is one way of having multiple runs through one input file.


TIP: Keep in mind that the correct ordering on the command line is crucial for your program to work correctly; files are read from left to right, so if you want to have file1 and file2 read in that order, you'll need to specify them as such on the command line.

The Program File

With awk's automatic type conversion, a file of names and a file of numbers entered in the reverse order at the command line generate strange-looking output rather than an error message. That is why, for longer programs, it is simpler to put the program in a file and specify the name of the file on the command line. The -f option does this. Notice that this is an exception to the usual way UNIX handles options. Usually, the options occur at the end of a command; however, here, an input file is the last parameter.


NOTE: Versions of awk that meet the POSIX awk specifications are allowed to have multiple -f options. You can use this capability for running multiple programs using the same input.

Specifying Output on the Command Line

Output from awk may be redirected to a file or piped to another program. (See Chapter 4, Volume I, "The UNIX File System.") The command awk '/^5/ {print $0}' | grep 3, for example, will result in just those lines that start with the digit 5 (that's what the awk part does) and also contain the digit 3 (the grep command). If you wanted to save that output to a file, by contrast, you could use awk '/^5/ {print $0}' > results and the file results would contain all lines prefaced by the digit 5. If you opt for neither of these courses, the output of awk will be displayed on your screen directly, which can be quite useful in many instances, particularly when you're developing or fine-tuning your awk script.

Patterns and Actions

Awk programs are divided into three main blocks: the BEGIN block, the per-statement processing block, and the END block. Unless explicitly stated, all statements to awk appear in the per-statement block. (You'll see later where the other blocks can come in particularly handy for programming, though.)

Statements within awk are divided into two parts: a pattern, telling awk what to match, and a corresponding action, telling awk what to do when a line matching the pattern is found. The action part of a pattern-action statement is enclosed in curly braces ({}) and can be multiple statements. Either part of a pattern action statement may be omitted. An action with no specified pattern matches every record of the input file you want to search. (That's how the earlier example of {print $0} worked.) A pattern without an action indicates that you want input records to be copied to the output file as they are (as in printed).

/^5/ {print $0} is an example of a two-part statement. The pattern is all lines that begin with the digit 5. (The ^ indicates that it should appear at the beginning of the line; without this modifier, the pattern would say any line that includes the digit 5.) The action prints the entire line, verbatim. ($0 is shorthand for the entire line.)

Input

Awk automatically scans, in order, each record of the input file looking for each pattern action statement in the awk program. Unless otherwise set, awk assumes each record is a single line. (See the sections "Advanced Concepts," "Multiline Records" in this chapter for how to change this.) If the input file has blank lines in it, the blank lines count as a record too. Awk automatically retrieves each record for analysis; there is no read statement in awk.

A programmer can also disrupt the automatic input order in of two ways: with the next and exit statements. The next statement tells awk to retrieve the next record from the input file and continue, without running the current input record, through the remaining portion of pattern-action statements in the program. For example, if you are doing a crossword puzzle and all the letters of a word are formed by previous words, most likely you wouldn't even bother to read that clue but simply skip to the clue below; this is how the next statement would work, if your list of clues were the input. The other method of disrupting the usual flow of input is through the exit statement. The exit statement transfers control to the END block--if one is specified--or quits the program, as if all the input has been read. Suppose the arrival of a friend ends your interest in the crossword puzzle, but you still put the paper away. Within the END block, an exit statement causes the program to quit.

An input record refers to the entire line of a file including any characters, spaces, or tabs. The spaces and tabs are called whitespace.


TIP: If you think that your input file includes both spaces and tabs, you can save yourself a lot of confusion by ensuring that all tabs become spaces with the expand command. It works like this: expand filename | awk '{ stuff }'. If your system does not have expand, you can use pr -e.

The whitespace in the input file and the whitespace in the output file are not related; you must explicitly put whitespace in your output file.

Fields

A group of characters in the input record or output file is called a field. Fields are predefined in awk: $1 is the first field, $2 is the second, $3 is the third, and so on. $0 indicates the entire line. Fields are separated by a field separator (any single character including Tab) held in the variable FS. Unless you change it, FS has a space as its value. You can change FS by either starting the program file with the following statement:

BEGIN {FS = "c" }

or by setting the -Fc command-line option where "c" and c are the single selected field separator characters you want to use.

One file that you might have viewed, which demonstrates where changing the field separator could be helpful, is the /etc/passwd file that defines all user accounts. Rather than having the different fields separated by spaces or tabs, the password file is structured with lines that look like:

nttp://?:6:11:USENET nttp:///usr/spool/nttp:///bin/ksh

Each field is separated by a colon. You could change each colon to a space (with sed, for example), but that wouldn't work too well. The fifth field, USENET News, contains a space already. You should change the field separator. If you wanted to have a list of the fifth fields in each line, for example, you could use the simple awk command awk -F: '{print $5}' /etc/passwd.

Likewise, the built-in variable OFS holds the value of the output field separator. OFS also has a default value of a space. It, too, may be changed by placing the following line at the start of a program.

BEGIN {OFS = "c" }

If you wanted to automatically translate the /etc/passwd file so that it listed only the first and fifth fields, separated by a tab, you would use the awk script:

BEGIN { FS=":" ; OFS="       " }     # Use the tab key for OFS
{ print $1, $5 }

The script contains two blocks: the BEGIN block and the main per-input line block. Also, most of the work is done automatically.

Program Format

With a few noted exceptions, awk programs are free format. The interpreter ignores any blank lines in a program file (also known as an awk script). Add blank lines to improve the readability of your program. The same is true for tabs and spaces between operators and the parts of a program. Therefore, these two lines are treated identically by the awk interpreter:

$4 == 2               {print "Two"}
$4     ==     2     {     print     "Two"     }

If more than one action appears on a line, you'll need to separate the actions with a semicolon, as shown previously in the BEGIN block for the /etc/passwd file translator. If you stick with one command per line, you won't need to worry too much about the semicolons. There are a couple of spots, however, in which the semicolon must always be used: before an else statement or when included in the syntax of a statement. (See the "Loops" or "The Conditional Statement" sections in this chapter.)

Putting a semicolon at the end of a statement is useful when you have a C language background or convert your awk code to a compiled C program.

The other format restriction for awk programs is that at least the opening curly bracket of the action ( of a pattern action statement) must be on the same line as the accompanying pattern. Thus, the following examples all do the same thing.

The first shows all statements on one line:

$2==0     {print ""; print ""; print "";}

The second example puts the first statement on the same line as the pattern to match and the remaining statements on the following lines:

$2==0     {     print ""
          print ""
          print ""}

You can spread out the statements even more by moving the first statement to its own line. Only the initial (opening) curly bracket has to be on the same line as the pattern:

$2==0     {
          print ""
          print ""
          print ""
     }

When the second field of the input file is equal to 0, awk prints three blank lines to the output file.


NOTE: Notice that print "" prints a blank line to the output file, whereas the statement print alone prints the current input line.

An awk program file might have commentary within. Anything typed from a # to the end of the line is considered a comment and is ignored by awk. Comments are notes explaining what is going on in words, not computerese.

A Note on awk Error Messages

Awk error messages (when they appear) tend to be cryptic. Often, due to the brevity of the program, a typo is easily found. Not all errors are as obvious; I have scattered some examples of errors throughout this chapter.

Print Selected Fields

Awk includes three ways to specify printing. The first is implied. A pattern without an action assumes that the action is to print. The two ways of actively commanding awk to print are print and printf(). For simplicity, only implied printing and the print statement are shown here. printf is discussed in a later section titled "Input/Output" and is used mainly for precise output. This section demonstrates the first two types of printing through some step-by-step examples.

Program Components

If I wanted to look for a particular user in the /etc/passwd file, I could enter an awk command to find a match but omit an action. The following command line puts a list on-screen.

$ awk '/Ann/' /etc/passwd

amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
andhs26:0TFnZSVwcua3Y:2488:23:DeAnn O'Neal:/usr/lstudent/andhs26:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh
lschultz:mic35ZiFj9zWk:3060:22:Lee Ann Schultz, :/usr/lteach/lschultz:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
bakehs59:yRYV6BtcW7wFg:3075:23:DeAnna Adlington, Baker :/usr/bakehs59:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _

I look on the monitor and see the correct spelling.


NOTE: For the sake of making a point, suppose I had chosen the pattern /Anne/. A quick glance above shows that there would be no matches. Entering awk '/Anne/' /etc/passwd would produce nothing but another system prompt to the monitor. This can be confusing if you expect output. The same goes the other way; above, I wanted the name Ann, but the names LeAnn, Annie, and DeAnna matched, too. Sometimes choosing a pattern too long or too short can cause an unneeded headache.

The grep command can perform the same search performed using awk in the above example. The real power of awk searching comes from searching specific fields like this:

$ awk -F: '$5 ~ /^Ann*/' /etc/passwd

amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
alewis:VYfz4EatT4OoA:2623:22:Annie Lewis:/usr/lteach/alewis:/bin/csh
akestle:job57Lb5/ofoE:3063:22:Ann Kestle.:/usr/lteach/akestle:/bin/csh
ahernan:AZZPQNCkw6ffs:3144:23:Ann Hernandez:/usr/lstudent/ahernan:/bin/csh
$ _

I'll discuss more about advanced search strings in the "Patterns" section.


TIP: If a pattern match is not found, look for a typo in the pattern you are trying to match.

The Input File and Program

Printing specified fields of an ASCII (plain text) file is a straightforward awk task. Because this program example is so short, only the input is in a file. The first input file, sales, is a file of car sales by month. The file consists of each salesperson's name, followed by a monthly sales figure. The end field is a running total of that person's total sales.

$cat sales
John Anderson,12,23,7,42
Joe Turner,10,25,15,50
Susan Greco,15,13,18,46
Bob Burmeister,8,21,17,46

The following command line prints the salesperson's name and the total sales for the first quarter.

$ awk -F, '{print $1,$5}' sales

John Anderson 42
Joe Turner 50
Susan Greco 46
Bob Burmeister 46

A comma (,) between field variables indicates that I want OFS applied between output fields, as shown in a previous example. Remember, without the comma, no field separator will be used and the displayed output fields (or output file) will all run together.


TIP: Putting two field separators in a row inside a print statement creates a syntax error with the print statement; however, using the same field twice in a single print statement is valid syntax. For example:

awk '{print($1,$1)}'



			

Patterns

A pattern is the first half of an awk program statement. In awk, there are six accepted pattern types. You have already seen a couple of them, including BEGIN, and a specified, slash-delimited pattern, in use. Awk has many string-matching capabilities arising from patterns and uses regular expressions in patterns. A range pattern locates a sequence. All patterns except range patterns may be combined in a compound pattern.

This section explores exactly what is meant by a pattern match. What kind of pattern you can match depends on exactly how you're using the awk pattern-specification notation.

BEGIN and END

The two special patterns BEGIN and END may be used to indicate a match, either before the first input record is read or after the last input record is read, respectively. Some versions of awk require that, if used, BEGIN must be the first pattern of the program and, if used, END must be the last pattern of the program. This is good practice to follow even if the version you use does not require it. Examples in this chapter will follow this practice. Using the BEGIN pattern for initializing variables is common (although variables can be passed from the command line to the program too; see the section "Command-Line Arguments"). The END pattern is used for things which are input-dependent, such as totals.

If I wanted to know how many lines were in a given program, I would type the following line:

$ awk 'END {print "Total lines: " NR}' myprogram

I see Total lines: 256 on the monitor and therefore know that the file myprogram has 256 lines. At any point while awk is processing the file, the variable NR counts the number of records read so far. NR at the end of a file has a value equal to the number of lines in the file.

How might you see a BEGIN block in use? Your first thought might be to initialize variables, but if something is a numeric value, it's automatically initialized to 0 before its first use. Instead, perhaps you're building a table of data and want to have some columnar headings. With this in mind, here's a simple awk script that shows you all the accounts that people named Dave have on your computer:

BEGIN {
     FS=":"        # remember that the passwd file uses colons
     OFS="     "   # we_re-setting the output to a TAB
     print "Account", "Username"
     }
/Dav/     {print $1, $5}

Here's what it looks like in action (I've called this file daves.awk, although the program matches Dave and David):

$ awk -f daves.awk /etc/passwd
Account     Username
andrews     Dave Andrews
d3          David Douglas Dunlap
daves       Dave Smith
taylor      Dave Taylor

Note that you could also easily have a summary of the total number of matched accounts by adding a variable that's incremented for each match, and then output it in the END block output in some manner. Here's one way to do it:

BEGIN {  FS=":" ; OFS="     "  # input colon separated, output tab separated
     print "Account", "Username"
     }
/Dav/     {print $1, $5 ; matches++ }
END     {print "A total of " matches " matches."}

Here, you can see how awk allows you to shorten the length of programs by having multiple items on a single line, which is particularly useful for initialization. Also, notice the C increment notation: matches++ is functionally identical to matches = matches + 1 and matches += 1. Finally, also note that I did not initialize the variable matches to 0 because it was done automatically by the awk system.

Expressions

Any expression can be used with any operator in awk. An expression consists of any operator in awk and its corresponding operand in the form of a pattern-match statement. Type conversion--variables being interpreted as numbers at one point, but strings at another--is automatic but never explicit. The type of operand needed is decided by the operator type. If a numeric operator is given a string operand, it is converted, and vice versa.


TIP: To force a conversion, if the desired change is string to number, add (+) 0. If you want to explicitly convert a number to a string concatenate "" (the null string) to the variable.

Two quick examples are these: num=3; num=num "" creates a new numeric variable and sets it to the number three; by appending a null string to it, it gets translates to a string (the string with the character 3 within). Adding 0 to the string created by str="3"; str=str + 0 forces it back to a numeric value.


Any expression can be a pattern. If the pattern, in this case the expression, evaluates to a non-zero or non-null value, the pattern matches that input record. Patterns often involve comparison. Table 4.1 shows the valid awk comparison operators.

Table 4.1. Comparison operators in awk.

Operator Meaning
== Equal to
< Less than
> Greater than
<= Less than or equal to
>= Greater than or equal to
!= Not equal to
~ Matched by
!~ Not matched by

In awk, as in C, the logical equality operator is == rather than =. The single = assigns values, whereas == compares values. When the pattern is a comparison, the pattern matches, if the comparison is true (non-null or non-zero). Here's an example. What if you wanted to only print lines wherein the first field had a numeric value of less than 20? Here's how:

$1 < 20 {print $0}

If the expression is arithmetic, it is matched when it evaluates to a non-zero number. For example, here's a small program that will print the first 10 lines that have exactly 7 words:

BEGIN  {i=0}
NF==7 { print $0 ; i++ }
i==10 {exit}

There's another way that you could use these comparisons too, because awk understands collation orders (that is, whether words are greater or lesser than other words in a standard dictionary ordering). Consider the situation wherein you have a phone directory--a sorted list of names--in a file and you want to print all the names that would appear in the corporate phone book before a certain person, say D. Hughes. You could do this quite succinctly:

$1 >= "Hughes,D" { exit }

When the pattern is a string, a match occurs if the expression is non-null. In the earlier example with the pattern /Ann/, it was assumed to be a string because it was enclosed in slashes. In a comparison expression, if both operands have a numeric value, the comparison is based on the numeric value. Otherwise, the comparison is made using string ordering, which is why this simple example works.


TIP: You can write more than two comparisons to a line in awk.

The pattern $2 <= $1 could involve either a numeric comparison or a string comparison. Whichever it is, it will vary from file to file or even from record to record within the same file.


TIP: Know your input file well when using such patterns, particularly since awk will often silently assume a type for the variable and work with it, without error messages or other warnings.

String Matching

There are three forms of string matching. The simplest is to surround a string by slashes (/). No quotation marks are used. Hence /"Ann"/ is actually the string ' "Ann" ', not the string Ann--and /"Ann"/ returns no input. The entire input record is returned if the expression within the slashes is anywhere in the record. The other two matching operators have a more specific scope. The operator ~ means "is matched by," and the pattern matches when the input field being tested for a match contains the substring on the right side.

$2 ~ /mm/

This example matches every input record containing mm somewhere in the second field. It could also be written as $2 ~ "mm".

The other operator !~ means "is not matched by."

$2 !~ /mm/

This example matches every input record not containing mm anywhere in the second field.

Armed with that explanation, you can now see that /Ann/ is really just shorthand for the more complex statement $0 ~ /Ann/.

Regular expressions are common to UNIX, and they come in two main flavors. You have probably used them subconsciously on the command line as wildcards, where * matches zero or more characters and ? matches any single character. For instance, entering the first line below results in the command interpreter matching all files with the suffix abc and the rm command deleting them.

rm *abc

Awk works with regular expressions that are similar to those used with grep, sed, and other editors but subtly different than the wildcards used with the command shell. In particular, . matches a character and * matches zero or more of the previous character in the pattern. (A pattern of x*y will match anything that has any number of the letter x followed by a y. To force a single x to appear too, you'd need to use the regular expression xx*y instead.) By default, patterns can appear anywhere on the line, so to have them tied to an edge, you need to use ^ to indicate the beginning of the word or line and $ for the end. If you wanted to match all lines where the first word ends in abc, for example, you could use $1 ~ /abc$/. The following line matches all records where the fourth field begins with the letter a:

$4 ~ /^a.*/

Range Patterns

The pattern portion of a pattern/action pair can also consist of two patterns separated by a comma (,); the action is performed for all lines between the first occurrence of the first pattern and the next occurrence of the second.

At most companies, employees receive different benefits according to their respective hire dates. It so happens that I have a file listing all employees in my company, including their hire dates. If I wanted to write an awk program that just lists the employees hired between 1980 and 1987, I could use the following script, if the first field is the employee's name and the third field is the year hired. Here's how that data file might look. (Notice that I use : to separate fields so that we don't have to worry about the spaces in the employee names.)

$ cat emp.data.
John Anderson:sales:1980
Joe Turner:marketing:1982
Susan Greco:sales:1985
Ike Turner:pr:1988
Bob Burmeister:accounting:1991

The program could then be invoked:

$ awk -F: '$3 == 1980,$3 == 1985 {print $1, $3}' emp.data

With the output:

John Anderson 1980
Joe Turner 1982
Susan Greco 1985


TIP: The preceding example works because the input is already in order according to hire year. Range patterns often work best with presorted input. This particular data file would be a bit tricky to sort within UNIX, but you could use the rather complex command sort -c: +3 -4 -rn emp.data > new.emp.data to sort things correctly. (See Chapter 3, "Text Editing with vi and EMACS," for more details on using the powerful sort command.)

Range patterns are inclusive; they include both the first item matched and the end data indicated in the pattern. The range pattern matches all records from the first occurrence of the first pattern to the first occurrence of the second. This is a subtle point, but it has a major affect on how range patterns work. First, if the second pattern is never found, all remaining records match. So given the input file here:

$ cat sample.data
1
3
5
7
9
11

The following output appears on the monitor, totally disregarding that 9 and 11 are out of range.

$ awk '$1==3, $1==8' sample.data
3
5
7
9
11

The end pattern of a range is not equivalent to a <= operand, although liberal use of these patterns can alleviate the problem, as shown in the employee hire date example. Using compound patterns is one way to get around this limitation.

Secondly, as stated, the pattern matches the first range; others that might occur later in the data file are ignored. That's why you have to make sure that the data is sorted as you expect.


CAUTION: Range patterns cannot be parts of a larger pattern.

A more useful example of the range pattern comes from awk's capability to handle multiple input files. I have a function finder program that finds code segments I know exist and tells me where they are. The code segments for a particular function X, for example, are bracketed by the phrase "function X" at the beginning and } /* end of X at the end. It can be expressed as the awk pattern range:

'/function functionname/,/} \/* end of functionname/'

Compound Patterns

Patterns can be combined using the logical operators and parentheses as needed. (See Table 4.2.)

Table 4.2. The logical operators in awk.

Operator Meaning
! Not
|| Or (you can also use | in regular expressions)
&& And

The pattern can be simple or quite complicated: (NF<3) || (NF >4). This matches all input records not having exactly four fields. As is usual in awk, there are a wide variety of ways to do the same thing (specify a pattern). Regular expressions are allowed in string matching, but their uses are not forced. To form a pattern that matches strings beginning with a or b or c or d, there are several pattern options:

/^[a-d].*/
/^a.*/ !! /^b.*/ || /^c.*/ || /^d.*/


NOTE: When using range patterns: $1==2, $1==4 and $1>= 2 && $1 <=4 are not the same ranges. First, the range pattern depends on the occurrence of the second pattern as a stop marker, not on the value indicated in the range. Secondly, as I mentioned earlier, the first pattern matches only the first range; others are ignored.

For instance, consider the following simple input file:

$ cat mydata
1     0
3     1
4     1
5     1
7     0
4     2
5     2
1     0
4     3

The first range I try, '$1==3,$1==5, produces

$ awk '$1==3,$1==5' mydata
3     1
4     1
5     1

Compare this to the following pattern and output:

$ awk '$1>=3 && $1<=5' mydata
3     1
4     1
5     1
4     2
5     2
4     3

Range patterns cannot be parts of a combined pattern.

Actions

As the name suggests, the action part tells awk what to do when a pattern is found. Patterns are optional. An awk program built solely of actions looks like other iterative programming languages. But looks are deceptive; even without a pattern, awk matches every input record to the first pattern-action statement before moving to the second.

Actions must be enclosed in curly braces ({}), whether accompanied by a pattern or alone. An action part can consist of multiple statements. When the statements have no pattern and are single statements (no compound loops or conditions), brackets for each individual action are optional provided the actions begin with a left curly brace and end with a right curly brace. Consider the following three action pieces:

{
   name = $1;
   print name;
}

and

{name = $1
print name}

and

{name = $1}
{print name}

These three produce identical output. Personally, I use the first because I find it more readable (and I code my C programs the same way).

Variables

An integral part of any programming language are variables, the virtual boxes within which you can store values, count things, and more. In this section, I talk about variables in awk. Awk has three types of variables: user-defined variables, field variables, and predefined variables that are provided by the language automatically. Awk doesn't have variable declarations. A variable comes to life the first time it is mentioned.


CAUTION: Because there are no declarations, be doubly careful to initialize all the variables you use, although you can always be sure that they automatically start with the value O.

Naming

The rule for naming user-defined variables is that they can be any combination of letters, digits, and underscores, as long as the name starts with a letter. It is helpful to give a variable a name indicative of its purpose in the program. Variables already defined by awk are written in all uppercase. Because awk is case-sensitive, ofs is not the same variable as OFS and capitalization (or lack thereof) is a common error. You have already seen field variables--variables beginning with $, followed by a number, and indicating a specific input field.

A variable is a number, string, or both. There is no type declaration, and type conversion is automatic if needed. Recall the car sales file used earlier. For illustration, suppose I entered the program awk -F: '{ print $1 * 10}' emp.data; awk obligingly provides the rest:

0
0
0
0
0

Of course, this makes no sense. The point is that awk did exactly what it was asked without complaint: It multiplied the name of the employee times 10, and when it tried to translate the name into a number for the mathematical operation it failed, resulting in a zero. Ten times zero is still zero.

Awk in a Shell Script

Before examining the next example, review what you know about shell programming (Chapters 8-13 of Volume I). Remember, every file containing shell commands needs to be changed to an executable file before you can run it as a shell script. To do this, enter chmod +x filename from the command line.

Sometimes, awk's automatic type conversion benefits you. Imagine that I'm still trying to build an office system with awk scripts and this time I want to be able to maintain a running monthly sales total based on a data file that contains individual monthly sales. It looks like this:

$ cat monthly.sales
John Anderson,12,23,7
Joe Turner,10,25,15
Susan Greco,15,13,18
Bob Burmeister,8,21,17

These need to be added together to calculate the running totals for each person's sales. Let a program do it!

$cat total.awk
BEGIN      {FS=",";     #Input fields are seperated by commas
            OFS=",";}   #Put a comma in the output
{print $1, " monthly sales summary: " $2+$3+$4 }

That's the awk script, so let's see how it works:

$ awk -f total.awk monthly.sales
John Anderson, monthly sales summary: 42
Joe Turner, monthly sales summary: 50
Susan Greco, monthly sales summary: 46
Bob Burmeister, monthly sales summary: 46


CAUTION: Always run your program once to be sure it works before you make it part of a complicated shell script.

The shell script used to run the awk script would look like this:

#! /bin/ksh     # always specify your shell
awk -f total.awk monthly.sales
exit $?         # return awk's return code

Your task has been reduced to entering the monthly sales figures in the sales file and editing the program file total to include the correct number of fields. (You could put in a for loop like for(i=2; i<+NF; i++), the number of fields is correctly calculated--but printing is a hassle and needs an if statement with 12 else if clauses.)

In this case, not having to wonder whether a digit is part of a string or a number is helpful. Just keep an eye on the input data, because awk performs whatever actions you specify, regardless of the actual data type with which you're working.

Built-In Variables

The built-in variables found in awk provide useful data to your program. The ones available vary with each of awk versions; for that reason, notes are included for those variables found in nawk, POSIX awk, and gawk. As before, unless otherwise noted, the variables of earlier releases can be found in the later implementations. The built-in variables are summarized in Table 4.3 at the end of this section.

Awk was released first and contains the core set of built-in variables used by all updates. Nawk expands the set. The POSIX awk specification encompasses all variables defined in nawk plus one additional variable. Gawk applies the POSIX awk standards and then adds some built-in variables that are found in gawk alone; the built-in variables noted when discussing gawk are unique to gawk. This list is a guideline, not a hard and fast rule. For instance, the built-in variable ENVIRON is formally introduced in the POSIX awk specifications; it exists in gawk; it is in also in the System V implementation of nawk, but not in SunOS. (See Chapter 5, Volume I, "General Commands," for more information on how to use man pages.)

In all implementations of awk, built-in variables are written entirely in uppercase.

Built-In Variables for Awk When awk first became a part of UNIX, the built-in variables were the bare essentials. As the name indicates, the variable FILENAME holds the name of the current input file. Recall the function finder code; and add on the new line:

/function functionname/,/} \/* end of functionname/' {print $0}
END     {print ""; print "Found in the file " FILENAME}

This adds the finishing touch.

The value of the variable FS determines the input field separator. FS has a space as its default value. The built-in variable NF contains the number of fields in the current record. (Remember, fields are akin to words, and records are input lines.) This value can change for each input record.

What happens if within an awk script I have the following statement?

$3 = "Third field"

It reassigns $3 and all other field variables, also reassigning NF to the new value. The total number of records read can be found in the variable NR. The variable OFS holds the value for the output field separator. The default value of OFS is a space. The value for the output format for numbers resides in the variable OFMT, which has a default value of %.6g. This is the format specifier for the print statement, although its syntax comes from the C printf format string. ORS is the output record separator. Unless changed, the value of ORS is newline (\n).

Built-In Variables for Nawk


NOTE: When awk was expanded in 1985, part of the expansion included adding more built-in variables.


CAUTION: Some implementations of UNIX simply put the new code in the spot for the old code and didn't bother keeping both awk and nawk. System V and SunOS have both available. Linux has neither awk nor nawk but uses gawk. The book The Awk Programming Language (see the "Further Reading" section at the end of this chapter) by the awk authors speaks of awk throughout the book, but the programming language it describes is called nawk on many systems.

The built-in variable ARGC holds the value for the number of command-line arguments. The variable ARGV is an array containing the command-line arguments. Subscripts for ARGV begin with 0 and continue through ARGC-1. ARGV[0] is always awk. The available UNIX options do not occupy ARGV. The variable FNR represents the number of the current record within that input file. Like NR, this value changes with each new record. FNR is always <= NR. The built-in variable RLENGTH holds the value of the length of string matched by the match function. The variable RS holds the value of the input record separator. The default value of RS is a newline. The start of the string matched by the match function resides in RSTART. Between RSTART and RLENGTH, it is possible to determine what was matched. The variable SUBSEP contains the value of the subscript separator. It has a default value of "\034" (the double quote character (")).

Built-In Variables for POSIX Awk The POSIX awk specification introduces a new built-in variable beyond those in nawk. The built-in variable ENVIRON is an array that holds the values of the current environment variables. The subscript values for ENVIRON are the names of the environment variables themselves, and each ENVIRON element is the value of that variable.

Here's an example of how you could work with the environment variables:

ENVIRON[EDITOR] == "vi"  {print NR,$0}

This program prints program listings with line numbers if I am using vi as my default editor.

Built-In Variables in Gawk The GNU group further enhanced awk by adding four new variables to gawk, its public reimplementation of awk. Gawk does not differ between UNIX versions as much as awk and nawk do, fortunately. These built-in variables are in addition to those mentioned in the POSIX specification as described in the previous section. The variable CONVFMT contains the conversion format for numbers. The default value of CONVFMT is "%.6g" and is for internal use only. The variable FIELDWIDTHS allows a programmer the option of having fixed field widths rather than a single character field separator. The values of FIELDWIDTHS are numbers separated by a space or tab (\t), so fields don't need to be the same width. When the FIELDWIDTHS variable is set, each field is expected to have a fixed width. Gawk separates the input record using the FIELDWIDTHS values for field widths. If FIELDWIDTHS is set, the value of FS is disregarded. Assigning a new value to FS overrides the use of FIELDWIDTHS; it restores the default behavior.

To see where this could be useful, imagine that you've just received a data file from accounting that indicates the different employees in your group and their ages. It might look like this:

$ cat gawk.datasample
1Swensen, Tim  24
1Trinkle, Dan  22
0Mitchel, Carl 27

The very first character, you find out, indicates if the employees are hourly or salaried. A value of 1 means that they're salaried, and a value of 0 refers to hourly. How do you split that character out from the rest of the data field? You can with the FIELDWIDTHS statement. Here's a simple gawk script that could attractively list the data:

BEGIN {FIELDWIDTHS = 1 8 1 4 1 2}
{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";
  else         print "Hourly   employee "$2,$4" is "$6" years old."
}

The output would look like this:

Salaried employee Swensen, Tim  is 24 years old.
Salaried employee Trinkle, Dan  is 22 years old.
Hourly   employee Mitchel, Carl is 27 years old.


TIP: When calculating the different FIELDWIDTH values, don't forget any field separators; the spaces between words do count in this case.

The variable IGNORECASE controls the case sensitivity of gawk's regular expressions. If IGNORECASE has a non-zero value, pattern matching ignores case for regular expression operations. The default value of IGNORECASE is zero; all regular expression operations are normally case sensitive.

Table 4.3 summarizes the built-in variables and the first awk version in which they appeared:

Table 4.3. Built-in variables in awk.

V Variable Meaning Default (if any)
N ARGC The number of command-line arguments
N ARGV An array of command-line arguments
A FS The input field separator Space
A NF The number of fields in the current record
G CONVFMT The conversion format for numbers %.6g
G FIELDWIDTHS A whitespace, separated
G IGNORECASE Controls the case sensitivity Zero (case-sensitive)
P FNR The current record number
A FILENAME The name of the current input file
A NR The number of records already read
A OFS The output field separator Space
A ORS The output record separator Newline
A OFMT The output format for numbers %.6g
N RLENGTH Length of string matched by match function
A RS Input record separator Newline
N RSTART Start of string matched by match function
N SUBSEP Subscript separator "\034"

NOTE: V is the first implementation using the variable. A = awk, G = gawk, P = POSIX awk, N = nawk

Conditions (No IFs, &&s, or Buts)

Awk program statements are, by their very nature, conditional; if a pattern matches, a specified action or actions occurs. Actions, too, have a conditional form. This section discusses conditional flow. It focuses on the syntax of the if statement, but, as usual in awk, there are multiple ways to do something.

A conditional statement does a test before it performs the action. One test, the pattern match, has already happened; this test is an action. The last two sections introduced variables; now you can begin putting them to practical uses.

The if Statement

An if statement takes the form of a typical iterative programming language control structure, where E1 is an expression, as mentioned in the "Patterns" section earlier in this chapter:

if E1 S2; else S3.

Although E1 is always a single expression, S2 and S3 can be either single- or multiple-action statements. (Conditions in conditions are legal syntax.) Returns and indention are, as usual in awk, entirely up to you. However, if S2 and the else statement are on the same line and S2 is a single statement, a semicolon must separate S2 from the else statement. When awk encounters an if statement, evaluation occurs as follows: E1 is evaluated, and if E1 is non-zero or non-null(true), S2 is executed; if E1 is zero or null(false) and there's an else clause, S3 is executed. For instance, if you wanted to print a blank line when the third field has the value 25 and the entire line in all other cases, you could use a program snippet like this:

{ if $3 == 25
     print ""
else
     print $0 }

The portion of the if statement involving S is completely optional because sometimes your choice is limited to whether or not to have awk execute S2:

{ if $3 == 25
     print "" }

Although the if statement is an action, E1 can test for a pattern match using the pattern-match operator ~. As you have already seen, you can use it to look for my name in the password file another way. The first way is shorter, but they do the same thing.

$awk '/Ann/'/etc/passwd
$awk '{if ($0 ~ /Ann/) print $0}' /etc/passwd

One use of the if statement combined with a pattern match is to further filter the screen input. For example, here I'm going to print only the lines in the password file that contain both Ann and an M character:

$ awk '/Ann/ { if ($0 ~ /M/) print}' /etc/passwd
amarshal:oPWwC9qVWI/ps:2005:12:Ann Marshall:/usr/grad/amarshal:/bin/csh
cmcintyr:0FciKEDDMkauU:2630:22:Carol Ann McIntyre:/usr/lteach/cmcintyr:/bin/csh
jflanaga:ShrMnyDwLI/mM:2654:22:JoAnn Flanagan:/usr/lteach/jflanaga:/bin/csh

S2, S3, or both can consist of multiple-action statements. If any of them do, the group of statements is enclosed in curly braces. You may put curly braces wherever you want as long as they enclose the action. The rule of thumb is if it's one statement, the braces are optional. More than one and it's required.

You can also use multiple else clauses. The car sales example gets one field longer each month. The first two fields are always the salesperson's name, and the last field is the accumulated annual total, so it is possible to calculate the month by the value of NF:

if(NF=4) month="Jan."
else if(NF=5) month="Feb"
else if(NF=6) month="March"
else if(NF=7) month="April"
else if(NF=8) month="May" # and so on


NOTE: Whatever the value of NF, the overall block of code will execute only once. It falls through the remaining else clauses.

The Conditional Statement

Nawk++ also has a conditional statement--really just shorthand for an if statement. It takes the format shown and uses the same conditional operator found in C:

E1 ? S2 : S3

Here, E1 is an expression, and S2 and S3 are single-action statements. When awk encounters a conditional statement, it evaluates it in the same order as an if statement: E1 is evaluated; if E1 is non-zero or non-null (true), S2 is executed; if E1 is zero or null (false), S3 is executed. Only one statement, S2 or S3, is chosen, never both.

The conditional statement is a good place for the programmer to provide error messages. Return to the monthly sales example. When we wanted to differentiate between hourly and salaried employees, we had a big if-else statement:

{ if ($1 == 1) print "Salaried employee "$2,$4" is "$6" years old.";
  else         print "Hourly   employee "$2,$4" is "$6" years old."
}

In fact, there's an easier way to do this with conditional statements:

{ print ($1==1? "Salaried":"Hourly") "employee "$2,$4" is "$6" years old." }


CAUTION: Remember, the conditional statement is not part of original awk!

At first glance, and for short statements, the if statement appears identical to the conditional statement. On closer inspection, the statement you should use in a specific case differs. Either is fine for use when choosing between either of two single statements, but the if statement is required for more complicated situations, such as when E2 and E3 are multiple statements. Use if for multiple else statements (the first example) or for a condition inside a condition like the second example here:

{ if (NR == 100)
     { print \$(NF-1)\{""
     print "This is the 100th record"
     print $0
       print
     }
}
{ if($1==0)
     if(name~/Fred/
          print "Fred is broke" }

Patterns as Conditions

The program relying on pattern matching (had I chosen that method) produces the same output. Look at the program and its output.

$ cat seniority.awk
$(NF-1) <= 7    {print $1, $(NF-1), "Long timer" }
$(NF-1) > 7     {print $1, $(NF-1), "Is new"     }
$ awk -f lowsales.awk emp.data
John Anderson     1980  Long timer
Joe Turner        1982  Long timer
Susan Greco       1985  Long timer
Ike Turner        1988  Long timer
Bob Burmeister    1991  Is new

Because the two patterns are nonoverlapping and one immediately follows the other, the two programs accomplish the same thing. Which to use is a matter of programming style. I find the conditional statement or the if statement more readable than two patterns in a row. When you are choosing whether to use the nawk conditional statement or the if statement because you're concerned about printing two long messages, remember that the if statement is cleaner. Above all, if you chose to use the conditional statement, keep in mind you can't use awk; you must use nawk or gawk.

Loops

People often write programs to perform a repetitive task or several repeated tasks. These repetitions are called loops. Loops are the subject of this section. The loop structures of awk are very similar to those found in C. First, let's look at a shortcut in counting by ones. Then, I'll show you the ways to program loops in awk. The looping constructs of awk are the do (nawk), for, and while statements. As with multiple-action groups in an if statement, curly braces({}) surround a group of action statements associated in a loop. Without curly braces, only the statement immediately following the keyword is considered part of the loop.


TIP: Forgetting curly braces is a common looping error.

Increment and Decrement

As stated earlier, assignment statements take the form x = y, where the value y is being assigned to x. Awk has some shorthand methods of writing this. For example, to add a monthly sales total to the car sales file, you'll need to add a variable to keep a running total of the sales figures. Call it total . You need to start total at zero and add each $(NF-1) as read. In standard programming practice, that would be written total = total + $(NF-1). This is okay in awk, too. However, a shortened format of total += $(NF-1) is also acceptable.

There are two ways to indicate line+= 1 and line -=1 (line =line+1 and line=line-1 in awk shorthand). They are called increment and decrement, respectively, and can be further shortened to the simpler line++ and line--. At any reference to a variable, you cannot only use this notation but even vary whether the action is performed immediately before or after the value is used in that statement. This is called prefix and postfix notation, and is represented by ++line and line++.

Focus on increment for a moment. Decrement functions the same way using subtraction. Using the ++line notation tells awk to do the addition before doing the operation indicated in the line. Using the postfix form says to do the operation in the line and then do the addition. Sometimes, the choice does not matter; keeping a counter of the number of sales people (to later calculate a sales average at the end of the month) requires a counter of names. The statements totalpeople++ and ++totalpeople do the same thing and are interchangeable when they occupy a line by themselves. But suppose I decided to print the person's number along with his or her name and sales. Adding either of the second two lines below to the previous example produces different results based on starting both at totalpeople=1.

s$ cat awkscript.v1
BEGIN { totalpeople = 1 }
{print ++totalpeople, $1, $(NF-1)     }

$ cat awkscript.v2
BEGIN { totalpeople = 1 }
{print totalpeople++, $1, $(NF-1)     }

The first example will actually have the first employee listed as #2, because the totalpeople variable is incremented before it's used in the print statement. By contrast, the second version will do what we want because it will use the variable value and then afterward increment it to the next value.


TIP: Be consistent. Either prefix or postfix is fine, but stick with one or the other, and there is less likelihood that you will accidentally enter a loop an unexpected number of times.

The while Statement

Awk provides the while statement for general looping. It has the following form:

while(E1)
     S1

Here, E1 is an expression (a condition), and S1 is either one action statement or a group of action statements enclosed in curly braces. When awk meets a while statement, E1 is evaluated. If E1 is true, S1 executes from start to finish and then E1 is again evaluated. If E1 is true, S1 again executes. The process continues until E1 is evaluated to false. When it does, the execution continues with the next action statement after the loop. Consider the this program:

{ while ($0~/M/)
     print
}

Typically, the condition (E1) tests a variable, and the variable is changed in the while loop.

{ i=1
  while (i<20)
     {  print i
      i++
     }
}

This second code snippet will print the numbers from 1 to 19; after the while loop tests with i=20, the condition of i<20 will become false and the loop will be done.

The do Statement

Nawk++ provides the do statement for looping in addition to the while statement. The do statement takes the following form:


 do
     S
while .

Here, S is either a single statement or a group of action statements enclosed in curly braces, and E is the test condition. When awk comes to a do statement, S is executed once and then condition E is tested. If E evaluates to non-zero or non-null, S executes again, and so on, until the condition E becomes false. The difference between the do and the while statement rests in their order of evaluation. The while statement checks the condition first and executes the body of the loop if the condition is true. Use the while statement to check conditions that could be initially false. For instance, while (not end-of-file(input)) is a common example. The do statement executes the loop first and then checks the condition. Use the do statement when testing a condition that depends on the first execution to meet the condition.

You can initiate the do statement using the while statement. Put the code that is in the loop before the condition as well as in the body of the loop.

The for Statement

The for statement is a compacted while loop designed for counting. Use it when you know ahead of time that S is a repetitive task and the number of times it executes can be expressed as a single variable. The for loop has the following form:

for(pre-loop-statements; TEST; post-loop-statements)

Here, pre-loop-statements usually initialize the counting variable, TEST is the test condition, and post-loop-statements indicate any loop variable increments.

For example:

{ for(i=1; i<=30; i++) print i }

This is a succinct way of saying initialize i to 1, and then continue looping while i<=30, and incrementing i by one each time through. The statement executed each time simply prints the value of i. The result of this statement is a list of the numbers 1 through 30.


TIP: The condition test should either be < 21 or <= 20 to execute the loop 20 times. The equality operator == is not a good test condition. It will be false the first time checked (or shortly thereafter if i is initialized to 20).

{ for (i=1; i==20; i++) print i }



			

The for loop can also be used involving loops of unknown size:

for (i=1; i<=NF; i++)
     print $i

This prints each field on a unique line. You don't know what the number of fields will be, but you do know NF will contain that number.

The for loop does not have to be incremented; it could be decremented instead:

$awk -F: '{ for (i = NF; i > 0; --i) print $i }' sales.data

This prints the fields in reverse order, one per line.

Loop Control

The only restriction of the loop control value is that it must be an integer. Because of the desire to create easily readable code, most programmers try to avoid branching out of loops midway. Awk does offer two ways to do this: break and continue. Sometimes, unexpected or invalid input leaves little choice but to exit the loop or have the program crash--something a programmer strives to avoid. Input errors are acceptable when you use the break statement. For instance, when reading the car sales data into the array name, I wrote the program expecting five fields on every line. If something happens and a line has the wrong number of fields, the program is in trouble. A way to protect your program from this is to have code like this:

{ for(i=1; i<=NF; i++)
     if (NF != 5) {
          print "Error on line " NR invalid input...leaving loop."
          break  }
     else
#          continue with program code...
}

The break statement terminates the loop only. It is not equivalent to the exit statement, which transfers control to the END statement of the program. A solution to this problem is shown on the CD-ROM in file LIST15_1.


TIP: The ideal error message depends, of course, on your application, the knowledge of the end users, and the likelihood they will be able to correct the error.

As another use for the break statement, consider do S while (1). It is an infinite loop depending on another way out. Suppose your program begins by displaying a menu on screen. (See the LIST15_2 file on the CD-ROM.)

The previous example shows an infinite loop controlled with the break statement giving the end user a way out.


NOTE: In the CD-ROM file LIST15_2, the getline function is used to get entire lines. The substr function extracts the first character from that line.

The continue statement causes execution to skip the current iteration remaining in both the do and the while statements. Control transfers to the evaluation of the test condition. In the for loop control goes to post-loop instructions. When is this of use? Consider computing a true sales ratio by calculating the amount sold and dividing that number by hours worked.

Because this is all kept in separate files, the simplest way to handle the task is to read the first list into an array, calculate the figure for the report, and do whatever else is needed.

FILENAME=="total"          read each $(NF-1) into monthlytotal[i]
FILENAME=="per"            with each i
                              monthlytotal[i]/$2
whatever else

But what if $2 is 0? The program will crash because dividing by 0 is an illegal statement. Although it is unlikely that an employee will miss an entire month of work, it is possible. So, it is good idea to allow for the possibility. This is one use for the continue statement. The preceding program segment expands to Listing 4.1.

Listing 4.1. Using the continue statement.

BEGIN         { star = 0

          other stuff...
}

FILENAME=="total"         { for(i=1;NF;i++)
                               monthlyttl[i]=$(NF-1)
                   }

FILENAME=="per"           { for(i=1;NF;i++)
                              if($2 == 0)   {
                                  print "*"
                                  star++
                                 continue }
                            else
                              print monthlyttl[i]/$2
                     whatever else
                         }

END   { if(star>=1)
         print "* indicates employee did not work all month."
      else
whatever
}

The preceding program makes some assumptions about the data, in addition to assuming valid input data. What are these assumptions and more importantly, how do you fix them? The data in both files is assumed to be the same length, and the names are assumed to be in the same order.

Recall that in awk, array subscripts are stored as strings. Because each list contains a name and its associated figure, you can match names. Before running this program, run the UNIX sort utility to ensure the files have the names in alphabetical order. (See the section titled "Sorting Text Files" in Chapter 3, "Text Editing with vi and EMACS.") After making changes, use file LIST15_3 on the CD-ROM.

Strings

There are two primary types of data that awk can work with: numeric values or sequences of characters and digits that comprise words, phrases, or sentences. The latter are called strings in awk and most other programming languages. For instance, "now is the time for all good men" is a string. A string is always enclosed in double quotes(""). It can be almost any length; the exact number varies from UNIX version to version.

One of the important string operations is called concatenation, which means putting together. When you concatenate two strings, you create a third string that is the combination of the first string immediately followed by the second. To perform concatenation in awk simply leave a space between two strings.

print "Her name is" "Ann."

This prints the line:

Her name isAnn.

(To ensure that a space is included, use a comma in the print statement or simply add a space to one of the strings: print "Her name is " "Ann").

Built-In String Functions

As a rule, awk returns the leftmost, longest string in all its functions. This means that it will return the string occurring first (farthest to the left). Then, it collects the longest string possible. For instance, if the string you are looking for is "y*" in the string "any of the guyys knew it", the match returns "yy" over "y", even though the single y appears earlier in the string.

The different awk string functions available are organized by version.

Awk The original awk contained few built-in functions for handling strings. The length function returns the length of the string. It has an optional argument. If you use the argument, it must follow the keyword and be enclosed in parentheses: length(string). If there is no argument, the length of $0 is the value. For example, it is difficult to determine from some screen editors if a line of text stops at 80 characters or wraps around. The following invocation of awk aids by listing just those lines that are longer than 80 characters in the specified file.

$ awk '{ if (length > 80)  { print NR ": " $0}' file-with-long-lines

The other string function available in the original awk is substring, which takes the form substr(string, position, len) and returns the len length substring of the string starting at position.


NOTE: A disagreement exists over which functions originated in awk and which originated in nawk. Consult your system for the final word on awk string functions. The functions in nawk are fairly standard.

Nawk When awk was expanded to nawk, many built-in functions were added for string manipulation while keeping the two from awk. The function gsub(r, s, t) substitutes string s into target string t every time the regular expression r occurs, and returns the number of substitutions. If t is not given, gsub() uses $0. For instance, gsub(/l/, "y", "Randall") turns Randall into Randayy. The g in gsub means global because all occurrences in the target string change.

The function sub(r, s, t) works like gsub(), except the substitution occurs only once. Thus, sub(/l/,"y", "Randall") returns "Randayl". The place the substring t occurs in string s is returned with the function index(s, t): index("i", "Chris")) returns 4. As you might expect, the return value is 0 if substring t is not found. The function match(s, r) returns the position in s where the regular expression r occurs. It returns the index where the substring begins or 0, if there is no substring. It sets the values of RSTART and RLENGTH.

The split function separates a string into parts. For example, if your program reads a date as 5-10-94 and later you want it written May 10, 1994, the first step is to divide the date appropriately. The built-in function split does this: split("5-10-94", store, "-") divides the date and sets store["1"] = "5", store["2"] = "10" and store["3"] = 94. Notice that here the subscripts start with "1" not "0".

POSIX Awk The POSIX awk specification added two built-in functions for use with strings. They are tolower(str) and toupper(str). Both functions return a copy of the string str with the alphabetic characters converted to the appropriate case. Non-alphabetic characters are left alone.

Gawk Gawk provides two functions returning time-related information. The systime() function returns the current time of day in seconds since Midnight UTC (Universal Time Coordinated, the new name for Greenwich Mean Time), January 1970 on POSIX systems. The function strftime(f, t), where f is a format and t is a timestamp of the same form as returned by systime(), returns a formatted timestamp similar to the ANSI C function strftime().

String Constants

String constants are the way awk identifies a non-keyboard, but essential, character. Because these constants are part of strings, when you use one, you must enclose it in double quotes (""). These constants can appear in printing or in patterns involving regular expressions. For instance, the following command prints all lines less than 80 characters that begin with a tab. See Table 4.4.

awk 'length < 80 && /\t/' a-file-with-long-lines

Table 4.4. Awk string constants.

Expression Meaning
\\ Indicates that a backslash gets printed
\a The "alert" character, usually the ASCII BEL
\b A backspace character
\f A formfeed character
\n A newline character
\r Carriage return character
\t Horizontal tab character
\v Vertical tab character
\x Indicates the following value is a hexadecimal number
\0 Indicates the following value is an octal number

Arrays

An array is a method of storing pieces of similar data in the computer for later use. Suppose your boss asks for a program that reads in the name, social security number, and other personnel data to print check stubs and the detachable check. For three or four employees keeping name1, name2, and so on might be feasible, but at 20 employees, it is tedious and at 200, impossible. This is a use for arrays! See file LIST15_4 on the CD-ROM for an example of how to handle this.


NOTE: The sample awk script assumes the data has the check date as the first input record; because of this, the total lines (NR) is not the number of checks to issue. I could have used NR-1, but I chose clarity over brevity.

Using arrays is much easier, cleaner, and quicker than spelling out individual variables for each element used. It means that you do not have to change the code when the number of elements (employees for example) changes. Awk supports only single-dimension arrays. (See the section "Advanced Concepts" for how to simulate multiple-dimensional arrays.) That and a few other things set awk arrays apart from the arrays of other programming languages. This section focuses on arrays, their uses, special properties, and the three features of awk (a built-in function, a built-in variable, and an operator) designed to help you work with arrays.

Arrays in awk, like variables, don't need to be declared. Furthermore, no indication of size must be given ahead of time; in programming terms, you might say arrays in awk are dynamic. To create an array, give it a name and put its subscript after the name in square brackets ([ ]), instead of name2, you use name[2], for instance. Array subscripts are also called the indexes of the array; in name[2], 2 is the index to the array name, and it accesses the one name stored at location 2.


NOTE: One peculiarity in awk is that elements are not stored in the order they are entered. This bug is fixed in newer versions.

Awk arrays are different from those of other programming languages because in awk, array subscripts are stored as strings, not numbers. Technically, the term is associative arrays, and it's unusual in programming languages. Be aware that the use of strings as subscripts can confuse you if you think purely in numeric terms. Because "3" > "15", an array element with a subscript 15, is stored before one with subscript of "3", even though numerically, 3 > 15.

Because subscripts are strings, a subscript can be a field value. grade[$1]=$2 is a valid statement, as is salary["John"].

Array Specialties

Nawk++ has additions specifically intended for use with arrays. The first is a test for membership. Suppose Mark Turner enrolled late in a class I teach, and I don't remember if I added his name to the list I keep on my computer. The following program checks the list for me:

BEGIN {i=1}

{ name [i++] = $1 }

END { if ("Mark Turner" in name)
      print "He's enrolled in the course!"
    }

The delete function is a built-in function to remove array elements from computer memory. To remove an element, for example, you could use the command delete name[1].


CAUTION: After you remove an element from memory, it's gone, and it isn't coming back! When in doubt, keep it.

Although technology is advancing and memory is not the precious commodity it once was considered to be, it is still a good idea to clean up after yourself when you write a program. Think of the check printing program. Two hundred names won't fill the memory. If your program controls personnel activity, however, it writes checks and checkstubs, adds and deletes employees, and charts sales. It's better to update each file to disk and remove the arrays not in use. There is less chance of reading obsolete data. It also consumes less memory and minimizes the chance of using an array of old data for a new task. The clean-up can be easily done:

END  {i= totalemps
     while(i>0) {
          delete name[i]
          delete data[i]
          i-- }
     }

Nawk++ creates another built-in variable for use when simulating multidimensional arrays. The section titled "Advanced Concepts" discusses more about it. It is called SUBSEP and has a default value of "\034". To add this variable to awk, just use the name in your program:

BEGIN { SUBSEP = "\034" }

Recall that in awk, array subscripts are stored as strings. Because each list contains a name and its associated figure, you can match names and match files.

Arithmetic

Although awk is primarily a language for pattern matching, and hence, text and strings pop into mind more readily than math and numbers, awk also has a good set of math tools. In this section, first I show the basics and then discuss the math functions built into awk.

Operators

Awk supports the usual math operations. The expression x^y is x superscript y, that is, x to the y power. The % operator calculates remainders in awk: x%y is the remainder of x divided by y, and the result is machine-dependent. All math uses floating-point, numbers are equivalent no matter which format they are expressed in, so 100 = 1.00e+02.

The math operators in awk consist of the four basic functions: + (addition), - (subtraction), / (division), and * (multiplication), plus ^ and % for exponential and remainder.

As you saw in the most recent sales example, fields can be used in arithmetic too. If, in the middle of the month, my boss asks for a list of the names and latest monthly sales totals, I don't need to panic over the discarded figures; I can just print a new list. My first shot seems simple enough. (See Listing 4.2.)

Listing 4.2. Print sales totals for May.

BEGIN      {OFS="\t"}
{          print $1, $2, $6 }          # field #6 = May

Then a thought hits. What if my boss asks for the same thing next month? Sure, changing a field number each month is not a big deal, but is it really necessary?

I look at the data. No matter what month it is, the current month's totals are always the next to last field. I start over with the program in Listing 4.3.

Listing 4.3. Printing the previous month's sales totals.

BEGIN      {OFS= _\t_}
{          print $1,$2, $(NF-1) }


TIP: Again, be careful, because awk lets you get away with murder. If I forgot the parentheses on the last statement above, rather than get a monthly total, I would print a list of the running total: 1! Also, rather than generate an error, if I mistype $(NF-1) and get $(NF+1) (not hard to do using the number pad), awk assigns nonexistent variables (here the number of fields plus 1) to the null string. In this case, it prints blank lines.

Another use for arithmetic is assignment. You can change field variables by assignment. Given the following file, the statement $3 = 7 is a valid statement and produces the these results:

$ cat inputfile
1 2
3 4
5 6
7 8
9 10

$ awk '$3 = 7' inputfile
1 2 7
3 4 7
5 6 7
7 8 7
9 10 7


NOTE: The preceding statement forces $0 and NF values to change. Awk recalculates them as it runs. The original awk will produce an error message, the other versions produce the result shown.

If I run the following program, four lines appear on the monitor, showing the new values:

     {   if(NR==1)
          print $0, NF  }
     { if (NR >= 2 && NR <= 4) { $3=7; print $0, NF } }
END {print $0, NF }

Now when I run the data file through awk here's what I see:

$awk -f newsample.awk inputfile
1 2 2
3 4 7 3
5 6 7 3
7 8 7 3
0

Numeric Functions

Awk has a well-rounded selection of built-in numeric functions. As before in the sections on "Built-in Variables" and "Strings," the functions build on each other, beginning with those found in awk.

Awk Awk has built-in functions exp(exp), log(exp), sqrt(exp), and int(exp), where int() truncates its argument to an integer.

Nawk Nawk added further arithmetic functions to awk. It added atan2(y,x), which returns the arctangent of y/x. It also added two random number generator functions: rand() and srand(x). Some disagreement exists over which functions originated in awk and which in nawk. Most versions have all the trigonometric functions in nawk, regardless of where they first appeared.

Input and Output

This section takes a closer look at the way input and output function in awk. It introduces input with getline function of nawk++; output is shown through print and printf.

Input

Awk handles the majority of input automatically; there is no explicit read statement, unlike in most programming languages. Each line of the program is applied to each input record in the order the records appear in the input file. If the input file has 20 records, the first pattern-action statement in the program looks for a match 20 times. The next statement causes the input to skip to the next program statement without trying the rest of the input against that pattern action statement. The exit statement acts as if all input has been processed. When awk encounters an exit statement, if there is one, the control goes to the END pattern action statement.

The Getline Statement

One addition, when awk was expanded to nawk, was the built-in function getline. It is also supported by the POSIX awk specification. The function can take several forms. At its simplest, it's written getline. When written alone, getline retrieves the next input record and splits it into fields as usual, setting FNR, NF, and NR. The function returns 1 if the operation is successful, 0 if it is at the end of the file (EOF), and -1 if the function encounters an error. Thus:

while (getline == 1)

simulates awk's automatic input.

Writing getline variable reads the next record into variable (getline char from the earlier menu example, for instance). Field splitting does not take place, and NF remains 0; but FNR and NR are incremented. Either of the previous two can be written using input from a file besides the one containing the input records by appending < filename on the end of the command. Furthermore, getline char < stdin takes the input from the keyboard. As you might expect, neither FNR nor NR are affected when the input is read from another file. You can also write either of the two forms, taking the input from a command.

If you omit the variable when using getline with a file (getline < filename), $0, the field variables, NF, FNR, and NR are affected.

Output

There are two forms of printing in awk: the print and the printf statements. Until now, I have used the print statement. It is the fallback. There are two forms of the print statement. One has parentheses; one doesn't. So, print $0 is the same as print($0). In awk shorthand, the statement print by itself is equivalent to print $0. As shown in an earlier example, a blank line is printed with the statement print "". Use the format you prefer.


NOTE: print() is not accepted shorthand; it generates a syntax error.

Nawk requires parentheses only if the print statement involves a relational operator.


For a simple example, consider file1:

$cat file1
1     10
3     8
5     6
7     4
9     2
10    0

The command line

$ nawk 'BEGIN {FS="\t"}; {print($1>$2)}' file1

shows

0
0
0
1
1
1

on the monitor.

Knowing that 0 indicates false and 1 indicates true, the result shown above is what you might expect, but most programming languages won't print the result of a relation directly. Nawk and C will.


NOTE: Printing the value of a relational expression requires nawk or later. Trying the example above in awk results in a syntax error.

Nawk prints the results of relations with both print and printf. Both print and printf require the use of parentheses when a relation is involved, however, to distinguish between > meaning greater than and > meaning the redirection operator.

The printf Statement

printf is used when the use of formatted output is required. It closely resembles C's printf. Like the print statement, it comes in two forms: with and without parentheses. Either may be used, except the parentheses are required when using a relational operator.

printf format-specifier, variable1,variable2, variable3,..variablen
printf(format-specifier, variable1,variable2, variable3,..variablen)

The format specifier is always required with printf. It contains both any literal text and the specific format for displaying any variables you want to print. The format specifier always begins with a %. Any combination of three modifiers can occur: -, a number, and .number. A - indicates the variable should be left-justified within its field. A number indicates the total width of the field should be that number (if the number begins with a 0, %-05 means to make the variable 5 wide and pad with 0s as needed. The last modifier is .number the meaning depends on the type of variable, the number indicates either the maximum number string width, or the number of digits to follow to the right of the decimal point. After zero or more modifiers, the display format ends with a single character indicating the type of variable to display.


TIP: Numbers can be displayed as characters, and nondigit strings can be displayed as numbers. With printf, anything goes!

Remember, the format specifier has a string value and because it does, it must always be enclosed in double quotes ("), whether it is a literal string such as

printf("This is an example of a string in the display format.")

or a combination:

printf("This is the %d example", occurrence)

or just a variable:

printf("%d", occurrence)


NOTE: The POSIX awk specification (and hence gawk) supports the dynamic field width and precision modifiers like ANSI C printf() routines do. To use this feature, place an * in place of either of the actual display modifiers, and the value will be substituted from the argument list following the format string. Neither awk nor nawk have this feature.

Table 4.5 shows the format specifiers that determine how an awk variable is printed, and there are format modifiers to modify the behavior of the format specifiers.

Table 4.5. The format specifiers in awk.

Format Meaning
%c An ASCII character.
%d A decimal number (an integer, no decimal point involved).
%i Just like %d (remember i for integer).
%e A floating-point number in scientific notation (1.00000E+01).
%f A floating-point number (10001010.434).
%g Awk chooses between %e and %f display format; the one producing a shorter string is selected. Insignificant zeros are not printed.
%o An unsigned octal (base-eight) number.
%s A string.
%x An unsigned hexadecimal (base-sixteen) number.
%X Same as %x, but letters are uppercase rather than lowercase.


NOTE: If the argument used for %c is numeric, it is treated as a character and printed. Otherwise, the argument is assumed to be a string, and only the first character of that string is printed.

Look at some examples without display modifiers. When the file file1 looks like this:

$ cat file1
34
99
-17
2.5
-.3

the command line

awk '{printf("%c %d %e %f\n", $1, $1, $1, $1)}' file1

produces the following output:

" 34 3.400000e+01 34.000000
c 99 9.900000e+01 99.000000
  -17 -1.700000e+01 -17.000000
  2 2.500000e+00 2.500000
 0 -3.000000e-01 -0.300000

By contrast, a slightly different format string produces dramatically different results with the same input:

$ awk '{printf("%g %o %x", $1, $1, $1)}' file1
34 42 22
99 143 63
-17 37777777757 ffffffef
2.5 2 2
-0.3 0 0

Now let's change file1 to contain just a single word:

$cat file1
Example

This string has seven characters. For clarity, I have used * instead of a blank space so the total field width is visible on paper.

printf("%s\n", $1)
     Example
printf("%9s\n", $1)
     **Example
printf("%-9s\n", $1)
     Example**
printf("%.4s\n", $1)
     Exam
printf("%9.4s\n", $1)
     *****Exam
printf("%-9.4s\n", $1)
     Exam*****

One topic pertaining to printf remains. The function printf was written so that it writes exactly what you tell it to write--and how you want it written, no more and no less. That is acceptable until you realize that you can't enter every character you might want to use from the keyboard. Awk uses the same escape sequences found in C for nonprinting characters. The two most important to remember are \n for a carriage return and \t for a tab character.


TIP: There are two ways to print a double quote, neither of which is that obvious. One way around this problem is to use the printf variable by its ASCII value:

doublequote = 34
printf("%c", doublequote)

The other strategy is to use a backslash to escape the default interpretation of the double quote as the end of the string:

     printf("Joe said \"undoubtedly" and hurried along.\n")

This second approach works in most versions.


Closing Files and Pipes

Unlike most programming languages, there is no way to explicitly open a file in awk; opening files is implicit. However, you must close a file if you intend to read from it after writing to it. Suppose you entered the command cat file1 > file2 in your awk program. Before you could read file2 you must close the pipe. To do this, use the statement close(cat file1 > file2). You may also do the same for a file: close(file2).

You can implicitly open a file using getline:

getline < filename;

which is used to read data from the file filename. When done with the file, you should use the close(filename) command.

Command-Line Arguments

As you have probably noticed, awk presents a programmer with a variety of ways to accomplish the same thing. This section focuses on the command line. You will see how to pass command-line arguments to your program from the command line and how to set the value of built-in variables on the command line. A summary of command-line options concludes the section.

Passing Command-Line Arguments

Command-line arguments are available in awk through a built-in array called, as in C, ARGV. Again echoing C semantics, the value of the built-in ARGC is one less than the number of command-line arguments. Given the command line awk -f programfile infile1, ARGC has a value of 2. ARGV[0] = awk and ARGV[1] = infile1.


NOTE: The subscripts for ARGV start with 0 not 1.

programfile is not considered an argument. (No option argument is.) Had -F been in the command line, ARGV would not contain a comma either. Note that this behavior is very different to how argv and argc are interpreted in C programs too.


Setting Variables on the Command Line

It is possible to pass variable values from the command line to your awk program by stating the variable and its value. For example, for the command line, awk -f programfile infile x=1 FS=,. Normally, command-line arguments are filenames, but the equal sign indicates an assignment. This lets variables change value before and after a file is read. For instance, when the input is from multiple files, the order they are listed on the command line becomes very important because the first named input file is the first input read. Consider the command line awk -f program file2 file1 and this program segment:

BEGIN { if ( FILENAME != "foo") {
               print 'Unexpected input...Abandon ship!"
               exit
      }
      }

The programmer has written this program to accept one file as first input, and anything else causes the program to do nothing except print the error message.

awk -f program x=1 file1 x=2 file2

The change in variable values can also be used to check the order of files. Because you (the programmer) know their correct order, you can check for the appropriate value of x.


TIP: Awk allows only two command-line options. The -f option indicates the file containing the awk program. When no -f option is used, the program is expected to be a part of the command line. The POSIX awk specification adds the option of using more than one -f option. This is useful when running more than one awk program on the same input. The other option is the -Fchar option, where char is the single character chosen as the input field separate. Without a specified -F option, the input field separator is a space, until the variable FS is otherwise set.

Functions

User-defined functions are provide a means of combining code into blocks that can be executed from different parts of a program. In some languages, they are known as subroutines. The capability to add, define, and use functions was not originally part of awk. It was added in 1985 when awk was expanded. Technically, this means you must use either nawk or gawk if you intend to write awk functions; but again, because some systems use the nawk implementation and call it awk, check your man pages before writing any code.

Function Definition

An awk function definition statement appears like the following:

function functionname(list of parameters) {
     the function body
}

A function can exist anywhere a pattern-action statement can be. As most of awk is, functions are free format but must be separated with either a semicolon or a newline. Like the action part of a pattern-action statement, newlines are optional anywhere after the opening curly brace. The list of parameters is a list of variables separated by commas that are used within the function. The function body consists of one or more pattern-action statements.

A function is invoked with a function call from inside the action part of a regular pattern-action statement. The left parenthesis of the function call must immediately follow the function name, without any space between them to avoid a syntactic ambiguity with the concatenation operator. This restriction does not apply to the built-in functions.

Parameters

Most function variables in awk are given to the function call by value. Actual parameters listed in the function call of the program are copied and passed to the formal parameters declared in the function. For instance, let's define a new function called isdigit, as shown:

function isdigit(x) {
     x=8
}
{  x=5
   print x
   isdigit(x)
   print x
}

Now let's use this simple program:


$ awk -f isdigit.awk
5
5

The call isdigit(x) copies the value of x into the local variable x within the function itself. The initial value of x is 5, as shown in the first print statement, and is not reset to a higher value after the isdigit function is finished. Note that if there were a print statement at the end of the isdigit function itself, however, the value would be 8, as expected. call by value ensures you don't accidentally clobber an important value.

Variables

Local variables in a function are acceptable. However, because functions were not a part of awk until awk was expanded, handling local variables in functions was not a concern. Local variables must be listed in the parameter list and can't just be created as used within a routine. A space separates local variables from program parameters. For example, function isdigit(x a, b) indicates that x is a program parameter, whereas a and b are local variables; they have life and meaning only as long as isdigit is active.

Global variables are any variables used throughout the program, including inside functions. Any changes to global variables at any point in the program affects the variable for the entire program. In awk, to make a variable global, just exclude it from the parameter list entirely.

Let's see how this works with a sample script:

function isdigit(x) {
     x=8
     a=3
 }
  { x=5 ; a = 2
  print "x = " x " and a = " a
  isdigit(x)
  print "now x = " x " and a = " a
 }

The output is


x = 5 and a = 2
x = 5 and a = 3

Function Calls

Functions can call each other. A function can also be recursive (call itself multiple times). The best example of recursion is factorial numbers: factorial(n) is computed as n * factorial(n-1) down to n=1, which has a value of one. The value factorial(5) is 5 * 4 * 3 * 2 * 1 = 120 and could be written as an awk program:

function factorial(n) {
  if (n == 1) return 1;
  else return ( n * factorial(n-1) )
}

For a more in-depth look at the fascinating world of recursion, you should see either a programming or data structure book.

Gawk follows the POSIX awk specification in almost every aspect. There is a difference, though, in function declarations. In gawk, the word func may be used instead of the word function. The POSIX specification mentions that the original awk authors asked that this shorthand be omitted, and it is.

The return Statement

A function body can (but doesn't have to) end with a return statement. A return statement has two forms. The statement can consist of the direction alone: return. The other form is return E, where E is some expression. In either case, the return statement gives control back to the calling function. The return E statement gives control back and also gives a value to the function.


TIP: If the function is supposed to return a value and doesn't explicitly use the return statement, the results returned to the calling program are undefined.

Let's revisit the isdigit() function to see how to make it finally ascertain whether the given character is a digit or not:

function isdigit(x) {
     if (x >= "0" && x <= "9")
          return 1;
     else
          return 0
}

As with C programming, I use a value of O to indicate false and a non-zero value to indicate true. A return statement often is used when a function cannot continue due to some error. Note also that with inline conditionals--as explained earlier--this routine can be shrunk down to a single line: function isdigit(x) { return (x >= "0" && x <= "9") }

Writing Reports

Generating a report in awk is a sequence of steps, with each step producing the input for the next step. Report writing is usually a three-step process: pick the data, sort the data, and make the output pretty.

BEGIN and END Revisited

The section titled "Patterns" discussed the BEGIN and END patterns as pre- and post-input processing sections of a program. Along with initializing variables, the BEGIN pattern serves another purpose. BEGIN is awk's provided place to print headers for reports. Indeed, it is the only chance. Remember the way awk input works automatically. The lines

{ print "                     Total Sales"
  print "  Salesperson       for the Month"
  print "  ------------------------------" }

would print a header for each input record rather than a single header at the top of the report. The same is true for the END pattern--only it follows the last input record.

{print "------------------------------"
 print "                Total sales",ttl" }

should only be in the END pattern.

Better yet:

BEGIN { print "                     Total Sales"
       print "  Salesperson       for the Month"
        print "  --------------------------------" }
{ # per person processing statements }
{print "------------------------------"
 print "               Total sales",ttl" }

UNIX users are split roughly in half over which text editor they use--vi or emacs. I began using UNIX and the vi editor, so I prefer vi. The vi editor has no easy way to set off a block of text and do some operation, such as move or delete, to the block, and so falls back on the common measure, the line; a specified number of lines are deleted or copied.

When dealing with long programs, I don't like to guess about the line numbers in a block or take the time to count them either. So, I have a short script that adds line numbers to my printouts for me. It is centered around the awk program in file LIST15_5 on the CD-ROM.

Complex Reports

Using awk, it is possible to quickly create complex reports. It is much easier to perform string comparisons, build arrays on-the-fly, and take advantage of associative arrays than to code in another language (like C). Instead of having to search through an array for a match with a text key, you can use that key as the array subscript.

I have produced reports using awk with three levels of control breaks, multiple sections of report in the same control break, and multiple totaling pages. The totaling pages were for each level of control break plus a final page; if the control break did not have a particular type of data, the total page did not. If there was only one member of a control break, the total page for that level was not created. (This saved a lot of paper when there was really only one level of control break--the highest.)

This report ended up being more than 1,000 lines of awk (nawk to be specific) code. It takes a little longer to run than the equivalent C program, but it took a lot less programmer time to create. Because it was easy to create and modify, it was developed using prototypes. The user briefly described what they wanted, and I produced a report. The user decided he needed more control breaks, so I added them; the user realized a lot of paper was wasted on total pages, so I made the necessary modifications.

Being easy to develop incrementally without knowing the final result, made it easier and more fun for me. My being responsive to user changes, made the user happy!

Extracting Data

As mentioned earlier in this chapter, many systems do not produce data in the desired format. When working with data stored in relational databases, there are two main ways to get data out: Use a query tool with SQL or write a program to get the data from the database and output it in the desired form. SQL query tools have limited formatting capability but can provide quick, easy access to the data.

One technique I have found very useful is to extract the data from the database into a file that is then manipulated by an awk script to produce the exact format required. When required, an awk script can create the SQL statements used to query the database (specifying the key values for the rows to select).

The following example is used when the query tool places a space before a numeric field that must be removed for program that will use the data in another system (mainframe COBOL):

{   printf("%s%s%-25.25s\n", $1, $2, $3);   }

Awk automatically removes the field separator (the space character), and the format specifiers in the printf are contiguous (do not have any spaces between them).

Commands On-the-Fly

The ability to pipe the output of a command into another is very powerful because the output of the first becomes the input that the second can manipulate. A frequent use of one-line awk programs is the creation of commands based on a list.

The find command (see Chapter 4, Volume I, "The UNIX File System," for more information) can produce a list of files that match its conditions, or it can execute a single command that takes a single command-line argument. I could see files in a directory (and subdirectories) that match specific conditions with the following:

$ find . -name "*.prn" -print
./exam2.prn
./exam1.prn
./exam3.prn

Or, I could print the contents of those files with the following:

find . -name "*.prn" -exec lp {} \;

The find command will insert the individual filenames that it locates in place of the { } and execute the lp command. But if I wanted to execute a command that required two arguments (to copy files to a new name) or execute multiple commands at once, I could not do it with find alone. I could create a shell script that would accept the single argument and use it in multiple places, or I could create an awk single-line program:

$ find . -name "*.prn" -print | awk '{print "echo bak" $1; print "cp " $1 " " $1".bak";}'
echo bak./exam2.prn
cp ./exam2.prn ./exam2.prn.bak
echo bak./exam1.prn
cp ./exam1.prn ./exam1.prn.bak
echo bak./exam3.prn
cp ./exam3.prn ./exam3.prn.bak

To get the commands to actually execute, pipe them into one of the shells. The following example uses the Korn shell; you can use the one you prefer:

$ find . -name "*.prn" -print |
     awk '{print "echo bak" $1; print "cp " $1 " " $1".bak";}' |
     ksh
bak./exam2.prn
bak./exam1.prn
bak./exam3.prn

Before each copy takes place, the message is shown. This is also handy if you wanted to search for a string (using the grep command) in the files of multiple subdirectories. Many versions of the grep command do not show the name of the file searched unless you use wildcards (or specify multiple file names on the command line). The following uses find to search for C source files, awk to create grep commands to look for an error message, and the shell echo command to show the file being searched:

$ find . -name "*.c" -print |
     awk '{print "echo " $1; print "grep error-message " $1;}' |
     ksh

The same technique can be used to perform lint checks on source code in a series of subdirectories. I execute the following in a shell script periodically to check all C code:

$ find . -name "*.c" -print |
     awk '{print "lint " $1 " > " $1".lint"}' |
     ksh

Take a look at LIST15_6 on the CD-ROM for an advanced search awk script. The lint version on one system prints the code error as a heading line and then the parts of code in question as a list below. Grep will show the heading but not the detail lines. The awk script prints all lines from the heading until the first blank line (end of the lint section).

When in doubt, pipe the output into more or pg to view the created commands before you pipe them into a shell for execution.

Advanced Concepts

As you spend more time with awk, you might yearn to explore some of the more complex facets of the programming language. I highlight some of the key ones in this section.

The Built-in System Function

Although awk allows you to accomplish quite a few tasks with a few lines of code, it's still helpful sometimes to be able to tie in the many other features of UNIX. Fortunately, most versions after the original awk the built-in function system(value), where value is a string that you would enter from the UNIX command line.

The text is enclosed in double quotes, and the variables are written using a space for concatenating. For example, if I made a packet of files to e-mail to someone and I created a list of the files to send, I would put a file list in a file called sendrick:

$cat sendrick
/usr/anne/ch1.doc
/usr/informix/program.4gl
/usr/anne/pics.txt

Then, awk can build the concatenated file with

$ nawk '{system("cat" $1)}' sendrick > forrick

which creates a file called forrick containing a full copy of each file. A shell script could be written to do the same thing, but shell scripts don't do the pattern matching that awk does, and they are not great at writing reports either.

Multiline Records

By default, the input record separator RS recognizes a newline as the marker between records. As is the norm in awk, this can be changed to allow for multiline records. When RS is set to the null string, the newline character always acts as a field separator, in addition to whatever value FS might have.

Multidimensional Arrays

Although awk does not directly support multidimensional arrays, it can simulate them using the single dimension array type awk does support. Why do this? An array can be compared to a bunch of books. Different people access them in different ways. Someone who doesn't have many books might keep them on a shelf in the room, which is analogous to a single-dimension array with each book at location[i]. Time passes, and you buy a bookcase. Now each book is in location[shelf, i]. The comparison goes as far as you wish. Consider the intercounty library with each book at location[branchnum, floor, room, bookcasenum, shelf, i]. The appropriate dimensions for the array depend very much on the type of problem you are solving. If the intercounty library kept track of all their books by using a catalog number rather than location, a single dimension of book[catalog_num] = title would make more sense than location[branchnum, floor, room, bookcasenum, shelf, i] = title. Awk allows either choice.

Awk stores array subscripts as strings rather than as numbers, so adding another dimension is actually only a matter of concatenating another subscript value to the existing subscript. Suppose you designed a program to inventory jeans at Levis. You could set up the inventory so that item[inventorynum]=itemnum or item[style, size, color] = itemnum. The built-in variable SUBSEP is put between subscripts when a comma appears between subscripts. SUBSEP defaults to the value \034, a value with little chance of being in a subscript. Because SUBSEP marks the end of each subscript, subscript names do not have to be the same length. For example:

item["501","12w","stone washed blue"],
item["dockers","32m","black"]
item["relaxed fit", "9j", "indigo"]

are all valid examples of the inventory. Determining the existence of an element is done just as it is for a single dimension array with the addition of parentheses around the subscript. Your program should reorder when a certain size gets low.

if (("501",,) in item) print a tag.


NOTE: The in keyword is nawk++ syntax.

The price increases on 501s, and your program is responsible for printing new price tags for the items which need a new tag:

for ("501" in item)
     print a new tag.

Recall the string function split; split("501", ,SUBSEP) will retrieve every element in the array with "501" as its first subscript.

Summary

The awk programming language is useful in many ways--with and without full programs. You can use it to search for data, extract data from files, create commands on-the-fly, or even create entire programs.

Awk is very useful as a prototyping language. You can create reports very quickly. After showing them to the user, you can make changes quickly also. Although awk is less efficient than the comparable program written in C, it is not so inefficient that you cannot create production programs with. If efficiency is a concern with an awk program, it can be converted into C.

The capability to search with patterns and have relation arrays (perform easy array lookups based on strings) are features that set awk apart from other programming languages.

The next chapter covers Perl, a language related to awk.

Further Reading

For further reading:

Aho, Alfred V., Brian W. Kernighan, and Peter J. Weinberger. The Awk Programming Language. Reading, Mass.: Addison-Wesley, 1988 (copyright AT&T Bell Lab).

IEEE Standard for Information Technology, Portable Operating System Interface (POSIX), Part 2: Shell and Utilities, Volume 2. Std. 1003.2-1992. New York: IEEE, 1993.

See also the man pages for awk, nawk, or gawk on your system.

GNU awk, gawk, is available on the CD-ROM.

TOCBACKFORWARDHOME


©Copyright, Macmillan Computer Publishing. All rights reserved.