Essentials for Scientific Computing: Stream editing with sed and awk Day 6

Essentials for Scientific Computing: Stream editing with sed and awk Day 6 Ershaad Ahamed TUE-CMS, JNCASR May 2012 One powerful way of specifying add...
Author: Justina Smith
1 downloads 0 Views 128KB Size
Essentials for Scientific Computing: Stream editing with sed and awk Day 6 Ershaad Ahamed TUE-CMS, JNCASR

May 2012 One powerful way of specifying addresses is to use a pattern. We have been using patterns to specify what strings to match for the s command. We can use those patterns as addresses, and the commands that follow are executed for lines that satisfy the pattern. Pattern addresses are specified between /s. Using this, we can write a script to delete empty lines from a file. cat comfile.txt | sed -e ’/^$/d’ Here the sed command being executed is d (delete line), which is executed for addresses that match /^$/. The address may look cryptic to you, so we’ll see what it means next. The pattern ^ matches the beginning of a line. Unlike literal strings or patterns like \w, ^ actually matches a position rather than any character. Similarly $ matches the end of a line. The combination ^$ thus means an empty line with no characters in between the ^ and $. Looking at the output we get. C C C C B N B N

3.102166 4.343029 4.343243 3.102143 3.100137 4.341568 4.345228 3.103911

11.5549 10.8749 9.41218 8.71322 7.30638 6.57610 5.13343 4.39795

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

B N C B N B N B N

3.100340 4.341533 0.620442 0.618437 1.859867 1.863528 0.622211 0.618640 1.859832

2.95305 2.21948 8.71323 7.30639 6.57611 5.13344 4.39797 2.95306 2.21949

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

1

B N

1.863132 0.622276

0.75964 0.00000

0.0000 0.0000

We find that some of the empty lines have been removed and others have not. This means that those lines do not match the address we gave. On closer examination, we can find that those lines are not really empty, they contain so called whitespace characters that are real characters but do not display on screen or print. Whitespace characters are commonly spaces and tabs. The lines above have tabs and spaces between the beginning and end of the line. We need to modify our address pattern to match spaces and tabs too. Lets move a step at a time and start with spaces only. cat comfile.txt | sed -e ’/^ *$/d’ The change being that a space followed by a * has been added. If we included only a space, the pattern would match a line with a single space in it. What the * does is to modify the meaning of the space (much the same way as \+ does) and says zero or more repetitions of the preceding character. Thus a ‘ *’ (notice the space) means zero or more consecutive spaces. The script above now also removes lines that are empty except for spaces. Changing the script to do the same for lines that only have tabs. cat comfile.txt | sed -e ’/^\t*$/d’ Where \t is the tab character. This script does not work for lines which have only spaces in them, so we need to tell the sed script to look for either a space or a tab character. The alternation operator | can be used for this. If we have a pattern like word1\|word2 the pattern matches an occurrence of word1 or word2 in the input and a pattern of this type is called an alternation. Modifying our script, cat comfile.txt | sed -e ’/^\( \|\t\)*$/d’ In this example, we do not enclose the pattern in parantheses in order to capture the match, but rather to group the alternation as one unit. This means that the * modifies the meaning of the group and not the single character preceding it alone, so that the pattern means zero or more repetitions of \( \|\t\). Thus the script now removes empty lines, lines with only spaces, lines with only tabs and lines with both tabs and spaces. Since we’re only alternating between single characters and not patterns or strings, we can use the more compact pattern. cat comfile.txt | sed -e ’/^[ \t]*$/d’ A pattern of the form [], for example [14ab] will match any one of the enclosed characters or one of the character in the range specified. [14ab] can match either one of 1, 4, a or b.

2

1

More sed

The sed command supports a lot more operations that we have covered here. But the examples above cover the most common use cases. sed commands that need to be used again can be saved in a text file and reused. This can be done by specifying the -f SCRIPTFILE option instead of the -e option. We have also just scaped the surface of regular expression syntax and uses, which is a vast topic in itself. A summary of some of the metacharacters that we haven’t covered is below. • ‘.’ (period) This matches any single character, including whitespace. Depending on the implementation and flags, it can also match a newline • ‘?’ This matches zero or one occurrence of the preceding character. compare with ‘+’, which matches one or more.

2

awk

Awk, like sed is also a powerful stream editing and processing language. The program gawk which you will find on most Linux distributions is the GNU implementation of the AWK programming language developed by Alfred V. Aho, Peter J. Weinberger and Brian W. Kernighan. Although it is a complete programming language, awk is particularly suited for processing and reformatting structured text data. An example of structured data are files that are generated by some scientific measurements which consist of several lines of text, where each line consist of a number of columns or readings. In awk, manipulations like rearranging columns, aggregating data and filtering can be conveniently programmed. Awk, by default, is line buffered like sed and applies commands for each line that is read in. This behaviour though, can be changed to suit data that is structured differently. Let’s begin with an example file we used before. C C C C B N B N B N

3.102166 4.343029 4.343243 3.102143 3.100137 4.341568 4.345228 3.103911 3.100340 4.341533

11.5549 10.8749 9.41218 8.71322 7.30638 6.57610 5.13343 4.39795 2.95305 2.21948

0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

A simple awk script that prints all lines to stdout is below. cat comfilens.txt | awk ’{ print }’ Awk automatically splits input lines into fields (columns) separated by spaces and assigns each of these fields to special variables. $1 in the first field, $2 the second and so on. The script 3

cat comfilens.txt | awk ’{ print $1 }’ is an easy way to extract the first column. C C C C B N B N B N If we wanted to print the columns in reverse order, we would use cat comfilens.txt | awk ’{ print $4,$3,$2,$1 }’ and the output will be 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

11.5549 10.8749 9.41218 8.71322 7.30638 6.57610 5.13343 4.39795 2.95305 2.21948

3.102166 4.343029 4.343243 3.102143 3.100137 4.341568 4.345228 3.103911 3.100340 4.341533

C C C C B N B N B N

Printing in this way places a space between each variable. If we wanted more spaces or even a tab character, we can place literal spaces in the print command. Placing a literal string or two or more variables next to each other causes them to be concatenated. For example, the script cat comfilens.txt | awk ’{ print $4"

"$3"\t\t"$2$1 }’

produces the output. 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

11.5549 10.8749 9.41218 8.71322 7.30638 6.57610 5.13343 4.39795 2.95305 2.21948

3.102166C 4.343029C 4.343243C 3.102143C 3.100137B 4.341568N 4.345228B 3.103911N 3.100340B 4.341533N

Notice how $2 and $1 are concatenated together along with the literal strings containing spaces and tabs. 4

2.1

Using Variables, the BEGIN and END Blocks

Like any other programming language, variables can be used in awk. cat comfilens.txt | awk ’{ var1 = $2; var2 = $3; var3 = $4; \ average = ( var1 + var2 + var3 )/3 ; print average }’ Notice that, since the program is too long to fit on a single line, we continue it on the next line by using bash shell line continuation, which is done by simply placing a backslash character as the last character of a line. This tells bash that the remainder of the command is on the next line. Each consecutive programming expression or statement is separated by a semicolon ;. When awk programs are placed in files for reuse (in exactly the same way as sed or bash scripts), the semicolons can be omitted by putting each statement or expression on its own line. In the awk script above, which calculates and prints the average of the three values on each line, the statements between a set of braces {} make up a block of the awk script. The block is executed once for each line of the input. Thus, for each line, the second field (column), that is $2, is assigned to the variable var1, the third field to var2 and the last field to var3. Then the expression for evaluating the average of the three values is computed and assigned to the variable average. In the last statement of the script, the variable average is printed. The output is 4.88569 5.07264 4.58514 3.93845 3.46884 3.63922 3.15955 2.50062 2.0178 2.187 Suppose we would like to compute the totals for each of the columns containing numbers. In order to do this, we will have to create a variable with an initial value of zero and add successive values from each line to it. We could initialise a variable with a statement like var1 total = 0, but we have a problem. If we place the statement above inside the same block, it will be executed for each line of input and thus var total will always be zero. The solution is that there is a special block called the BEGIN block that is executed only once, and before any other block is executed. Let’s add the BEGIN block to our script above, remove the lines that calculate and print the average, and instead of storing each of the fields in var1, var2 and var3, we accumulate them in var1 tot, var2 tot and var3 tot (var1 += var2 being equivalent to var1 = var1 + var2). Since our program is getting longer, we can put the script in a file. To do this we write our awk script into a file using an editor like Vim. We can then execute the script using. cat comfilens.txt | awk -f SCRIPTFILE

5

Where SCRIPTFILE is the file with our awk script. The new script, now in a script file is below. Notice there are no semicolons since we have each statement on a separate line. BEGIN { var1_tot = 0 var2_tot = 0 var3_tot = 0 } { var1_tot += $2 var2_tot += $3 var3_tot += $4 print var1_tot, var2_tot, var3_tot } The output is 3.10217 11.5549 0 7.4452 22.4298 0 11.7884 31.842 0 14.8906 40.5552 0 17.9907 47.8616 0 22.3323 54.4377 0 26.6775 59.5711 0 29.7814 63.9691 0 32.8818 66.9221 0 37.2233 69.1416 0 In our script var1 tot, var2 tot and var3 tot are printed for each line of the input. There is a special block which is analogous to the BEGIN block, but is executed only once and after every other block is executed. By putting the print command in the END block, we just print the final total. The new script will look like. BEGIN { var1_tot = 0 var2_tot = 0 var3_tot = 0 } { var1_tot += $2 var2_tot += $3 var3_tot += $4 } END { print var1_tot, var2_tot, var3_tot }

2.2

Selective Execution with Patterns

Remember how we could cause sed to execute commands selectively for certain lines by specifying addresses. Although awk does not support line numbers 6

directly, it supports expressions called patterns. Line numbers can be selected, albeit a bit differently than sed. The next examples will illustrate that. An awk pattern is specified just before the opening brace of a block. Naturally, this doesn’t apply to the BEGIN and END blocks. When the pattern is a regular expression, the actions within the block are executed if the pattern matches the current line (more accurately record ) that has been read. Suppose for our initial example, we would like to print only the lines that had a ‘C’ as the first character of a line. cat comfilens.txt | awk ’/^C/{ print }’ We get the output. C C C C

3.102166 4.343029 4.343243 3.102143

11.5549 10.8749 9.41218 8.71322

0.0000 0.0000 0.0000 0.0000

The pattern can be negated by using a ! before pattern, so the command cat comfilens.txt | awk ’! /^C/{ print }’ will print all lines that do not begin with a ‘C’. A pattern can also be an expression that evaluates to true (non-zero) or false (zero). Awk has certain special variables whose value can be read. For instance the NR variable contains the line number of the line currently being processed. In order for a block to be executed only for even line numbers, we would use. cat comfilens.txt | awk ’NR % 2 == 0 { print }’ Prints only even numbered lines. % is the modulo (remainder) operator. The == comparison operator evaluates to 1 (true) if the expressions on the right and left are equal, and 0 (false) if they are not. We can rewrite the earlier example that printed the lines with ‘C’ in the first column using an expression rather than a regular expression as follows. cat comfilens.txt | awk ’$1 == "C" { print }’ Another type of pattern is a range of the form pat1,pat2. In this case the block is executed for all lines that follow the first line that matches pat1 until a line matching pat2 is encountered, both lines included. For example, to start printing lines when the first line beginning with the ‘C’ character is encountered and to stop printing after the first line that begins with an ‘N’ character is encountered, we would use. cat comfilens.txt | awk ’/^C/,/^N/ { print }’ The patterns need not be limited to a regular expression, like the following. cat comfilens.txt | awk ’/^C/,NR==7 { print }’

7