Command line tools - bash, awk and sed

Command line tools - bash, awk and sed We have only explored a small fraction of the capabilities of the bash shell and command-line utilities in Linu...
Author: Aubrie Russell
3 downloads 0 Views 77KB Size
Command line tools - bash, awk and sed We have only explored a small fraction of the capabilities of the bash shell and command-line utilities in Linux. An entire course could be taught about bash alone, but that is not the central topic of interest for this class. There are, however, a few more commands and techniques that are useful enough to merit mention in this session. Previous exercises have used both awk and sed, but this class will devote more attention to general principles of these tools, as well as some additional bash functions. We will also discuss "regular expressions" and file "globs" as part of learning more about command-line tools, because these are generally applicable to many functions. The document FileGlobbing.pdf in the /media/lubuntu/data/documents folder on the USB system contains the information from the Ubuntu manpage on file globbing. Read this document, then interpret the following globs. What would each of the following patterns match? Test them using the ls command on the /media/lubuntu/data/data directory: ls ls ls ls

/media/lubuntu/data/data/*.gz /media/lubuntu/data/data/[ct][ns][1-3]ln[1-8].* /media/lubuntu/data/data/[!a-m]*.fastq* /media/lubuntu/data/data/[a-m]*.fastq*

Regular expressions are similar to file globs in the sense that both are ways of writing general "patterns" that can match many different targets. File globs apply specifically to file and directory names, while regular expressions apply to any text. Unfortunately there are multiple "flavors" of regular expressions: basic, extended, Perl-format, and so forth. Different tools use slightly different syntax and different flavors, so getting familiar enough with regular expressions to make them work for you is not a trivial task. The RegularExpressions.pdf document in the /media/lubuntu/data/documents folder on the USB drive has an overview of regular expressions, although this is by no means a complete explanation of the topic. Read the document, then interpret the following regular expressions, and note what backreferences are available for use in a substitution command. NOTE: grouping using () and alternative search terms using | are capabilities of "extended regular expressions" rather than "basic regular expressions", so test these regular expressions using the egrep command on a listing of the files in the /media/lubuntu/data/data directory ls /media/lubuntu/data/data | egrep "(fastq|fasta|fq|fa)\.gz" ls /media/lubuntu/data/data | egrep "(cn|ts)[1-3]ln[^3a-zA-Z]\."

Note that grep finds matches to the search pattern regardless of the presence of other characters around the pattern - this is another difference between regular expressions and file globs. A file glob finds only filenames that are completely matched by the pattern, while regular expressions find text that contain the pattern. Another version of the grep program is zgrep, which can search within gzipped files without saving a decompressed version of the file. This is useful if it is necessary to find subsets of data within a large file of sequence data stored in fastq.gz compressed format. One challenge with using zgrep on fastq.gz files is that it cannot readily distinguish between the four types of lines found in FASTQformat files. We will come back to this point later, when we discuss awk in more depth. With a basic understanding of file globs and regular expressions, we can look at how these are useful. We will start with commands for locating things. Linux system architecture is complex

enough that it is not always obvious what software is installed on a given system, or if it is not installed, if it is available in pre-packaged form from the distribution repository. The Boost library required for compiling TopHat is an example - how can we find out if our Linux system already has a copy of the Boost library installed? If it does not, how do we know if the library is available in a Ubuntu package from the repository, or if we should download the source code, compile it, and install the library? The software for managing packages in Lubuntu is called dpkg, short for 'Debian package' - Lubuntu is a version of Ubuntu, which in turn is a version of the Linux distribution Debian. The command to search for packages with names containing the word 'boost' is dpkg-query -l *boost*

Try this and see what happens. The information returned provides the complete names of packages that match the search pattern, along with the status of those packages on the local system (in the first three columns of each line). If the line begins with 'ii', the package is installed and functional on the local system; if the line begins with 'un', the package is not installed. Try the search again using different patterns. For example, leave out one or both asterisks: dpkg-query -l boost*

The characters 'boost' are surrounded by other characters in the names of the software packages libboost-system and libboost-thread, so both asterisks are required for the search to be successful this behaves like a file glob, rather than like a regular expression search. The dpkg-query command only searches the names of packages, which may or may not include the pattern of interest. To search for packages that contain the pattern "boost" in either the package name or description, use the command apt-cache search: apt-cache search boost

The apt-cache is a list of names and descriptions of all available packages, installed or not, so this search returns a long list of packages that include the pattern 'boost' somewhere in the name or description. Note that leading and trailing asterisks are not needed, unlike dpkg-query - this behaves like a regular expression search rather than a file glob. Another command for finding software on the local system is 'locate' - this uses a local database of installed programs, and so it is wise to make sure the database is up-to-date before doing a search. Execute the following two commands to update the database and then search it: sudo updatedb locate boost Note that the locate command,

similar to apt-cache search, does not require leading or trailing asterisk characters to find the pattern 'boost', even when it is surrounded by other characters. This highlights a reality of Linux systems - different parts of the same system don't always follow the same rules, so it is often important to try different versions of a command, or search online for tips using search keywords that include the specific command you are trying to use and the specific distribution you are working with. An even more general tool for searching on the local system is the bash command find, which will return a longer list of files and directories containing the pattern 'boost' than the locate command. Try: sudo find / -iname *boost*

This should be run with sudo privileges if you are searching (as in this example) from the root directory / rather than from the /home/lubuntu directory, because the lubuntu user does not have read privileges on many of the directories searched by the find command, so running the command without sudo privileges results in a lot of Permission denied error messages. The -iname option

tells find to search for the pattern '*boost*' in filenames, in a case-insensitive manner. Other options allow the find command to search based on file type, file size, creation date, last modification date, or almost any other characteristic. Try the search using boost without the flanking asterisks - is the find function more similar to file globs or to regular expressions? Loops and Conditionals DNA sequence analysis often requires carrying out the same steps of data processing and analysis separately for multiple sequence files. For example, even our small example RNA-seq experiment with only two conditions and three replicates of each condition requires processing six sequence files separately up to the point where the data from the six files are combined into a single object containing read counts for each transcript from all six samples. There are two general strategies for simplifying repetitive sets of tasks that need to be completed on multiple files, and then multiple methods to use either of the two strategies. The two strategies are serial processing and parallel processing - one either carries out the tasks on one file at a time, and repeats the same tasks on one file after another until the whole set is finished, or one carries out the tasks in parallel on all files at the same time. The choice of strategy is based on the amount of system resources required per task, the number of files to be processed, and the total amount of resources available. A workstation with 24 processors and 128 Gb of RAM could carry out parallel processing of six RNA-seq data files from the example dataset at the same time, but the BIT laptops with 4 processors and 8 Gb of RAM probably could not. Command-line options for parallel processing include the commands xargs with tee, or the parallel command; see the manpages for more information on xargs and tee, and the webpage for more information on parallel. Loops One way to carry out serial processing is to create a loop of commands. This is a sequence of commands provided to the interpreter at the same time, that (1) give a list of things to be done, and (2) specify what is to be done with them. The "list of things to be done" can be as simple as counting through numbers in a sequence or processing each file in turn from a list of filenames, while the specification of "what is to be done with them" can be as simple or complex as desired. Loops are a general concept that can be applied in many ways using different command interpreters; we will focus on examples of loops in the bash shell and in the awk text processing language. Bash loops can have any of three general forms: for loops, while loops, or until loops (see the Bash Beginners Guide at The for loop takes the form 'for X in Y; do Z; done'. X is a variable, Y is a list of values that will each be assigned to the variable X, and Z is an action or set of actions to be taken. A simple example which simply echos the numbers 1 to 10 to the screen is: for x in {1..10}; do echo $x; done

This example demonstrates that numerical sequences can be specified using curly brackets and two dots separating the beginning and ending values. A list of text items, such as file names, can be provided as a space-separated list with no brackets or quotes (provided the file names do not contain spaces): for x in cn1ln7 cn2ln8 cn3ln1 ts1ln4 ts2ln6 ts3ln2; do echo $x; done

In order to help the bash shell recognize the boundaries of the string used as the variable name, we can put curly brackets around the string, so that adjacent characters are clearly marked as not part of

the name. For example, if we wanted to use the same strategy as before to list the names of the sequence files, but used the variable only for the parts that are different, we could write: for x in cn1ln7 cn2ln8 cn3ln1 ts1ln4 ts2ln6 ts3ln2; do echo ${x}.fastq.gz; done

Instead of just echoing the filename to the screen, we could just as easily use this loop to carry out serial alignment of all 6 RNA-seq data files to the chromosome 5 reference sequence to produce BAM output files: for x in cn1ln7 cn2ln8 cn3ln1 ts1ln4 ts2ln6 ts3ln2; do bwa mem -t 3 Atchr5 ${x}.fastq.gz | samtools view -SbuF4 - | samtools sort ${x}; done

Note that the value of the variable $x is used twice, once to name the input file to be aligned, and a second time to name the BAM file produced as output. This points out the advantages of a filenaming system that keeps all the relevant information together in one part of the filename, so the name can easily be subdivided into different parts, one specific for the sample identity and another specific for the file format or stage of processing. There are also advantages to naming files so that the sample identity segment always consists of the same number of characters. For example, in analysis of genotyping-by-sequencing data using the STACKS pipeline, sequence reads for each sample are expected to be stored in separate files. This can mean dozens or hundreds of different sequence files, so typing out a list of all the filenames as demonstrated above would be very tedious. If the files are named so that the sample identity always occupies the first 6 characters (eg p01A01 to p10H12 for ten 96-well plates of samples, each named for the plate and well position where it is stored), then a loop that processes all 960 files can be written in three lines: for file in *.fastq.gz; do name=${file,0,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz | samtools view -SbuF4 - | samtools sort - ${name}; done

It is critical, in using loops of this sort, to make sure that the path information is correct relative to the working directory from which the command is executed. This example would be executed from within a directory containing the sequence files, and also containing a sub-directory called refindex in which the genome reference index called refname is stored. Two more points are worth mentioning here. First, the list of file names that are assigned to the $file variable is produced using a file glob - this example is simple, but it can be as complex as necessary to achieve the desired result. Second, the first 6 characters of each filename are extracted using the syntax ${variable, position, length}, where "position" is 0-indexed. The first character of the filename in the $file variable is position 0, and a sub-string of length 6 beginning at that first character is assigned to the variable $name. We can store the BAM output alignment files in a different directory than the input sequence files, if desired, by providing more path information in addition to filenames. For example, for file in seqdata/*.fastq.gz; do name=${file,8,6}; bwa mem -t 3 refindex/refname ${name}.fastq.gz | samtools view -SbuF4 - | samtools sort - bamfiles/${name}; done

This bash loop would be executed from a directory containing three sub-directories, the seqdata/ directory with the fastq.gz sequence files for all samples, the refindex/ directory containing the reference genome index, and the bamfiles/ directory in which the BAM output is to be stored. Note that the substring assigned to the $name variable now starts at position 8 of the filename stored in the $file variable, because the path string "seqdata/" occupies positions 0 to 7. Change to the /media/lubuntu/data directory and try the command below to see how files are listed: for file in data/*gz; do echo $file; done

We discussed the text processing language awk in the context of manipulating and summarizing SAM alignment files, but it is a powerful tool for other kinds of text processing as well. For example, consider the problem of summarizing a file containing several FASTA-format sequences, each with a single header line and multiple lines of sequence, like the file test.fa available on the website. If we want to know how long each of the DNA sequences is, we can use an awk loop to condense all the sequence information onto a single line, then determine the length of that line. awk 'BEGIN {RS = ">"; FS = "\n"} {printf "%s\n", ">"$1} {for(i=2;i