Login • Requesting a tak account http://iona.wi.mit.edu/bio/software/unix/bioinfoaccount.php
• Windows PuTTY Need to setup X-windows for graphical display
• Macs Access the Terminal: Go Utilities Terminal
2
Login using Secure Shell ssh –Y user@tak PuTTY on Windows
Terminal on Macs
Command Prompt user@tak ~$
3
Hot Topics website: http://jura.wi.mit.edu/bio/education/hot_topics/ • Please login to our tak server, and create a directory for the exercises in your lab share, and use it as your working directory $ mkdir unix2 $ cd unix2
• Download all files into your working directory • You should have the files below in your working directory: – foo.txt, sample1.txt, exercise.txt , join2filesByFirstColumn.pl , datasets folder – you can check with ls command
select only these fields select 1st and 5th fields select 1st, 2nd, 3rd, 4th, and 5th fields
$ wc –l foo.txt How many lines are in this file? 5
What you will learn… • • • • • •
sed awk groupBy (bedtools) loops join files together scripting
6
sed: stream editor for filtering and transforming text • Print lines 10 - 15: $ sed -n '10,15p' bigFile > selectedLines
• Delete 5 header lines at the beginning of a file: $ sed '1,5d' file > fileNoHeader
• Remove all version numbers (ex: '.1') from the end of a list of sequence accessions: $ sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly s: substitute g: global modifier d: delete line p: print the current pattern space -n: only print those lines matching our pattern 7
awk • Name comes from the original authors: Alfred V. Aho, Peter J. Weinberger, Brian W. Kernighan
• A simple programing language • Good for short programs, including parsing files
8
awk • Print the 2nd and 1st fields of the file: $ awk ' { print $2"\t"$1 } ' foo.tab • Convert sequences from tab delimited format to fasta format: $ awk ' { print ">"$1"\n"$2 } ' foo.tab > foo.fa $ head -1 foo.tab Seq1 ACTGCATCAC $ head -2 foo.fa >Seq1 ACGCATCAC
#: comment, ignored by awk By default, awk splits each line by spaces
Character
Description
\n
newline
\r
carriage return
\t
horizontal tab
9
awk: field separator • Issues with default separator: white space – one field is gene description with multiple words – consecutive empty cells
• To use tab as the separator: $ awk –F "\t" '{ print NF }' foo.txt or $ awk 'BEGIN {FS="\t"} { print NF }' foo.txt BEGIN: action before read input NF: number of fields in the current record FS: input field separator OFS: output field separator END: action after read input 10
awk: arithmetic operations Add average values of 4th and 5th fields to the file: $ awk '{ print $0"\t"($4+$5)/2 }' foo.tab $0: all fields Operator
Description
+
Addition
-
Subtraction
*
Multiplication
/
Division
%
Modulo
^
Exponentiation
**
Exponentiation
11
awk: making comparisons Print out records if values in 4th or 5th field are above 4: $ awk '{ if( $4>4 || $5>4 ) print $0 } ' foo.tab Sequence