Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing
Evangelos Pournaras, Izabela Moise, Dirk Helbing
1
AWK
A "Swiss knife" for data manipulation, retrieval, formatting, processing, transformation, prototyping and more...
Evangelos Pournaras, Izabela Moise, Dirk Helbing
2
About AWK Check www.awk.info. • A pattern scanning and processing language. • AWK name: Alfred V. Aho, Peter J. Wein-berger and Brian W.
Kernighan (creators) • An evolving yet, stable, cross-platform language. • Written in 1977 at AT&T Bell Laboratories. • Data-driven language. – Posix standard for AWK: – Various Implementations: gawk, nawk, mawk, spawk, etc.
"AWK is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks." Evangelos Pournaras, Izabela Moise, Dirk Helbing
3
What you can do with AWK
• Manage small databases • Validate data • Produce indexes & perform document preparation tasks • Experiment with algorithms you can adapt later to other
programming languages
Evangelos Pournaras, Izabela Moise, Dirk Helbing
4
Implementations • GAWK – Extract bits and pieces of data for processing – Sort bits – Perform simple network communications
• MAWK – Efficiency, byte code interpreter
• JAWK – Java support
• NAWK, XGAWK, SPAWK, QTAWK, RunAWK, etc.
Evangelos Pournaras, Izabela Moise, Dirk Helbing
5
AWK Advantages
• Very simple • Easy learning curve • Standardized • On-the-fly calculations • No need to open/close files • Interpreted, not compiled – Avoiding the edit-compile-test-debug lifecycle
Evangelos Pournaras, Izabela Moise, Dirk Helbing
6
Programming Philosophy • Programming in AWK: Building a list of rules • Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action} ... • Linear scans, handling one data element per time – Resembling Hadoop philosophy – Random access seek times vs. hard drives sizes • Manipulating delimited text files in a single pass • By design, division of a file in records & fields – Each line is a record – Fields are delimited by a special character
Every clause is a potential action performed on the current record! Evangelos Pournaras, Izabela Moise, Dirk Helbing
7
Comparison with other Languages A case study with converting triplets to sparse matrices:
Source: https://github.com/brendano/awkspeed Evangelos Pournaras, Izabela Moise, Dirk Helbing
8
Running an AWK program
Three ways to run an AWK program from command line: 1. >awk ‘program’ input-file1 input-file2 ... 2. >awk -f program-file input-file1 input-file2 ... 3. Unix script: my-awk-script.sh
#!/usr/bin/awk -f #awk rules go here
Evangelos Pournaras, Izabela Moise, Dirk Helbing
9
Program Structure # Initialization body BEGIN{ # initialization actions } #Main execution body { # main program actions } # Finalization body END{ # Final actions } Evangelos Pournaras, Izabela Moise, Dirk Helbing
10
AWK Demonstration example-01.awk, example-02.awk
Evangelos Pournaras, Izabela Moise, Dirk Helbing
11
AWK Regular Expressions A pattern enclosed in slashes (‘/’) checked if it matches each input record. • letters, numbers, both. • /foo/ • ˜ matches • !˜ does not match • | alternation expression • ˆ matches the beginning of a string • $ matches the end of a string • . matches any single character Evangelos Pournaras, Izabela Moise, Dirk Helbing
12
AWK Demonstration
Evangelos Pournaras, Izabela Moise, Dirk Helbing
13
Scripts >awk ’/.edu/ {print $0}’ mail-list.txt >awk ’$1 ~ /J/’ inventory-shipped.txt >awk ’$3 ~ /edu$|be$/’ mail-list.txt >awk ’{if (length($0)>max) max=length($0)} END{print max}’ mail-list.txt >awk ’NF>0’ inventory-shipped.txt >awk ’END{print NR}’ >awk ’NR%2==0’ mail-list.txt >awk ’$1=="Jan" {sum+=$5} END{print sum}’ inventory-shipped.txt
Evangelos Pournaras, Izabela Moise, Dirk Helbing
14
Variables
• No variable declaration is needed. • No type declaration is needed. • Built-in variables: – NF: number of fields – NR: current record number – FS: field separator
Evangelos Pournaras, Izabela Moise, Dirk Helbing
15
Functions Specified as follows:
function awkFunction(a,b,c,d){ return a+b+c+d } Built-in functions: • Numeric: – sqrt, log, sin, cos, rand, log, etc.
• String: – index, length, match, split, substr, etc.
Evangelos Pournaras, Izabela Moise, Dirk Helbing
16
Arrays
Associative arrays: • String for indices rather than numbers • arrayname[string]=value • Multi-dimensional arrays: – Supported by concatenation of indices into one string – foo[5,12]="value"
Evangelos Pournaras, Izabela Moise, Dirk Helbing
17
AWK Demonstration example-03.awk, example-04.awk
Evangelos Pournaras, Izabela Moise, Dirk Helbing
18
AWK Example - Arrays BEGIN{} { letters[$4]++; } END{ for(var in letters) print var, "exists", letters[var], if("A" in letters) print "A exists" else print "A does not exist" }
"times."
Evangelos Pournaras, Izabela Moise, Dirk Helbing
19
Proposed Literature AWK scripts: https://github.com/data-science-course/lectures/tree/master/awk
A. D. Robbins. Gawk: Effective AWK Programming. Free Software Foundation, Inc., 4.1 edition, April 2014. How to read the user guide: • Fast reading: Chapters 1-10 • Practical examples: Chapters 11
Evangelos Pournaras, Izabela Moise, Dirk Helbing
20
What is next?
• SQL and relational databases • Plotting and visualizing data
Evangelos Pournaras, Izabela Moise, Dirk Helbing
21