Data Manipulation with AWK. Evangelos Pournaras, Izabela Moise, Dirk Helbing

Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing Evangelos Pournaras, Izabela Moise, Dirk Helbing 1 AWK A "Swiss knife...
Author: Vivian Wheeler
20 downloads 0 Views 3MB Size
Data Manipulation with AWK Evangelos Pournaras, Izabela Moise, Dirk Helbing

Evangelos Pournaras, Izabela Moise, Dirk Helbing

1

AWK

A "Swiss knife" for data manipulation, retrieval, formatting, processing, transformation, prototyping and more...

Evangelos Pournaras, Izabela Moise, Dirk Helbing

2

About AWK Check www.awk.info. • A pattern scanning and processing language. • AWK name: Alfred V. Aho, Peter J. Wein-berger and Brian W.

Kernighan (creators) • An evolving yet, stable, cross-platform language. • Written in 1977 at AT&T Bell Laboratories. • Data-driven language. – Posix standard for AWK: – Various Implementations: gawk, nawk, mawk, spawk, etc.

"AWK is a convenient and expressive programming language that can be applied to a wide variety of computing and data-manipulation tasks." Evangelos Pournaras, Izabela Moise, Dirk Helbing

3

What you can do with AWK

• Manage small databases • Validate data • Produce indexes & perform document preparation tasks • Experiment with algorithms you can adapt later to other

programming languages

Evangelos Pournaras, Izabela Moise, Dirk Helbing

4

Implementations • GAWK – Extract bits and pieces of data for processing – Sort bits – Perform simple network communications

• MAWK – Efficiency, byte code interpreter

• JAWK – Java support

• NAWK, XGAWK, SPAWK, QTAWK, RunAWK, etc.

Evangelos Pournaras, Izabela Moise, Dirk Helbing

5

AWK Advantages

• Very simple • Easy learning curve • Standardized • On-the-fly calculations • No need to open/close files • Interpreted, not compiled – Avoiding the edit-compile-test-debug lifecycle

Evangelos Pournaras, Izabela Moise, Dirk Helbing

6

Programming Philosophy • Programming in AWK: Building a list of rules • Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action} ... • Linear scans, handling one data element per time – Resembling Hadoop philosophy – Random access seek times vs. hard drives sizes • Manipulating delimited text files in a single pass • By design, division of a file in records & fields – Each line is a record – Fields are delimited by a special character

Every clause is a potential action performed on the current record! Evangelos Pournaras, Izabela Moise, Dirk Helbing

7

Comparison with other Languages A case study with converting triplets to sparse matrices:

Source: https://github.com/brendano/awkspeed Evangelos Pournaras, Izabela Moise, Dirk Helbing

8

Running an AWK program

Three ways to run an AWK program from command line: 1. >awk ‘program’ input-file1 input-file2 ... 2. >awk -f program-file input-file1 input-file2 ... 3. Unix script: my-awk-script.sh

#!/usr/bin/awk -f #awk rules go here

Evangelos Pournaras, Izabela Moise, Dirk Helbing

9

Program Structure # Initialization body BEGIN{ # initialization actions } #Main execution body { # main program actions } # Finalization body END{ # Final actions } Evangelos Pournaras, Izabela Moise, Dirk Helbing

10

AWK Demonstration example-01.awk, example-02.awk

Evangelos Pournaras, Izabela Moise, Dirk Helbing

11

AWK Regular Expressions A pattern enclosed in slashes (‘/’) checked if it matches each input record. • letters, numbers, both. • /foo/ • ˜ matches • !˜ does not match • | alternation expression • ˆ matches the beginning of a string • $ matches the end of a string • . matches any single character Evangelos Pournaras, Izabela Moise, Dirk Helbing

12

AWK Demonstration

Evangelos Pournaras, Izabela Moise, Dirk Helbing

13

Scripts >awk ’/.edu/ {print $0}’ mail-list.txt >awk ’$1 ~ /J/’ inventory-shipped.txt >awk ’$3 ~ /edu$|be$/’ mail-list.txt >awk ’{if (length($0)>max) max=length($0)} END{print max}’ mail-list.txt >awk ’NF>0’ inventory-shipped.txt >awk ’END{print NR}’ >awk ’NR%2==0’ mail-list.txt >awk ’$1=="Jan" {sum+=$5} END{print sum}’ inventory-shipped.txt

Evangelos Pournaras, Izabela Moise, Dirk Helbing

14

Variables

• No variable declaration is needed. • No type declaration is needed. • Built-in variables: – NF: number of fields – NR: current record number – FS: field separator

Evangelos Pournaras, Izabela Moise, Dirk Helbing

15

Functions Specified as follows:

function awkFunction(a,b,c,d){ return a+b+c+d } Built-in functions: • Numeric: – sqrt, log, sin, cos, rand, log, etc.

• String: – index, length, match, split, substr, etc.

Evangelos Pournaras, Izabela Moise, Dirk Helbing

16

Arrays

Associative arrays: • String for indices rather than numbers • arrayname[string]=value • Multi-dimensional arrays: – Supported by concatenation of indices into one string – foo[5,12]="value"

Evangelos Pournaras, Izabela Moise, Dirk Helbing

17

AWK Demonstration example-03.awk, example-04.awk

Evangelos Pournaras, Izabela Moise, Dirk Helbing

18

AWK Example - Arrays BEGIN{} { letters[$4]++; } END{ for(var in letters) print var, "exists", letters[var], if("A" in letters) print "A exists" else print "A does not exist" }

"times."

Evangelos Pournaras, Izabela Moise, Dirk Helbing

19

Proposed Literature AWK scripts: https://github.com/data-science-course/lectures/tree/master/awk

A. D. Robbins. Gawk: Effective AWK Programming. Free Software Foundation, Inc., 4.1 edition, April 2014. How to read the user guide: • Fast reading: Chapters 1-10 • Practical examples: Chapters 11

Evangelos Pournaras, Izabela Moise, Dirk Helbing

20

What is next?

• SQL and relational databases • Plotting and visualizing data

Evangelos Pournaras, Izabela Moise, Dirk Helbing

21