Python in Action Presented at USENIX LISA Conference November 16, 2007 David M. Beazley http://www.dabeaz.com (Part II - Systems Programming) Copyright (C) 2007, http://www.dabeaz.com

2- 1

Section Overview • In this section, we're going to get dirty • Systems Programming • Files, I/O, file-system • Text parsing, data decoding • Processes and IPC • Networking • Threads and concurrency Copyright (C) 2007, http://www.dabeaz.com

2- 2

Commentary • I personally think Python is a fantastic tool for systems programming.

• Modules provide access to most of the major system libraries I used to access via C

• No enforcement of "morality" • Decent performance • It just "works" and it's fun 2- 3

Copyright (C) 2007, http://www.dabeaz.com

Approach • I've thought long and hard about how I would present this part of the class.

• A reference manual approach would probably be long and very boring.

• So instead, we're going to focus on building something more in tune with the times

Copyright (C) 2007, http://www.dabeaz.com

2- 4

"To Catch a Slacker" • Write a collection of Python programs that can

quietly monitor Firefox browser caches to find out who has been spending their day reading Slashdot instead of working on their TPS reports.

• Oh yeah, and be a real sneaky bugger about it.

Copyright (C) 2007, http://www.dabeaz.com

2- 5

Why this Problem? • Involves a real-world system and data • Firefox already installed on your machine (?) • Cross platform (Linux, Mac, Windows) • Example of tool building • Related to a variety of practical problems • A good tour of "Python in Action" Copyright (C) 2007, http://www.dabeaz.com

2- 6

Disclaimers • I am not involved in browser forensics (or spyware for that matter).

• I am in no way affiliated with Firefox/Mozilla nor have I ever seen Firefox source code

• I have never worked with the cache data prior to preparing this tutorial

• I have never used any third-party tools for looking at this data.

Copyright (C) 2007, http://www.dabeaz.com

2- 7

More Disclaimers • All of the code in this tutorial works with a standard Python installation

• No third party modules. • All code is cross-platform • Code samples are available online at http://www.dabeaz.com/action/

• Please look at that code and follow along Copyright (C) 2007, http://www.dabeaz.com

2- 8

Assumptions • This is not a tutorial on systems concepts • You should be generally familiar with background material (files, filesystems, file formats, processes, threads, networking, protocols, etc.)

• Hopefully you can "extrapolate" from the

material presented here to construct more advanced Python applications.

Copyright (C) 2007, http://www.dabeaz.com

2- 9

The Big Picture • We want to write a tool that allows

someone to locate, inspect, and perform queries across a distributed collection of Firefox caches.

• For example, the cache directories on all machines on the LAN of a quasi-evil corporation.

Copyright (C) 2007, http://www.dabeaz.com

2- 10

The Firefox Cache • The Firefox browser keeps a disk cache of recently visited sites

% ls Cache/ -rw-------rw-------rw------... -rw-------rw-------rw-------rw-------rw-------rw-------

1 beazley 1 beazley 1 beazley 1 1 1 1 1 1

beazley beazley beazley beazley beazley beazley

111169 Sep 25 17:15 01CC0844d01 104991 Sep 25 17:15 01CC3844d01 47233 Sep 24 16:41 021F221Ad01 26749 58172 1939456 2588672 4567040 33044

Sep Sep Sep Sep Sep Sep

21 25 25 25 25 23

11:19 18:16 19:14 19:14 18:44 21:58

FF8AEDF0d01 FFE628C6d01 _CACHE_001_ _CACHE_002_ _CACHE_003_ _CACHE_MAP_

• A bunch of cryptically named files. Copyright (C) 2007, http://www.dabeaz.com

2- 11

Problem : Finding Files • Find the Firefox cache

Write a program findcache.py that takes a directory name as input and recursively scans that directory and all subdirectories looking for Firefox/Mozilla cache directories.

• Example:

% python findcache.py /Users/beazley /Users/beazley/Library/.../qs1ab616.default/Cache /Users/beazley/Library/.../wxuoyiuf.slt/Cache %

• Use case: Searching for things on the filesystem. Copyright (C) 2007, http://www.dabeaz.com

2- 12

findcache.py # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name

2- 13

Copyright (C) 2007, http://www.dabeaz.com

The sys module # findcache.py # Recursively scan a directory looking for The sys module has basic # Firefox/Mozilla cache directories import sys import os

information related to the execution environment.

if len(sys.argv) != 2: sys.argv print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

A list in of os.walk(sys.argv[1]) the command line caches = (path for path,dirs,files sys.stdin if '_CACHE_MAP_' in files) options sys.stdout for name in caches: sys.stderr sys.argv = ['findcache.py', print name

'/Users/beazley']

Standard I/O files Copyright (C) 2007, http://www.dabeaz.com

2- 14

Program Termination # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os

SystemExit exception if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) Forces Python to exit. path,dirs,files Valueinisos.walk(sys.argv[1]) return code.

caches = (path for if '_CACHE_MAP_' in files) for name in caches: print name

2- 15

Copyright (C) 2007, http://www.dabeaz.com

os Module # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache os directories module import sys import os

Contains useful OS related functions (files, processes, etc.)

if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name

Copyright (C) 2007, http://www.dabeaz.com

2- 16

os.walk() os.walk(topdir) # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla directories Recursively walkscache a directory tree and

generates import sys a sequence of tuples (path,dirs,files) importpath os if

= The current directory name dirs = List of all subdirectory names in path len(sys.argv) != 2: files >>sys.stderr,"Usage: = List of all regular files (data)findcache.py in path print python

dirname"

raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches: print name

2- 17

Copyright (C) 2007, http://www.dabeaz.com

A Sequence of Caches # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys This statement import os

generates a sequence of directory names where '_CACHE_MAP_' is if len(sys.argv) 2: contained in the!=filelist.

print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1)

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for name in caches:

The print directory name name that is generated as a result Copyright (C) 2007, http://www.dabeaz.com

File name check 2- 18

Printing the Result # findcache.py # Recursively scan a directory looking for # Firefox/Mozilla cache directories import sys import os if len(sys.argv) != 2: print >>sys.stderr,"Usage: python findcache.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) Thisinprints if '_CACHE_MAP_' files)the sequence for name in caches: print name

of cache directories that are generated by the previous statement.

2- 19

Copyright (C) 2007, http://www.dabeaz.com

Commentary • Our solution is strongly based on a

"declarative" programming style (again)

• We simply write out a sequence of

operations that produce what we want

• Not focused on the underlying mechanics of how to traverse all of the directories.

Copyright (C) 2007, http://www.dabeaz.com

2- 20

Big Idea : Iteration • Python allows iteration to be captured as a kind of object.

caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files)

• This de-couples iteration from the code that uses the iteration for name in caches: print name

• Another usage example: for name in caches: print len(os.listdir(name)), name

Copyright (C) 2007, http://www.dabeaz.com

2- 21

Big Idea : Iteration • Compare to this: for path,dirs,files in os.walk(sys.argv[1]): if '_CACHE_MAP_' in files: print len(os.listdir(path)),path

• This code is simple, but the loop and the code that executes in the loop body are coupled together

• Not as flexible, but this is somewhat subtle to wrap your brain around at first.

Copyright (C) 2007, http://www.dabeaz.com

2- 22

Mini-Reference : sys, os • sys module sys.argv sys.stdin sys.stdout sys.stderr sys.executable sys.exc_info()

# # # # # #

os.walk(dir)

# Recursively walk dir producing a # sequence of tuples (path,dlist,flist)

os.listdir(dir)

# Return a list of all files in dir

• os module

List of command line options Standard input Standard output Standard error Full path of Python executable Information on current exception

• SystemExit exception

raise SystemExit(n) # Exit with integer code n

Copyright (C) 2007, http://www.dabeaz.com

2- 23

Problem: Searching for Text • Extract all URL requests from the cache

Write a program requests.py that scans the contents of the _CACHE_00n_ files and prints a list of URLs for documents stored in the cache.

• Example:

% python requests.py /Users/.../qs1ab616.default/Cache http://www.yahoo.com/ http://us.js2.yimg.com/us.yimg.com/a/1-/java/promotions/js/ad_eo_1.1.js http://us.i1.yimg.com/us.yimg.com/i/ww/atty_hsi_free.gif http://us.i1.yimg.com/us.yimg.com/i/ww/thm/1/search_1.1.png ... %

• Use case: Searching the contents of files for text patterns.

Copyright (C) 2007, http://www.dabeaz.com

2- 24

The Firefox Cache • The cache directory holds two types of data • Metadata (URLs, headers, etc.). • Raw data (HTML, JPEG, PNG, etc.) • This data is stored in two places • Cryptic files in the Cache directory • Blocks inside the _CACHE_00n_ files • Metadata almost always in _CACHE_00n_ Copyright (C) 2007, http://www.dabeaz.com

2- 25

Possible Solution : Regex • The _CACHE_00n_ files are encoded in a binary format, but URLs are embedded inside as null-terminated text:

\x00\x01\x00\x08\x92\x00\x02\x18\x00\x00\x00\x13F\xff\x9f \xceF\xff\x9f\xce\x00\x00\x00\x00\x00\x00H)\x00\x00\x00\x1a \x00\x00\x023HTTP:http://slashdot.org/\x00request-method\x00 GET\x00request-User-Agent\x00Mozilla/5.0 (Macintosh; U; Intel Mac OS X; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7\x00 request-Accept-Encoding\x00gzip,deflate\x00response-head\x00 HTTP/1.1 200 OK\r\nDate: Sun, 30 Sep 2007 13:07:29 GMT\r\n Server: Apache/1.3.37 (Unix) mod_perl/1.29\r\nSLASH_LOG_DATA: shtml\r\nX-Powered-By: Slash 2.005000176\r\nX-Fry: How can I live my life if I can't tell good from evil?\r\nCache-Control:

• Maybe the requests could just be ripped using a regular expression.

Copyright (C) 2007, http://www.dabeaz.com

2- 26

A Regex Solution # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

2- 27

Copyright (C) 2007, http://www.dabeaz.com

The re module # requests.py import re import os import sys

re module Contains all functionality related to

cachedir = sys.argv[1] regular expression pattern matching, cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

searching, replacing, etc.

# A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)\x00')

Features are strongly influenced by Perl,

# Loop over all files search for URLs butandregexs are not directly integrated for name in cachefiles: into the Python language. data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com

2- 28

Using re # requests.py Patterns import re are first specified import os and compiled into a regex import sys

as strings object.

pat = re.compile(pattern [,flags]) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)\x00') # Loop over all files and search for URLs for name in cachefiles: The pattern syntax is "standard" data = open(os.path.join(cachedir,name),"rb").read() pat* pat1|pat2 index = 0 pat+ [chars] while True: pat? [^chars] m = request_pat.search(data,index) (pat) pat{n} if not m: break . pat{n,m} print m.group(1) index = m.end()

2- 29

Copyright (C) 2007, http://www.dabeaz.com

Using re # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_003_' ] All subsequent operations are'_CACHE_002_', methods of the

compiled regex pattern

# A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)\x00') m = pat.match(data [,start]) # Check for match m = pat.search(data [,start]) # Search for match # Loop over= all files and search URLs newdata pat.sub(data, repl) for # Pattern replace for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com

2- 30

Searching for Matches # requests.py import re import os pat.search(text import sys

[,start])

Searches= the string text for the first occurrence cachedir sys.argv[1] cachefiles = [ pattern '_CACHE_001_', '_CACHE_003_' ] of the regex starting'_CACHE_002_', at position start. "MatchObject" if a match is found. # Returns A regex a for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)\x00') In the code below, we're finding matches one

# Loop over all files and search for URLs at aname time. for in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

2- 31

Copyright (C) 2007, http://www.dabeaz.com

Match Objects # requests.py import re import os import sys cachedir = sys.argv[1]

Regex matches are represented by a MatchObject cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_'

]

# Text # m.group([n]) A regex for embedded URL matched strings by group n m.start([n]) # Starting index of group n request_pat = re.compile(r'([a-z]+://.*?)\x00') m.end([n]) # End index of group n # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 The matching text for while True: just the URL. m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() The end of the match Copyright (C) 2007, http://www.dabeaz.com

2- 32

Groups # requests.py import re In patterns, parentheses () define groups which import os are numbered left to right. import sys group 0 # The entire pattern group 1 # Text in first () cachedir = sys.argv[1] group 2 # Text in '_CACHE_002_', next () cachefiles = [ '_CACHE_001_', '_CACHE_003_' ] ... # A regex for embedded URL strings request_pat = re.compile('([a-z]+://.*?)\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

2- 33

Copyright (C) 2007, http://www.dabeaz.com

Mini-Reference : re • re pattern compilation pat = re.compile(r'patternstring')

• Pattern syntax literal pat* pat+ pat? pat1|pat2 (pat) [chars] [^chars] . \d \w \s

Copyright (C) 2007, http://www.dabeaz.com

# # # # # # # # # # # #

Match Match Match Match Patch Patch Match Match Match Match Match Match

literal text 0 or more repetitions of pat 1 or more repetitions of pat 0 or 1 repetitions of pat pat1 or pat2 pat (group) characters in chars characters not in chars any character except \n any digit alphanumeric character whitespace

2- 34

Mini-Reference : re • Common pattern operations pat.search(text) pat.match(text) pat.sub(repl,text)

# Search text for a match # Search start of text for match # Replace pattern with repl

m.group([n]) m.start([n]) m.end([n])

# Text matched by group n # Starting position of group n # Ending position of group n

• Match objects

• How to loop over all matches of a pattern for m in pat.finditer(text): # m is a MatchObject that you process ...

Copyright (C) 2007, http://www.dabeaz.com

2- 35

Mini-Reference : re • An example of pattern replacement # This replaces American dates of the form 'mm/dd/yyyy' # with European dates of the form 'dd/mm/yyyy'. # This function takes a MatchObject as input and returns # replacement text as output. def euro_date(m): month = m.group(1) day = m.group(2) year = m.group(3) return "%d/%d/%d" % (day,month,year) # Date re pattern and replacement operation datepat = re.compile(r'(\d+)/(\d+)/(\d+)') newdata = datepat.sub(euro_date,text)

Copyright (C) 2007, http://www.dabeaz.com

2- 36

Mini-Reference : re • There are many more features of the re module

• Strongly influenced by Perl (feature set) • Regexs are a library in Python, not integrated into the language.

• A book on regular expressions may be essential for advanced functions.

2- 37

Copyright (C) 2007, http://www.dabeaz.com

File Handling # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings What is going on in this statement? request_pat = re.compile(r'([a-z]+://.*?)\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com

2- 38

os.path module # requests.py os.path import re has portable file related functions import os os.path.join(name1,name2,...) # Join path names import sys os.path.getsize(filename) # Get the file size os.path.getmtime(filename) # Get modification date cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ]

There are many more functions, but this is the #preferred A regex for embedded strings module forURL basic filename handling request_pat = re.compile(r'([a-z]+://.*?)\x00')

# Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end()

2- 39

Copyright (C) 2007, http://www.dabeaz.com

os.path.join() # requests.py Creates import re a fully-expanded pathname import os dirname = '/foo/bar' filename = 'name' import sys os.path.join(dirname,filename) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # A regex for embedded URL strings '/foo/bar/name' request_pat = re.compile(r'([a-z]+://.*?)\x00')

Aware of platform differences ('/' vs. '\')

# Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com

2- 40

Mini-Reference : os.path os.path.join(s1,s2,...) os.path.getsize(path) os.path.getmtime(path) os.path.getatime(path) os.path.getctime(path) os.path.exists(path) os.path.isfile(path) os.path.isdir(path) os.path.islink(path) os.path.basename(path) os.path.dirname(path) os.path.abspath(path)

# # # # # # # # # # # #

Join pathname parts together Get file size of path Get modify time of path Get access time of path Get creation time of path Check if path exists Check if regular file Check if directory Check if symbolic link Return file part of path Return dir part of Get absolute path

2- 41

Copyright (C) 2007, http://www.dabeaz.com

Binary I/O # requests.py import re import os import sys cachedir = sys.argv[1] cachefiles [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] For all= binary files, use modes "rb","wb", etc. # A regex for embedded URL strings request_pat = re.compile(r'([a-z]+://.*?)\x00') Disables new-line translation (critical on Windows) # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com

2- 42

Common I/O Shortcuts # requests.py # Readre an entire file into a string import data =os open(filename).read() import import sys # Write a string out to a file open(filename,"w").write(text) cachedir = sys.argv[1] cachefiles = [ '_CACHE_001_', '_CACHE_002_', '_CACHE_003_' ] # Loop over all lines in a file line for in open(filename): #for A regex embedded URL strings ... request_pat = re.compile(r'([a-z]+://.*?)\x00') # Loop over all files and search for URLs for name in cachefiles: data = open(os.path.join(cachedir,name),"rb").read() index = 0 while True: m = request_pat.search(data,index) if not m: break print m.group(1) index = m.end() Copyright (C) 2007, http://www.dabeaz.com

2- 43

Commentary on Solution • This regex approach is mostly a hack for this particular application.

• Reads entire cache files into memory as strings (may be quite large)

• Only finds URLs, no other metadata • Some risk of false positives since URLs could also be embedded in data.

Copyright (C) 2007, http://www.dabeaz.com

2- 44

Commentary • We have started to build a collection of very simple command line tools

• Very much in the "Unix tradition." • Python makes it easy to create such tools • More complex applications could be

assembled by simply gluing scripts together

Copyright (C) 2007, http://www.dabeaz.com

2- 45

Working with Processes • It is common to write programs that run other programs, collect their output, etc.

• Pipes • Interprocess Communication • Python has a variety of modules for supporting this.

Copyright (C) 2007, http://www.dabeaz.com

2- 46

subprocess Module • A module for creating and interacting with subprocesses

• Consolidates a number of low-level OS

functions such as system(), execv(), spawnv(), pipe(), popen2(), etc. into a single module

• Cross platform (Unix/Windows) Copyright (C) 2007, http://www.dabeaz.com

2- 47

Example : Slackers • Find slacker cache entries. Using the programs findcache.py and requests.py as subprocesses, write a program that inspects cache directories and prints out all entries that contain the word 'slashdot' in the URL.

Copyright (C) 2007, http://www.dabeaz.com

2- 48

slackers.py # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

Copyright (C) 2007, http://www.dabeaz.com

2- 49

Launching a subprocess # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess

This is launching a python for cachedir in dirlist: Collection of output searcher = subprocess.Popen( script as a subprocess, with newline [sys.executable,"requests.py",cachedir], connecting itsstdout=subprocess.PIPE) stdout stripping. for line in searcher.stdout: stream to a pipe. if 'slashdot' in line: print line, Copyright (C) 2007, http://www.dabeaz.com

2- 50

Python Executable # slackers.py import sys import subprocess

Full pathname of python interpreter

# Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

2- 51

Copyright (C) 2007, http://www.dabeaz.com

Subprocess Arguments # slackers.py import sys import subprocess # Run findcache.py as a subprocess finder = subprocess.Popen( [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) dirlist = [line.strip() for line in finder.stdout]

List of arguments to subprocess.

# Run request.py as a subprocess for cachedirCorresponds in dirlist: to what would searcherappear = subprocess.Popen( on a shell command line. [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

Copyright (C) 2007, http://www.dabeaz.com

2- 52

slackers.py # slackers.py import sys import subprocess #More Run findcache.py a subprocess of the sameasidea. For each directory we finder = subprocess.Popen( found in the last step, we run requests.py to [sys.executable,"findcache.py",sys.argv[1]], stdout=subprocess.PIPE) produce requests. dirlist = [line.strip() for line in finder.stdout] # Run request.py as a subprocess for cachedir in dirlist: searcher = subprocess.Popen( [sys.executable,"requests.py",cachedir], stdout=subprocess.PIPE) for line in searcher.stdout: if 'slashdot' in line: print line,

2- 53

Copyright (C) 2007, http://www.dabeaz.com

Commentary • subprocess is a large module with many options. • However, it takes care of a lot of annoying platform-specific details for you.

• Currently the "recommended" way of dealing with subprocesses.

Copyright (C) 2007, http://www.dabeaz.com

2- 54

Low Level Subprocesses • Running a simple system command • Connecting to a subprocess with pipes os.system("shell command")

pout, pin = popen2.popen2("shell command")

• Exec/spawn os.execv(),os.execl(),os.execle(),... os.spawnv(),os.spawnvl(), os.spawnle(),...

• Unix fork()

os.fork(), os.wait(), os.waitpid(), os._exit(), ...

Copyright (C) 2007, http://www.dabeaz.com

2- 55

Interactive Processes • Python does not have built-in support for controlling interactive subprocesses (e.g., "Expect")

• Must install third party modules for this • Example: pexpect • http://pexpect.sourceforge.net Copyright (C) 2007, http://www.dabeaz.com

2- 56

Commentary • Writing small Unix-like utilities is fairly straightforward in Python

• Support for standard kinds of operations (files, regular expressions, pipes, subprocesses, etc.)

• However, our solution is also kind of clunky • Only returns some information • Not particularly memory efficient (reads large files into memory)

2- 57

Copyright (C) 2007, http://www.dabeaz.com

Interlude • Python is well-suited to building libraries and frameworks.

• In the next part, we're going to take a totally different approach than simply writing simple utilities.

• Will build libraries for manipulating cache

data and use those libraries to build tools.

Copyright (C) 2007, http://www.dabeaz.com

2- 58

Problem : Parsing Data • Extract the cache data (for real)

Write a module ffcache.py that contains a set of functions for reading Firefox cache data into useful data structures that can be used by other programs. Capture all available information including URLs, timestamps, sizes, locations, content types, etc.

• Use case: Blood and guts Writing programs that can process foreign file formats. Processing binary encoded data. Creating code for later reuse. 2- 59

Copyright (C) 2007, http://www.dabeaz.com

The Firefox Cache • There are four critical files _CACHE_MAP_ _CACHE_001_ _CACHE_002_ _CACHE_003_

# # # #

Cache Cache Cache Cache

index data data data

• All files are binary-encoded • _CACHE_MAP_ is used by Firefox to locate data, but it is not updated until Firefox exits.

• We will ignore _CACHE_MAP_ since we want to observe caches of live Firefox sessions.

Copyright (C) 2007, http://www.dabeaz.com

2- 60

Firefox _CACHE_ Files • _CACHE_00n_ file organization Free/used block bitmap Blocks

4096 bytes Up to 32768 blocks

• The block size varies according to the file: _CACHE_001_ _CACHE_002_ _CACHE_003_

256 byte blocks 1024 byte blocks 4096 byte blocks

2- 61

Copyright (C) 2007, http://www.dabeaz.com

Cache Entries • Each cache entry: • A maximum of 4 cache blocks • Can either be data or metadata • If >16K, written to a file instead • Notice how all the "cryptic" files are >16K -rw-------rw-------rw------... -rw-------rw-------

beazley beazley beazley

111169 Sep 25 17:15 01CC0844d01 104991 Sep 25 17:15 01CC3844d01 47233 Sep 24 16:41 021F221Ad01

beazley beazley

26749 Sep 21 11:19 FF8AEDF0d01 58172 Sep 25 18:16 FFE628C6d01

Copyright (C) 2007, http://www.dabeaz.com

2- 62

Cache Metadata

• Metadata is encoded as a binary structure Header

36 bytes

Request String

Variable length (in header)

Request Info

Variable length (in header)

• Header encoding (binary, big-endian) 0-3 4-7 8-11 12-15 16-19 20-23 24-27 28-31 32-35

magic (???) location fetchcount fetchtime modifytime expiretime datasize requestsize infosize

unsigned unsigned unsigned unsigned unsigned unsigned unsigned unsigned unsigned

int int int int int int int int int

(0x00010008)

(system time) (system time) (system time) (byte count) (byte count) (byte count)

Copyright (C) 2007, http://www.dabeaz.com

2- 63

Solution Outline • Part 1: Parsing Metadata Headers • Part 2: Getting request information (URL) • Part 3: Extracting additional content info • Part 4: Scanning of individual cache files • Part 5: Scanning an entire directory • Part 6: Scanning a list of directories Copyright (C) 2007, http://www.dabeaz.com

2- 64

Part I - Reading Headers • Write a function that can parse the binary metadata header and return the data in a useful format

Copyright (C) 2007, http://www.dabeaz.com

2- 65

Reading Headers import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

Copyright (C) 2007, http://www.dabeaz.com

2- 66

Reading Headers • How this is supposed to work: >>> f = open("Cache/_CACHE_001_","rb") >>> f.seek(4096) # Skip the bit map >>> headerdata = f.read(36) # Read 36 byte header >>> meta = parse_meta_header(headerdata) >>> meta {'fetchtime': 1190829792, 'requestsize': 27, 'magic': 65544, 'fetchcount': 3, 'expiretime': 0, 'location': 2449473536L, 'modifytime': 1190829792, 'datasize': 29448, 'infosize': 531} >>>

• Basically, we're parsing the header into a

useful Python data structure (a dictionary) 2- 67

Copyright (C) 2007, http://www.dabeaz.com

struct module import struct # This function parses a cache metadata header into a dict Parses binary encoded data into Python objects. # of named fields (listed in _headernames below) _headernames = ['magic','location','fetchcount', You would use this module to pack/unpack 'fetchtime','modifytime','expiretime', binary 'datasize','requestsize','infosize'] data from Python strings.

raw

def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

Unpacks 9 unsigned 32-bit big-endian integers

Copyright (C) 2007, http://www.dabeaz.com

2- 68

struct module import struct # This function parses a cache metadata header into a dict # of named fields (listed in _headernames below)

Result is always a tuple of converted values.

_headernames = ['magic','location','fetchcount', head = (65544, 'fetchtime','modifytime','expiretime', 0, 1, 1191682051, 1191682051, 0, 8645, 190, 218) 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

2- 69

Copyright (C) 2007, http://www.dabeaz.com

Dictionary Creation zip(s1,s2) makes a list of tuples [('magic',head[0]), ('location',head[1]), ('fetchcount',head[2]) # This function parses a cache metadata header into a dict ... # of named fields (listed in _headernames below) ] zip(_headernames,head) import struct

_headernames = ['magic','location','fetchcount', 'fetchtime','modifytime','expiretime', 'datasize','requestsize','infosize'] def parse_meta_header(headerdata): head = struct.unpack(">9I",headerdata) meta = dict(zip(_headernames,head)) return meta

Make a dictionary

Copyright (C) 2007, http://www.dabeaz.com

2- 70

Commentary

• Dictionaries as data structures meta

= { 'fetchtime' 'requestsize' 'magic' 'fetchcount' 'expiretime' 'location' 'modifytime' 'datasize' 'infosize'

: : : : : : : : :

1190829792, 27, 65544, 3, 0, 2449473536L, 1190829792, 29448, 531 }

• Useful if data has many parts data = f.read(meta[8])

# Huh?!?

vs. data = f.read(meta['infosize'])

# Better

2- 71

Copyright (C) 2007, http://www.dabeaz.com

Mini-reference : struct • struct module

items = struct.unpack(fmt,data) data = struct.pack(fmt,item1,...,itemn)

• Sample Format codes 'c' 'b' 'B' 'h' 'H' 'i' 'I' 'f' 'd' 's' '>' '>sys.stderr,"Usage: python slack.py dirname" raise SystemExit(1) caches = (path for path,dirs,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files) for meta in ffcache.scan(caches): if 'slashdot' in meta['request']: print meta['request'] print meta['cachedir'] print

Copyright (C) 2007, http://www.dabeaz.com

2-108

Intermission • Have written a simple library ffcache.py • Library takes a moderate complex data

processing problem and breaks it up into pieces.

• About 100 lines of code. • Now, let's build an application... Copyright (C) 2007, http://www.dabeaz.com

2-109

Problem : CacheSpy • Big Brother (make an evil sound here) Write a program that first locates all of the Firefox cache directories under a given directory. Then have that program run forever as a network server, waiting for connections. On each connection, send back all of the current cache metadata.

• Big Picture We're going to write a daemon that will find and quietly report on browser cache contents.

Copyright (C) 2007, http://www.dabeaz.com

2-110

cachespy.py import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com

2-111

SocketServer Module import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) SocketServer if '_CACHE_MAP_' in files] def dump_cache(f): A module for easily creating for meta in ffcache.scan(caches): low-level internet applications pickle.dump(meta,f)

using sockets.

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com

2-112

SocketServer Handlers import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 You define a simple class that caches = [path for path,dname,files in os.walk(sys.argv[1]) implements handle(). if '_CACHE_MAP_' in files] def dump_cache(f): This the server logic. forimplements meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com

2-113

SocketServer Servers import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files] def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f) class SpyHandler(SocketServer.BaseRequestHandler): Next, you just create a Server object, def handle(self): hook fthe handler up to it, and run the = self.request.makefile() server.dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com

2-114

Data Serialization import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 caches = [path for path,dname,files in os.walk(sys.argv[1]) if '_CACHE_MAP_' in files]

Here, we are def dump_cache(f): for meta in ffcache.scan(caches): socket into a pickle.dump(meta,f)

turning a file and dumping cache data on it.

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() dump_cache(f) f.close() SocketServer.TCPServer.allow_reuse_address = True socket corresponding to serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) client connected. print "CacheSpy running on port %d"that % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com

2-115

pickle Module import sys, os, pickle, SocketServer, ffcache SPY_PORT = 31337 The pickle module takes any caches = [path for path,dname,files in os.walk(sys.argv[1]) Python object and serializes it if '_CACHE_MAP_' in files]

into a byte string.

def dump_cache(f): for meta in ffcache.scan(caches): pickle.dump(meta,f)

class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): There are really only two ops: f = self.request.makefile() dump_cache(f) pickle.dump(obj,f) # Dump object f.close() obj = pickle.load(f) # Load object SocketServer.TCPServer.allow_reuse_address = True serv = SocketServer.TCPServer(("",SPY_PORT),SpyHandler) print "CacheSpy running on port %d" % SPY_PORT serv.serve_forever() Copyright (C) 2007, http://www.dabeaz.com

2-116

Running our Server • Example: % python cachespy.py /Users CacheSpy running on port 31337

• Server is just sitting there waiting • You can try connecting with telnet % telnet localhost 31337 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. (dp0 S'info' p1 ... bunch of cryptic data ... Copyright (C) 2007, http://www.dabeaz.com

2-117

Problem : CacheMon • The Evil Overlord (make a more evil sound) Write a program cachemon.py that contains a function for retrieving the cache contents from a remote machine.

• Big Picture

Writing network clients. Programs that make outgoing connections to internet services.

Copyright (C) 2007, http://www.dabeaz.com

2-118

cachemon.py # cachemon.py import pickle, socket def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()

Copyright (C) 2007, http://www.dabeaz.com

2-119

Solution : Socket Module # cachemon.py import pickle, socket def scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() try: socket module provides direct while True: meta = pickle.load(f) access to low-level socket API. meta['host'] = host # Add host to metadata s = socket(addr,type) yield meta except EOFError: s.connect(host) pass s.bind(addr) f.close() s.listen(n) s.close() s.accept() s.recv(n) s.send(data) ... Copyright (C) 2007, http://www.dabeaz.com

2-120

Unpickling a Sequence # cachemon.py import Herepickle, we usesocket pickle

to repeatedly load objects of the socket. We use yield to generate a defoff scan_remote_cache(host): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) sequence of received objects. s.connect(host) f = s.makefile() try: while True: meta = pickle.load(f) meta['host'] = host # Add host to metadata yield meta except EOFError: pass f.close() s.close()

Copyright (C) 2007, http://www.dabeaz.com

2-121

Example Usage • Example: Find all JPEG images > 100K on a remote machine

>>> rcache = scan_remote_cache(("localhost",31337)) >>> jpegs = (meta for meta in rcache ... if meta['content-type'] == 'image/jpeg' ... and meta['datasize'] > 100000) >>> for j in jpegs: ... print j['request'] ... http://images.salon.com/ent/video_dog/comedy/2007/09/27/ cereal/story.jpg http://images.salon.com/ent/video_dog/ifc/2007/09/28/ apocalypse/story.jpg http://www.lakesideinns.com/images/fallroadphoto2006.jpg ...

• This looks almost identical to old code Copyright (C) 2007, http://www.dabeaz.com

2-122

Code Similarity • A Remote Scan rcache = scan_remote_cache(("localhost",31337)) jpegs = (meta for meta in rcache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) for j in jpegs: print j['request']

• A Local Scan cache = ffcache.scan(cachedirs) jpegs = (meta for meta in cache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) for j in jpegs: print j['request']

2-123

Copyright (C) 2007, http://www.dabeaz.com

Big Picture cachespy.py for meta in ffcache.scan(dirs): pickle.dump(meta,f)

socket

cachemon.py while True: meta = pickle.load(f) yield meta for meta in remote_scan(host): # ... Copyright (C) 2007, http://www.dabeaz.com

2-124

Problem : Clusters • Scan a whole cluster of machines Write a function that can easily scan the caches of an entire collection of remote hosts.

• Big Picture Collecting data from a group of machines on the network.

2-125

Copyright (C) 2007, http://www.dabeaz.com

cachemon.py # cachemon.py ... def scan_cluster(hostlist): for host in hostlist: try: for meta in scan_remote_cache(host): yield meta except (EnvironmentError,socket.error): pass

A bit of exception handling to deal with dead machines, and other problems (would probably need to be expanded)

Copyright (C) 2007, http://www.dabeaz.com

2-126

Example Usage • Example: Find all JPEG images > 100K on a set of remote machines >>> >>> >>> ... ... >>> ... ... ...

hosts = [('host1',31337),('host2',31337),...] rcaches = scan_cluster(hosts) jpegs = (meta for meta in rcache if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000) for j in jpegs: print j['request']

• Think about the abstraction of "iteration" here. Query code is exactly the same.

Copyright (C) 2007, http://www.dabeaz.com

2-127

Problem : Concurrency • Collect data from a large set of machines

In the last section, the scan_cluster() function retrieves data from one machine at a time. However, a world-wide quasi-evil organization is likely to have at least several dozen machines.

• Your task

Modify the scanner so that it can manage concurrent client connections, reading data from multiple sources at once.

Copyright (C) 2007, http://www.dabeaz.com

2-128

Concurrency • Python provides full support for threads • They are real threads (pthreads, system threads, etc.)

• However, a lock within the Python

interpreter (Global Interpreter Lock), prevents concurrency across more than one CPU.

Copyright (C) 2007, http://www.dabeaz.com

2-129

Programming with Threads • threading module provides a Thread object. • A variety of synchronization primitives are provided (Locks, Semaphores, Condition Variations, Events, etc.)

• Can program very traditional kinds of threaded

programs (multiple threads, lots of locking, race conditions, horrible debugging, etc.).

Copyright (C) 2007, http://www.dabeaz.com

2-130

Threads with Queues • One technique for thread programming is to have independent threads that share data via thread-safe message queues.

• Variations of "producer-consumer" problems. • Will use this in our solution. Keep in mind, it's not the only way to program threads.

Copyright (C) 2007, http://www.dabeaz.com

2-131

A Cache Scanning Thread # cachemon.py ... import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Copyright (C) 2007, http://www.dabeaz.com

2-132

threading Module # cachemon.py ... import threading

threading module.

Contains most functionality class ScanThread(threading.Thread): def __init__(self,host,msg_q): related to threads. threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Copyright (C) 2007, http://www.dabeaz.com

2-133

Thread Base Class # cachemon.py ... import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host Threads are defined by self.msg_q = msg_q def run(self): inheriting from the Thread for meta in scan_remote_cache(self.host): base class. self.msg_q.put(meta)

Copyright (C) 2007, http://www.dabeaz.com

2-134

Thread Initialization # cachemon.py ... import threading initialization and setup class ScanThread(threading.Thread): def __init__(self,host,msg_q): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

2-135

Copyright (C) 2007, http://www.dabeaz.com

Thread Execution # cachemon.py ... run() method import threading class ScanThread(threading.Thread): Contains code that def __init__(self,host,msg_q): executes in the thread. threading.Thread.__init__(self) self.host = host self.msg_q = msg_q def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

The thread performs a scan of a single host. Copyright (C) 2007, http://www.dabeaz.com

2-136

Launching a Thread • You create a thread object and start it t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start()

• .start() starts the thread and calls .run()

Copyright (C) 2007, http://www.dabeaz.com

2-137

Thread Safe Queues • Queue module. Provides a thread-safe queue. import Queue msg_q = Queue.Queue()

• Queue insertion • Queue removal msg_q.put(obj)

obj = msg_q.get()

• Queue can be shared by as many threads as you want without worrying about locking.

Copyright (C) 2007, http://www.dabeaz.com

2-138

Use of a Queue Object # cachemon.py ... import threading class ScanThread(threading.Thread): def __init__(self,host,msg_q): A Queue object. threading.Thread.__init__(self) self.host = host Where incoming self.msg_q = msg_q objects are placed. def run(self): for meta in scan_remote_cache(self.host): self.msg_q.put(meta)

Get data from the remote machine and put into the Queue 2-139

Copyright (C) 2007, http://www.dabeaz.com

Primitive Use of a Queue • You first create a queue, then launch the threads to insert data into it.

msg_q = Queue.Queue() t1 = ScanThread(("host1",31337),msg_q) t1.start() t2 = ScanThread(("host2",31337),msg_q) t2.start() while True: meta = msg_q.get()

Copyright (C) 2007, http://www.dabeaz.com

# Get metadata

2-140

Monitor Architecture Host

Host socket

Host

socket

socket

Monitor Thread

Thread

Thread .put()

msg_q .get() Consumer ???? Copyright (C) 2007, http://www.dabeaz.com

2-141

Concurrent Monitor import threading, Queue def concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break Copyright (C) 2007, http://www.dabeaz.com

2-142

Launching Threads import threading, Queue def concurrent_scan(hostlist, msg_q): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() msg_q.put(None) # Sentinel def scan_cluster(hostlist): Themsg_q above= function is a thread that launches Queue.Queue() threading.Thread(target=concurrent_scan, ScanThreads. It then waits for the threads to args=(hostlist,msg_q)).start() terminate by joining with them. After all threads while True: meta = msg_q.get() have terminated, a sentinel is dropped in the Queue. if meta: yield meta else: break Copyright (C) 2007, http://www.dabeaz.com

2-143

Collecting Results import threading, Queue Theconcurrent_scan(hostlist, function below createsmsg_q): a Queue and launches a def thr_list = [] all of the scanning threads. thread to launch for host in hostlist: thr = ScanThread(host,msg_q) It thenthr.start() produces a sequence of cache data until the thr_list.append(thr) sentinel (None) is pulled off of the queue. for thr in thr_list: thr.join() msg_q.put(None) # Sentinel def scan_cluster(hostlist): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q)).start() while True: meta = msg_q.get() if meta: yield meta else: break Copyright (C) 2007, http://www.dabeaz.com

2-144

More on Threads • There are many more issues to thread programming that we could discuss.

• All issues concerning locking, synchronization, event handling, and race conditions apply to Python.

• Because of global interpreter lock, threads are not a way to achieve higher performance (generally).

2-145

Copyright (C) 2007, http://www.dabeaz.com

Thread Synchronization • threading module has various primitives Lock() RLock() Semaphore(n)

• Example use:

# Mutex Lock # Reentrant Mutex Lock # Semaphore

x = value x_lock = Lock() ...

# Some kind of shared object # A lock associated with x

x_lock.acquire() # Modify or do something with x ... x_lock.release()

Copyright (C) 2007, http://www.dabeaz.com

(critical section)

2-146

Story so Far • Wrote a module ffcache.py that parsed contents of caches (~100 lines)

• Wrote cachespy.py that allows cache data to be retrieved by a remote client (~25 lines)

• Wrote a concurrent monitor for getting that data (~50 lines)

Copyright (C) 2007, http://www.dabeaz.com

2-147

A subtle observation • In none of our programs have we read the entire contents of any Firefox cache into memory.

• In cachespy.py, the contents are read iteratively and piped through a socket (not stored in memory).

• In cachemon.py, contents are received and

routed through message queues. Processed iteratively (no temporary lists of results).

Copyright (C) 2007, http://www.dabeaz.com

2-148

Another Observation • For every connection, cachespy sends the

entire contents of the Firefox cache metadata back to the monitor.

• Given that caches are ~50 MB by default, this could result in large network traffic.

• Question: Given that we're normally

performing queries on the data, could we do any of this work on the remote machines?

Copyright (C) 2007, http://www.dabeaz.com

2-149

Remote Filtering • Distribute the work Modify the cachespy program so that some of the query work can be performed remotely on each of the machines. Only send back a subset of the data to the monitor program.

• Big Picture Distributed computation. Massive security nightmare.

Copyright (C) 2007, http://www.dabeaz.com

2-150

The idea • Modify scan_cluster() and all related

functions to accept an optional filter specification. Pass this on to the remote machine and use it to process the data remotely before returning results. filter = """ if meta['content-type'] == 'image/jpeg' and meta['datasize'] > 100000 """ rcaches = scan_cluster(hostlist,filter)

2-151

Copyright (C) 2007, http://www.dabeaz.com

Changes to the Monitor Add a filter parameter

# cachemon.py def scan_remote_cache(host,filter=""): s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) s.connect(host) f = s.makefile() pickle.dump(filter,f) Send the filter to the f.flush() remote host right try: while True: after connecting. meta = pickle.load(f) meta['host'] = host yield meta except EOFError: pass

Copyright (C) 2007, http://www.dabeaz.com

2-152

Changes to the Monitor # cachemon.py ... class ScanThread(threading.Thread): def __init__(self,host,msg_q,filter=""): threading.Thread.__init__(self) self.host = host self.msg_q = msg_q filter added to self.filter = filter thread data def run(self): try: for meta in scan_remote_cache(self.host,self.filter): self.msg_q.put(meta) except (EnvironmentError,socket.error): pass

2-153

Copyright (C) 2007, http://www.dabeaz.com

Changes to the Monitor def concurrent_scan(hostlist, msg_q,filter): thr_list = [] for host in hostlist: thr = ScanThread(host,msg_q,filter) thr.start() thr_list.append(thr) for thr in thr_list: thr.join() filter msg_q.put(None) # Sentinel

passed to thread creation

Copyright (C) 2007, http://www.dabeaz.com

2-154

Changes to the Monitor # cachemon.py ... def scan_cluster(hostlist,filter=""): msg_q = Queue.Queue() threading.Thread(target=concurrent_scan, args=(hostlist,msg_q,filter)).start() while True: meta = msg_q.get() if not meta: break filter added yield meta

2-155

Copyright (C) 2007, http://www.dabeaz.com

Commentary • Have modified the cache monitor program to accept a filter string and to pass that string to remote clients upon connecting.

• How to use the filter in the spy server.

Copyright (C) 2007, http://www.dabeaz.com

2-156

Changes to CacheSpy # cachespy.py ... def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)

2-157

Copyright (C) 2007, http://www.dabeaz.com

Changes to CacheSpy Filter added and used # cachespy.py to create an expression ... string. def dump_cache(f,filter): values = """(meta for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) filter = "if meta['datasize'] > 100000" except: pickle.dump({'error' : traceback.format_exc()},f) values = """(meta for meta in ffcache.scan(caches) if meta['datasize'] > 100000)"""

Copyright (C) 2007, http://www.dabeaz.com

2-158

Eval() # cachespy.py ... def dump_cache(f,filter): values Evaluates = """(meta s as a Python expression. eval(s). for meta in ffcache.scan(caches) %s)""" % filter try: for meta in eval(values): pickle.dump(meta,f) except: pickle.dump({'error' : traceback.format_exc()},f)

A bit error of handling. traceback module creates stack traces for exceptions. Copyright (C) 2007, http://www.dabeaz.com

2-159

Changes to the Server # cachespy.py ... class SpyHandler(SocketServer.BaseRequestHandler): def handle(self): f = self.request.makefile() filter = pickle.load(f) Get filter from dump_cache(f,filter) f.close() the monitor

Copyright (C) 2007, http://www.dabeaz.com

2-160

Putting it all Together • A remote query to find slackers # Find all of those slashdot slackers import cachemon hosts = [('host1',31337),('host2',31337), ('host3',31337),...] filter = "if 'slashdot' in meta['request']" rcaches = cachemon.scan_cluster(hosts,filter) for meta in rcaches: print meta['request'] print meta['host'],meta['cachedir'] print

Copyright (C) 2007, http://www.dabeaz.com

2-161

Putting it all Together • Queries run remotely on all the hosts • Only data of interest is sent back • No temporary lists or large data structures • Concurrent execution on monitor • Concurrency is hidden from user Copyright (C) 2007, http://www.dabeaz.com

2-162

The Power of Iteration • Loop over all entries in a cache file: for meta in scan_cache_file(f,256): ...

• Loop over all entries in a cache directory for meta in scan_cache(dirname): ...

• Loop over all cache entries on remote host for meta in scan_remote_cache(host): ...

• Loop over all cache entries on many hosts for meta in scan_cluster(hostlist): ...

2-163

Copyright (C) 2007, http://www.dabeaz.com

Wrapping Up • A lot of material has been presented • Again, the goal was to do something

interesting with Python, not to be just a reference manual.

• This is only a small taste of what's possible • And it's only a small taste of why people like programming in Python

Copyright (C) 2007, http://www.dabeaz.com

2-164

Other Python Examples • Python makes many annoying tasks relatively easy.

• Will end by showing very simple examples of other modules.

Copyright (C) 2007, http://www.dabeaz.com

2-165

Fetching a Web Page • urllib and urllib2 modules import urllib w = urllib.urlopen("http://www.foo.com") for line in w: # ... page = urllib.urlopen("http://www.foo.com").read()

• Additional options support uploading of

form values, cookies, passwords, proxies, etc.

Copyright (C) 2007, http://www.dabeaz.com

2-166

A Web Server with CGI • Serve files and allow CGI scripts from BaseHTTPServer import HTTPServer from CGIHTTPServer import CGIHTTPRequestHandler import os os.chdir("/home/docs/html") serv = HTTPServer(("",8080),CGIHTTPRequestHandler) serv.serve_forever()

• Can easily throw up a server with just a few lines of Python code.

Copyright (C) 2007, http://www.dabeaz.com

2-167

A Custom HTTP Server • BaseHTTPServer module from BaseHTTPServer import BaseHTTPRequestHandler,HTTPServer class MyHandler(BaseHTTPRequestHandler): def do_GET(self): ... def do_POST(self): ... def do_HEAD(self): ... def do_PUT(self): ... serv = HTTPServer(("",8080),MyHandler) serv.serve_forever()

• Could use to put a web server in an application Copyright (C) 2007, http://www.dabeaz.com

2-168

XML-RPC Server/Client • How to create a stand-alone server

from SimpleXMLRPCServer import SimpleXMLRPCServer def add(x,y): return x+y s = SimpleXMLRPCServer(("",8080)) s.register_function(add) s.serve_forever()

• How to test it (xmlrpclib)

>>> import xmlrpclib >>> s = xmlrpclib.ServerProxy("http://localhost:8080") >>> s.add(3,5) 8 >>> s.add("Hello","World") "HelloWorld" >>> Copyright (C) 2007, http://www.dabeaz.com

2-169

Where to go from here? • Network/Internet programming. Python

has a large user base developing network applications, web frameworks, and internet data handling tools.

• C/C++ extension building. Python is easily extended with C/C++ code. Can use Python as a high-level control application for existing systems software.

Copyright (C) 2007, http://www.dabeaz.com

2-170

Where to go from here? • GUI programming. There are several major GUI packages for Python (Tkinter, wxPython, PyQT, etc.).

• Jython and IronPython. Implementations of the Python interpreter for Java and .NET.

2-171

Copyright (C) 2007, http://www.dabeaz.com

Where to go from here? • Everything Pythonic: http://www.python.org

• Get involved. PyCon'2008 (Chicago) • Have an on-site course (shameless plug) http://www.dabeaz.com/python.html

Copyright (C) 2007, http://www.dabeaz.com

2-172

Thanks for Listening! • Hope you got something out of the class • Please give me feedback!

Copyright (C) 2007, http://www.dabeaz.com

2-173