Introduction to: Computers & Programming: File Input and Output (IO)

Introduction to: Computers & Programming: File Input and Output (IO) Adam Meyers New York University Intro to: Computers & Programming: File Input & ...
Author: Shona Parks
3 downloads 0 Views 173KB Size
Introduction to: Computers & Programming: File Input and Output (IO) Adam Meyers New York University

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Summary • What is Input and Ouput? • What kinds of Input and Output have we covered so far? – print (to the console) – input (from the keyboard)

• File handling – input from files – output to files – Text files vs. 'pickled' binary files • URL handling

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Input • Input is any information provided to the program – Keyboard input – Mouse input – File input – Sensor input (microphone, camera, photo cell, etc.)

• Output is any information (or effect) that a program produces: – sounds, lights, pictures, text, motion, etc. – on a screen, in a file, on a disk or tape, etc. Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Types of Input Covered in This Class • So Far – Input: keyboard input only – Output: graphical and text output transmitted to the computer screen

• This Unit expands our repertoire to include: – File Input – Python can read in the contents of files – File Output – Python can write text to files

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Files • File = Named Data Collection stored on memory device – Different types of data: text, binary, etc – Accessible by name or address – Has start and end point – Programs read, create, modify, (and do other things to) files • Text file can be treated like a (big) string – Human readable – ASCII/UTF-8/etc. encoding – Can be plain text or can contain markup (e.g., html) • Binary files: not human readable, usually require specific programs to read Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Use Text Editors for Text Files • A text file (.txt) can be created or edited with a text editor • Text Editors – Apple: TextEdit – Windows: Wordpad (preferred) and Notepad – Unix Systems: emacs (available for most systems), vi or ex – Program development tools (including idle) bundle text editing with other features Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Folders/Directories and Paths • A Folder or a Directory is a named stored item that contains other folders and/or Files • The root directory of a storage device: • no other directory contains it • it contains all other directories/files on that storage device. • A sequence from directory to directory to directory … ending in a directory or file is called a path. – Each item n in the path (except the root) is contained by the n-1 item. – There is at least one path from the root to every file & directory, i.e., paths can be used to identify/locate files – Each path uniquely identifies a single directory or file (ignoring short cuts, aka, symbolic links)

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Slash Notation for Representing Paths • Unix operating systems (linux, Apple, Android, etc.) use the forward slash to connect directories in a path, e.g., the path for this file could be: /Users/Adam/Desktop/Class Talks/Input-Output.odp

• MSDOS, Windows and related systems use the backslash instead \ • The root directory in UNIX systems is labeled / • In Windows it is a letter and a colon, e.g., C: Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

The Directory Tree Including this File

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Filename Conventions • There are some variations about legal characters for filenames between different systems • Conservative assumptions: – Use letters, numbers and underscore • Dashes are OK, but can be problematic for Python – Use conventional file extensions = filename endings beginning with a period: .py, jpg, doc, etc. – Some file extensions we are likely to use with python: • Python = .py • Text = .txt • Comma separated values = .csv • Tab separated values = .tsv Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Path Conventions • The full pathname of a file is the path from the root to that file • A relative pathname of a file is a path from some other point to that file • Commonly, paths are described relative to some working directory, commonly called the “current working directory” or the “present working directory” Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

The os module • Interface between python and Operating System • http://docs.python.org/py3k/library/os.html • For performing OS-dependent operations – handling files, checking for system features, getting root or administrative permission, etc.

• As with other modules, needs to be imported – import os, help(os), etc.

• Other system info is in platform module – e.g., platform.system() distinguishes Windows, Apple (Darwin), linux, etc. – Most of this info would not be needed for the types of programs we can reasonably expect to write this term Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Global Variables from the os Module • os.name – 'posix' (linux, current apple) or 'nt' (most Windows) – Others: 'os2' (old – IBM's answer to MSDOS), 'ce' (Windows embedded) ,'java' (Java os),'riscos' (Raspberry pi)

• os.environ – all environmental variables (as a dictionary) • os.linesep – '\n' for most systems, '\r\n' for Windows • os.sep – '/' for most systems, '\\' for Windows

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Listing, Renaming & Removing paths and Creating Directories • • • • • • • •

os.getcwd() – gets current working directory os.listdir(directory) – gets children of directory os.chdir(path) – change current working directory to path os.mkdir(path) – make directory called path os.remove(path) – remove file (not directory) called path os.rmdir(path) – remove directory (not file) called path os.rename(oldpath,newpath) – rename (or move) oldpath to newpath os.path.isfile(path), os.path.isdir(path) – Boolean functions indicating if a particular pathname refers to a real file/directory • os.system(command) – execute a terminal command as indicated by the string command Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

File Permissions • There are certain directories and files that require root or administrative permission to open or read. • It is possible to view and change these permission properties. • For simplicity, we are only going to deal with files which our user has permission to create, remove and/or change

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Files and Streams •

A stream is a continuous block of data ending in EOF (end of file character) – Python's “file objects” are instances of streams. – A computer program can read from a stream – A computer can write to a stream – Other similar operations (e.g., append) are also possible



An input stream can be created (opened) containing data found in a file. A program can then read data from this stream.



A program can create (open) an output stream and add (write) data to it. When the stream is closed, the data in the stream is written to the file. This writing can either overwrite an existing file or create a new one.



Other streams exist – 2 notable examples are: – standard input (the words you type) • What input statements in Python read from – sys.stdin.read() (reading from standard input) is a lot like input – standard output (what you see when words appear on the screen) • Like print statements in Python write to – sys.stdout.write() (writing to standard output) is a lot like print Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Reading and Writing to Files in Python • instream = open(path, 'r') – creates an input stream containing the contents of the file named path and makes it the value of the variable instream.

• outstream = open(path,'w') – creates an output stream for writing data and set the variable outstream to this stream. Path names the file that will be created when this stream is closed (a previous file with that name would be overwritten).

• When the program is finished with a stream, it should close it as follows: – stream.close() – If stream is an output stream, a file is created or overwritten Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Options for open(stream) • Direction – r – read (previous slide, also default) – w – write (previous slide) – a – append (add to the end) – + – open for read and write

• File Type – b – binary mode – t – text mode (default) Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Sample function that reads and prints a text file

• def read_story(file):

story = open(file,'r') for line in story: print(line,end='') story.close() – IO-examples.py

– read_story('/Users/adam/Documents/short_story.txt') – read_story('short_story.txt')

• The file 'short_story'.txt – is in the current working directory – os.getcwd() → '/Users/meyers/Desktop/Python-Class/Python-programs/' – os.listdir(os.getcwd()) → a big list of files – 'short_story.txt' in os.listdir(os.getcwd()) → True Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

More about reading file • The for loop treats the input stream (story) as a sequence of lines, each line being a string. for line in story: print(line,end='') • The print function does not require a newline after each string – Each line is a string that ends with os.linesep ('/n' for all posix systems (Apple, Linux) and with '/r/n' in Windows) – Leaving out end='' results in additional blank line being printed • At the beginning of the read_story function, we open a stream which we call story as follows: story = open(filename, 'r')) • At the end of the function we close the stream as follows story.close() Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Reading from Streams—Details • for loop – treats stream as a list of lines • Method readline reads one line at a time, moving to the next position in the stream • def read_story2(file): story = open(file,'r') line = '*start*' while line != '': line = story.readline.() print(line,end='') story.close() ## equivalent to read_story Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Reading from Streams—More Details • stream.read() method – reads the whole stream as one big string • def read_story3(file): story = open(file,'r') big_string = story.read() line_list = big_string.split(os.linesep) for line in line_list: print(line) story.close() ## equivalent to read_story Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Reading from Streams—More Details • stream.read(1) method – reads one character at a time • def read_story4(file): story = open(file,'r') char = '***' while char != '': char = story.read(1) print(char,end='') story.close() ## equivalent to read_story

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Alternatives for Reading from Stream • input_string = stream.read() – Creates one large string consisting of all characters in a file. – Flexible – the program can divide this up in any way, really easily, e.g., for some texts, splitting at tabs will make a list of paragraphs.

• next_line = stream.readline() – Get string starting ing at current stream position and ending with os.linesep. Advance position to just after this os.linesep. •

next_char = stream.read(1) – Reads character at current stream position and advances stream position – next_string = stream.read(N) – read N characters

• for line in stream: ## loop through stream and treat as list of lines • Other stream methods listed under class IOBase in https://docs.python.org/3.1/library/io.html – Testing for types of streams, changing stream position, returning portions of stream, etc. Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Reading txt files on your machine •

Put a plain text file in the current working directory and/or use an absolute path name



I am basing the description so far on Posix (linux, Apple, Solaris, BS/D, etc.) paths



Windows paths are different • My windows cwd is 'C:\\Python33' by default • There are backslash instead of slashes • By default the file system does not display the file type (.txt) – So we have to be extra careful that we have the right filename



In Python string, backslashes are indicated by using 2 backslashes – For Windows, it may be convenient to use the notation for a 'raw' string: r'Z:\2015-class-websites\Python-programs\short_story.txt'



Thus, on my Windows machine, the following commands work: – read_story('Z:\\2015-class-websites\\Python-programs\\short_story.txt') – read_story(r'Z:\2015-class-websites\Python-programs\short_story.txt')



If I change directories first, I can just use the relative path (just the filename) – os.chdir(r'Z:\2015-class-websites\Python-programs\') – read_story('short_story.txt') Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Function that Writes User Input to a File • def take_dictation(outfile): dictation = open(outfile,'w') line = 'Empty' while (not (line == '')): line = input('Please give next line or hit enter if you are done. ') dictation.write(line+'\n') ## use newline char ## instead of os.linesep dictation.close() • take_dictation('ClassNotesTues_4_12.txt') • In IO-examples.py Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Notes about Dictation Function • Since I did not provide an absolute path, I know that the file will be in the current working directory. • os.getcwd() → identifies the cwd • The output file ('Today_dictation.txt') is located there. • The function initializes an output stream using the 'w' (write) option of open • The variable line is initialized as 'Empty' • Then a while loop keeps going as long as line is not equal to the empty string. • This kind of while loop is called a sentinel loop because we use a sentinel string (the empty string) to indicate when it is done. Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

More on the Dictation Function • Other sentinel strings are possible. – The empty string is not ideal as it could be entered by accident. – Perhaps, an explicit **stop** would make sure the user only stops when they mean to do so.

• The while loop prints each user input on a newline using the function (method) called write which is specific to streams. • '/n' is added to the end of the string. – '/n' is used for all operating systems for writing to files (it is converted automatically for Windows) • After the loop, we close the stream (and it writes to the file) Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Comparing stream.write & print • The stream.write method takes one string as an argument • It is like a print function, but it prints to a stream, instead of the screen (but stream can be sys.stdout). • Write does not behave like print in these ways: – Unlike print, it does not takes more than one argument – There are no :sep and :end keywords – It does not work with non-strings (you have to convert items to strings) Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

A Shortcut for Opening and Closing Streams •

Block of code in which the stream is opened



with open(filename,'r') as instream: for line in instream: print(line,end='')



with open(filename,'w') as outstream: for line in list_of_output_lines: outstream.write(line)



with open(infile,'r') as instream, open(outfile,'w') as outstream: for line in instream: outstream.write(line)



Equivalent to: instream = open(infile,'r') outstream = open(outfile,'w') for line in instream: outstream.write(line) instream.close() outstream.close() Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

A Simple Spam Filter • There are a bunch of email messages stored as files in a directory • One at a time, the program reads these files and checks to see which ones pass a spam test. • If the test says a file is spam, the program moves it into the spam directory, otherwise the program moves it into the to-read directory. • In IO-examples.py Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

The implemented version of filter_spam • Function call: filter_spam('letters','spam','to-read') – Sorts through the files in 'letters' and distributes them to 'spam' and 'to-read'. – Function call assumes that all directories are subdirectories of cwd

• filter_spam uses objects from the os package • os.path.isdir() – checks to see if the output directories exist • os.path.mkdir() – makes the directories if they don't exist • os.rename(file,destination) – 2 equivalent interpretations – moves a file from one path to another path – renames a file from one pathname to another

• os.sep –global variable – '/' for UNIX and '\\' for Windows

• The function is_spam determines which directory a file is moved to Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

The function: is_spam • A function that returns True or False • The current version returns True if: – The file is too big (more than 25K bytes) • It uses the .st_size slot of the object os.stat(file) – Or the subject line has no lowercase letter • The subject line begins with subject: (ignoring case) – Or the subject line includes a word from a list of spam words – Or the subject line is over 15 characters with no spaces • There are some errors – Some of the mail classified as SPAM is really NOT SPAM – Some of the mail classfied as NOT SPAM is really SPAM Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

A State-of-the-Art Version of is_spam Might have the Following Features • It might pay attention to more of the letter than just the subject line (the email address, the body of the letter) • It might look for (characteristics of) images and weird character sets • It would probably incorporate large statistics on words that are more likely to be found in spam than in normal emails (it would not use a simple list) • It would combine statistics, rather than basing the determination on the presence/absence of items in a list • It would include user feedback Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

More about Spam Program • For simplicity, we treated letters as files – Actually the important issue is that they are streams, a more general concept that includes both files, letters, transmissions of different kinds, etc.

• This program uses the os package to be platform independent – The specifics of file handling largely depend on the computer, operating system, etc. that you are using

• Weird Characters were also a factor – open(file,'r', encoding='utf-8', errors='ignore') • Encoding ensures that most characters are accepted • Errors='ignore' makes it so the program does not bomb on bad characters (important for email which mixes text and binary) Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Binary Files (not covered in detail) • For our purposes, a binary file is any non-text file: – exe, jpg, gif, mp3, etc.

• The open function can read them using mode 'br' and write to them using mode 'bw' • Pickling is a Python process for saving python data in binary form and retrieving it – Import pickle ## loads the pickle module – pickle.dump(python_object,outstream) ## sends python_object to the output stream outstream – Instream = pickle.load(pickled_file) • ## creates a stream instream with contents of pickled_file Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Processing Webpages • import urllib.request • Loads module for creating streams by reading in webpages • Note that these streams will have many of the same properties as file streams – E.g., they can be treated as lists of strings • http://docs.python.org/py3k/library/urllib.request • For example, I have recently used this package to process Yahoo (Bing) Searches. – Additional work to “process” the html output of search: • separate out the top 10 search results • Identify URL link, title and abstract Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Reading Data from Files • Files can store structured information that can be output or input by programs • Ex 1: phone_list.txt – Each Record = consecutive lines with no blanks – Each line contains a feature and a value (split at ':')

• Ex 2 and 3: comma or tab seprated files (.csv & .tsv) – – – – –

Each Line represents 1 record Each line = values separated by commas or tabs Position (or column) determines feature Column labels can be used as a 1st line Can be read by standard spreadsheet programs • LibreOffice Calc, Google Docs, Microsoft Excel, etc.

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

IO-examples.py programs for reading/writing out records • Programs for same records in different formats – read_in_phone_records('address_list.txt') – read_in_phone_database_file(inputfile) – ## 'phone_list.tsv' or 'phone_list.csv'

• Program for adding manually entered records to a file in the .csv or .tsv format – add_phone_record_from_user_input()

• Some problems with this program: – Difficult to change entries & prevent duplicate/conflicting entries – Dictionaries better in this way – data structure covered next week Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Summary • Python can use files for input and output (I/O) – This class will only deal with text file I/O

• It is possible to read in unstructured text (e.g., a story) and I is possible to output unstructured text • It is also possible to use text files to store different sorts of structured input, .csv and .tsv files are standard examples • We covered the use of Python stream objects including reading from and writing to streams • We also discussed the os package including variables and functions that deal with files as expected by different operating systems. Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002

Homework 8 – Due 25th Class • http://cs.nyu.edu/courses/spring16/CSCI-UA.0002-004/hw8.html

Intro to: Computers & Programming: File Input & Output in Python CSCI-UA.0002