Data Analysis with Python for Modern Lab. The modlab.py Package

Data Analysis with Python for Modern Lab The modlab.py Package Werner Boeglin 1 Introduction Python is a very high level scripting language that is...

Author: Marian Malone

18 downloads 0 Views 126KB Size

Report

Download PDF

Recommend Documents

The DimPy physical quantity package for Python

Learning Geospatial Analysis with Python

Math of the lab. Data collection & analysis

The Package for Analysis and Visualization of Environmental Data

Astropy: A Community Python Package for Astronomy

Coursework : Practical Data Analysis using Python

Meta-Analysis with R: The metafor Package

Data Mining With Python and R Tutorials

Time Series Analysis in Python with statsmodels

coe808 Lab 5: Python Scrabble

Constraint Systems. Lab 01 - Python

Lab 7: Data Vector Spatial Analysis

Operations for the Modern Data Centre

Data Distribution Service for Python Applications

Differential analysis of count data the DESeq2 package

An Introduction to Python for Text Analysis

ChemoSpec: An R Package for Chemometric Analysis of Spectroscopic Data (Package Version 4.4.1)

Python lab 3: 2D arrays and plotting

Lab 3: Working with Attribute Data

MODERN ENERGY THE MIND LAB IN WORPSWEDE

Modern Data Architecture with Apache Hadoop

Inf2a: Lab 1 Introduction to Python

Getting Started with Python

OLAP Data Analysis with DB2

Data Analysis with Python for Modern Lab The modlab.py Package Werner Boeglin

1

Introduction

Python is a very high level scripting language that is mostly independent of the operating system used. Python runs on Unix, Mac OS X and a whole array of Windows variations. It is also a system that allows one to include software libraries written in other languages such as C, C++ Fortran etc. In this way the speed of C and Fortran are available from within Python. The main web page on Python can be found at http://www.python.org. Do NOT download it from there, see below ! In what follows I only provide a very basic introduction into programming with Python to make it possible for you to use the software provided in the in the modlab package for all possible data analysis tasks that you will encounter in Modern Lab I and II.

2

What is necessary to do data analysis using Python ?

If you have never programmed before, this will be a bit different from using application programs such as EXCEL or WORD. You will mostly work in command mode which means that you will not control what is happening with menus and the mouse but by using commands and a text editor. Important: When you need to hit the return key I added (return).

2.1

Text Editor

A text editor is in a way like a simple word processor that can only produce simple ASCII text, an example in Windows would be NOTEPAD. While you can indeed write programs using NOTEPAD I do not recommend it, as there are much more powerful editors

1

available (for free) which will help you with the programming syntax e.g. using syntax highlighting. A very powerful editor which is also available for almost every operating system is EMACS. If you have a MAC it is the standard system editor. Also on Linux distributions such as Ubuntu EMACS is already installed. For Windows you can obtain download and installation instruction from http:// www.claremontmckenna.edu/math/alee/emacs/emacs.html (use this file emacs-22.3-bin-i386.zip). A much nicer version of EMACS for the MAC is Aquamacs which you can download from http://aquamacs. org/. I strongly suggest you do this on your personal computers. These editors are also installed on the computers in the lab and I can help you with the installation.

2.2

Numpy, Scipy, Matplotlib, IPython

These are software packages for scientific computation (Numpy and Scipy) and visualization (Matplotlib) for Python. These packages contain much much more than what you need for modern lab but they will be very useful for your later studies. The easiest way to install Python and include them all in one step is to use the Enthought Python Distribution (EPD) which is free for academic use. You can download it from http://www.enthought.com/products/ edudownload.php (the 32-bit version should be fine. If you have a problem let me know). In this way you have all that is necessary. Again let me know if you have any problems.

3

Modern Lab Software Package: modlab.py

Now that you have installed the basic software tools you need the lab-specific software. You can get it from http://wanda.fiu.edu/ teaching/downloads. You will find 2 files there PYTHON.zip This is the package containing the modlab module. setup modlab.py This is a python script to set the environment variables. If you us windows download PYTHON.zip only. If you use Mac OSX or Linux download both files. When you unpack (unzip) PYTHON.zip you will get a directory (folder) named PYTHON. It is important that this directory is 2

placed at the right location. system:

3.1

Where depends on your operating

Windows

1. Download PYTHON.zip: e.g. double click it and select open. Open a second windows (double-click on My Computer and double-click Local Disk (C:). Go back to the first window and drag the PYTHON folder onto the second window (the one that shows the content of the Local Disk C: 2. Now you need to add the environment variable Open the control panel (in classic view).

PYTHONPATH.

3. Select System 4. Select Advanced system settings or similar. There you will find a button called Environment Variables .... 5. Select Environment Variables ... and a new window will open with two panels. The top one shows the user defined variables. 6. Click on the New button just below it, which will open another window. 7. In the top field you enter the new environment variable name : PYTHONPATH. In the field below (Variable Value:) you enter C:\PYTHON\. Now click OK and close all the windows.

3.2

Mac OS X, Linux

When you download a file it is typically unpacked and saved in your Downloads folder. Note: all instructions here assume that you use the bash shell. 1. Look for a folder named PYTHON there. All you have to do then is move it to your home directory (indicated by a little house icon with your user name. 2. Download setup modlab.py. Move this also to your home directory.

3

3. Open a terminal (on a MAC you find Terminal in the Utilities folder in the Applications folders). Then you type: python setup_modlab.py (return) Now you should be ready to use the software The next few sections describe the fundamentals of Python and how to used the various modules in the package.

4

Python Fundamentals

Python is case sensitive. This is different from other languages such as FORTRAN or BASIC. A very good python shell to use is ipython. On Mac and Linux open a terminal window and type ipython -pylab, this will also activate the plotting system (there is also a pylab icon in EPD). On a windows machine double-click the pylab icon. It will also open a command window with ipython running in it. The output should look similar to: >ipython -pylab (return) Enthought Python Distribution -- http://code.enthought.com Python 2.5.4 |EPD_Py25 4.2.30201| (r254:67916, Apr 28 2009, 15:27:34) Type "copyright", "credits" or "license" for more information. IPython 0.9.1 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython’s features. %quickref -> Quick reference. help -> Python’s own help system. object? -> Details about ’object’. ?object also works, ?? prints more. In [1]: You are now ready to enter commands. First you need to enter some data to work with. To quit ipython enter Ctrl-D (hit the D key while you press the Control key) on Mac or Linux or enter Ctrl-Z on Windows.

4

5

Entering Data using Emacs

First you need some data to enter. Open EMACS and start entering data as follows: Emacs can be controlled either by commands or by the icons. There are many more commands than icons. The most important (keyboard) commands are: Ctrl-x Ctrl-f Open a new file. Ctrl-x Ctrl-s Save the current file. Ctrl-x Ctrl-w Save to a new file. Ctrl-x Ctrl-c Quit Emacs With the windows version of EMACS you can create a new file by clicking on the left most icon. It will display a file dialog and ask for a file name. Just enter the name for you new file. If the file does not exits it will be created and you are ready to enter data. Enter first a few comments to remind you later what these data were. Remember that every comment line MUST start with # Enter the data as shown below: # # # # # # # # #

this is a file with test data it is used to show hoe to enter data and how to work with them later assume they represent time and distance (dist) measurements and each has its own error add the header lines below, notice you can have an empty line inbetween

#! time[f,0]/ dist[f,1]/ d_err[f,2]/ 0.0 3.5 1.1 1.0 4.6 1.2 2.0 8.0 1. 3.0 11.4 1.5 4.0 9.5 1.7 5.0 14.0 1.2 6.0 16.3 1.8 5

7.0 17.1 1.3 8.0 19.5 1.8 9.0 20.3 1.2 10.0 21.2 1.5 # that is the end of the data Now save the file as my exp 1.data. This is what one of you can do during the experiment and/or what you should do when you start analyzing the data. These data can now be used in ipython for further analysis.

6

Loading Modules in Python

Python software is organized in so-called modules. You can think of them like toolboxes. For instance if you need mathematical functions you could load the numpy (numerical python) module. Since you need to work with data taken in modern lab all you really need to do is loading the modlab module. There are several ways of loading modules and you can consult the python documentation for detailed information. Here just a short summary with an example: The simplest version is (here the numpy module is used) import numpy In order to access the (sin) function of the numpy module you need to also give the name of the module: x = numpy.sin(5.) Another possibility is to give it numpy a new name after you imported it: import numpy as np x = np.sin(5.) or if you do not want to have to enter the module name you can do: from numpy import * x = sin(5.)

6

The danger here is, that if you have defined sin in another module previously it is now redefined. For the modlab module I suggest you do: from modlab import * as the danger is small to redefine other functions.

7

The mydata object

This command loads a data file and is used as follows: md = mydata(’my_exp_1.data’) my exp 1.data is the name of the data file that you just created and that resides in the directory in which you started ipython. mydata is a (member) function that makes your data available to python. In this line of code mydata(’my exp 1.data’) creates an object, in this case a datafile object, to which we assign the name md. Now you can start to ’play’ with it. As an example you can find the column names of your data by doing the following: md.show_keys() (return) The () is important ! To look at the values associated with each name you can do: md.show_data(’time’) (return) md.show_data(’dist’) (return) md.show_data(’d_err’) (return) Or you can display them together: md.show_data(’time:dist:d_err’) (return) This may be useful to showing them but in order to work with the data you need to get them into ipython as arrays. This is achieved by using the function get data: t = get_data(md,’time’) (return) It needs 2 arguments: the name of the datafile object (md), and the name of the variable (in this case ’time’). What you get is a so-called numpy array which I called t (for time). You can again display the array by just typing its name and hit return 7

t (return) You should see something like (the number in the brackets will be different): Out[34]: array([

0.,

1.,

2.,

3.,

4.,

5.,

6.,

7.,

8.,

To access a single element of an array you enter: t[1] (return) Try out other values for the index! The integer (whole number) in the bracket is the index and in this case runs from 0 to 10, since there are 11 elements in the array. You can find the length of an array by typing: len(t) (return) And you should get back 11. You can now manipulate this array. For instance you can multiply it by a number. In the example you will multiply all values by 1000. to convert to milliseconds and assign the result to tms: tms = t*1000 (return) tms (return) The output now looks like this: Out[35]: array([

0., 7000.,

1000., 8000.,

2000., 9000.,

3000., 10000.])

4000.,

5000.,

6000.,

Almost any mathematical operation is possible, check the documentation. Now get the the distance data and the corresponding errors by doing: dexp = get_data(md, ’dist’) (return) derr = get_data(md, ’d_err’) (return) Now all your data are in the form of arrays that can be used for computation. For instance you might want to know what percentage error each data point has. This can be done as follows: p_err = derr/dexp * 100. (return) p_err (return) 8

9.,

10.])

And the output should be: Out[39]: array([ 31.42857143, 17.89473684, 9.23076923,

26.08695652, 8.57142857, 5.91133005,

12.5 , 13.15789474, 11.04294479, 7.60233918, 7.0754717 ])

To have a bit a nicer output you can use a for loop. First some information on loops. The simple for loop works as follows for D in dexp: (return) The cursor will have moved to the right by about 4 spaces, the prompt has changed and the cursor is typically just below the D In [43]: for D in dexp: ....: _ now enter: print ’distance = ’, D (return) the output should look like: In [43]: for D in dexp: ....: print ’distance = ’, D ....: _ The cursor is now just below the p of print. Now press (return) twice and the loop starts to run. Your output will look like (with the first part of the loop): In [42]: for D in dexp: .....: print "distance = ", D .....: .....: distance = 3.5 distance = 4.6 distance = 8.0 distance = 11.4 distance = 9.5 distance = 14.0 distance = 16.3 distance = 17.1 distance = 19.5 distance = 20.3 distance = 21.2 9

What happened here: you created a for loop, where each element( one after another) of dexp get assigned the name D. In the loop body (what comes below the for... statement and is indented) the current value D is printed together with the string ’distance = ’. The loop ends where the indentation ends. Interactively this is accomplished by the two returns. In order to print all values of t, dexp and derr in one for loop I use enumerate. First I check what enumerate does In [111]: for i,D in enumerate(dexp): (return) .....: print ’i = ’, i, ’D = ’, D and end the last line again with 2 returns. You should then see: In [111]: for i,D in enumerate(dexp): .....: print ’i = ’, i, ’D = ’, D .....: .....: i = 0 D = 3.5 i = 1 D = 4.6 i = 2 D = 8.0 i = 3 D = 11.4 i = 4 D = 9.5 i = 5 D = 14.0 i = 6 D = 16.3 i = 7 D = 17.1 i = 8 D = 19.5 i = 9 D = 20.3 i = 10 D = 21.2 In this variation the i contains the index of D in dexp. Since the corresponding values in t, dexp and derr all have the same index, I can print them all in one loop as follows: In [112]: for i,D in enumerate(dexp): print ’time = ’, t[i], ’dist = ’, D, ’ error = ’, derr[i] Again I close the loop with 2 returns. An the output now is: In [112]: for i,D in enumerate(dexp): print ’time = ’, t[i], ’dist = ’, D, ’ error = ’, derr[i] .....: 10

.....: time = 0.0 dist = time = 1.0 dist = time = 2.0 dist = time = 3.0 dist = time = 4.0 dist = time = 5.0 dist = time = 6.0 dist = time = 7.0 dist = time = 8.0 dist = time = 9.0 dist = time = 10.0 dist =

3.5 error = 1.1 4.6 error = 1.2 8.0 error = 1.0 11.4 error = 1.5 9.5 error = 1.7 14.0 error = 1.2 16.3 error = 1.8 17.1 error = 1.3 19.5 error = 1.8 20.3 error = 1.2 21.2 error = 1.5

Now you have learned how to get the data and how to loop over data. There are many more loop possibilities in Python that you can find in the documentation. For your needs in modern lab the for loop is enough.

8

Plotting the Data

There are several possibilities to plot data. Most important is to include error bars. The plotting used here is based on matplotlib. In modern lab you typically plot data, or calculated values and error bars together with the result of a fit. If you have a counting experiment then histogramming is also useful. The various functions included in modlab are described in the following.

8.1

mydata Columns

Very often you would like to just quickly have a look at the data in your file. For this the function dplot exp is very useful. You can plot the values of the various columns in your datafile versus each other. As an example I plot the time versus position of the data in your mydata object called md including the errors. Remember time was called ’time’, position was called ’dist’ and its error was called ’d err’. You can always get the names of all the variables in you mydata object my doing md.show keys(). The plotting command then looks like the following: dplot_exp(md, ’time’, ’dist’,’d_err’) (return) 11

If you do not have errors or do not want to plot them just leave them off. If you need a log scale enter: pl.yscale(’log’) (return) To switch back to a linear scale enter: pl.yscale(’linear’) You can save the plot as a pdf file by giving the command: pl.savefig(’my_plod.pdf’) Note that all the commands that adjust the plot start with pl. The reason is that the modlab module itself imports the matplotlib plotting module and gives it the name pl. You can change the labels as follows: x-label: pl.xlabel( ’blablabla’ ) y-label: pl.ylabel( ’yakyakyak’) Title : pl.title(’A new Plot’) You can also change the axis limits as follows: x-axis: pl.xlim ( (xmin, xmax ) ) y-axis: pl.ylim( (ymin, ymax) ) Note that the two parenthesis are needed since the limits are entered as so-called tuples. These are basically as sequence of objects (numbers, strings etc.) which cannot be changed. If you enter the command pl.xlim() you will get back the current limits. If you want to know more about tuples have a look at the Python documentation.

8.2

Arrays

Since I have the data also as (numpy) arrays (see above) you should also be able to plot them. The names of the arrays were t, dexp and derr. To plot those you do: pl.clf()(return) plot_exp(t, dexp, derr) (return) 12

The command pl.clf() clears the figure. You notice this time the title and the labels are None. You can add the labels as shown above later. The plot looks the same as before. You can leave off the errors if you leave off derr in the arguments to plot exp. There are many socalled keyword arguments to further control the appearance. You can get more information by doing help(plot exp) and even more if you do help(pl.plot) since this is the function that the modlab functions are based upon. You should also familiarize yourself with how keywords are used.

8.3

Histograms

Histograms are typically used in counting experiments, especially in nuclear physics experiments. One experiment you do very early on is count the number of decays from a radioactive source during a certain amount of time. This experiment is then repeated many times. I have simulated the result of such and experiment and the data are stored in the file ’counts.data’. To work with these data I first load them in the same way as I did before: mcd = mydata(’counts.data’) (return) To see what data are define in there you do (as before) mcd.show_keys() (return) And the output should be something like In [34]: mcd.show_keys() [’n’, ’counts’, ’indx’] Here n is the number of the measurement (the first one would be 0) and counts is the number of counts you obtained in this experiment. The additional variable indx is used internally and is not interesting to you (DO NOT TOUCH IT!). To get an idea what the data look like, I could just plot counts as a function of n. Since I want to plot the data that are part of a mydata object I will used the dplot exp function. dplot_exp(mcd, ’n’, ’counts’) (return) You will notice that the data scatter around a value of 150. To do analyze them further we need to convert them to numpy arrays: 13

ne = get_data(mcd, ’n’) (return) counts = get_data(mcd, ’counts’) (return) Now let’s calculate the average. To do this you need a for loop N = 0(return) sum = 0. (return) for cc in counts:(return) N += 1 (return) sum += cc (return) Close the loop with two returns. Now you you can get the average by doing N_av = sum/N This was a little programming. First I created a variable called N to count the number of terms in the sum and, in the first statement (N = 0), I set it to 0. Then I did the same for another variable, called sum. sum should contain the running sum of the values which is carried out in a for loop (for cc in counts:. In the loop, cc contains the current value of counts. There are two strange looking lines. The first N += 1 means add 1 to the current value of N or increment N by one. The second does the same for the variable sum but instead of adding 1 we add the current value of the counts stored in cc. As a consequence at the end of the loop N contains the number of measurements and sum the total sum of the values. To calculate the average all I have to do is take the ratio of the two number and store them in the variable N av. This was done in the last line (N av = sum/N). To look at its value hit N av (return) and the output should be In [61]: N_av Out[61]: 150.41499999999999 This was a small programming example. Since the average is such an important quantity this procedure has been built into numpy. np.average( counts )(return) Will give you the same result. There are many useful function defined for arrays. Here are just a few useful ones using counts as an example : 14

counts.sum() returns the sum of all elements counts.mean() returns the mean (average) of all elements counts.var() returns the variance of all elements counts.min() returns the smallest value of the array counts.max() return the larges value of the array Try them out ! When you create a histogram you want to count how often a certain value occurs. To do this there is a special histogram object in modlab. The simples way to use it is as follows: hh = myhisto(counts)(return) It looks as if nothing has happened. But if all the work as already been completed. You can now make a plot of the histogram by doing hh.plot() (return) And you will get a (blue) graph that represents the frequency with which certain counts values occur. What you see is a relatively coarse representation of this distribution since in its simplest form myhisto automatically determines the range of values that you provided. This range is divided into 10 equally spaced regions, called bins. It then goes through all the data in the array counts and checks in which region (bin) each value falls. To each bin belongs a counter that is incremented each time a values falls withing the associated region. The value of each counter is called the (bin content). The graph now shows the value of the bin content as a function of the counts value. The width of each step corresponds to the width of a bin. You have much more control on how your histogram is setup. Lets re-define it and this time we give it a range and also the number of bins. From the first graph we see that most values of counts lie between 120 and 180. If we want to get a count of how often a specific value occurs we need to make the bin width equal to one as follows hh = myhisto(counts, range = (120., 180.), bins = 60) (return) This is another example of using keywords to control what an object/function is doing. the range = (120., 180.) determines the range of values accepted, where range is the keyword. The other statement : bins = 60, sets the number of bins to 60. In that way the width of a bin is 1. Now plot the new histogram again. 15

clf() (return) hh.plot() (Remember the first line clf() clears the figure). Now you have a much better representation of the histogram. The central values, the content (number of occurrences) and the associated errors of each bin are can be accesses as follows: hh.bin_center (return) Produces the the output: In [14]: hh.bin_center Out[14]: array([ 120.5, 121.5, 128.5, 129.5, 136.5, 137.5, 144.5, 145.5, 152.5, 153.5, 160.5, 161.5, 168.5, 169.5, 176.5, 177.5,

122.5, 130.5, 138.5, 146.5, 154.5, 162.5, 170.5, 178.5,

123.5, 131.5, 139.5, 147.5, 155.5, 163.5, 171.5, 179.5])

124.5, 132.5, 140.5, 148.5, 156.5, 164.5, 172.5,

125.5, 133.5, 141.5, 149.5, 157.5, 165.5, 173.5,

126.5, 134.5, 142.5, 150.5, 158.5, 166.5, 174.5,

127.5, 135.5, 143.5, 151.5, 159.5, 167.5, 175.5,

Below is a summary of the most important information stored in myhisto. I use hh as the name for the histogram here by you can use any other variable name when you create it: hh.bin center Central value of each bin. hh.bin content Content of each bin (the value of the counter) hh.bin error Error of the content of each bin hh.bin width Width of each bin hh.title Histogram title. You can store the short description of the data (used in plotting). hh.xlabel Description of the bin center values (used in plotting). hh.ylabel Description of the bin content (used in plotting). And here the most important (member) functions:

16

myhisto Create a histogram. There are several ways to create one depending on the arguments in myhisto here are the main possibilities: • h = myhisto( data array, bins = 20, range = (10., 30.) ), data array is a numpy array containing the data to be histogrammed, bins the number of bins, in this case 20 and range the range of values to be sorted into bins bins , in this example between 10 and 30. • h = myhisto(file = ’histo.data’), read the histogram data from file ’histo.data’ and create a new histogram. The ’histo.data’ file should have been created previously by a myhisto.save(’histo.data’) command (see below) myhisto.plot(ymin = 0., filled = True, color = ’b’) Plot the histogram. There are several keywords that can be set to control the plot. You can assign a minimal value ymin for the smallest content to be drawn (default is 0) . filled = False the area below the curve is not filled. color = ’b’ draw the line in blue. Other colors are: red(r), green(g) , magenta(m), yellow(y) or any other matplotlib color. myhisto.plot exp Plot the bin content and the errors as data points with error bars. myhisto.clear() Clear the content of the histogram myhisto.fill(data array, add = False) Fill the histogram using the numpy array data array. If add = True then continue incrementing the bins otherwise start fresh. This function is mostly used to add more data. (add = True). myhisto.save(filename = ’histo.data’) Save the histogram to the file histo.data. myhisto.sum(xmin=10., xmax=20) Add the content of the histogram for the values ranging between (and including) xmin = 10 and xmax = 20. Returns a tuple with the value and the error. If no arguments are given it returns the total of all bins.

17

9

Data Fitting

Very frequently one wants to try to extract an interesting quantity from the experimental data. I assume you know the functional relation ship between the different quantities but you do not know the parameters. As an example I take the example data: distance as a function of time. If the object was moving with constant velocity the distance should be a linear function of time, with the slope being the velocity and the y-axis offset the position at t = 0. To do this I want to perform a least square fit of a line to the data.

9.1

Fitting a Line

In order to fit a line to your data you use the linefit function of modlab. This function needs the data as arrays, so I will use the array t for the time, dexp for the distance and derr for the errors in the distance. Do the following: fit = linefit(t, dexp, derr) (return) You should see the output In [4]: fit = linefit(t, dexp, derr) chisq/dof = 0.661037211585 offset = 3.88359753039 +/- 0.540669925884 slope = 1.86063659417 +/- 0.098317559634 Remember the slope is the velocity which in this case would be 1.86 with an uncertainty of ±0.1 and and the position at t = 0 is 3.88±0.54. The function returns much more information that you stored in the variable fit. To see it just enter fit (return) and you will get:

In [5]: fit.res Out[5]: {’chisquare’: 5.9493349042607102, ’cov. matrix’: array([[ 0.44222014, -0.06453747], [-0.06453747, 0.01462299]]), ’deg. of freedom’: 9, ’difference’: array([ 0.38359753, 1.14423412, -0.39512928, -1.93449269, 1.82614 -0.8132195 , -1.2525829 , -0.19194631, -0.73130972, 0.32932688, 1.28996347]), ’fitted values’: array([ 3.88359753, 5.74423412, 7.60487072, 9.46550731, 18

11.32614391, 13.1867805 , 15.0474171 , 16.90805369, 18.76869028, 20.62932688, 22.48996347]), ’parameters’: array([ 3.88359753, 1.86063659]), ’red. chisquare’: 0.66103721158452333, ’xpl’: array([ 0., 5., 10.]), ’ypl’: array([ 3.88359753, 13.1867805 , 22.48996347])} fit is a Python dictionary, a kind of an array that uses keywords (or keys) instead of indices. To see the keys only you can do: fit.res.keys() (return) and will get In [6]: fit.res.keys() Out[6]: [’red. chisquare’, ’parameters’, ’fitted values’, ’deg. of freedom’, ’xpl’, ’cov. matrix’, ’chisquare’, ’difference’, ’ypl’] To get the number of degrees of freedom in this fit, you would type fit.res[’deg. of freedom’](return) And you would get back the number 9, try it ! A very important quantity is the covariance matrix for error propagation. It is also there and you can assign it to the variable named e.g. cov and display it by doing: cov = fit.res[’cov. matrix’] (return) cov (return) cov is a 2d array that can be used in error propagation later. It is also interesting to display the fitted line together with the data. First I draw the data using plot exp with the data arrays:

19

pl.clf() plot_exp(t,dexp, derr) (return) Then I draw the fitted line. The data are stored in fit.res with the name (key) ’xpl’ and ’ypl’. For this I use the plotting function plot line as follows: plot_line( fit.res[’xpl’], fit.res[’ypl’]) And you will see the data together with the fitted line. The linear fit is the one you will be mostly using in the analysis of your experiments. If you want to evaluate the fitted function (in this case a line) for any value x. You use it as follows: y = fit.line(x) (return) y now contains the value of the fit function for the argument x/

9.2

Fitting a Polynomial

Maybe your data are not really for an object moving with constant velocity. If there is for instance an acceleration present the position would vary quadratically with time and the coefficient of the quadratic term would be proportional to the acceleration. You can try to determine this by fitting a polynomial of 2nd degree to the data using the function polyfit: fitp=polyfit(t,dexp,derr,2)(return) You could also have done: fitp=polyfit(t,dexp,yerr=derr,order=2)(return) yerr and order are keyword arguments. The number 2 indicates the polynomial order. The result looks like: In [9]: fitp=polyfit(t,dexp,derr,2) chisq/dof = 0.519062589345 parameter [ 0 ] = 3.12859470397 +/- 0.627859702779 parameter [ 1 ] = 2.45054883094 +/- 0.3288132167 parameter [ 2 ] = -0.061125765893 +/- 0.0328533890728

20

As you can see there could be indeed a small negative acceleration. Also note that the fit results are stored here in a dictionary named fitp.res. In that way I have preserved the results of the linear fit. You can plot the result of the quadratic fit by doing plot_line( fitp.res[’xpl’], fitp.res[’ypl’]) Which looks exactly like what you did for the linear fit; all that has changed is the name of the dictionary (here it is fitp before it was fit) As for the line the fitted polynomial can be used evaluated by: y = fitp.poly(x) (return) y now contains the value of the fit function for the argument x. Note in this case y might be a 1-element array. You do not need to worry about this.

9.3

Histogram Fitting

The most frequently used fitting task in histograms is fitting a Gaussian to a peak. This is useful to determine the location of a peak and other information on the peak shape such as width and height. Fitting a Gaussian requires a non-linear fitting algorithm and is therefore a bit harder to use. The myhisto object has a built-in fitting function. You can fit a Gaussian on top of a quadratic background. You can also just fit a background or just a Gaussian. All this is controlled by which parameters you want to fit. Another important part of non-linear fitting is that you need to provide reasonable guesses for the fit parameters. The parameters in myhisto are not just numbers but object with their own properties and functions. The full fit function is as follows: f (x) = b0 + b1 x + b2 x2 + Aexp(−(x − µ)2 /σ 2 ) Where bo = myhisto.b0 b1 = myhisto.b1 b2 = myhisto.b2 21

A = myhisto.A µ = myhisto.mean σ = myhisto.sigma You can access a parameter value (e.g. for the parameter A) in two ways ( I assume you have created a histogram with the name h): h.A() or h.A.value. A parameter has also a name (h.A.name and an error(h.A.err). You can get all the 3 values returned as a tuple by the function h.A.get(). To setup a fit you should set the initial parameters. For a Gaussian the approximate position is very important. You can guess this by looking at the plot: h.plot()(return) From the plot you can see that the location is around 150. For now set the position of the peak to 140, the height to 2 and sigma to 10: h.mean.set(140) (return) h.A.set(2) (return) h.sigma.set(10) (return) Then let the fit run: h.fit()

(return)

The output you should now be: ---------------------------------------------------------------------Fit results: A = 5.55018222684 +/- 0.569280666662 mean = 151.109462102 +/- 1.0413638285 sigma = 16.4156848452 +/- 1.30318794874 Chi square = 43.4234928935 Chi sq./DoF = 0.761815664798 ---------------------------------------------------------------------You can plot the curve using h.plot_fit() The final fit parameters are stored in the same variables (objects) as the initial parameters. 22

9.4

Working with MCA Data

When you do experiments that use a Multi Channel Analyzer (MCA) such as the Maestro program of Ortec (e.g. in the Compton scattering experiment). You save the spectra as ASCII (.SPE) files. You can load these files into python as histograms as follows ( I assume the spectrum name is co 60 cal.Spe ): co = get_spectrum(’data/co_60_cal.SPE’) (return) cs = get_spectrum(’data/cs_137_cal.SPE’) (return) This function returns a histogram which I named co in this example. You can plot it using the standard function (co.plot()). Often you want to focus on a smaller region of the spectrum. This can be done by setting a window with the function set window as follows. co.set_window(200., 310.) (return) co.plot() (return) Now you see only the region between channels 200 and 310. The window can be cleared with the command clear window(). You can perform all the functions described above for histograms. Fitting a Gaussian is especially useful in order to precisely determine the location of a peak. This is useful for calibrating the spectra that you have taken and is described below. 9.4.1

Energy Calibration

An energy calibration is necessary in order to convert the channel number in your spectrum (histogram) to energy. For this you need to determine the parameters of a function (linear or polynomial) using calibration spectra where the energy of the various peaks are known. In most cases one can assume a linear relationship between channel number and energy. The procedure is as follows: • For each known peak fit a Gaussian to the top part of each peak. This determines the peak position. • Create an array containing the fitted peak positions and call it pos pos = np.array([ 145.1,

244.1, 23

274.7]) (return)

• Create an array containing the corresponding known energies of the peaks and call this array energy. The energy you can get from the literature or the source box. Make sure that the elements in the two arrays correspond to each other and that the arrays are numpy arrays. Below is an example for the energies in keV’s energy = np.array([

661.7,

1173.2,

1332.5]) (return)

• Perform a linear fit as follows: cal = linefit(pos, energy) (return) You have now an object called cal that contains the calibration fit and the function that converts the channel number to an energy. • Load the spectrum again (preferably with a different name) and apply the new calibration: co_cal = get_spectrum(’data/co_60_cal.SPE’, \ calibration = cal.line) (return) cs_cal = get_spectrum(’data/cs_137_cal.SPE’, \ calibration = cal.line) (return) (The backslash means for python that the next line belongs to the first one.) The two new histograms co cal and cs cal have now calibrated x-axis. Also note that the statement calibration = cal.line tells the function get spectrum to use the the function cal.line to convert channels to energy. If you leave this statement off no calibration will be applied. 9.4.2

Background Subtraction

From the Compton experiment you should always have two spectra. One taken with the target and one without. We call the spectrum taken without the target the background spectrum. In order to determine the effect of the target you need to subtract the contribution of the background from the spectrum with the target. First you load the two spectra and apply the energy calibration: 24

cs_tar = get_spectrum(’data/cs_45deg_target.SPE’,\ calibration = cal.line) (return) cs_bck = get_spectrum(’data/cs_45deg_empty.SPE’, \ calibration = cal.line) (return) Here I assumed that data/cs 45deg target.SPE is the file containing the data with target and data/cs 45deg back.SPE is the one for the background. Now do: cs_sig = cs_tar - cs_bck (return) clf() (return) cs_sig.plot() (return) cs sig contains the signal due to the target. This is the subtracted spectrum. clf() clears the potting window and cs sig.plot() draws the result. Now you are ready to determine the energy of the scattered photons again by fitting the peak position using a Gaussian as described in the histogram section.

9.5

General Non-Linear Fitting

Will be covered another time

10

Writing a Python Script

You have seen the typical steps in an analysis: reading the data, manipulating arrays, fitting and plotting. In order that you do not have to always type in everything you can type all the commands into a simple text file using EMACS or another editor, save it with the extension .py and the execute it with the command run or %run. A simple script for a few things I did in this short introduction looks like # import all from modlab from modlab import * # read the data md = mydata(’my_exp_1.data’) # get the data into arrays t = get_data(md, ’time’) dexp = get_data(md,’dist’) 25

derr = get_data(md, ’d_err’) # print the values in a for loop for i, D in enumerate(dexp): # indentation is important print ’time = ’, t[i], ’distance = ’, D, ’error = ’, derr[i] # end of the loop = end of indentation # plot the data plot_exp(t, dexp, derr) # fit a line fit = linefit(t, dexp, derr) # draw the fit result plot_line(fit[’xpl’], fit[’ypl’]) # add the labels pl.xlabel(’t (sec)’) pl.ylabel(’Distance (m)’) pl.title(’Distance vs time exp.’) # save the plot pl.savefig(’my_plot.pdf’) # assign the covariance matrix cov = fit[’cov. matrix’] # print the covariance matrix: print "The Convariance Matrix is : ", cov You would run it by doing run analysis.py If the plot does not show after a short while you need to enter pl.show().

26