Why use spreadsheets?

Statistics by spreadsheet 1 Statistics by spreadsheet Colin Tredoux After studying this tutorial you should be able to: * Explain the central design...

Author: Beverly Robinson

36 downloads 0 Views 1MB Size

Report

Download PDF

Recommend Documents

Why Spreadsheets are Inadequate for Uncertainty Analysis

Citations why use them?

TPU. Why use KRYSTALFLEX?

Spreadsheets in Financial Departments:

PC*MILER Spreadsheets

Appendix B. Spreadsheets 101

Computational Spreadsheets: An Introduction

Tableur: Handwritten Spreadsheets

PRINTING LARGE EXCEL SPREADSHEETS

WHY USE SYNERGY BUSINESS EVENTS?

Why Americans use social media

Why Use Antifoams and Defoamers?

16. Why we use lasers

Location problems solving by spreadsheets

COMI 1420 Introduction to Spreadsheets

Visual Specifications of Correct Spreadsheets

PRINCIPLES AND INDICATIONS FOR USE. Why use Diphoterine or Hexafluorine?

Why Use OneNote for Your Lesson Plans?

Why Do People Use Social Networks?

Why and how to use arbitrary precision

Project planning template. Why use this tool?

Why use Project-based Learning? PBL-FAQ

INTRODUCTION: WHY YOU SHOULD USE A SITEMAP

About Rubrics and why to use them

Statistics by spreadsheet

1

Statistics by spreadsheet Colin Tredoux After studying this tutorial you should be able to: * Explain the central design concept of spreadsheet application software. * Set the data up for a simple statistical analysis via spreadsheet. * Calculate simple descriptive statistics for a set of data with Microsoft Excel. * Construct simple graphical displays in Microsoft Excel * Identify some of the advantages and limitations of spreadsheet software.

Why use spreadsheets? One of the problems with conventional statistics packages like STATISTICA, SPSS, and SAS, is that the calculations and computations, which constitute a particular statistical analysis, are opaque to the user. That is, the computer does the computations, and 'spits' the answer out at you, like a large virtual 'black box', whose inner workings are mysterious and all-powerful. For example, we collect the data shown in Table 1, which presents depression and income scores for 10 CEO's of large South African companies (0 = no depression, 100 = a lot of depression). Manager 1 2 3 4 5

Income (R 1000's) 350 410 470 199 987

Depression 60 70 80 30 90

Manager 6 7 8 9 10

Income (R 1000's) 1320 560 210 230 760

Depression 95 50 40 30 50

Table 1 Depression and income in a sample of Chief Executive Officers

If we wish to calculate the correlation between depression and income of the CEO's, it will be comparatively easy in any of the standard statistical packages. Thus, in STATISTICA, we simply open the appropriate dialog box and check one or two boxes, and the output is generated (see Figure 1).

Figure 1 Calculating the correlation coefficient with Statistica

The problem with this comparatively effortless procedure is that we are far removed from the calculations going on behind the dialog boxes. This is quite acceptable if we are able to accurately

Tutorial x : Statistics by spreadsheet

2

judge the results computed and displayed by the program, and if we have a sound understanding of the statistical analysis we have requested. If we don't, we will have to accept the results at face value. The potential for mistakes, and for bad statistical analysis, are obvious. For this reason, many statisticians are sceptical of all-purpose statistical packages, particularly when used in a libertine manner by researchers. One attractive solution to this problem is to use a spreadsheet program for the statistical analysis. These programs are widely available (Microsoft Excel, Quattro Pro, Lotus 1-2-3 are virtually cyberhold names) at bargain prices (usually less than 10% of statistical software packages). 'Spreadsheet' computer programs display data and allow calculations in a form very close to what one does when doing calculations by hand. There is an uncanny resemblance in particular between using a spreadsheet and using the old-fashioned 'grid' or arithmetic paper you probably used at school. What makes spreadsheets really useful, and highly popular amongst statisticians, accountants, and the like, is the combination of traditional layout and automation. Data can be entered in a form that allows easy visual inspection and understanding, and formulas, functions and calculations can be applied and updated automatically as new data is entered. The way in which calculations are set out (i.e. in a completely apparent and manifest manner), and the clever automation built into the programs make them highly attractive for researchers who want to understand and directly oversee their statistical analyses. Box 1 Commercially available spreadsheets There are a great many spreadsheet programs available, ranging from downloadable freeware to specialist, proprietary software. If you walk into your local computer store, though, you are likely to find the brands discussed below. We will not make recommendations here about which brand to buy, since they are very similar to each other, and comparative advantages are small and short-lived in the competitive and changing world of software applications. Each will do the particular job under consideration - entering and analysing research data - to about the same level of competence. Microsoft Excel

Lotus -1-2-3

Quattro Pro

This program is bundled with the Microsoft Office suite. At the time of writing, the suite sells for approximately R 3000.00, but there is usually a substantial education discount if purchased through a university or technikon. It is presently the leading spreadsheet in terms of sales, and we will base our discussion in this tutorial on it.

Right through the 1980's, Lotus 1-2-3 dominated the spreadsheet market, and turned its founding organization into a billion dollar giant. However, times have changed, and the program exists now as part of the Lotus Smartsuite office suite. The suite sells for about R 2500.00

Quattro Pro was the flagship spreadsheet program of the innovative software company Turbo Borland. It is now incorporated into the Corel WordPerfect Office suite, which is owned by the Corel organization. This suite sells for about R 2500.00.

http://www.microsoft.com/office/excel/d efault.htm

http://www.lotus.com/home.nsf/tabs/ lotus123/

http://www.corel.com/products/word perfect/cqp8/

>

Tutorial x : Statistics by spreadsheet

3

Doing summary statistical analysis on spreadsheets As we indicated earlier, spreadsheet software is generally not a replacement for serious statistical software, particularly when the analytic problem at hand is complex. It is best suited for summary statistical analysis, and simple inferential tests, where its ease of use is remarkable.

The basic structure of a spreadsheet A spreadsheet is a tabular layout of cells, which have unique addresses. The addresses are determined as the intersection of rows (typically numbered 1, 2, 3 … n) and columns (typically labelled A, B, C …AA, AB… etc). The data in Table 1 is shown in such a layout in Figure 2.

Figure 2 A typical spreadsheet layout of the data in Table 1

Figure 3 Formula view of spreadsheet of Figure 2

Cells can contain numeric or alphabetic entries. These entries, in turn, can either be values, or they can be formulas/functions. Thus, cell B2 in Figure 2 has a value entry of 350. In cell B12 however, we have placed a formula, which calculates the average of the cells in B2..B11. We can choose to see the result displayed on the screen, or the formula itself (Figure 3). In Excel, selecting ‘Options’ from the ‘Tools’ menu, and choosing Formulas ‘under’ Window options can do this. Notice also that Excel has a special formula bar, where the formula will be displayed for the selected cell (see Figure 4). If you place the cursor in the formula bar, you can edit the formula there. Alternatively, pressing the F2 key while the cursor is on top of a formula cell will reveal the formula and make it ‘editable’ (Figure 5).

Figure 4 The formula bar reflects formula cell entries

Figure 5 Using F2 to reveal a cell formula

Tutorial x : Statistics by spreadsheet

4

The workspace is really a kind of giant scratchpad, with the added advantage that it has possible operations built into each cell in the space. Thus, if we want to do arithmetic on any of the numbers in the spreadsheet, we simply type in the ‘=’ sign, and then a calculation – e.g. to add the first two percentages in column C, we place the cursor on any destination cell of our choice (say D3), and type =C2+C3. Similarly, to divide these numbers by 2 to get an average, we can edit the formula to be =(C2+C3)/2, or we can enter the formula =D4/2 into another cell (see Figure 5). The arithmetic operators are subject to the same rules of precedence as an ordinary piece of grid-paper! This is the essential idea behind a spreadsheet. Although all commercial computer spreadsheets augment the basic idea with a plethora of features, automation tools, and so on, spreadsheets derive their identity from the basic cell-addressable configuration, and operations performed on entities that have addresses in this configuration. >

Entering data The first task in analysing and summarising a set of data with a computer package is to enter it into a data file. In the case of spreadsheets, there are many options available to us. Firstly, we can simply enter the data straight into the spreadsheet. Microsoft Excel has a great many features which allow the user to make the entry of data easier, including 'freezing' particular rows or columns (so that variable names and case numbers are always displayed), data forms (a dialog box that accepts data, and does some error checking to ensure that only valid data is entered e.g. a date variable accepts only dates), auto-completion (entries such as codes (e.g. 'male') are completed as soon the user types the first unique character of the code). Half the fun in learning computer programs is to find out what is available, and to experiment.

Tutorial x : Statistics by spreadsheet

5

Exercise 1 Table 2 shows some data tabulating Gross Domestic Product of a number of countries, and the relative proportions they spend on Education, Health Defense and Interest payments. Enter this data set into an Excel spreadsheet, and complete 1 – 4 below. High-Income

Middle-Income

Low-Income

Total Total Total GDP expenditure GDP expenditure GDP expenditure Defense

4.4

14.2

2.0

8.2

2.8

15.3

Education

1.6

5.0

2.8

11.4

1.0

5.4

Health

3.7

12.0

1.2

5.0

0.4

2.2

Economic services

2.8

8.9

4.8

19.6

4.6

25.0

Interest

3.4

11.1

5.0

20.2

3.0

16.3

Source: IMF, Government Finance Statistics Yearbook, 1993. Entries are percentages.

Table 2 Comparative economic data for 89 countries, grouped into income categories 1 2 3 4

Format the Income titles so that all are in boldface Format the next line (GDP, Total expenditure) so that all entries are in italics Add two columns to the right of the low-income group Make each added column reflect the average ‘GDP’ or average ‘Total expenditure’ for that row. Hint: type an expression of the form ‘(B1+D1+E1)/3’ in the cell where you want the average to appear.

Most spreadsheet programs recognise and translate between a great many different formats. You can enter the data in a word processing program, a database program, a different spreadsheet program, or even a statistical package, and the spreadsheet program is almost certain to be able to translate the data and turn it into a spreadsheet. For example, to translate a data file that was entered in your word processor, and saved as a space-delimited text file (i.e. data points are separated by spaces), simply use the File Open command, and load the file. Follow the Import Wizard instructions, as shown in Figure 6.

Figure 6 Text import wizard in Excel

Tutorial x : Statistics by spreadsheet

6

>

Summarising data Spreadsheets can be treated at the simplest level as automated scratchpads, so if you are happier taking full control of the calculations, you can use the spreadsheet to perform the most ordinary of arithmetic calculations. Alternatively, you can use the large number of built-in formulae to make the task even easier. Imagine that we have some data on the relationship between the rate of crowd disturbances in South Africa in major cities, and the ambient temperature of those major cities at the times when the disturbances occurred. We set the data out in a Microsoft Excel worksheet, much as it would be in SPSS - cases in rows, and variables in columns, with suitable headings (see Figure 7). Blank columns and rows are left for calculation results, just as you would do if you were computing by hand.

Figure 7 Spreadsheet of ambient temperature data (see Tyson & Turnbull, 1990)

Figure 8 Corrected spreadsheet of ambient temperature data (see Tyson & Turnbull, 1990)

We now want to know the average, or mean, rate of disturbance for the data, and the associated standard deviation. We can calculate this in a number of ways, by entering formulas into the cells immediately below the data, and the results will be displayed immediately. Thus, to calculate the mean temperature, we type =average(b2:b9) into the cell B11. Notice the basic structure of this formula (or function): =function(startingcell:endingcell) where ‘=’ denotes the start of an evaluative expression, the brackets encase what is to be operated on, and the starting and ending cells indicate the range of contiguous cells to be operated on. (The colon separator is important!)

If you can remember this basic structure, entering formulas in Excel is a cinch. If we now discover that the data in cell B8 is incorrect, we can correct this in the spreadsheet. The summary information is immediately updated - we do not need to recalculate, as we would have to by hand, or by calculator. It is in this sense that spreadsheets are said to be dynamic, and it is what makes them very popular with computer users.

Tutorial x : Statistics by spreadsheet

7

Now you can enter formulas to calculate all the information you need to determine the correlation coefficient between rate of crowd disturbance and ambient temperature. These are shown in Figure 9 below, but note that you will not need to type each of the formulas - spreadsheets typically have excellent facilities for automating tasks like this. You simply copy the formula to the position, you want, and the spreadsheet cleverly uses its relative cell referencing to correctly complete the formula.

Figure 9 Spreadsheet formula for the correlation Figure 10 Spreadsheet result display for the calculation of temperature data correlation calculation of temperature data

This spreadsheet gives you all the information you need to calculate the means, standard deviations, covariances, and correlation coefficient - and you retain the ability to inspect the results visually, with all the interim calculations clearly exhibited. >

Using functions Virtually all spreadsheets offer a set of built-in formulae and functions frequently used in statistics, mathematics, finance, and a number of other disciplines. In Microsoft Excel, for example, the elementary statistical functions shown in Table 3 are available (among others):

FUNCTION AVERAGE CONFIDENCE COUNT DEVSQ INTERCEPT MAX MEDIAN MIN MODE

DESCRIPTION Returns the average of its arguments Returns the two-sided confidence interval for a population mean Counts how many numbers are in the list of arguments Returns the sum of squares of deviations Returns the intercept of the linear regression line Returns the maximum value in a list of arguments Returns the median of the given numbers Returns the minimum value in a list of arguments Returns the most common value in a data set

Tutorial x : Statistics by spreadsheet PEARSON PERCENTILE PROB QUARTILE RANK RSQ SKEW SLOPE STANDARDIZE STDEV STDEVP TTEST VAR ZTEST

8

Returns the Pearson product moment correlation coefficient Returns the k-th percentile of values in a range Returns the probability that values in a range are between two limits Returns the quartile of a data set Returns the rank of a number in a list of numbers Returns the square of the Pearson product moment correlation coefficient Returns the skewness of a distribution Returns the slope of the linear regression line Returns a normalized value Estimates standard deviation based on a sample Calculates standard deviation based on the entire population Returns the probability associated with a Student's t-test Estimates variance based on a sample Returns the two-tailed P-value of a z-test Note: for details of usage and syntax, refer to the online help in Excel.

Table 3 Some elementary statistical functions available in Microsoft Excel. These functions are used as we illustrated in the discussion regarding Figure 9. In general, they take an argument, and they return a result. Thus, to calculate the average of a set of cells we enter (in a new cell) the function, with its required argument - =AVERAGE (b2:b10) in B12, and this returns the average of the values contained in the cells b2 to b10. Similarly, to calculate the standard deviation of the set of values in the cells b2:b11, we enter =STDEV(b2:b11) in b13, and the (sample) standard deviation is returned. The wonderful thing about a cell-addressable structure like a spreadsheet is that we can simply copy formulae into a new location, and they will adjust themselves for the changes in cell addresses. Thus, we simply copy the formula presently in b11 and b12 of the spreadsheet in Figure 11, into cells c12 and c13, and voila! (If you inspect the formula in the cell C11 you will see that it is now =AVERAGE(c2:c11)). >

Exercise 2 Open the Excel file you created for the comparative economic data of Exercise 1. 1 Find the mean, standard deviation (population and sample), maximum and minimum for ‘defense’ expense categories using Excel’s built-in functions. 2 Use Excel’s cut and paste capabilities to repeat the calculations in 1 for each of the other expense categories. Open the Excel file you created for the data on CEO’s (at the beginning of the tutorial). 4

Use the ‘standardize’ function to create standardised scores for the stress variable.

Tutorial x : Statistics by spreadsheet

9

Figure 11 Copying formulas to other cells in the spreadsheet

Sometimes the functions are a little more complicated, but most spreadsheets have very good context sensitive help. Indeed, Excel has a function wizard, which will prompt you for the details, rather than requiring you to remember the required arguments, and to enter them precisely. Thus, you can use the Insert menu to insert a function, or the button on the standard toolbar, or the key combination SHIFT F3. Let's enter the function for the Pearson correlation coefficient in cell b14. We use the Insert Function command, which generates a dialog box (Figure 12), and we choose the appropriate function category (Statistics) and the appropriate function within that (Pearson). This generates a dialog box, which asks for the set of data we wish to apply the correlation to (Figure 13).

Figure 12 Dialog boxes controlling the insertion of statistical formulae

Figure 13 Spreadsheet result display for insertion of correlation coefficient formula

We complete the requested entries, and this returns the correlation coefficient in cell b14 (see Figure 10).

Tutorial x : Statistics by spreadsheet

10

Box 2 Working with random numbers One of the most useful things about spreadsheets is their capacity to generate (quasi) random numbers with comparatively little effort. Typically there are two situations where one might want to do this. A

Drawing a random sample.

Imagine that we wish to draw a random sample of three managers from the ten referred to in Table 1, in order to interview them. You can do this in Excel by generating a list of random numbers between 1 and 10, and choosing the first three nonduplicates. Type the function =RAND() in any empty cell in your worksheet This returns a random number between 0 and 1. Now edit this formula to be =(RAND()*9+1) – notice how the multiplier creates the range (0 to 9), and the addendum modifies the endpoints of the range (0 to 1, 9 to 10). To neaten the output, add a rounding function, thus =ROUND(RAND()*9+1,0). Copy the formula you have created to a blank range, and select the first three non-duplicates. The process is shown in Figure 14

Entering the formula

Produces a random number

Which is cut and paste.

Select the first 3 unique

Figure 14 Demonstration of random number generation B

Randomizing cases to conditions.

In experimental design, one often wants to randomise cases into conditions e.g. randomising patients into treatment and non-treatment groups. In Excel this can be done by generating a random number for each case, and then sorting the cases according to the random numbers associated with the cases. For a two-group randomisation, the top half of the list will be one group, and the bottom the other. This can be extended to any number of group categories, and with a little ingenuity, to multiple grouping variables. Figure 15 shows this process for a randomisation of the managers into a two group experiment.

Figure 15 Demonstration of randomisation

Tutorial x : Statistics by spreadsheet

11

More advanced functions are also available. There are those which use standard statistical distributions to generate probability values, there are others which generate random numbers (you can specify the underlying generating distribution), and there are special type of functions known as array functions, which allow the quick calculation of standard functions on arrays of data, and which can return results in arrays, rather than the single cell returns we have seen thus-far. Some of the more advanced functions are shown in Table 4 FUNCTION BINOMDIST CHIDIST CHITEST FDIST FISHER HARMEAN NORMDIST NORMINV TDIST

DESCRIPTION Returns the individual term binomial distribution probability Returns the one-tailed probability of the chi-squared distribution Returns the test for independence Returns the F probability distribution Returns the Fisher transformation Returns the harmonic mean Returns the normal cumulative distribution Returns the inverse of the normal cumulative distribution Returns the Student's t-distribution

Table 4. Some statistical distribution functions available in Microsoft Excel.

The distribution functions are tremendously useful if we need to apply inferential statistical reasoning to a particular problem - as we indeed often do in the social sciences. For the data shown in Table 1 (the executive stress example), we may want to know what an unusually high stress level is, assuming that the underlying population distribution is normal. We can calculate this, using the sample estimates as estimates of the population parameter, and assuming an unusually high stress level to be over the 95th percentile score for the distribution. We enter into cell b15 the formula for obtaining this score, using the wizard for the inverse form of the normal distribution (i.e. to calculate a criterion score, rather than a probability):

Figure 16 Dialog box controlling the insertion of Figure 17 Spreadsheet result display for calculation with the inverse normal distribution statistical formulae

We see that the calculated value is 98.36. It is quite clear from the sample data that none of the executives reaches this high level of stress

Tutorial x : Statistics by spreadsheet

12

>

Exercise 3 Open the Excel file you created for the data of Table 1 1

Use the random number function in Excel to randomly select three managers (Hint: read the box on working with random numbers)

2

Use the random number function to randomly assign the managers to three groups (Hint: read the box on working with random numbers, and notice that you have an even number of managers – think about solutions to this problem) Use the Normal distribution functions in Excel to find the 95% and 5% cut-off points on a population of depression scores, assuming it has the same mean and standard deviation as the sample of 10 managers.

3

Integrated statistical modules/assistants Along with the elementary statistical and mathematical functions available in Excel, the distributional functions make it possible, in principle, to conduct many statistical analytic problems. However, there will be a fair amount of work in doing so, especially when the problem at hand has complexities. This is frequently the case in the social sciences, where missing data, unbalanced designs, and other perturbations of the social world are common. Even when the problems are absent, the analysis can often require a good deal of computation. For this reason, most contemporary spreadsheet programs have added automated 'statistics assistants'. In Microsoft Excel, for instance, you can install a 'data analysis tool' as an add-in. This allows you to conduct a number of standard analyses, as shown in Table 5. Analysis of variance: Single factor Analysis of variance: Two-factor with replication Analysis of variance: Two-factor without replication Correlation Covariance Descriptive statistics Exponential smoothing F-test Two-sample for variances Fourier analysis Histogram

Moving average Random number generation Rank and percentile Regression Sampling t-Test: Paired Two sample for means t-Test: Two sample assuming equal variances t-Test: Two sample assuming unequal variances z-Test: Two sample for means

Table 5 Statistical procedures available in Microsoft Excel (the ’data analysis’ add-in)

However, the addition of these 'add-ins' to a large extent detracts from the inherent appeal of

Tutorial x : Statistics by spreadsheet

13

spreadsheet software: the intuitive layout and design makes the calculational procedure transparent to the user, and allows complete control over the analysis. The 'add-ins' simply offer what standard statistical packages such as SPSS and STATISTICA provide - and, it must be said, with much less flexibility and reliability. We cannot recommend them over that provided by statistical packages.

Displaying data with spreadsheets Graphical displays of data are often as important to conducting statistical analysis as calculations of descriptive and inferential indices. This is so in the beginning stages of the analysis, since visual displays are often easier to understand and more efficient than summary statistics, and is usually so at the end of the project, since graphs or diagrams of one kind or another are usually included in the study report or article. With the advent of faster, more powerful computers, it has become easier for statistical packages to generate graphic displays, and some packages offer highly sophisticated and user-friendly graphic abilities. All of the major spreadsheet programs have graphic capabilities, but they do not offer the statistics user much to write home about. The construction of all but the simplest graphs is generally quite cumbersome, and many of the standard statistics graphs (e.g. boxplots) are not available. The principal reason for this is probably that the essential design element of the spreadsheet (the cell-addressable tabular layout) does not lend itself easily to graphing applications. It is an easy thing to instruct a statistics program to graph two variables, since the notion of a 'variable' is an element in the statistics program design, whereas spreadsheets recognise cell addresses as elements, but not variables. Nevertheless, in the spirit of the 'all-purpose' nature of modern ‘office’ software, graphs are available in Excel, and simple exemplars can be constructed reasonably easily with 'wizards' and other automation tools. Let's see what is involved if we want to create a scatter plot of the relation between depression and income, using the data of Figure 11. >

In Microsoft Excel, graphs are created by using the Insert menu (sub-menu Chart), or the icon. Using either of these options starts the Chart wizard, which takes the user through a number of steps. Firstly, a gallery of chart types is displayed, and the user selects one (see Figure 18). In the present instance, we select scatter, and choose the first of the sub-types in the right hand panel. We are then required to define or select a 'data range', containing the xy data. Once we have done this, the penultimate wizard screen asks us to provide a chart title, and labels for the x and y-axes. We can also set options for gridlines, legends, and data labels (see Figure 19). The final graph is generated (see Figure 20), and we can place it in a new worksheet or within the active worksheet containing the data. Additional touching up can be effected at any stage by double clicking the graphic and using the flying menus (accessed by right-clicking).

Tutorial x : Statistics by spreadsheet

14

Figure 18 Dialog box controlling the selection of chart type

Figure 19 Dialog box controlling the selection of chart options

Depression

Sample scatterplot for executive stress example 100 80 60 40 20 0 0

500

1000

1500

Income

Figure 20 Scatterplot chart of data in Table 1, constructed with Excel Exercise 4 Open the Excel file you created for the comparative economic data of Exercise 1. Complete the exercises below, and try to interpret any patterns or differences you might see. 1 2 3

Produce a bar chart comparing income areas on expenditure categories as a % of GDP. Add titles, axis labels, and formatting until you are happy with the result. Produce a bar chart comparing income areas on expenditure categories as a % of Total expenditure. Add titles, axis labels, and formatting until you are happy with the result. Produce a pie chart for each of the income areas, reflecting expenditure categories as pie-slices.

Conclusion Spreadsheet programs such as Microsoft Excel, Corel Quattro, and Lotus 1-2-3 can indeed be useful alternatives to standard statistical packages. The underlying design concept is extraordinarily elegant, and this makes them both intuitive to use, and transparent to inspection. For simple, relatively straightforward statistical analysis they are very useful. They are particularly useful to entry-level

Tutorial x : Statistics by spreadsheet

15

students, since the analysis occurs in a demystified, manifest manner. In this sense, it is worth learning to use one of the standard spreadsheet packages: the present tutorial is only an introduction to the topic, and mastery of a package like Excel represents a considerable investment of time. On the other hand, spreadsheet programs are not much good at typical analytic tasks like graph construction, or analyses that involve complexity. Unfortunately, many analytic problems in the social sciences are complex, and hence not suitable for spreadsheet programs. We cannot recommend their usage as general-purpose statistical programs.

Exercises 1

A psychologist, who is interested in Health care in rural communities, wants to know whether children in those areas are undernourished. One indication of undernourishment is a sub-normative weight. The psychologist thus weighs 25 8 yr old children (boys and girls) in a rural community, and obtains the following set of data (weight in kilograms, rounded to the nearest kilogram): 20 22 27 19 29 30 19 23 20 21 28 28 28 23 27 29 29 20 23 28 29 21 23 21 29 Enter the data in a spreadsheet in Microsoft Excel, and produce the following: a)

a frequency distribution of the weights (Hint: there is a suitable statistical function in Excel for this)

b)

a histogram of the weights (Hint: insert a column or bar chart)

Examine what you have produced and describe the distribution of the weight of children in this sample. Can you identify any feature of the psychologist's research design, which might explain the nature of the distribution?

2

Consider the following sample of scores, obtained from the administration of a psychological test to 40 first year psychology students: 98 102 120 129 95 150 120 123 143 92 112 109 103 115 128 113 89 124 97 120 143 189 143 126 121 100 98 103 105 110 113 132 104 112 84 132 120 112 111 110 Enter the data into Microsoft Excel, and calculate descriptive statistics (range, mean, median, standard deviation, variance etc.) for the sample. Also construct a frequency distribution and a histogram. Finally, express each decile in standard deviation units, i.t.o. of its distance from the mean. (i.e. the 4th decile might be -1.32 standard deviations from the mean).

3

A group of children is given a test in which they are required to solve an arithmetic problem as quickly as they can. Here are the times of 25 such children (in seconds): 100 234 156 230 178 190 256 210 298 124 163 172 183 320 123 234 321 395 354 182 119 102 103 176 243 Use Excel to calculate the deciles and quartiles of this distribution. Express these as Z-scores. What is the probability of a child who is randomly selected from the 25 above taking longer than 300s. to solve the problem? Use Excel's built in function to determine the probability of this occurring, assuming that the sample mean and standard deviation are equal to the corresponding population values.

Tutorial x : Statistics by spreadsheet 4

16

In each of the following cases pairs of data are reported. Decide for each case whether it is appropriate to calculate a correlation coefficient - describe the nature of each relationship and estimate its strength. (HINT: use scatter diagrams to assist you). Where appropriate, calculate the correlation coefficient. 1

X: 1 3 5 7 9 11 13 15 17 Y: 100 93 86 79 72 65 58 51 44

2

X: 2 3 4 5 6 7 8 9 10 Y: 4 9 16 25 36 49 64 81 100

3

X: 3 6 2 8 3 9 12 5 Y: 1 7 4 3 5 6 11 3

References Tyson, G. & Turnbull, O. (1990). Ambient temperature and the occurrence of collective violence: A South African replication. South African Journal of Psychology, 20(3), 159-162.

Further reading Berk, K.N. & Carey, P. (1998). Data analysis with Microsoft Excel. Duxbury Press. Piele, D. (1990). Introductory statistics with spreadsheets. Addison-Wesley.