Introduction to Statistical Analysis Using SPSS Statistics

Introduction to Statistical Analysis Using SPSS® Statistics 33373-001 SPSS v17.0.1;1/2009 nm For more information about SPSS Inc. software products...
Author: Denis Davis
7 downloads 2 Views 4MB Size
Introduction to Statistical Analysis Using SPSS® Statistics 33373-001

SPSS v17.0.1;1/2009 nm

For more information about SPSS Inc. software products, please visit our Web site at http://www.spss.com or contact SPSS Inc. 233 South Wacker Drive, 11th Floor Chicago, IL 60606-6412 Tel: (312) 651-3000 Fax: (312) 651-3668 SPSS is a registered trademark and the other product names are the trademarks of SPSS Inc. for its proprietary computer software. No material describing such software may be produced or distributed without the written permission of the owners of the trademark and license rights in the software and the copyrights in the published materials. The SOFTWARE and documentation are provided with RESTRICTED RIGHTS. Use, duplication, or disclosure by the Government is subject to restrictions as set forth in subdivision (c)(1)(ii) of The Rights in Technical Data and Computer Software clause at 52.227-7013. Contractor/manufacturer is SPSS Inc., 233 South Wacker Drive, 11th Floor, Chicago, IL 60606-6412. Patent No. 7,023,453 General notice: Other product names mentioned herein are used for identification purposes only and may be trademarks or registered trademarks of their respective companies in the United States and other countries. Windows is a registered trademark of Microsoft Corporation. Apple, Mac, and the Mac logo are trademarks of Apple Computer, Inc., registered in the U.S. and other countries.

Introduction to Statistical Analysis Using SPSS Statistics Copyright © 2009 by SPSS Inc. All rights reserved. Printed in the United States of America.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher.

Introduction to Statistical Analysis Using SPSS Statistics

CHAPTER 1: INTRODUCTION TO STATISTICAL ANALYSIS .... 1-1 1.1 COURSE GOALS AND METHODS .............................................................................. 1-1 1.2 BASIC STEPS OF RESEARCH PROCESS ...................................................................... 1-2 1.3 POPULATIONS AND SAMPLES .................................................................................. 1-3 1.4 RESEARCH DESIGN .................................................................................................. 1-4 1.5 INDEPENDENT AND DEPENDENT VARIABLES........................................................... 1-5 1.6 LEVELS OF MEASUREMENT AND STATISTICAL METHODS ....................................... 1-5

CHAPTER 2: DATA CHECKING.......................................................... 2-1 2.1 INTRODUCTION........................................................................................................ 2-1 2.2 VIEWING A FEW CASES ........................................................................................... 2-3 2.3 MINIMUM, MAXIMUM, AND NUMBER OF VALID CASES .......................................... 2-5 2.4 DATA VALIDATION: DATA PREPARATION ADD-ON MODULE ................................. 2-8 2.5 DATA VALIDATION RULES ...................................................................................... 2-9 2.6 WHEN DATA ERRORS ARE DISCOVERED .............................................................. 2-25 SUMMARY EXERCISES ................................................................................................ 2-27

CHAPTER 3: DESCRIBING CATEGORICAL DATA ....................... 3-1 3.1 WHY SUMMARIES OF SINGLE VARIABLES? ............................................................. 3-1 3.2 FREQUENCY ANALYSIS ........................................................................................... 3-2 3.3 STANDARDIZING THE CHART AXIS........................................................................ 3-11 3.4 PIE CHARTS ........................................................................................................... 3-16 SUMMARY EXERCISES ................................................................................................ 3-19

CHAPTER 4: EXPLORATORY DATA ANALYSIS: SCALE DATA 4-1 4.1 SUMMARIZING SCALE VARIABLES .......................................................................... 4-1 4.2 MEASURES OF CENTRAL TENDENCY AND DISPERSION ............................................ 4-2 4.3 NORMAL DISTRIBUTIONS ........................................................................................ 4-3 4.4 HISTOGRAMS AND NORMAL CURVES ...................................................................... 4-4 4.5 USING THE EXPLORE PROCEDURE: EDA................................................................. 4-7 4.6 STANDARD ERROR OF THE MEAN AND CONFIDENCE INTERVALS .......................... 4-12 4.7 SHAPE OF THE DISTRIBUTION ................................................................................ 4-12 4.8 BOXPLOTS ............................................................................................................. 4-15 4.9 APPENDIX: STANDARDIZED (Z) SCORES................................................................ 4-21 SUMMARY EXERCISES ................................................................................................ 4-25

CHAPTER 5: PROBABILITY AND INFERENTIAL STATISTICS . 5-1 5.1 THE NATURE OF PROBABILITY ................................................................................ 5-1 5.2 MAKING INFERENCES ABOUT POPULATIONS FROM SAMPLES .................................. 5-2 5.3 INFLUENCE OF SAMPLE SIZE ................................................................................... 5-2 5.4 HYPOTHESIS TESTING ........................................................................................... 5-10 5.5 TYPES OF STATISTICAL ERRORS ............................................................................ 5-11 5.6 STATISTICAL SIGNIFICANCE AND PRACTICAL IMPORTANCE .................................. 5-12

i

Introduction to Statistical Analysis Using SPSS Statistics

CHAPTER 6: COMPARING CATEGORICAL VARIABLES ........... 6-1 6.1 TYPICAL APPLICATIONS .......................................................................................... 6-1 6.2 CROSSTABULATION TABLES ................................................................................... 6-2 6.3 TESTING THE RELATIONSHIP: CHI-SQUARE TEST .................................................... 6-5 6.4 REQUESTING THE CHI-SQUARE TEST ...................................................................... 6-7 6.5 INTERPRETING THE OUTPUT .................................................................................... 6-8 6.6 ADDITIONAL TWO-WAY TABLES .......................................................................... 6-12 6.7 GRAPHING THE CROSSTABS RESULTS ................................................................... 6-16 6.8 ADDING CONTROL VARIABLES ............................................................................. 6-18 6.9 EXTENSIONS: BEYOND CROSSTABS....................................................................... 6-21 6.10 APPENDIX: ASSOCIATION MEASURES ................................................................. 6-21 SUMMARY EXERCISES ................................................................................................ 6-28

CHAPTER 7: MEAN DIFFERENCES BETWEEN GROUPS: T TEST ...................................................................................................................... 7-1 7.1 INTRODUCTION........................................................................................................ 7-1 7.2 LOGIC OF TESTING FOR MEAN DIFFERENCES .......................................................... 7-1 7.3 EXPLORING THE GROUP DIFFERENCES .................................................................... 7-6 7.4 TESTING THE DIFFERENCES: INDEPENDENT SAMPLES T TEST ............................... 7-14 7.5 INTERPRETING THE T TEST RESULTS ..................................................................... 7-16 7.6 GRAPHING MEAN DIFFERENCES............................................................................ 7-20 7.7 APPENDIX: PAIRED T TEST.................................................................................... 7-22 7.8 APPENDIX: NORMAL PROBABILITY PLOTS ............................................................ 7-25 SUMMARY EXERCISES ................................................................................................ 7-29

CHAPTER 8: BIVARIATE PLOTS AND CORRELATIONS: SCALE VARIABLES .............................................................................................. 8-1 8.1 INTRODUCTION........................................................................................................ 8-1 8.2 READING THE DATA ................................................................................................ 8-1 8.3 EXPLORING THE DATA ............................................................................................ 8-2 8.4 SCATTERPLOTS........................................................................................................ 8-6 8.5 CORRELATIONS ..................................................................................................... 8-11 SUMMARY EXERCISES ................................................................................................ 8-16

CHAPTER 9: INTRODUCTION TO REGRESSION .......................... 9-1 9.1 INTRODUCTION AND BASIC CONCEPTS .................................................................... 9-1 9.2 THE REGRESSION EQUATION AND FIT MEASURE .................................................... 9-2 9.3 RESIDUALS AND OUTLIERS ..................................................................................... 9-2 9.4 ASSUMPTIONS ......................................................................................................... 9-3 9.5 SIMPLE REGRESSION ............................................................................................... 9-3 SUMMARY EXERCISES .................................................................................................. 9-8

ii

Introduction to Statistical Analysis Using SPSS Statistics

APPENDIX A: MEAN DIFFERENCES BETWEEN GROUPS: ONEFACTOR ANOVA .................................................................................... A-1 A.1 INTRODUCTION ...................................................................................................... A-1 A.2 EXTENDING THE LOGIC BEYOND TWO GROUPS .................................................... A-1 A.3 EXPLORING THE DATA .......................................................................................... A-3 A.4 ONE-FACTOR ANOVA ......................................................................................... A-4 A.5 POST HOC TESTING OF MEANS............................................................................ A-10 A.6 GRAPHING THE MEAN DIFFERENCES ................................................................... A-18 A.7 APPENDIX: GROUP DIFFERENCES ON RANKS....................................................... A-20 SUMMARY EXERCISES ............................................................................................... A-23

APPENDIX B: INTRODUCTION TO MULTIPLE REGRESSION . B-1 B.1 MULTIPLE REGRESSION ......................................................................................... B-1 B.2 MULTIPLE REGRESSION RESULTS .......................................................................... B-4 B.3 RESIDUALS AND OUTLIERS .................................................................................... B-7 SUMMARY EXERCISES ............................................................................................... B-10

REFERENCES.......................................................................................... R-1 INTRODUCTORY STATISTICS BOOKS ............................................................................ R-1 ADDITIONAL REFERENCES ........................................................................................... R-1

ALTERNATIVE EXERCISES FOR CHAPTERS 8 & 9 AND APPENDIX B ............................................................................................ E-1 SUMMARY EXERCISES FOR CHAPTER 8........................................................................ E-2 SUMMARY EXERCISES FOR CHAPTER 9........................................................................ E-3 SUMMARY EXERCISES FOR APPENDIX B...................................................................... E-4

iii

Introduction to Statistical Analysis Using SPSS Statistics

iv

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 1: Introduction to Statistical Analysis Topics: • Course Goals and Methods • Basic Steps of Research Process • Populations and Samples • Research Design • Independent and Dependent Variables • Level of Measurement and Statistical Methods Data This course uses the data file GSS2004Intro.sav for most of the examples. These data are a subset of variables from the General Social Survey. The General Social Survey 2004 (GSS, produced by the National Opinion Research Center, Chicago) is a survey involving demographic, attitudinal and behavioral items that include views on government and satisfaction with various facets of life. Approximately 2,800 U.S. adults were included in the study. However, not all questions were asked of each respondent, so most analyses will be based on a reduced number of cases. The survey has been administered nearly every year since 1972 (it is now administered in even years).

1.1 Course Goals and Methods In this chapter, we begin by briefly reviewing the basic elements of quantitative research and issues that should be considered in data analysis. We will then discuss some SPSS Statistics facilities that can be used to check your data. In the remainder of the course, we will cover a number of statistical procedures that SPSS Statistics performs. This is an application-oriented course and the approach will be practical. We will discuss: 1) 2) 3) 4)

The situations in which you would use each technique, The assumptions made by the method, How to set up the analysis using SPSS Statistics, Interpretation of the results.

We will not derive proofs, but rather focus on the practical matters of data analysis in support of answering research questions. For example, we will discuss what correlation coefficients are, when to use them, and how to produce and interpret them, but will not formally derive their properties. This course is not a substitute for a course in statistics. You will benefit if you have had such a course in the past, but even if not, you will understand the basics of each technique after completion of this course. We will cover descriptive statistics and exploratory data analysis, and then examine relationships between categorical variables using crosstabulation tables and chi-square tests. Testing for mean differences between groups using T Tests and analysis of variance (ANOVA) will be considered. Correlation and regression will be used to investigate the relationships between interval scale variables. Graphics comprise an integral part of the analyses.

Introduction to Statistical Analysis 1 - 1

Introduction to Statistical Analysis Using SPSS Statistics This course assumes you have a working knowledge of SPSS Statistics in your computing environment. Thus the basic use of menu systems, data definition and labeling will not be considered in any detail. The analyses in this course will show the locations of the menu choices and dialog boxes within the overall menu system, and the dialog box selections will be detailed.

Scenario for Analyses We will perform many analyses on the GSS data. As we review and apply the statistical methods to this data, it is crucial that you think about how these methods might be used with information you collect. Although survey data is used for our examples, the same methods can be used with experimental and archival data. In part to simplify the course through minimizing the number of data sets, we will produce more types of analyses on one data set than are typically done. In practice the number and variety of analyses performed in a study is a function of the research design: the questions that you, the analyst, want to ask of the data.

1.2 Basic Steps of Research Process All research projects can be broken down into a number of discrete components. These components can be categorized in a variety of ways. We might summarize the main steps as: 1. Specify exactly the aims and objectives of the research along with the main hypotheses 2. Define the population and sample design 3. Choose a method of data collection, design the research and decide upon an appropriate sampling strategy 4. Collect the data 5. Prepare the data for analysis 6. Analyse the data 7. Report the findings Some of these points may seem obvious, but it is surprising how often some of the most basic principles are overlooked, potentially resulting in data that is impossible to analyze with any confidence. Each step is crucial for a successful research project and it is never too early in the process to consider the methods that you intend to use for your data analysis. In order to place the statistical techniques that we will discuss in this course in the broader framework of research design, we will briefly review some of the considerations of the first steps. Statistics and research design are highly interconnected disciplines and you should have a thorough grasp of both before embarking on a research project. This introductory chapter merely skims the surface of the issues involved in research design. If you are unfamiliar with these principles, we recommend that you refer to the research methodology literature for more thorough coverage of the issues.

Introduction to Statistical Analysis 1 - 2

Introduction to Statistical Analysis Using SPSS Statistics

Research Objectives It is important that a research project begin with a set of well-defined objectives. Yet, this step is often overlooked or not well defined. The specific aims and objectives may not be addressed because those commissioning the research do not know exactly which questions they would like answered. This rather vague approach can be a recipe for disaster and may result in a completely wasted opportunity as the most interesting aspects of the subject matter under investigation could well be missed. If you do not identify the specific objectives, you will fail to collect the necessary information or ask the necessary question in the correct form. You can end up with a data file that does not contain the information that you need for your data analysis step. For example, you may be asked to conduct a survey "to find out about alcohol consumption and driving". This general objective could lead to a number of possible survey questions. Rather than proceeding with this general objective, you need to uncover more specific hypotheses that are of interest to your organization. This example could lead to a number of very specific research questions, such as: “What proportion of people admits to driving while above the legal alcohol limit?” “What demographic factors (e.g. age/sex/social class) are linked with a propensity to drunkdriving?” “Does having a conviction for drunk-driving affect attitudes towards driving while over the legal limit?” These specific research questions would then define the questionnaire items. Additionally, the research questions will affect the definition of the population and the sampling strategy. For example, the third question above requires that the responder have a drunk-driving conviction. Given that a relatively small proportion of the general population has such a conviction, you would need to take that into consideration when defining the population and sampling design. For example, a simple random sample of the general population would not be recommended although several other approaches beyond the scope of this course would be considered. Therefore, it is essential to state formally the main aims and objectives at the outset of the research so the subsequent stages can be done with these specific questions in mind.

1.3 Populations and Samples In studies involving statistical analysis it is important to be able to characterize accurately the population under investigation. The population is the group to which you wish to generalize your conclusions, while the sample is the group you directly study. In some instances the sample and population are identical or nearly identical; consider the Census of any country in this regard. In the majority of studies, the sample represents a small proportion of the population. In the example above, the population might be defined as those people with registered drivers' licenses. We could select a sample from the drivers' license registration list for our survey. Other common examples are: membership surveys in which a small percentage of members are sent questionnaires, medical experiments in which samples of patients with a disease are given different treatments, marketing studies in which users and non users of a product are compared, and political polling.

Introduction to Statistical Analysis 1 - 3

Introduction to Statistical Analysis Using SPSS Statistics The problem is to draw valid inferences from data summaries in the sample so that they apply to the larger population. In some sense you have complete information about the sample, but you want conclusions that are valid for the population. An important component of statistics and a large part of what we cover in the course involves statistical tests used in making such inferences. Because the findings can only be generalized to the population under investigation, you should give careful thought to defining the population of interest to you and making certain that the sample reflects this population. The survey research literature—for example Sudman (1976) or Salant and Dillman (1994)—reviews these issues in detail. To state it in a simple way, statistical inference provides a method of drawing conclusions about a population of interest based on sample results.

1.4 Research Design With specific research goals and a target population in mind, it is then possible to begin the design stage of the research. There are many things to consider at the design stage. We will consider a few issues that relate specifically to data analysis and statistical techniques. This is not meant as a complete list of issues to consider. For example, for survey projects, the mode of data collection, question selection and wording, and questionnaire design are all important considerations. Refer to the survey research literature mentioned above as well as general research methodology literature for discussion of these and other research design issues. First, you must consider the type of research that will be most appropriate to the research aims and objectives. Two main alternatives are survey research and experimental research. The data may be recorded using either objective or subjective techniques. The former includes items measured by an instrument and by computer such as physiological measures (e.g. heart-rate) while the latter includes observational techniques such as recordings of a specific behavior and responses to questionnaire surveys. Most research goals lend themselves to one particular form of research, although there are cases where more than one technique may be used. For example, a questionnaire survey would be inappropriate if the aim of the research was to test the effectiveness of different levels of a new drug to relieve high blood pressure. This type of work would be more suited to a tightly controlled experimental study in which the levels of the drug administered could be carefully controlled and objective measures of blood pressure could be accurately recorded. On the other hand, this type of laboratory-based work would not be a suitable means of uncovering people’s voting intentions. The classic experimental design consists of two groups: the experimental group and the control group. They should be equivalent in all respects other than that those in the former group are subjected to an effect or treatment and the latter is not. Therefore, any differences between the two groups can be directly attributed to the effect of this treatment. The treatment variables are usually referred to as independent variables, and the quantity being measured as the effect is the dependent variable. There are many other research designs, but most are more elaborate variations on this basic theme. In survey research, you rarely have the opportunity to implement such a rigorously controlled design. However, the same general principles apply to many of the analyses you perform.

Introduction to Statistical Analysis 1 - 4

Introduction to Statistical Analysis Using SPSS Statistics

1.5 Independent and Dependent Variables In general, the dependent (sometimes referred to as the outcome) variable is the one we wish to study as a function of other variables. Within an experiment, the dependent variable is the measure expected to change as a result of the experimental manipulation. For example, a drug experiment designed to test the effectiveness of different sleeping pills might employ the number of hours of sleep as the dependent variable. In surveys and other non-experimental studies, the dependent variable is also studied as a function of other variables. However, no direct experimental manipulation is performed; rather the dependent variable is hypothesized to vary as a result of changes in the other (independent) variables. Correspondingly, independent (sometimes referred to as predictor) variables are those used to measure features manipulated by the experimenter in an experiment. In a non-experimental study, they represent variables believed to influence or predict a dependent measure. Thus terms (dependent, independent) reasonably applied to experiments have taken on more general meanings within statistics. Whether such relations are viewed causally, or as merely predictive, is a matter of belief and reasoning. As such, it is not something that statistical analysis alone can resolve. To illustrate, we might investigate the relationship between starting salary (dependent) and years of education, based on survey data, and then develop an equation predicting starting salary from years of education. Here starting salary would be considered the dependent variable although no experimental manipulation of education has been performed. One way to think of the distinction is to ask yourself which variable is likely to influence the other? In summary, the dependent variable is believed to be influenced by, or be predicted by, the independent variable(s). Finally, in some studies, or parts of studies, the emphasis is on exploring and characterizing relationships among variables with no causal view or focus on prediction. In such situations there is no designation of dependent and independent. For example, in crosstabulation tables and correlation matrices the distinction between dependent and independent variables is not necessary. It rather resides in the eye of the beholder (researcher).

1.6 Levels of Measurement and Statistical Methods The term, levels of measurement, refers to the properties and meaning of numbers assigned to observations for each item. Many statistical techniques are only appropriate for data measured at particular levels or combinations of levels. Therefore, when possible, you should determine the analyses you will be using before deciding upon the level of measurement to use for each of your variables. For example, if you want to report and test the mean age of your sample, you will need to ask their age in years (or year of birth) rather than asking them to choose an age group into which their age falls. Because measurement type is important when choosing test statistics, we briefly review the common taxonomy of level of measurement. For an interesting discussion of level of measurement and statistics see Velleman and Wilkinson (1993). The four major classifications that follow are found in many introductory statistics texts. They are presented beginning with the weakest and ending with those having the strongest measurement properties. Each successive level can be said to contain the properties of the preceding types and to record information at a higher level.

Introduction to Statistical Analysis 1 - 5

Introduction to Statistical Analysis Using SPSS Statistics •

Nominal — In nominal measurement each numeric value represents a category or group identifier, only. The categories cannot be ranked and have no underlying numeric value. An example would be marital status coded 1 (Married), 2 (Widowed), 3 (Divorced), 4 (Separated) and 5 (Never Married); each number represents a category and the matching of specific numbers to categories is arbitrary. Counts and percentages of observations falling into each category are appropriate summary statistics. Such statistics as means (the average marital status?) would not be appropriate, but the mode would be appropriate (the biggest category: Married?).



Ordinal — For ordinal measures the data values represent ranking or ordering information. However, the difference between the data values along the scale is not equal. An example would be specifying how happy you are with your life, coded 1 (Very Happy), 2 (Happy), and 3 (Not Happy). There are specific statistics associated with ranks; SPSS Statistics provides a number of them mostly within the Nonparametric and Ordinal Regression procedures. The mode and median can be used as summary statistics.



Interval — In interval measurement, a unit increase in numeric value represents the same change in quantity regardless of where it occurs on the scale. For interval scale variables such summaries as means and standard deviations are appropriate. Statistical techniques such as regression and analysis of variance assume that the dependent (or outcome) variable is measured on an interval scale. Examples might be temperature in degrees Fahrenheit or IQ score.



Ratio — Ratio measures have interval scale properties with the addition of a meaningful zero point; that is, zero indicates complete absence of the characteristic measured. For statistics such as ANOVA and regression only interval scale properties are assumed, so ratio scales have stronger properties than necessary for most statistical analyses. Health care researchers often use ratio scale variables (number of deaths, admissions, discharges) to calculate rates. The ratio of two variables with ratio scale properties can thus be directly interpreted. Money is an example of a ratio scale, so someone with $10,000 has ten times the amount as someone with $1,000.

The distinction between the four types is summarized in Table 1.1. Table 1.1 Level of Measurement Properties Property

Level of Measurement

Categories

Nominal

9

Ordinal

9

9

Interval

9

9

9

Ratio

9

9

9

Ranks

Equal Intervals

True Zero Point

9

These four levels of measurement are often combined into two main types, categorical consisting of nominal and ordinal measurement levels and continuous (scale) consisting of interval and ratio measurement levels.

Introduction to Statistical Analysis 1 - 6

Introduction to Statistical Analysis Using SPSS Statistics The measurement level variable attribute in SPSS Statistics recognizes three measurement levelsNominal, Ordinal and Scale. The icon indicating the measurement level is displayed preceding the variable name or label in the variable lists of all dialog boxes. Table 1.2 shows the most common icons used for the measurement levels. Special data types, such as Date and Time variables have distinct icons not shown in this table. Table 1.2 Variable List Icons Measurement Level

Data Type Numeric

String

Nominal Ordinal Scale

Not Applicable Not Applicable

Note: Rating Scales and Dichotomous Variables A common scale used in surveys and market research is an ordered rating scale usually consisting of five- or seven-point scales Such ordered scales are also called Likert scales and might be coded 1 (Strongly Agree, or Very Satisfied), 2 (Agree, or Satisfied), 3 (Neither agree nor disagree, or Neutral), 4 (Disagree, or Dissatisfied), and 5 (Strongly Disagree, or Very Dissatisfied). There is an ongoing debate among researchers as to whether such scales should be considered ordinal or interval. SPSS Statistics contains procedures capable of handling such variables under either assumption. When in doubt about the measurement scale, some researchers run their analyses using two separate methods, since each make different assumptions about the nature of the measurement. If the results agree, the researcher has greater confidence in the conclusion. Dichotomous (binary) variables containing two possible responses (often coded 0 and 1) are often considered to fall into all of the measurement levels except ratio. As we will see, this flexibility allows them to be used in a wide range of statistical procedures Statistics are available for variables at all measurement levels, and it is important to match the proper statistic to a given level of measurement. In practice your choice of statistical method depends on the questions you are interested in asking of the data and the nature of the measurements you make. Table 1.3 suggests which statistical techniques are most appropriate, based on the measurement level of the dependent and independent variables. Much more extensive diagrams and discussion are found in Andrews et al. (1981). Recall that ratio variables can be considered as interval scale for analysis purposes. If in doubt about the measurement properties of your variables, you can apply a statistical technique that assumes weaker measurement properties and compare the results to methods making stronger assumptions. A consistent answer provides greater confidence in the conclusions.

Introduction to Statistical Analysis 1 - 7

Introduction to Statistical Analysis Using SPSS Statistics Table 1.3 Statistical Methods and Level of Measurement

Independent Variables Dependent Variable

Nominal Ordinal Interval/Ratio

Nominal Crosstabs

Ordinal Crosstabs

Nonparametric tests Ordinal Regression T Test, ANOVA

Nonparametric correlation Nonparametric correlation

Introduction to Statistical Analysis 1 - 8

Interval/Ratio Discriminant Logistic Regression Ordinal Regression Correlation Regression

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 2: Data Checking Topics: • Viewing a Few Cases • Basic Data Validation: Minimum, maximum and number of cases • Data Validation Rules o Creating single-variable rules o Creating cross-variable rules o Applying validation rules • When Data Errors are Discovered Data This chapter uses the data file GSS2004PreClean.sav. This data set is a version of the GSS2004Intro.sav file into which we have introduced an out-of-range value in CONFINAN and an inconsistent response in HAPMAR to allow us to demonstrate some data checking features of SPSS Statistics. All other values are unchanged. This course assumes that the training files are located in the c:\Train\Stats folder. If you are not working in an SPSS Training center, the training files may be in a different folder structure. If you are running SPSS Statistics Server, then these files can be located on the server or a machine that can be accessed (mapped from) the server.

2.1 Introduction When working with data it is important to verify their validity before proceeding with the analysis. Web surveys are often collected using software, such as SPSS Dimensions, that automatically check the validity (for example, is the response within an acceptable range) of an answer and is it consistent with previous information. Such methods as double-entry verification, a technique in which two people independently enter the data into a computer and the values are compared for discrepancies can be implemented using data entry software such as SPSS Data Entry. If you are reading your data from some other source, you can use the SPSS Data Preparation add-on module to construct validation rules to check for values of single variables and consistency across variables. We will demonstrate this feature in this chapter. You can also use some basic features of SPSS Statistics Base as a first step in examining your data and checking for inconsistencies. Although mundane, time spent examining data in this way early on will reduce false starts, misleading analyses, and makeup work later. For this reason data checking is a critical prelude to statistical analysis.

Note about Default Startup Folder and Variable Display in Dialog Boxes In this course, all of the files used for the demonstrations and exercises are located in the folder c:\Train\Stats. You can set the startup folder that will appear in all Open and Save dialog boxes. We will use this option to set the startup folder. Click Edit...Options, then click the File Locations tab Click the Browse button to the right of the Data Files text box

Data Checking 2 - 1

Introduction to Statistical Analysis Using SPSS Statistics Select Train from the Look In: drop down list, then select Stats from the list of folders and click Set button Click the Browse button to the right of the Other Files text box Move to the Train\Stats folder (as above) and click Set button

Figure 2.1 Set Default File Location in the Edit Options Dialog Box

Note: If the course files are stored in a different location, your instructor will give you instructions specific to that location. Either variable names or longer variable labels will appear in list boxes in dialog boxes. Additionally, variables in list boxes can be ordered alphabetically or by their position in the file. In this course, we will display variable names in alphabetical order within list boxes. Since the default setting within SPSS Statistics is to display variable labels in file order, we will change this before accessing data. Click General tab (Not shown) Select Display names in the Variable Lists group on the General tab Select Alphabetical Click OK and OK in the information box to confirm the change

Data Checking 2 - 2

Introduction to Statistical Analysis Using SPSS Statistics

2.2 Viewing a Few Cases Often the first step in checking data previously entered on the computer is to view the first few observations and compare their data values to the original data worksheets, survey forms, or database records. This will detect many gross errors of data definition such as reading alphabetic characters as numeric data fields incorrectly formatted spreadsheet data. Viewing the first few cases can be easily accomplished using the Data Editor window in SPSS Statistics or the Case Summaries procedure. Below we view part of the 2004 General Social Survey data in the Data Editor window. Click File…Open…Data Double-click GSS2004PreClean.sav and click Open

Figure 2.2 General Social Survey 2004 Data in Data Editor Window

The first few responses can be compared to the original data source or surveys as a preliminary test of data entry. If errors are found, corrections can be made directly within the Data Editor window. (If you do not see the data values but labels instead, click on the Value Labels tool button on the Toolbar .) The Case Summaries procedure can list values of individual cases for selected variables. This allows you to more easily check variables that may be separated in the Data Editor, or request additional statistics. Click Analyze…Reports…Case Summaries Move HEALTH, RINCOME, and SIBS into the Variables list box. Type 20 into the Limit cases to first text box

Data Checking 2 - 3

Introduction to Statistical Analysis Using SPSS Statistics Note we limit the listing to the first 20 cases (the default is 100). The Case Summaries procedure can also display case listings and summary statistics for groups of cases as defined by the Grouping Variable(s). Figure 2.3 Case Summaries Dialog Box

Click OK

Data Checking 2 - 4

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.4 Case Summaries Listing of First Twenty Cases Case Summariesa

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Total

N

In general, how is your health? GOOD . EXCELLENT . . GOOD EXCELLENT . GOOD . . . . GOOD EXCELLENT . GOOD EXCELLENT . . 9

RESPONDENTS INCOME . . NA $25000 OR MORE $25000 OR MORE $15000 - 19999 $25000 OR MORE . . $25000 OR MORE . $25000 OR MORE $10000 - 14999 $25000 OR MORE $20000 - 24999 $3000 TO 3999 $3000 TO 3999 $15000 - 19999 . $25000 OR MORE 13

NUMBER OF BROTHERS AND SISTERS 3 7 7 10 2 3 2 5 3 1 2 8 3 4 1 7 18 5 4 3 20

a. Limited to first 20 cases.

By default, SPSS Statistics displays value labels in case listings; this can be modified within the Options dialog box (click Edit…Options, then move to the Output Labels tab). We use the case listing to look for potential problems, such as too much missing data. Looking at Figure 2.4, (reformatted for presentation), there certainly are lots of system missing data (the period) for General Health, but not all questions are asked of all respondents in the General Social Survey, so this is most likely not of concern. The value of “NA” means “No Answer,” which appears as a response to the Respondent's Income question. There are no missing data in the first 20 cases for the number of siblings question.

2.3 Minimum, Maximum, and Number of Valid Cases A second simple data check that can be done within SPSS Statistics is to request descriptive statistics on all numeric variables. By default, the Descriptives procedure will report the mean, standard deviation, minimum, maximum and number of valid cases for each numeric variable. While the mean and standard deviation are not relevant for nominal variables (see Chapter 1), the minimum and maximum values will signal any out-of-range data values. In addition, if the number of valid observations is suspiciously small for a variable, it should be explored carefully. Since Descriptives provides only summary statistics, it will not indicate which observation contains an out-of-range value, but that can be easily determined once the data value is known. The Data Validation procedure in the Data Preparation add-on module can be used to check for

Data Checking 2 - 5

Introduction to Statistical Analysis Using SPSS Statistics specific values or ranges of values in a variable and list the violating cases. We will demonstrate that procedure in the next section. Click Analyze…Descriptive Statistics…Descriptives Move all variables except ID into the Variable(s) list box (use shift-click to select all variables and then ctrl-click on ID to de-select it)

Figure 2.5 Descriptives Dialog Box

Only numeric variables appear in the variable list box. Click OK

Data Checking 2 - 6

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.6 Descriptives Output (Beginning)

We can see the minimum, maximum and number of valid cases for each variable in the data set. By examining such variables as EDUC (Highest Year of School Completed), EMAILHR (Email Hours per Week) and AGE (Age of Respondent) we can determine if there are any unexpected values. The maximum for EMAILHR looks rather high (50) and we might want to investigate this further. Note that all of the "Confidence" variables have a maximum value of 3 except for the Confidence in banks & financial institutions. We will investigate this further later in the chapter.

Data Checking 2 - 7

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.7 Descriptives Output (End) Showing Valid Listwise

The valid number of observations (Valid N) is listed for each variable. The number of valid observations listwise indicates how many observations have complete data for all variables, a useful bit of information. Here it is zero because not all questions are asked of, nor are relevant to, any single individual. If unusual or unexpected values are discovered in these summaries we can locate the problem observations using data selection (Data...Select Cases on the menu) or the Find function (under Edit menu) in the Data Editor window. Or, we can use the Data Preparation add-on module data validation feature to define rules and clean the data.

2.4 Data Validation: Data Preparation Add-On Module The task of data checking and fixing becomes more complicated and time-consuming as data files, and data warehouses, grow ever larger, so more automatic methods to create a “clean” data file are helpful. The SPSS Data Preparation add-on module allows you to identify errors in data values/variables, excessive missing data in variables or cases, or unusual data values in your data file. Both data errors and unusual values can influence reporting and data analysis, depending on their frequency of occurrence and actual values. The Data Preparation module contains two procedures for data checking: • •

Validate Data helps you define rules and conditions to run checks to identify invalid values, variables, and cases. Anomaly Detection searches for unusual cases based on deviations from the norms (of cluster groups). The procedure is designed to quickly detect unusual cases in the exploratory data analysis step, prior to any inferential data analysis.

Note: The Data Preparation module also includes the Optimal Binning procedure which defines optimal bins for one or more scale variables based on a categorical variable that “supervises” the process. The binned variable can then be used instead of the original data values for further analysis. This procedure is discussed in the Data Management and Manipulation with SPSS Statistics course.

Data Checking 2 - 8

Introduction to Statistical Analysis Using SPSS Statistics In this chapter we will use the Validate Data procedure (VALIDATEDATA in syntax), which is the basis for all data cleaning. As with the data transformation facilities in SPSS Statistics, Validate Data requires user input from you to be effective. You need to review the variable definitions in your file and determine valid values. This also includes cross-variable rules (e.g., no customers should rate a product they don’t own), or combinations of values that are commonly miscoded. You then need to create the equivalent rules in the Validate Data dialog box and apply them to your data file. The more effort you put into the rules, the more the payoff in cleaner data.

2.5 Data Validation Rules A rule is used to determine whether a case is valid. There are two types of validation rules: • •

Single-variable rules – Single-variable rules consist of a fixed set of checks that apply to a single variable, such as checks for out-of-range values. For single-variable rules, valid values can be expressed as a range of values or a list of acceptable values. Cross-variable rules – Cross-variable rules are user-defined rules that are commonly applied to a combination of variables. Cross-variable rules are defined by a logical expression that flags invalid values.

There are also basic checks available which look for problems in individual variables. These checks detect excessive missing data, a minimal amount of variation in values (small standard deviation), or many cases with the same value (data “heaping”). Validation rules are saved to the data dictionary of your data file. This allows you to specify a rule once and then reuse it. Rules can also be used in other data files (through the Copy Data Properties facility).

Creating Single-Variable Rules We’ll open the Validate Data dialog box, review its options, and then create some single-variable rules in this section. Validate Data is accessed from the Data menu. Click Data…Validation

There are three menu selections under Validation. The last choice (Validate Data) opens the complete dialog box that allows you to define rules and then apply them to the active dataset. The first two choices (Load Predefined Rules and Define Rules) allow you to load rules from an existing data file shipped with SPSS Statistics, or to simply define rules for a set of variable/cases without necessarily applying them to the data. We’ll say a bit more about these near the end of the chapter.

Data Checking 2 - 9

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.8 Validation Menu Choices

Click Validate Data

The first step in using Validate Data is to specify the variables we wish to check (Analysis Variables:) with single variable rules and basic variable checks. Optionally, you can also select one or more case identification variables to check for duplicate or incomplete IDs, and to label casewise output. Figure 2.9 Validate Data Variable Tab

In this example we’ll define rules for only a few variables, but we’ll do basic checks on all of them.

Data Checking 2 - 10

Introduction to Statistical Analysis Using SPSS Statistics Move all the variables except ID to the Analysis Variables list Move the variable ID to the Case Identifier Variables: list (not shown) Click Basic Checks tab

The Basic Checks tab allows you to select basic checks for analysis variables, case identifiers, and whole cases. You can perform the following data checks on the variables selected on the Variables tab. • • • • •

Maximum percentage of missing values Maximum percentage of cases in a single category for categorical (nominal, ordinal) variables Maximum percentage of categories with a count of 1 for categorical variables Minimum coefficient of variation for scale variables Minimum standard deviation for scale variables

Additionally, if you selected any case identifier variables on the Variables tab, you can flag incomplete IDs (values for ID variables which are missing or blank). You can also flag duplicate IDs in the file. Finally, you can flag empty cases, where all variables are empty or blank. Figure 2.10 Validate Data Basic Checks Tab

We’ll use the default settings for the basic checks. Click the Single-Variable Rules tab

Data Checking 2 - 11

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.11 Validate Data Single-Variable Rules Tab

The Single-Variable Rules tab displays available single-variable validation rules and allows you to apply them to analysis variables. There are none defined yet. The list shows analysis variables, summarizes their distributions (with a bar chart or histogram), lists the minimum and maximum values, and shows the number of rules applied to each variable. Note that user- and systemmissing values are not included in the summaries. The Display drop-down list controls which variables are shown. You can choose from all variables, numeric variables, string variables, or date variables. We need to define some rules so we can apply them to the analysis variables, and we do this in the Define rules dialog box. Click Define Rules…

Data Checking 2 - 12

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.12 Validate Data Define Single-Variables Rules Dialog

When the dialog box is opened it shows a placeholder rule named SingleVarRule 1 (you can have spaces in rule names). Rules can be defined for numeric, string, or date variables. The selections change somewhat depending on the variable type. With the exception of the variable GENDER, the GSS2004 data has only numeric variables, so we’ll concentrate on that type in this example. Rules must have a unique name (including the set of cross-variable rules). Valid values can be defined as either falling within a range, or in a specific list of values (selected from the Valid Values: dropdown). Noninteger values are acceptable by default. Also by default, missing values will be included as valid values. This doesn’t imply that they aren’t flagged as missing. Otherwise, though, missing values would be flagged as invalid, which is probably inappropriate for user-missing values. If you don’t expect any blank values in a numeric variable, you might wish to uncheck the Allow system-missing values check box. In practice, you would check all of your variables. To illustrate, we make checks for some of the variables: Number of TV hours watched per day "Confidence" Variables (CONARMY to CONSCI)

Logically shouldn't be above 12 hours Should be 1, 2, or 3.

Note: The General Social Survey has been "cleaned" so we would not expect to find coding or entry errors.

Data Checking 2 - 13

Introduction to Statistical Analysis Using SPSS Statistics Before proceeding, note that what you should do is make a similar list for all the variables in the file you wish to validate. Then you can see which rules need to be defined, and which rules can be used for multiple variables. We could use the Within a range choice for all these, but to demonstrate the In a list option we’ll use that for the "Confidence" variables. Change the Name text to TVHours Outliers Enter 0 for the Minimum value and 12 for the Maximum value

Figure 2.13 TVHours Outliers Rule Defined

The rule is automatically stored; you don’t need to click OK to create it. We can simply define the next rule. Click New Change the Name text to Confidence Click the Valid Values dropdown and select In a list Enter the values 1, 2, and 3 on successive rows (shown in Figure 2.14)

Data Checking 2 - 14

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.14 Confidence Rule Defined

Once you have defined all the single-variable rules you need, you must select the variables to which each rule applies. Click Continue

To apply the rules, select one or more variables and check all rules that you want to apply in the Rules list in the Single-Variable Rules tab. We see the two rules that we just defined in the Rules list. They are applied by selecting the variable(s) to which they apply, and then clicking on the check box. More than one rule can be applied to a variable and a rule can be applied to more than one variable. The Rules list shows only rules that are appropriate for the selected analysis variables. These rules are available for all of the numeric variables, but none of them will be listed for the string variable, GENDER. We will now set the rules for the variables of interest. Click on the variable TVHOURS in the Analysis Variables list With TVHOURS selected, click check box for TVHours Outliers rule

Data Checking 2 - 15

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.15 TVHours Outlier Rule Applied

We want to apply the Confidence rule to the set of variables asking about confidence with various organizations. Select all variables from CONARMY to CONSCI Click the Confidence rule (not shown)

Creating Cross-Variable Rule In most data sets certain relations must hold among variables if the data are recorded properly. This is especially true with surveys containing filter questions or skip patterns. For example, if someone is not currently married, then his/her happiness with marriage should have a missing code. Such relations can be easily checked with cross-variable rules. A cross-variable rule is defined by creating a logical expression that will evaluate to true or false (1 or 0). The expression should evaluate to 1 for the invalid cases. The logic of the cross-variable rule will depend on whether some of the key data are missing or not. Figure 2.16 depicts the relationship between marital status and happiness with marriage. Unlike normal crosstab tables, the missing data categories are also displayed. In order to display this table, we needed to include the missing data in the crosstab. To accomplish this, we recoded the system-missing values in HAPMAR to zero (0). To include user-missing values, you would need to temporarily remove the property of user-missing or paste the syntax for CROSSTABS and add the subcommand "MISSING=INCLUDE" as in:

Data Checking 2 - 16

Introduction to Statistical Analysis Using SPSS Statistics CROSSTABS /TABLES=MARITAL BY HAPMAR /MISSING INCLUDE /FORMAT= AVALUE TABLES /CELLS= COUNT /COUNT ROUND CELL .

For ease of interpretation, we also displayed the data values along with the value labels by changing the Edit…Options, Output Labels. Figure 2.16 Relationship between Marital Status and Happiness with Marriage (All Cases)

The Happiness with Marriage question was only asked of a subset(682) of the married respondents. Thus, several married respondents are also system-missing on the happiness question because it wasn't asked. Those married respondents who were asked the question, but didn't answer should be coded "No Answer" on the happiness question; there are six married respondents who didn't answer. But, if a respondent is not currently married, they can not provide a valid response to the happiness question. Note that we have introduced an error, circled on the table; a never married person is incorrectly coded as "Very Happy" for happiness with marriage. This cross-variable check is more easily accomplished by defining a cross-variable rule. Click Cross-Variables Rule tab Click Define Rules pushbutton

Data Checking 2 - 17

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.17 Validate Data: Define Cross-Variable Rules Dialog

The dialog box looks a bit similar to the Compute dialog box. In addition to cross-validation, rules can be defined for only a single variable. You may find this to be necessary to define a rule more complex than can be defined with the available options in the Single-Variable Rule tab. The same functions available in Compute are available here. There is a placeholder rule called CrossVarRule 1 by default. The logical expression is created in the Logical Expression text box, either by typing it in directly or using the variable list, calculator buttons, and functions. Change the text in the Name box to MaritalHappiness

The married respondents are coded 1 on the variable MARITAL. As shown above, the married respondents can have any value on the HAPMAR variable. However, all other values on MARITAL must be system missing on HAPMAR. Thus the logical condition to find invalid cases is when MARITAL is not equal 1 and HAPMAR is not system-missing. In order to express this, we will use the SYSMIS function as in, MARITAL ~= 1 and ~(SYSMIS(HAPMAR)) where the symbol ~ means "not".

Data Checking 2 - 18

Introduction to Statistical Analysis Using SPSS Statistics Click the variable MARITAL] Click Insert (Alternatively, drop and drag MARITAL to the Logical Expression text box.) Click Not Equal button or type ~= Leave a space and type 1 Click Ampersand (and) ampersand) Click Not

button or type & (be sure to leave spaces around the

button or type ~

Select Missing Values from the Display dropdown list in the Functions and Special Variables area Click Sysmis from the Functions and Special Variables list Click Insert below the Function: list Click the variable HAPMAR Click Insert below the Variables: list

The dialog box should now look like Figure 2.18 below. Figure 2.18 Cross-Variable Rule for MaritalHappiness

Click Continue

When you return to the Cross-Variable Rules tab, the rule you just defined is listed (not shown), with the Apply check box turned on (so the rule will be applied).

Data Checking 2 - 19

Introduction to Statistical Analysis Using SPSS Statistics We could run the procedure now, but we first examine the settings on the Output and Save tabs. Click Output tab

Figure 2.19 Validate Data Output Tab

This tab controls what output is created in the Viewer window. Violations can be listed for each case (the default), using both single- and cross-variable rules. In large files, the maximum setting for the number of cases in the report should be increased above 100 (the upper limit is 1,000). Also by default, violations will be displayed by variable for all single-variable rules. You can also, by checking Summarize violations by rule, request a summary of violations by rule. A check box (Move cases with validation rule…) will move cases with violations to the top of the data file so they are easier to locate. In a large file this may be especially helpful. We’ll use the default settings. You can also request to create new variables that will flag cases with violations. This is requested from the Save tab. Click Save tab

Data Checking 2 - 20

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.20 Validate Data Save Tab

Options are available to save a flag variable (coded 0 and 1) that identifies a case with no data, duplicate IDs, and incomplete IDs. Another choice creates a variable that counts the number of rule violations (single- and cross-variable) for each case. This can be useful in detecting cases that have major problems. The Save indicator variables that record all validation rule violations check box will create a flag variable for every rule you have applied. It will record whether that rule was violated for each case in the data. Although this option can create a large number of new variables, it makes it easy to locate cases with violations. We’ll select this option to see its effect as well as the variable of the count of rule violations for each case. Click Validation rule violations check box in Summary Variables list Click Save indicator variables that record all validation rule violations check box

We are ready to validate the GSS2004 data. Click OK

The first table of output, displayed in Figure 2.21, is seen in almost every application of Validate Data. It simply tells us that not every possible problem was found in the data. Recall that we asked for the basic checks, including such problems as excessive missing data (above 70%), standard deviations of 0, and so forth.

Data Checking 2 - 21

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.21 Warning Message from Validate Data Warnings Some or all requested output is not displayed because all cases, variables, or data values passed the requested checks.

The next table reports the violations of these basic variable checks. Figure 2.22 Variable Checks Table of Basic Check Violations Variable Checks Categorical

Cases Missing > 70

Cases Constant > 95

Scale

Cases Missing > 70

Taking things all together, how would you describe your marriage? Would you say that your marriage Visited web site for News and current events in past 30 days DOES GUN BELONG TO R Visited web site for Travel Information in past 30 days How would you rate your ability to use the World Wide Web? A DEATH OF SPOUSE A DEATH OF CHILD A DEATH OF PARENTS INFERTILITY, UNABLE TO HAVE A BABY DRINKING PROBLEM CHILD ON DRUGS, DRINKING PROBLEM EMAIL HOURS PER WEEK EMAIL MINUTES PER WEEK

Each variable is reported with every check it fails.

Because several questions were asked of only a subset of cases, a few variables are in violation of the basic rule of excessive missing data (above 70%). Note that the violations of this rule are reported separately for categorical and scale variables as you have defined them in the Measurement Level variable property. Additionally, six variables were reported in violation of the basic constant rule (greater than 95% of the cases in one category). These six variables asked people to report (Yes or No) whether they had each of these events occur in the last year. You would expect these events to be relatively rare in the general population, as reported. The next table, Rule Description, (not shown) lists the single-variable rules we applied that had at least one rule violation.

Data Checking 2 - 22

Introduction to Statistical Analysis Using SPSS Statistics The Variable Summary table summarizes single-variable rule violations. We see that two variables have rule violations. The Confidence rule was violated by one case on one variable. Eleven cases watched TV more than 12 hours a day. The details of the violations are not listed If this data had not been previously cleaned, we would likely want to check these cases to determine if these values were misentered. Or, we might want to investigate to see if the values make sense given the values of other key variables for each case. You should expect more violations if your data have not been previously checked for these rules. Figure 2.23 Variable Summary of Rule Violations Variable Summary Rule CONFID IN BANKS & Confidence FINANCIAL INSTITUTIONS Total HOURS PER DAY TVHours Outliers WATCHING TV Total

Number of Violations 1 1 11 11

The next table, Cross-Variable Rules, shows that the MaritalHappiness rule was violated for one case. Figure 2.24 Cross-Variable Rules Violations Cross-Variable Rules Rule MaritalHappiness

Number of Violations 1

Rule Expression MARITAL ~= 1 & ~ SYSMIS(HAPMAR)

The last table, shown in Figure 2.25 summarizes violations by case. There were 13 total rule violations, and they all occurred on separate cases (so no case had more than one violation). The MaritalHappiness violation, for instance, occurred on case ID 3. The Confidence rule is violated for case ID 4 and so forth. We can now easily review those cases to see the problems.

Data Checking 2 - 23

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.25Case Report Summary of All Rule Violations Case Report Validation Rule Violations Case 3 4 114 661 1260 1512 1570 1604 1947 2100 2280 2401 2679

Single-Variable

a

Cross-Variable MaritalHappiness

Identifier RESPONDNENT ID NUMBER

Confidence (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1) TVHours Outliers (1)

3 4 114 661 1260 1512 1570 1604 1947 2100 2280 2401 2679

a. The number of variables that violated the rule follows each rule.

Now we can view the new variables added to the file. Return to the Data Editor Click the Data View tab if necessary Scroll to the right until the end of the variables Click on the Value Labels tool

to turn off the value labels (if necessary)

As shown in Figure 2.26, eleven variables were added to the file. They are coded 0 (no violation) or 1 (violation). We applied single-variable rules to 9 variables, and there is a new flag variable for each of these 9. Then there is a flag variable for the one cross-variable rule, and another variable (ValidationRuleViolations) which provides a total count of the number of violations for each case. We have marked two of the cases in violation on the Figure; case 3 is in violation on the MaritalHappiness cross-variable rule and case 4 is in violation of the Confidence rule on the CONFINAN variable. And, each case had only one rule violation. In this way, you can easily locate cases with particular problems to continue data cleaning.

Data Checking 2 - 24

Introduction to Statistical Analysis Using SPSS Statistics Figure 2.26 New Flag Rule Violation Variables

Applying Validation Rules Validation rules are stored in the data dictionary of a data file. This means that they will be available in the future if a file is saved after defining a set of rules. In addition, rules can be applied from one file to another using the Copy Data Properties facility. SPSS Inc. supplies a file along with the software (in the Statistics17/lang sub-directories) that contains many predefined validation rules for common problems such as range specifications for variables coded 0 and 1. You can load these rules into your file first, then supplement them with additional rules you define. Or you could add additional rules that you commonly use to the predefined file so they will be immediately available for all files. The predefined rule file is accessed from the Data...Validation…Load Predefined Rules menu. The file name is Predefined Validation Rules.sav.

2.6 When Data Errors Are Discovered If errors are found the first step is to return to the original survey or data source. Simple clerical errors are merely corrected. In some instances errors on the part of respondents can be corrected based on their answers to other questions. Or, systematic errors can be discovered and recoded appropriately. If your data has been retrieved from an organization database, the database administrator is often helpful in identifying the reasons for the problem. If these approaches are not possible the offending items can be coded as missing responses and will be excluded from analyses. While beyond the scope of this course, there are techniques that substitute estimated data values for missing responses. For a discussion of such methods see Allison (2002) or Burke and Clark (1992).

Data Checking 2 - 25

Introduction to Statistical Analysis Using SPSS Statistics Having cleaned the data we can now move to the more interesting part of the process, data analysis.

SPSS Missing Values Add-on Module The SPSS Missing Values Add-on Module is very useful in both describing missing data patterns across cases and variables as well as imputing (substituting) values for missing data. You can use this add-on option to produce various reports and graphs describing the frequency and pattern of missing data. It also provides methods for estimating (imputing) values for missing data. As of SPSS Statistics 17, you can perform multiple imputation of missing values which allows you to use multiple variables to more accurately estimate (impute) missing data.

Data Checking 2 - 26

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises These exercises use the GSS2004Intro.sav data file. Open this data file and close the GSS2004PreClean.sav file. Or, exit SPSS Statistics and read this data file before beginning the exercises. 1. Examine the value labels (click Utilities…Variables) for a few of the variables in GSS2004Intro.sav and compare these ranges to the results in the Descriptives table. For example, ETHNIC has a maximum value of 97. Is this a valid value? 2. Using the Validate Data procedure, write a single-variable rule to list all cases with values greater than 30 on the EMAILHR variable. How many cases are identified? Examine values of other variables for these cases (use the Data Editor display). Do these values seem reasonable? 3. Define a cross-variable rule to check that the total number of persons in the Household (HHSIZE) is equal to the sum of HHBABIES, HHPRETEEN, HHTEENS, and HHADULTS. Save the Rule Violation indicator variables. HINT: Using the SUM function in the expression has different results than using arithmetic addition. Why? And, why would you use one method versus the other? Try both!

Data Checking 2 - 27

Introduction to Statistical Analysis Using SPSS Statistics

Data Checking 2 - 28

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 3: Describing Categorical Data Topics: • Why Summaries of Single Variables? • Frequency Analysis • Standardizing the Bar Chart Axis • Pie Charts Data This chapter uses the data file GSS2004Intro.sav. Scenario We are interested in exploring relationships between some demographic variables (highest educational degree attained, gender) and some belief/attitudinal/behavioral variables (belief in an afterlife, use of a computer). Prior to running these two-way analyses (considered in Chapter 6) we will look at the distribution of responses for several of these variables. This can be regarded as a preliminary step before performing the main crosstabulation analysis of interest, or as an analysis in its own right. There might be considerable interest in documenting what percentage of the U.S. (non-institutionalized) adult population believes in an afterlife. In addition, we will look at the frequency distributions of work status and marital happiness.

3.1 Why Summaries of Single Variables? Summaries of individual variables provide the basis for more complex analyses. There are a number of reasons for performing single variable (univariate) analyses. One would be to establish base rates for the population sampled. These rates may be of immediate interest: What percentage of our customers is satisfied with services this year? In addition, studying a frequency table containing many categories might suggest ways of collapsing groups for a more succinct, striking and statistically appropriate table. When studying relationships between variables, the base rates of the separate variables indicate whether there is a sufficient sample size (discussed in more detail in Chapter 5) in each group to proceed with the analysis. A second use of such summaries would be as a data-checking device—unusual values would be apparent in a frequency table. The Level of Measurement of a variable determines the appropriate summary statistics, tables, and graphs to describe the data. Table 3.1 summarizes the most common summary measures and graphs for each of the measurement levels and SPSS Statistics procedures that can produce them.

Describing Categorical Data 3 - 1

Introduction to Statistical Analysis Using SPSS Statistics Table 3.1 Summary of Descriptive Statistics and Graphs

NOMINAL

ORDINAL

SCALE

Definition

Unordered Categories

Ordered Categories

Metric/Numeric Values

Examples

Labor force status, gender, marital status

Satisfaction ratings, degree of education

Income, height, weight

Measures of Central Tendency

Mode

Mode Median

Mode Median Mean

Min/Max/Range, InterQuartile Range (IQR)

Measures of Dispersion

N/A

Graph

Pie or Bar

Pie or Bar

Procedures

Frequencies

Frequencies

Min/Max/Range, IQR, Standard Deviation/Variance Histogram, Box & Whisker, Stem & Leaf Frequencies, Descriptives, Explore

In this chapter, we will review tables and graphs appropriate for describing categorical (nominal and ordinal) variables. Techniques for exploring scale (interval and ratio) variables will be reviewed in Chapter 4. The most common technique for describing categorical data is a frequency analysis which provides a summary table indicating the number and percentage of cases falling into each category of a variable as well as the number of valid and missing cases. To represent this information graphically we use bar or pie charts. In this chapter we run frequency analyses on several questions from the General Social Survey and construct charts to accompany the tables. We discuss the information in the tables and consider the advantages and disadvantages in standardizing bar charts when making comparisons across charts.

3.2 Frequency Analysis We begin by requesting frequency tables and bar charts for five variables: labor force status, WRKSTAT; highest education degree earned, DEGREE; belief in an afterlife, POSTLIFE; computer use, COMPUSE and happiness with marriage, HAPMAR. Requests for bar charts can be made from the Frequencies dialog box, or through the Graphs menu. We begin by opening the 2004 General Social Survey data in the Data Editor:

Describing Categorical Data 3 - 2

Introduction to Statistical Analysis Using SPSS Statistics Click File…Open…Data Click GSS2004Intro.sav and click Open (not shown) Click Analyze…Descriptive Statistics…Frequencies Move WRKSTAT, DEGREE, POSTLIFE,COMPUSE, , and HAPMAR into the Variable(s): list box

Figure 3.1 Frequencies Dialog Box

Note that three of these variables, WRKSTAT, POSTLIFE, and COMPUSE are defined as nominal variables; DEGREE and HAPMAR are defined as ordinal variables. After placing the variables in the list box, we use the Charts option button to request bar charts based on percentages. Click the Charts option button Select Bar charts in the Chart Type area Select Percentages in the Chart Values area

Describing Categorical Data 3 - 3

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.2 Frequencies: Charts Dialog Box

Click Continue Click the Format option button on the Frequencies dialog Click Organize output by variables in the Multiple Variables area

Figure 3.3 Frequencies Format Dialog Box

We choose to organize the output by variables which will display the frequency table followed by the bar chart for each variable. The default would display the frequency tables for all of the variables followed by the bar charts for all of the variables. Other options on the Format dialog box allow you to change the display order of the categories in the frequency tables and charts and suppress large frequency tables for variables with more than the specified number of categories. You also can make additional format changes to the tables and charts by using the pivot table editor and chart editor. Click Continue Click OK

We now examine the tables and charts looking for anything interesting or unusual.

Describing Categorical Data 3 - 4

Introduction to Statistical Analysis Using SPSS Statistics

Frequencies Output We begin with a table based on labor force status. By default, value labels appear in the first column and, if labels were not supplied, the data values display. Tables involving nominal and ordinal variables usually benefit from the inclusion of value labels. Without value labels we wouldn’t be able to tell from the output which number (data value) stands for which work status category. Sometimes you want to display both the data value and the label. You can do this under Edit…Options by changing the Pivot Table Labeling option on the Output Labels tab. The Frequency column contains counts, i.e. the number of occurrences, of each data value. The Percent column shows the percentage of cases in each category relative to the number of cases in the entire data set, including those with missing values. One case did not answer this question (NA) and has been flagged as a user-missing value. This case is excluded from the Valid Percent calculation. Thus the Valid Percent column contains the percentage of cases in each category relative to the number of valid (non-missing) cases. Cumulative percentage, the percentage of cases whose values are less than or equal to the indicated value, appears in the cumulative percent column. With only one case containing a missing value, the percent and valid percent columns are identical. Note we can edit the frequencies pivot table to display the percentages to greater precision to see the difference between the two percents in the second decimal position. Figure 3.4 Frequency of Labor Force Status LABOR FRCE STATUS

Valid

Missing Total

Frequency WORKING FULLTIME 1466 WORKING PARTTIME 320 TEMP NOT WORKING 80 UNEMPL, LAID OFF 99 RETIRED 403 SCHOOL 115 KEEPING HOUSE 266 OTHER 62 Total 2811 NA 1 2812

Percent 52.1 11.4 2.8 3.5 14.3 4.1 9.5 2.2 100.0 .0 100.0

Valid Percent 52.2 11.4 2.8 3.5 14.3 4.1 9.5 2.2 100.0

Cumulative Percent 52.2 63.5 66.4 69.9 84.2 88.3 97.8 100.0

Examine the table. Note the disparate category sizes. Over half of the sample is working full time. And, there are four categories that each has less than 5% of the cases. Before using this variable in a crosstabulation analysis, you should consider combining some of the categories with fewer cases. Decisions about collapsing categories usually have to do with which groups need to be kept distinct in order to answer the research question asked, and the sample sizes for the groups. For example, you might combine the "temporarily not working" and "unemployed laid off" depending on the intent of your analysis. However, if those temporarily not working are of specific interest to your study, you would want to leave them as a separate group. What are some other meaningful ways in which you might combine or compare categories?

Describing Categorical Data 3 - 5

Introduction to Statistical Analysis Using SPSS Statistics Next we view a bar chart based on the labor force variable. Does the picture make it easier to understand the distribution? Figure 3.5 Bar Chart of Labor Force Status

The disparities among the labor force status categories are brought into focus by the bar chart. We next turn to highest education degree earned. Figure 3.6 Frequency Table of Educational Degree RS HIGHEST DEGREE

Valid

Missing Total

LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE Total DK

Frequency 364 1435 224 507 281 2811 1 2812

Describing Categorical Data 3 - 6

Percent 12.9 51.0 8.0 18.0 10.0 100.0 .0 100.0

Valid Percent 12.9 51.0 8.0 18.0 10.0 100.0

Cumulative Percent 12.9 64.0 72.0 90.0 100.0

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.7 Bar Chart of Highest Educational Degree

There are some interesting peaks and valleys in the distribution of the respondent’s highest degree. Again over half of the people fall into one category, high school graduates. Depending on your research, you might want to collapse some of the categories. Can you think of a sensible way of collapsing DEGREE into fewer categories? Or reasons why you would not? Next, we look at the dichotomous variable, POSTLIFE Figure 3.8 Frequency Table of Belief in Afterlife BELIEF IN LIFE AFTER DEATH

Valid

Missing

Total

YES NO Total DK System Total

Frequency 958 213 1171 154 1487 1641 2812

Percent 34.1 7.6 41.6 5.5 52.9 58.4 100.0

Valid Percent 81.8 18.2 100.0

Cumulative Percent 81.8 100.0

Describing Categorical Data 3 - 7

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.9 Bar Chart of Belief in Afterlife

The great majority of respondents (81.8%) do believe in life after death (though we suspect that “life after death” means different things to different people). It might be interesting to look at the relationship between this variable and educational degree: to what extent is belief in an afterlife related to level of education? The frequency tables we are viewing display each variable independently. To investigate the relationship between two categorical variables we will turn to crosstabulation tables in Chapter 6. POSTLIFE has two missing value categories. The first missing code (DK) represents a response of "Don't Know" and is very commonly used as a response in survey questions. This question was not asked of all respondents so they were left blank in the data and SPSS Statistics converted the blanks to the system-missing value. So, the second missing category represents those who were not asked the question. These codes are excluded from consideration in the “Valid Percent” column of the frequency table, as well as from the bar chart, and would also be ignored if any additional statistics were requested.

Describing Categorical Data 3 - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.10 Frequency Table of Computer Use R USE COMPUTER

Valid

Missing

Total

YES NO Total NA System Total

Frequency 723 255 978 6 1828 1834 2812

Percent 25.7 9.1 34.8 .2 65.0 65.2 100.0

Valid Percent 73.9 26.1 100.0

Cumulative Percent 73.9 100.0

Figure 3.11 Bar Chart of Computer Use

Almost three-quarters of the respondents asked use the computer regularly. However, the percentage of computer users might well be related to other demographic variables such as degree or gender. Or others? Note that this question was asked of only 35% of the respondents. Of those who were asked the question, 6 people did not give an answer (NA). This group is flagged as user-missing and both groups are excluded from the bar chart.

Describing Categorical Data 3 - 9

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.12 Frequency Table for Happiness of Marriage Taking things all together, how would you describe your marriage? Would you say that your marriage

Valid

Missing

Total

VERY HAPPY PRET.HAPPY NOT TOO Total NO ANSWER System Total

Frequency 417 234 25 676 6 2130 2136 2812

Percent 14.8 8.3 .9 24.0 .2 75.7 76.0 100.0

Valid Percent 61.7 34.6 3.7 100.0

Cumulative Percent 61.7 96.3 100.0

Figure 3.13 Bar Chart for Happiness of Marriage

About two-thirds of those married are very happily married. Of the rest, most say they are pretty happy. A very small percentage (3.7%) of respondents is not too happy. However, remember this question was only asked of a sample of those currently married. How might this influence your interpretation of the percentages?

Describing Categorical Data 3 - 10

Introduction to Statistical Analysis Using SPSS Statistics

3.3 Standardizing the Chart Axis If we glance back over the last few bar charts we notice that the scale axis, which displays percents, varies across charts. This is because the maximum value displayed in each bar chart depends on the percentage of respondents in the most popular category. Such scaling permits better use of the space within each bar chart but makes comparison across charts more difficult. Percentaging is itself a form of standardization, and bar charts displaying percentages as the scale axis were requested in our analyses. Charts can be further normed by forcing the scale axis (the axis showing the percents) in each chart to have the same maximum value. This facilitates comparisons across charts, but can make the details of individual charts more difficult to see. There are at least two methods within SPSS Statistics for standardizing the scale axis. 1. Chart Template: You can edit the scale axis range and other characteristics of a chart, save the edited chart as a chart template, then apply the chart template to existing charts or to all charts being built. 2. Chart Builder: You can edit the properties of the scale axis within the Chart Builder when you build the chart. We will illustrate this by reviewing two of the previous bar charts (labor force status and belief in afterlife) and requesting that the maximum scale value be set to 100 (100%).

Creating and Using a Chart Template First, we edit the COMPUSE chart, save a chart template and apply it to the WRKSTAT chart. To edit the scale axis maximum to 100%: Double click the R Use Computer chart to open the Chart Editor Click on the scale axis values (Y axis) Click Edit…Properties or the tool button to open the Properties dialog (if necessary) Click the Scale tab in the Properties dialog box Click the Maximum checkbox to deselect it, then set its value to 100 Click Apply, then Close

The scale axis of the chart will change to the range of 0 to 100.

Describing Categorical Data 3 - 11

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.14 Bar Chart of Belief in Afterlife with Edited Scale Axis

We could edit many other elements of the scale axis such as displaying a percent sign or displaying decimal positions. Or, we could edit other elements of the chart. All of which can be saved on a chart template. We will limit our editing to this one change and save the chart template. Click File…Save Chart Template

Describing Categorical Data 3 - 12

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.15 Save Chart Template Dialog Box

You specify the specific elements that you want to save on the chart template. In our case, we will save just the Scale axes characteristics. Click Reset Select Scale axes Click Continue In the Save Template dialog, move to the C:\Train\Stats folder in the Look in: dropdown list (not shown) Type Ch3_bar100.sgt in the File Name textbox Click Save Close the Chart Editor window

To apply this chart template to the Labor Force Status chart, Double click the Labor Force Status chart to open the Chart Editor Click File…Apply Chart Template in the Chart Editor window In the Apply Template dialog, move to the c:\Train\Stats folder, then select Ch3_bar100.sgt Click Open

Describing Categorical Data 3 - 13

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.16 Bar Chart of Labor Force Status with Chart Template Applied

Note that the scale axes of both bar charts are now in comparable units so we can make direct comparisons based on the bar heights. This is the advantage of the percentage standardization. However, it works best if the variables have a similar number of categories or size of the most popular (largest) category. For example, multiple charts of satisfaction rating questions are best presented with a standardized scale. On the other hand, we can see that applying the 0 to 100 scale to the labor force variable with eight categories results in the same shape as the previous one but has shrunken the bars. As a result some detail is lost. Thus the advantage of standardizing the percentage scale must be weighed against potential loss of detail. In practice it is usually quite easy to decide which approach is better. Close the Chart Editor window

Using Chart Builder to Create the Bar Chart You can use the Chart Builder (Graphs…Chart Builder) to create bar charts and set the maximum value of the scale axis percentages to 100% initially. For example, to directly create a bar chart with a scale axis range 0 to 100 of the COMPUSE variable, we use the Chart Builder.

Describing Categorical Data 3 - 14

Introduction to Statistical Analysis Using SPSS Statistics Click Graphs…Chart Builder Click OK in the Information box (if necessary) Click Reset (if necessary) Click Gallery tab (if it's not already selected) Click Bar in the Choose from: list Select the first icon for Simple Bar Chart and drag it to the Chart Preview canvas Drag and drop COMPUSE from the Variables: list to the X-Axis? area in the Chart Preview canvas In the Element Properties dialog box (Click Element Properties button if this dialog box did not automatically open), Select Percentage(?) from the Statistic: dropdown list Click Apply Select Y-Axis1 (Bar1) from the Edit Properties of: list Uncheck the Maximum checkbox in the Scale Range area Set the Maximum to 100 Click Apply

Figure 3.17 Chart Builder and Element Properties to Create Bar Chart

Click OK in the Chart Builder to build the bar chart

Describing Categorical Data 3 - 15

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.18 Bar Chart Created with Chart Builder

Note that the default display for the scale axis labels is to display the percents with one decimal place and the percent sign. We could have achieved this in the chart template as well by editing these features.

3.4 Pie Charts Pie charts provide a second way of picturing information in a frequency table. You can produce pie charts from the Chart Builder or in the Frequencies Chart dialog box. To create a pie chart for labor force status from the Graphs menu: Click Graphs…Chart Builder Click OK in the Information box (if necessary) Click Reset Click Pie/Polar in the Choose from: list on the Gallery tab

and drag it to the Chart Preview canvas Select the icon Drag and drop WRKSTAT from the Variables: list to the Slices by? area in the Chart Preview canvas In the Element Properties dialog box (Click Element Properties button if this dialog box did not automatically open), Select Percentage(?) from the Statistic: dropdown list Click Apply

Describing Categorical Data 3 - 16

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.19 Chart Builder and Element Properties to Create Pie Chart

Click OK in the Chart Builder to build the pie chart

Describing Categorical Data 3 - 17

Introduction to Statistical Analysis Using SPSS Statistics Figure 3.20 Pie Chart of Labor Force Status

While the pie and bar charts are based on the same information, the structure of the pie chart draws attention to the relation between a given slice (here a group) and the whole. On the other hand, a bar chart leads one to make comparisons among the bars, rather than any single bar to the total. You might keep these different emphases in mind when deciding which to use in your presentations. See Cleveland, W.S. (1994), Tufte (2001) and Few, Stephen (2004) for additional considerations in displaying data graphically.

Describing Categorical Data 3 - 18

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises Suppose we are interested in looking at the relationship between race (RACE) and HLTH1, whether you were ill enough to go to the doctor last year, NATAID, attitude toward spending on foreign aid, and NEWS, how frequently you read the newspaper. In addition, we wish to determine whether gender (GENDER) is related to these same beliefs. 1. First run a frequency analysis on these variables. Look at the distributions. Do you see any difficulties using these variables in a cross tabulation analysis? If so, is there an adjustment you would make? 2. Run Frequencies on the NATAID, NATENVIR, and NATCITY variables and request bar charts displaying percentages. Standardize the percentage scales to 0 to 100 with appropriate tick marks for all of the charts. For those with extra time: 1. Run a frequency on WEBYR, year in which you began using the web. This is coded with years in category ranges. How might you recode this variable before using it in a crosstab? 2. Create a new variable with the collapsed categories of WEBYR. If you wish, save the modified data in a data file named MyGSS2004.sav.

Describing Categorical Data 3 - 19

Introduction to Statistical Analysis Using SPSS Statistics

Describing Categorical Data 3 - 20

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 4: Exploratory Data Analysis: Scale Data Topics • Summarizing Scale Variables • Measures of Central Tendency and Dispersion • Normal Distributions • Histograms and Normal Curves • Using the Explore Procedure: EDA • Standard Error of the Mean and Confidence Intervals • Shape of the Distribution • Boxplots • Appendix: Standardized (Z) Scores Data In this chapter, we continue to use the GSS2004Intro.sav file. Scenario One of the aims of our overall analysis is to compare demographic groups on hours per week using the Internet, the number of hours worked last week, and the amount of time spent watching TV each day. Before proceeding with the group comparisons in Chapter 7, we discuss basic concepts of the normal distribution and statistical measures to describe the distribution and look at summaries of these measures across the entire sample.

4.1 Summarizing Scale Variables In Chapter 3 we used frequency tables containing counts and percentages as the appropriate summaries for individual categorical (nominal) variables. If the variables of interest are interval scale we can expand the summaries to include means, standard deviations and other statistical measures. Counts and percentages may still be of interest, especially when the variables can take only a limited number of distinct values. For example, when working with a one to five point rating scale we might be very interested in knowing the percentage of respondents who reply “Strongly Agree.” However, as the number of possible response values increases, frequency tables based on interval scale variables become less useful. Suppose we asked respondents for their family income to the nearest dollar? It is likely that each response would have a different value and so a frequency table would be quite lengthy and not particularly helpful as a summary of the variable. In data cleaning, you might find a frequency table useful for examining possible clustering of cases on specific values or looking at cumulative percentages. But, beware of using frequency tables for scale variables with many values as they can be very long. In short, while there is nothing incorrect about a frequency table based on an interval scale variable with many values, it is neither a very effective nor efficient summary of the variable. For interval scale variables such statistics as means, medians and standard deviations are often used. Several procedures within SPSS Statistics (Frequencies, Case Summaries and Explore) can

Exploratory Data Analysis: Scale Data 4 - 1

Introduction to Statistical Analysis Using SPSS Statistics produce these summaries and other summaries of the distribution. We will define each of these measures and use exploratory data analysis to produce them in SPSS Statistics. In addition such graphs as histograms, stem & leaf, and box & whisker plots are designed to display information about interval scale variables. We will see examples of each. Exploratory data analysis (EDA) was developed by John Tukey, a statistician at Princeton and Bell Labs. Seeing limitations in the standard set of summary statistics and plots, he devised a collection of additional plots and graphs. These are implemented in SPSS Statistics in the Explore procedure and we will include them in our discussion and examples.

4.2 Measures of Central Tendency and Dispersion Measures of central tendency and dispersion are the most common measures used to summarize the distribution of variables. We give a brief description of each of these measures below.

Measures of Central Tendency Statistical measures of central tendency give that one number that is often used to summarize the distribution of a variable. They may be referred to generically as the "average". There are three main central tendency measures: mode, median, and mean. In addition, Tukey devised the 5% trimmed mean. •







Mode: The mode for any variable is merely the group or class that contains the most cases. If two or more groups contain the same highest number of cases, the distribution is said to be ‘multimodal’. This measure is more typically used on nominal or ordinal data and can easily be determined by examining the frequency table. Median - If all the cases for a variable are arranged in order according to their value, the median is that value that splits the cases into two equally sized groups. The median is the same as the 50th percentile. Medians are resistant to extreme scores, and so are considered robust measures of central tendency. Mean: - The mean is the simple arithmetic average of all the values in the distribution (i.e. the sum of the values of all cases divided by the total number of cases). It is the most commonly reported measure of central tendency. The mean along with the associated measures of dispersion are the basis for many statistical techniques. 5% trimmed mean - The 5% trimmed mean is the mean calculated after the extreme upper 5% and the extreme lower 5% of the data values are dropped. Such a measure is resistant to small numbers of extreme values.

The specific measure that you choose will depend on a number of factors, most importantly the level of measurement of the variable. The mean is considered the most "powerful" measure of the three classic measures. However, it is good practice to compare the median, mean, and 5% trimmed mean to get a more complete understanding of a distribution.

Measures of Dispersion Measures of dispersion or variability describe the degree of spread, i.e. dispersion, or variability around the central tendency measure. You might think of this as a measure of the extent to which observations cluster within the distribution. There are a number of measures of dispersion including: simple measures such as maximum, minimum, and range, common statistical measures such as standard deviation and variance, as well as the exploratory data analysis measure, the interquartile range (IQR) .

Exploratory Data Analysis: Scale Data 4 - 2

Introduction to Statistical Analysis Using SPSS Statistics • • •







Maximum: Simply the highest value observed for a particular variable. By itself, it can tell us nothing about the shape of the distribution, merely how 'high' the top value is. Minimum: The lowest value in the distribution and, like the maximum, is only useful when reported in conjunction with other statistics. Range: The difference between the maximum and minimum values gives a general impression of how broad the distribution is. It says nothing about the shape of a distribution and can give a distorted impression of the data if just one case has an extreme value. Variance: Both the Variance and Standard Deviation provide information about the amount of spread around the mean value. They are overall measures of how clustered around the mean the data values are. The variance is calculated by summing the square of the difference between the value and the mean for each case and dividing this quantity by the number of cases minus 1. If all cases had the same value, the variance (and standard deviation) would be zero. The variance measure is expressed in the units of the variable squared. This can cause difficulty in interpretation; so more often the standard deviation is used. In general terms, the larger the variance, the more spread there is in the data, the smaller the variance, the more the data values are clustered around the mean. Standard Deviation: The standard deviation is the square root of the variance which restores the value of variability to the units of measurement of the original variable's values. It is therefore easier to interpret. Either the variance or standard deviation are often used in conjunction with the mean, as a basis for a wide variety of statistical techniques. Interquartile Range (IQR) - This measure of variation is the range of values between the 25th and 75th percentile values. Thus, the IQR represents the range of the middle 50 percent of the sample and is more resistant to extreme values than the standard deviation.

Like the measures of central tendency, these measures differ in their usefulness with variables of different measurement levels. The variability measures, variance and standard deviation, are used in conjunction with the mean for statistical evaluation of the distribution of a scale variable. The other measures of dispersion, although less useful statistically, can provide useful descriptive information about a variable.

4.3 Normal Distributions An important statistical concept is that of the normal distribution. This is a frequency (or probability) distribution which is symmetrical and is often referred to as the normal bell-shaped curve. The histogram in Figures 4.1 illustrates a normal distribution. The mean, median and mode exactly coincide in a perfectly normal distribution. And the proportion of cases contained within any portion of the normal curve can be exactly calculated mathematically or more usually from tables of the normal distribution. Its symmetry means that 50% of cases lie to either side of the central point as defined by the mean. Two of the other most frequently-used representations are the portions lying between plus and minus one standard deviations of the mean (containing approximately 68% of cases, see Figure 4.1) and that between plus and minus 1.96 standard deviations (containing approximately 95% of cases, see Figure 4.2), sometimes rounded up to 2.00 for convenience. Thus, if a variable is normally distributed, we expect 95% of the cases to be within roughly 2 standard deviations from the mean.

Exploratory Data Analysis: Scale Data 4 - 3

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.1 Normal Distribution: Plus or Minus 1 SD and 1.96 SD

Many naturally occurring phenomena, such as height, weight and blood pressure, are distributed normally. Random errors also tend to conform to this type of distribution. It is important to understand the properties of normal distributions and how to assess the normality of particular distributions because of their theoretical importance in many inferential statistical procedures. We will discuss these issues later in this course. In this chapter, we will review descriptive statistics and graphs that allow us to assess the distribution of our data in comparison to the normal distribution.

4.4 Histograms and Normal Curves The histogram is designed to display the distribution (range and concentration) of a scale variable that takes many different values. A bar chart contains one bar for each distinct data value. When there are many possible data values and few observations at any given value, a bar chart is less useful than a histogram. In a histogram, adjacent data values are grouped together so that each bar represents the same range of data values. SPSS Statistics automatically chooses the range of data values for each bin, but you can specify the range of values or number of bins. With this chart we can see the general distribution of data regardless of how many distinct data values are present. While a bar chart is appropriate for an ordinal variable such as a one to five or one to seven point rating scale, a bar chart of hours worked last week would contain too many bars, some of them with few cases, and gaps in hours (values which no one worked) would not be displayed. For these reasons, a histogram is a better choice.

Exploratory Data Analysis: Scale Data 4 - 4

Introduction to Statistical Analysis Using SPSS Statistics You can request histogram plots from the Chart Builder or as options from the Frequencies and Explore procedures. Note If you request histograms and summary statistics from the Frequencies procedure, you might want to uncheck (turn off) Display frequency tables on the Frequencies dialog box. We will begin by requesting a histogram on hours worked last week, HRS1 using the Chart Builder. In addition, we will ask that a normal curve be superimposed on the histogram. Should we expect it to be normally distributed? Open the GSS2004Intro.sav data file (if necessary) Click Graphs…Chart Builder Click OK in the Information box (if necessary) Click Reset Click Gallery tab (if it's not already selected) Click Histogram in the Choose from: list

Select the first icon for Simple Histogram and drag it to the Chart Preview canvas Drag and drop HRS1 from the Variables: list to the X-Axis? area in the Chart Preview canvas In the Element Properties dialog box (Click Element Properties button if this dialog box did not automatically open), Click (check on) Display normal curve Click Apply

See Figure 4.2 for the completed Chart Builder dialogs. This will produce a histogram displaying frequencies for HRS1 with the bell shaped curve for normal distribution of the HRS1 mean and standard deviation. The other icons in the gallery request special types of histograms such as a stacked histogram and population pyramid; both of which allow you to display histogram distributions of a scale variable for subgroups of cases.

Exploratory Data Analysis: Scale Data 4 - 5

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.2 Chart Builder and Element Properties to Create Histogram with Normal Curve

Click OK in the Chart Builder

Exploratory Data Analysis: Scale Data 4 - 6

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.3 Histogram with Normal Curve of Hours Worked Last Week

The mean, standard deviation and number of valid cases for HRS1 is automatically displayed in the chart legend. The mean hours worked last week is 42.26, slightly above the norm of 40. The standard deviation is about 15 hours, which indicates there is a fair amount of variation among respondents in hours worked. Does the histogram seem useful in describing hours worked? If HRS1 were normally distributed, the data bars would align perfectly with the normal curve. We easily see that values near 40 are by far the most common. This is because most people who work do so full-time. Given this clustering, we might expect that the median (50th percentile) value is 40, and we will verify that next. The distribution is basically symmetric. Technically speaking, this distribution would be described as not being skewed. However, we can see that the actual distribution is more "peaked" than the normal curve. There are formal statistical measures to describe these deviations from the normal curve which we will discuss shortly.

4.5 Using the Explore Procedure: EDA As we mentioned earlier, John Tukey devised several statistical measures and plots designed to reveal data features that might not be readily apparent from standard statistical summaries. In his book describing these methods, Exploratory Data Analysis, Tukey (1977) described the work of a data analyst to be similar to that of a detective, the goal being to discover surprising, interesting and unusual things about the data. To further this effort, Tukey developed both plots and data summaries. These methods, called exploratory data analysis and abbreviated EDA, have become

Exploratory Data Analysis: Scale Data 4 - 7

Introduction to Statistical Analysis Using SPSS Statistics very popular in applied statistics and data analysis. Exploratory data analysis can be viewed either as an analysis in its own right, or as a set of data checks and investigations performed before applying inferential testing procedures. These methods are best applied to variables with at least ordinal (more commonly interval) scale properties and which can take many different values. The plots and summaries would be less helpful for a variable that takes on only a few values (for example, a five point scale). The Explore procedure produces many of the EDA statistical measures and plots along with the classic statistics and histograms. We will use the Explore procedure to examine hours worked last week, hours spent using the Internet per week, and number of hours of TV viewed per day. To run the Explore procedure, Click Analyze…Descriptive Statistics…Explore Move HRS1, WWWHR, and TVHOURS to the Dependent List: box

The scale variables to be summarized appear in the Dependent list box. The Factor list box can contain one or more categorical (for example, demographic) variables, and if used would cause the procedure to present summaries for each category of the factor variable(s). We will use this feature in later chapters when we look at mean differences between groups. By default, both plots and statistical summaries will appear. While not discussed here, the Explore procedure can produce robust mean estimates (M-estimators), percentile values, and lists of extreme values, as well as normal probability and homogeneity plots. Figure 4.4 Explore Dialog Box

We can request specific statistical summaries and plots using the Statistics and Plots option buttons. We will accept the default statistics; but request a histogram rather than a stem and leaf plot. Click Plots Click off Stem-and-leaf Click on Histogram

Exploratory Data Analysis: Scale Data 4 - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.5 Plots Dialog Box in Explore

The stem & leaf plot (devised by Tukey) is modeled after the histogram, but contains more information. Although not requested here, we will briefly discuss them later. For most purposes, the histogram is easier to interpret and more useful. By default, a boxplot will be displayed for each scale variable. Click Continue

Options with Missing Values Ordinarily SPSS Statistics excludes any observations with missing values when running a procedure like Explore. When several variables are used (as here) you have a choice as to whether the analysis should be based on only those observations with valid values for all variables in the analysis (called listwise deletion), or whether missing values should be excluded separately for each variable (called pairwise deletion). When only a single variable is considered both methods yield the same result, but they will not give identical answers when multiple variables are analyzed in the presence of missing values. The default method is listwise deletion. In our example, each of the variables was asked of only a subset of cases; but a different subset for each variable. Thus, the listwise option is not appropriate. We will specifically request the alternative pairwise method using the Options button. Click Options Click Exclude cases pairwise

Exploratory Data Analysis: Scale Data 4 - 9

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.6 Missing Value Options in Explore

Rarely used, the Report values option includes cases with user-defined missing values in frequency analyses, but excludes them from summary statistics and charts. Click Continue Click OK

The Explore procedure produces two tables followed by the requested charts for each variable. The first table, a Case Processing Summary pivot table, displays the number of valid cases and missing cases for each variable. Each variable has a considerable amount of missing data. For example, 1763 cases (respondents) had valid values for HRS1, while 37.3% were missing. Ordinarily such a large percentage of missing data would set off alarm bells for the analyst. However we know that people who did not work were not asked this question. TVHOURS has the most missing data, 68% of the cases, because it was asked of only a subset of the respondents. Notice the large differences in the number of valid cases among the three variables. Figure 4.7 Case Processing Summary Case Processing Summary

Valid N NUMBER OF HOURS WORKED LAST WEEK WWW HOURS PER WEEK HOURS PER DAY WATCHING TV

Percent

Cases Missing N Percent

Total N

Percent

1763

62.7%

1049

37.3%

2812

100.0%

1701

60.5%

1111

39.5%

2812

100.0%

899

32.0%

1913

68.0%

2812

100.0%

Note The statistics for all three variables are displayed in the one Descriptive table. We edited this pivot table, moving the "Dependent Variables" from the row to the layer dimension. We will present and discuss the summaries and plots for each variable (layer) separately. The Descriptives table in Figure 4.8 displays a series of descriptive statistics for HRS1. From the previous table, we know that these statistics are based on 1763 respondents.

Exploratory Data Analysis: Scale Data 4 - 10

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.8 EDA Summaries for Hours Worked Last Week Descriptives NUMBER OF HOURS WORKED LAST WEEK Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Lower Bound Upper Bound

Statistic 42.26 41.56

Std. Error .358

42.96 42.01 40.00 225.651 15.022 1 89 88 13 .275 1.099

.058 .117

First, several measures of central tendency appear: the Mean, 5% Trimmed Mean, and Median. As we discussed earlier, these statistics attempt to describe with a single number where data values are typically found, or the center of the distribution. Useful information about the distribution can be gained by comparing these values to each other. Here the mean, median and 5% trimmed mean are very close and their values (40.0 to 42.01) suggest either that there are not many extreme scores (not true in this case), or that the number of high and low scores is balanced (which we will see does seem to be the case). If the mean were considerably above or below the median and trimmed mean, it would suggest a skewed or asymmetric distribution. A perfectly symmetric distribution—the normal—would produce identical expected means, medians and trimmed means. The measures of central tendency are followed in the table by several measures of dispersion or variability. As we discussed earlier, these indicate to what degree observations tend to cluster or be widely separated. Both the standard deviation and variance (standard deviation squared) appear. The standard deviation of 15.022 indicates a variation around the mean of 15 hours; a modest amount of variation. The standard error is an estimate of the standard deviation of the mean if repeated samples of the same size (here 1763) were taken. It is used in calculating the 95% confidence band for the sample mean discussed below. Also appearing is the interquartile range (IQR) which is essentially the range between the 25th and the 75th percentile values. It is a variability measure more resistant to extreme scores than the standard deviation. The interquartile range of 13 indicates that the middle 50% of the sample lie within a range of 13 hours. The fact that the IQR is lower than the standard deviation suggests that the distribution may be "peaked" in the center. We also see the minimum, maximum and range. It is useful to check the minimum and maximum in order to make sure no impossible data values are recorded.

Exploratory Data Analysis: Scale Data 4 - 11

Introduction to Statistical Analysis Using SPSS Statistics

4.6 Standard Error of the Mean and Confidence Intervals As stated earlier, the standard error of the mean is an estimate of the standard deviation around the mean and is a function of the sample standard deviation and the sample size: Standard error of the mean = Sample standard deviation Square root of sample size

The larger the sample size, the smaller the standard error given the same sample standard deviation. The standard error of the mean is used to calculate the 95% confidence interval. The 95% confidence interval has a technical definition: if we were to repeatedly perform the study, on average we would expect the 95% confidence bands to include the true population mean 95% of the time. It is useful in that it combines measures of both central tendency (mean) and variation (standard error of mean) to provide information about where we should expect the population mean to fall. The confidence band is based on the sample mean, plus or minus 1.96 times the standard error of the mean. Recall from our earlier discussion about the normal distribution that 95% of the area under a normal curve is within 1.96 standard deviations of the mean. Since the sample standard error of the mean is simply the sample standard deviation divided by the square root of the sample size, the 95% confidence band for the mean is equal to the sample mean plus or minus 1.96 * (sample standard deviation/(square root (sample size))). Thus if you have the sample mean, standard deviation and number of observations, you can easily calculate the 95% confidence band. As we can see in Figure 4.8, the confidence band for the mean of hours worked is very narrow (41.56 to 4296) so we have a fairly precise idea of the population mean for hours worked last week expecting the population mean to fall within this range 95% of the time.

4.7 Shape of the Distribution In Figure 4.3, we displayed the histogram showing the distribution of hours worked last week. This same histogram, minus the normal curve line, was requested as part of the Explore output. The final two statistical measures, skewness and kurtosis, in the Descriptive table in Figure 4.8 provide numeric summaries about the shape of the distribution of the data. Since most analysts are content to view histograms in order to make judgments regarding the distribution of a variable, these measures are infrequently used. Skewness is a measure of the symmetry of a distribution. It measures the degree to which cases are clustered towards one end of the distribution. It is normed so that a symmetric distribution has zero skewness. A positive skewness value indicates bunching on the left and a longer tail on the right (for example, income distribution in the U.S.); negative skewness follows the reverse pattern. The standard error of skewness also appears in the Descriptives table and we can use it to determine if the data are significantly skewed. One method is to use the standard errors to calculate the 95% confidence interval around the skewness. If zero is not in this range, we could conclude that the distribution was skewed. A second method is to compare the skewness value to 1.96*SE from zero.

Exploratory Data Analysis: Scale Data 4 - 12

Introduction to Statistical Analysis Using SPSS Statistics Although the skewness value for hours worked is close to 0 (.275) (see Figure 4.8), using either of the methods above, we would determine that the distribution was slightly positively skewed. However, the histogram does not support any significant evidence of skewness. Kurtosis also has to do with the shape of a distribution and is a measure of how much of the data is concentrated near the center, as opposed to the tails, of the distribution. It is normed to the normal curve (for which kurtosis is zero). As an example, a distribution with longer tails and more peaked in the middle than a normal is referred to as a leptokurtic distribution and would have a positive kurtosis measure. On the other hand, a platykurtic distribution is a flattened distribution and has negative kurtosis values. A standard error for kurtosis also appears. The same methods used for evaluating skewness can be used to evaluate the kurtosis values. Since the kurtosis value for hours worked is 1.099 (Figure 4.8), which is well beyond two standard errors (1.96*.1097) from zero, hours worked is a leptokurtic distribution and is not normally distributed. The shape of the distribution can be of interest in its own right. Also, assumptions are made about the shape of the data distribution within each group when performing significance tests on mean differences between groups. This aspect will be covered in later chapters. Note: Stem & Leaf Plot of Hours Worked Last Week The stem & leaf plot (devised by Tukey) is modeled after the histogram, but is designed to provide more information. We requested the histogram rather than a stem & leaf plot; but provide the stem & leaf plot for hours worked last week in Figure 4.9 as an example. The overall distribution is reflected in the length of the lines. Instead of using a standard symbol (for example, an asterisk ‘*’ or block character) to display a case or group of cases, the stem & leaf plot uses data values as the plot symbols on each line. Thus the shape of the distribution appears and the plot can be read to obtain specific data values.

Exploratory Data Analysis: Scale Data 4 - 13

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.9 Stem & Leaf Plot of Number of Hours Worked Last Week

In a stem & leaf plot the stem is the vertical axis and the leaves branch horizontally from the stem. The stem width indicates the number of units in which the stem value is measured; in this case a stem unit represents 10 hours. This means that the stem value must be multiplied by 10 to reproduce the original units of analysis. The leaf values in the chart indicate the value of the next unit down from the steam. To illustrate, the third row from the bottom of the chart contains a stem value of 6 with one leaf of 4 and four leafs of 5. These represent six individuals who worked 64 hours last week and 24 who worked 65 hours. Each leaf represents 6 cases, so there are 24 respondents who worked 65 hours. Values in a stem that did not have at least 6 cases are represented by a fractional leaf, denoted by an ampersand (&). Notice that there are a total of30 cases in this stem. The first and last lines identify outliers. These are data points far enough from the center (defined more exactly under Box & Whisker plots below) that they might merit more careful checking. Extreme points might be data errors or possibly represent a separate subgroup. Outliers are those cases with values less than or equal to 17 hours (there are 104 of these) and greater than or equal to 70 hours (there are 99 of these). Thus besides viewing the shape of the distribution we can pick out individual scores.

4.8 Boxplots Boxplots, also referred to as box & whisker plots, are a more easily interpreted plot to convey the same information about the distribution of a variable. In addition, the boxplot graphically identifies outliers. Below we see the boxplot for hours worked.

Exploratory Data Analysis: Scale Data 4 - 14

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.10 Boxplot of Hours Worked

The vertical axis represents the scale for the number of hours worked. The solid line inside the box represents the median or 50th percentile. The top and bottom borders (referred to as "hinges") of the box correspond to the 75th and 25th percentile values of hours worked and thus define the interquartile range (IQR). In other words, the middle 50% of data values fall within the box. The “whiskers” (vertical lines extending from the top and bottom of the box) are the last data values that lie within 1.5 box lengths (or IQRs) of the respective hinges (borders of box). Tukey considers data points more than 1.5 box lengths from a hinge to be "outliers". These points are marked with a circle. Points more than 3 box lengths (IQR) from a hinge are considered by Tukey to be “far out” points and are marked with an asterisk symbol (there are none here). This plot has many outliers. If a single outlier exists at a data value, the case sequence number appears beside it (an ID variable can be substituted), which aids data checking. If the distribution were symmetric, the median would be centered within the box, the hinges and the whiskers. In the plot above, the median is toward the bottom of the box, indicating a positively skewed distribution. Boxplots are particularly useful for obtaining an overall ‘feel’ for a distribution in an instant. The median tells us the location or central tendency of the data (40 hours for hours worked). The length of the box indicates the amount of spread within the data, and the position of the median in relation to the box tells us something of the nature of the distribution Box plots are also useful when comparing several groups, as we will see in later chapters.

Exploratory Data Analysis: Scale Data 4 - 15

Introduction to Statistical Analysis Using SPSS Statistics Note: Boxplots like all charts in SPSS Statistics can be edited. For ease in interpretation or presentation, you might want to delete some of the outlier and extreme data points after you have studied them. You could also change range or ticks on the scale access and make other enhancements to the chart attributes.

Exploring Hours Using the Internet We use our knowledge to now consider hours using the Internet per week. Figure 4.11 Exploratory Summaries of Hours Using the Internet Per Week Descriptives WWW HOURS PER WEEK Mean 95% Confidence Interval for Mean

Lower Bound Upper Bound

5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Statistic 7.46 6.96

Std. Error .255

7.96 5.92 4.00 110.793 10.526 0 130 130 9 3.569 21.097

.059 .119

The mean of 7.46 hours is much greater than the median (4.00). This suggests a positive skew to the data, confirmed by the skewness statistic, which is much larger than 0. Examine the minimum and maximum values; do they suggest data errors? You might look at other variables such as hours worked per week for those who claim to use the internet 130 hours a week. After all, there are only 168 hours in a week! Which other variables might you look at in order to investigate the validity of these responses? We have valid data for 1,701 observations with 1111 missing (these numbers appear in the Case Processing Summary table in Figure 4.7). This is a large amount of missing data, but some people don’t use a computer (and so wouldn’t answer this question), and it is also possible that a subset of respondents was simply not asked this and other questions about computer usage. Notice that the standard deviation is larger than the mean, a sign of great variation in the values. The kurtosis is also very large.

Exploratory Data Analysis: Scale Data 4 - 16

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.12 Histogram for Hours Using the Internet Per Week

The histogram shows an extremely positively skewed distribution with roughly 800 cases having values close to zero. Although in some cases, the bars for the high values are too short to see in this rendering, there is at least one case as we know with a value of 130. Finally, do you notice any pattern of peaks and valleys to the plot? For example, the concentration of cases around 20 hours per week looks odd. This might be an example of data heaping, when respondents can’t estimate precisely how often they do something and so choose a convenient round number.

Exploratory Data Analysis: Scale Data 4 - 17

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.13 Box & Whisker Plot for Hours Using the Internet Per Week

Notice that all the extreme values occur at large values, unlike for hours worked last week. Individuals who use the internet less than an hour (a value of 0) are not outliers. This is because a value of 0 is not that far from the bulk of the observations, while values of 10 (75th percentile) and above are. The box is squashed because of the outliers and so is difficult to use, but it is clear how far some of the outliers are from the bulk of cases. The positive skewness is apparent from the outliers at the high end. Some of these are marked as extreme points (with an asterisk). While unusual relative to the data, certainly people can work online for many hours. However, we begin to wonder whether values above 60 or so hours are valid. If suspicious outliers appear in your data you should check whether they are data errors. If not, you need to consider whether you wish them included in your analysis. This is especially problematic when dealing with a small sample (not the case here), since an outlier can substantially influence the analysis. We say more about outliers when we discuss ANOVA and Multiple Regression in Appendices A and B.

Exploratory Data Analysis: Scale Data 4 - 18

Introduction to Statistical Analysis Using SPSS Statistics

Exploring Hours of TV Viewed We now move to hours of TV watched per day. The mean (2.87) is very near 3 hours, the trimmed mean is at 2.56 and the median is 2. This suggests skewness. Do you notice anything surprising about the minimum or maximum? Watching 20 hours of TV a day is possible (?), but unlikely, so perhaps it is a result of misunderstanding the question. The trimmed mean is closer to the mean than the median, indicating that the difference between the mean and median is not solely due to the presence of outliers. The histogram in Figure 4.15, showing a heavy concentration of respondents at 1 and 2 hours of TV viewing, suggests why the median is at 2. There is also positive skewness and kurtosis. Figure 4.14 Exploratory Summaries of Daily TV Hours Descriptives HOURS PER DAY WATCHING TV Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Lower Bound Upper Bound

Statistic 2.87 2.69

Std. Error .087

3.04 2.56 2.00 6.849 2.617 0 20 20 3 2.589 9.823

.082 .163

Exploratory Data Analysis: Scale Data 4 - 19

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.15 Histogram of Daily TV Hours

This histogram identifies outliers on the high side. Other than that it is of limited use since TVHOURS is recorded to the integer number of hours and a relatively small number of values are chosen. This same point would apply when considering use of Explore for five-point rating scales.

Exploratory Data Analysis: Scale Data 4 - 20

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.16 Box & Whisker Plot of Daily TV Hours

In addition to the asymmetry created by the large outliers, we see the median is not centered in the box: it is closer to the lower edge (25th percentile value). This is due to the heavy concentration of those viewing 0 through 2 hours of TV per day. We would not argue that something of interest always appears through use of the methods of exploratory data analysis. However, you can quickly glance over these results, and if anything strikes your attention, and then pursue it in more detail. The possibility of detecting something unusual encourages the use of these techniques.

4.9 Appendix: Standardized (Z) Scores The properties of the normal distribution allows us to calculate a standardized score, often referred to as a z-score, which indicates the number of standard deviations above or below the sample mean for each value. Standardized (Z) scores can be used to calculate the relative position of each value in the distribution. Z-scores are most often used in statistics to standardize variables of unequal scale units for statistical comparisons or use in multivariate procedures. We'll have more to say about this later. For example, if you obtain a score of 68 out of 100 on a word test, this information alone is not enough to tell how well you did in relation to others taking the test. However, if you know the mean score is 52.32, the standard deviation 8.00 and the scores are normally distributed, you can calculate the proportion of people who achieved a score at least as high as your own.

Exploratory Data Analysis: Scale Data 4 - 21

Introduction to Statistical Analysis Using SPSS Statistics The standardized score is calculated by subtracting the mean from the value of the observation in question (68-52.32 = 15.68) and dividing by the standard deviation for the sample (15.68/8 = 1.96). Z Score =

Case Score - Sample Mean Standard Deviation

Therefore, the mean of a standardized distribution is 0 and the standard deviation is 1. In this case, your score of 68 is 1.96 standard deviations above the mean. The histogram of the normal distribution in Figure 4.1 displays the distribution as a Z-score so the values on the x-axis are standard deviation units. From this figure, we can see only 2.5% of the cases are likely to have a score above 68 (1.96 standardized score). The normal distribution table (see Table 4.1), found in an appendix of most statistics books, show proportions for z-score values. Table 4.1 Normal Distribution Table

A score of 1.96, for example, corresponds to a value of .025 in the ‘one-tailed’ column and .050 in the ‘two-tailed’ column. The former means that the probability of obtaining a z-score at least as large as +1.96 is .025 (or 2.5%), the latter that the probability of obtaining a z-score of more than +1.96 or less than -1.96 is .05 (or 5%) or 2.5% at each end of the distribution. You can see these cutoffs in the histogram displayed in Figure 4.1 Normal Distribution: Plus or Minus 1 SD and 1.96 SD.

Exploratory Data Analysis: Scale Data 4 - 22

Introduction to Statistical Analysis Using SPSS Statistics Whether we choose a one or two-tailed probability depends upon whether we wish to consider both ends of the distribution (two-tailed) or just one end (one-tailed). We will say more about 1tailed and 2-tailed probability when we discuss mean difference tests in Chapter 7. As we mentioned, another advantage of standardized scores is that they allow for comparisons on variables measured in different units. For example, in addition to the word test score, you might have a mathematics test score of 150 out of 200 (or 75%). Although it appears that you did better on the mathematics test from the percentages alone, you would need to calculate the z-score for the mathematics test and compare the z-scores in order to answer the question. You might want to compute z-scores for a series of variables and determine whether certain subgroups of your sample are, on average, above or below the mean on these variables by requesting descriptive statistics or using the Case Summaries procedure. For example, you might want to compare a respondent's education and social economic index (SEI) values using z-scores. The Descriptives procedure has an option to calculate standardized score variables. A new variable containing the standardized values is calculated for the specified variables. To create zscores for education and socioeconomic index, Click Analyze…Descriptive Statistics…Descriptives Move EDUC and SEI to the Variable(s): list box Click Save standardized values as variables Click OK

Figure 4.17 Descriptives Dialog Box to Create Z-Scores

By default, the new variable name is the old variable name prefixed with the letter "Z". Two new variables, zeduc and zsei, containing the z-scores of the two variables are created at the end of the data.

Exploratory Data Analysis: Scale Data 4 - 23

Introduction to Statistical Analysis Using SPSS Statistics Figure 4.18 Two Z-score Variables in the Data Editor

These variables can be saved in your file and used in any statistical procedure. Note: You can assign specific names to the z-score variables by using the DESCRIPTIVES syntax command. Paste the syntax and add the z-score variable name in parentheses after the variable name in the VARIABLES subcommand as in: DESCRIPTIVES VARIABLES=EDUC (Zscore_EDUC) SEI (Zscore_SEI) /SAVE /STATISTICS=MEAN STDDEV MIN MAX .

This would name the new variables, Zscore_educ and Zscore_SEI.

Exploratory Data Analysis: Scale Data 4 - 24

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises We will later compare different groups on the average number of children they have (CHILDS), their age (AGE) and number of hours spent on email per week (EMAILHR). 1. In anticipation of this, run an exploratory data analysis on these variables. Use a histogram rather than a stem & leaf plot. Review the results. Keep in mind that this is a U.S. adult sample; do you see anything unusual about the age range? 2. Using Chart Builder, run a histogram of AGE with the normal curve. Looking at this chart and the information from Explore, how would you describe the distribution of AGE? Given this information, how might you group years of age into a category variable? 3. Use Visual Binning (Transform…Visual Binning) to create a new grouped AGE variable. If you wish, save the modification in a data file named, MYGSS2004.sav. For those with extra time: 1. Number of children (CHILDS) is coded 0 through 8, where 8 indicates eight or more children. Look at the exploratory output, or run a frequency analysis on the variable. Would you expect the truncation of CHILDS to have much influence on an analysis?

Exploratory Data Analysis: Scale Data 4 - 25

Introduction to Statistical Analysis Using SPSS Statistics

Exploratory Data Analysis: Scale Data 4 - 26

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 5: Probability and Inferential Statistics Topics • The Nature of Probability • Making Inferences about Populations from Samples • Influence of Sample Size • Hypothesis Testing • Type I and Type II Statistical Errors • Statistical Significance and Practical Importance Data Data files showing the same percentages or means for samples of 100, 400 and 1,600. A data file containing 10,000 observations drawn from a normal population with mean 70 and standard deviation of 10. Scenario In this chapter, we overview some basic statistical concepts and principles that are needed to understand the assumptions and interpretation of inferential statistical techniques. We then display a series of analyses in which only the sample size varies and see which outcome measures change. Finally, we discuss scenarios in which statistical significance and practical importance do not coincide.

5.1 The Nature of Probability Up to this point, we have used descriptive statistics (that is, literally describing the data in our sample through the use of a number of summary procedures and statistics). The statistical methods described later in this course, are termed inferential in that the data we have collected will be used to provide more generalized conclusions. In other words, we want to infer the results from the sample on which we have data to the population which the sample represents. To do this, we use procedures that involve the calculation of probabilities. The fundamental issue with inferential statistical tests concerns whether any 'effects' (i.e. relationships or differences between groups) we have found are 'genuine' or as a result of sampling variability (in other words, mere 'chance'). A probability value can be defined as 'the mathematical likelihood of a given event occurring', and as such we can use such values to assess whether the likelihood that any differences we have found are the result of random chance. Consider the following example. Let's suppose we have conducted a study and found that there is a slight difference between the mean blood pressure of left-handed people and the mean blood pressure of right-handed people. We would be naive to expect both means to be exactly the same, so has this difference occurred due to chance or does it simply reflect that in the population, there really is a difference in the mean blood pressure of these two groups? In later chapters, we will see just how researchers answer such a question.

Probability and Inferential Statistics 5 - 1

Introduction to Statistical Analysis Using SPSS Statistics

5.2 Making Inferences about Populations from Samples Ideally, we would have data about everyone we wished to study (i.e. in the whole population). In practice, we rarely have information about all members of our population and instead collect information from a representative sample of the population. However, our goal is to make generalizations about various characteristics of the population based on the known facts about the sample. We choose the sample with the intention of using the data from that sample to make inferences about the ‘true’ values in the population. These population measures are referred to as parameters while the equivalent measures from samples are known as statistics. It is unlikely that we will know the population parameters; therefore we use the sample statistics to infer what these population values will be. As noted in section 5.1, these statistical techniques are known as inferential in contrast to the purely descriptive analyses we have considered so far. We have already considered a number of statistics and parameters such as means, proportions, standard deviations, etc. An important distinction between parameters and statistics is that parameters are fixed (although often not known) while statistics vary from one sample to another. Due to the effects of random variability, it is unlikely that any two samples drawn from the same population will produce the same statistics. By plotting the values of a particular statistic (e.g. the mean) from a large number of samples, it is possible to obtain a sampling distribution of the statistic. For small numbers of samples, the mean of the sampling distribution may not closely resemble that of the population. However, as the number of samples taken increases, the closer the mean of the sampling distribution (the mean of all the means, if you like) gets to the population mean. For an infinitely large number of samples, the mean will be exactly the same as the population mean. Additionally, as the sample size increases, the amount of variability in the distribution of sample means decreases. If you think of the variability in terms of the error made in estimating the mean, it should be clear that the more evidence you have (i.e. the more cases in your sample), the smaller will be the error in estimating the mean. Of course, it is unlikely you will ever be able to take repeated samples - you usually get just the one chance and must therefore base your conclusions on the data from this one sample. If repeated random samples of size N are drawn from any population, then as N becomes large, the sampling distribution of sample means approaches normality - a phenomenon known as the Central Limit Theorem. This is an extremely useful statistical concept as it does not require that the original population distribution is normal. In the next section, we'll take a closer look at the influence of sample size on the precision of the statistics.

5.3 Influence of Sample Size In statistical analysis sample size plays an important role, but one that can easily be overlooked since a minimum sample size is not required for the most commonly used statistical tests. Workers in some areas of applied statistics (engineering, medical research) routinely estimate the effects of sample size on their analyses (termed power analysis). This is less frequently done in social science and market research. The formulas for standard errors describe the effect of sample size. Here we will demonstrate the effect in two common data analysis situations: crosstabulation tables and mean summaries.

Probability and Inferential Statistics 5 - 2

Introduction to Statistical Analysis Using SPSS Statistics

Precision of Percentages Precision is strongly influenced by the sample size. In the figures below we present a series of crosstabulation tables containing identical percentages, but with varying sample sizes. We will observe how the test statistics change with sample size and relate this result to the precision of the measurement. The results below assume a population of infinite size or at least one much larger than the sample. For precision calculations involving percentages with finite populations see Cochran (1977), Kish (1965) or other survey sampling texts. Note The Chi-square test of independence will be presented for each table as part of the presentation of the effect of changing sample size. This test assumes that each sample is representative of the entire population. A detailed discussion of the chi-square statistic, its assumptions and interpretation, can be found in Chapter 6. Sample Size of 100 The table below displays responses of men and women to a question asking for which candidate they would vote. The table was constructed by adjusting case weights to reflect a sample of 100. Figure 5.1 Crosstab Table with Sample of 100

We see that 46 percent of the men and 54 percent of the women choose candidate A, resulting in an 8% difference. Since this sample of 100 people incompletely reflects the population we turn to the chi-square test to assess whether men differ from women in the population. (As noted above, the chi-square test will be examined closely in Chapter 6). Here we simply note the chi-square value (.640) and state that the significance value of .424 indicates that men and women share the same view (do not differ significantly) concerning candidate choice. The significance value of Probability and Inferential Statistics 5 - 3

Introduction to Statistical Analysis Using SPSS Statistics .424 suggests that if men and women in the population had identical attitudes toward the candidates, with a sample of 100 we could observe a gender difference of 8 or more percentage points about 42% of the time. Thus we are fairly likely to find such a difference (8%) in a small sample even if there is no gender difference. Sample Size of 400 Now we view a table with percentages identical to the previous one, but based on a sample of 400 people, four times as large as before. Figure 5.2 Crosstabulation Table with Sample of 400

The gender difference remains at 8% with fewer men choosing Candidate A. Although the percentages are identical, the chi-square value has increased by a factor of four (from .640 to 2.56) and the significance value is smaller (.11). This significance value of .11 suggests that if men and women in the population had identical attitudes toward the candidates, with a sample of 400 we would observe a gender difference of 8 or more percentage points about 11% of the time. Thus with a bigger sample, we are much less likely to find such a large (8%) percentage difference if the men’s and women’s attitudes are identical. Since much statistical testing uses a cutoff value of .05 when judging whether a difference is significant, this result is close to being judged statistically significant. Sample Size of 1,600 Finally we present the same table of percentages, but increase the sample size to 1,600; the increase is once again by a factor of four.

Probability and Inferential Statistics 5 - 4

Introduction to Statistical Analysis Using SPSS Statistics Figure 5.3 Crosstabulation Table with Sample of 1,600

The percentages are identical to the previous tables and so the gender difference remains at 8%. The chi-square value (10.24) is four times that of the previous table and sixteen times that of the first table. Notice that the significance value is quite small (.001), indicating a statistically significant difference between men and women. With a sample as large as 1,600 it is very unlikely (.001 or 1 chance in 1000) that we would observe a difference of 8 or more percentage points between men and women if they did not differ in the population. Thus the 8% sample difference between two groups is highly significant if the sample is 1,600, but not significant (testing at .05 level) with a sample of 100. This is because the precision with which we measure the candidate preference increases with the sample size, and as our measurement grows more precise the 8% sample difference looms large. This relationship is quantified in the next section.

Sample Size and Precision In the series of crosstabulation tables we saw that as the sample size increased we were more likely to conclude there was a statistically significant difference between two groups when the magnitude of the sample difference was constant (8%). This is because the precision with which we estimate the population percentage increases with increasing sample size. This relation can be approximated (see the note at the end of this chapter for the exact relationship) by a simple equation; the precision of a sample proportion is approximately equal to one divided by the square root of the sample size. Table 5.1 displays the precision for the sample sizes used in our examples below.

Probability and Inferential Statistics 5 - 5

Introduction to Statistical Analysis Using SPSS Statistics Table 5.1 Sample Size and Precision for Different Sample Sizes Sample Size 100 400 1600

Precision 1/sqrt(100) = 1/10 .10 or 10% 1/sqrt (400) = 1/20 .05 or 5% 1/sqrt(1600) = 1/40 .025 or 2.5%

And to obtain a precision of 1%, we would need a sample of 10,000 (1/sqrt(10,000) = 1/100). We can understand now why surveys don’t often state that the results are accurate within ±1%. Since precision increases as the square root of the sample size, in order to double the precision we must increase the sample size by a factor of four. This is an unfortunate and expensive fact of survey research. In practice, samples between 500 and 1,500 are often selected for national studies.

Precision of Means The same basic relation—that precision increases with the square root of the sample size— applies to sample means as well. To illustrate this we display histograms based on different samples from a normally distributed population with mean 70 and standard deviation 10. We first view a histogram based on a sample of 10,000 individual observations. Next we will view a histogram of 1,000 sample means where each mean is composed of 10 observations. The third histogram is composed of 100 sample means, but here each mean is based on 100 observations. We will focus our attention on how the standard deviation changes when sample means are the units of observation. To aid such comparisons the scale is kept constant across histograms.

Probability and Inferential Statistics 5 - 6

Introduction to Statistical Analysis Using SPSS Statistics A Large Sample of Individuals Below is a histogram of 10,000 observations drawn from a normal distribution of mean 70 and standard deviation 10. Figure 5.4 Histogram of 10,000 Observations

We see that a sample of this size closely matches its population. The sample mean is very close to 70, the sample standard deviation is near 10, and the shape of the distribution is normal.

Probability and Inferential Statistics 5 - 7

Introduction to Statistical Analysis Using SPSS Statistics Means Based on Samples of 10 The second histogram displays 1,000 sample means drawn from the same population (mean 70, standard deviation 10). Here each observation is a mean based on 10 data points. In other words we pick samples of ten each and plot their means in the histogram below. Figure 5.5 Histogram of Means Based on Samples of 10

The overall average of the sample means is about 70, while the standard deviation of the sample means (standard error) is reduced to 3.11. Comparing the two histograms we see there is less variation (standard deviation of 3.11 versus 10) among means based on groups of observations then among the observations themselves. Recall the rule of thumb that precision is a function of the square root of the sample size. If the population standard deviation were 10, we would expect the standard deviation of means based on samples of 10 to be the population figure reduced by a factor of 1/square root(N) or 1/square root(10), or .316. If we multiply this factor (.316) by the population standard deviation (10), the theoretical value we get (3.16) is very close to what we observe in our sample (3.11). Thus by increasing the sample size by a factor of ten (from single observations to means of ten observations each) we reduce the imprecision (increase the precision) by the factor (1/square root(10)). The shape of the distribution remains normal.

Probability and Inferential Statistics 5 - 8

Introduction to Statistical Analysis Using SPSS Statistics Means Based on Samples of 100 The next histogram is based on a sample of 100 means where each mean represents 100 observations. Figure 5.6 Histogram of Means Based on Samples of 100

While quite compressed, the distribution still resembles a normal curve. The overall mean remains at 70 while the standard deviation is very close to 1 (1.00). This is what we expect since with samples of 100, the expected value of the standard deviation of the sample mean (standard error) would be the population standard deviation divided by the square root of 100. Thus the theoretical sample standard deviation would be 10/square root(100) or 1.00, which is quite close to our observed value. Thus with means as well as percents, precision increases with the square root of the sample size.

Statistical Power Analysis With increasing precision we are better able to detect small differences that exist between groups and small relationships between variables. Power analysis was developed to aid researchers in determining the minimum sample size required in order to have a specified chance of detecting a true difference or relationship of a given size. To put it more simply, power is used to quantify your ability to reject the null hypothesis when it is false. For example, suppose a researcher hopes to find a mean difference of .8 standard deviation units between two populations. A power calculation can determine the sample size necessary to have a 90% chance that a significant difference will be found between the sample means when performing a statistical test at a specified significance level. Thus a researcher can evaluate whether the sample is large enough Probability and Inferential Statistics 5 - 9

Introduction to Statistical Analysis Using SPSS Statistics for the purpose of the study. The SPSS SamplePower program performs power analysis. Also, books by Cohen (1988) and Kraemer & Thiemann (1987) discuss power analysis and present tables used to perform the calculation for common statistical tests. In addition specialty software is available for such analyses. Power analysis can be very useful when planning a study, but does require such information as the magnitude of the hypothesized effect and an estimate of the variance.

5.4 Hypothesis Testing Whenever we wish to make an inference about a population from our sample, we must specify a hypothesis to test. It is common practice to state two hypotheses: the null hypothesis (also known as H0) and the alternative hypothesis (H1). The null hypothesis being tested is conventionally the one in which no effect is present. For example, we might be looking for differences in mean income between males and females, but the (null) hypothesis we are testing is that there is no difference between the groups. If the evidence is such that this null hypothesis is unlikely to be true, the alternative hypothesis should be accepted. Another way of thinking about the problem is to make a comparison with the criminal justice system. Here, a defendant is treated as innocent (i.e. the null hypothesis is accepted) until there is enough evidence to suggest that they perpetrated the crime beyond any reasonable doubt (i.e. the null hypothesis is rejected). The alternative hypothesis is generally (although not exclusively) the one we are really interested in and can take any form. In the above example, we might hypothesis that males will have a higher mean income than females. When the alternative hypothesis has a ‘direction’ (i.e. we expect a specific result), the test is referred to as one-tailed. Often, you do not know in which direction to expect a difference and may simply wish to leave the alternative hypothesis openended. This is a two-tailed test and the alternative hypothesis would simply be that the mean incomes of males and females are different. Whichever option you choose will have implications when interpreting the probability levels. In general, the probability of the occurrence of a particular statistic for a one-tailed test will be half that of a two-tailed test as only one extreme of the distribution is being considered in the former type of test. You will see an example of this when we demonstrate the T Test procedure in Chapter 7.

Significance Criteria Level Having formally stated your hypotheses, you must then select a criterion for acceptance or rejection of the null hypothesis. With probability tests such as the chi-square test or the t-test, you are testing the likelihood that a statistic of the magnitude obtained would have occurred by chance assuming that the null hypothesis (i.e. that there is no difference in the population) is true. In other words, we only wish to reject the null hypothesis when we can say that the result would have been extremely unlikely under the conditions set by the null hypothesis. In this case, the alternative hypothesis should be accepted. It is worth noting that this does not ‘prove’ the alternative hypothesis beyond doubt, it merely tells us that the null hypothesis is unlikely to be true. But what criterion (or alpha level, as it is often known) should we use? Unfortunately, there is no easy answer! Traditionally, a 5% level is chosen, indicating that a statistic of the size obtained would only be likely to occur on 5% of occasions (or once-in-twenty) should the null hypothesis be true. This also means that, by choosing a 5% criterion, you are accepting that you will make a mistake in rejecting the null hypothesis 5% of the time.

Probability and Inferential Statistics 5 - 10

Introduction to Statistical Analysis Using SPSS Statistics The 5% cut-off point is not a hard and fast rule, however. Some prefer to choose a 10% level, others a far more conservative 1% or even 0.1%. In this last case, a statistic would only be accepted as significant if it was shown to occur on 0.1%, or one-in-a-thousand, of all occasions under the null hypothesis. The level you choose will to a large extent, depend upon the importance of getting the answer correct. If performing more exploratory research, where the outcome is not so critical, you may decide upon a more liberal region of rejection such as 10%. Alternatively, if carrying out potentially life-or-death clinical trials, you will wish to be as certain as possible that you have made the correct choice. In these cases, the more conservative the criterion, the ‘safer’ you will be should you achieve a significant result.

5.5 Types of Statistical Errors Recall that when performing statistical tests we are generally attempting to draw conclusions about the larger population based on information collected in the sample. There are two major types of errors in this process. False positives, or Type I errors, occur when no difference (or relation) exists in the population, but the sample tests indicate there are significant differences (or relations). Thus the researcher falsely concludes a positive result. This type of error is explicitly taken into account when performing statistical tests. When testing for statistical significance using a .05 criterion (alpha level), we acknowledge that if there is no effect in the population then the sample statistic will exceed the criterion on average 5 times in 100 (.05). Type II errors, or false negatives, are mistakes in which there is a true effect in the population (difference or relation) but the sample test statistic is not significant, leading to a false conclusion of no effect. To put it briefly, a true effect remains undiscovered. The probability of making this type of error is often referred to as the beta level. Whereas you can select your own alpha levels, beta levels are dependent upon things such as the alpha level and the size of the sample. It is helpful to note that statistical power, the probability of detecting a true effect, equals 1 minus the Type II error and the higher the power the better. Table 5.2 Types of Statistical Errors in Hypothesis Testing Statistical Test Outcome Not Significant Significant Type I error (α) False positive

No Difference (Ho is True)

Correct

True Difference (H1 is True)

Type II error (β) False negative

Population Correct

When other factors are held constant there is a tradeoff between the two types of errors; thus Type II error can be reduced at the price of increasing Type I error. In certain disciplines, for example in statistical quality control when destructive testing is done, the relationship between the two error types is explicitly taken into account and an optimal balance determined based on cost considerations. In social science research, the tradeoff is acknowledged but rarely taken into account (the exception being power analysis); instead emphasis is usually placed on maintaining a steady Type I error rate at some criteria level, commonly .05 (5%). This discussion merely touches the surface of these issues; researchers working with small samples or studying small effects should be very aware of them.

Probability and Inferential Statistics 5 - 11

Introduction to Statistical Analysis Using SPSS Statistics

5.6 Statistical Significance and Practical Importance A related issue involves drawing a distinction between statistical significance and practical importance. When an effect is found to be statistically significant we conclude that the population effect (difference or relation) is not zero. However, this allows for a statistically significant effect that is not quite zero, yet so small as to be insignificant from a practical or policy perspective. This notion of practical or real world importance is also called ecological significance. Recalling our discussion of precision and sample size, very large samples yield increased precision, and in such samples very small effects may be found to be statistically significant. In such situations, the question arises as to whether the effects make any practical difference. For example, suppose a company is interested in customer ratings of one of its products and obtains rating scores from several thousand customers. Furthermore, suppose mean ratings on a 1 to 5 satisfaction scale are 3.25 for male and 3.15 for female customers, and this difference is found to be significant. Would such a small difference be of any practical interest or use? When sample sizes are small (say under 30), precision tends to be poor and so only relatively large (and ecologically significant) effects are found to be statistically significant. With moderate samples (say 50 to one or two hundred) small effects tend to show modest significance while large effects are highly significant. For very large samples, several hundreds or thousands, small effects can be highly significant; thus an important aspect of the analysis is to examine the effect size and determine if it is important from a practical, policy or ecological perspective. In summary, the statistical tests we cover in this course provide information as to whether there are non-zero effects. Estimates of the effect size should be examined to determine whether the effects are important.

Computational Note: Precision of Percentage Estimates In this chapter we suggested, as a rule of thumb, that the precision of a sample proportion is approximately equal to one divided by the square root of the sample size. Formally, for a binomial or multinomial distribution (a variable measured on a nominal or ordinal scale), the standard error of the sample proportion (P) is equal to

StdErr( P) = P * (1 − P) N Thus the standard error is a maximum when P = .5 and reaches a minimum of 0 when P = 0 or 1. A 95% confidence band is usually determined by taking the sample estimate plus or minus twice the standard error. Precision (pre) here is simply two times the standard error. Thus precision (pre) is

pre( P) = 2 * P * (1 − P) N . If we substitute for P the value .5 which maximizes the expression (and is therefore conservative) we have

pre(0.5) = 2 * 0.5 * (1 − 0.5) N

(0.5) * (0.5) = 2 * (0.5) N = 2* =1 Probability and Inferential Statistics 5 - 12

N

N

Introduction to Statistical Analysis Using SPSS Statistics This validates the rule of thumb used in the chapter. Since the rule of thumb employs the value of P=.5, which maximizes the standard deviation and thus the standard error, in practice, greater precision would be obtained when P departs from .5. It is important to note that this calculation assumes the population size is infinite, or as an approximation, much larger than the sample. Formulations that take finite population values into account can be found in Kish (1965) and other texts discussing sampling. When applied to survey data, the calculation also assumes that the survey was carried out in a methodologically sound manner. Otherwise, the validity of the sample proportion itself is called into question.

Probability and Inferential Statistics 5 - 13

Introduction to Statistical Analysis Using SPSS Statistics

Probability and Inferential Statistics 5 - 14

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 6: Comparing Categorical Variables Topics • Typical Applications • Crosstabulation Tables • Testing the Relationship: Chi-Square Statistic • Additional Two-Way Crosstabs • Graphing the Crosstabs Results • Adding Control Variables • Extensions: Beyond Crosstabs • Appendix: Measures of Association Data In this chapter, we continue to use the GSS2004Intro.sav file. Scenario Using the GSS 2004 data, our interest is in investigating whether men differ from women in their belief in an afterlife and in their use of the computer. In addition, we will explore whether education relates to these measures.

6.1 Typical Applications Thus far we have examined each variable isolated from the others. A main component of most studies is to look for relationships among variables or to compare groups on some measure. Our choice of variables is based on our view of which questions might be interesting to investigate. More often a study is designed to answer specific questions of interest to the researcher. These may be theoretical as in an academic project, or quite applied as often found in market research. The crosstabulation table is the basic technique used to examine relationships among categorical variables. Crosstabs are used in practically all areas of research. A crosstabulation table is a cofrequency table of counts, where each row or column is a frequency table of one variable for observations falling within the same category of the other variable. When one of the variables identifies groups of interest (for example, a demographic variable) crosstabulation tables permit comparisons between groups. In survey work, two attitudinal measures are often displayed in a crosstab to assess relationship. While the most common tables involve two variables, crosstabulations are general enough to handle additional variables and we will discuss a simple three-variable analysis. A crosstabulation table can serve several purposes. It might be used descriptively, that is, the emphasis is on providing some information about the state of things and not on inferential statistical testing. For example, demographic information about members of an organization (company employees, students at a college, members of a professional group) or recipients of a service (hospital patients, season ticket holders) can be displayed using crosstabulation tables.

Comparing Categorical Variables 6 - 1

Introduction to Statistical Analysis Using SPSS Statistics Here the point is to provide summary information describing the groups and not to make explicit comparisons that generalize to larger populations. For example, an educational institution might publish a crosstabulation table reporting student outcome (dropout, return) for its different divisions. For this purpose, the crosstabulation table is descriptive. Crosstabulation tables are also used in research studies where the goal is to draw conclusions about relationships in the population based on sample data (recall our discussion in Chapter 5). Many survey studies and experiments have this as their goal. In order to make such inferences, statistical tests (usually the chi-square test of independence) are applied to the tables. In this chapter we will begin by discussing a simple table displaying gender and belief in the afterlife. We will then outline the logic of applying a statistical test to the data, perform the test, and interpret the results. To provide reinforcement, other two-way tables will be considered. In addition to the statistical tests, researchers occasionally desire a numeric summary of the strength of the association between the two variables in a crosstabulation table. We provide a brief review of some of these measures. Another aspect of data analysis involves graphical display of the results. We will see how bar charts can be used to present the data in crosstabulation tables. Finally, we will explore a threeway table and point in the direction of more advanced methods. We begin however, with a simple table.

6.2 Crosstabulation Tables The Crosstabs procedure in SPSS Statistics produces crosstabulation tables on at least two categorical variables. These tables are most useful when there are a relatively small number of categories. As we noted in Chapter 2, you might want to combine categories, especially those with a small number of cases, before running the crosstab. This is easily done using the Recode or Visual Binning facilities on the Transform menu. To request a crosstabulation table we need to specify the row variable and the column variable. We will specify POSTLIFE as the row variable and GENDER for the column variable. Note that multiple variables can be given for both. Open GSS2004Intro.sav data file (if necessary) Click Analyze…Descriptive Statistics…Crosstabs… Move POSTLIFE into the Row(s): box Move GENDER into the Column(s): box

A checkbox option is available to graph the crosstabulation table results as a clustered bar chart based on counts. Rather than request a bar chart of counts now, we will later use the Graphs menu to construct a clustered bar chart based on percents. The Suppress tables option is available if you want to see the crosstabulation statistical measures but not the crosstabulation tables. A button labeled Exact will appear if the SPSS Exact Tests add-on module is installed.

Comparing Categorical Variables 6 - 2

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.1 Crosstabs Dialog Box

Because GENDER is designated as the Column variable, each gender group will appear as a separate column in the table, and the categories of POSTLIFE will define the rows of the table.. The Layer box can be used to build three-way and higher-order tables; we will see this feature later in the chapter. By default the Crosstabs procedure will display only counts in the cells of the table. For interpretive purposes we want percentages as well. The Cells option button controls the summaries appearing in the cells of the table. Click Cells Click the Column check box in the Percentages area to obtain column percentages

The completed Cells dialog box is shown in Figure 6.2.

Comparing Categorical Variables 6 - 3

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.2 Crosstab Cell Display Dialog

Row, column and total table percentages can be requested. Row percentages are computed within each row of the table so that the percentages across a row sum to 100%. Column percentages would sum to 100% down each column, and total percentages sum to 100% across all cells of the table. While we can request any or all of these percentages, the column percentage best suits our purpose. If there is a variable that can be considered to be the independent variable (which GENDER is here), then the appropriate table percentage is based on that dimension. Since GENDER is our column variable, column percentages allow immediate comparison of the percentages of men and women who believe in an afterlife, which is the question we wish to explore. We will not request row percents because we are not directly interested in them and wish to keep the table simple. Notice that Observed Counts is checked by default. The other choices (Expected Count and Residuals) are more technical summaries and will be considered in the next example. Click Continue Click OK

Comparing Categorical Variables 6 - 4

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.3 Crosstabulation of Belief in an Afterlife by Gender BELIEF IN LIFE AFTER DEATH * GENDER OF RESPONDENT Crosstabulation

BELIEF IN LIFE AFTER DEATH

YES

NO

Total

Count % within GENDER OF RESPONDENT Count % within GENDER OF RESPONDENT Count % within GENDER OF RESPONDENT

GENDER OF RESPONDENT Female Male 541 417

Total 958

86.0%

76.9%

81.8%

88

125

213

14.0%

23.1%

18.2%

629

542

1171

100.0%

100.0%

100.0%

The statistics labels appear in the row dimension of the table. The two numbers in each cell are counts and column percentages. We see that about 77% of the men and 86% of the women said they believe in an afterlife. The table includes only those respondents who had valid values on both questions. Although gender is known for all respondents, the belief in life after death question was not asked of all respondents. The Case Processing Summary table (not shown) tells us that 58.4% (1641 cases) were excluded from the table. The row and column totals, often referred to as marginals, show the frequency counts for each variable. The column percentages total to 100%. The total row percentages are the percentage for each category of belief in afterlife based on all respondents in the table. On the descriptive level we can say that most of the sample believed in the afterlife (look at the Total row percentages in the column labeled “Total”). If we wish to draw conclusions about the population, for example differences between men and women, we would need to perform statistical tests. Row percentages, if requested, would indicate what percentage of believers is male and what percentage of believers is female. In other words, the percentages would sum to 100% across each row. Your choice of row versus column percents determines your view of the data. In survey research, independent variables, such as demographics, are often positioned as column variables (or as a banner variable in the stub and banner tables of market research), and since there is much interest in comparing these groups, column percents are displayed. If you prefer to interpret row percentages in this context, or wish both percentages to appear, feel free to do so. The important point is that the percentages help answer the question of interest in a direct fashion. Having examined the basic two-way table, we move on to ask questions about the larger population.

6.3 Testing the Relationship: Chi-Square Test In the table viewed in Figure 6.3, 77% of the men in the sample and 86% of the women believed in an afterlife. There is a difference in the sample of approximately 9% with a higher proportion of women believing. Can we conclude from this that there is a population difference between men and women on this issue (statistical significance)? And if there is a difference in the population, is it large enough to be important to us (ecological significance)?

Comparing Categorical Variables 6 - 5

Introduction to Statistical Analysis Using SPSS Statistics The difficulty we face is that the sample is an incomplete and imperfect reflection of the population. We use statistical tests to draw conclusions about the population from the sample data.

Basic Logic of Statistical Tests In general, we assume that there is no effect (null hypothesis) in the population. In our example, Ho (Null Hypothesis) assumes that gender and belief in an afterlife are independent of each other. We then calculate how likely it is that a sample could show as large (or larger) an effect as what we observe (here a 9% difference), if there were truly no effect in the population. If the probability of obtaining so large a sample effect by chance alone is very small (often less than 5 chances in 100 or 5% is used) we reject the null hypothesis and conclude there is an effect in the population. While this approach may seem backward, that is, we assume no effect when we wish to demonstrate an effect; it provides a valid method of forming conclusions about the population. The details of how this logic is applied will vary depending on the type of data (counts, means, other summary measures) and the question asked (differences, association). So we will use a chisquare test in this chapter, but t and F tests later.

Logic of Chi-Square Applying the testing logic to the crosstabulation table, we first calculate the number of people expected to fall into each cell of the table assuming the null hypothesis (no relationship between gender and belief in an afterlife), then compare these numbers to what we actually obtained in the sample. If there is a close match we accept the null hypothesis of no effect. If the actual cell counts differ dramatically from what is expected under the null hypothesis we will conclude there is a gender difference in the population. The chi-square statistic summarizes the discrepancy between what is observed and what we expect under the null hypothesis. In addition, the sample chi-square value can be converted into a probability that can be readily interpreted by the analyst.

Assumptions of the Chi-Square Test • • •

Each observation is independent of all other observations, i.e. that each individual contributed one observation only to the data set. Each observation can fall into only one cell in the table. The sample size should be large. The larger the sample size, the more accurate the estimate. Although there is no definitive guide governing what size a sample must be to achieve this criterion, some useful guidelines are given in the output and we discuss these latter.

The chi-square statistic is calculated by: 1. Computing the difference between the observed and expected frequencies for each cell. 2. Squaring the difference and dividing by the expected frequency of that cell. 3. Summing these values across all cells.

Comparing Categorical Variables 6 - 6

Introduction to Statistical Analysis Using SPSS Statistics

6.4 Requesting the Chi-Square Test To demonstrate how this works in practice, we will rerun the same analysis as before, but request the chi-square statistic. We will also ask that some supplementary information appear in the cells of the table to better envision the actual chi-square calculation. In practice, you would rarely ask for this latter information to be displayed. The chi-square statistic is requested with the Crosstabs Statistics option button. We return to the previous Crosstabs dialog box to request the chi-square statistic. Click the Dialog Recall tool, and then click Crosstabs Click the Statistics button Click the Chi-square check box

Figure 6.4 Crosstab Statistics Dialog Box

The first choice is the chi-square test of independence of the row and column variables. Most of the remaining statistics are association measures that attempt to assign a single number to represent the strength of the relationship between the two variables. We will briefly discuss them later in this chapter. Click Continue

To illustrate the chi-square calculation we also request some technical results (expected values and residuals) in the cells of the table. Once again, you would not typically display these statistics. We request expected counts and unstandardized residuals using the Cells option button. Click the Cells button Check Expected in the Counts area and Unstandardized in the Residuals area

Comparing Categorical Variables 6 - 7

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.5 Cell Display Dialog Box

By displaying the expected counts we can see how many observations are expected to fall into each cell, assuming no relationship between the row and column variables. The unstandardized residual is the difference between the observed count and the expected count in the cell. As such it is a measure of the discrepancy between what we expect under the null hypothesis and what we actually observe. In this course we will not explore the other residuals listed, but note they can be used with large tables to identify quickly cells that exhibit the greatest deviations from independence. Click Continue Click OK

6.5 Interpreting the Output As we can see in Figure 6.6, the counts and percentages are the same as before; the expected counts and residuals will aid in explaining the calculation of the chi-square statistic. Recall that our testing logic assumes no relation between the row and column variables (here gender and belief in an afterlife) in the population, and then determines how consistent the data are with this assumption. In this table there are 417 males who say they believe in an afterlife.

Comparing Categorical Variables 6 - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.6 Crosstab with Expected Values and Residuals BELIEF IN LIFE AFTER DEATH * GENDER OF RESPONDENT Crosstabulation

BELIEF IN LIFE AFTER DEATH

YES

NO

Total

Count Expected Count % within GENDER OF RESPONDENT Residual Count Expected Count % within GENDER OF RESPONDENT Residual Count Expected Count % within GENDER OF RESPONDENT

GENDER OF RESPONDENT Female Male 541 417 514.6 443.4

Total 958 958.0

86.0%

76.9%

81.8%

26.4 88 114.4

-26.4 125 98.6

213 213.0

14.0%

23.1%

18.2%

-26.4 629 629.0

26.4 542 542.0

1171 1171.0

100.0%

100.0%

100.0%

We now need to calculate how many observations should fall into this cell if gender and belief in an afterlife were independent of each other. First, note (we calculate this from the counts in the cells and in the margins of the table) that 46.3% (542 of 1171, or .463) of the sample is male and 81.8% (shown in the row total) of the sample believes in an afterlife. If gender is unrelated to belief in an afterlife, the probability of picking someone from the sample who is both a male and a believer would be the product of the probability of picking a male and the probability of picking a believer, that is, .463*.818 or .379 (37.9%). This is based on the probability of the joint event equaling the product of the probabilities of the separate events when the events are independent— for example, the probability of obtaining two heads when flipping coins. Taking this a step further, if the probability of picking a male believer is 37.9% and our sample is composed of 1171 people, then we would expect to find 443.4 male believers in the sample (.379*1171). This number is the expected count for the male-believer cell, assuming no relation between gender and belief. We observed 417 male believers while we expected to find 443.4, and so the discrepancy or residual is -26.4. Small residuals indicate agreement between the data and the null hypothesis of no relationship; large residuals suggest the data are inconsistent with the null hypothesis. Expected counts and residuals are calculated in this manner for each cell in the table. Since simply summing the residuals has the disadvantage of negative and positive residuals (discrepancies) canceling each other out, the residuals are squared so all values are positive. A second consideration is that a residual of 50 would be large relative to an expected count of 15, but small relative to an expected count of 2,000. To compensate for this the squared residual from each cell is divided by the expected count of the cell. The sum of these cell summaries—((Observed count - Expected count)^2 / Expected count)—constitutes the Pearson chi-square statistic.

Comparing Categorical Variables 6 - 9

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.7 Chi-Square Test Results

The chi-square is a measure of the discrepancy between the observed cell counts and what we expect if the row and column variables were unrelated. Clearly a chi-square of zero would indicate perfect agreement (and no relationship between the variables); a small chi-square would indicate agreement while a large chi-square would signal disagreement between the data and the null hypothesis. One final consideration is that since the chi-square statistic is the sum of positive values from each cell in the table, other things being equal, it will have greater values in tables with more cells. The chi-square value itself is not adjusted for this, but an accompanying statistic called degrees of freedom, based on the number of cells (technically the number of rows minus one multiplied by the number of columns minus one), is taken into account when the statistic is evaluated. In order to assess the magnitude of the sample chi-square, the calculated value is compared to the theoretical chi-square distribution and an easily interpreted probability is returned (column labeled Asymp. Significance (2-Sided)). The chi-square we have been discussing, Pearson chisquare, has a significance value of .000 (which means it is less than .0005). This means that if there were no relation between gender and belief in an afterlife in the population, the probability of obtaining discrepancies as large (or larger) as we see in our sample (percentage differences of 9% between men and women) would be no greater than .0005 (or less than 5 chances in a thousand). In other words, it is quite unlikely that we would obtain this large a sample difference between men and women if there were no differences in the population. If we consider as significant those effects that would occur less than 5% of the time by chance alone (as many researchers do), we would claim this is a statistically significant effect. U.S. adult women are more likely to believe in an afterlife than men. (We can display the significance value to greater precision by double clicking on the pivot table to open the Pivot Table editor, then double clicking on the significance value. Or we can select the cell and format the cell value to display greater precision.) The Continuity correction will appear only in two-row by two-column tables when the chi-square test is requested. In such small tables it was known that the standard chi-square calculation did not closely approximate the theoretical distribution, which meant that the significance value was not quite correct. A statistician named Frank Yates published an adjusted chi-square calculation specifically for two-row by two-column tables, and it typically appears labeled as the “Continuity correction” or as “Yates’ correction.” It was applied routinely for many years, but more recent Monte Carlo simulation work indicates that it over adjusts. As a result it is no longer

Comparing Categorical Variables 6 - 10

Introduction to Statistical Analysis Using SPSS Statistics automatically used in two by two tables, but it is certainly useful to compare the two significance values to make sure they agree (which they do here). A more recent chi-square approximation than the standard Pearson chi-square is the likelihood ratio chi-square test. Here it tests the same null hypothesis, independence of the row and column variables, but uses a different chi-square formulation. It has some technical advantages that largely show up when dealing with higher-order tables (three-variables or more). In the vast majority of cases, both the Pearson and likelihood ratio chi-square tests lead to identical conclusions. In most introductory statistics courses, and when reporting results of two-variable crosstab tables, the Pearson chi-square is commonly used. For more complex tables, and more advanced statistical applications, the likelihood ratio chi-square is almost exclusively applied. Fisher’s exact test will also appear for crosstabulation tables containing two rows and two columns (a 2x2 table); exact tests are available for larger tables through the SPSS Exact Tests add-on module. Fisher’s test calculates the proportion of all table arrangements that have more extreme percentages than observed in the cells, while keeping the same marginal proportions. Exact tests have the advantage of not depending on approximations (as do the Pearson and likelihood ratio chi-square tests). Although the computational effort required to evaluate exact tests in all but simple situations (a 2x2 table) was large, recent improvements in algorithms have resulted in exact tests calculated more efficiently. You should consider using exact tests when your sample size is small, or when some cells in large crosstabulation tables are empty or have small cell counts. As the sample size increases (for all cells), exact tests and asymptotic (Pearson, likelihood ratio) results converge.

Ecological Significance While our significance tests are definitive—U.S. adult men and women differ in their belief in an afterlife—we now consider the matter of practical importance. Recall that majorities of both men and women believe and the sample difference between them was about 9%. At this point the researcher should consider whether a 9% difference is large enough to be of practical importance. For example, if these were dropout rates for students in two groups (no intervention, a dropout intervention program), would a 9% difference in dropout rate justify the cost of the program? These are the more practical and policy decisions that often have to be made during the course of an applied statistical analysis.

Small Sample Considerations The crosstabulation table viewed above was based on a large sample. When sample sizes are small and expected cell counts approach zero, the chi-square approximation may break down with the result that the probability (significance values) cannot be trusted. Although there are no definitive answers, some rules of thumb have been developed to warn the analyst away from potentially misleading results. Minimum Expected Cell Count: A conservative rule of thumb is 4 or 5 or greater. Studies have shown that the minimum expected cell count could be as low as 1 or 2 without adverse results in some situations. In the presence of many small expected cell counts, you should be concerned that the chi-square test is no longer behaving as it should (matching its theoretical expectations). A footnote to the Chi-square table displays number (percent) of cells having an expected count less than 5 and the minimum expected count in the crosstab. In our crosstab above, none of the cells had an expected count less than 5.

Comparing Categorical Variables 6 - 11

Introduction to Statistical Analysis Using SPSS Statistics Observed cell count: You should monitor the number and proportion of zero cells (cells with no observations). Some researchers say that no more that 20% of your observed cell counts should be less than 5 in the situation where your expected counts are well behaved (see above). Too many zero cells, or a particular pattern of zero cells, invalidate the usual interpretation of many measures of association. Zero cells also contribute to a loss of sensitivity in your analysis. Two subgroups, which might be distinguishable given enough data, are not when a small sample makes cell counts small and near zero. What to do when rules of thumb are violated?: In practice, when expected or observed counts become small, and if it makes conceptual sense, researchers often collapse several rows or columns together to increase the sample sizes for the now broader groups. Another possibility is to drop a row or column category from the analysis, essentially giving up information about that group in order to obtain stability when investigating the others. In recent years efficient algorithms have been developed to perform “exact tests” which permit low or zero expected cell counts in crosstabulation tables. SPSS has implemented such algorithms in the SPSS Exact Tests add-on module.

6.6 Additional Two-Way Tables We will examine two additional two-variable crosstabulation tables and apply the chi-square test. Specifically, we will look at the relationship between education degree (DEGREE) and belief in life after death, and gender and computer use (COMPUSE). We return to the Crosstabs dialog box to request the additional tables. We also drop the expected counts and residuals from the tables. Click the Dialog Recall tool and then click Crosstabs Move DEGREE into the Column(s) box Move COMPUSE into the Row(s) box Click Cells button Click to uncheck the Expected cell counts and Unstandardized Residuals (not shown) Click Continue

Comparing Categorical Variables 6 - 12

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.8 Multiple Crosstab Tables Request

Multiple tables can be obtained by naming several row or column variables. Each variable in the Row(s) box is run against each variable in the Column(s) box. Our request will produce four twovariable tables. Click OK

Comparing Categorical Variables 6 - 13

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.9 Belief in Afterlife by Education Degree

Across different education degrees the belief in an afterlife ranges from a high of 88% (Junior College) to a low of 73% (Less than High School). The Pearson and likelihood ratio chi-squares indicate a significant result at the .05 level, but not at the .01 level (a sample with differences this large would occur about 26 times in 1000 (.026) by chance alone if there were no differences in the population). No continuity correction or Fisher’s Exact Test appears because this is not a tworow by two-column table. The Linear by Linear chi-square (not displayed for 2x2 tables) tests the very specific hypothesis of linear association between the row and column variables. This assumes that both variables are interval scale measures and you are interested in testing for straight-line association. This is rarely the case in crosstabulation tables (unless working with rating scales) and the test is not often used. No cells have an expected count less than 5, with the minimum expected frequency for a cell at 18.74. This satisfies the assumptions of using the chi-square test.

Comparing Categorical Variables 6 - 14

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.10 Computer Use and Gender

We see there is a 5.6% difference between men and women in the percentage that uses the computer. One could argue that the relationship is significant if we use the .05 level as the cut-off for the chi-square significance level since the significance for the Pearson Chi-Square statistic is .049. However, the significance of both the continuity correction chi-square and the Fisher's Exact Test are slightly above .05. Even though there are no problems of small sample (low expected counts), the conservative approach would be to conclude from this sample that there is no difference between male and female computer use in the adult population of the US. Since the results are somewhat inconclusive, this would also suggest that further study would be warranted if this was a relationship of great interest. We will look at the fourth table that we ran, the table of computer use with education degree, in the appendix of this chapter. In this set of three tables, two were statistically significant while the third would conservatively be evaluated as not significant at the .05 level. To repeat, if a significance value were above .05, say for example .60, it would imply that, under the null hypothesis of independence between the row and column variables in the population, it is quite likely (60%) that we could obtain the differences observed in our sample. In other words the sample is consistent with the assumption of independence of the variables.

Comparing Categorical Variables 6 - 15

Introduction to Statistical Analysis Using SPSS Statistics

6.7 Graphing the Crosstabs Results Percentages in a crosstabulation table can be displayed using clustered bar charts. You can request bar charts based on counts directly from the Crosstabs dialog box, but since we wish to display percentages, we instead use the Chart Builder. A simple rule to apply for bar charts is that the percentages represented in the bar chart should be consistent with those displayed in the crosstabulation table. Typically, the variable on which you based the percentages is used as the cluster variable and the percentages are based on the categories of that variable. However, using Chart Builder, you have a choice in how you organize the variables on the chart. As an example, we will graph the percentages for gender and belief in life after death. Figure 6.11 shows the completed Chart Builder. Click Graphs…Chart Builder… Click OK in the Information box (if necessary) Click Reset Click Gallery tab (if it's not already selected) Click Bar in the Choose from: list

Select the second icon for Clustered Bar Chart and drag it to the Chart Preview canvas Drag and drop POSTLIFE from the Variables: list to the X-Axis? area in the Chart Preview canvas Drag and drop GENDER from the Variables: list to the Cluster: Set Color area in the Chart Preview canvas In the Element Properties dialog box (Click Element Properties button if this dialog box did not automatically open), Choose Bar1 in the Edit Properties of: list Select Percentage(?) from the Statistics: dropdown list

We can now select the variable in the chart to use for the base or denominator of the percentage. Choices are Grand Total to base the percentages on the total cases in the chart, Total for each XAxis Category to base the percentages on the x-axis variable and Total for Each Legend Variable Category (same fill color) to base the percentages on the cluster (legend) variable. The crosstabulation percentages are based on gender; so we want the bar chart percentages to be based on the cluster variable. Click Set Parameters button Select Total for Each Legend Variable Category (same fill color)

Comparing Categorical Variables 6 - 16

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.11 Chart Builder and Element Properties for Clustered Bar Chart

Click Continue in the Element Properties: Set Parameters dialog box Click Apply in the Element Properties dialog box Click OK in the Chart Builder to build the chart

Comparing Categorical Variables 6 - 17

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.12 Bar Chart of Belief in Afterlife by Gender

We now have a direct visual comparison between the men and women to supplement the crosstabulation table and significance tests. This graph might be useful in a final presentation or report. Hint You can create a bar chart directly from the values in the crosstabs pivot table. To do so, doubleclick on the crosstabs pivot table to activate the Pivot Table Editor, then select (Ctrl-click) all table values, for example column percents except for totals, that you wish to plot. Next, rightclick and select Create Graph…Bar from the Context menu. A bar chart will be inserted in the Viewer window, following the pivot table.

6.8 Adding Control Variables To investigate more complex interactions, you may want to explore the relationship of three or more variables. Within the Crosstabs procedure, we can specify one or more layer variables. Adding one layer variable produces a three-way table which is displayed as a series of two-way tables; one for each category of the third variable. This third variable is sometimes called the control variable since it determines the composition of each sub table. We will illustrate a three-way table using the table of belief in an afterlife by degree as a basis (see Figure 6.9). In this table, we discovered that those with less than a high school education had the lowest belief in an after life and that the relationship was significant at the .05 level. Suppose

Comparing Categorical Variables 6 - 18

Introduction to Statistical Analysis Using SPSS Statistics we are interested in seeing how gender might interact with this observed relationship. To explore this question we specify sex as the control (or layer) variable in the crosstabulation analysis. In this way we can view a table of belief in afterlife by degree separately for males and females. We will request a chi-square test of independence for each sub table. Click on the Dialog Recall tool and then click Crosstabs Click the Reset pushbutton Move POSTLIFE into the Row(s): list box Move DEGREE into the Column(s): list box Move GENDER into the Layer list box Click on the Cells pushbutton and click the Column check box under Percentages Click Continue Click Statistics and click the Chi-square check box Click Continue

Figure 6.13 Crosstabs Dialog Box for Three-Way Table

Click OK

As before, POSTLIFE and DEGREE are, respectively, the row and column variables; but GENDER is added as a layer (or control) variable. Note that GENDER is in the first layer. Additional control variables can be added as higher-level layers by clicking the Next button. Although not shown, we asked for column percentages in the Cells dialog box and the chi-square test from the Statistics dialog box.

Comparing Categorical Variables 6 - 19

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.14 Belief in Afterlife, Degree, and Gender BELIEF IN LIFE AFTER DEATH * RS HIGHEST DEGREE * GENDER OF RESPONDENT Crosstabulation

GENDER OF RESPONDENT Female

BELIEF IN LIFE AFTER DEATH

YES

NO

Total

Male

BELIEF IN LIFE AFTER DEATH

YES

NO

Total

Count % within RS HIGHEST DEGREE Count % within RS HIGHEST DEGREE Count % within RS HIGHEST DEGREE Count % within RS HIGHEST DEGREE Count % within RS HIGHEST DEGREE Count % within RS HIGHEST DEGREE

LT HIGH SCHOOL 61

RS HIGHEST DEGREE HIGH JUNIOR SCHOOL COLLEGE BACHELOR 293 54 90

GRADUATE 43

Total 541

77.2%

85.4%

94.7%

90.0%

86.0%

86.0%

18

50

3

10

7

88

22.8%

14.6%

5.3%

10.0%

14.0%

14.0%

79

343

57

100

50

629

100.0%

100.0%

100.0%

100.0%

100.0%

100.0%

44

210

37

74

52

417

67.7%

78.9%

80.4%

75.5%

77.6%

76.9%

21

56

9

24

15

125

32.3%

21.1%

19.6%

24.5%

22.4%

23.1%

65

266

46

98

67

542

100.0%

100.0%

100.0%

100.0%

100.0%

100.0%

The crosstabulation has two sub-tables, one for females and one for males. The percentages saying “yes” for females are more dissimilar than for males and the patterns are slightly different. Among the females, there is a drop-off of 4% from Bachelor's to Graduate degree groups; while for the males, the percentage of believers actually increases slightly between these same two degree groups. Figure 6.15 Chi-Square Statistics for Three-Way Table

The chi-square results are intriguing because they indicate that the relationship between afterlife and degree is significant for females (p=.039) at least at the .05 level but not for males (p=.382). This difference suggests a possible interaction effect: the effect of one variable (DEGREE) on another (POSTLIFE) depends on the value of a third variable (GENDER). However, given the large size of the sample and the significance levels, we would be cautious in our interpretation.

Comparing Categorical Variables 6 - 20

Introduction to Statistical Analysis Using SPSS Statistics As suggested in the next section, we could use more advanced techniques to further test for the significance of the interaction. These new results don’t mean that the two-way table is wrong or inaccurate. That table does present the relationship between highest educational degree and belief in an afterlife. The next step for the analyst would be to determine if this new found difference is important to their interest, and perhaps look at other variables that might provide more information about this relationship.

6.9 Extensions: Beyond Crosstabs Decision Tree analysis is often used by data analysts who need to predict, as accurately as possible, into which outcome group an individual will fall, based on potentially many nominal or ordinal background variables. For example, an insurance company is interested in the combination of demographics that best predict whether a client is likely to make a claim. Or a direct mail analyst is interested in the combinations of background characteristics that yield the highest return rates. Here the emphasis is less on testing a hypothesis and more on a heuristic method of finding the optimal set of characteristics for prediction purposes. CHAID (chi-square automatic interaction detection), a commonly used type of decision-tree technique, along with other decision-tree methods are available in the SPSS Decision Trees add-on module. A technique called loglinear modeling can also be used to analyze multi-way tables. This method requires statistical sophistication and is beyond the domain of our course. SPSS Statistics has several procedures (Genlog, Loglinear and Hiloglinear) to perform such analyses. They provide a way of determining which variables relate to which others in the context of a multi-way crosstab (also called contingency) table. These procedures could be used to explicitly test for the three-way interaction suggested above. For an introduction to this methodology see Fienberg (1977). Academic researchers often use such models to test hypotheses based on survey data. Occasionally there is interest in testing whether a frequency table based on sample data is consistent with a distribution specified by the analyst. This test (one sample chi-square) is available within the SPSS Statistics Base nonparametric procedure (click Analyze…Nonparametric Tests…Chi-Square).

6.10 Appendix: Association Measures We have discussed the chi-square test of independence and how to use it to determine whether there is a statistically significant relationship between the row and column variables in the population. And, we viewed the percentages in the table to describe the relation and determine the magnitude of the differences. It would be useful to have a single number to quantify the strength of the association between the row and column variables. Measures of Association have been developed to allow you to compare different tables and speak of relative strength of association or effect. They are typically normed to range between 0 (no association) and 1 (perfect association) for variables on a nominal scale. Those assuming ordinal measurement are scaled from –1 to +1, the extremes representing perfect negative and positive association, respectively; here zero would indicate no ordinal association. One reason for the large number of measures developed is that there are many ways two variables can be related in a large crosstabulation table. In addition, depending on the level of measurement

Comparing Categorical Variables 6 - 21

Introduction to Statistical Analysis Using SPSS Statistics (for example, nominal versus ordinal), different aspects of association might be relevant. Association measures tend to be used in academic and medical research studies, less so in applied work such as market research. In market research you typically display the crosstabulation table for examination, rather than focus on a single summary. We will review some general characteristics of the association measures, but not consider them in great detail. For more involved discussion of association measures for nominal variables see Gibbons (1993), while a more complete but technical reference is Bishop, Fienberg and Holland (1975). First, some general points: •

Some measures of association are based on the chi-square values; others are based on probabilistic considerations. The latter class is usually preferred since chi-square based values have no direct, intuitive interpretation.



Some measures of association assume a certain level of measurement (for example, dichotomous, nominal, or ordinal). Consider this when choosing a particular measure.



Some measures are symmetric, that is, do not vary if the row and column variables are interchanged. Others are asymmetric and must be interpreted in light of a causal or predictive ordering that you conceive between your variables.



Measures of association for crosstabulation tables are bivariate (two-variable). In general, multivariate (two or more) extensions do not exist. To explore association in higher order tables you must turn to a method called loglinear modeling (implemented in Genlog and Hiloglinear procedures of the SPSS Advanced Statistics add-on module: see Loglinear choice under the Analyze menu). Such analyses were briefly mentioned in this chapter (section 6.9), but are beyond the scope of this course.

Association Measures Available within Crosstabs Several common measures are available in the Crosstabs procedure. They can be classified in the following groups. Chi-Square Based: Phi, V, and the Contingency Coefficient are measures of association based on the chi-square value. Their early advantage was convenience: they could be readily derived from the already calculated chi-square. Values range from 0 to 1. Their disadvantage is that there is not a simple, intuitive interpretation of the numeric values. Nominal Probabilistic Measures: Lambda and Goodman & Kruskal’s Tau (both produced by selecting Lambda in the dialog) are probabilistic or PRE (proportional reduction in error) measures suitable for nominal scale data. They are measures attempting to summarize the extent to which the category value of one variable can be predicted from the value of the other. These measures are asymmetric and are reported with each variable predicted from the other. Ordinal Probabilistic Measures: Kendall’s Tau-b, Tau-c, Gamma and Somers’ d are all probabilistic measures appropriate for ordinal tables. Values range from -1 to +1, and reflect the extent to which higher categories (based on the data codes used) of one variable are associated with higher categories of the second variable. Some of these are asymmetric (e.g., Somers’ d).

Comparing Categorical Variables 6 - 22

Introduction to Statistical Analysis Using SPSS Statistics Correlations produces Pearson’s r, the standard correlation coefficient, which assumes both variables are interval scaled. If this association were the main interest in the analysis, such correlations can be obtained directly from the Correlation procedure. Eta is asymmetric and assumes the dependent measure is interval scale while the independent variable is nominal. It measures the reduction in variation of the dependent measure when the value of the independent variable is known. It can also be produced by the Means (Analyze…Compare Means…Means) and GLM (Analyze…General Linear Model…Univariate) procedures. The McNemar statistic is used to test for equality of correlated proportions, as opposed to general independence of the row and column variables (as does the chi-square test). For example, if we ask people, before and after viewing a political commercial, whether they would vote for candidate A, the McNemar test would test whether the proportion choosing candidate A changed. The Cochran’s and Mantel-Haenszel statistics test whether a dichotomous response variable is conditionally independent of a dichotomous explanatory variable when adjusting for the control variable. For example, is there an association between instruction method (treatment vs. control) and exam performance (pass vs. fail), controlling for school area (urban vs. rural). An association measure often used when coding open-end responses to survey questions is Kappa, which measures the agreement between two raters of the same information. The relative risk association measure is often used in health research; it assesses the odds of the occurrence of some outcome in the presence of an event (the use of a drug, a medical condition). It is not bounded as the other association measures are. These association measures are found in the Crosstab Statistics dialog box. We will request several measures for the computer use by education degree table. Here both nominal and ordinal measures of association might be desirable, as we will explain. Click on the Dialog Recall tool, then click Crosstabs Click Reset Move COMPUSE to the Row(s) box Move DEGREE to the Column(s) box Click Cells Click to check on Column in the Percentage area Click Continue Click the Statistics pushbutton Click to check Chi-square, Lambda, Gamma, and Kendall’s tau-c

Comparing Categorical Variables 6 - 23

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.16 Association Measures in Crosstabs

The association measures are grouped by level of measurement assumed for the variables. We checked Lambda (which will also produce Goodman & Kruskal’s Tau) along with Kendall’s c and the Gamma statistic. Click Continue Click OK

Comparing Categorical Variables 6 - 24

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.17 Computer Use and Education Degree

The significance level of the chi-square test is well below .01 so there is a statistically significant relationship, and the highest degree levels are more likely to use the computer. The majority of the people in all of the degree groups, except the less than high school group, use the computer and the percentage difference between that group and the Graduate degree group is well over 50%. We move next to the association measures.

Comparing Categorical Variables 6 - 25

Introduction to Statistical Analysis Using SPSS Statistics Figure 6.18 Association Measures for Computer Use and Education Degree

The association measures are in two tables, one for the nominal and the other for the ordinal statistics. The column labeled Value contains the actual association measures. Focusing on the first table (Directional Measures), we focus on the values for computer use as the dependent variable since that is our assumption. Keeping in mind that zero would be the weakest association, the values of .2 for both measures are well above zero and statistically significant. We have a situation in which there is a statistically significant result, but the level of association is lower than we might expect given the differences in the percentages among the degree groups. This is often the case with nominal measures of association. In the second table, note that the ordinal measures (gamma and Tau-c) are larger in magnitude. For an ordinal measure to be non-zero, the proportion of respondents using the computer needs to increase (or decrease) as education degree increases. This is indeed the case, and gamma indicates a strong association (–.739). The two measures have a negative sign because as education increases, computer use percentage also increases, but a “yes” for COMPUSE is coded with a lower value (1) than “no” (2). Thus higher data values on degree are associated with lower data values on COMPUSE. We have used COMPUSE, which would seem to be coded on a nominal scale, with ordinal measures of association. Dichotomous variables, for purposes of crosstabulation (and some other techniques), can be considered as measured on an ordinal scale.

Comparing Categorical Variables 6 - 26

Introduction to Statistical Analysis Using SPSS Statistics The other columns in the two tables are somewhat technical and we will not pursue them here (see the references cited earlier in this section). However they are used when you wish to perform statistical significance tests on the association measures themselves to determine whether an association measure differs from zero in the population.

Comparing Categorical Variables 6 - 27

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises We want to study the relationship between two demographic variables, race and gender and three attitude and behavior variables, HLTH1, whether you were ill enough to go to the doctor last year, NATAID, attitude toward spending on foreign aid, and NEWS, how frequently you read the newspaper. Before running the analysis, think about the variables involved in these tables. What relations would you expect to find here and why? 1. Run crosstabulations of race (RACE) against the measures: HLTH1, whether you were ill enough to go to the doctor last year, NATAID, attitude toward spending on foreign aid, and NEWS, how often newspaper is read. Request the appropriate percentage within racial categories and run the chi-square test of independence. 2. Then repeat the analysis after substituting GENDER in place of RACE. How would you summarize each finding in a paragraph? 3. Run a three-way table of HLTH1 by GENDER by RACE and NATAID by RACE by GENDER. Request the chi-square. Do these findings affect your summaries from above? If so, how? 4. Create a clustered bar chart displaying the results of one of your tables. For those with extra time: 1. Request appropriate measures of association for the HLTH1 by GENDER table and the NEWS by RACE by GENDER table. Are the results consistent with your interpretation up to this point? Based on either the association measures, or percentage differences, would you say the results have practical (or ecological) significance? 2. If you created a collapsed version of the WEBYR variable in Chapter 3 exercise, run a crosstab with NEWS30, whether you accessed a news website in the last 30 days. Would you expect to see a relationship?

Comparing Categorical Variables 6 - 28

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 7: Mean Differences Between Groups: T Test Topics • Logic of Testing for Mean Differences • Exploring the Group Differences • Testing the Differences: Independent Samples T Test • Interpreting the T Test Results • Graphing the Mean Differences • Appendix: Paired T Test • Appendix: Normal Probability Plots Data In this chapter, we continue to use the GSS2004Intro.sav file. Scenario Using the GSS 2004 data, we want to investigate whether men differ from women on two types of behavior: the hours spent watching TV every day and the number of hours each week using the internet. Since both measures are scale variables, we will summarize the groups using means. Our goal is to draw conclusions about population differences based on our sample.

7.1 Introduction In Chapter 6 we performed statistical tests in order to draw conclusions about population group differences on categorical variables using crosstabulation tables and applying the chi-square test of independence. When our purpose is to examine group differences on interval scale outcome measures, we turn to the mean as the summary statistic since it provides a single measure of central tendency. Also, from a statistical perspective, the properties of sample means are well known, which facilitates testing. For example, we will compare men and women in their mean number of hours using the Internet each week. In this chapter, we outline the logic involved when testing for mean differences between groups, state the assumptions, and then perform an analysis comparing two groups. Appendix A will generalize the method to analysis involving more than two groups.

7.2 Logic of Testing for Mean Differences The goal of statistical tests on means is to draw conclusions about population differences based on the observed sample means. To provide a context for this discussion, we view a series of boxplots, such as those discussed in Chapter 4, showing three groups (A, B and C) sampled from different populations with different distributions on a scale variable. The first boxplot in Figure 7.1 displays the case when the three groups are distinctly different in mean level on the scale variable

Mean Differences Between Groups: T Test 7 - 1

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.1 Samples from Three Very Different Populations

We see that the groups are well separated: there is no overlap between any sample group and either of the remaining two. In this case, a statistical test is almost superfluous since the groups are so disparate, but if performed we would find highly significant differences. Next we turn to a case in which the groups are samples from the same population and show no differences.

Mean Differences Between Groups: T Test 7 - 2

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.2 Three Samples from the Same Population

Here there is considerable overlap of the three samples; the medians (and means) and other summaries match almost identically across the groups. If there are any true differences between the population groups they are likely to be extremely small and not have any practical importance. When there are modest population differences, we might obtain the result below. Figure 7.3 Samples from Three Modestly Different Populations

Mean Differences Between Groups: T Test 7 - 3

Introduction to Statistical Analysis Using SPSS Statistics There is some overlap among the three groups, but the sample means (medians in the boxplot) are different. In this instance a statistical test would be valuable to assess whether the sample mean differences are large enough to justify the conclusion that the population means differ. This last plot represents the typical situation facing a data analyst. As we did when we performed the chi-square test, we formulate a null hypothesis and use the data to evaluate it. Ho (Null Hypothesis) assumes that the population means are identical. We then determine if the differences in sample means are consistent with this assumption. If the probability of obtaining sample means as far (or further) apart as we find in our sample is very small (less than 5 chances in 100 or .05), assuming no population differences, we reject our null hypothesis and conclude the populations are different. We implement this logic by comparing the variation among sample means relative to the variation of individuals within each sample group. The core idea is that if there were no differences between the population means, then the only source for differences in the sample means would be the variation among individual observations (since the samples contain different observations), which we assume is random. We then compute a ratio of the variance among sample means (referred to as between group variance) divided by the variance among individual observations within each group (referred to as within group variance). This ratio will be close to 1 if there are no population differences. If there are true population differences, we would expect the ratio of variances to be greater than 1. This ratio value is referred to as the F-value although typically reported as a t value (the square root of F) in the two group comparison. Under the assumptions made in analysis of variance (discussed below), this variation ratio follows a known statistical distribution, the F distribution. Thus the result of performing the test will be a probability indicating how likely we are to obtain sample means as far apart (or further) as we observe in our sample if the null hypothesis were true. If this probability is very small, we reject the null hypothesis and conclude there are true population differences. This concept of taking a ratio of between-group variation of means (between-group) to withingroup variation of individuals (within-group) is fundamental to the statistical method called analysis of variance. It is implicit in the simple two-group case (t test), and appears explicitly in more complex analyses (general ANOVA).

Assumptions When statistical tests of mean differences are applied, i.e. t test for two group differences and F test for the more general case, at least two assumptions are made. First, that the distribution of the dependent measure within each population subgroup follows the normal distribution (normality). Second, that its variation is the same within each population subgroup (homogeneity of variance). When these assumptions are met, the t and F tests can be used to draw inferences about population means. We will discuss each of these assumptions as it applies in practice and see whether they hold in our data.

Mean Differences Between Groups: T Test 7 - 4

Introduction to Statistical Analysis Using SPSS Statistics Normality Assumption Normality of the dependent variable within each group is formally required when statistical tests (t, F) involving mean differences are performed. In reality, these tests are not much influenced by moderate departures from normality. This robustness of the significance tests holds especially when the sample sizes are moderate to large (over 25 cases) and the dependent measure has the same distribution (for example, skewed to the right) within each comparison group. Thus while normality is assumed when performing the significance tests, the results are not much affected by moderate departures from normality (for discussion and references, see Kirk (1964) and for an opposing view see Wilcox (2004)). In practice, researchers often examine histograms and box plots to view each group in order to make this determination. If a more formal approach is preferred, the Explore procedure can produce more technical plots (normal probability plots) and statistical tests of normality (see the second appendix to this chapter). In situations where the sample sizes are small or there are gross deviations from normality, researchers often shift to nonparametric tests. An example is given in Appendix A. Homogeneity of Variance The second assumption, homogeneity of variance, indicates that the variance of the dependent variable is the same for each population subgroup. Under the null hypothesis we assume the variation in sample means is due to the variation of individual scores, and if different groups show disparate individual variation, it is difficult to interpret the overall ratio of between-group to pooled within-group variation. This directly affects significance tests. Based on simulation work, it is known that significance tests of mean differences are not much influenced by moderate lack of homogeneity of variance if the sample sizes of the groups are about the same. If the sample sizes are quite different, then lack of homogeneity (heterogeneity) is a problem in that the significance test probabilities are not correct. SPSS Statistics performs a formal test, the Levene test of homogeneity, to test the homogeneity assumption. We will discuss this with the example results. When comparing means from two groups (t test) and one-factor ANOVA (see Appendix A) there are corrections for lack of homogeneity. In the more general ANOVA analysis a simple correction does not exist. It is beyond the scope of this course, but it should be mentioned that if there is a relationship or pattern between the group means and standard deviations (for example, if groups with higher mean levels also have larger standard deviations), there are sometimes data transformations that when applied to the dependent variable will result in homogeneity of variance. Such transformations can entail additional complications, but provide a method of meeting the homogeneity of variance requirement. The Explore procedure’s Spread & Level plot can provide information as to whether this approach is appropriate and can suggest the optimal data transformation to apply to the dependent measure. To oversimplify, when dealing with moderate or large samples and testing for mean differences, normality is not always important. Gross departures from homogeneity of variance do affect significance tests when the sample sizes are disparate.

Sample Size Generally speaking, tests involving comparisons of sample means do not require any specific minimal sample size. Formally, there must be at least one observation in each group of interest and at least one group with two or more observations in order to obtain an estimate of the withingroup variation. While these requirements are quite modest, the more important point regarding sample size is that of statistical power: your ability to detect differences that truly exist in the

Mean Differences Between Groups: T Test 7 - 5

Introduction to Statistical Analysis Using SPSS Statistics population. As your sample size increases, the precision with which means and standard deviations are estimated increases, as does the probability of finding true population differences (power). Thus larger samples are desirable from the perspectives of statistical power and robustness (recall our discussion of normality), but are not formally required. These analyses do not require that the group sizes be equal. However, analyses involving tests of mean differences are more resistant to violation of the homogeneity of variance assumption when the sample sizes are equal (or near equal). In the more general ANOVA analysis, equal (or proportional) group sample sizes bring assurance that the various factors under investigation can be looked at independently. Finally, equal sample size conveys greater statistical power when looking for any differences among groups. In summary, equal group sample sizes are not required, but do carry advantages. This is not to suggest that you should drop observations from the analysis in order to obtain equal numbers in each group, since this would throw away information. Rather, think of equal group sample size as an advantageous situation you should strive for when possible. In experiments equal sample size is usually part of the design, while in survey work it is rarely seen.

7.3 Exploring the Group Differences In this analysis we wish to determine if there are population gender differences in hours using the Internet each week and hours watching TV each day. Before performing these tests, we will use the exploratory data analysis procedures we discussed in Chapter 4 to look at group differences on these variables and check for violations of the assumptions above. Click Analyze…Descriptive Statistics…Explore Move WWWHR and TVHOURS into the Dependent List: box Move GENDER into the Factor List: box

Figure 7.4 Explore Dialog Box Comparing Groups

Explore will perform a separate analysis on each dependent variable for each category of the Factor variable. The Factor variable, GENDER, will produce statistical summaries separately for males and females. We will use the Options button to request that missing values should be

Mean Differences Between Groups: T Test 7 - 6

Introduction to Statistical Analysis Using SPSS Statistics treated separately for each dependent variable (Pairwise option). We mentioned in Chapter 4 that Explore’s default is to exclude a case from analysis if it contains a missing value for any of the dependent variables. Click Options button Click Exclude cases pairwise Click Continue

Finally, we will request histograms rather than stem & leaf plots. Click Plots button Click off (uncheck) Stem & leaf in the Descriptives area Click on (check) Histograms Click Continue

Figure 7.5 Explore Options and Plots Dialog Boxes

Click OK

Mean Differences Between Groups: T Test 7 - 7

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.6 Summaries of Hours of Internet Use per Week Descriptives WWW HOURS PER WEEK

Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Lower Bound Upper Bound

Statistic GENDER OF RESPONDENT Female Male 6.30 8.79 5.65 8.02 6.94

9.56

4.83 3.00 98.341 9.917 0 130 130 6 4.602 35.922

7.24 5.00 121.877 11.040 0 100 100 8 2.757 10.990

Std. Error GENDER OF RESPONDENT Female Male .329 .392

.081 .162

.087 .173

Note: Editing Descriptives Table The original output for Figure 7.6 was edited using the Pivot Table editor to facilitate the male to female comparisons (steps outlined below). Double-click on the Descriptives pivot table Click Pivot…Pivoting Trays to activate the Pivoting Trays window (if necessary) Drag Gender of Respondent from the Row dimension tray to the Column dimension tray below Stat Type already there Drag Dependent Variables from the Row dimension tray to the Layer dimension tray Close the Pivot Table Editor

In Figure 7.6, we can see that the means (male 8.79; female 6.30) are higher than both the median and trimmed mean for each gender, which suggests some skewness to the data. This is confirmed by the positive skewness measures and the histograms in Figure 7.7. The mean for males is about 2.5 hours greater than that for females. Also the sample standard deviation for males is larger (11.04 to 9.92), and the IQR for males is also greater (8 to 6). Thus, it is unclear whether the homogeneity of variance assumption has been met. Both genders have some very high maximum values.

Mean Differences Between Groups: T Test 7 - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.7 Histograms for Females and Males of Hours of Internet Use

Mean Differences Between Groups: T Test 7 - 9

Introduction to Statistical Analysis Using SPSS Statistics

Viewing the histograms with normality in mind, it is very obvious that both distributions are positively skewed and not normal. However, keeping in mind our earlier discussion about the assumptions for statistical testing, since both gender groups show a similar skewed pattern, we will not be concerned since the sample sizes are fairly large (793 for males and 908 females). These numbers are found in the Case Processing Summary table (not shown). The box plot provides some visual confirmation of the mean (actually median) differences between the two groups. Note that the difference would be made more apparent by editing the range of the vertical scale of the chart. The side-by-side comparison shows that the groups have a similar pattern of positive skewness. The height of the box (the IQR) is smaller for women confirming the smaller dispersion for females. Outliers are identified and might be checked against the original data for errors; we considered this issue when we performed exploratory data analysis on Internet use for the entire sample. Based on these plots and summaries we might expect to find a significant mean difference between men and women. Also, since the two groups have a similar distribution of data values (positively skewed) with large samples, we feel comfortable about the normality assumption to be made when testing for mean differences. Figure 7.8 Boxplot of Internet Use Per Week for Males and Females

Next we turn to the summaries for the hours watching television each day.

Mean Differences Between Groups: T Test 7 - 10

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.9 Summaries of Hours Per Day Watching TV Descriptives HOURS PER DAY WATCHING TV

Mean 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis

Lower Bound Upper Bound

Statistic GENDER OF RESPONDENT Female Male 2.99 2.72 2.73 2.51 3.25

2.94

2.63 2.00 8.291 2.879 0 20 20 3 2.696 10.183

2.47 2.00 5.184 2.277 0 14 14 2 2.112 5.874

Std. Error GENDER OF RESPONDENT Female Male .132 .111

.112 .223

.119 .238

The mean is 2.99 for females and 2.72 for males, so each gender watched about 3 hours per day. The means are quite similar, suggesting that there may be no difference in the population between the two genders. The means are above their respective trimmed means and the medians; the skewness measures are several standard errors from zero, so both groups have positive skewness. And the kurtosis values are far from zero, especially for females. All these signs indicate that the distributions are not normal. However, the sample sizes are large. Notice also that the standard deviations for the groups are reasonably similar (2.88, 2.28) although the IQR is different for each group. These may hint at the possible lack of homogeneity of variance between the groups. The histograms in Figure 7.10 show somewhat similar distributions and have outliers at larger positive values for both genders.

Mean Differences Between Groups: T Test 7 - 11

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.10 Histograms for Females and Males of Hours Watching TV

To compare the groups directly, we move to the boxplot.

Mean Differences Between Groups: T Test 7 - 12

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.11 Boxplot of Hours Watched TV Per Day For Males and Females

The median for males and females is the same. However, the IQR is higher (3) for females than males (2) and the outliers are more widely spread for females. Both groups show positive skewness. Since both groups follow a similar skewed distribution and the samples are large, the normality assumption will not be a problem. There is some evidence of not meeting the homogeneity of variance assumptions which we will confirm or deny in the next step of our analysis. Having explored the data focusing on group comparisons, we now perform tests for mean differences between the populations. Note: Producing Stacked and Paneled Histograms with Chart Builder The Chart Builder provides other types of histograms that you can use to compare distributions of a scale variable on multiple groups. Both a stacked histogram and a population pyramid will show the distribution of multiple groups in one histogram. These types of charts are available in the Gallery of histogram icons. Or, you can produce paneled histograms, separate histograms for each group shown on the same scale range, arranged in either columns or rows. The panel charts are available on the Groups/Point ID tab of the Chart Builder. Figure 7.12 shows a stacked histogram by gender of hours of Internet use.

Mean Differences Between Groups: T Test 7 - 13

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.12 Stacked Histogram of Hours of Internet Use by Gender (Chart Builder)

7.4 Testing the Differences: Independent Samples T Test The t test is used to test the differences in means between two populations. If more than two populations are involved, a generalization of this method, called analysis of variance (ANOVA), can be used. The T Test procedure is found on the Compare Means menu. Click Analyze…Compare Means

Figure 7.13 Compare Means Menu

Mean Differences Between Groups: T Test 7 - 14

Introduction to Statistical Analysis Using SPSS Statistics There are three available t tests: one-sample t test, independent-samples t test, and paired-samples t test. The One-Sample T Test compares a value you supply (it can be the known value for some population, or a target value) to the sample mean in order to determine whether the population represented by your sample differs from the specified value. The other t tests involve comparison of two sample means. The Independent-Samples T Test applies when there are two separate populations to compare (for example, males and females). An observation can only fall into one of the two groups. The Paired-Samples T Test is appropriate when comparing two measures (variables) for a single population. For example, a paired t test would be used to compare pretreatment to post-treatment scores in a medical study. IF the same observation contributes to both means, the paired t test takes advantage of this fact and can provide a more powerful analysis. An example applying the paired t test to compare the formal education of the respondent’s mother and father appears in the appendix at the end of this chapter. In our present example, an observation (individual interviewed) can fall into only one of the two groups (male or female), so we choose the independent-samples t test. Click Analyze…Compare Means…Independent-Samples T Test… Move WWWHR and TVHOURS into the Test Variable(s): box Move GENDER into the Grouping Variable: box

We first indicate the dependent measures or “Test variable.” We specify both WWWHR and TVHOURS, which will yield two separate analyses. The “Grouping” or independent variable is GENDER. Figure 7.14 Independent-Samples T Test Dialog Box

Notice the question marks following GENDER. The T Test dialog requires that you indicate which groups are to be compared, which is usually done by providing the data values for the two groups. Since GENDER is a string variable with females coded "F" and males coded "M", we must supply these codes using the Define Groups dialog box. Be sure to type upper case for both as shown in Figure 7.15. Click the Define Groups pushbutton Enter F as the first and M as the second group code

Mean Differences Between Groups: T Test 7 - 15

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.15 T Test Define Groups Dialog Box

We have identified the values defining the two groups to be compared. If the independent variable is a numeric variable, you can specify a single cut point value to define the two groups. Those cases less than or equal to the cut point go into the first group, and those greater than the cut point fall into the second group. This option is rarely used though. Click Continue

Figure 7.16 Completed T Test Dialog Box

Our specifications are complete. By default, the procedure will use all valid responses for each dependent variable (pairwise deletion) in the analysis. Click OK

7.5 Interpreting the T Test Results We will first look at the output for hours using the Internet each week. NOTE: In the original output, the test results for both dependent variables were displayed in a single pivot table, but for discussion purposes we present them separately and edit them for better display.

Mean Differences Between Groups: T Test 7 - 16

Introduction to Statistical Analysis Using SPSS Statistics In Figure 7.17 we can see some of the same summaries as the Explore procedure displayed: sample sizes, means, standard deviations, and standard errors for the two groups. The mean for males is about 2.5 hours more per week than for females. The actual sample mean difference is displayed in the Independent Samples Test table in Figure 7.18. Figure 7.17 Summaries for Hours of Internet Use Group Statistics

WWW HOURS PER WEEK

GENDER OF RESPONDENT Female Male

N

Mean 6.30 8.79

908 793

Std. Deviation 9.917 11.040

Std. Error Mean .329 .392

Figure 7.18 T Test Output for Hours of Internet Use Independent Samples Test Levene's Test for Equality of Variances

F WWW HOURS PER WEEK

Equal variances assumed Equal variances not assumed

15.182

Sig. .000

t-test for Equality of Means

t

df

Sig. (2-tailed)

Mean Difference

Std. Error Difference

95% Confidence Interval of the Difference Lower Upper

-4.902

1699

.000

-2.491

.508

-3.487

-1.494

-4.866

1605.39

.000

-2.491

.512

-3.495

-1.487

Homogeneity of Variance Test: Levene's Test Next, we consider the Levene test of homogeneity of variance for the two groups. With this test we can assess whether the data meet the homogeneity assumption before examining the t test results. There are several tests of homogeneity (Bartlett-Box, Cochran’s C, Levene). Levene’s test has the advantage of being sensitive to lack of homogeneity, but relatively insensitive to nonnormality. Bartlett-Box and Cochran’s C are sensitive to both lack of homogeneity and nonnormality. Since nonnormality (recall our discussion in the assumptions section) is not necessarily an important problem for t tests and analysis of variance, the Levene test is directed toward the more critical issue of homogeneity. Homogeneity tests evaluate the null hypothesis that the dependent variable’s standard deviation is the same in the two populations. Since homogeneity of variance is assumed when performing the t test, the analyst hopes to find this test to be nonsignificant. The probability (labeled Sig. in the table) from Levene’s test indicates that the probability of obtaining sample standard deviations (technically, variances are tested) as far apart (11 versus 9.9) as we observe in our data, if the standard deviations were identical in the two populations. The probability is quite low (Sig. value is .000). This is below the common .05 cut-off (some researchers suggest using a cutoff of .01 for larger samples), so we conclude the standard deviations are not identical in the two population groups, and the homogeneity requirement is not met. Since this is an important assumption, we will need to use an adjusted t test.

Mean Differences Between Groups: T Test 7 - 17

Introduction to Statistical Analysis Using SPSS Statistics If this procedure seems too complicated, some authors suggest the following simplified rules: (1) If the sample sizes are about the same, don’t worry about the homogeneity of variance assumption; (2) If the sample sizes are quite different, then take the ratio of the standard deviations in the two groups and round it to the nearest whole number. If this rounded number is 1, don’t worry about lack of homogeneity of variance. Using this simplified test, the ratio of the group standard deviations rounds to 1. This demonstrates that Levene's test is a conservative measure of homogeneity of variance especially for large samples.

T Test Finally two versions of the t test appear in Figure 7.18. The row labeled “Equal variances assumed” contains results of the standard t test, which assumes homogeneity of variance. The second row labeled “Equal variances not assumed” contains an adjusted t test that corrects for lack of homogeneity in the data. You would choose one or the other based on your evaluation of the homogeneity of variance question, so we choose the bottom row. However, as we can see the two values are very similar in this example. The actual t value and df (degrees of freedom) are technical summaries measuring the magnitude of the group differences and a value related to the sample sizes, respectively. The degrees of freedom, equal to the number of sample cases in the analysis minus 2, is used in calculating the probability (significance) of the t value. To interpret the results, move to the column labeled “Sig. (2-tailed).” This is the probability (rounded to .000, meaning there is less than .0005), of our obtaining sample means as far or further apart (2.49 hours), by chance alone, if the two populations (males and females) actually have the same Internet use each week. Thus the probability of obtaining such a large difference by chance alone is quite small (less than 5 in 10,000), so we would conclude there is a significant difference in Internet use between men and women, with men using the Internet more. The term “2-tailed” significance indicates that we are interested in testing for any differences in Internet use between men and women, either in the positive or negative direction (ergo the two tails). Researchers with hypotheses that are directional—for example, that men use the Internet more than women—can use one-tailed tests to address such questions in a more sensitive fashion. Recall our discussion in Chapter 5. Broadly speaking, two-tailed tests look for any difference between groups, while a one-tailed test focuses on a difference in a specific direction. Two-tailed tests are more commonly done since the researcher is usually interested in any differences between the groups, regardless which is higher. If interested, you can obtain the one-tailed t test result directly from the two-tailed significance value that is displayed. For example, suppose you wish to test the directional hypothesis that in the population men do use the Internet more than women, the null hypothesis being that either women use it more than men or that there is no gender difference. You would simply divide the two-tailed significance value by 2 to obtain the one-tailed probability, and verify that the pattern of sample means is consistent with your hypothesized direction. Thus if the two-tailed significance value were .0005, then the one-tailed significance value would be half that value (.00025), if the direction of the sample means violates the null hypothesis (otherwise it is 1 – p/2, where p is the two-tailed value). To learn more about the differences and logic behind one and two-tailed testing, see SPSS 16.0 Guide to Data Analysis (Norusis, 2008) or an introductory statistics book.

Mean Differences Between Groups: T Test 7 - 18

Introduction to Statistical Analysis Using SPSS Statistics

Confidence Band for Mean Difference The T Test procedure provides an additional bit of useful information: the 95% confidence band for the sample mean difference. Recalling our earlier discussion, the 95% confidence band for the difference provides a measure of the precision with which we have estimated the true population difference. In the output shown in Figure 7.18, the 95% confidence band for the mean difference between groups is from -3.5 to -1.5 hours (use the Equal variances not assumed row). Note that the difference values are negative because "Males", the second group, has the highest mean hours using the internet. Thus we expect that the population mean difference could easily be a number like 1.9 or 2.5 hours but would not be a number as large as 5 or 6 hours. So the 95% confidence band indicates the likely range within which we expect the population mean difference to fall. Speaking in the technically correct fashion, if we were to continually repeat this study, we would expect the true population difference to fall within the confidence bands 95% of the time. While the technical definition is not illuminating, the 95% confidence band provides a useful precision indicator of our estimate of the group difference.

Summary for Internet Use Per Week Our analysis indicated that the assumption of homogeneity of variance is not satisfied, and that there is a significant difference in Internet use per week between men and women. Our sample indicates that, in the population of adults, on average men use the Internet about 2.5 hours more than women per week and the 95% confidence band on this difference ranges from 1.5 to 3.5 hours.

T Test Results for Television Viewing We will now compare men and women on their daily amount of television viewing. Figure 7.19 Summaries for Hours Per Day Watching TV Group Statistics

HOURS PER DAY WATCHING TV

GENDER OF RESPONDENT Female Male

N 479 420

Mean 2.99 2.72

Std. Deviation 2.879 2.277

Std. Error Mean .132 .111

Figure 7.20 T Test Output for Hours Per Day Watching TV Independent Samples Test Levene's Test for Equality of Variances

F HOURS PER DAY WATCHING TV

Equal variances assumed Equal variances not assumed

5.620

Sig. .018

t-test for Equality of Means

t

df

Sig. (2-tailed)

Mean Difference

Std. Error Difference

95% Confidence Interval of the Difference Lower Upper

1.520

897

.129

.266

.175

-.077

.609

1.543

887.775

.123

.266

.172

-.072

.604

Mean Differences Between Groups: T Test 7 - 19

Introduction to Statistical Analysis Using SPSS Statistics Reviewing Figure 7.19, we see that the sample means are very close. As we discussed in Chapter 4, the standard errors are the expected standard deviations of the sample means if the study were to be repeated with the same sample sizes. The difference (shown in Figure 7.20) between men and women is small (.266 hours). Note that although the standard deviations of the two groups are fairly close, the Levene test returns a probability value of .018, which is below the .05 cutoff; or above the .01 cutoff which we might consider using with our large sample. It is a good idea to keep the sample size in mind when evaluating the homogeneity test, because with increasing sample size there is more precise estimation of the sample standard deviations, and so smaller differences are statistically significant. Thus if the Levene test were significant, but the sample sizes were large and the ratio of the sample standard deviations were near 1, then the equal variance t test should be quite adequate (and in this situation the two t values almost invariably give the same result). Proceeding to the t test itself, the significance value of .129 (in the equal variances assumed line) indicates that if the null hypothesis of no gender difference in the amount of TV viewing were true, then there is about a 13% chance of obtaining sample means as far (or further) apart as we observe in our data. This is not significant (well above .05) and we conclude there is no evidence of men differing from women in the number of hours watching TV each day. Notice that the 95% confidence band of the male-female difference includes 0. This is another reflection of the fact that we cannot conclude the populations are different.

7.6 Graphing Mean Differences Although the T Test procedure displays the appropriate statistical test information, a summary chart is often preferred as a way to present significant results. Bar charts displaying the group sample means with 95% confidence bands can be produced using Chart Builder. However, many people prefer an error bar chart instead. It is a cleaner chart that focuses more on the precision of the estimated mean for each group than the mean itself. We will produce an error bar chart showing the gender difference in Internet use. Click Graphs…Chart Builder… Click OK in the Information box (if necessary) Click Reset Click Gallery tab (if it's not already selected) Click Bar in the Choose from: list

Select the icon for Simple Error Bar and drag it to the Chart Preview canvas Drag and drop GENDER from the Variables: list to the X-Axis? area in the Chart Preview canvas Drag and drop WWWHR from the Variables: list to the Y-Axis? area in the Chart Preview canvas

The completed Chart Builder is shown in Figure 7.21.

Mean Differences Between Groups: T Test 7 - 20

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.21 Chart Builder for Simple Error Bar Chart

By default, the sample mean will be displayed for each group with the error bars representing the 95% confidence interval for the sample means. In the Element Properties dialog box, you can choose to display error bars representing Standard error or Standard deviation of the mean. From the Statistics dropdown list, you can choose to display a statistic other than the mean. We will display the default choices. Click OK

Mean Differences Between Groups: T Test 7 - 21

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.22 Error Bar Chart of Internet Use Per Week

The small circle in the middle of each error bar represents the sample group mean of Internet use per week, and the attached bars are the upper and lower limits for the 95% confidence interval on the sample mean. Thus we can directly compare groups and view the precision with which the group means have been estimated. Notice that the lower bound for men is well above the upper bound for women indicating these groups are well separated and that the population difference is statistically significant. Such charts are especially useful when more than two groups are displayed, since one can quickly make informal comparisons between any groups of interest.

7.7 Appendix: Paired T Test The paired t test is used to test for statistical significance between two population means when each observation (respondent) contributes to both means. In medical research a paired t test would be used to compare means on a measure administered both before and after some type of treatment. Here each patient is tested twice and is used in calculating both the pre- and posttreatment means. In market research, if a subject were to rate the product they usually purchase and a competing product on some attribute, a paired t test would be needed to compare the mean ratings. In an industrial experiment, the same operators might run their machines using two different sets of guidelines in order to compare average performance scores. Again, the paired t test is appropriate. Each of these examples differs from the independent groups t test in which an observation falls into one and only one of the two groups. The paired t test entails a slightly different statistical model since when a subject appears in each condition, he acts as his own control. To the extent that an individual’s outcomes across the two conditions are related, the

Mean Differences Between Groups: T Test 7 - 22

Introduction to Statistical Analysis Using SPSS Statistics paired t test provides a more powerful statistical analysis (greater probability of finding true effects) than the independent groups t test. To demonstrate a paired t test using the General Social Survey data, we will compare mean education levels of the mothers and fathers of the respondents. The paired t test is appropriate because we will obtain data from a single respondent regarding his/her parents’ education. We are interested in testing whether there is a significant difference in education between fathers and mothers of respondents in the population. Keep in mind that while the population we sample from is the U.S. adult population, the questions pertain to their parents’ education. Thus the population our conclusion directly generalizes to would be parents of U.S. adults. To test directly for differences between men and women in the U.S. population, we could run an independent-groups t test comparing mean education level for men and women. While not pursued here, we would recommend running exploratory data analysis on the two variables to be tested. The homogeneity of variance assumption does not apply since we are dealing with one group. Normality is assumed and applies to the difference scores, obtained by subtracting the two measures to be compared. To investigate this assumption using SPSS Statistics, compute a new variable that is the difference between the two measures, then run Explore on this variable to examine the descriptive statistics and plots. To run the paired samples t test on mother's and father's years of education, Click Analyze…Compare Mean…Paired-Samples T Test Click on MAEDUC and drag it to the Variable1 box for Pair 1 Click on PAEDUC and drag it to the Variable2 box for Pair 1

Figure 7.23 Paired-Samples T Test Dialog Box

Click OK

Mean Differences Between Groups: T Test 7 - 23

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.24 Summaries of Differences in Parent’s Education

The first table displays the mean, standard deviation and standard error for each of the variables. We see that the means for mothers and fathers are virtually the same. This might indicate very close educational matching of people who marry. Another possibility is incorrect reporting of parent’s formal education by the respondent with a bias toward reporting the same value for both. In the next table, the sample size (number of pairs) appears along with the correlation between mother and father’s education. Correlations and their significance tests will be studied in a later chapter, but we note that the correlation (.648) is positive, high, and statistically significant (differs from zero in the population). This suggests that the power to detect a difference between the two means is substantial. Figure 7.25 Paired T Test of Differences in Parents’ Education Paired Samples Test Paired Differences

Pair 1

HIGHEST YEAR SCHOOL COMPLETED, MOTHER HIGHEST YEAR SCHOOL COMPLETED, FATHER

Mean

Std. Deviatio n

Std. Error Mean

.049

3.159

.071

95% Confidence Interval of the Difference Lower Upper -.091

.188

t .683

df

Sig. (2-tailed)

1976

The mean formal education difference, .049 years, is reported along with the sample standard deviation and standard error (based on the parents’ education difference score computed for each respondent). Not surprisingly, with this small mean difference, the significance value (.494) indicates that if mothers and fathers in the population had the same formal education (null hypothesis) then there is almost a 50% chance of obtaining as large (or larger) a difference as we obtained in our sample. Using the standard cut-off probability of .05, we accept the null hypothesis and conclude that mothers and fathers have the same level of education.

Mean Differences Between Groups: T Test 7 - 24

.494

Introduction to Statistical Analysis Using SPSS Statistics

7.8 Appendix: Normal Probability Plots The histogram is useful for evaluating the shape of the distribution of the dependent measure within each group. Since one of the t test assumptions is that these distributions are normal, we implicitly compare the histograms to the well-known normal bell-shaped curve that we discussed in Chapter 4. If a more direct assessment of normality is desired, the Explore procedure can produce a normal probability plot and a fit test of normality. In this section we return to the Explore dialog box and request these features. Earlier in the chapter we used the Explore dialog box to investigate Internet use and hours watching TV for males and females. If we return to this dialog box by clicking the Dialog Recall tool, we see that it retains the settings from our last analysis. Click the Dialog Recall tool, and then click Explore

To request the normal probability plots and test statistics, Click the Plots pushbutton Check Normality plots with tests

Figure 7.26 Explore Plots Dialog Box

As mentioned in the discussion concerning homogeneity of variance, the spread & level plot can be used to find a variance stabilizing transformation for the dependent measure. Click Continue Click OK

We ignore the other output which we have seen before and proceed to the Normal Q-Q plots.

Mean Differences Between Groups: T Test 7 - 25

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.27 Normal Probability Plot - Females

Figure 7.27 displays the normal probability plot on females of hours of internet use. To produce a normal probability plot, the data values (here hours using the Internet each week) are first ranked in ascending order. Then the normal deviate corresponding to each rank (compared to the sample size) is calculated (based on the standard normal curve) and plotted against the observed value. The vertical axis of the normal probability plot represents normal deviates (based on the rank of the observation). The actual data values appear along the horizontal axis. The individual points (circles) represent the data values (females only in Figure 7.27). The straight line indicates the pattern we would see if the data were perfectly normal. In Figure 7.27, the line passes through the point Expected Normal=0 (the center of the normal curve) and Observed Value=6.30, which corresponds to the mean of Internet use for females. If Internet use followed a normal distribution for females, the plotted values would closely approximate the straight line. Notice the large deviations of the higher data values. The advantage of a normal probability plot is that instead of comparing a histogram or stem & leaf plot to the normal curve (more complicated), you need only compare the plot to a straight line. The plot above confirms what we concluded earlier: that for females, Internet hours per week does not follow a normal distribution.

Mean Differences Between Groups: T Test 7 - 26

Introduction to Statistical Analysis Using SPSS Statistics

Tests of Normality Accompanying the normal probability plot is a modified version of the Kolmogorov-Smirnov test (Lilliefors correction) and the Shapiro-Wilk test, which address whether the sample can be viewed as originating from a population following a normal distribution. The null hypothesis is that the sample comes from a normal population with unknown mean and variance. The significance value is the probability that we can obtain a sample as far (or further) from the normal as what we observe in our data, if our sample truly came from a normal population. Figure 7.28 Tests of Normality Tests of Normality

WWW HOURS PER WEEK HOURS PER DAY WATCHING TV

GENDER OF RESPONDENT Female Male Female Male

a

Kolmogorov-Smirnov Statistic df Sig. .263 908 .000 .215 793 .000 .200 479 .000 .220

420

.000

Statistic .578 .712 .737 .791

Shapiro-Wilk df 908 793 479 420

Sig. .000 .000 .000 .000

a. Lilliefors Significance Correction

For both tests the significance value is at .000 (rounded to 3 decimals) in all cases. If we assume we have sampled from a normal population, the probability of obtaining a sample as far (or further) from a normal as what we have found is less than .0005 (or 5 chances in 10,000). So we would conclude that for females in the population, the distribution of Internet hours per week is not normal. Please recall our discussion during which we outlined when normality might not be that important. Also keep in mind that since our sample is large, we have a powerful test of normality so relatively small departures from normality would be significant.

Detrended Normal Plot The detrended normal probability plot (as shown in Figure 7.29) focuses attention on those areas of the data exhibiting the greatest deviation from the normal. This plot displays the deviation of each point in the normal probability plot from the straight line corresponding to the normal. The vertical axis represents the difference between each point in the normal probability plot and the straight line representing the perfect normal. The horizontal axis represents the observed value. This serves to visually magnify the areas where there is greatest deviation from the normal. If the data in the sample were normal, all the data points in the detrended normal plot would appear on the horizontal line centered at 0. Figure 7.29 shows the detrended normal plot of Internet use for females. We see that the major deviations from the normal occur in the right tail of the distribution.The same conclusion could have been made from the a histogram or the normal probability plot. The detrended normal plot is a more technical plot, which allows the researcher to focus in detail on the specific locus and form of deviations from normality.

Mean Differences Between Groups: T Test 7 - 27

Introduction to Statistical Analysis Using SPSS Statistics Figure 7.29 Detrended Normal Plot of Internet Use -Females

A normal probability plot and a detrended normal plot also appear for the males. These will not be displayed here since our aim was to demonstrate the purpose and use of these charts, and not to repeat the investigation of normality.

Mean Differences Between Groups: T Test 7 - 28

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises We want to see whether men and women differ in the average number of children, their use of email, and their age. 1. Run exploratory data analysis on CHILDS, EMAILHR, and AGE by GENDER. (Hint: Don't forget to select the "Pairwise Deletion" option.) Are any of the variables normally distributed? What differences between males and females do you notice from the boxplot of CHILDS? 2. Use Chart Builder to produce a paneled histogram for number of children by gender. 3. Run t tests looking at mean differences by gender for these three variables. Interpret the results. Which variables met the homogeneity of variance assumption? Are the means of any of the variables significantly different for males and females? Can you explain why? 4. Use Chart Builder to display an error bar chart of number of children by gender. The analysis suggests that women have a greater number of children than men. Can you suggest reasons for this seemingly odd result? For those with extra time: 1. Other measures that might be of interest are the age when their first child was born (AGEKDBRN) and number of household members (HHSIZE). Are you surprised by any of these results?

Mean Differences Between Groups: T Test 7 - 29

Introduction to Statistical Analysis Using SPSS Statistics

Mean Differences Between Groups: T Test 7 - 30

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 8: Bivariate Plots and Correlations: Scale Variables Topics • Scatterplots: Plotting Relationships between Scale Variables • Types of Relationships • The Pearson Correlation Coefficient Data This chapter uses the Bank.sav data file: personnel file containing demographic, salary and work related data from 474 employees at a bank in the early 1970s. The salary information has not been converted to current dollars. Demographic variables include sex, race, age, and education in years (edlevel). Work related variables are job classification (jobcat), previous work experience recorded in years (work), time (in months) spent in current job position (time). Current salary (salnow) and starting salary (salbeg) are also available.

8.1 Introduction In previous chapters we explored relations among categorical variables (using crosstab tables), and between categorical variables and interval scale variables (t-test). Here we focus on studying two interval scale measures: starting salary and formal education. We wish to determine if there is a relationship, and if so, quantify it. Starting salary is recorded in dollars and formal education is reported in years; thus both variables can be interval scales or stronger (actually ratio scales). Note, that education level has been defined as nominal measurement level, so we will need to change this prior to completing our analysis. Each variable can take on many different values. If we tried to present these variables (beginning salary and education) in a crosstabulation table, the table could contain hundreds of rows. In order to view the relation between these measures we must either recode salary and education into categories and run a crosstab (the appropriate graph is a clustered bar chart), or alternatively, present the original variables in a scatterplot. Both approaches are valid and you would choose one depending on your interests. Since we hope to build an equation relating amount of education to beginning salary, we will stick to the original scales and begin with a scatterplot. But first we will take a quick look at the relevant variables using exploratory data analysis methods.

8.2 Reading the Data The data are stored as the SPSS Statistics file named Bank.sav. Click File…Open…Data Double-click Bank.sav

Bivariate Plots and Correlations: Scale Variables 8- 1

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.1 The Bank Data

We see the data values for several employees in the Data Editor window.

8.3 Exploring the Data As this is the first time seeing the Bank data, we will explore the data before performing more formal analysis (for example, regression). While the scatterplot itself provides much useful information about each of the variables displayed, we begin by examining each variable separately. We will run exploratory data analysis on beginning salary (salbeg) and education (edlevel). The id variable will be used to label cases. Click Analyze…Descriptive Statistics…Explore Move salbeg and edlevel into the Dependent List: box Move id into the Label Cases by: box

Bivariate Plots and Correlations: Scale Variables 8 - 2

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.2 Explore Dialog Box

There are no Factor variables in this analysis since we are looking at the two variables over the entire sample. Outliers in the box plot will be identified by their employee ID number. This file has no missing data so we need not change the option for handling missing data. However, we will suppress the stem & leaf plot and examine the boxplot Click Plots button Click off (uncheck) Stem & leaf (not shown) Click OK

This procedure will create a number of tables and charts in the Output Viewer which we discuss below.

Bivariate Plots and Correlations: Scale Variables 8- 3

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.3 Statistics for Beginning Salary

The descriptive statistics for beginning salary are displayed in Figure 8.3. The mean ($6,806) is considerably higher than the median ($6,000), suggesting a skewed distribution. This is confirmed by the skewness value compared to its standard error. Starting salaries range from $3,600 to $31,992 (recall that these are salaries from the 1960s and early 1970s in unadjusted dollars) The extreme values at the high salary end result in a skewed distribution. Since several different job classifications are represented in this data, the skewness may be due to a relatively small number of people in high paying jobs. The positive kurtosis is partially caused by the large peak of values in the $6,000 salary range.

Bivariate Plots and Correlations: Scale Variables 8 - 4

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.4 Boxplot of Beginning Salary

In the box plot above, all outliers are at the high end, and the employee numbers for some of them can be read from the plot (changing the font size of these numbers in the Chart Editor window would make more of them legible). It might be useful to look at the job classification (jobcat) of some of the higher salaried individuals as a check for data errors. Figure 8.5 Statistics for Formal Education (in years)

Bivariate Plots and Correlations: Scale Variables 8- 5

Introduction to Statistical Analysis Using SPSS Statistics The mean for education is again above the median, but the skewness value is very near zero (suggesting a symmetric distribution). Here the mean exceeding the median is not due to the presence of outliers. We will see that there are only a few extreme observations, and they are at high education values. The mean is above the median because of the concentration of employees with education of 15 to 19 years. Figure 8.6 Box & Whisker Plot of Education

In the boxplot (Figure 8.6), the median or 50th percentile (dark line within box) falls on the lower edge of the box (25th percentile) indicating a large number of people with 12th grade education. There are very few outliers. Having explored each variable separately, we will now view them jointly with a scatterplot.

8.4 Scatterplots A scatterplot displays individual observations in an area determined by a vertical and a horizontal axis, each of which represent an interval scale variable of interest. In a scatterplot you look for a relationship between the two variables and note any patterns or extreme points. The scatterplot visually presents the relation between two variables, while correlation and regression summaries quantify the relationships. To request a simple scatterplot, we will use the Chart Builder. Click Graphs…Chart Builder

Bivariate Plots and Correlations: Scale Variables 8 - 6

Introduction to Statistical Analysis Using SPSS Statistics The first step is to select a chart from the Gallery or individual elements from Basic Elements; then drag and drop them onto the canvas. For most charts, you will want to use the Gallery as your starting point. Click the Gallery tab if it is not selected Click Scatter/Dot in the Choose from list if is not selected

There are a number of options for the scatter/dot charts. Since we are dealing with only two variables, a simple scatterplot will suffice. Drag the icon for Simple Scatter (the first icon) into the canvas

Next we indicate that beginning salary (salbeg) and education (edlevel) are the Y and X variables, respectively. Move salbeg into the Y Axis: area

Traditionally, the vertical axis is called the Y axis, while the horizontal axis is referred to as the X axis. Also, if one of the variables is viewed as dependent and the other as independent, by convention the dependent variable is specified as the Y axis variable. Notice the measurement level of the variable edlevel. As seen in the dialog, it is defined as nominal (you can see this from the three balls icon next to edlevel and also from the Categories section below the variable list). Scatterplots are normally run on scale variables in order to determine the type of relationship between the two variables. To change the measurement level of Education level [edlevel], we could cancel out of the dialog and return to the Variable View of the Data Editor or we can change it within the dialog. The change made in the dialog box affects only this chart. The measurement level specification on the file (viewed in the Data Editor) is unchanged. In the Variables: list, select and right-click on edlevel Select Scale from the resulting pop-up menu.

Figure 8.7 Changing the measurement level of Education level

Bivariate Plots and Correlations: Scale Variables 8- 7

Introduction to Statistical Analysis Using SPSS Statistics We can now move the edlevel variable into the X-axis and click OK to build the chart. The completed Chart Builder dialog is shown in Figure 8.8. Move edlevel into the X Axis: area Click OK

Figure 8.8 Chart Builder with Simple Scatterplot

Bivariate Plots and Correlations: Scale Variables 8 - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.9 Scatterplot of Beginning Salary and Education

Each circle represents at least one observation. We see there are many points (fairly dense) at several values, including 8 and 12 years of education. The plot has gaps because education is recorded in integers. Overall, there seems to be a positive relation between the two variables, since higher values of education are associated with higher salaries. Notice there is no one with little education and a high salary, nor is there anyone with high education and a very low salary. This will be explored in more detail shortly. There is one individual at a salary considerably higher than the rest. If this were your study, you might check this observation to make sure it wasn’t in error. While we can describe the pattern to an interested party by saying that to some extent greater education is associated with higher salary levels, or simply show them the chart, there would be an advantage if we could quantify the relation using some simple function. We will pursue this aspect later in this chapter and in the next chapter. If we wish to overlay our plot with a best-fitting straight line, we can do so using the Chart Editor. Double click on the chart to open the Chart Editor Click Elements…Fit Line at Total

Bivariate Plots and Correlations: Scale Variables 8- 9

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.10 Chart Editor Elements Menu

Figure 8.11 Element Properties for Fit Line

By default a straight line (linear) will be fit to the data. However, you can use the Properties dialog box to specify lines with other fit methods such as quadratic and cubic. The Loess choice applies a robust regression technique to the data. Such methods produce a result that is more resistant to outliers than the traditional least-squares regression. Note that 95% confidence bands around the best-fitting line can be added to the plot. Although not obvious, the r-square measure will also be displayed. We will leave the defaults.

Bivariate Plots and Correlations: Scale Variables 8 - 10

Introduction to Statistical Analysis Using SPSS Statistics Close the Chart Editor

Figure 8.12 Scatterplot with Best Fitting Line

The straight line tracks the positive relationship between beginning salary and education. How well do you think it describes or models the relationship? We use scatterplots to get a sense of whether or not it is appropriate to use correlation coefficients and, later, regression. Both of these techniques assume a linear relationship (although regression can be used to model curvilinear relationships with suitable modifications). The other fit choices available can be used to determine what type of nonlinearity might exist. It would be helpful if we could quantify the strength of the relationship, and furthermore to describe it mathematically. If a simple function (for instance a straight line) does a fair job of representing the relationship, then we can very easily describe a straight line with the equation, Y = a + b*X. Here b is the slope (or average change in Y per unit change in X) and a is the intercept. Methods are available to perform both tasks: correlation for assigning a number to the strength of the straight-line relationship, and regression to describe the best-fitting straight line.

8.5 Correlations A correlation coefficient can be used to quantify the strength and direction of the relationship between variables in a scatterplot. The correlation coefficient (formally named the Pearson product-moment correlation coefficient) is a measure of the extent to which there is a linear (or straight line) relationship between two variables. It is normed so that a correlation of +1 indicates that the data fall on a perfect straight line sloping upwards (positive relationship), while a correlation of –1 would represent data forming a straight line sloping downwards (negative relationship). A correlation of 0 indicates there is no straight-line relationship at all. Correlations

Bivariate Plots and Correlations: Scale Variables 8- 11

Introduction to Statistical Analysis Using SPSS Statistics falling between 0 and either extreme indicate some degree of linear relation: the closer to +1 or – 1, the stronger the relation. In social science and market research, when straight-line relationships are found, significant correlation values are often in the range of .3 to .6. Below we display four scatterplots with their accompanying correlations, all based on simulated data following normal distributions. Four different correlations appear (1.0, .8, .4, 0). All are positive, but represent the full range in strength of linear association (from 0 to 1). As an aid in interpretation, a best-fitting straight line is superimposed on each chart. Figure 8.13 Scatterplots Based on Various Correlations.

For the perfect correlation of 1.0, all points fall on the straight line trending upwards. In the scatterplot with a correlation of .8 the strong positive relation is apparent, but there is some variation around the line. Looking at the plot of data with correlation of .4, the positive relation is suggested by the absence of points in the upper left and lower right of the plot area. The association is clearly less pronounced than with the data correlating .8 (note greater scatter of points around the line). The final chart displays a correlation of 0: there is no linear association present. This is fairly clear to the eye (the plot most resembles a blob), and the best-fitting straight line is a horizontal line. While we have stressed the importance of looking at the relationships between variables using scatterplots, you should be aware that human judgment studies indicate that people tend to overestimate the degree of correlation when viewing scatterplots. Thus obtaining the numeric correlation is a useful adjunct to viewing the plot. Correspondingly, since correlations only capture the linear relation between variables, viewing a scatterplot allows you to detect nonlinear relationships present. Additionally, statistical significance tests can be applied to correlation coefficients. Assuming the variables follow normal distributions, you can test whether the correlation differs from zero (zero indicates no linear association) in the population, based on your sample results. The significance value is the probability that you would obtain as large (or larger in absolute value) a correlation as you find in your sample, if there were no linear association (zero correlation) between the two variables in the population.

Bivariate Plots and Correlations: Scale Variables 8 - 12

Introduction to Statistical Analysis Using SPSS Statistics In SPSS Statistics, correlations can be easily obtained along with an accompanying significance test. If you have grossly non-normal data, or only ordinal scale data, the Spearman rank correlation coefficient (or Spearman correlation) can be calculated. It evaluates the linear relationship between two variables after ranks have been substituted for the original scores. Another, less common, rank association measure is Kendall’s coefficient of concordance (also known as Kendall’s coefficient, or Kendall’s tau-b). We will obtain the correlation (Pearson) between beginning salary and education, and will also include age, current salary, and work experience in the analysis. Click Analyze…Correlate…Bivariate Move salbeg, salnow, edlevel, age and work to the Variables: list box

Figure 8.14 Correlations Dialog Box

Notice that we simply list the variables to be analyzed; there is no designation of dependent and independent. Correlations will be calculated on all pairs of the variables listed. By default, Pearson correlations will be calculated, which is what we want. However, both Kendall’s tau-b and Spearman nonparametric correlation coefficients can be requested as well. A two-tailed significance test will be performed on each correlation. This will posit as the null hypothesis that in the population there is no linear association between the two variables. Thus any straight-line relationship, either positive or negative, is of interest. If you prefer a one-tailed test, one in which you specify the direction (or sign) of the relation you expect and any relation in the opposite direction (opposite sign) is bundled with the zero (or null) effect, you can obtain it though the One-tailed option button. This issue was discussed earlier in Chapter 5 and in the context of one versus two-tailed t tests. A one-tailed test gives you greater power to detect a correlation of the sign you propose, at the price of giving up the ability to detect a significant correlation of the opposite sign. In practice, researchers are usually interested in all linear relations, positive and negative, and so two-tailed tests are most common. The Flag significant

Bivariate Plots and Correlations: Scale Variables 8- 13

Introduction to Statistical Analysis Using SPSS Statistics correlations check box is checked by default. When checked, asterisks appearing beside the correlations will identify significant correlations. The Options button opens a dialog box in which you can request a table of descriptive statistics for the variables used in the analysis. There is also a choice for missing values. The default missing setting is Pairwise, which means that a case is dropped from a correlation coefficient if one of the two variables is missing. However, the case will be used in all other pairs of variables that are not missing. In this way, SPSS Statistics will still use the valid information from all pairs of variables. The alternative is Listwise, in which a case is dropped from the entire correlation analysis if any of its analysis variables have missing values. Neither method provides an ideal solution; in practice, pairwise deletion is often chosen when a large number of cases are dropped by the listwise method. This is an area of statistics in which considerable progress has been made in the last decade, and the SPSS Missing Values add-on module incorporates some of these improvements. Click OK

SPSS Statistics displays the correlations, sample sizes and significance values together in a cell in the Correlations table. Looking at the table in Figure 8.15, we see that the variable labels run down the first column and across the top row. Each cell (intersection of a row and column) in the matrix contains the correlation (also significance value and sample size) between the relevant row and column variable. The correlation (Pearson Correlation) is listed first in each cell, followed by the probability value of the significance test (“Sig. (2-tailed)”), and finally the sample size (N). The Pearson Correlation coefficient (generally abbreviated as 'r') itself is based upon the ‘leastsquares’ criterion and is calculated by summing a standardized version of the discrepancies between the actual and the predicted values for each data point in turn. The predicted values (or pairs of co-ordinates since we are considering values on two variables) are those which would constitute our best guess for predicting scores. The correlation coefficient can take on values between +1 and -1 where:



r = +1.00 if there is a perfect positive linear relationship between the two variables



r = -1.00 if there is a perfect negative linear relationship between the two variables



r = 0.00 if there is no linear relationship between the two variables

Note that the sign does not reveal anything about the strength of the relationship, just its direction. One extremely important consideration when using Pearson’s correlation coefficient is exactly what is being measured by the statistic. Remember, r is simply a measure of the linear relationship between the variables, hence a value of 0 does not necessarily mean that the two variables are not related, simply that there is no evidence for a linear relationship.

Bivariate Plots and Correlations: Scale Variables 8 - 14

Introduction to Statistical Analysis Using SPSS Statistics Figure 8.15 Correlation Matrix

Note all correlations along the major (upper left to lower right) diagonal are 1. This is because each variable perfectly correlates with itself (no significance tests are performed for these correlations). Also, the correlation matrix is symmetric, that is, the correlation between beginning salary and education is the same as the correlation between education and beginning salary. Thus you need only view part of the matrix to the upper right (or to the lower left) of the diagonal to see all the correlations. There is, not surprisingly, a strong (.88) correlation between beginning salary and current salary. Its significance value rounded to three decimals is .000 (thus less than .0005). This means that if beginning salary and current salary had no linear association in the population, then the probability of obtaining a sample with such a strong (or stronger) linear association is less than .0005. The sample size is nearly 500, which should provide fairly sensitive (powerful) tests of the correlations being nonzero. Formal education and beginning salary have a substantial (.63) positive correlation, while age has no linear association with beginning salary (correlation = –.01; probability value of .81, or 81% chance of obtaining a sample correlation this far from zero, if it were truly zero in the population). Do you see any other large correlations in the table, and if so can you explain them? Also note that asterisks mark the significant correlations (at the .01 and .05 level). A correlation provides a concise numerical summary of the degree of linear association between pairs of variables. However, outliers can influence a correlation with no visible evidence. Often, such outliers would be visible in a scatterplot. Also, a scatterplot might suggest that a function other than a straight line be fit to the data, whereas a correlation simply provides a measure of straight-line fit. For these reasons, serious analysts look at scatterplots. If the number of variables is so large as to make looking at all scatterplots impracticable, then at least view those involving important variables.

Bivariate Plots and Correlations: Scale Variables 8- 15

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises Suppose you are interested in predicting current salary (salnow), based on age, education in years (edlevel), minority status (minority), beginning salary (salbeg), gender (sex), and work experience before coming to the bank (work). 1. Run frequencies on minority so you understand its distribution. Run descriptive statistics on the other variables for the same reason. 2. Now produce correlations with all these predictors and current salary. 3. Then create scatterplots of the predictors and current salary. Which variables have strong correlations with current salary? Did you find any potential problems with using linear regression? Did you find any potential outliers?

Bivariate Plots and Correlations: Scale Variables 8 - 16

Introduction to Statistical Analysis Using SPSS Statistics

Chapter 9: Introduction to Regression Topics • Introduction and Basic Concepts • Regression Equation and Fit Measure • Assumptions • Simple Regression • Interpreting Standard Results Data This chapter uses the Bank.sav data file.

9.1 Introduction and Basic Concepts We found in Chapter 8, based on a scatterplot and correlation coefficient, that beginning salary and education are positively related for the bank employees. We wish to further quantify this relation by developing an equation predicting starting salary based on education. One such statistical method used to predict a variable (an interval scale dependent measure) from one or more predictor (interval scale) variables is regression analysis. Commonly, straight lines are used, although other forms of regression allow nonlinear functions. In this chapter we will focus on linear regression, which typically involves linear or straight-line relations between variables. To aid our discussion, let’s revisit the scatterplot of beginning salary and education. Figure 9.1 Scatterplot of Beginning Salary and Education

Introduction to Regression 9- 1

Introduction to Statistical Analysis Using SPSS Statistics Earlier we pointed out that to the eye there seems to be a positive relation between education and beginning salary, that is, higher education is associated with greater starting salaries. This was confirmed by the two variables having a significant positive correlation (.63). While the correlation does provide a single numeric summary of the relation, something that would be more useful in practice is some form of prediction equation. Specifically, if some simple function can approximate the pattern shown in the plot, then the equation for the function would concisely describe the relation, and could be used to predict values of one variable given knowledge of the other. A straight line is a very simple function, and is usually what researchers start with, unless there are reasons (theory, previous findings, or a poor linear fit) to suggest another. Also, since the point of much research involves prediction, a prediction equation is valuable. However, the value of the equation would be linked to how well it actually describes or fits the data, and so part of the regression output includes fit measures.

9.2 The Regression Equation and Fit Measure In the scatterplot (Figure 9.1), beginning salary is placed on the Y-axis and education appears along the X axis. Since formal education is typically completed before starting at the bank, we consider beginning salary to be the dependent variable and education the independent or predictor variable (this assumption was more true in the 1960s than today). A straight line is superimposed on the scatterplot; the line is represented in general form by the equation, Y = a + b*X where, b is the slope (the change in Y per unit change in X) and a is the intercept (the value of Y when X is zero). Given this, how would one go about finding a best-fitting straight line? In principle, there are various criteria that might be used: minimizing the mean deviation, mean absolute deviation, or median deviation. Due to technical considerations, and with a dose of tradition, the best-fitting straight line is the one that minimizes the sum of the squared deviation of each point about the line. Returning to the plot of beginning salary and education, we might wish to quantify the extent to which the straight line fits the data. The fit measure most often used, the r-square measure, has the dual advantages of falling on a standardized scale and having a practical interpretation. The rsquare measure (which is the correlation squared, or r2, when there is a single predictor variable, and thus its name) is on a scale from 0 (no linear association) to 1 (perfect linear prediction). Also, the r-square value can be interpreted as the proportion of variation in one variable that can be predicted from the other. Thus an r-square of .50 indicates that we can account for 50% of the variation in one variable if we know values of the other. You can think of this value as a measure of the improvement in your ability to predict one variable from the other (or others if there are multiple independent variables).

9.3 Residuals and Outliers Viewing the scatterplot, we see that many points fall near the line, but some are quite a distance from it. For each point, the difference between the value of the dependent variable and the value predicted by the equation (value on the line) is called the residual (also known as the error). Points above the line have positive residuals (they were under-predicted), those below the line

Introduction to Regression 9 - 2

Introduction to Statistical Analysis Using SPSS Statistics have negative residuals (they were over-predicted), and a point falling on the line has a residual of zero (perfect prediction). Points having relatively large residuals are of interest because they represent instances where the prediction line did poorly. For example, one case has a beginning salary of about $30,000 while the predicted value (based on the line) is about $10,000, yielding a residual, or miss, of about $20,000. If budgets were based on such predictions, this is a substantial discrepancy. The Regression procedure can provide information about large residuals, and also present them in standardized form. Outliers, or points far from the mass of the others, are of interest in regression because they can exert considerable influence on the equation (especially if the sample size is small). Also, outliers can have large residuals and would be of interest for this reason as well. While not covered in this class, SPSS Statistics can provide influence statistics to aid in judging whether the equation was strongly affected by an observation and, if so, to identify the observation.

9.4 Assumptions Regression is usually performed on data for which the dependent and independent variables are interval scale. In addition, when statistical significance tests are performed, it is assumed that the deviations of points around the line (residuals) follow the normal bell-shaped curve. Also, the residuals are assumed to be independent of the predicted values, which implies that the variation of the residuals around the line is homogeneous. SPSS Statistics can provide summaries and plots useful in evaluating these latter issues. One special case of the assumptions involves the interval scale nature of the independent variable(s). A variable coded as a dichotomy (say 0 and 1) can technically be considered as an interval scale. An interval scale assumes that a one-unit change has the same meaning throughout the range of the scale. If a variable’s only possible codes are 0 and 1 (or 1 and 2, etc.), then a oneunit change does mean the same change throughout the scale. Thus dichotomous variables, for example sex, can be used as predictor variables in regression. It also permits the use of nominal predictor variables if they are converted into a series of dichotomous variables; this technique is called dummy coding and is considered in most regression texts (Draper and Smith, 1998; Cohen and Cohen, 2002). The multiple regression analysis (multiple independent variables) performed in Appendix B uses a dichotomous predictor (sex).

9.5 Simple Regression A regression involving a single independent variable is the simplest case and is called simple regression. We will develop a regression equation predicting beginning salary based on education. There are a number of regression techniques available in SPSS Statistics on the Regression menu such as linear regression, that is used to perform simple and multiple linear regression, and logistic regression, that is used to predict nominal outcome variables such as purchase/not purchase. Logistic regression and many of the other regression techniques are discussed in the Advanced Techniques: Regression course. We will select Linear to perform simple linear regression, then name beginning salary (salbeg) as the dependent variable and education (edlevel) as the independent variable. Click Analyze…Regression…Linear from the menu Move salbeg to the Dependent: box Move edlevel to the Independent(s): box

Introduction to Regression 9- 3

Introduction to Statistical Analysis Using SPSS Statistics Figure 9.2 Linear Regression Dialog Box

In this first analysis we will limit ourselves to producing the standard regression output. In the multiple regression example in Appendix B, we will ask for residual plots and information about cases with large residuals. Also, the Regression dialog box allows many specifications; here we will discuss the most important features. However, if you will be running regression often, some time spent reviewing the additional features and controls mentioned in the Help system will be well worth it. The Independent(s) list box will permit more than one independent variable, and so this dialog box can be used for both simple and multiple regression. The block controls permit an analyst to build a series of regression models with the variables entered at each stage (block), as specified by the user. By default, the Method is Enter, which means that all independent variables in the block will be entered into the regression equation simultaneously. This method is chosen to run one regression based on all variables you specify. If you wish the program to select, from a larger set of independent variables, those that in some statistical sense are the best predictors, you can request the Stepwise method. The Selection Variable option permits cross-validation of regression results. Only cases whose values meet the rule specified for a selection variable will be used in the regression analysis, yet the resulting prediction equation will be applied to the other cases. Thus you can evaluate the regression on cases not used in the analysis, or apply the equation derived from one subgroup of your data to other groups. While SPSS Statistics will present standard regression output by default, many additional (and some of them quite technical) statistics can be requested via the Statistics dialog box. The Plots

Introduction to Regression 9 - 4

Introduction to Statistical Analysis Using SPSS Statistics dialog box is used to generate various diagnostic plots used in regression, including residual plots. We will request such plots in the analysis in Appendix B. The Save dialog box permits you to add new variables to the data file. These variables contain such statistics as the predicted values from the regression equation, various residuals and influence measures. Finally, the Options dialog box controls the criteria when running stepwise regression and choices in handling missing data. By default, SPSS Statistics excludes a case from regression if it has one or more values missing for the variables used in the analysis. Note: The SPSS Missing Values add-on module provides more sophisticated methods for handling missing values. This module includes procedures for displaying patterns of missing data and imputing (estimating) missing values using multiple variable imputation methods. We perform the analysis by finishing the dialog box. The output in the Output Viewer will be a series of tables described below. Click OK

Figure 9.3 Model Summary and Overall Significance Tests

After listing the dependent and independent variables, Regression provides several measures of how well the model fits the data. These fit measures are displayed in the Model Summary table. First is the multiple R, which is a generalization of the correlation coefficient. If there is a single predictor variable (as in our case) then the multiple R is simply the unsigned (positive) correlation between the independent and dependent variable—recall the correlation between beginning salary and education was .63. If there are several independent variables then the multiple R represents the unsigned (positive) correlation between the dependent measure and the optimal linear combination of the independent variables. Thus the closer the multiple R is to 1, the better the fit. As mentioned earlier, the R-Square measure can be interpreted as the proportion of variance of the dependent measure that can be predicted from the independent variable(s). Here it is about 40% (.40), which is far from perfect prediction, but still substantial. The Adjusted R-Square

Introduction to Regression 9- 5

Introduction to Statistical Analysis Using SPSS Statistics represents a technical improvement over the r-square in that it explicitly adjusts for the number of predictor variables relative to the sample size, and as such is preferred by many analysts. Generally, they are very close in value; in fact, if they differ dramatically in multiple regression, it is a sign that you have used too many predictor variables for your sample size, and the adjusted r-square value should be more trusted. In our results, they are very close. The Standard Error of the Estimate is a standard deviation type summary of the dependent variable that measures the deviation of observations around the best fitting straight line. As such it provides, in the scale of the dependent variable, an estimate of how much variation remains to be accounted for after the line is fit. The reference number for comparison is the original standard deviation of the dependent variable, which measures the original amount of unaccounted variation. Regression can display such descriptive statistics as the standard deviation, but since we didn’t request this, we will note that the original standard deviation of beginning salary was $3,148 . Thus the uncertainty surrounding individual beginning salaries has been reduced from $3,148 (standard deviation) to $2,439 (standard error). If the straight line perfectly fit the data, the standard error would be 0. While the fit measures indicate how well we can expect to predict the dependent variable or how well the line fits the data, they do not tell whether there is a statistically significant relationship between the dependent and independent variables. The analysis of variance table (ANOVA in the Output Viewer) presents technical summaries (sums of squares and mean square statistics) of the variation accounted for by the prediction equation. Our main interest is in determining whether there is a statistically significant (non-zero) linear relation between the dependent variable and the independent variable(s) in the population. Since in simple regression there is a single independent variable, we are testing a single relationship; in multiple regression, we test whether any linear relation differs from zero. The significance value accompanying the F test gives us the probability that we could obtain one or more sample slope coefficients (which measure the straight line relationships) as far from zero as what we obtained, if there were no linear relations in the population. The result is highly significant (significance probability less than .0005 or 5 chances in 10,000). Now that we have established there is a significant relationship between the beginning salary and education, and obtained fit measures, we turn to the next table, Coefficients, to interpret the regression coefficients. Figure 9.4 Regression Coefficients

The first column contains a list of the independent variables plus the intercept (constant term). The column labeled B contains the estimated regression coefficients we would use in a prediction equation. The coefficient for education level indicates that on average, each year of education was associated with a beginning salary increase of $691.01. The constant or intercept of –2,516.39 indicates that the predicted beginning salary of someone with 0 years of education is negative $2,516.39, so they would pay the bank to work! This is clearly impossible. This odd result stems in part from the fact that no one in the sample had fewer than 8 years of education, so the

Introduction to Regression 9 - 6

Introduction to Statistical Analysis Using SPSS Statistics intercept projects well beyond the region containing data. When using regression it can be very risky to extrapolate beyond where the data are observed; the assumption is that the same pattern continues. Here it clearly cannot! The Standard Error (of B) column contains standard errors of the regression coefficients. These provide a measure of the precision with which we estimate the B coefficients. The standard errors can be used to create a 95% confidence band around the B coefficients (available as a Statistics option). In our example, the regression coefficient is $691 and the standard error is about $39. Thus we would not be surprised if in the population the true regression coefficient were $650 or $730 (within one standard error of our sample estimate), but it is very unlikely that the true population coefficient would be $300 or $2,000. Betas are standardized regression coefficients and are used to judge the relative importance of each of several independent variables. We will use these measures when discussing multiple regression. Finally, the t statistics provide a significance test for each B coefficient, testing whether it differs from zero in the population. Since we have only one independent variable, this is the same result as what the F test provided earlier. In multiple regression, the F statistic tests whether any of the independent variables are significantly related (non-zero coefficient) to the dependent variable, while the t statistic is used to test each independent variable separately. The significance test on the constant assesses whether the intercept coefficient is different from zero in the population (it is). Thus if we wish to predict beginning salary based on education for new employees, we would use the B coefficients in the formula: Beginning Salary = $691 * Education – $2,516. Even when running simple regression, the analyst would probably take a look at some residual plots and check for outliers; we will follow through on this aspect in the Appendix B example.

Introduction to Regression 9- 7

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises 1. Run a simple linear regression using beginning salary to predict current salary. 2. How well did you do in predicting current salary from this one variable? (Hint: What is the R-square and how would you interpret it? 3. Interpret the constant (intercept) and B values. 4. Use the predictive equation to predict the value of current salary for a person that had a beginning salary of $6400. 5. Run a simple linear regression using one of the other variables from the set you explored in the last chapter. Hint: You might want to try the one with the next highest correlation coefficient. Answer Questions 2-3 above.

Introduction to Regression 9 - 8

Introduction to Statistical Analysis Using SPSS Statistics

Appendix A: Mean Differences Between Groups: One-Factor ANOVA Topics • Extending the Logic of T Test Beyond Two Groups • Exploring the Differences • One-Factor ANOVA • Post Hoc Testing of Means • Graphing the Mean Differences • Appendix: Group Differences on Ranks Data In this appendix, we use the GSS2004Intro.sav file. Scenario Using the GSS 2004 data, we want to investigate the relation between level of education and amount of TV viewing. One approach is to group people according to their type of degree, and then compare these groups on average amount of daily TV watched. In the General Social Survey the question about highest degree completed (DEGREE) contains five categories: less than high school, high school, junior college, bachelor, and graduate. Assuming we retain these categories we might first ask if there are any population differences in TV viewing among these groups. If there are significant mean differences overall, we next want to know specifically which groups differ from which others.

A.1 Introduction Analysis of variance (ANOVA) is a general method of drawing conclusions regarding differences in population means when two or more comparison groups are involved. The independent-groups t test (Chapter 7) applies only to the simplest instance (two groups), while ANOVA can accommodate more complex situations. In fact, t test can be viewed as a special case of ANOVA, and they yield the same result in a two-group situation (same significance value, and the t statistic squared is equal to ANOVA’s F statistic). We will compare five groups composed of people with different education degrees and evaluate whether the populations they represent differ in average amount of daily TV viewing. Before performing the analysis we will look at an exploratory data analysis plot.

A.2 Extending the Logic Beyond Two Groups The basic logic of significance testing for comparing group means on more than two groups is the same as that for the t test which we reviewed in Chapter 7. To summarize, Ho (Null Hypothesis) assumes the population groups have the same means. Determine the probability of obtaining a sample with group mean differences as large (or larger) as what we find in our data. To make this assessment the amount of variation among

Mean Differences Between Groups: One-Factor ANOVA A- 1

Introduction to Statistical Analysis Using SPSS Statistics group means (between-group variation) is compared to the amount of variation among observations within each group (within-group variation). Assuming in the population that the group means are identical (null hypothesis), the only source of variation among sample means would be the fact that the groups are composed of different individual observations. Thus a ratio of the two sources of variation (between group/within group) should be about 1 when there are no population differences. When the distribution of individual observations within each group follows the normal curve, the statistical distribution of this ratio is known (F distribution) and we can make a probability statement about the consistency of our data with the null hypothesis. The final result is the probability of obtaining sample differences as large (or larger) as what we found if there were no population differences. If this probability is sufficiently small (usually less than 5 chances in 100, or .05) we conclude the population groups differ. The assumptions of normality within each group and homogeneity of variance that we discussed in Chapter 7 and considerations of sample size are applicable to all ANOVA models. Likewise, the "rules of thumb" approaches that we considered for violation of these assumptions carry over to this extended ANOVA model.

Factors When performing a t test comparing two groups there is only one comparison that can be made: group 1 versus group 2. For this reason, the groups are constructed so their members systematically vary in only one aspect: for example, males versus females, or drug A versus drug B. If the two groups differed on more than one characteristic (for example, males given drug A versus females given drug B), it would be impossible to differentiate between the two effects (gender, drug). When the data can be partitioned into more than two groups, additional comparisons can be made. These might involve one aspect or dimension, for example, four groups each representing a region of the country. Or the groups might vary along several dimensions, for example eight groups each composed of a gender (two categories) by region (four categories) combination. In this latter case, we can ask additional questions. (1) Is there a gender difference? (2) Is there a region difference? (3) Do gender and region interact? Each aspect or dimension the groups differ on is called a factor. Thus one might discuss a study or experiment involving one, two, even three or more factors. A factor is represented in the data set as a categorical (nominal) variable and would be considered an independent variable. SPSS Statistics allows for multiple factors to be analyzed, and has different procedures available based on how many factors are involved and their degree of complexity. If only one factor is to be studied, use the One-factor ANOVA procedure. This is the procedure we will demonstrate in this Appendix to study the education degree related to average daily TV viewing hours. When you have two or more factors, the Univariate procedure under the General Linear Models menu can be used. Other procedures on the General Linear Models menu, such as Multivariate and Repeated Measures as well as the Linear Mixed Models procedure can be used for more complex models. These models are beyond the scope of this course; but are covered in the Advanced Topics: ANOVA course and to some degree in the Advanced Statistical Analysis using SPSS course.

Mean Differences Between Groups: One-Factor ANOVA A - 2

Introduction to Statistical Analysis Using SPSS Statistics

A.3 Exploring the Data As in Chapter 7, we begin by applying exploratory data analysis procedures to the variables of interest. In practice, you would check each group’s summary statistics, looking at the pattern of the data and noting any unusual points. For brevity in our presentation we will examine only the boxplot. Open GSS2004Intro.sav if it is not already open Click Analyze…Descriptive Statistics…Explore Move TVHOURS to the Dependent List: box Move DEGREE to the Factor List: box

Figure A.1 Explore Dialog Box to Compare TV Hours for Degree Groups

The dependent variable is the scale variable of interest and the factor variable is the nominal or ordinal variable that defines the groups which we want to compare. Since we are comparing different formal education degree groups, we designate DEGREE as the factor (or nominal independent variable) and TVHOURS as the dependent variable. We accept the default output. As we've seen in earlier chapters, you might chose to run histograms rather than stem & leaf plots or request additional statistics. Click OK

An exploratory analysis of TV hours will appear for each degree group. For brevity in this presentation we move directly to the boxplot.

Mean Differences Between Groups: One-Factor ANOVA A- 3

Introduction to Statistical Analysis Using SPSS Statistics Figure A.2 Boxplot of TV Hours by Degree Groups

The median hours of daily TV watched appears higher for those with less than a high school degree and lower for those with graduate degrees. Each group exhibits a positive skew that is more exaggerated for those with high school or lesser degree. Some individuals report watching rather large amounts of daily TV; one might want to examine the original survey to check for data errors or evidence of misunderstanding the question. Also, based on the box heights (interquartile ranges), those with a high school degree or less show greater within-group variation than the others. This suggests a potential problem with the homogeneity of variance assumption, especially since the sample sizes are quite disparate (see the Case Processing Summary table). We also note there is no apparent pattern between the median level and the interquartile range (for example as one increases so does the other) that might suggest a data transformation to stabilize the within-group variance. We will come back to this point after testing for homogeneity of variance. Let’s move on to the actual ANOVA analysis.

A.4 One-Factor ANOVA To run the analysis: Click Analyze…Compare Means…One-Way ANOVA Move TVHOURS to the Dependent List: box Move DEGREE to the Factor: box

Mean Differences Between Groups: One-Factor ANOVA A - 4

Introduction to Statistical Analysis Using SPSS Statistics Figure A.3 One-Way ANOVA Dialog Box

We have provided the minimum information to run the basic analysis: one dependent variable and one factor variable. You could use One-Way ANOVA on more than one dependent variable for the same factor groups by placing multiple variables in the Dependent List. The Contrasts button is used to request statistical tests for planned group comparisons of interest. The Post Hoc button will produce multiple comparison tests that can compare each group mean against every other one. Such tests facilitate determination of just which groups differ from which others and are usually performed after the overall analysis establishes that some significant differences exist. We will use these tests in the next section. Finally, the Options button controls such diverse features as missing value handling and whether to display optional output such as descriptive statistics, means plots, and homogeneity tests. We want to display both descriptive statistics (although having just run Explore, they are not necessary) and the homogeneity of variance test. Click Options button Check Descriptive check box Check Homogeneity of variance test check box Check Brown-Forsythe and Welch check boxes

The completed dialog box is shown in Figure A.4. As mentioned earlier, ANOVA assumes homogeneity of within-group variance. However, when homogeneity does not hold there are several adjustments that can be made to the F test. We request these optional statistics because the boxplot indicates that the homogeneity of variance assumption may not hold.

Mean Differences Between Groups: One-Factor ANOVA A- 5

Introduction to Statistical Analysis Using SPSS Statistics Figure A.4 One-Way ANOVA Options Dialog Box

The missing value choices deal with how missing data are to be handled if you specify several dependent variables. By default, cases with missing values on a particular dependent variable are dropped only for the specific analysis involving that variable. Since we are looking at a single dependent variable, the choice has no relevance to our analysis. The Means plot option will produce a line chart displaying the group means. This is one type of chart to present the results. However, we will request an error bar plot later because it shows more information than the Means line plot. Click Continue Click OK

We now turn to interpretation of the results.

One-Factor ANOVA Results The One-way output includes the descriptive statistics table, the analysis of variance summary table, robust tests that do not assume homogeneity of variance, and the probability value(s) we will use to judge statistical significance. We first review the ANOVA summary table (the default output) to determine if any of the group means differ from any of the other groups.

Mean Differences Between Groups: One-Factor ANOVA A - 6

Introduction to Statistical Analysis Using SPSS Statistics Figure A.5 One-Factor ANOVA Summary Table ANOVA HOURS PER DAY WATCHING TV

Between Groups Within Groups Total

Sum of Squares 483.655 5667.059 6150.714

df 4 894 898

Mean Square 120.914 6.339

F 19.075

Sig. .000

Most of the information in the ANOVA table (Figure A.5) is technical in nature and is not directly interpreted. Rather the summaries are used to obtain the F statistic and, more importantly, the probability value we use in evaluating the population differences. In the first column there is a row for the between-groups and a row for within-groups variation. The “df” column contains information about degrees of freedom, related to the number of groups and the number of individual observations within each group. The degrees of freedom are not interpreted directly, but are used in calculating the between-group and within-group variation (variances). Similarly, the sums of squares are intermediate summary numbers used in calculating the between- and within-group variances. Technically the Between Groups Sum of Squares represents the sum of the squared deviations of the individual group means around the total sample mean. And, the Within Groups Sum of Square, the sum of the squared deviations of individual observations around their respective sample group mean. These numbers are never interpreted and are reported because it is traditional to do so. The Mean Squares are measures of the between-group and within-group variation (variances). Technically, they are the Sum of Squares divided by their respective degrees of freedom. Recall in our discussion of the logic of testing that under the null hypothesis both variances would have the same source and the ratio of between to within would be about 1. This ratio of the mean square values is the sample F statistic. The F value in our example is 19.075, very far from 1! Finally, and most readily interpretable, the column labeled “Sig.” provides the probability of obtaining a sample F ratio as large (or larger) than 19.075 (taking into account the number of groups and sample size), assuming the null hypothesis that in the population all degree groups watch the same amount of TV. The probability of obtaining an F value this large (in other words, of obtaining sample means as far apart as we have), if the null hypothesis were true, is about .000. This number is rounded when displayed so the actual probability is less than .0005, or less than 5 chances in 10,000 of obtaining sample mean differences so far apart by chance alone. Thus we have a highly significant difference. In practice, most researchers move directly to the significance value since the columns containing the sums of squares, degrees of freedom, mean squares and F statistic are all necessary for the probability calculation but are rarely interpreted in their own right. To interpret the results we move to the descriptive information shown in the Descriptives table (Figure A.6) that we requested as optional output.

Mean Differences Between Groups: One-Factor ANOVA A- 7

Introduction to Statistical Analysis Using SPSS Statistics Figure A.6 Descriptive Statistics for Groups Descriptives HOURS PER DAY WATCHING TV

LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE Total

N 111 476 79 141 92 899

Mean 4.50 2.96 2.53 2.17 1.78 2.87

Std. Deviation 3.697 2.561 1.894 2.080 1.333 2.617

Std. Error .351 .117 .213 .175 .139 .087

95% Confidence Interval for Mean Lower Upper Bound Bound 3.80 5.19 2.73 3.19 2.11 2.96 1.82 2.52 1.51 2.06 2.69 3.04

Minimum 0 0 0 0 0 0

Maximum 20 20 12 15 6 20

The pattern of means is consistent with the boxplot in that those with less formal education watch more TV than those with more formal education. The 95% confidence bands for the group means gauge the precision with which we have estimated these values, and we can informally compare groups by comparing their confidence bands. The minimum and maximum values for each group are valuable as a data check; we again note some surprisingly large numbers. Often at this point we are interested in making a statement about which of the five groups differ significantly from which others. This is because the overall F statistic simply tested the null hypothesis that all population means were the same. Typically, you now want to make more specific statements than merely that the five groups are not identical. Post Hoc tests permit these pairwise group comparisons and we will pursue them later. But first, we must check the homogeneity of variance assumption by reviewing the tests that we requested under the Options.

Homogeneity of Variance and What to Do If Violated We also requested the Levene test of homogeneity of variance. This is the same test that we saw in Chapter 7 in the t test table and is interpreted in the same way. Figure A.7 Homogeneity of Within-Group Variance Test of Homogeneity of Variances HOURS PER DAY WATCHING TV Levene Statistic 12.015

df1 4

df2 894

Sig. .000

Unfortunately the null hypothesis assuming homogeneity of within-group variance is rejected at the rounded .000 (less than .0005) level. Our sample sizes are quite disparate (see Figure A.6), so we cannot count on robustness due to equal sample sizes. For this reason we turn to the BrownForsythe and Welch tests, which test for equality of group means without assuming homogeneity of variance. Since these results will not be calculated by default, you would request them based on homogeneity tests done in the Explore or One-way ANOVA procedures. These tests are show in Figure A.8.

Mean Differences Between Groups: One-Factor ANOVA A - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure A.8 Robust Tests of Mean Differences Robust Tests of Equality of Means HOURS PER DAY WATCHING TV a

Welch Brown-Forsythe

Statistic 19.259 20.508

df1 4 4

df2 269.315 350.785

Sig. .000 .000

a. Asymptotically F distributed.

Both of these measures mathematically attempt to adjust for the lack of homogeneity of variance. When calculating the between-group to within-group variance ratio, the Brown-Forsythe test explicitly adjusts for heterogeneity of variance by adjusting each group's contribution to the between-group variation by a weight related to its within-group variation. The Welch test adjusts the denominator of the F ratio so it has the same expectation as the numerator, when the null hypothesis is true, despite the heterogeneity of within-group variance. Both tests indicate there are highly significant differences in average TV viewing between the education degree groups, which are consistent with the conclusions we drew from the standard ANOVA. These robust tests, as noted by the caption, are asymptotic tests, meaning their properties improve as the sample size increases. Both tests do assume that the distribution is normal. Simulation work (Brown and Forsythe, 1974) indicates the tests perform well with group sample sizes as small as 10 and possibly even 5. As one alternative, a statistically sophisticated analyst might attempt to apply transformations to the dependent measure in order to stabilize the within-group variances (variance stabilizing transforms). These are beyond the scope of this course, but interested readers might turn to Emerson in Hoaglin, Mosteller and Tukey (1991) for a discussion from the perspective of exploratory data analysis, and note that the spread & level plot in Explore will suggest a variance stabilizing transform. Box, Hunter and Hunter (1978) contains a brief discussion of such transformations, and the original (technical) paper was by Box and Cox (1964). A second alternative would be to perform the analysis using a nonparametric statistical method that assumes neither normality nor homogeneity of variance (recall the Brown-Forsythe and Welch tests assume normality of error). A one-factor analysis of group differences assuming that the dependent measure is only an ordinal (rank) variable is available as a Nonparametric Test procedure within SPSS Statistics Base. When this analysis was run (see appendix to this chapter if interested), the group differences were found to be highly significant. This serves as another confirmation of our result, but corresponding nonparametric procedures are not available for all analysis of variance models. In situations in which robust or nonparametric equivalents are not available, many researchers accept the ANOVA results with a caveat that the reported probability levels are not exactly correct. In our example, since the significance value was less than .0005, even if we discount the value by an order or two of magnitude, the result would still be significant at the .05 level. While these approaches are not entirely satisfactory, and statisticians may disagree as to which would be best in a given situation, they do constitute the common ways of dealing with the problem.

Mean Differences Between Groups: One-Factor ANOVA A- 9

Introduction to Statistical Analysis Using SPSS Statistics Having concluded that there are differences in amount of TV viewed among different educational degree groups, we probe to find specifically which groups differ from which others.

A.5 Post Hoc Testing of Means Post hoc tests are typically performed only after the overall F test indicates that population differences exist, although for a broader view see Milliken and Johnson (1984). At this point there is usually interest in discovering just which group means differ from which others. In one aspect, the procedure is quite straightforward: every possible pair of group means is tested for population differences and a summary table produced. However, a problem exists in that as more tests are performed, the probability of obtaining at least one false-positive result increases. Recall our discussion of Type I and Type II errors in Chapter 5. As an extreme example, if there are ten groups, then 45 pairwise group comparisons (n*(n-1)/2) can be made. If we are testing at the .05 level, we would expect to obtain on average about 2 (.05 * 45) false-positive tests. In an attempt to reduce the false-positive rate when multiple tests of this type are done, statisticians have developed a number of methods.

Why So Many Tests? The ideal post hoc test would demonstrate tight control of Type I (false-positive) error, have good statistical power (probability of detecting true population differences), and be robust over assumption violations (failure of homogeneity of variance, nonnormal error distributions). Unfortunately, there are implicit tradeoffs involving some of these desired features (Type I error and power) and no current post hoc procedure is best in all these areas. Add to this the facts that pairwise tests can be based on different statistical distributions (t, F, studentized range, and others) and that Type I error can be controlled at different levels (per individual test, per family of tests, variations in between), and you have a large collection of post hoc tests. We will briefly compare post hoc tests from the perspective of being liberal or conservative regarding control of the false-positive rate (Type 1 error) and apply several to our data. There is a full literature (and several books) devoted to the study of post hoc (also called multiple comparison or multiple range tests, although there is a technical distinction between the two) tests. Some books (Toothaker, 1991) summarize simulation studies that compare multiple comparison tests on their power (probability of detecting true population differences) as well as performance under different scenarios of patterns of group means, and assumption violations (homogeneity of variance). The existence of numerous post hoc tests suggests that there is no single approach that statisticians agree will be optimal in all situations. In some research areas, publication reviewers require a particular post hoc method, simplifying the researcher’s decision. For more detailed discussion and recommendations, short books by Klockars and Sax (1986), Toothaker (1991) or Hsu (1996) are useful. Also, for some thinking on what post hoc tests ought to be doing see Tukey (1991) or Milliken and Johnson (1984). Below we present some tests available within SPSS Statistics, roughly ordered from the most liberal (greater statistical power and greater false-positive rate) to the most conservative (smaller false-positive rate, less statistical power), and also mention some designed to adjust for lack of homogeneity of variance.

Mean Differences Between Groups: One-Factor ANOVA A - 10

Introduction to Statistical Analysis Using SPSS Statistics LSD The LSD or least significant difference method simply applies standard t tests to all possible pairs of group means. No adjustment is made based on the number of tests performed. The argument is that since an overall difference in group means has already been established at the selected criterion level (say .05), no additional control is necessary. This is the most liberal of the post hoc tests. SNK, REGWF, REGWQ & Duncan The SNK (Student-Newman-Keuls), REGWF (Ryan-Einot-Gabriel-Welsh F), REGWQ (RyanEinot-Gabriel-Welsh Q, based on the studentized range statistic) and Duncan methods involve sequential testing. After ordering the group means from lowest to highest, the two most extreme means are tested for a significant difference using a critical value adjusted for the fact that these are the extremes from a larger set of means. If these means are found not to be significantly different, the testing stops; if they are different then the testing continues with the next most extreme set, and so on. All are more conservative than the LSD. REGWF and REGWQ improve on the traditionally used SNK in that they adjust for the slightly elevated false-positive rate (Type I error) that SNK has when the set of means tested is much smaller than the full set. Bonferroni & Sidak The Bonferroni (also called the Dunn procedure) and Sidak (also called Dunn-Sidak) perform each test at a stringent significance level to insure that the family-wise (applying to the set of tests) false-positive rate does not exceed the specified value. They are based on inequalities relating the probability of a false-positive result on each individual test to the probability of one or more false positives for a set of independent tests. For example, the Bonferroni is based on an additive inequality, so the criterion level for each pairwise test is obtained by dividing the original criterion level (say .05) by the number of pairwise comparisons made. Thus with five means, and therefore ten pairwise comparisons, each Bonferroni test will be performed at the .05/10 or .005 level. Tukey (b) The Tukey (b) test is a compromise test, combining the Tukey (see next test) and the SNK criterion producing a test result that falls between the two. Tukey Tukey’s HSD (Honestly Significant Difference; also called Tukey HSD, WSD, or Tukey(a) test) controls the false-positive rate family-wise. This means if you are testing at the .05 level, that when performing all pairwise comparisons, the probability of obtaining one or more false positives is .05. It is more conservative than the Duncan and SNK. If all pairwise comparisons are of interest, which is usually the case, Tukey’s test is more powerful than the Bonferroni and Sidak. Scheffe Scheffe’s method also controls the family-wise error rate. It adjusts not only for the pairwise comparisons, but also for any possible comparison the researcher might ask. As such it is the most conservative of the available methods (false-positive rate is least), but has less statistical power.

Mean Differences Between Groups: One-Factor ANOVA A- 11

Introduction to Statistical Analysis Using SPSS Statistics

Specialized Post Hoc Tests Hochberg’s GT2 & Gabriel: Unequal Ns Most post hoc procedures mentioned above (excepting LSD, Bonferroni & Sidak) were derived assuming equal group sample sizes in addition to homogeneity of variance and normality of error. When the subgroup sizes are unequal, SPSS Statistics substitutes a single value (the harmonic mean) for the sample size. Hochberg’s GT2 and Gabriel’s post hoc test explicitly allow for unequal sample sizes. Waller-Duncan The Waller-Duncan takes an approach (Bayesian) that adjusts the criterion value based on the size of the overall F statistic in order to be sensitive to the types of group differences associated with the F (for example, large or small). Also, you can specify the ratio of Type I (false positive) to Type II (false negative) error in the test. This feature allows for adjustments if there are differential costs to the two types of errors.

Unequal Variances and Unequal Ns Tamhane T2, Dunnett’s T3, Games-Howell, Dunnett’s C Each of these post hoc tests adjusts for unequal variances and sample sizes in the groups. Simulation studies (summarized in Toothaker, 1991) suggest that although Games-Howell can be too liberal when the group variances are equal and sample sizes are unequal, it is more powerful than the others. An approach some analysts take is to run both a liberal (say LSD) and a conservative (Scheffe or Tukey HSD) post hoc test. Group differences that show up under both criteria are considered solid findings, while those found different only under the liberal criterion are viewed as tentative results. To illustrate the differences among the post hoc tests we will request that three be done: one liberal (LSD), one midrange (REGWF), and one conservative (Scheffe). In addition, since homogeneity of variance does not hold in the data, we request the Games-Howell and would pay serious attention to its results. Ordinarily, a researcher would not run as many different tests. Although running multiple tests to compare the results is often useful as we will see. Due to the homogeneity of variance violation in our data, in practice only the Games-Howell might be run. Click on the Dialog Recall tool, then click One-Way ANOVA Click on the Post Hoc button Click LSD (Least Significant Difference), R-E-G-W-F (Ryan-Eniot-Gabriel-Welsh F), Scheffe and Games-Howell check boxes

The completed dialog box is shown in Figure A.9.

Mean Differences Between Groups: One-Factor ANOVA A - 12

Introduction to Statistical Analysis Using SPSS Statistics Figure A.9 Post Hoc Testing Dialog Box

Click Continue Click OK

By default, statistical tests will be done at the .05 level. If you prefer to use a different alpha value (for example, .01), you can specify it in the Significance level box. The beginning part of the output contains the ANOVA table, robust test of mean differences, descriptive statistics, and homogeneity test which we have already reviewed. We will move directly to the post hoc test results. Note Some of the pivot tables shown in this section were edited (changed column widths; only one post hoc method shown in some figures) to better display in the course guide.

Mean Differences Between Groups: One-Factor ANOVA A- 13

Introduction to Statistical Analysis Using SPSS Statistics Figure A.10 Least Significant Difference Post Hoc Results Multiple Comparisons Dependent Variable: HOURS PER DAY WATCHING TV LSD

(J) RS HIGHEST (I) RS HIGHEST DEGREE DEGREE LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE HIGH SCHOOL LT HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE JUNIOR COLLEGE LT HIGH SCHOOL HIGH SCHOOL BACHELOR GRADUATE BACHELOR LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE GRADUATE GRADUATE LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR

Mean Difference (I-J) 1.540* 1.964* 2.325* 2.713* -1.540* .424 .786* 1.173* -1.964* -.424 .361 .749 -2.325* -.786* -.361 .388 -2.713* -1.173* -.749 -.388

Std. Error .265 .371 .319 .355 .265 .306 .241 .287 .371 .306 .354 .386 .319 .241 .354 .337 .355 .287 .386 .337

Sig. .000 .000 .000 .000 .000 .166 .001 .000 .000 .166 .307 .053 .000 .001 .307 .251 .000 .000 .053 .251

95% Confidence Interval Lower Upper Bound Bound 1.02 2.06 1.24 2.69 1.70 2.95 2.02 3.41 -2.06 -1.02 -.18 1.02 .31 1.26 .61 1.74 -2.69 -1.24 -1.02 .18 -.33 1.06 -.01 1.51 -2.95 -1.70 -1.26 -.31 -1.06 .33 -.27 1.05 -3.41 -2.02 -1.74 -.61 -1.51 .01 -1.05 .27

*. The mean difference is significant at the .05 level.

The rows are formed by every possible combination of groups. For example, at the top of the pivot table the “Less than High School” group is paired with each of the other four. The column labeled “Mean Difference (I-J)” contains the sample mean difference between each pairing of groups. We see the “Less than High School” and Graduate groups have a mean difference of 2.71 hours of daily TV viewing. If this difference is statistically significant at the specified level after applying the post hoc adjustments (none for LSD), then an asterisk (*) appears beside the mean difference. Notice the actual significance value for the test appears in the column labeled “Sig.”. Thus, the first block of LSD results indicate that in the population those with less than high school degrees differ significantly in daily TV viewing from each of the other four groups. In addition, the standard errors and 95% confidence intervals for each mean difference appear. These provide information on the precision with which we have estimated the mean differences. Note that, as you would expect, if a mean difference is not significant, the confidence interval includes 0. Also notice that each pairwise comparison appears twice (for example: high school - college degree and also college degree - high school). For each such duplicate pair the significance value is the same, but the signs are reversed for the mean difference and confidence interval values.

Mean Differences Between Groups: One-Factor ANOVA A - 14

Introduction to Statistical Analysis Using SPSS Statistics Summarizing the entire table in Figure A.10, we would say that the lowest degree group (less than high school) differs in amount of TV viewed daily from all other groups, and those with high school differ from all other groups except Junior College, those with higher degrees watch less TV. The three highest degree groups do not differ from each other. Since LSD is the most liberal of the post hoc tests, we are interested to learn whether the same results hold using more conservative criteria. Figure A.11 Homogeneous Subsets Results for REGWF Post Hoc Tests

The REGWF results in Figure A.11 are not presented in the same format as we saw for the LSD. This is because for some of the post hoc test methods (for example, the sequential or multiplerange tests) standard errors and 95% confidence intervals for all pairwise comparisons are not defined. Rather than display pivot tables with empty columns, a different format, homogeneous subsets, is used. A homogeneous subset is a set of groups for which no pair of group means differs significantly (the Sig. value at the bottom of the column will be above the alpha criterion of .05). This format is closer in spirit to the nature of the sequential tests actually performed by REGWF. Depending on the post hoc test requested, SPSS Statistics will display a multiple comparison table, a homogeneous subset table, or both. Recall the REGWF tests first the most extreme, then the less extreme means, adjusting for the number of means in the comparison set. Viewing the REGWF portion of the table, we see three homogeneous subsets (three columns). The first is composed of graduate, bachelor, and junior college groups; they do not differ, but one or the other differs from the two remaining groups. This result is consistent with the LSD tests. The second subset is composed of junior college and high school groups (they do not differ significantly). This result is also consistent with the LSD results. Notice that the third homogeneous subset contains only one group (less than high school). This is because that group of respondents differs from each of the others on television viewing (also consistent with the LSD results). The homogeneous subset pivot table thus displays where population differences do not exit (and by inference, where they do).

Mean Differences Between Groups: One-Factor ANOVA A- 15

Introduction to Statistical Analysis Using SPSS Statistics A homogeneous subset summary appears for the Scheffe test as well (Figure A.11). The results are similar, except for subset 2, where bachelor is added to the homogenous group with junior college and high school. This is consistent with Scheffe being a more conservative test (smaller false-positive rate) than the LSD or REGWF. Thus, notice that under the Scheffe test, the high school and bachelor populations are not found to be significantly different. Figure A.12 Scheffe Post Hoc Results Multiple Comparisons Dependent Variable: HOURS PER DAY WATCHING TV Scheffe

(J) RS HIGHEST (I) RS HIGHEST DEGREE DEGREE LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE HIGH SCHOOL LT HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE JUNIOR COLLEGE LT HIGH SCHOOL HIGH SCHOOL BACHELOR GRADUATE BACHELOR LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE GRADUATE GRADUATE LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR

Mean Difference (I-J) 1.540* 1.964* 2.325* 2.713* -1.540* .424 .786* 1.173* -1.964* -.424 .361 .749 -2.325* -.786* -.361 .388 -2.713* -1.173* -.749 -.388

Std. Error .265 .371 .319 .355 .265 .306 .241 .287 .371 .306 .354 .386 .319 .241 .354 .337 .355 .287 .386 .337

Sig. .000 .000 .000 .000 .000 .750 .032 .002 .000 .750 .903 .440 .000 .032 .903 .858 .000 .002 .440 .858

95% Confidence Interval Lower Upper Bound Bound .72 2.36 .82 3.11 1.34 3.31 1.62 3.81 -2.36 -.72 -.52 1.37 .04 1.53 .29 2.06 -3.11 -.82 -1.37 .52 -.73 1.45 -.44 1.94 -3.31 -1.34 -1.53 -.04 -1.45 .73 -.65 1.43 -3.81 -1.62 -2.06 -.29 -1.94 .44 -1.43 .65

*. The mean difference is significant at the .05 level.

A careful observer will notice that the Scheffe multiple comparison results in Figure A.12 are not completely consistent with the homogeneous subset results (Figure A.11). The multiple comparisons results indicate that the high school group differs significantly from the bachelor group (p=.03), while the homogeneous subset 2 indicates they do not. Here a slightly different sample size adjustment (for homogeneous subsets, sample size is set to be the harmonic mean of all groups, while for multiple comparison tables the default is to compute harmonic means on a two-group (pairwise) basis) produces different conclusions. This is a not an uncommon result when doing post hoc testing because of the different assumptions and methods of the various tests, and it is but one reason why investigators often request multiple tests to compare and contrast the results.

Mean Differences Between Groups: One-Factor ANOVA A - 16

Introduction to Statistical Analysis Using SPSS Statistics Figure A.13 Games-Howell Post Hoc Results Multiple Comparisons Dependent Variable: HOURS PER DAY WATCHING TV Games-Howell

(J) RS HIGHEST (I) RS HIGHEST DEGREE DEGREE LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE HIGH SCHOOL LT HIGH SCHOOL JUNIOR COLLEGE BACHELOR GRADUATE JUNIOR COLLEGE LT HIGH SCHOOL HIGH SCHOOL BACHELOR GRADUATE BACHELOR LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE GRADUATE GRADUATE LT HIGH SCHOOL HIGH SCHOOL JUNIOR COLLEGE BACHELOR

Mean Difference (I-J) 1.540* 1.964* 2.325* 2.713* -1.540* .424 .786* 1.173* -1.964* -.424 .361 .749* -2.325* -.786* -.361 .388 -2.713* -1.173* -.749* -.388

Std. Error .370 .411 .392 .377 .370 .243 .211 .182 .411 .243 .276 .254 .392 .211 .276 .224 .377 .182 .254 .224

Sig. .001 .000 .000 .000 .001 .411 .002 .000 .000 .411 .685 .031 .000 .002 .685 .416 .000 .000 .031 .416

95% Confidence Interval Lower Upper Bound Bound .52 2.56 .83 3.10 1.24 3.41 1.67 3.76 -2.56 -.52 -.25 1.10 .21 1.36 .67 1.67 -3.10 -.83 -1.10 .25 -.40 1.12 .05 1.45 -3.41 -1.24 -1.36 -.21 -1.12 .40 -.23 1.00 -3.76 -1.67 -1.67 -.67 -1.45 -.05 -1.00 .23

*. The mean difference is significant at the .05 level.

The Games-Howell multiple comparison test (Figure A.13) adjusts for both unequal variances (determined to be present by the Levene test earlier) and unequal sample sizes. The overall results are more conservative than any of the other tests. For example, Games-Howell determines that the junior college and graduate groups are statistically distinct (p=.03). These results are surprising consistent given the lack of homogeneity and the unequal cell sizes; but we have a large sample. When results differ and they often do even more than our example, what is the true situation? We don’t know. Your original choice of a post hoc test would be based on how you want to balance power and the false-positive rate. Here under more liberal falsepositive rates we would conclude that there is no significant difference between the junior college and graduate group on the amount of TV they watch. But, only Games-Howell adjusts for the unequal variances across degree groups, and it found that these two groups differed. As a consequence, some researchers would state the junior college – graduate difference as a tentative result; others may state there is a difference, preferring the Games-Howell test. On the other hand, there are other comparisons in which we can have more confidence. All four tests found that those with a bachelors or graduate degree watch less TV than those with a high school degree or less. And in all these comparisons, never lose sight of practical or substantive significance. Are the amounts of television viewing by each group different enough to be of practical importance?

Mean Differences Between Groups: One-Factor ANOVA A- 17

Introduction to Statistical Analysis Using SPSS Statistics The bottom line is that your choice in post hoc tests should reflect your preference for the power/false-positive tradeoff and your evaluation of how well the data meet the assumptions of the analysis, and you live with the results of that choice. Such books as Toothaker (1991) or Hsu (1996) and their references evaluate the various post hoc tests on the basis of theoretical and Monte Carlo considerations.

A.6 Graphing the Mean Differences For presentations it is helpful to display the sample group means along with their 95% confidence bands. We saw such error bar charts in the two-group case and they are useful here as well. To create an error bar chart for TV hours grouped by respondent’s highest degree: Click Graphs…Chart Builder Click OK in the Information box Click Reset Click Gallery tab (if it's not already selected) Click Bar in the Choose from: list

Select the icon for Simple Error Bar (usually third icon in second row) and drag it to the Chart Preview canvas Drag and drop DEGREE from the Variables: list to the X-Axis? area in the Chart Preview canvas Drag and drop TVHOURS from the Variables: list to the Y-Axis? area in the Chart Preview canvas

The completed Chart Builder is shown in Figure A.14.

Mean Differences Between Groups: One-Factor ANOVA A - 18

Introduction to Statistical Analysis Using SPSS Statistics Figure A.14 Chart Builder for Error Bar Chart

Click OK

The chart, shown in Figure A.15, provides a visual sense of how far the groups are separated. The confidence bands are determined for each group separately and no adjustment is made based on the number of groups that are compared or their (unequal) variances. From the graph we have a clear sense of the relation between formal education degree and TV viewing.

Mean Differences Between Groups: One-Factor ANOVA A- 19

Introduction to Statistical Analysis Using SPSS Statistics Figure A.15 Error Bar Chart of TV Hours by Degree Group

A.7 Appendix: Group Differences on Ranks Analysis of variance assumes that the distribution of the dependent measure within each group follows a normal curve and that the within-group variation is homogeneous across groups. If any of these assumptions fail in a gross way, as an alternative, you can sometimes apply techniques that make fewer assumptions about the data. We saw such a variation when we applied tests that didn’t assume homogeneity of variance but did assume normality (Brown-Forsythe and Welch). However, what if neither homogeneity nor normality assumptions were met? In this case, nonparametric statistics are available; they don’t assume specific data distributions described by parameters such as the mean and standard deviation. Since these methods make few if any distributional assumptions, they can often be applied when the usual assumptions are not met. They can also be used with ordinal level measures. The downside of such methods is that if the stronger data assumptions hold, then nonparametric techniques are generally less powerful (probability of finding true differences) than the appropriate parametric method. Second, there are some parametric statistical analyses that currently have no corresponding nonparametric method. It is fair to say that boundaries separating where one would use parametric versus nonparametric methods are in practice somewhat vague, and statisticians can and often do disagree about which approach is optimal in a specific situation. For more discussion of the common nonparametric tests see Daniel (1978), Siegel and Castellan (1988) or Wilcox (1997).

Mean Differences Between Groups: One-Factor ANOVA A - 20

Introduction to Statistical Analysis Using SPSS Statistics Because of our concerns about the lack of homogeneity of variance and normality of TV hours viewed for the different degree groups, we will use a nonparametric procedure—the KruskalWallis test—that only assumes that the dependent measure has ordinal (rank order) properties. The basic logic behind this test is straightforward. If we rank order the dependent measure across the entire sample, we would expect under the null hypothesis (no population differences) that the mean rank (technically the sum of the ranks adjusted for sample size) should be the same for each sample group. The Kruskal-Wallis test calculates the ranks, the sample group mean ranks, and the probability of obtaining average ranks (weighted summed ranks) as far apart (or further) as what are observed in the sample, if the population groups were identical. To run the Kruskal-Wallis test, we declare TVHOURS as the test variable (from which ranks are calculated) and DEGREE as the independent or grouping variable. Click Analyze…Nonparametric Tests…K Independent Samples Move TVHOURS into the Test Variable List: box Move DEGREE into the Grouping Variable: box

Note that the minimum and maximum value of the grouping variable must be specified using the Define Range pushbutton. Click the Define Range pushbutton Enter 0 as the Minimum and 4 as the Maximum

Figure A.16 Analysis of Ranks Dialog Box

Click Continue Click OK

By default, the Kruskal-Wallis test will be performed. The Kruskal-Wallis is the most commonly used nonparametric test for this situation. However, two additional statistical tests are available and you can choose to run all three if you want.

Mean Differences Between Groups: One-Factor ANOVA A- 21

Introduction to Statistical Analysis Using SPSS Statistics Figure A.17 Results of Kruskal-Wallis Nonparametric Analysis

The results are displayed in two tables shown in Figure A.17. In the first table, we see the pattern of mean ranks (remember smaller ranks imply less TV watched) and follows that of the original means of TVHOURS - the higher the degree, the less TV watched. The chi-square statistic used in the Kruskal-Wallis test indicates that we are very unlikely (less than .0005 or 5 chances in 10,000) to obtain samples with average ranks so far apart if the null hypothesis (same distribution of TV hours in each group) were true. Based on this result, we are now much more confident in our original conclusion about overall mean differences because we were able to confirm that population differences exist without making all the assumptions required for analysis of variance. As we noted earlier, there are no nonparametric equivalents for the post hoc tests, so if we are interested in which groups differ from each other, we would still have to rely on the post hoc tests in the One-way ANOVA procedure.

Mean Differences Between Groups: One-Factor ANOVA A - 22

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises We will continue our investigation (from Chapter 7) of TV watching hours (TVHOURS) and the number of hours using the web (WWWHR). We want to see whether the average number of hours watching TV or the number of hours using the web differ by marital status. 1. Run exploratory data analysis on TVHOURS and WWWHR by MARITAL. (Hint: Don't forget to select the "Pairwise Deletion" option.) Are any of the variables normally distributed? What differences do you notice in the means and standard deviations for each group? Is the homogeneity of variance assumption likely to be met? You might use Chart Builder to produce a paneled histogram for number of children by marital status. 2. Run a One-way ANOVA looking at mean differences by marital status for these two variables. Request the test for homogeneity and the robust measures. Interpret the results. Which variables met the homogeneity of variance assumption? Are the means of any of the variables significantly different for marital status groups? 3. Run Post Hoc Tests selecting a liberal test, such as LSD, a more conservative test, such as Scheffe, and the Games-Howell if the variables did not meet the homogeneity of variance criteria. Which groups are significantly different from which other groups? Do the tests agree? If not, how might you summarize the results? 4. Use Chart Builder to display an error bar chart for each of these analyses. For those with extra time: In Chapter 7 exercises, you looked at the age when their first child was born (AGEKDBRN) and number of household members (HHSIZE) by gender. Would you expect the average of either of these variables to be different for education degree groups or marital status? Test your assumption by running the appropriate ANOVA and interpret your results.

Mean Differences Between Groups: One-Factor ANOVA A- 23

Introduction to Statistical Analysis Using SPSS Statistics

Mean Differences Between Groups: One-Factor ANOVA A - 24

Introduction to Statistical Analysis Using SPSS Statistics

Appendix B: Introduction to Multiple Regression Topics • Running Multiple Regression • Interpreting the Results • Residuals and Outliers Data This chapter uses the Bank.sav data file.

B.1 Multiple Regression In Chapter 9, we produced a predictive equation for beginning salary based upon education level. Compensation analysts more often build such equations using several predictor variables instead of the single variable approach we have used. This approach is called multiple regression. Additionally we want to assess how well the equation fits the data and view diagnostics to assess the regression assumptions. Since we have measured several variables that might be related to beginning salary, we will add additional predictor (independent) variables to the equation, evaluate the improvement in fit, and interpret the equation coefficients. In a successful analysis we would obtain an equation useful in predicting starting salary based on background information, and understand the relative contribution of each predictor variable. Multiple regression represents a direct extension of simple regression. Instead of a single predictor variable, multiple regression allows for more than one independent variable (Y = a + b1*X1 + b2*X2 + b3*X3 +…) in the prediction equation. While we are limited to the number of dimensions we can view in a single plot (SPSS Statistics can build a 3-dimensional scatterplot), the regression equation allows for many independent variables. When we run multiple regression we will again be concerned with how well the equation fits the data, whether there are any significant linear relations, and estimating the coefficients for the best-fitting prediction equation. In addition, we are interested in the relative importance of the independent variables in predicting the dependent measure. In our example, we expand our prediction model of beginning salary to include years of formal education (edlevel), years of previous work experience (work), age, and gender (sex). Gender is a dichotomous variable coded 0 for males and 1 for females. As such (recall our earlier discussion), it can be included as an independent variable in regression. Its regression coefficient will indicate the relation between gender and beginning salary, adjusting for the effects of the other independent variables. To run the analysis, we open the Linear Regression dialog box and add the additional independent variables to the Independent Variables list.

Introduction to Multiple Regression B- 1

Introduction to Statistical Analysis Using SPSS Statistics Click Analyze…Regression…Linear… Move salbeg to the Dependent box Move edlevel, sex, work, and age to the Independent(s): box

Figure B.1 Setting Up Multiple Regression

Since the four independent variables will be entered as a single block (we are at block 1 of 1), the order in which we list the variables will not affect the analysis, but Regression will maintain this order when presenting results.

Residual Plots While we can run the multiple regression at this point, we will request some diagnostic plots involving residuals and information about outliers. By default no residual plots will appear. These options are explained below. Click Plots

Within the Plots dialog box: Click the Histogram check box in the Standardized Residual Plots area Move *ZRESID into the Y: box Move *ZPRED into the X: box

Introduction to Multiple Regression B - 2

Introduction to Statistical Analysis Using SPSS Statistics Figure B.2 Regression Plots Dialog Box

The options in the Standardized Residual Plots area of the dialog box all involve plots of standardized residuals. Ordinary residuals are useful if the scale of the dependent variable is meaningful, as it is here (beginning salary in dollars). Standardized residuals are helpful if the scale of the dependent is not familiar (say a 1 to 10 customer satisfaction scale). By this we mean that it may not be clear to the analyst just what constitutes a large residual; is an over prediction of 1.5 units a large miss on a 1 to 10 scale? In such situations, standardized residuals (residuals expressed in standard deviation units) are very useful because large prediction errors can be easily identified. If the errors follow a normal distribution, then standardized residuals greater than 2 (in absolute value) should occur about 5% of the time, and those greater than 3 (in absolute value) should happen less than 1% of the time. Thus standardized residuals provide a norm against which one can judge what constitutes a large residual. We requested a histogram of the standardized residuals; note that a normal probability plot is available as well. Recall that the F and t tests in regression assume that the residuals follow a normal distribution. Regression can produce summaries concerning various types of residuals. Without going into all these possibilities, we request a scatterplot of the standardized residuals (*ZRESID) versus the standardized predicted values (*ZPRED). An assumption of regression is that the residuals are independent of the predicted values, so if we see any patterns (as opposed to a random blob) in this plot, it might suggest a way of adjusting and improving the analysis. Click Continue

Next we will look at the Statistics dialog box. The Casewise Diagnostics choice appears here. When this option is checked, Regression will list information about all cases whose standardized residuals are more than 3 standard deviations from the line. This outlier criterion is under your control. Click Statistics Click the Casewise diagnostics check box in the Residuals area

Introduction to Multiple Regression B- 3

Introduction to Statistical Analysis Using SPSS Statistics Figure B.3 Regression Statistics Dialog Box

Other statistics such as the 95% confidence interval for the B (regression) coefficients can be requested. Click Continue Click OK

B.2 Multiple Regression Results We now turn to the results of our multiple regression run. Recall that listwise deletion of missing data has occurred, that is, if a case is missing data on any of the five variables used in the regression it will be dropped from the analysis. If this results in heavy data loss, other choices for handling missing values are available in the Regression Options dialog box (see also the SPSS Missing Values add-on module for multiple variable imputation methods for estimating missing data values). In Figure B.4 the dependent and independent variables are listed followed by the Model Summary table. The R-square statistic is about .49, indicating that with these four predictor variables we can account for about 49% of the variation in beginning salaries. Education alone had an r-square of .40, so the additional set of three predictors added only an additional 9%: an improvement, but a modest improvement. The Adjusted R-square is quite close to the r-square. The standard error has dropped from $2,439 (with just education as a predictor) to $2,260: again, an improvement, but not especially large.

Introduction to Multiple Regression B - 4

Introduction to Statistical Analysis Using SPSS Statistics Figure B.4 Variable Summary and Fit Measures

Next we turn to the ANOVA table. Figure B.5 ANOVA Table

Since there are four independent variables, the F statistic tests whether any of the variables have a linear relationship with beginning salary. Not surprisingly, since we already know from the analysis in Chapter 9 that education is significantly related to beginning salary, the result is highly significant.

Introduction to Multiple Regression B- 5

Introduction to Statistical Analysis Using SPSS Statistics Figure B.6 Multiple Regression Coefficients, Bs and Betas

In the Coefficients table, the independent variables appear in the order they were given in the Regression dialog box, not in order of importance. Although the B coefficients are important for prediction and interpretive purposes, analysts usually look first to the t test at the end of each line to determine which independent variables are significantly related to the outcome measure. Since four variables are in the equation, we are testing if there is a linear relationship between each independent variable and the dependent measure after adjusting for the effects of the three other independent variables. Looking at the significance values we see that education and gender are highly significant (less than .0005), age is significant at the .05 level (p=.035), while work experience is not linearly related to beginning salary (after controlling for the other predictors). Thus we can drop work experience as a predictor. It may seem odd that work experience is not related to salary, but since many of the positions were clerical, work experience may not play a large role. Typically, you would rerun the regression after removing variables not found to be significant, but we will proceed and interpret this output. The estimated regression (B) coefficient for education is about $651, similar but not identical to the coefficient ($691) found in the simple regression using formal education alone. In the simple regression we estimated the B coefficient for education ignoring any other effects, since none were included in the model. Here we evaluate the effect of education after controlling (statistically adjusting) for age, work experience and gender. If the independent variables are correlated, the change in B coefficient from simple to multiple regression can be substantial. So, after controlling (holding constant) age, work experience and gender, a year of formal education, on average, was worth $651 in starting salary. The gender variable has a B coefficient of –$1526. This means that a one-unit change in gender (moving from male status to female status), controlling for the other independent variables in the equation, is associated with a drop (negative coefficient) in beginning salary of $1,526. To put it more plainly, females had a beginning salary $1,526 lower than men, controlling for the other three variables in the equation. Age has a B coefficient of $33, so a one-year increase in age (controlling for the other variables) was associated with a $33 beginning salary increase. Since we found work experience not to be significantly different from zero, we treat it as if it were zero. The constant or intercept term is still negative, and would correspond to the predicted salary for a male (sex=0) with 0 years of education, 0 years of work experience and whose age is 0—not a realistic combination. The standard errors again provide precision measures for the regression coefficient estimates. If we simply look at the estimated B coefficients we might think that gender is the most important variable. However, the magnitude of the B coefficient is influenced by the standard deviation of the independent variable. For example, sex takes on only two values (0 and 1), while education Introduction to Multiple Regression B - 6

Introduction to Statistical Analysis Using SPSS Statistics values range from 8 years to over 20 years. The Beta coefficients explicitly adjust for such standard deviation differences in the independent variables. They indicate what the regression coefficients would be if all variables were standardized to have means of 0 and standard deviations of 1. A Beta coefficient thus indicates the expected change (in standard deviation units) of the dependent variable per one standard deviation unit increase in the independent variable (after adjusting for other predictors). This provides a means of assessing relative importance of the different predictor variables in multiple regression. The Betas are normed so that the maximum should be less than or equal to one in absolute value (if any Betas are above 1 in absolute value, it suggests a problem with the data: multi-collinearity). Examining the Betas, we see that education is the most important predictor, followed by gender, and then age. The Beta for work experience is very near zero, as we would expect. If we needed to predict beginning salary from these background variables (dropping work experience) we would use the B coefficients. Rounding to whole numbers, we would say: salbeg = 651*edlevel – 1526*sex + 33*age – 2666.

B.3 Residuals and Outliers In Figure B.7, we see those observations more than three standard deviations from the regression fit line; assuming a normal distribution, this would happen less than 1% of the time by chance alone. In this data file that would be about 5 outliers (.01*474), so the seven cases listed does not seem excessive. However, it is interesting to note that all the large residuals are positive, and some of them are quite substantial. Residuals should normally be balanced between positive and negative values; when they are not, you should investigate the data further, so the next step would be to see if these observations have anything in common (same job category perhaps, which may be out of line with the others regarding salary). Since we know their case numbers (an ID variable can be substituted), we could look at them more closely.

Figure B.7 Casewise Listing of Outliers

We see the distribution of the residuals with a normal bell-shaped curve superimposed in Figure B.8. The residuals are a bit too concentrated in the center (notice the peak) and are skewed; notice the long tail to the right. Given this pattern, a technical analyst might try a data transformation on the dependent measure (taking logs), which might improve the properties of the residual

Introduction to Multiple Regression B- 7

Introduction to Statistical Analysis Using SPSS Statistics distribution. However, just as with ANOVA, larger sample sizes protect against moderate departures from normality, and our sample size here should be adequate. Overall, the distribution is not too bad, but there are clearly some outliers in the tail; these also show up in the casewise outlier summary. Figure B.8 Histogram of the Residuals

In the scatterplot of residuals (Figure B.9), we hope to see a horizontally oriented blob of points with the residuals showing the same spread across different predicted values. Unfortunately, we see a hint of a curving pattern: the residuals seem to slowly decrease then swing up at the end. This type of pattern can emerge if the relationship is curvilinear, but a straight line is fit to the data. Also, the spread of the residuals is much more pronounced at higher predicted values than at the lower ones. This suggests lack of homogeneity of variance. Such a pattern is common with economic data: there is greater variation at larger values. At this point, the analyst should think about adjustments to the equation. Given the lack of homogeneity, the suggestion of curvilinearity, and knowing that the dependent measure is income, an experienced regression user would probably perform a log transform on beginning salary and rerun the analysis. This is not to suggest that such an adjustment should occur to you at this stage; the main point is that you should look at residual plots to check the assumptions of regression, and you may find hints there on how to improve your model.

Introduction to Multiple Regression B - 8

Introduction to Statistical Analysis Using SPSS Statistics Figure B.9 Scatterplot of Residuals and Predicted Values

Summary of Regression Results Overall, the regression analysis was successful in that we can predict about 49% of the variation in beginning salary from education, gender and age. We found that education is the best predictor, but that gender played a substantial role. Examination of residual summaries suggested that a straight line may not be the best function to fit (this was also evident from the scatterplot in Figure 9.1), and there were several large positive residuals that should be checked more carefully.

Introduction to Multiple Regression B- 9

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises Using the Bank data, 1. Use linear regression to develop a prediction equation for current salary (salnow), using as predictors: age, edlevel, minority, salbeg, sex, time, and work. Request a histogram of the errors, the scatterplot of residuals, and casewise diagnostics. What variables are significant predictors of current salary? Which variable is the strongest predictor? The weakest? Do the results make sense? What is the prediction equation? 2. Are there any problems with the assumptions for linear regression, such as homogeneity of variance? Does a linear model fit the data? For those with extra time: 1. Rerun the regression with only the significant variables. Do the coefficients (B) change much or not?

Introduction to Multiple Regression B - 10

Introduction to Statistical Analysis Using SPSS Statistics

References Introductory Statistics Books Burns, Robert P and Burns, Richard. 2008 (forthcoming). Business Research Methods and Statistics Using SPSS London Sage Publications Ltd. Field, Andy. 2005. Discovering Statistics Using SPSS (2nd Ed.). London Sage Publications Ltd. Hays, William L. 2007. Statistics (6th Ed.). New York: Wadsworth Publishing. Kendrick, Richard J. 2004. Social Statistics: An Introduction Using SPSS (2nd Ed.). Allyn & Bacon. Knoke, David, Bohrnstedt, George W. and Mee, Alisa Potter. 2002. Statistics for Social Data Analysis. (4rd ed.).Wadsworth Publishing. Moore, David S., 2005. The Practice of Business Statistics with SPSS. W.H. Freeman. Norusis, Marija J. 2008. SPSS 16.0 Guide to Data Analysis. (2nd Ed.) New York: Prentice-Hall. Norusis, Marija J. 2008. SPSS 16.0 Statistical Procedures Companion. (2nd Ed.) New York: Prentice-Hall.

Additional References Agresti, Alan. 2007. An Introduction to Categorical Data Analysis (2nd Ed.). New York: WileyInterscience. Allison, Paul D. 1998. Multiple Regression: A Primer. Thousand Oaks, CA: Pine Forge. Allison, Paul D. 2001. Missing Data. Thousand Oaks, CA: Sage. Andrews, Frank M, Klem, L., Davidson, T.N., O’Malley, P.M. and Rodgers, W.L. 1981. A Guide for Selecting Statistical Techniques for Analyzing Social Science Data. Ann Arbor, MI: Institute for Social Research, University of Michigan. Bishop, Yvonne M., Fienberg, S. and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press. Box, George E. P. and Cox, D.R. 1964. “An Analysis of Transformations,” J. Royal Statistical Society, Series B, 23, p. 211. Box, George E. P., Hunter, W.G. and Hunter, J.S. 2005. Statistics for Experimenters (2nd Ed.). New York: Wiley.

References R - 1

Introduction to Statistical Analysis Using SPSS Statistics Brown, Morton B. and Forsythe, A. 1974. “The Small Sample Behavior of Some Statistics Which Test the Quality of Several Means,” Technometrics, pp. 129-132. Cleveland, William S. 1994. The Elements of Graphing Data (2nd. Ed.). Chapman & Hall/CRC Cochran, William G. 1977. Sampling Techniques (3rd ed.). New York: Wiley. Cohen, Jacob. 1988. Statistical Power Analysis for the Behavioral Sciences (2nd Ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Cohen, Jacob and Cohen, P, et.al. 2002. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences (3rd. ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Daniel, Cuthbert and Wood, Fred S. 1999. Fitting Equations to Data (2nd ed.). New York: Wiley. Daniel, Wayne W. 2000. Applied Nonparametric Statistics (2nd ed.). Boston: Duxbury Press. Draper, Norman and Smith, Harry. 1998. Applied regression Analysis (3rd ed.). New York: Wiley. Few, Stephen. 2004. Show Me the Numbers: Designing Tables and Graphs to Enlighten. Analytics Press Fienberg, Stephen E. 1980. The Analysis of Cross-Classified Categorical Data (2nd ed.). Cambridge, MA: MIT Press. Gibbons, Jean D. 2005. Nonparametric Measures of Association. Newbury Park, CA: Sage. Hoaglin, David C., Mosteller, F. and Tukey, J.W. 1985. Exploring Data Tables, Trends and Shapes. New York: Wiley. Hoaglin, David C., Mosteller, F. and Tukey, J.W. 1991. Fundamentals of Exploratory Analysis of Variance. New York: Wiley. Hsu, Jason C. 1996. Multiple Comparisons: Theory and Methods. London: Chapman & Hall. Huff, Darell and Geis, Irving. 1003. How to Lie with Statistics (Reissue ed.). W.W. Noton & Company Kirk, Roger E. 1994. Experimental Design: Procedures for the Behavioral Sciences (3d ed.). Belmont, CA: Brooks/Cole Publishing. Kish, Leslie. 1965. Survey Sampling. New York: Wiley. Klockars, Alan J. and Sax, G. 1986. Multiple Comparisons. Newbury Park, CA: Sage. Kraemer, H.K and Thiemann, S. 1987. How Many Subjects? Statistical Power Analysis in Research. Newbury Park, CA: Sage.

References R - 2

Introduction to Statistical Analysis Using SPSS Statistics Milliken, George A. and Johnson, D.E. 2004. Analysis of Messy Data, Volume 1: Designed Experiments. Chapman & Hall/CRC. Mosteller, Frederick and Tukey, John W. 1977. Data Analysis and Regression. Reading, MA: Addison-Wesley. Salant, Priscilla and Don A. Dillman. 1994. How to Conduct Your Own Survey. New York: Wiley. Searle, Shayle R. 2005. Linear Models for Unbalanced Data. New York: Wiley-Interscience. Siegel, Stanley and Castellan, N. J. 1988. Nonparametric Statistics for the Behavioral Sciences. (2nd ed.). New York: McGraw-Hill. Sudman, Seymour. 1976. Applied Sampling. New York: Academic Press. Toothaker, Larry E. 1991. Multiple Comparisons for Researchers. Newbury Park, CA: Sage. Tufte, Edward R. 2001. The Visual Display of Quantitative Information (2nd. Ed.). Graphics Press. Tukey, John W. 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley. Tukey, John W. 1991. “The Philosophy of Multiple Comparisons,” Statistical Science, vol. 6, 1, pp. 100-116. Velleman, Paul F. and Wilkinson, L. 1993. “Nominal, Ordinal and Ratio Typologies are Misleading for Classifying Statistical Methodology,” The American Statistician, vol. 47, pp. 6572. Wilcox, Rand R. 2004. Introduction to Robust Estimation and Hypothesis Testing (2nd ed.). New York: Academic Press. Wilkinson, Leland. 2005. The Grammar of Graphics (2nd. Ed.). Springer.

References R - 3

Introduction to Statistical Analysis Using SPSS Statistics

References R - 4

Introduction to Statistical Analysis Using SPSS Statistics

Alternative Exercises for Chapters 8 & 9 and Appendix B This appendix contains an alternative set of exercises for Chapter 8, Chapter 9 and Appendix B. The exercises in the main text are based on the Bank.sav data file which is discussed in the respective chapters. These exercises are based on the GSS200Intro4.sav file which is used in other chapter examples and exercises. We present this as an option for the instructor.

Alternative Exercises E - 1

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises For Chapter 8 Using the GSS 2004Intro.sav file: Suppose you are interested in predicting the age when first child is born. This might be of interest if you were looking at programs targeted toward teenage parents. The outcome variable is agekdbrn. Consider age, education (educ), spouses'education (speduc) (Note: By using this variable though, you are limiting the analysis to those currently married), household size (hhsize), number of children (childs) and sex (A numeric version of Gender). 1. Create a numeric version of Gender using the "Recode Into Different Variable" procedure. Name the new variable Sex and recode "M" to 0 and "F" to 1. Assign value labels to the new variable. 2. Run frequencies on sex so you understand its distribution. Run descriptive statistics on the other variables for the same reason. 3. Now produce correlations with all these predictors and agekdbrn. 4. Then create scatterplots of the predictors and agekdbrn. Which variables have strong correlations with agekdbrn? Do you find any potential problems with using linear regression? Did you find any potential outliers?

Alternative Exercises E - 2

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises For Chapter 9 Using the GSS2004Intro.sav file: 1. Run a simple linear regression using education (educ) to predict age when child first born (agekdbrn). 2. How well did you do in predicting age when child first born from this one variable? (Hint: What is the R-square and how would you interpret it? 3. Interpret the constant (intercept) and B values. 4. Use the predictive equation to predict the value of age when first child was born for a person with 12 years of education. 5. Run a simple linear regression using one of the other variables from the set you explored in the last chapter. Hint: You might want to try the one with the next highest correlation coefficient. Answer Questions 2-3 above.

Alternative Exercises E - 3

Introduction to Statistical Analysis Using SPSS Statistics

Summary Exercises For Appendix B Using the GSS 2004Intro.sav file: 1. Use linear regression to develop a prediction equation for age when child first born (agekdbrn), using as predictors: age, educ ,speduc,hhsize, childs, and sex (numeric version of gender that you need to create). Request a histogram of the errors, the scatterplot of residuals, and casewise diagnostics. What variables are significant predictors of current salary? Which variable is the strongest predictor? The weakest? Do the results make sense? What is the prediction equation? 2. Are there any problems with the assumptions for linear regression, such as homogeneity of variance? Does a linear model fit the data? 3. Are there other variables that you believe might be good predictors? Consider recoding Race to two categories, combining "Black" and "Other" and use it as a predictor. Is it statistically significant?

Alternative Exercises E - 4