Introduction to SAS. 1.1 About SAS Programs and Files

Chapter 1 Introduction to SAS The Statistical Analysis System [7], or SAS, is the package of record, as it is probably the most widely-used statistic...
11 downloads 2 Views 203KB Size
Chapter 1

Introduction to SAS The Statistical Analysis System [7], or SAS, is the package of record, as it is probably the most widely-used statistical program. Despite its inauspicious beginning (see [2]) in the early 1970’s at North Carolina State University, SAS is now practically the benchmark for statistical packages. It is powerful– able to handle and manipulate very large data sets with relative ease– and versatile– containing routines with methodology just recently put forward in research literature.

1.1

About SAS Programs and Files

SAS programs are organized into what might be thought of as paragraphs. Usually, the first “paragraph” is entering the data. Subsequent “paragraphs” are often procedures for analyzing the data. The files containing SAS programs are typically named with .sas as the extension (e.g., program.sas). When SAS executes a program, there is an accompanying “log” file. When it is named, its extension is usually .log. The results of the program are referred to as the “output”. Output files are usually given the extension .lst. When these files are saved on a p.c., sometimes double-clicking will produce an error message along the lines of “There is no program associated with this extension” (this is most frequently the case when you e-mail one of these files to someone). However, all of these files (i.e., the log, the output, and the program) are simple text files. They can therefore be edited or viewed in any word processor program (e.g., Word, Word Perfect, Notepad) using its “open” command (usually CTRL-O). In the more advanced word processors, a fixedwidth font (e.g., Courier New or SAS Monospace) is preferable in order to preserve spacing and appearance. This is particularly the case with the log and output files. SAS programmers, particularly those just learning the language, should look over their log files. This is where SAS talks directly to the programmer. 1

2

CHAPTER 1. INTRODUCTION TO SAS

When a program does not run, this is where SAS gives the programmer some insight (sometimes, very little insight) as to why it did not run. So, even if it is a typo you notice after you run the program, look over the log to see how SAS notes it. That way, when you don’t know why your program did not run, you will have some familiarity with the log file and how to tell what might have went wrong.

1.2

Why SAS?

At the beginning, I noted that SAS is “the package of record”. SAS is the clich´e 800 pound gorilla. It has a well-earned reputation for having practically every data analytic methodology contained in it within a few years of its introduction to the statistical community (though the clock on the non-nested model selection methodology from my dissertation is still ticking). The legends told to me say that SAS was the first- and for decades, onlypackage deemed “acceptable” by the US Food and Drug Administration (FDA). If a pharmaceutical firm performed their data analyses using some other package or in-house programs, they had to convince the FDA of the worthiness and accuracy of the program before the FDA would even consider the results. So, the drug makers essentially had to use SAS. As long as SAS “kept up with the times”, the companies had no reason to look elsewhere. Obviously, then, statisticians who wanted to work in industry (and the pharmaceutical industry was and still is one of the main areas for which there was demand for data analysts) had better opportunities for high paying jobs if they could program in SAS. Since SAS has “kept up with the times” and so many statisticians knew how to program in SAS, it spread into other industries’ research and development groups. Future data analysts who look at job listings will see that SAS programming skills are the most frequently listed property employers seek from them.

1.3

Why Not SAS?

The terse answer to this is another question: Have you ever seen their manuals? This is a short way of saying that learning SAS is not an easy or quick proposition. The clich´e above is particularly appropriate because, at some point in time, most people trying to learn SAS’ language conclude that SAS has the accompanying personality of an 800 pound gorilla. Earlier, I said that you should look over the log files after running your SAS programs, even if nothing is apparently wrong or if you know what the problem is. Hopefully, you will see some of the cryptic error messages while they are laughable, rather than frustrating or annoying. (My personal favorite message in the error log is, in its entirety, “Import unsuccessful. See Log for details.”)

1.4. SAS AND ME

3

To a beginner, the SAS manuals (or on-line help) read about as well as uncommented Shakespeare in Olde English. In fact, I am of the opinion that it is only when one can read the manual for a new (to him/her) “procedure” and understand it that he/she can remove the “novice” or “student” adjective from the description of their SAS skills. Perhaps a reason for the manuals’ lack of readability (at least, according to the legends I’ve heard) is that the original manuals were written after the original programmers were no longer around. So, for the age-old parts of the program (namely, the classic materials, like regression, ANOVA, data manipulation, and so on), the manuals have more of a “this is how we get it to do this” tone, rather than explaining why it works the way it works. Readers, therefore, are left to consider the underlying logic on their own, which, in the end, makes the language harder to learn.

1.4

SAS and Me

My first exposure to SAS came in Dr. Rahman Azari’s classes (“multivariate analysis” and “categorical data analysis”). In those classes, SAS was just a tool that would get us results. The teaching assistants would come in with dittos (since that was in the 1990’s, I believe that the University of California at Davis Division of Statistics had the last working ditto machine in the US that was not in a museum) of SAS (or BMDP) programs and say “type this in to do the analyses for your homework”. There was never any explanation, and since I was a TA myself, I knew that there were only a couple of people- four, including Dr. Azari, being the most at any one time- who really understood SAS in the Division. There was no point in asking the actual TA since he did not know SAS and, besides, we had other things to learn- like the underlying theory. I got my Ph.D. and went off to teach at the University of Missouri-Rolla. There, my knack for stonewalling other departments’ attempts to teach Statistics on their own in the Curriculum Committee kept me in trouble with various “big-wigs” on campus. One such department I stopped was the Management Science Department. Now, the only way to really stop any department from offering a new course is for the department that should teach the course to refuse to offer it. So, when I managed to prevent the Management Science Department from teaching a course entitled “Introductory Statistics for Management Science”, that meant that my department, the Mathematics and Statistics Department, had to teach it, as the Dean told my Chairman in no uncertain terms. Even though my bachelor’s degree is in “Quantitative Economics” (which is now called “Management Science” at my alma mater, UC San Diego), I believe that the Chairwoman of the Management Science Department, whose husband is an economist and had worked on a couple of papers with me, really requested me to teach the course because she had a great sense of humor. Oh, and perhaps I should mention,

4

CHAPTER 1. INTRODUCTION TO SAS

the Management Science Department wanted their students to be able to get models in SAS because that skill was in demand for management scientists. So, there I was, having to teach a course centered around a package for which my whole knowledge base consisted of maybe a dozen ditto sheets. I discovered all of the terrible things I hinted at in the “Why Not SAS” section at this point in time and, even though I bought a couple of highly-recommended “how-to” books, I found myself broadcasting anti-social language across the third floor of the Rolla Building. Fortunately, my colleague Prof. Tom Kirchoff was a more-than-capable SAS programmer with an uncanny ability to discern expletive-laden screams as cries for help (and, since we shared a windowed vestibule behind our offices, to which we usually kept our doors open, he undoubtedly heard these tirades). It is from this experience- and, from discussions with others, I know my experience is not singular- that I conclude that SAS is typically an “inherited trait”. Namely, somebody has to teach SAS to you; it is rarely learned on one’s own. Tom pulled me over a number of obstacles on my way not only to learning SAS, but to teaching it just days (and sometimes, hours) after learning it. I even won a teaching award for that class. At the end of that school year, Tom took a job with Capital One and I also left Rolla and ended up returning to my graduate school alma mater as a “Senior Statistician” (which, in English, means “Ph.D. data analyst”). So, both of our SAS skills have come in handy for us.

Chapter 2

Entering Data Into SAS There are numerous ways to get data into SAS. In later chapters, I will give different methods, which will be more appropriate to those chapters’ topics, but I shall begin with what SAS manuals call “the data step”. The data step is certainly the most versatile part of the SAS programming language. There is likely to be at least one data step in any SAS program, as this is where data manipulation and other database management techniques (such as merging or splitting) are often performed. For my first example of the data step, I simply consider entering data as part of the program. Suppose there are 8 observations of a variable called x: 17.2, 4.7, 15.9, 4.1, 11.0, 7.9, 5.1, −3.7. To get them into SAS in a data set called “first” (the name of the data set must be alphanumeric, with the exception of an underscore- i.e., , and begin with a letter; to ensure that the name will be acceptable in all versions of SAS, it is best to limit the names to 8 characters), the data step might be: /* is the name of the data set*/ /* tells SAS to remain on the current line for more data (otherwise, SAS looks to the next line for the next data point)*/ /* sometimes datalines is used instead; cards; indicates data follows*/ 17.2 4.7 15.9 4.1 11.0 7.9 5.1 -3.7 /* indicates the end of the data step */ run;

data first; input x @@;

Note that SAS uses /* and */ to indicate the beginning and end of an extended comment (i.e., part of a program that SAS ignores), respectively. They will be used along with an underscore to give some explanation of new commands. A simple check of whether SAS read the data in correctly might be done using the “print” procedure: 5

6

CHAPTER 2. ENTERING DATA INTO SAS proc print data=first; run; quit;

/* is a way of abbreviating procedure*/ /* tells SAS to use the data set named ‘first’*/ /* indicates the end of the procedure */ /* at the end of a program, it is best, though not necessary, to use this command */

SAS has procedures to do simple tasks like getting the mean and standard deviation or performing one-sample t-tests, of course. One such procedure is the “univariate” procedure: proc univariate data=first; var x; run; quit;

/* procedure which gives means, /* medians, percentiles, etc.*/ /* tells SAS to use only the variable x*/

Chapter 3

Balanced ANOVA The easiest analysis of variance (ANOVA) model to deal with is the “balanced ANOVA”. Recall that a “balanced ANOVA” in the one-way setting means that there are the same number of observations for every level of the factor. Consider the following example (from Lecture 13): reps 1 2 3 4 1 12 7 9 6 A 2 15 10 12 9 3 10 5 7 4 Here, there are 4 “repetitions” for each of the three levels of factor ‘A’. Reading data like this into SAS is typically best done using “do loops”. data balance; do A=1 to 3;

/* perform the commands between do and end below the stated number of times*/

do rep=1 to 4; input y @@; output; /*save the data from the input; try this program without the output commandsometimes it will not retain all of the data */ end; end; datalines; /* could have used cards instead; */ 12 7 9 6 15 10 12 9 10 5 7 4 run; proc print; run; The reason for the proc print above is to give you an idea of the how the data should appear to SAS when preparing to run an ANOVA. In general, the 7

8

CHAPTER 3. BALANCED ANOVA

data should consist of a column for the outcome and columns with the level(s) of the factor(s). In this lecture, I introduce the glm (or ‘general linear models’) procedure. I will use this procedure to obtain ANOVA models and perform parameter estimates and tests, but this is a mere fraction of the procedure’s potential use. There is an anova procedure in SAS that can handle balanced models, but I will not use it. The reason is that it only handles balanced models and will give inaccurate results if the data are not balanced. That is because it uses shortcut formulae, which are faster to compute. Thirty years ago, this meant a difference of minutes in the time it takes to get the analysis. Today, the difference between proc anova and proc glm is measured in hundredths of a second. So, there is practically never a reason to use proc anova today. proc glm data=balance; class A; model y=A;

/* the general linear models procedure*/ /* tells which variable denotes the different populations or groups */ /* tells SAS that the outcome and population variables are in this line */ /* Notice that the outcome variable is on the left-hand side of the equation and the explanatory variable is on the right-hand side. */

run; quit; In the output, SAS gives the ANOVA table and two parameter tables. Both of these tables appear to be redundant to the ANOVA table and each other in the one-way ANOVA situation, but, in multi-factor unbalanced ANOVA models, there will be a noticeable difference.

Chapter 4

More Data Issues 4.1

Preserving Data for Use in Future SAS Programs

There are numerous ways to “save data” for SAS. Saving data sets is desirable inasmuch as it allows the programmer to condense the data entry to just a few lines. In my experience, the “best” way of saving data sets is to use the comma separated values (or “CSV”) format. The reason for this is that CSV files can be moved to other computers easily (because it is a text file) and can be viewed in other programs, particularly spreadsheets like Microsoft Excel, easily. data balance; do A=1 to 3; do rep=1 to 4; input y @@; output; end; end; datalines; 12 7 9 6 15 10 12 9 10 5 7 4 run; proc export data=balance datafile=’c:/my documents/balance.csv’ dbms=csv replace; /* export is a procedure which has SAS write data to a file on the hard drive */ /* the command datafile names the file- be sure that the directory, in this case, ‘my documents’, exists*/ 9

10

CHAPTER 4. MORE DATA ISSUES /* The dbms is the format of the file and replace tells SAS to replace the existing file, if one exists*/ run; quit;

Other dbms’s that SAS will accept include (but are not limited to) txt (tab delimited text file) and excel (Excel files, limited to SAS for Windows). Note that the whole proc export statement was spread over two lines. SAS uses semi-colons as its way of denoting the end of a single command line. In other words, SAS is ignoring the fact that half of the command is given on one line and half is given on the next; it is treating that statement as if it was entirely on one line. This also means that the indenting that I use is cosmetic and has no effects in SAS whatsoever.

4.2 4.2.1

Comments on Excel Saving Data From Excel

SAS and Excel do not “get along” with one another. If Excel has a file opened in it, SAS will not be able to access that file. So, it is a good idea to close Excel entirely while a SAS program that uses external files is running. Even when Excel is closed, SAS and Excel can have disagreements. Both programs like to do some of your thinking for you at times and, besides the occasions where those packages do that thinking along some undesirable to you path, they sometimes do their thinking along different paths from each other. Take a data entry like “7/29”. Excel, left to its own “thinking”, will convert that to a date. SAS will, depending on how that entry is input, probably view it as either a mistake or a character value with no intrinsic meaning. It is particularly interesting when SAS is importing an Excel file (I’ll get to proc import later) when Excel has decided that entries such as this should be a date. Often, the entry will not be recognizable to the user in SAS. Saving Excel spreadsheets as PRN (space delimited) or TXT (tab delimited) has a peculiar set of risks, too. Take, for instance, the following example: ab cdef ghijkl mnopqrst uvwxyzABCD

EF GHIJ KLMNOP QRSTUVWX YZ12345678

Type that “data” into Excel and then save it as PRN. Then, close Excel and reopen it. Open that PRN file and see if it is the same as the file you thought you saved!

4.3. IMPORTING FILES TO SAS

11

While Excel can jumble CSV files, that format is the most likely to have a seamless transition to SAS and other programs. In any event, it is best to view any data set imported into SAS before trusting the results as being accurate.

4.2.2

Excel’s Statistical Functions

Excel has a bad reputation in the “statistical community”. This reputation comes from Microsoft’s use of shortcuts in formulae and unwillingness to change from these shortcuts (see, for example, [6]). At the Joint Statistics Meetings in 2002 in New York City, I attended a session that was a panel “discussion” on how to get technical books published. At the end of the session, a question came from the floor: “Suppose I am writing a book ‘Introductory Statistics Using Package X’ and that package is known to have some errors. Is it my responsibility to fully discuss those errors in the text?” The person who asked that question was in the back of the room and the panel could not hear it. So, after unsuccessfully repeating the question, someone nearer to the front of the room relayed it as he (and everyone else in the room who heard the question) understood it: “Suppose I am writing a book “Introductory Statistics Using Excel....” Excel can be trusted to correctly calculate means, but its shortcuts begin to make it vulnerable in calculating variances and standard deviations. Obviously, then, it could have problems with ANOVA or regression models. So, even if you give up on SAS, you probably should use some package other than Excel to perform statistical calculations. Packages which do such calculations accurately from a point-and-click set of menus are a dime a dozen.

4.3

Importing Files to SAS

Importing data files is done much like exporting data files from the programmer’s perspective. proc import out=balance datafile=’c:/my documents/balance.csv’ dbms=csv replace; /* import is a procedure which has SAS read data from a file on the hard drive */ /* the command out give the name of the data set within the current SAS program*/ run; proc print; run; quit;

12

CHAPTER 4. MORE DATA ISSUES

As I noted previously, it is always a good idea to look over imported data. The print procedure is a handy way of doing exactly that.

Chapter 5

Checking Assumptions The two main assumptions that should be checked during the process of obtaining and evaluating a model are the assumptions of normality and constant variance in the errors (another important assumption is independence, but it should be accounted for in the model design process). Namely, if yij = µ + θi + ij is the model being fit, where µ is the overall mean of the population, µ + θi is the mean of the ith group (sometimes θi is referred to as the effect of the ith group), and ij is the “random error”. These assumptions are that ij has a normal distribution and the variance of ij is the same for every observation. A practical mistake many researchers make is that they check these assumptions on the raw outcome variable. In fact, this assumption is supposed to be checked using the residuals, because, if the θi s are not all equal to zero, a histogram of the yij values will have multiple distinct modes. Namely, if the null is rejected, it is almost always the case that the raw values of the outcome are not normally distributed. Residuals are defined as eij = yij − yˆij , where yˆij = µ ˆ + θˆi , with the latter two being the estimates of the population mean and group effects, respectively.

5.1

Checking Normality

Using the previous ANOVA example, I insert some code to instruct SAS to provide the residuals and then test them for normality. proc import out=balance datafile=’c:/my documents/balance.csv’ dbms=csv replace; run; 13

14

CHAPTER 5. CHECKING ASSUMPTIONS proc glm data=balance; class A; model y=A; output out=resids r=res; /* instructs SAS to create a new data set called ‘resids’ and put the residuals from the model into that data set and call them ‘res’- you can view what else is in that data set by using proc print immediately below the following run statement*/ run; proc univariate normal plot; /* Tells SAS to run tests of normality and give a QQ-plot */ var res; run; quit;

In most cases, SAS will give 4 tests of normality, the first two being the ShapiroWilk and Kolmogorov-Smirnoff tests. While these tests are accompanied by p-values (with ‘conforming to normality’ being the null), it is worth remembering that the Central Limit Theorem makes it reasonable to allow some non-normality to be exhibited in the residuals, if the sample sizes (here, meaning the number of replications for each group) are “large”. The QQ-plot (which is referred to as the “normal probability plot” or “ZZplot”) has +s where the data should be, if it was normally distributed, and *s where the residuals actually fall. Some forms of non-normality require more care than others (e.g., “heavy tails” vs. “truncated tails”) and the QQ-plot (as well as the stem-and-leaf plot, which can be viewed as a sideways histogram) would indicate the shape of the distribution, if normality is rejected. In the case of this example, the residuals are decidedly non-normal (the Shipiro-Wilk test has a p-value of 0.0237). The QQ-plot indicates a skewed distribution. So, a transformation is in order. I will try a natural log. proc import out=first datafile=’c:/my documents/balance.csv’ dbms=csv replace; run; data balance; /* take the data set “first” and copy its set first; contents to create a new data set called “balance” */ logy=log(y); /* create a new variable “logy” by taking the natural log of the existing variable y*/

5.2. CHECKING CONSTANT VARIANCE

15

run; proc glm data=balance; class A; model logy=A; output out=resids r=res; run; proc univariate normal plot; var res; run; quit; Notably, the test is run on the residuals, but the transformation is applied to the original outcome variable (here, that is y). Also, note that the ANOVA results, especially including the p-values, change. At times, a transformation causes dramatic changes in the model! One of the changes, of course, was the change in the tests of normality on the residuals. Those now indicate that the residuals are consistent with the null hypothesis of normality.

5.2

Checking Constant Variance

In the one-way ANOVA setting, checking the assumption of constant variance is easier than checking the assumption of normality of the residuals. In the example below, I use the Levene test. proc glm data=balance; class A; model logy=A; means A / hovtest=levene; /* Performs the Levene test for the factor A */ /* The slash (/) in that statement indicates that “options”, meaning non-standard commands, are requested in the subsequent part of the program line */ output out=resids r=res; run; proc univariate normal plot; var res; run; quit; The hovtest=levene option above only works in one-way ANOVA models, though. A quick-and-dirty way of running the Levene test can be constructed using the residuals.

16

CHAPTER 5. CHECKING ASSUMPTIONS proc glm data=balance; class A; model logy=A; output out=resids r=res; run; data levene; set resids; res2=res*res; /* res2=res**2 has the same effect*/ run; proc glm; class A; model res2=A; run; quit;

Actually, this “quick-and-dirty” way of doing the test is the same thing that SAS does- namely, it just checks the squared residuals (which are necessarily mean 0 for each group). The formal “modified Levene test” uses medians (instead of means) and is often given as using absolute values of the differences between the residual and the median (rather than the square). The reason for using medians and absolute values is to make the test less susceptible to violations of the assumption of normality. Since I have already checked the assumption of normality, the easier approach can be used. Even though the test shows the model as being consistent with the assumption of constant variance, I shall use this data set in the example of how to do a weighted ANOVA, which is one of the “remedial measures” for the violation of that assumption. Recall that the weights should be one over the variance for that group. proc import out=first datafile=’c:/my documents/balance.csv’ dbms=csv replace; run; data balance; set first; logy=log(y); run; proc sort; /* This procedure sorts a data set the way the user wants*/ by A; /* In this case, by the group */ run; proc means n mean var data=balance; /* Procedure gives concise summary of variables. n requests that the sample size is given. mean asks SAS to give the mean. var asks SAS to give the variance */

5.2. CHECKING CONSTANT VARIANCE by A;

17

/* give statistics separately for each group */

var logy; output out=vars var=var; /*Creates a variable to hold the var variances for each group*/ run; data weights; merge balance vars; /* Combine the two data sets by giving every observation in the first data set the same value of the new variables, if they match levels of the by variable(s)*/ by A; /* By using a proc print below, you will see how the variances were added to the corresponding observations*/ wt=1/var; run; proc glm data=balance; class A; weight wt; /* run a weighted ANOVA using the weights given in the variable wt*/ model logy=A; output out=resids r=res; run; quit; Of note about proc means is that, without the n, mean, and var added to the end of the proc means statement, SAS will give the sample size, mean, standard deviation, minimum (min) and maximum (max). Other statistics proc means will provide upon request include, but are not limited to, the standard error (stderr), the standard deviation (std), the median (median or p50), the 5th and 95th percentiles (p5 and p95, respectively) and the t-statistic and p-value for the test of the mean being zero (t and probt, respectively). Also, if you are not interested in seeing the group-by-group statistics in the output, noprint can be substituted for n mean var in the program. Checking for normality after running a weighted ANOVA model requires some thinking. There is a weight statement in proc univariate, but its use will prevent the use of the Shapiro-Wilk test. Without using the weight statement in proc univariate, though, the residuals from the highly variable groups will likely appear as outliers or heavy tails. I will typically look at the QQ-plots in that situation both with and without the weights and make a judgement from both. Since, as a rule, I will work to correct departures from normality before lack of constant variance, I usually find that the normality is “close enough”.

18

CHAPTER 5. CHECKING ASSUMPTIONS

Chapter 6

Unbalanced ANOVA 6.1

Entering the Data

Suppose I want to read in the Weight Loss Data. Since this data set is unbalanced, the do loop approach does not lend itself here. One way of doing it is to have a separate data step for each of the groups. data de; input x @@; treatment=’Diet and Exercise’;

cards; 17.2 11.0 4.9 21.6 17.6 run; data diet; input x @@; treatment=’Diet’; cards; 4.7 7.9 0.5 -2.1 7.1 7.9 3.6 run; data exercise; input x @@; treatment=’Exercise’; cards; 15.9 5.1 6.7 19.1 9.2 7.7 14.7 run; data control; 19

/* is the name of the data set*/ /* Sets the variable ’treatment’ to have the same value for every point in this data set. In older versions of SAS, variables had to have names less than 8 characters long. */

20

CHAPTER 6. UNBALANCED ANOVA input x @@; treatment=’Control’; cards; 4.1 -3.7 -2.1 7.7 5.9 -4.1 8.2 2.0 -4.4 -2.1 run; data weightloss;

/* In versions 6 and earlier, data sets could only have 8 characters in their names. */ set de exercise diet control; /* combines the data sets named into the current data set. NOTE THAT THE ORDER MAY BE IMPORTANT HERE! */

run; proc print;

/* Without the ‘data=’, SAS will use the last-used data set */

run; quit;

Regarding the order used in the set command, SAS typically defines its variables based upon the first time it comes across the variable in a data set. At issue in this case is the variable treatment. If, for example, the command set was typed in as “set diet control de exercise;”, SAS would not be able to distinguish between the ‘diet’ group and the ‘diet and exercise’ group, as it would only consider the first four characters in the variable treatment since that was length of the variable in the diet data set. It is a good idea to print out the data set before doing any analyses to ensure that the data set in SAS is exactly what it is supposed to be. Now, to obtain the means for each group, like in the table on page 2, I use proc means.

proc means n mean std data=weightloss; class treatment; /* give statistics separately for each treatment; does not require proc sort and gives the output more concisely than does by */ var x; run; quit;

This information could have been obtained using proc univariate instead of proc means, but, since my interest is only in the 3 statistics, the output from the latter is easier to handle.

6.2. RUNNING THE ANOVA MODEL AND ESTIMATING GROUP MEANS21

6.2

Running the ANOVA Model and Estimating Group Means

I note that, to use the notation of the Lecture 6, SAS estimates, for example, θD , the effect of diet only, as a parameter of interest. Since I view each of the population means as interesting parameters, I will use the noint option below.

proc glm data=weightloss; class treatment; model x=treatment/noint; /* noint tells SAS to consider the four populations means separately */ estimate ’control group’ treatment 1 0 0 0; /* tells SAS that I wish to estimate the mean of the control group; the term in single quotes is merely a title to help read the output; I am looking to estimate the mean of the first (in alphabetical order) treatment population only */ estimate ’exercise only vs. diet only’ treatment 0 -1 0 1; /* want to compare the difference between exercise only (the last treatment alphabetically) and diet only (the second treatment alphabetically)*/ run; quit; There is a minor point to be made here. The differences between the P values given by SAS and those calculated earlier in the Lecture are due to rounding off being done in the Lecture. In the SAS output, you will see that my σ ˆ 2 is in the “Mean Square” column and in the “Error” row of the table, just to the right of the SSE. The noint “option” in the model statement above is one that should be used only with great caution. The model that accompanies it does not give the usual ANOVA tests and results. Using the notation of the “Balanced ANOVA” chapter, my one-way model is yij = µ + θi + ij . Where the usual model tests the null hypothesis θ1 = θ2 = · · · = θI = 0 (namely, that none of the groups has a different mean from the others), the null hypothesis tested with the noint option is µ = θ1 = θ2 = · · · = θI = 0. This leads me to consider how SAS “parameterizes” its models.

22

CHAPTER 6. UNBALANCED ANOVA

Chapter 7

SAS’ Parameterization In the parameterization used in Prof. Fenech’s notes, the one-way ANOVA model is defined as yij = µ + θi + ij . This definition gives rise to unique estimates of µ and θ1 , . . . , θI only P if some restriction is placed upon the θs. In his notes, that restriction is i θi = 0. That is a common restriction and has the desirable property that µ represents the mean of the entire population (i.e., composed of all of the groups). However, that is not the parameterization, or perhaps more accurately, the restriction SAS uses to define its variables. Define SAS’ parameterization of the one-way model as yij = α + βi + ij . SAS has the restriction that βI = 0. Therefore, α is the mean of the Ith group and βi is the difference between the means of the ith and Ith groups. This parameterization is desirable when there is a “baseline” (often called the “control”) group to which all of the others are to be compared. This also is typically the parameterization used by regression model specialists (like me) who use “dummy variable coding”. Just because SAS does not use the same parameterization as Prof. Fenech does not mean all is lost. Denote the mean of the ith group as µi . Then, µi = µ + θi = α + βi . If I am interested in comparing the means of the ith and kth groups, then µi − µk = (µ + θi ) − (µ + θk ) = θi − θk in Prof. Fenech’s parameterization and µi − µk = (α + βi ) − (α + βk ) = βi − βk in SAS’ parameterization. So, the relative differences are preserved! 23

24

CHAPTER 7. SAS’ PARAMETERIZATION

It turns out that everything is relative in ANOVA models, anyway. Typically, researchers are interested in whether one group stands out from the others in the population and, if they do, by how much. That is, there always is some “baseline” by which the groups are compared, whether it is one groups’ (as is the case with the α parameterization of SAS) or the population mean (as is the case with the µ parameterization of Prof. Fenech). Regardless of which parameterization is used, though, the contrasts and estimates of groups’ means will be the same, if done correctly. Let’s try the weight loss example without the noint option and contrast it with the use of the noint option. proc glm data=weightloss; class treatment; model x=treatment / solution; /* tells SAS to give the estimates of α and β in the output */ estimate ’control group’ intercept 1 treatment 1 0 0 0; /* tells SAS to include α in the estimate of the mean of the control group */ estimate ’exercise only vs. diet only’ treatment 0 -1 0 1; estimate ’mu-hat?’ intercept 1 treatment 0.25 0.25 0.25 0.25; estimate ’mu-hat?’ intercept 29 treatment 10 7 5 7 / divisor=29; /* Divide all of the weights in the statement by 29*/ run; quit; Notably, I get the same answers, including the t and p values of the tests as with the noint option. As pointed out in the last chapter, the MSE is also equivalent in the two parameterizations. In the case of the contrast between “exercise only” and “diet only”, no changes were necessary since the baseline group is not at issue (i.e., µ or α gets cancelled out). If it was the case that I wanted the “baseline” group to be the “control” group here, SAS can be tricked by using treatment=’control’; instead of treatment=’Control’; in the corresponding data step. Since SAS treats lower case letters as coming alphabetically after capital letters, the control group will now be SAS’ baseline group. However, the estimate statements above will need to be changed, as well, to reflect the new ordering! Regarding the last two estimate statements, under what assumptions is each one the appropriate way of estimating µ in Prof. Fenech’s parameterization?

Chapter 8

Multiple ANOVA 8.1

Entering the Data

Earlier, the weight loss data set was treated as though it had one factor. The data set can be analyzed as a two-factor model, however, with “diet” and “exercise” being the factors, each with levels of “yes” and “no”. Even though the data set weightloss could easily be manipulated from its current form to accommodate the purpose here, I give the following example starting from scratch. data dande; input x diet $ exercise $;/* indicates that the variable will have characters, rather than numbers, in it */ cards; 17.2 yes yes 11.0 yes yes 4.9 yes yes 21.6 yes yes 17.6 yes yes 4.7 yes no 7.9 yes no 0.5 yes no -2.1 yes no 7.1 yes no 7.9 yes no 3.6 yes no 15.9 no yes 5.1 no yes 6.7 no yes 19.1 no yes 25

26

CHAPTER 8. MULTIPLE ANOVA 9.2 no yes 7.7 no yes 14.7 no yes 4.1 no no -3.7 no no -2.1 no no 7.7 no no 5.9 no no -4.1 no no 8.2 no no 2.0 no no -4.4 no no -2.1 no no run; proc print;

/* Without the ‘data=’, SAS will use the last-used data set */

run; quit; Similar to the original weightloss data set, changing the order of the data entry (say, starting with the controls, rather than the group that exercises and is on the diet) would change how SAS stores the variables exercise and diet. Unlike the other example, here, even one character would be enough to distinguish between the levels of those variables.

8.2

The Model

Entering the commands to tell SAS to obtain a two-factor model is fairly straightforward. proc glm data=dande; title ’D & E model’; /* changes top line in output */ class diet exercise; model x = diet exercise diet*exercise ; /* denoting the use of an interaction */ /* note that model x = diet|exercise; can be used instead */ run; quit; Compare these results to the ones using the weightloss data set using the commands below. proc glm data=weightloss; title ’Weight loss model’;

8.2. THE MODEL

27

class treatment; model x = treatment; run; quit; As should be the case, the ANOVA tables are equivalent. However, the rest of the output differs somewhat. This is because the two-factor model gives a breakdown of the factors. Still, there is a difference between the last two tables in the two-factor model: Source DF Type I SS Mean Square F Value Pr > F diet 1 72.1876690 72.1876690 2.74 0.1106 exercise 1 721.1593294 721.1593294 27.34 F diet 1 68.5957108 68.5957108 2.60 0.1194 exercise 1 702.2815645 702.2815645 26.62 0.69 then size=’m’; /* define the variable size as ‘m’ when the quantitative length is more than 0.69*/ if length>0.89 then size=’l’; run; proc print; run; quit;

9.1

The Means Command

The model I will use here has lifespan as the outcome and size, type, number and the interaction between the latter two as the explanatory variables. 29

30CHAPTER 9. MEANS COMPARISONS IN MULTI-WAY ANOVA MODELS proc glm data=fruitfly; title ’Fruitfly Data with size type and number’; class size type number; model lifespan = size type|number; means size; /* calculate the means and standard deviations of the outcome for each level of the factor size */ run; quit; Before discussing the means output, it is worth noting that the degrees of freedom for the factor number differ between the “Type I” and “Type III” tables. In addition, the degrees of freedom in the latter table do not sum up to the degrees of freedom for “model” in the ANOVA table. This is indicative of some confounding of one of the levels of number with a level of another factor (in this case, type). The above use of the means command under proc glm is not particularly special, though. The same task can be accomplished by using proc means, as shown below. proc means data=fruitfly mean std; title ’Fruitfly Data under proc means’; class size ; var lifespan; run; quit; Unlike proc means, however, the means command under proc glm has numerous options to facilitate multiple comparisons. The most frequently-used multiple comparison techniques are Bonferroni adjustments, Fisher’s “least significant differences”, and Tukey’s “honestly significant differences”. SAS does this by using the corresponding options in the means statements below. proc glm data=fruitfly; title ’Multiple comparisons with the Fruitfly Data’; class size type number; model lifespan = size type|number; means size / bon; /* can use t equivalently*/ means size / lsd; means size / tukey; run; quit; Notably, the Bonferroni test SAS performs assumes that the experimentwise error rate should be fixed. This would be because Fisher’s approach gives

9.2. THE LSMEANS COMMAND

31

equivalent results to Bonferroni when the latter is used comparison-wise. This also explains why t is an alias for lsd and not for bon. A frequent way of presenting means comparisons is by using a lettering scheme, wherein levels sharing a letter are not significantly different. This is accomplished by using the lines option, as shown below. proc glm data=fruitfly; title ’Multiple comparisons with the Fruitfly Data’; class size type number; model lifespan = size type|number; means size / bon lines alpha=0.008; /* changes the value of α */ run; quit; In that example, sizes ‘l’ and ‘m’ are not significantly different from one another, but both have significantly longer lifespans than size ‘s’. This conclusion is at the experiment-wise error rate of 0.8%.

9.2

The LSMeans Command

When an experiment is not balanced using the sample means for a factor can be misleading. This can be particularly true when the interaction is significant. Consider, for example, the artificial data set below. A\B B1 B2 A1 123 101 102 103 104 105 106 107 108 109 110 A2 123456789 101 102 103 The sample mean for level 1 of factor A is y¯1· = 951/12 = 79.25. The mean for level 2 of factor A is y¯2· = 351/12 = 29.25. There is clearly an effect due to B, though. Consider y¯11 = 6/3 = 2; y¯12 = 945/9 = 105; y¯21 = 45/9 = 5; and y¯22 = 306/3 = 102. An estimate for µ1· is the average of the cell means. In this case, µ ˆ1· = (¯ y11 + y¯12 )/2 = 107/2 = 53.5. Similarly, µ ˆ2· = 107/2 = 53.5. The command lsmeans uses this approach. Let’s try it with the fruitfly data. proc glm data=fruitfly; title ’lsmeans with the Fruitfly Data’; class size type number; model lifespan = size type|number; lsmeans size; run; quit;

32CHAPTER 9. MEANS COMPARISONS IN MULTI-WAY ANOVA MODELS

The results of this have to be disheartening! Rather than a number, SAS claims that these means are not estimable. Why is there no µ ˆ? It is because the cells with number equal to 0 and type equal to either 0 or 1 are empty. Also, the cells with type equal to 9 and number equal to 1 or 8 are empty. In reading about this data set, Hanley, et al. [4] note that there are really five “groups”: {Type = 0, Number = 1}; {0, 8}; {1, 1}; {1, 8}; and {9, 0}. The confounding mentioned before is now obviously with level ‘9’ of type and level ‘0’ of number. A single factor “group” can be created out of the two factors type and number. That would account for all of the groups. Since none of the cells that are empty are possible (and the interaction between number and type is significant), this is a reasonable way of accounting for the data set. data combineff; set fruitfly; if type=0 then do; if number=1 then group=’0 and 1’; if number=8 then group=’0 and 8’; end; if type=1 then do; if number=1 then group=’1 and 1’; if number=8 then group=’1 and 8’; end; if type=9 then group=’9 and 0’; run; proc glm; title ’lsmeans with the grouped Fruitfly Data’; class size group; model lifespan = size group; lsmeans size / adjust=bon pdiff; /* adjust instructs SAS to use an adjustment*/ /* pdiff instructs SAS to give p-values for all of the comparisons */ means size / bon; run; quit; I included the means command above to highlight that the estimated means for the two methods (i.e., means and lsmeans) are different. In this case, the difference is not overly substantial. Unlike the means command, lsmeans does not have an option to give a table with shared letters to indicate non-significance. This approach relies upon the appropriateness of the combining of the two factors into one factor to eliminate the missing cells. If there are empty cells

9.3. MEANS OR LSMEANS: WHICH TO USE?

33

in the study that could exist in the population, then the levels of the factors with the missing cells probably should not be compared using lsmeans.

9.3

Means or LSMeans: Which to Use?

At the end of the “SAS Parameterization” chapter, I posed the question which estimate is the best estimate of the overall mean and gave two choices- one wherein each level of the factor gets the same weight and one wherein each level of the factor gets weight proportional to its sample size. The analogous question is being asked here. In my opinion, the answer depends upon how the sample relates to the population. If the data set is from random sample from the population and both the explanatory factor(s) and the outcome variable are merely recorded, it is reasonable to think then that the cell sizes are estimates of the relative sizes of the cells in the population. In that scenario, it makes sense to use the sample size weights or means command. If, on the other hand, the data set was generated from an experiment in which the explanatory factors were designed (or ended up) with the imbalance for some practical reason and there is no reason to believe that the population has this imbalance, it is usually best to give each cell equal weight. In this case, then, it makes sense to use the lsmeans command in proc glm. Finally in comparing means and lsmeans the I note that it is NEVER acceptable to justify the use of the means command with “it always gives an answer”! It is better to have no answer than a wrong one.

9.4

An LSMeans-type Comparison

It is reasonable to ask how to get an lsmeans-type contrast between levels of a factor that was combined with another in order to be able to do lsmeans. A way of performing the contrast is given below. proc glm data=combineff; title ’lsmeans-type contrasts with the Fruitfly Data’; class size group; model lifespan = size group; estimate ’type 0 vs. type 1’ group 1 1 -1 -1; estimate ’type 0 vs. type 9’ group 0.5 0.5 0 0 -1; run; quit; This approach gives lsmeans-type results since, by default with the estimate command, SAS balances the levels of group with respect to the number of observations in the cells with the different levels of size. This can be seen by inserting a line with lsmeans group above and verifying by hand that the

34CHAPTER 9. MEANS COMPARISONS IN MULTI-WAY ANOVA MODELS group estimates are exactly weighted combinations of the lsmeans estimates. Better yet, code that will verify it literally is below. proc glm data=combineff; title ’lsmeans-type contrasts with the class size group; model lifespan = size group; estimate ’size l vs. type m’ size 1 -1 estimate ’size l vs. size s’ group 1 0 estimate ’size m vs. size s’ group 0 1 lsmeans size / pdiff; run; quit;

Fruitfly Data’;

0; -1; -1;

Chapter 10

Random and Nested Effects In this chapter, the “Eggs” data set from the Statlib (http://lib.stat.cmu.edu) “Datasets and Stories Library” [1] will be considered. The data set, itself, is located at “http://lib.stat.cmu.edu/DASL/Datafiles/Eggs.html”, while the underlying story and references are at “http://lib.stat.cmu.edu/DASL/Stories/ Eggs.html”.

10.1

Nested Effects

In this example, the design is “hierarchical”. That is, there are different degrees of nesting. The factor tech is nested within lab; and sample is nested within tech. The following data step assumes that the data set is in the “c:/my documents/” directory and is called “eggs.txt” and that it is tab delimited. data eggs; infile ’c:/my documents/eggs.txt’ dlm=’09’x firstobs=2; /* dlm which is short for delimiter indicates the delimiter used (why ’09’x is equivalent to TAB is beyond me)*/ /* firstobs tells SAS to look in the listed row for the first data value */ input Fat Content Lab $ Tech Sample $; run; proc glm; title ’nesting factors with the Eggs Data’; class lab tech sample; model fat content = lab tech(lab) sample(tech*lab); /* indicates that tech is nested within lab */ run; quit; 35

36

CHAPTER 10. RANDOM AND NESTED EFFECTS

Note that the above and below uses of proc glm give equivalent results. proc glm data=eggs; title ’nesting factors with the Eggs Data’; class lab tech sample; model fat content = lab tech*lab sample*tech*lab; run; quit; Also, it might be interesting to try these programs with the order of the factors in the model statement shuffled. For example, if the model statement is changed to read model fat content = sample(tech*lab) tech(lab) lab; then the “Type I” table will not only give very different sums of squares, but the degrees of freedom for the latter two factors will be 0! This is because of confounding, of course. In this case, however, the confounding is known in advance as it is part of the nesting process. As always, the “Type III” table should give the same results regardless of order in the model statement. Here, unlike the fruitfly example, the degrees of freedom in the “Type III” table do sum up to be the degrees of freedom for “model” in the ANOVA table.

10.2

Random Effects

As is frequently the case, the nested effects in the Eggs data set could be considered as random effects. The technicians were selected from a larger pool of technicians within each lab and the samples were selected from the larger pool of samples for each technician. So, that can be indicated to SAS using the command structure below. proc glm data=eggs; title ’random effect nesting with the Eggs Data’; class lab tech sample; model fat content = lab tech(lab) sample(tech*lab); random sample(tech*lab) tech(lab) / test; /* indicates that SAS should run appropriate tests of the effects with the random effects given*/ run; quit; Notably, when the mixed effect model is used, lab goes from being highly significant to relatively insignificant (p=0.1895). This would appear to confirm the speculation in the “story” that one technician gave particularly poor measurements.

Chapter 11

Analysis of Covariance Analysis of covariance (ANCOVA) is a combination of regression and ANOVA techniques. Namely, ANOVA is a type of model to use when there is a quantitative outcome and categorical explanatory variables (usually called “factors”). Regression is a type of model to use when there is a quantitative outcome and quantitative explanatory variables (often called “predictors”). ANCOVA is a type of model to use when there is a quantitative outcome and both quantitative and categorical explanatory variables (in this context, the quantitative explanatory variables are frequently referred to as the “covariates”). In this chapter, I will again use the fruitfly data.

11.1

Fruitfly Example

In the description of the data set [4], the length of the thorax was measured. In previous uses of this data set, I made this categorical by dividing the lengths up into three groups. Furthermore, I treated the number of companions as categorical, as there were only 3 unique values. Here, though, I shall treat both number and length as quantitative. proc glm data=fruitfly; title ’ANCOVA with the Fruitfly Data’; class type; /* note that number is no longer in this list! The class statement is only for factors*/ model lifespan = length type|number / solution; run; quit; The output for this type of model has a slightly different interpretation than did ANOVA model results. 37

38

CHAPTER 11. ANALYSIS OF COVARIANCE

Parameter

Estimate

Intercept length type type type number number*type number*type number*type So, for type ‘0’,

0 1 9 0 1 9 the

Standard Error

t Value

Pr > |t|

-4.71 10.92 0.77 -1.61 . -4.30 3.38 . .

Suggest Documents