SAS Users Guide. to accompany. Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock

SAS Users Guide to accompany Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock SAS Users Guide- 1 Statistics: Unlocking ...
1 downloads 1 Views 411KB Size
SAS Users Guide to accompany

Statistics: Unlocking the Power of Data by Lock, Lock, Lock, Lock, and Lock

SAS Users Guide- 1

Statistics: Unlocking the Power of Data

Getting Started Statistical Analysis System or SAS is a text based statistical software, as opposed to point and click based. Throughout this users guide written code will be provided for most applications, as well as direction to more information (e.g. links to documents at the SAS support website) and options regarding the procedures. Text such as DataName, VarName, Yvar, and Xvar indicate locations which require replacement with specific data names or variable names and will be written in italics. To enter data: Remember that each column is a different variable and the rows are the cases. 1. If your data already exists in some format, such as Excel, you can import it into SAS by selecting File  Import This will open the import wizard. The following steps outline using the import wizard to enter data: a. Select a data source from the drop down list, and click Next Example: For Excel select Microsoft Excel Workbook b. Browse for the location of your file, once selected click OK c. Select the appropriate worksheet and click Next d. Name the dataset in the Member: box and click Finish 2. If you are typing your data in yourself you use a data statement: The Data Statement Guide Example: data DataName; input VarName, VarName2; cards; 1 2 3 4 5 6 run;

Warning: SAS has quantitative variables, which can contain only numbers, and categorical variables, which can contain anything. If a column being read in has anything other than a number it is considered categorical. This includes things like dollar signs, units, etc. If you wish to enter a categorical variable in the data statement you place a $ after the variable name (ex/ VarName $). If you enter something other than a number in a quantitative variable column by mistake SAS will give you the error: Invalid data for VarName in line #

SAS Users Guide- 2

Statistics: Unlocking the Power of Data

Using SAS in Chapter 2 Note: For most tasks in SAS there are multiple approaches, we present at least one option for each. This manual attempts to present the easiest approach, which may not always be the best.

Categorical Variables Tables for categorical variables: For most tasks involving categorical variables we utilize the frequency procedure. Creating a frequency table for one categorical variable: proc freq data = DataName; table VarName; run; This provides you with both the count and the percent for each category. For a relationship between categorical variables: proc freq data = DataName; table VarName*VarName2; run; This provides the count, percent, row percent, and column percent for each combination of categories. For more information about the frequency procedure: The Frequency Procedure Guide Graphs for categorical variables: We utilize the gchart procedure for graphical presentations. For a barchart: proc gchart data = DataName; vbar VarName; run; For a piechart: proc gchart data = DataName; pie VarName; run; For a relationship between categorical variables, side by side barcharts: proc gchart data = DataName; vbar VarName /group= VarName2; run; We use the gchart procedure often throughout Chapter 2, for more information: The gchart Procedure Guide

SAS Users Guide- 3

Statistics: Unlocking the Power of Data

One Quantitative Variable Statistics for a single quantitative variable: For statistics and graphs involving a single quantitative variable we use the univariate procedure: proc univariate data = DataName; var VarName; run; Graphs for a single quantitative variable: We again use the univariate procedure in order to produce a histogram: proc univariate data = DataName; histogram VarName; run; Boxplots can also be created using the univariate procedure: proc univariate data = DataName plot; var VarName; run; For more information: The Univariate Procedure Guide

One Quantitative Variable by groups in One Categorical Variable Statistics for a quantitative variable by groups in a categorical variable: We will again use the univariate procedure here: proc univariate data = DataName; by CatVarName; var QuantVarName; run; Graphs for a quantitative variable by categories in a categorical variable: Use the gchart procedure to produce side by side histograms: proc gchart data = DataName; vbar QuantVarName /group = CatVarName; run;

SAS Users Guide- 4

Statistics: Unlocking the Power of Data

Two Quantitative Variables Statistics for two quantitative variables: Correlation: Use the correlation procedure: proc corr data = DataName; var VarName VarName2; run; This provides some summary statistics for each variable (mean, standard deviation, etc.) as well as the correlation between the two variables, titled “Pearson Correlation Coefficients.” Linear Regression: Use the regression procedure: proc reg data = DataName; model Yvar = Xvar; run; The two parameter estimates, y-intercept and slope, are provided in the “parameter estimates” section of the output. For more information on either of these procedures: The Corr Procedure Guide The Reg Procedure Guide Graphs for two quantitative variables: We can produce a scatterplot by using the gplot procedure mentioned previously: proc gplot data = DataName; plot Yvar*Xvar; run;

SAS Users Guide- 5

Statistics: Unlocking the Power of Data

Using SAS for Chapters 3 and 4 The current version of SAS (9.2) has no easy procedures for creating bootstrap or randomization distributions. We believe the StatKey tools at lock5stat.com/statkey provide better options for these procedures. However, for those instructors wishing to do so, we include sample code to create bootstrap distributions and perform randomization tests in SAS.

Chapter 3: Bootstrapping Creating a set of bootstrap samples: The code below is one way to generate a set of bootstrap samples (currently set up to generate 1,000): data bootsamp; do sampnum = 1 to 1000; /* 1,000 replicates */ do i = 1 to nobs; x = round(ranuni(0) * nobs); set DataName nobs = nobs point = x; output; end; end; stop; run; Finding a statistic for each boostrap sample: The following code takes a set of bootstrap samples and finds the mean for each, saved under the data table “bootmeans”: proc univariate data = bootsamp noprint; var VarName; by sampnum; output out = bootmeans mean = means; run; If we wanted to find a different statistic we would use whatever procedure is appropriate to find our statistic of interest (see chapter 2). For example if we want correlation we would use the corr procedure on the bootstrap samples, but much of the code would look exactly the same. Finding the confidence interval: Once we have our set of means we can use the univariate procedure again to check that the distribution is symmetric and bell shaped (histogram), find the standard error for a confidence interval (standard deviation of the means), or find the percentiles for a confidence interval (percentiles provided in the output): proc univariate data = bootmeans; var means; histogram means; run;

SAS Users Guide- 6

Statistics: Unlocking the Power of Data

Chapter 4: Randomization Tests Difference in Means: We can utilize the npar1way procedure to perform a randomization test for a difference in means: proc npar1way scores = data data = DataName; class CatVar; var QuantVar; exact scores = data /N = 1000; run; Anything else: If we are attempting to do a randomization test for anything other than a difference in means we need to use sampling methods. Below is one example, using sampling methods to test if p=0.50 with a sample size of n = 100. Create a set of 1,000 proportions assuming the null hypothesis is true: data NullSamp; do samp = 1 to 1000; p = (rand('Binomial',0.5,100))/100; output; end; run; Plot a histogram of this sample to make sure it is symmetric and bell shaped: proc univariate data = NullSamp; histogram p; run; Find how many of your proportions are beyond your sample p (for the example we’ll assume the sample proportion = 0.55): data more; set NullSamp; if p >= 0.55; run; The p-value is then calculated as the number beyond (rows in the dataset more above) divided by the number created (in this case 1,000). Note: this will be determined by your alternative hypothesis of interest. Randomization tests using sampling methods for other parameters (one mean, correlation, etc.) can be conducted with sampling procedures similar to the bootstrapping methods on the previous page.

SAS Users Guide- 7

Statistics: Unlocking the Power of Data

Using SAS for Theoretical Distributions in Chapters 5 – 10 Finding Values for Theoretical Distributions To find a probability or a percentile from a theoretical distribution, we use the prob command, which gives the probability of being less than the value presented. Normal Distribution: We use probnorm(z-value), where z-value denotes the specific z-value of interest. Example: find the probability a standard normal (Z) is less than 1.4, less than -2, and between 1.4 and -2 data DataName; pn1 = probnorm(1.4); pn2 = probnorm(-2); pn3 = probnorm(1.4)-probnorm(-2); run; proc print; run; t Distribution: We use probt(t-value,df), where t-value denotes the specific t-value of interest and df denotes the degrees of freedom. Example: Find the probability a t with 10 df is less than 1.4, a t with 9 df is more than 2.3, and a t with 24 df is between -2 and 14. data DataName; pt1 = probt(1.4,10); pt2 = 1- probt(2.3,9); pt3 = probt(1.4,24)-probt(-2,24); run; proc print; run;

SAS Users Guide- 8

Statistics: Unlocking the Power of Data

Using SAS in Chapter 6 Inference for Means: t-Intervals and t-Tests For both hypothesis tests and confidence intervals involving means we will be using the ttest procedure. Confidence Intervals: A confidence interval for one mean with confidence 1 - alpha: proc ttest data = DataName alpha = 0.05; var QuantVar; run; Confidence interval for two means with confidence 1 - alpha: proc ttest data = DataName alpha = 0.05; var QuantVar; class CatVar; run; This gives the confidence interval for both samples individually as well as the difference. Hypothesis Tests: A hypothesis test for one mean with specific null hypothesis (Example: H0 = 3); proc ttest data = DataName H0 = 3; var QuantVar; run; A hypothesis test for difference in two means: proc ttest data = DataName H0 = 0; var QuantVar; class Catvar; run; The t-test for two means shows output for “pooled” (an assumption of equal variance) and “Satterthwaite” which should match the t-statistic from the text, but with different degrees of freedom. For more information on the ttest procedure: The ttest Procedure Guide

SAS Users Guide- 9

Statistics: Unlocking the Power of Data

Inference for Proportions: Intervals and Tests Hypothesis Test and Confidence Interval for one Proportion: The code for a hypothesis test and confidence interval for a single proportion using freq: proc freq data = DataName; table VarName /binomial(Wald level=2 Null=0.50) alpha=0.05; run; The level = … determines which proportion of the categorical variable you wish to run inference on (1 or 2), and the Null = … is the p0 value under the null hypothesis. Hypothesis Test and Confidence Interval for a Difference in Proportions: We use the genmod procedure for a hypothesis test and confidence interval with a difference in proportions: proc genmod data = DataName; class CatVar; model Yvar = CatVar; run; This provides a large amount of output, the important section is the “Analysis of Maximum Likelihood Parameter Estimates”. The second row of this output provides the estimated difference in proportions, a standard error, 95% confidence limits, and a p-value. For more information on the genmod procedure: The genmod Procedure Guide

SAS Users Guide- 10

Statistics: Unlocking the Power of Data

Using SAS in Chapter 7 Chi-Square Tests Chi-Square Goodness-of-Fit Test for a Single Categorical Variable We again use the freq procedure to perform a chi-square test: proc freq data = DataName; tables Varname /chisq testp=(0.5 0.3 0.2); run; The values in parentheses following testp= refers to the expected proportions under the null hypothesis. Removing the testp= statement will perform a chi-square test for equally likely categories. Chi-Square Test for Association for Two Categorical Variables To use the freq procedure for a chi-square test for association between two categorical variables: proc freq data = DataName; tables Varname*VarName2 /chisq; run; Alternatively, one could calculate the chi-square test statistic by hand for either test, and compare the value to the chi-square distribution with appropriate degrees of freedom: data ChiSquare; px = 1-CDF('Chisquare',df,X-value); run; proc print; run; This gives you the probability of being greater than a chi-square value (X-value) for a given degrees of freedom (df). For more information on conducting a chi-square test in the freq procedure: The freq Procedure: Chi-square Tests and Statistics Guide

SAS Users Guide- 11

Statistics: Unlocking the Power of Data

Using Sas in Chapter 8 Analysis of Variance for Difference in Means An ANOVA to compare means is performed using the GLM procedure: proc glm data = DataName; class FactorVar; model Yvar = FactorVar; lsmeans FactorVar /stderr CL pdiff; run; The first page of output gives you the analysis of variance table with degrees of freedom, sums of squares, mean squares, F statistic and p-value. The second page gives you summary statistic for the specific groups, 95% confidence limits for each mean and difference in means, and a matrix of p-values for pairwise comparisons of means. For more information on the GLM procedure: The GLM Procedure Guide

SAS Users Guide- 12

Statistics: Unlocking the Power of Data

Using SAS in Chapters 9 and 10 Correlation The code for correlation here is the same as the code in Chapter 2. This provides summary statistics for each variable, the correlation, and a p-value for testing the correlation. proc corr data = DataName; var VarName VarName2; run;

Linear Regression The code for a simple or multiple linear regression is also the same as in Chapter 2. More variables can be added to the model statement for a multiple regression: proc reg data = DataName; model Yvar = Xvar1 Xvar2; run; Within the output you will see:  The ANOVA for regression table, including the p-value for the ANOVA test  The estimate, standard error, test statistic (t Value), and p-value (Pr > |t|) for the slope(s) and intercept  The value of R-squared  Miscellaneous other pieces of information See Chapters 9 and 10 for an explanation of all output.