Introduction to Statistical Data Analysis

Introduction to Statistical Data Analysis Statistics is the science (and art) of making inferences and decisions given uncertain information. Given a ...
21 downloads 0 Views 43KB Size
Introduction to Statistical Data Analysis Statistics is the science (and art) of making inferences and decisions given uncertain information. Given a problem, how should we proceed?

. – p.1/16

Model Building Where should we start? 1. What are the scientific questions or objectives? 2. What is known about the problem? Are there similar experiments/studies? 3. How will/were the data collected?

. – p.2/16

Study Design Probability models are based on some underlying chance mechanism Probability Models form the basis for frequentist and Bayesian methods The conclusions/generalizations that we can make using statistical methods rely greatly on how the data were obtained. more on drawing conclusions later . . . Randomized Experiments (gold standard) Observational Studies Hybrid Studies

. – p.3/16

Randomized Experiments In a Randomized Experiment the investigator controls the assignment of experimental units to groups (treatments) using a chance mechanism. Causal conclusions are possible because the chance mechanism insures that treatment differences are not due to features of the subjects.

. – p.4/16

Observational Studies In an Observational Study the investigator observes group characteristics and outcomes for subjects; assignments of these are beyond the control of the investigator. Often a sample of convenience. Confounding variables (which may be unobserved) are variables that are related to both group membership and outcome variables, and that make drawing cause and effect conclusions impossible. If confounding variables are observed, we may use statistical methods to try to account for them.

. – p.5/16

Observational Studies are Still Useful! Cannot conduct randomized experiments (ethical reasons or it is physically impossible) May provide evidence to suggest or support causal theories that can be tested using subsequent experiments or with other research Causation is not always a goal of analysis

. – p.6/16

Know the Data! Are there missing data? Why? How are they coded? In surveys, was there non-response? Which variables are quantitative and which are qualitative? How are qualitative variables coded? Are they ordered? What are the units of measurements? Carry out “sanity checks” to look for data entry errors (do not underestimate the importance of this!)

. – p.7/16

Modern Statistical Data Analysis Modern Statistical analysis is based on a combination of techniques using Exploratory Data Analysis Formal Model Building Likelihood Based Methods Bayesian Approaches (builds on the likelihood based approach, but provides different interpretations) Model Checking/Validation

. – p.8/16

Exploratory Data Analysis EDA is an approach to statistical analysis, heavily graphical in nature, that attempts to maximize insight into data. EDA may lead to rejection of current scientific beliefs about models uncover underlying structure and suggest how data should be modelled detect obvious errors in the data and check that assumptions underlying formal analyses are plausible provide new directions for scientific inquiries Allow data to speak for themselves, postponing making assumptions and conducting formal analyses.

. – p.9/16

EDA Uses check for “outliers” or data errors or surprises check that features of subjects are randomly distributed between treatment groups (in randomized experiments) look for clusters or unexpected patterns in the data check for unexpected time trends (with order of data collection) check for missing data and possible relationships to other variables check assumptions for formal statistical procedures that provide quantitative support of conclusions provide visual evidence to tell the data’s story

. – p.10/16

Historical References Exploratory Data Analysis, Tukey, (1977) Data Analysis and Regression, Mosteller & Tukey (1977) Interactive Data Analysis, Hoaglin (1977) The ABC’s of EDA, Velleman and Hoaglin (1981) The Visual Display of Quantitative Information, Tufte (1983),

. – p.11/16

Historical References Graphical Methods for Data Analysis, Chambers Cleveland, Kleiner & Tukey, (1983) Elements of Graphing Data, Cleveland (1985) Dynamic Graphics for Statistics, Cleveland & McGill (1988) Visualizing Data, Cleveland (1993) Often simple statistical and graphical procedures that are as assumption free as possible. EDA is more than just statistical graphics!

. – p.12/16

Modern Statistical Software create static and dynamic graphics fit and critique complex models provide interactive environments for exploration of data and models are extensible high level programming languages allow speed-ups by creating compiled code or re-writing slow parts in C/FORTRAN Features present in Matlab,S/S-Plus, R, Xlispstat

. – p.13/16

Modern Analysis Combine computer’s speed for model fitting and graphical capabilities with human pattern recognition abilities to refine and iterate model building Statistics Addresses uncertainties in data using probabilistic models Quantifies uncertainty in stating conclusions Key difference between Statistics and other quantitative fields Good Statistical Practice is often a Science and Art

. – p.14/16

Tentative Outline of Course EDA (Notes, R Intro) Why Bayes? (Hoff Ch 1) Specifying Bayesian Models (Hoff Ch 2) One Parameter Models (Hoff Ch 3) Monte Carlo Approximations & Normal Models (Hoff Ch 4-5) Gibbs Sampling (Hoff Ch 6) Missing Data and Imputation (Hoff Ch 7) Group Comparisons & Hierarchical Models (Hoff Ch 8) Linear Regression (Hoff Ch 9) Generalized Linear Models (Hoff Ch 11 )

. – p.15/16

For Next Class Assignment 0 – Introduction to R (on course Calendar) Download/Install R to personal computer update email at Blackboard Review handouts on emacs, ESS, and LaTeX on Computing Page

. – p.16/16

Suggest Documents