Statistical Genomics and Bioinformatics Workshop 8/16/2013
Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies Overview of Genetics, Data Resources, Terminology and Statistics Brooke L. Fridley, PhD University of Kansas Medical Center 1
Schedule for the Day • 9:45 – 10 am: Morning Break • 11:30 – 12:30 pm: Lunch Break • 2:30 – 2:45 pm: Afternoon Break Schedule of Topics: • Overview of Genetics and Genomics – Genetics – Technologies for genotyping – Databases and publically available resources • Review of Statistical Aspects – Study Design – Power/Sample Size – Hypothesis Testing 2
1
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Schedule for the Day (con’t) • Population Genetics (LD) • Genetic Association Studies – Study Design – Quality Control – Genetic Models and Association Methods – Haplotypes – Power / Sample Size – Population Stratification – Genotype Imputation – Multiple locus methods • Example: GWAS for Hormone Levels
3
Schedule for the Day (con’t) • Multiple Testing – FWER – FDR – Permutation based p-values • Example: Acetaminophen toxicity GWAS • Limitations and Common Errors with GWAS • RNA-Seq – Goals and review of types of RNAs – NGS and Experimental Design – Bioinformatics and processing RNA-Seq data – Quality Control – Differential Expression Testing Methods • Clustering – Goals – Methods – Validation
4
2
Statistical Genomics and Bioinformatics Workshop 8/16/2013
REVIEW OF GENETICS
5
Individualized Medicine
6
3
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Anticipated benefits of Individualized Medicine • More powerful medicines • Better, safer drugs the first time • More accurate methods of determining appropriate drug dosages • Advanced screen for diseases • Better vaccines • Improvements in the drug discovery and approval process • Decrease in the overall cost of health care From Human Genomic Project Website: http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml#whatis
7
Integrative ‘Omics Genome
DNA Epigenome
Transcriptome
RNA
Proteome
Proteins
Metabolome
Metabolites (e.g. Lipids)
Phenome
Phenotype & Function
Regulatory Elements
8
4
Statistical Genomics and Bioinformatics Workshop 8/16/2013
DNA‐mRNA‐Protein
9
Humans have 46 chromosomes p Centromere q Telomere (Chromosome 5)
22 pairs of autosomes, 1 pair of sex chromosomes
10
5
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Gene Structure Splice sites
Start site
Stop site
Promoter Exon 1
Intron
Exon 2
Intron
Exon 3
5' end
3' end
Exon 1 Exon 2
Exon 3
Messenger RNA The exons encode the actual “blueprint” for a protein 11
The DNA double helix Sugar phosphate backbone
Nucleotide bases
Base pair
Adenine (A)
Cytosine (C)
Thymine (T)
Guanine (G) 12
6
Statistical Genomics and Bioinformatics Workshop 8/16/2013
DNA (uncoiled) A T T A T G A G T A A C C C A G
T A A T A C T C A T T G G G T C Adenine (A)
Cytosine (C)
Thymine (T)
Guanine (G) 13
DNA basepairs are read by threes Codons A U U A U G A G U A A C C C A G
T A A T A C T C A T T G G G T C Adenine (A) Thymine (T) Uracil (U)
Cytosine (C) Guanine (G) 14
7
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Genetic Code • A codon is made of 3 base pairs • There are 64 possible codons 1 codon (AUG) encodes methionine and starts translation of all proteins
61 codons encode 20 amino acids (redundant code)
3 codons stop protein translation
A U G
G C A
U A A
Met
Ala
15
DNA Mutation • A mutation is a change in the normal DNA base pair sequence
16
8
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Mutations can cause disease
Functional protein
Nonfunctional or missing protein
Proteins are chains of amino acids
17
SNP Markers • SNP: AATGCAGGTGCAATCGATTTC AATGCAGGTGCAATTGATTTC • SNPs make up 90% of all human genetic variation • SNPs with a minor allele frequency of ≥ 1% occur, on average, every 100 to 300 bases along the 3 – billion- base human genome. • Variations in the DNA sequences of humans can affect how humans develop disease or response to drug treatments (pharmacogenomics) 18
9
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Some types of mutations Normal THE BIG RED DOG RAN OUT. Missense THE BIG RAD DOG RAN OUT. Nonsense THE BIG RED. Frameshift (deletion) THE BRE DDO GRA. Frameshift (insertion) THE BIG RED ZDO GRA. 19
Polymorphisms • A change in the normal DNA base pair sequence • Mutations that do not alter protein function can become common in the population • A polymorphism is defined as a ‘common’ genetic change, usually >1% is considered common. 20
10
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Marker Types Alternative forms of a DNA sequence or gene SNP
allele A …….AATGCAGGTGCAATCGATTTC……. allele B …….AATGCAGGTGCAATTGATTTC…….
Insertion /Deletion
allele A …….AATGCAGGTGCAATCGATTTC……. …….AATGCAGGTGCAATCGATTTC……. allele B …….AATGCAGGATTTC…….
Microsatellite allele A …….AATGCGAGAGAGAGAGATTTC……. allele B …….AATGCGAGAGAGATTTC………….. 21
SNPs in the Human Genome GAAATAATTAATGTTTTCCTTCCTTCTCCTATTTTGTCCTTTACTTCAATTTATTTATTTATTATTAATATTATTATTTTTTGAGACGG AGTTTCACTCTTGTTGCCAACCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCACACTCCGCTTTC[C/T]GGTTTCAAGCGATTC TCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGTCACACACCACCACGCCCGGCTAATTTTTGTATTTTTAGTAGAGTTGGGG TTTCACCATGTTGGCCAGACTGGTCTCGAACTCCTGACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCTGGGATTACAGG CGTGAGCCACCGCGCTCGGCCCTTTGCATCAATTTCTACAGCTTGTTTTCTTTGCCTGGACTTTACAAGTCTTACCTTGTTCTGCC TTCAGATATTTGTGTGGTCTCATTCTGGTGTGCCAGTAGCTAAAAATCCATGATTTGCTCTCATCCCACTCCTGTTGTTCATCTCCT CTTATCTGGGGTCACTTTTATCTCTTCGTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCC CAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAATGCATTGTAAGTTAAATTATTAAAGATTTTAAATATAGGAAAAAAGT AAGCAAACATAAGGAACAAAAAGGAAAGAACATGTATTCTAATCCATTATTTATTATACAATTAAGAAATTTGGAAACTTTAGATT ACACTGCTTTTAGAGATGGAGATGTAGTAAGTCTTTTACTCTTTACAAAATACATGTGTTAGCAATTTTGGGAAGAATAGTAACTC ACCCGAACAGTGTAATGTGAATATGTCACTTACTAGAGGAAAGAAGGCACTTGAAAAACATCTCTAAACCGTATAAAAACAATT ACATCATAATGATGAAAACCCAAGGAATTTTTTTAGAAAACATTACCAGGGCTAATAACAAAGTAGAGCCACATGTCATTTATCT TCCCTTTGTGTCTGTGTGAGAATTCTAGAGTTATATTTGTACATAGCATGGAAAAATGAGAGGCTAGTTTATCAACTAGTTCATTTT TTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTG TGTCTGTGTGAAA[A/C]AGTCTAACACATCCTAGGTATAGGTGAACTGTCCTCCTGCCAATGTATTGCACATTTGTGCCCAGATCC AGCATAGGGTATGTTTGCCATTTACAAACGTTTATGTCTTAAGAGAGGAAATATGAAGAGCAAAACAGTGCATGCTGGAGAGAG AAAGCTGATACAAATATAAATGAAACAATAATTGGAAAAATTGAGAAACTACTCATTTTCTAAATTACTCATGTATTTTCCTAGAA TTTAAGTCTTTTAATTTTTGATAAATCCCAATGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATCTGTTATATAAT ATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTATTTTTTCTAGAG GGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCT CTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAACTCTTTCCACTAAGAAAGTTCAACTATTAATTTAGGCAC ATACAATAATTACTCCATTCTAAAATGCCAAAAAGGTAATTTGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATC TGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTAT TTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCATAAGAGACTTAAAACTGAA AACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCCCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCGTTTAAGATAGT CACACTGAACTATATTAAAAAATCCACAGGGTGGTTGGAACTAGGCCTTATATTAAAGAGGCTAAAAATTGCAATAAGACCACA GGCTTTAAATATGGCTTTAAACTGTGAAAGGTGAAACTAGAATGAATAAAATCCTATAAATTTAAATCAAAAGAAAGAAACAAAC T[A/G]AAATTAAAGTTATTATACAAGAATATGGTGGCCTGGATCTAGTGAACATATAGTAAAGATAAAACAGAATATTTCTGAAAA ATCCTGGAAAATCTTTTGGGCTAACCTGAAAACAGTATATTTGAAACTATTTTTAAAATGCAGTGATACTAGAAATATTTTAGAAT CATATGTATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAAAATTTCGA
22
11
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Terminology • Locus (plural: loci): May also be called a polymorphism, marker, variant, mutation • Allele: variant forms of the same locus, e.g., A, C • “wildtype” vs “variant” • Genotype: Pair of alleles • Phenotype: Expressed trait • Homozygote: AA, CC • Heterozygote: AC • Carrier: AA + AC • Phase: do alleles occur together on the same chromosome? • Haplotype: a collection of closely linked alleles, usually inherited as a unit. eg. CTG •Penetrance: P(Phenotype|Genotype)
A T A
G T C
23
Variant Types Type
Effect
Freq
RR of phenotype
Nonsense
stop AA seq.
v. low
v. high
Missense
change AA
low
low - v.high
Frameshift
change frame of protein coding
low
v. high
Intronic
No known function
Med
v. low
Intergenic
No known function
High
v. low
24
12
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Typical Steps in most Genomic Study • Hypotheses • Tissue/sample processing • Study Design – Focused/candidate regions vs whole genome – ‘Omic data type – Array vs NGS – Sample size and power – Confounding issues, covariates (epi, drug/trt) • Bioinformatics processing of raw data • Statistical Analysis • Annotation of results and relationship (IPA, etc) • Validation studies (replication, functional studies) 25
Evolution of Genomics Research Candidate Gene Studies < 2005 Genotyping
Genome-wide Association Studies 2005-2010
Next-Gen Sequencing 2010-Present
3rd Generation Sequencing Single Molecule Sequencing
RT-PCR
SNP arrays
DNA (Exome & WGS)
Resequencing genes (exons) with Sanger Sequencing
mRNA arrays
RNA-seq
PacBio, Complete Genomics, etc.
Methylation arrays
Bisulfite or RRBS (methylation)
Translation to clinical practice
Events leading up to Candidate Studies 1950 – Structure of DNA 1970s – Sanger Sequencing 1983 – PCR 1990 – HGP begins 1997 – NHGRI formed
Events leading up to GWAS 2000-1 – Draft version of human genome sequence completed 2002 – HapMap begins 2003 – HGP ends
Events leading up to Next-Gen Sequencing 2005 – 1st Commercial platform (Roche 454) 2006– Illumina’s Genome Analyzer (GA) IIx 2008 – 1KGP begins 26
13
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Human Genome Project • Completed in 2003; 13 year project
nature
• Goals: – Identify all ~25,000 genes in human DNA – Determine the sequences of the 3 billion bp – Store this information in databases – Improve tools for data analysis – Address ethical, legal, social issues (ELSI)
February, 2001 27
http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml
Human Genome Facts nature
• 3 billion base pairs • Around 25,000 genes – Functions unknown for ~50% • Average gene size is 3000 nucleotides • Coding is about 1.5% of genome
February, 2001 28
14
Statistical Genomics and Bioinformatics Workshop 8/16/2013
High Throughput Methods for Measuring DNA • Many approaches for genotyping – Hybridization Methods (Affymetrix, TaqMan) – Primer extension (Pyrosequencing) – Ligation (Illumina)
• Custom Content / Design – GoldenGate, Infinium at Illumina – Disease Specific panels (PGx, Cancer, Carbo‐Metabo)
• Standard large arrays – Genome‐wide arrays (> 1 million SNPs) – Exome Arrays (rare variants)
• Next‐Generation Sequencing 29
NGS Technologies • Illumina (Solexa) HiSeq 2000 (2500) & MiSeq, Life Technologies SOLiD, PacBio, Ion Torrent PGM, Roche 454, ... , and many more to come – No one-size-fits-all solution – Each has pros and cons
30
15
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Integrative Genomic Viewer (IGV)
Thorvaldsdottir, Robinson, Mesirov (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration Briefings in Bioinformatics
31
ENCODE (Encyclopedia of DNA Elements) • Goal to build a comprehensive parts list of functional elements in the human genome.
32
16
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Mouse ENCODE Project • Enhance the human ENCODE Project through relevant comparative studies • Access cell types, tissues, and developmental time points that are not addressable by the human project • Provide a general resource to inform and accelerate ongoing efforts in mouse genomics and disease modeling with human translational potential 33
Cancer Cell Line Encyclopedia J Barretina et al. Nature 483, 603-607 (2012) doi:10.1038/nature11003
34
17
Statistical Genomics and Bioinformatics Workshop 8/16/2013
The Cancer Genome Atlas (TCGA) • Began in 2006 as a three-year pilot project (NCI & NHGRI) for three tumors. • NIH is now commit to characterizing more than 20 additional tumors. • Extensive data available on 17 cancers • Tumor and normal tissue being analyzed on multiple levels, such as: – nucleotide variation (SNP, Indel, SNV) – gene copy number variation – gene expression levels – DNA methylation levels 35
Other Public Data and Information
36
18
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Public databases
The entire human genome sequence can be found in several public databases. –
–
–
National Center for Biotechnology Information (NCBI)
http://www.ncbi.nlm.nih.gov
Entrez – NCBIs search and retrieval system; Build 37
University of California at Santa Cruz (UCSC)
http://genome.ucsc.edu/
Genome Browser; hg19
Ensembl Genome Browser
http://www.ensembl.org/index.html 37
Public databases •
Compare NCBI Build to UCSC assembly (hg18)
Species
Human
UCSC Release
hg19
Date
Release Name
Status
Feb. 2009
Genome Reference Consortium GRCh37
Available
hg18
Mar. 2006
NCBI Build 36.1 Available
hg17
May 2004
NCBI Build 35
Available
hg16
Jul. 2003
NCBI Build 34
Available
hg15
Apr. 2003
NCBI Build 33
Archived
http://genome.ucsc.edu/FAQ/FAQreleases.html38
19
Statistical Genomics and Bioinformatics Workshop 8/16/2013
UCSC Genome Brower
39
Haplotype Map of the Human Genome QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
Goals: • Define patterns of genetic variation across human genome • Guide selection of SNPs efficiently to “tag” common variants • Public release of all data (assays, genotypes) Phase I: 1.3 M markers in 269 people Phase II: +2.8 M markers in 270 people Phase III: 1.6 M markers on 1,184 people (11 populations) 40
20
Statistical Genomics and Bioinformatics Workshop 8/16/2013
1000 Genomes Project (1KGP) • International project to construct a foundational data set for human genetics – Discover virtually all common human variations by investigating many genomes at the base pair level – Consortium with multiple centers, platforms, funders • Aims • Discover population level human genetic variations of all types (95% of variation > 1% frequency) • Define haplotype structure in the human genome • Develop sequence analysis methods, tools, and other reagents that can be transferred to other sequencing projects 41
3 pilot coverage strategies
42
21
Statistical Genomics and Bioinformatics Workshop 8/16/2013
1KGP Projects • 1000 Genomes Pilot project • • • • •
Started in 2008 Paper release contained ~14 million snps 179 individuals 4 populations Low coverage next generation sequencing
• 1000 Genomes Phase 1 • • • • •
Started in 2009 Phase 1 release has 36.6millon snps, 3.8millon indels and 14K deletions 1094 individuals 14 populations Low coverage and exome next generation sequencing
• 1000 Genomes Phase 2 • • • •
Started in 2011 1715 individuals 19 Populations Low coverage and exome next generation sequencing
43
Methodological Impact of 1000 Genomes • 1,092 individuals from 14 populations, constructed using a combination of lowcoverage whole-genome and exome sequencing. • Developed methods to integrate information across several algorithms and diverse data sources. • Joint calling and phasing of haplotypes Flannick J, Korn JM, Fontanillas P, Grant GB, et al. (2012) Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation. PLoS Comput Biol 8(7): e1002604. doi:10.1371/journal.pcbi.1002604 44 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604
22
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Bioinformatics and Statistical Genomics Statistics
Informatics
Computer Science
Statistical Genomics
Biostatistics
Bioinformatics
Computational Genomics Biology & Medicine
Genomics 45
Bioinformatics-Statistics “continuum”
Experimental Design
Processing of data via computers Biological knowledge/annotation
Data mining
Association Analysis
Algorithms to determine function, structure
Clustering/Profile
Differential Analysis
Network and Interactions
GWAS & Haplotype
Informatics
Gene set and pathway analysis
Modeling & Prediction
New algorithms for processing next‐generation sequence data
Bioinformatics
Pedigree Studies (Linkage) New statistical methods
Statistical Genomics 46
23
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Questions?
47
Statistics Overview
48
24
Statistical Genomics and Bioinformatics Workshop 8/16/2013
What is Statistics/Biostatistics? • It is the science of gaining information from data (ie collecting, analyzing, and interpreting data) • Statistics is mainly used in practice for evaluating data to gain an understanding of some subject matter.
49
3 Parts of Statistics • Collecting Data – Experiments and Experimental design – Sampling and observational studies
• Analyzing Data – Graphs and numerical summaries – Estimation and confidence intervals – Hypothesis Testing – Statistical Modeling (i.e. fitting lines) 50
25
Statistical Genomics and Bioinformatics Workshop 8/16/2013
3 Part of Statistics (cont) • Interpreting Results – Was the statistical analysis appropriate? – Was the data reliable? – What do the results tell about the research question? – What do the results tell about the estimate of an effect?
51
Statistics • Useful definitions: – population: all objects, individuals, etc. in which we are interested – sample: the subset of a population that is actually measured – data: information collected on objects, individuals, etc. of the sample
52
26
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Statistical Inference Sample Population
Research question, hypothesis
measure, question, read, record, etc.
Data
summary statistics inferences
Statistical Analysis
data manipulation
53
Types of Data/Variables • Qualitative Variables / Data – Categorical • These variables (data) classify subjects or objects into groups. The data can be character or numeric. If numeric, the numbers have no inherent meaning.
54
27
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Types of Data/Variables • Qualitative Variables/Data: Types – Nominal • Qualitative variables (data) in which the classifications/groups/categories are unordered. • Examples – blood group: A, B, O, AB – group: 0—control, 1—study – gender: 0—female, 1—male
55
Types of Data/Variables • Qualitative Variables/Data: Types – Ordinal • Qualitative variables (data) in which the classifications/groups/categories are ordered. • Examples – smoking status: 0-never, 1-former, 2-current – cancer stage: 1, 2, 3 – class: I, II, III, IV
56
28
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Types of Data/Variables • Quantitative Variables/Data – These variables (data) are numeric with inherent numeric meaning. They typically arise from measurements or counts.
57
Types of Data/Variables • Quantitative Variables/Data: Types – Count or Discrete • Quantitative variables (data) that arise from a counting process (only integers). • Examples – number of affected individuals in a family – number of renal arteries with more than 50% stenosis – number of bacterial colonies on a slide
58
29
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Types of Data/Variables • Quantitative Variables/Data: Types (cont.) – Continuous • Quantitative variables (data) when if measured with sufficient accuracy, there would be no gaps between possible values (continuum of values). • Examples – height – systolic blood pressure – time from diagnosis to last date alive or end of study
59
Graphical Displays of Data • Examples of Data Distribution Shapes
skewed right
symmetric
skewed left
uni-modal
Bi-modal
symmetric
symmetric with outlier
Somewhat symmetric 60
30
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Comparing sample mean and median • In a perfectly symmetric distribution, the mean and median are always the same. • Sample mean is influenced by outliers and skewed data, while the median is not. • Mean will move away from the median toward the tail of skewed data or outlier.
mean
mean
median
median
61
Measures of Spread (Variability) • Sample Variance – idea: a measure of variability that depends on all the observations, looks at amount of variation about the mean – notation • •
population variance: 2 sample variance: s2
– Formula
N
xi x 2
s 2 i 1
N 1 62
31
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Measures of Spread (Variability) • Characteristics of variance – s2 = 0 means no spread in the data – s2 is never negative – As variability increases, so does s2 – squared units of the values of the variable – Influenced by outliers
63
Measures of Spread (Variability) • Sample Standard Deviation (SD) – notation • population standard deviation: • sample standard deviation: s – formula N
xi x
s s 2 i 1
2
N 1
– characteristics • square root of the variance • has same units as the value of the variable 64
32
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Boxplots
outlier
outliers
QL
median
Maximum of (1) minimum value not flagged as outlier, (2) QL – 1.5*IQR
QU
Minimum of (1) maximum value not flagged as outlier, (2) QU + 1.5*IQR
• Extremely useful for comparing groups 65
Data Collection • Two General Ways to get data – Observational study: gathers information about individuals through response to questions or observations of an individual's "normal" actions – sampling, surveys, retrospective studies – Will not be able to conclude a “causative” effect
– Experiment: deliberately imposes some treatment in order to observe a response – Completely randomized design, clinical trials – Will be able to conclude a “causative” effect 66
33
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Experimental Terms • Experimental Unit: object on which experiment is performed • Measurement Unit: object for which you are taking a measurement of; usually the same as the experimental unit, but not always Example: Apply a fertilizer (treatment) to an orange tree (experimental unit); measure the acid level in the oranges (measurement unit).
• Treatment: specific experimental conditions applied to the units 67
Experimental Terms • Experimental Error: – Natural differences in experimental units – Variation in the measuring device – Variation in setting the experimental/treatment conditions – The effect on the response variable of all extraneous factors other than the experimental factors – WISH TO MINIMIZE THE EXPERIMENTAL ERROR 68
34
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Design Of Experiments (DOE) Principles 1. Control Group / Comparison Group(s) a. Controls for lurking variables
2. Randomization of experimental units to treatment groups a. Avoids bias due to assignment b. Produces similar treatment groups
3. Replication of experiment on many experimental units (n) a. Better able to find differences in treatments 69
Completely randomized design n1
TRT 1
Random allocation
Compare responses n2
TRT 2 70
35
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Variation in Experiments • If redo the experiment, will have different randomization and a different outcome • Some differences are due to chance differences in the groups • Statistically significant differences are differences that are too large to occur by chance alone (will study later with hypothesis / significance testing) • The larger the sample the better we are able to detect differences in treatment groups 71
Study Design 1. What is the Question of Interest? Objectives? 2. Determine the Scope of the Inference a. Will this be a randomized experiment or an observational study? b. What experimental or sampling units will be used? c. What are the populations of interest?
3. Understand the system under study. 72
36
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Study Design 4. Decide how to measure a response. 5. List Factors that can affect the response. a. Design Factors i. Factors to vary (treatments and controls) ii. Factors to fix
b. Confounding Factors i. Factors to control by design (blocking) ii. Factors to control by analysis (covariates) iii. Factors to control by randomization 73
Study Design 6. Plan the conduct of the experiment (time line) 7. Outline the statistical analysis 8. Determine the sample size / power
74
37
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Some Other Experimental Designs • • • •
Block Designs Factorial Designs Cross-over Matched / Paired Design – special case of blocked design
• • • •
Latin Square Split-plot Fractional factorial Randomized incomplete block design 75
Probability and Statistics
Probability (deductive) Population
Sample Statistics (inference) 76
38
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Probability Distribution • A probability distribution tells us what the possible outcomes are and the probability assigned to each outcome. – Example: Table with blood type probabilities
• Examples: – Uniform distribution – Normal distribution – T-distribution 77
What makes a good estimator? If you have 6 darts; what locations of the darts on the dart board would represent: 1) Unbiased & Low Variability? 2) Biased & High Variability?
78
39
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Questions?
79
Probability for a Statistic • A sampling distribution is a probability distribution for a statistic • We use statistics to estimate unknown population parameters. • Sampling distribution will be centered around the true value of the parameter (if the statistic is unbiased) • As sample size increases, the statistic gets closer to the parameter (less spread in the distribution) – Larger the sample size, the more precise the estimate • Sampling distribution looks approximately normal (i.e. symmetrical/bell-shaped) and has no outliers. – Will be “more normal” for larger samples. 80
40
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Significance Testing • Often called “Hypothesis” Testing • Statistical Inferences: two most common types 1. Confidence Interval: used when your goal is to estimate a population parameter. 2. Hypothesis Testing: used to assess the evidence provided by the data in favor of some claim about the population. • Reasoning for both types is based on asking what would happen if we repeated the sample or experiment many times. 81
Idea of Hypothesis Testing • Does our sample statistic indicate a TRUE effect? OR • Could we easily get this sample statistic by chance alone? – That is, taking into account variability in samples (and thus the statistic) is our observed value not an uncommon value?
• We would like to simply prove our alternative hypothesis is true, but statistics can never prove anything. Instead, we accumulate evidence against the null hypothesis. 82
41
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Null Hypothesis • Status quo – usually not the hypothesis believed by the investigator • population parameter does not differ from established value • two population parameters do not differ
– indirect method for ascertaining whether data support researcher’s belief
• Why status quo? makes it possible to calculate probabilities (p-values) 83
Alternative Hypothesis • Research hypothesis typically the hypothesis the investigator believes is true • Format – two-sided (two-tailed) Ha: parameter hypothesized quantity – upper-tailed (one-sided) Ha: parameter > hypothesized quantity – lower-tailed (one-sided) Ha: parameter < hypothesized quantity • Generally will use a two-sided hypothesis 84
42
Statistical Genomics and Bioinformatics Workshop 8/16/2013
p-value • Informally: The p-value helps answer the question, “Is the observed difference real or merely the result of chance?” • It does not answer this question directly. Rather it indicates how likely it is for the observed difference to be due to chance (assuming H0 is true). • Formally The p-value is the probability of observing the statistic value you got (or a value more extreme) if the null hypothesis is true. 85
Reasoning of Hypothesis Testing • We assume Ho is true. • Look at data to see if evidence is against Ho (Ho false) • Results that are very unlikely if Ho is true have very small p‐values and are evidence against the null hypothesis (Ho) – Small p‐value = prove Ha – Large p‐value = fail to prove Ha
• Cut‐off for p‐value is significance level α 86
43
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Interpreting p-values • Use p-value to determine which possibility is supported by data – p-value 0.001 • if the null hypothesis is true, there is a 1 in 1000 chance or less of observing our data or data more extreme • strong evidence against the null hypothesis
87
Interpreting p-values • Use p-value to determine which possibility is supported by data (cont.) – p-value 0.05 • if the null hypothesis is true, there is a 1 in 20 chance or less of observing our data or data more extreme • evidence against the null hypothesis
88
44
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Interpreting p-vales • Use p-value to determine which possibility is supported by data (cont.) – p-value 0.1 • if the null hypothesis is true, there is 1 in 10 chance or more of observing our data or data more extreme • no evidence against the null hypothesis • NOTE: • p-value is based on the assumption that the null hypothesis is true and so it cannot tell you if the null hypothesis is really true • not having enough evidence against the null hypothesis does not “prove” null hypothesis is 89 true)
Practical Significance • When the sample size is large, you are more likely to get a significant p-value. • This is because the spread in the sampling distribution is getting very small and thus, the test statistic is getting large in magnitude. • Don't confuse statistical significance with practical significance.
90
45
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Type I and II errors Decision: Reject Ho Decision: Fail to Reject Ho
Ho: True
Ho: False
Type I error (α)
OK
OK
Type II error (β)
• Type I error: reject the null hypothesis when it is true = Prob(Type I error) • Type II error: fail to reject null hypothesis when it is false = Prob(Type II error) • We control Type I error by setting α as low as possible • α and β act inversely, as α gets smaller β gets larger. • Make n large to control for both types of error 91
Power of a Test • Power – 1 – = 1 –Type II error – If the true parameter is , what is the chance of obtaining a significant result?
• Larger samples have greater power to obtain a significant result. In other words, when you increase sample size, you increase power. • For a given sample size, have greater power to detect larger effects. 92
46
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Power and Sample Size: Why important • You are planning an experiment and you want to give yourself the best possible chance of determining the truth. – Incorrect decisions: • Failure to reject Ho when we should --- Type II error (1Power) • Reject Ho when we shouldn’t --- Type I error • Planning stage: • What effects do you think are possible? • What is a clinically meaningful effect? – What result do you need to proceed to the next stage? – What result do you need to recommend a change in clinical practice? • What sample size is required to make it all work? 93
Ways to Determine Sample Size Two ways to determine sample size 1. Estimate n based on precision of confidence interval –
Studies should be designed with sample size sufficient to estimate precisely
2. Estimate n based on power of study –
Studies should be designed with sample size sufficient to provide good power(.8 or greater) to detect the smallest effect that would be clinically meaningful. 94
47
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Fail to reject H0
Reject Ho
P(reject H0| H0 is true)
P(fail to reject H0|
Power P(reject H0| HA is true)
HA is true)
95
Estimating the Sample Size • We know the location of the null and alternative curves, but we do not know the shape because the sample size determines the shape. • We need to find the sample size that will give the curves the shape so that the a level and power equal the specified values.
Alpha=0.025
Power=0.8 Beta=0.2
96
48
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Sample Size Determination for Test of Significance • Necessary components – , level of significance – 1 – , power – , the minimum difference between population parameters that is of clinical usefulness – s, the standard deviation of each group (Better to overestimate than underestimate) • Cautions: – formulas provided are only an approximation – Based on many assumptions • Need to be clear in presenting how the power/sample size estimates were computed – need to inflate the sample size you compute to account for loss to follow-up, dropouts, etc. 97
http://www.stat.uiowa.edu/~rlenth/Power/index.html
98
49
Statistical Genomics and Bioinformatics Workshop 8/16/2013
What impact does variance in population have on power? Higher variability, lower power What impact does effect size have on power? Smaller effect size, lower power What impact does type I level (α) have on power? Lower α, lower power 99
Time To Event (TTR, OS, DFS)
Linear models
Two group comparison (no covariates): KM curves and logrank test
Regression framework: Cox Proportional Hazards models
Logistic regression (binary outcome with covariates)
Poisson regression (count data; RNA-seq)
100
50
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Simple Linear Regression & Correlation • Goals: 1. Describe the nature of the relationship between two variables. 2. To find out whether some variables help explain, predict or even cause the value of another variable.
• Response Variable : the result, effect, or outcome that we are interested in; also called the dependent variable. • Explanatory Variable(s): explains, causes, or helps to predict the response; also called the independent variable. • A relationship between two variables, does not always imply that the one variable causes a change in the other variable 101
Correlation • Correlation (r): a numerical measure for the strength and direction of a linear relationship between two quantitative variables.
• r
( X X )(Y Y ) (n 1) S x S y
( X X )(Y Y ) ( X X ) (Y Y ) 2
2
-1 r 1
• Values of r close to 0 indicate a weak linear relationship (r = 0 indicates no linear relationship) • Values of r close to ‐1 or +1 indicate a strong linear relationship • r has no units of measurement
– r will not change if we change weight from lbs. to kg. or height from inches to cm. 102
51
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Least Squares Regression
Y variable (Response)
• We will use the line to predict y from x, so we want the line that is as close as possible to the points in the vertical (y) direction
X variable (Explanatory)
103
Least Squares Regression • A "good" regression line is one that makes the errors / residuals (ε) or distances as small as possible • A Least Squares regression line of y and x is the line that makes the sum of the squared vertical distances (errors) of the data points from the line as small as possible, or minimizes Σ(errors)2
104
52
Statistical Genomics and Bioinformatics Workshop 8/16/2013
LSR Line • Ŷ = b0 + b1 (X), Ŷ = predicted response – Based on data / sample
• b1 = slope = r (Sy/Sx) = rate of change = amount of predicted change in Y when X is increased by 1 unit • b0 = y‐intercept = Y b1 X = value of Ŷ when X = 0. = statistically meaningful only when X can take values close to 0 105
Prediction • Prediction: substitute an x‐value into the equation and will get a Ŷ which is the predicted response value for that x value. • The predicted value/point (X, Ŷ) is always on your line. • Not all the observed values (Y) will fall on the line unless r=1.0 or r=‐1.0.
106
53
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Interpreting correlation and regression • Know limitations: – Correlation and simple linear regression describes only linear relationships – Both r and LSR line are influenced by extreme observations (outliers/influential points) – One outlier can change r and LSR line dramatically – Always plot your data before you interpret correlation and regression • Influential Point : a point that when removed changes the position of the LSR line and affects the correlation. Influential Point
107
Categorical Data Analysis • When looking at categorical data, one often looks at proportions as opposed to means. • Testing that a proportion differs from a given value • Test that proportions for 2 populations differ • Test for relationship/association between two categorical variables. – Ex. Disease status and genotype frequency
108
54
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Chi‐Square Tests • Uses: – Comparison of Several (2 or more) Proportions – Test for relationship/association/independent between two categorical variables. • Ex. Disease status and genotype frequency – Test that k subpopulations are the same (homogeneity) – Goodness of fit 109
Example: Genetic Association Testing Case Control Total
aa 10 (7.5) 5 (7.5) 15
Expected Count =
aA 25 (22.5) 20 (22.5) 45
AA 50 (55) 60 (55) 110
Total 85 85 170
(Row Total) (Column Total) (Table Total)
• If the expected counts are far away from the observed counts, this is evidence against Ho. • Chi-square test statistic:
∑
• Under null hypothesis,
~
with df = (R-1)(C-1)
110
55
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Chi‐Square Distribution • Takes only positive values • Skewed distribution – A standard normal random variable squared is a Chi‐square with 1 df (i.e. Z2 ~ χ2 df =1) •chi-square distribution (df = 1)
p-value
111
X2 test statistic
Logistic Regression • Used when response (dependent) variable has only two possible outcomes, “success” (y=1) or “failure” (y=0). • Interested in what explanatory variables explain the response variable in terms of P(success). • Type of nonlinear model (generalized linear model). • Poisson Regression for when the response variable is a count from 0, …, ∞.
112
56
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Logistic Regression • Probability of success =
1
• Relate a function of , , to a linear combination of explanatory (independent) variables or predictors. • Simple logistic regression model: log
log
113
Logistic Regression • Thus, the probability in terms of Xi (independent variable) is •
1
• β1 measures the degree of association between the probability of success and the value of the explanatory or predictor variable. •
is referred to as the ODDS RATIO. 114
57
Statistical Genomics and Bioinformatics Workshop 8/16/2013
Questions?
115
58