Schedule for the Day

Statistical Genomics and Bioinformatics Workshop 8/16/2013 Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies ...
Author: Conrad Higgins
10 downloads 0 Views 2MB Size
Statistical Genomics and Bioinformatics Workshop 8/16/2013

Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies Overview of Genetics, Data Resources,  Terminology and Statistics  Brooke L. Fridley, PhD University of Kansas Medical Center  1

Schedule for the Day • 9:45 – 10 am: Morning Break • 11:30 – 12:30 pm: Lunch Break • 2:30 – 2:45 pm: Afternoon Break Schedule of Topics: • Overview of Genetics and Genomics – Genetics – Technologies for genotyping – Databases and publically available resources • Review of Statistical Aspects – Study Design – Power/Sample Size – Hypothesis Testing 2

1

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Schedule for the Day (con’t) • Population Genetics (LD) • Genetic Association Studies – Study Design – Quality Control – Genetic Models and Association Methods – Haplotypes – Power / Sample Size – Population Stratification – Genotype Imputation – Multiple locus methods • Example: GWAS for Hormone Levels

3

Schedule for the Day (con’t) • Multiple Testing – FWER – FDR – Permutation based p-values • Example: Acetaminophen toxicity GWAS • Limitations and Common Errors with GWAS • RNA-Seq – Goals and review of types of RNAs – NGS and Experimental Design – Bioinformatics and processing RNA-Seq data – Quality Control – Differential Expression Testing Methods • Clustering – Goals – Methods – Validation

4

2

Statistical Genomics and Bioinformatics Workshop 8/16/2013

REVIEW OF GENETICS

5

Individualized Medicine

6

3

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Anticipated benefits of Individualized Medicine • More powerful medicines • Better, safer drugs the first time • More accurate methods of determining appropriate drug dosages • Advanced screen for diseases • Better vaccines • Improvements in the drug discovery and approval process • Decrease in the overall cost of health care From Human Genomic Project Website: http://www.ornl.gov/sci/techresources/Human_Genome/medicine/pharma.shtml#whatis

7

Integrative ‘Omics Genome

DNA Epigenome

Transcriptome

RNA

Proteome

Proteins

Metabolome

Metabolites (e.g. Lipids)

Phenome

Phenotype & Function

Regulatory Elements

8

4

Statistical Genomics and Bioinformatics Workshop 8/16/2013

DNA‐mRNA‐Protein

9

Humans have 46 chromosomes p Centromere q Telomere (Chromosome 5)

22 pairs of autosomes, 1 pair of sex chromosomes

10

5

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Gene Structure Splice sites

Start site

Stop site

Promoter Exon 1

Intron

Exon 2

Intron

Exon 3

5' end

3' end

Exon 1 Exon 2

Exon 3

Messenger RNA The exons encode the actual “blueprint” for a protein 11

The DNA double helix Sugar phosphate backbone

Nucleotide bases

Base pair

Adenine (A)

Cytosine (C)

Thymine (T)

Guanine (G) 12

6

Statistical Genomics and Bioinformatics Workshop 8/16/2013

DNA (uncoiled) A T T A T G A G T A A C C C A G

T A A T A C T C A T T G G G T C Adenine (A)

Cytosine (C)

Thymine (T)

Guanine (G) 13

DNA basepairs are read by threes Codons A U U A U G A G U A A C C C A G

T A A T A C T C A T T G G G T C Adenine (A) Thymine (T) Uracil (U)

Cytosine (C) Guanine (G) 14

7

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Genetic Code • A codon is made of 3 base pairs  • There are 64 possible codons 1 codon (AUG) encodes methionine and starts translation of all proteins

61 codons encode 20 amino acids (redundant code)

3 codons stop protein translation

A U G

G C A

U A A

Met

Ala

15

DNA Mutation • A mutation is a change in the normal DNA base pair sequence

16

8

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Mutations can cause disease

Functional protein

Nonfunctional or missing protein

Proteins are chains of amino acids

17

SNP Markers • SNP: AATGCAGGTGCAATCGATTTC AATGCAGGTGCAATTGATTTC • SNPs make up 90% of all human genetic variation • SNPs with a minor allele frequency of ≥ 1% occur, on average, every 100 to 300 bases along the 3 – billion- base human genome. • Variations in the DNA sequences of humans can affect how humans develop disease or response to drug treatments (pharmacogenomics) 18

9

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Some types of mutations Normal THE BIG RED DOG RAN OUT. Missense THE BIG RAD DOG RAN OUT. Nonsense THE BIG RED. Frameshift (deletion) THE BRE DDO GRA. Frameshift (insertion) THE BIG RED ZDO GRA. 19

Polymorphisms • A change in the normal DNA base pair sequence • Mutations that do not alter protein function can become common in the population • A polymorphism is defined as a ‘common’ genetic change, usually >1% is considered common. 20

10

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Marker Types Alternative forms of a DNA sequence or gene SNP

allele A …….AATGCAGGTGCAATCGATTTC……. allele B …….AATGCAGGTGCAATTGATTTC…….

Insertion /Deletion

allele A …….AATGCAGGTGCAATCGATTTC……. …….AATGCAGGTGCAATCGATTTC……. allele B …….AATGCAGGATTTC…….

Microsatellite allele A …….AATGCGAGAGAGAGAGATTTC……. allele B …….AATGCGAGAGAGATTTC………….. 21

SNPs in the Human Genome GAAATAATTAATGTTTTCCTTCCTTCTCCTATTTTGTCCTTTACTTCAATTTATTTATTTATTATTAATATTATTATTTTTTGAGACGG AGTTTCACTCTTGTTGCCAACCTGGAGTGCAGTGGCGTGATCTCAGCTCACTGCACACTCCGCTTTC[C/T]GGTTTCAAGCGATTC TCCTGCCTCAGCCTCCTGAGTAGCTGGGACTACAGTCACACACCACCACGCCCGGCTAATTTTTGTATTTTTAGTAGAGTTGGGG TTTCACCATGTTGGCCAGACTGGTCTCGAACTCCTGACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCTGGGATTACAGG CGTGAGCCACCGCGCTCGGCCCTTTGCATCAATTTCTACAGCTTGTTTTCTTTGCCTGGACTTTACAAGTCTTACCTTGTTCTGCC TTCAGATATTTGTGTGGTCTCATTCTGGTGTGCCAGTAGCTAAAAATCCATGATTTGCTCTCATCCCACTCCTGTTGTTCATCTCCT CTTATCTGGGGTCACTTTTATCTCTTCGTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCTCTGCTTTCC CAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAATGCATTGTAAGTTAAATTATTAAAGATTTTAAATATAGGAAAAAAGT AAGCAAACATAAGGAACAAAAAGGAAAGAACATGTATTCTAATCCATTATTTATTATACAATTAAGAAATTTGGAAACTTTAGATT ACACTGCTTTTAGAGATGGAGATGTAGTAAGTCTTTTACTCTTTACAAAATACATGTGTTAGCAATTTTGGGAAGAATAGTAACTC ACCCGAACAGTGTAATGTGAATATGTCACTTACTAGAGGAAAGAAGGCACTTGAAAAACATCTCTAAACCGTATAAAAACAATT ACATCATAATGATGAAAACCCAAGGAATTTTTTTAGAAAACATTACCAGGGCTAATAACAAAGTAGAGCCACATGTCATTTATCT TCCCTTTGTGTCTGTGTGAGAATTCTAGAGTTATATTTGTACATAGCATGGAAAAATGAGAGGCTAGTTTATCAACTAGTTCATTTT TTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTGTGTCTGTGTGTAACAAAGTAGAGCCACATGTCATTTATCTTCCCTTTG TGTCTGTGTGAAA[A/C]AGTCTAACACATCCTAGGTATAGGTGAACTGTCCTCCTGCCAATGTATTGCACATTTGTGCCCAGATCC AGCATAGGGTATGTTTGCCATTTACAAACGTTTATGTCTTAAGAGAGGAAATATGAAGAGCAAAACAGTGCATGCTGGAGAGAG AAAGCTGATACAAATATAAATGAAACAATAATTGGAAAAATTGAGAAACTACTCATTTTCTAAATTACTCATGTATTTTCCTAGAA TTTAAGTCTTTTAATTTTTGATAAATCCCAATGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATCTGTTATATAAT ATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTATTTTTTCTAGAG GGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCAGTACTTAGCATGTGCGTAACAACTCTGCCT CTGCTTTCCCAGGCTGTTGATGGGGTGCTGTTCATGCCTCAGAAAAACTCTTTCCACTAAGAAAGTTCAACTATTAATTTAGGCAC ATACAATAATTACTCCATTCTAAAATGCCAAAAAGGTAATTTGTGAGACAAGATAAGTATTAGTGATGGTATGAGTAATTAATATC TGTTATATAATATTCATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAATTTTTGATATAAAGGTTGTGATGATTGTTGATTAT TTTTTCTAGAGGGGTTGTCAGGGAAAGAAATTGCTTTTTTTCATTCTTGATTGCATTCTGATCCCCATAAGAGACTTAAAACTGAA AACCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCCCTTGTGATCCGCCAGCCTCTGCCTCCCAAAGAGCGTTTAAGATAGT CACACTGAACTATATTAAAAAATCCACAGGGTGGTTGGAACTAGGCCTTATATTAAAGAGGCTAAAAATTGCAATAAGACCACA GGCTTTAAATATGGCTTTAAACTGTGAAAGGTGAAACTAGAATGAATAAAATCCTATAAATTTAAATCAAAAGAAAGAAACAAAC T[A/G]AAATTAAAGTTATTATACAAGAATATGGTGGCCTGGATCTAGTGAACATATAGTAAAGATAAAACAGAATATTTCTGAAAA ATCCTGGAAAATCTTTTGGGCTAACCTGAAAACAGTATATTTGAAACTATTTTTAAAATGCAGTGATACTAGAAATATTTTAGAAT CATATGTATTTTCATAGTGGAAGAAATAAAATTTAAGTCTTTTAAAATTTCGA

22

11

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Terminology • Locus (plural: loci): May also be called a polymorphism, marker, variant, mutation • Allele: variant forms of the same locus, e.g., A, C • “wildtype” vs “variant” • Genotype: Pair of alleles • Phenotype: Expressed trait • Homozygote: AA, CC • Heterozygote: AC • Carrier: AA + AC • Phase: do alleles occur together on the same chromosome? • Haplotype: a collection of closely linked alleles, usually inherited as a unit. eg. CTG •Penetrance: P(Phenotype|Genotype)

A T A

G T C

23

Variant Types Type

Effect

Freq

RR of phenotype

Nonsense

stop AA seq.

v. low

v. high

Missense

change AA

low

low - v.high

Frameshift

change frame of protein coding

low

v. high

Intronic

No known function

Med

v. low

Intergenic

No known function

High

v. low

24

12

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Typical Steps in most Genomic Study • Hypotheses • Tissue/sample processing • Study Design – Focused/candidate regions vs whole genome – ‘Omic data type – Array vs NGS – Sample size and power – Confounding issues, covariates (epi, drug/trt) • Bioinformatics processing of raw data • Statistical Analysis • Annotation of results and relationship (IPA, etc) • Validation studies (replication, functional studies) 25

Evolution of Genomics Research Candidate Gene Studies < 2005 Genotyping

Genome-wide Association Studies 2005-2010

Next-Gen Sequencing 2010-Present

3rd Generation Sequencing Single Molecule Sequencing

RT-PCR

SNP arrays

DNA (Exome & WGS)

Resequencing genes (exons) with Sanger Sequencing

mRNA arrays

RNA-seq

PacBio, Complete Genomics, etc.

Methylation arrays

Bisulfite or RRBS (methylation)

Translation to clinical practice

Events leading up to Candidate Studies 1950 – Structure of DNA 1970s – Sanger Sequencing 1983 – PCR 1990 – HGP begins 1997 – NHGRI formed

Events leading up to GWAS 2000-1 – Draft version of human genome sequence completed 2002 – HapMap begins 2003 – HGP ends

Events leading up to Next-Gen Sequencing 2005 – 1st Commercial platform (Roche 454) 2006– Illumina’s Genome Analyzer (GA) IIx 2008 – 1KGP begins 26

13

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Human Genome Project • Completed in 2003; 13 year project

nature

• Goals: – Identify all ~25,000 genes in human DNA – Determine the sequences of the 3 billion bp – Store this information in databases – Improve tools for data analysis – Address ethical, legal, social issues (ELSI)

February, 2001 27

http://www.ornl.gov/sci/techresources/Human_Genome/project/info.shtml

Human Genome Facts nature

• 3 billion base pairs • Around 25,000 genes – Functions unknown for ~50% • Average gene size is 3000 nucleotides • Coding is about 1.5% of genome

February, 2001 28

14

Statistical Genomics and Bioinformatics Workshop 8/16/2013

High Throughput Methods for Measuring DNA • Many approaches for genotyping – Hybridization Methods (Affymetrix, TaqMan) – Primer extension (Pyrosequencing) – Ligation (Illumina)

• Custom Content / Design – GoldenGate, Infinium at Illumina – Disease Specific panels (PGx, Cancer, Carbo‐Metabo) 

• Standard large arrays – Genome‐wide arrays (> 1 million SNPs) – Exome Arrays (rare variants)

• Next‐Generation Sequencing 29

NGS Technologies • Illumina (Solexa) HiSeq 2000 (2500) & MiSeq, Life Technologies SOLiD, PacBio, Ion Torrent PGM, Roche 454, ... , and many more to come – No one-size-fits-all solution – Each has pros and cons

30

15

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Integrative Genomic Viewer (IGV)

Thorvaldsdottir, Robinson, Mesirov (2012) Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration Briefings in Bioinformatics

31

ENCODE (Encyclopedia of DNA Elements) • Goal to build a comprehensive parts list of functional elements in the human genome.

32

16

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Mouse ENCODE Project • Enhance the human ENCODE Project through relevant comparative studies • Access cell types, tissues, and developmental time points that are not addressable by the human project • Provide a general resource to inform and accelerate ongoing efforts in mouse genomics and disease modeling with human translational potential 33

Cancer Cell Line Encyclopedia J Barretina et al. Nature 483, 603-607 (2012) doi:10.1038/nature11003

34

17

Statistical Genomics and Bioinformatics Workshop 8/16/2013

The Cancer Genome Atlas (TCGA) • Began in 2006 as a three-year pilot project (NCI & NHGRI) for three tumors. • NIH is now commit to characterizing more than 20 additional tumors. • Extensive data available on 17 cancers • Tumor and normal tissue being analyzed on multiple levels, such as: – nucleotide variation (SNP, Indel, SNV) – gene copy number variation – gene expression levels – DNA methylation levels 35

Other Public Data and Information

36

18

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Public databases 

The entire human genome sequence can be found in several public databases. –





National Center for Biotechnology Information (NCBI) 

http://www.ncbi.nlm.nih.gov



Entrez – NCBIs search and retrieval system; Build 37

University of California at Santa Cruz (UCSC) 

http://genome.ucsc.edu/



Genome Browser; hg19

Ensembl Genome Browser 

http://www.ensembl.org/index.html 37

Public databases •

Compare NCBI Build to UCSC assembly (hg18)

Species

Human

UCSC Release

hg19

Date

Release Name

Status

Feb. 2009

Genome  Reference  Consortium  GRCh37

Available

hg18

Mar. 2006

NCBI Build 36.1 Available

hg17

May 2004

NCBI Build 35

Available

hg16

Jul. 2003

NCBI Build 34

Available

hg15

Apr. 2003

NCBI Build 33

Archived

http://genome.ucsc.edu/FAQ/FAQreleases.html38

19

Statistical Genomics and Bioinformatics Workshop 8/16/2013

UCSC Genome Brower

39

Haplotype Map of the Human Genome QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

Goals: • Define patterns of genetic variation across human genome • Guide selection of SNPs efficiently to “tag” common variants • Public release of all data (assays, genotypes) Phase I: 1.3 M markers in 269 people Phase II: +2.8 M markers in 270 people Phase III: 1.6 M markers on 1,184 people (11 populations) 40

20

Statistical Genomics and Bioinformatics Workshop 8/16/2013

1000 Genomes Project (1KGP) • International project to construct a foundational data set for human genetics – Discover virtually all common human variations by investigating many genomes at the base pair level – Consortium with multiple centers, platforms, funders • Aims • Discover population level human genetic variations of all types (95% of variation > 1% frequency) • Define haplotype structure in the human genome • Develop sequence analysis methods, tools, and other reagents that can be transferred to other sequencing projects 41

3 pilot coverage strategies

42

21

Statistical Genomics and Bioinformatics Workshop 8/16/2013

1KGP Projects • 1000 Genomes Pilot project • • • • •

Started in 2008 Paper release contained ~14 million snps 179 individuals 4 populations Low coverage next generation sequencing

• 1000 Genomes Phase 1 • • • • •

Started in 2009 Phase 1 release has 36.6millon snps, 3.8millon indels and 14K deletions 1094 individuals 14 populations Low coverage and exome next generation sequencing

• 1000 Genomes Phase 2 • • • •

Started in 2011 1715 individuals 19 Populations Low coverage and exome next generation sequencing

43

Methodological Impact of 1000 Genomes • 1,092 individuals from 14 populations, constructed using a combination of lowcoverage whole-genome and exome sequencing. • Developed methods to integrate information across several algorithms and diverse data sources. • Joint calling and phasing of haplotypes Flannick J, Korn JM, Fontanillas P, Grant GB, et al. (2012) Efficiency and Power as a Function of Sequence Coverage, SNP Array Density, and Imputation. PLoS Comput Biol 8(7): e1002604. doi:10.1371/journal.pcbi.1002604 44 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604

22

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Bioinformatics and Statistical Genomics Statistics

Informatics

Computer Science

Statistical  Genomics

Biostatistics

Bioinformatics

Computational Genomics Biology &  Medicine

Genomics 45

Bioinformatics-Statistics “continuum”

Experimental Design

Processing of data via computers Biological knowledge/annotation

Data mining

Association Analysis

Algorithms to determine  function, structure

Clustering/Profile

Differential Analysis

Network and Interactions

GWAS & Haplotype

Informatics

Gene set and pathway analysis

Modeling & Prediction

New algorithms for processing  next‐generation sequence data

Bioinformatics

Pedigree Studies (Linkage) New statistical methods

Statistical Genomics 46

23

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Questions?

47

Statistics Overview

48

24

Statistical Genomics and Bioinformatics Workshop 8/16/2013

What is Statistics/Biostatistics? • It is the science of gaining information from data (ie collecting, analyzing, and interpreting data) • Statistics is mainly used in practice for evaluating data to gain an understanding of some subject matter.

49

3 Parts of Statistics • Collecting Data – Experiments and Experimental design – Sampling and observational studies

• Analyzing Data – Graphs and numerical summaries – Estimation and confidence intervals – Hypothesis Testing – Statistical Modeling (i.e. fitting lines) 50

25

Statistical Genomics and Bioinformatics Workshop 8/16/2013

3 Part of Statistics (cont) • Interpreting Results – Was the statistical analysis appropriate? – Was the data reliable? – What do the results tell about the research question? – What do the results tell about the estimate of an effect?

51

Statistics • Useful definitions: – population: all objects, individuals, etc. in which we are interested – sample: the subset of a population that is actually measured – data: information collected on objects, individuals, etc. of the sample

52

26

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Statistical Inference Sample Population

Research question, hypothesis

measure, question, read, record, etc.

Data

summary statistics inferences

Statistical Analysis

data manipulation

53

Types of Data/Variables • Qualitative Variables / Data – Categorical • These variables (data) classify subjects or objects into groups. The data can be character or numeric. If numeric, the numbers have no inherent meaning.

54

27

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Types of Data/Variables • Qualitative Variables/Data: Types – Nominal • Qualitative variables (data) in which the classifications/groups/categories are unordered. • Examples – blood group: A, B, O, AB – group: 0—control, 1—study – gender: 0—female, 1—male

55

Types of Data/Variables • Qualitative Variables/Data: Types – Ordinal • Qualitative variables (data) in which the classifications/groups/categories are ordered. • Examples – smoking status: 0-never, 1-former, 2-current – cancer stage: 1, 2, 3 – class: I, II, III, IV

56

28

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Types of Data/Variables • Quantitative Variables/Data – These variables (data) are numeric with inherent numeric meaning. They typically arise from measurements or counts.

57

Types of Data/Variables • Quantitative Variables/Data: Types – Count or Discrete • Quantitative variables (data) that arise from a counting process (only integers). • Examples – number of affected individuals in a family – number of renal arteries with more than 50% stenosis – number of bacterial colonies on a slide

58

29

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Types of Data/Variables • Quantitative Variables/Data: Types (cont.) – Continuous • Quantitative variables (data) when if measured with sufficient accuracy, there would be no gaps between possible values (continuum of values). • Examples – height – systolic blood pressure – time from diagnosis to last date alive or end of study

59

Graphical Displays of Data • Examples of Data Distribution Shapes

skewed right

symmetric

skewed left

uni-modal

Bi-modal

symmetric

symmetric with outlier

Somewhat symmetric 60

30

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Comparing sample mean and median • In a perfectly symmetric distribution, the  mean and median are always the same. • Sample mean is influenced by outliers and  skewed data, while the median is not. • Mean will move away from the median  toward the tail of skewed data or outlier.

mean

mean

median

median

61

Measures of Spread (Variability) • Sample Variance – idea: a measure of variability that depends on all the observations, looks at amount of variation about the mean – notation • •

population variance: 2 sample variance: s2

– Formula

N

  xi  x 2

s 2  i 1

N 1 62

31

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Measures of Spread (Variability) • Characteristics of variance – s2 = 0 means no spread in the data – s2 is never negative – As variability increases, so does s2 – squared units of the values of the variable – Influenced by outliers

63

Measures of Spread (Variability) • Sample Standard Deviation (SD) – notation • population standard deviation:  • sample standard deviation: s – formula N

  xi  x 

s  s 2  i 1

2

N 1

– characteristics • square root of the variance • has same units as the value of the variable 64

32

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Boxplots

outlier

outliers

QL

median

Maximum of (1) minimum value not flagged as outlier, (2) QL – 1.5*IQR

QU

Minimum of (1) maximum value not flagged as outlier, (2) QU + 1.5*IQR

• Extremely useful for comparing groups 65

Data Collection • Two General Ways to get data – Observational study: gathers information about individuals through response to questions or observations of an individual's "normal" actions – sampling, surveys, retrospective studies – Will not be able to conclude a “causative” effect

– Experiment: deliberately imposes some treatment in order to observe a response – Completely randomized design, clinical trials – Will be able to conclude a “causative” effect 66

33

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Experimental Terms • Experimental Unit: object on which experiment is performed • Measurement Unit: object for which you are taking a measurement of; usually the same as the experimental unit, but not always Example: Apply a fertilizer (treatment) to an orange tree (experimental unit); measure the acid level in the oranges (measurement unit).

• Treatment: specific experimental conditions applied to the units 67

Experimental Terms • Experimental Error: – Natural differences in experimental units – Variation in the measuring device – Variation in setting the experimental/treatment conditions – The effect on the response variable of all extraneous factors other than the experimental factors – WISH TO MINIMIZE THE EXPERIMENTAL ERROR 68

34

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Design Of Experiments (DOE) Principles 1. Control Group / Comparison Group(s) a. Controls for lurking variables

2. Randomization of experimental units to treatment groups a. Avoids bias due to assignment b. Produces similar treatment groups

3. Replication of experiment on many experimental units (n) a. Better able to find differences in treatments 69

Completely randomized design n1

TRT 1

Random  allocation

Compare  responses n2

TRT 2 70

35

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Variation in Experiments • If redo the experiment, will have different randomization and a different outcome • Some differences are due to chance differences in the groups • Statistically significant differences are differences that are too large to occur by chance alone (will study later with hypothesis / significance testing) • The larger the sample the better we are able to detect differences in treatment groups 71

Study Design 1. What is the Question of Interest? Objectives? 2. Determine the Scope of the Inference a. Will this be a randomized experiment or an observational study? b. What experimental or sampling units will be used? c. What are the populations of interest?

3. Understand the system under study. 72

36

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Study Design 4. Decide how to measure a response. 5. List Factors that can affect the response. a. Design Factors i. Factors to vary (treatments and controls) ii. Factors to fix

b. Confounding Factors i. Factors to control by design (blocking) ii. Factors to control by analysis (covariates) iii. Factors to control by randomization 73

Study Design 6. Plan the conduct of the experiment (time line) 7. Outline the statistical analysis 8. Determine the sample size / power

74

37

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Some Other Experimental Designs • • • •

Block Designs Factorial Designs Cross-over Matched / Paired Design – special case of blocked design

• • • •

Latin Square Split-plot Fractional factorial Randomized incomplete block design 75

Probability and Statistics

Probability (deductive) Population

Sample Statistics (inference) 76

38

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Probability Distribution • A probability distribution tells us what the possible outcomes are and the probability assigned to each outcome. – Example: Table with blood type probabilities

• Examples: – Uniform distribution – Normal distribution – T-distribution 77

What makes a good estimator? If you have 6 darts; what locations of the darts on the dart board would represent: 1) Unbiased & Low Variability? 2) Biased & High Variability?

78

39

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Questions?

79

Probability for a Statistic • A sampling distribution is a probability distribution for a statistic • We use statistics to estimate unknown population parameters. • Sampling distribution will be centered around the true value of the parameter (if the statistic is unbiased) • As sample size increases, the statistic gets closer to the parameter (less spread in the distribution) – Larger the sample size, the more precise the estimate • Sampling distribution looks approximately normal (i.e. symmetrical/bell-shaped) and has no outliers. – Will be “more normal” for larger samples. 80

40

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Significance Testing • Often called “Hypothesis” Testing • Statistical Inferences: two most common types 1. Confidence Interval: used when your goal is to estimate a population parameter. 2. Hypothesis Testing: used to assess the evidence provided by the data in favor of some claim about the population. • Reasoning for both types is based on asking what would happen if we repeated the sample or experiment many times. 81

Idea of Hypothesis Testing • Does our sample statistic indicate a TRUE effect? OR • Could we easily get this sample statistic by chance alone? – That is, taking into account variability in samples (and thus the statistic) is our observed value not an uncommon value?

• We would like to simply prove our alternative hypothesis is true, but statistics can never prove anything. Instead, we accumulate evidence against the null hypothesis. 82

41

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Null Hypothesis • Status quo – usually not the hypothesis believed by the investigator • population parameter does not differ from established value • two population parameters do not differ

– indirect method for ascertaining whether data support researcher’s belief

• Why status quo? makes it possible to calculate probabilities (p-values) 83

Alternative Hypothesis • Research hypothesis typically the hypothesis the investigator believes is true • Format – two-sided (two-tailed) Ha: parameter  hypothesized quantity – upper-tailed (one-sided) Ha: parameter > hypothesized quantity – lower-tailed (one-sided) Ha: parameter < hypothesized quantity • Generally will use a two-sided hypothesis 84

42

Statistical Genomics and Bioinformatics Workshop 8/16/2013

p-value • Informally: The p-value helps answer the question, “Is the observed difference real or merely the result of chance?” • It does not answer this question directly. Rather it indicates how likely it is for the observed difference to be due to chance (assuming H0 is true). • Formally The p-value is the probability of observing the statistic value you got (or a value more extreme) if the null hypothesis is true. 85

Reasoning of Hypothesis Testing • We assume Ho is true. • Look at data to see if evidence is against Ho (Ho false) • Results that are very unlikely if Ho is true have  very small p‐values and are evidence against  the null hypothesis (Ho)  – Small p‐value = prove Ha – Large p‐value = fail to prove Ha

• Cut‐off for p‐value is significance level α 86

43

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Interpreting p-values • Use p-value to determine which possibility is supported by data – p-value  0.001 • if the null hypothesis is true, there is a 1 in 1000 chance or less of observing our data or data more extreme • strong evidence against the null hypothesis

87

Interpreting p-values • Use p-value to determine which possibility is supported by data (cont.) – p-value  0.05 • if the null hypothesis is true, there is a 1 in 20 chance or less of observing our data or data more extreme • evidence against the null hypothesis

88

44

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Interpreting p-vales • Use p-value to determine which possibility is supported by data (cont.) – p-value  0.1 • if the null hypothesis is true, there is 1 in 10 chance or more of observing our data or data more extreme • no evidence against the null hypothesis • NOTE: • p-value is based on the assumption that the null hypothesis is true and so it cannot tell you if the null hypothesis is really true • not having enough evidence against the null hypothesis does not “prove” null hypothesis is 89 true)

Practical Significance • When the sample size is large, you are more likely to get a significant p-value. • This is because the spread in the sampling distribution is getting very small and thus, the test statistic is getting large in magnitude. • Don't confuse statistical significance with practical significance.

90

45

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Type I and II errors Decision: Reject Ho Decision: Fail to Reject Ho

Ho: True

Ho: False

Type I error (α)

OK

OK

Type II error (β)

• Type I error: reject the null hypothesis when it is true  = Prob(Type I error) • Type II error: fail to reject null hypothesis when it is false  = Prob(Type II error) • We control Type I error by setting α as low as possible • α and β act inversely, as α gets smaller β gets larger. • Make n large to control for both types of error 91

Power of a Test • Power – 1 –  = 1 –Type II error – If the true parameter is , what is the chance of obtaining a significant result?

• Larger samples have greater power to obtain a significant result. In other words, when you increase sample size, you increase power. • For a given sample size, have greater power to detect larger effects. 92

46

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Power and Sample Size: Why important • You are planning an experiment and you want to give yourself the best possible chance of determining the truth. – Incorrect decisions: • Failure to reject Ho when we should --- Type II error (1Power) • Reject Ho when we shouldn’t --- Type I error • Planning stage: • What effects do you think are possible? • What is a clinically meaningful effect? – What result do you need to proceed to the next stage? – What result do you need to recommend a change in clinical practice? • What sample size is required to make it all work? 93

Ways to Determine Sample Size Two ways to determine sample size 1. Estimate n based on precision of confidence  interval –

Studies should be designed with sample size  sufficient to estimate precisely

2. Estimate n based on power of study –

Studies should be designed with sample size  sufficient to provide good power(.8 or greater) to  detect the smallest effect that would be clinically  meaningful. 94

47

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Fail to reject H0

Reject Ho

P(reject H0| H0 is true)

 P(fail to reject H0|

Power P(reject H0| HA is true)

HA is true)



95

Estimating the Sample Size • We know the location of the null and alternative curves, but we do not know the shape because the sample size determines the shape. • We need to find the sample size that will give the curves the shape so that the a level and power equal the specified values.

Alpha=0.025

Power=0.8 Beta=0.2

96

48

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Sample Size Determination for Test of Significance • Necessary components – , level of significance – 1 – , power – , the minimum difference between population parameters that is of clinical usefulness – s, the standard deviation of each group (Better to overestimate than underestimate) • Cautions: – formulas provided are only an approximation – Based on many assumptions • Need to be clear in presenting how the power/sample size estimates were computed – need to inflate the sample size you compute to account for loss to follow-up, dropouts, etc. 97

http://www.stat.uiowa.edu/~rlenth/Power/index.html

98

49

Statistical Genomics and Bioinformatics Workshop 8/16/2013

What impact does variance in population have on power? Higher  variability, lower power  What impact does effect size have on power? Smaller effect size, lower  power What impact does type I level (α) have on power? Lower α, lower power 99

Time To Event (TTR, OS, DFS)

Linear models

Two group comparison (no covariates): KM curves and logrank test

Regression framework: Cox Proportional Hazards models

Logistic regression (binary outcome with covariates)

Poisson regression (count data; RNA-seq)

100

50

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Simple Linear Regression & Correlation • Goals: 1. Describe the nature of the relationship between two  variables. 2. To find out whether some variables help explain, predict  or even cause the value of another variable.

• Response Variable : the result, effect, or outcome  that we are interested in; also called the dependent  variable. • Explanatory Variable(s): explains, causes, or helps to  predict the response; also called the independent  variable. • A relationship between two variables, does not  always imply that the one variable causes a change  in the other variable 101

Correlation • Correlation (r): a numerical measure for the strength  and direction of a linear relationship between two  quantitative variables.

 • r

( X  X )(Y  Y ) (n  1) S x S y



 ( X  X )(Y  Y )  ( X  X )  (Y  Y ) 2

2

-1  r  1

• Values of r close to 0 indicate a weak linear relationship  (r = 0 indicates no linear relationship) • Values of r close to ‐1 or +1 indicate a strong  linear relationship • r has no units of measurement 

– r will not change if we change weight from lbs. to kg. or height  from inches to cm. 102

51

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Least Squares Regression

Y variable (Response)

• We will use the line to predict y from x, so we want the line  that is as close as possible to the points in the vertical (y)  direction

X variable (Explanatory)

103

Least Squares Regression • A "good" regression line is one that makes the errors  / residuals (ε) or distances as small as possible • A Least Squares regression line of y and x is the line  that makes the sum of the squared vertical distances  (errors) of the data points from the line as small as  possible, or minimizes Σ(errors)2

104

52

Statistical Genomics and Bioinformatics Workshop 8/16/2013

LSR Line • Ŷ = b0 + b1 (X), Ŷ = predicted response – Based on data / sample 

• b1 = slope = r (Sy/Sx) = rate of change  = amount of predicted change in Y when X is  increased by 1 unit • b0 = y‐intercept =  Y  b1 X = value of Ŷ when X = 0. = statistically meaningful only when X can take  values close to 0 105

Prediction • Prediction: substitute an x‐value into the equation  and will get a Ŷ which is the predicted response value  for that x value. • The predicted value/point (X, Ŷ) is always on your  line. • Not all the observed values (Y) will fall on the line  unless r=1.0 or r=‐1.0.

106

53

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Interpreting correlation and regression • Know limitations: – Correlation and simple linear regression describes only  linear relationships  – Both r and LSR line are influenced by extreme observations  (outliers/influential points) – One outlier can change r and LSR line dramatically – Always plot your data before you interpret correlation and  regression • Influential Point : a point that when removed changes the  position of the LSR line and affects the correlation. Influential Point

107

Categorical Data Analysis • When looking at categorical data, one often looks at  proportions as opposed to means. • Testing that a proportion differs from a given value • Test that proportions for 2 populations differ • Test for relationship/association between two  categorical variables. – Ex. Disease status and genotype frequency

108

54

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Chi‐Square Tests • Uses: – Comparison of Several (2 or more) Proportions – Test for relationship/association/independent  between two categorical variables. • Ex. Disease status and genotype frequency – Test that k subpopulations are the same (homogeneity) – Goodness of fit 109

Example: Genetic Association Testing Case Control Total

aa 10 (7.5) 5   (7.5) 15

Expected Count =

aA 25 (22.5) 20 (22.5) 45

AA 50 (55) 60 (55) 110

Total 85 85 170

(Row Total) (Column Total) (Table Total)

• If the expected counts are far away from the observed counts, this is evidence against Ho. • Chi-square test statistic:





• Under null hypothesis,

~

with df = (R-1)(C-1)

110

55

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Chi‐Square Distribution • Takes only positive values • Skewed distribution – A standard normal random variable squared is a  Chi‐square with 1 df (i.e. Z2 ~ χ2 df =1) •chi-square distribution (df = 1)

p-value

111

X2 test statistic

Logistic Regression • Used when response (dependent) variable has only  two possible outcomes, “success” (y=1) or “failure”  (y=0). • Interested in what explanatory variables explain  the response variable in terms of P(success). • Type of nonlinear model (generalized linear model). • Poisson Regression for when the response variable  is a count from 0, …,  ∞.

112

56

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Logistic Regression • Probability of success =

1

• Relate a function of , , to a linear combination of explanatory (independent) variables or predictors. • Simple logistic regression model: log

log





113

Logistic Regression • Thus, the probability in terms of Xi (independent  variable) is •

1

• β1 measures the degree of association between  the probability of success and the value of the  explanatory or predictor variable. •

is referred to as the ODDS RATIO. 114

57

Statistical Genomics and Bioinformatics Workshop 8/16/2013

Questions?

115

58

Suggest Documents