Copy Number Data Analysis

Copy Number Data Analysis Xiaowen Wang Field Applications Specialist [email protected] Topics • Copy Number Analysis • • • • • • Data import and...
Author: Maria Hines
3 downloads 0 Views 6MB Size
Copy Number Data Analysis

Xiaowen Wang Field Applications Specialist [email protected]

Topics •

Copy Number Analysis • • • • •



Data import and copy number creation Detect regions of amplification and deletion Detect copy number variation among different population Overlap gene with copy number regions Chromosome visualization

LOH Analysis • Data import • Detect LOH • Integration of LOH with copy number



Allele Specific Copy Number Analysis • Data import • ASCN creation • Detect allelic imbalance

2

Copyright © Partek Inc.

1

Copy Number Analysis

Standard Copy Number Processing Workflow Import Allele Intensity (Affy .cel files) Import from Affymetrix, Agilent, Illumina, NimbleGen etc…

Copy Number/LogRatio Detect regions on each sample Analysis on regions across samples Find genes overlap with regions Biological interpretation Genomic integration

Visualize data at any of the steps

4

Copyright © Partek Inc.

2

Assay Vendors for Copy Number/aCGH •

Affymetrix 10K through SNP6, Cyto2.7M, MIP –CEL and Text File



Illumina 317K, 550K, & 1M bead arrays : GenomeStudio plugin, Text



Agilent aCGH : Feature Extraction output, gpr output



Roche/Nimblegen aCGH : Paired files

5

Copyright © Partek Inc.

Import Samples



Import allele intensity to create copy number



Import Copy number/log ratio

6

Copyright © Partek Inc.

3

Import Allele Intensity from Affymetrix .CEL Files •

Specify .CEL files to import



Probes are adjusted for fragment length and sequence bias



Automatically download needed library files

7

Copyright © Partek Inc.

Import Allele Intensity from Affymetrix .CEL Files •

Output normalized allele intensity in log scale

• Two columns per SNP & one column per CNV probeset

8

Copyright © Partek Inc.

4

Import Allele Intensity from Illumina GenomeStudio • Use Partek plug-in for GenomeStuio • Analysis > Reports > Report Wizard • Choose Custom Report and select Partek Report Plug-in • Specify Type as X & Y • No normalization is performed • Output three spreadsheets in the project: • allele intensity • B allele frequency • Genotype call

9

Copyright © Partek Inc.

Paired/Unpaired Copy Number Creation UNPAIRED

• PAIRED Two samples (case/control) taken from each subject • The normal sample is baseline for the case sample for each subject • Output copy number values only for case sample in each subject

Affy : SNP6, SNP5, 100K,500K Illumina: 1M, Omni1-Quad 10

Copyright © Partek Inc.

5

Estimating Copy Number from Allele Intensities • The allele intensities are compared to intensities from normal subjects • Normal sample(s) allele intensity will serve as baseline • So if the intensity of a probe is 2 times brighter than baseline, it has 2 times as much DNA at the location on the genome for which the probe targets. • • • •

Normal = 2 copies 2 times normal = 4 copies ½ of normal intensity = 1 copy No intensity = 0 copies 11

Copyright © Partek Inc.

Baseline Choices Better ability to detect true copy number

Paired

DNA and reference from same patient

Unpaired Experimental Reference Reference from similar samples run in same lab

Unpaired Lab Reference Reference from larger unrelated group of samples run in same lab

Unpaired Universal Reference Large Hapmap baseline run in third party lab

More robustness to sources of noise

12

Copyright © Partek Inc.

6

GC Wave Correction on Copy Number Adjust copy number/logratio based on local gc content • Need reference genome in .2bit format Diskin, et.al; Adjustment of genomic waves in signal intensities from whole genome SNP genotyping platforms, Nucleic Acid Res., 2008, 36: 19

13

Copyright © Partek Inc.

Import Samples



Import allele intensity to create copy number



Import Copy number/log ratio

14

Copyright © Partek Inc.

7

Import Agilent aCGH • Import from Feature Extraction • Select the FE output .txt files • Choose LogRatio to import • Change the log base to 2 • Annotation will be generated during import •

Output is LogRatio spreadsheet

15

Copyright © Partek Inc.

Import from Illumina GenomeStudio • Use Partek plug-in for GenomeStuio • Analysis>Reports>Report Wizard • Choose Custom Report and select Partek Report Plugin • Specify Type as Illumina Copy Number Analysis • No normalization is performed • Output is Partek project containing three spreadsheets: • LogRRatio • B allele frequency • Genotype call

16

Copyright © Partek Inc.

8

Import from Illumina Text Report • Choose text file output from Illumina • Select field to import: • Sample ID • Probeset ID • Data

• • •

Only one type of data to import at a time Specify output file Annotation • Automatically link to annotation if it is in the text file • Choose File>Properties to manually specify text format of the annotation file

17

Copyright © Partek Inc.

Import from NimbleGen •

Specify either project folder or specific files



Specify annotation file (.pos)

(*normalized.txt) (.pair)



Two format of the files • .pair in raw data folder • *normalized.txt in processed data folder



Output of normalized.txt files is corrected logratio in the text file



Partek uses .pair file and LOESS normalization to create logratio of one color to the other 18

Copyright © Partek Inc.

9

Import Affymetrix MIP Chip File > Import > Affymetrix > MIP Copy Number text file • Choose input file, annotation file • Three files to choose from: — ASCN — Total Copy Number — Allele Ratio • Values pre-calculated by Affymetrix • Must adjust Copy Number values below zero to use analysis options • Set values below zero to small number

19

Copyright © Partek Inc.

Import from NimbleGen •

Paired files in raw data folder • Need to specify baseline channel • Output LOESS normalized log2ratio



.normalized.txt in processed data folder • Output corrected logratio

20

Copyright © Partek Inc.

10

Assign Sample Attributes •

There are many ways to assign sample information (treatments, phenotype, other clinical information) 1) From a “sampleInfo” file (.cel Import only) 2) By creating treat/phenotype groups and dragging the samples into the appropriate group 3) By splitting apart the filename 4) By manually adding columns and filling them in (similar to Excel)

21

Copyright © Partek Inc.

3) “Drag & Drop” Specification of Groups

Name the new attribute and all categories of that attribute. Group samples by dragging and dropping (e.g. attribute name is “Type”, and categories are “Down Syndrome” and “Normal”)

22

Copyright © Partek Inc.

11

Exploratory Analysis - PCA Scatterplot & Histogram • Identify outliers • Clustering pattern

• Identify the distribution of the data

23

Copyright © Partek Inc.

Chromosome View on Copy Number • View one chromosome at a time • Change the order of tracks • Add/remove track Heatmap track • Display the copy number/ log ratio for all the samples in spreadsheet • Color represent copy number value Profile track • Display selected sample • Y axis is copy number value • Raw and smoothed copy number

24

Copyright © Partek Inc.

12

How to find regions of CNV? (Amplifications & Deletions) •

Monitoring trends across multiple adjacent markers



Define chromosomal breakpoints where these trends in chromosomal abundance changes

2 “normal”



Methods in Partek: • Hidden Markov Model • Genomic segmentation

25

Copyright © Partek Inc.

Partek Genomic Segmentation • Find a breakpoint that produces different neighboring regions Segmentation Parameters • Specify minimum number of genomic markers • Two sided t-test to comparing two neighboring regions • Based on significance and amount of changes to decide whether to insert breakpoint Region Report • 2 One sided t-test to compare the mean of the region with expected range to determine aberration status • Expected range: the range around each expected copy number. In a diploid region , the expected range would be 2+/- 0.3 which is from 1.7-2.3.

Signal to Noise

2.6 2.2 (2.6-2.2)=0.4 > 0.3

26

Copyright © Partek Inc.

13

Hidden Markov Model • Specify expected states (copy number/log ratio) • Specify maximum probability of retaining the same state between neighboring markers • Genomic decay describes how quickly the retention of the state will decay to the initial probability • Find the most likely state sequence given the data • Compare state of 2 as normal to determine aberration status

27

Copyright © Partek Inc.

Result Spreadsheet • One row per segment per sample

• Mean is the average copy number of all the markers in the region

• First 3 columns are the genomic location: chromosome, start, end

• HMM has stat column

• Copy Number status is based on the report parameters

• Segmentation has p-value column

28

Copyright © Partek Inc.

14

HMM vs Segmentation • HMM - Good on homogenous samples with anticipated states (copy number)

• Segmentation - Good for heterogeneous sample when you don’t know the copy number state

29

Copyright © Partek Inc.

Segmentation Result Spreadsheet • One row per segment per sample • First 3 columns are the genomic location: chromosome, start, end • Copy Number status is based on the report parameters • Right click on a region row header>Browse to location

30

Copyright © Partek Inc.

15

Plot Detected Regions Karyoview (Histogram View) — Sample frequency on aberration regions

Classification View — View each region in each sample separately

31

Copyright © Partek Inc.

Analyze Detected Segments Analyze regions across multiple samples Region of sample 1 Region of sample 2 Region of sample 3 Region of sample 4 Region of sample 5 Result

Result

Result

Result

• Regions can be smaller than the default number of markers chosen during segmentation • Lack of defined borders in copy number • You can apply filter on small and less commonly shared regions

32

Copyright © Partek Inc.

16

Detect CN Variation on Different Categories • Chi-square test is used to detect copy number changes among different categories

• Unbalance between samples will increase significance of categorical contribution to aberration • Right click on a region header can invoke HTML report

33

Copyright © Partek Inc.

Create Region List • Specify criteria to filter down to interesting regions based on • p-value • length • number of marker • chromosome • number of aberration samples

34

Copyright © Partek Inc.

17

Find Overlapping Genes • Overlap with RefSeq, AceView, Ensemble, CNV or custom database • Output format: • on a new column in the region spreadsheet • new spreadsheet

35

Copyright © Partek Inc.

Test for Known Abnormalities • Input file: • Filtered segmentation/HMM result spreadsheet • Abnormality database • If overlap = positive • Output is each row is a feature testing in each sample

36

Copyright © Partek Inc.

18

Cluster Genome Copy number spreadsheet is used to verify how the samples are clustered on the whole genome or selected chromosomes • default is showing cluster on chromosome 1 • click Show All button to cluster on the whole genome • combine left and right click on chromosome number to select chromosomes

37

Copyright © Partek Inc.

38

Copyright © Partek Inc.

Chromosome View

19

Tracks • Can be easily added and removed • Drag and drop to change the order of the tracks • Select a track to change the configuration at the bottom

• Heatmap track on copy number data •Click on the color tab to change the heatmap color •Profile track on copy number data •Default only show the selected sample •Click on color tab to change the dot color • click on samples table to select samples to display • Add New track •Any annotation information •Any spreadsheet with genomic location 39

Copyright © Partek Inc.

Loss of Heterozygosity

20

LOH Workflow • Import genotype calls • Both Paired and Unpaired analysis are available. • Paired is preferred (if possible) because it focuses only on genomic regions specific to the disease phenotype and minimizes differences between individuals

41

Copyright © Partek Inc.

Import Samples • Platform: • Affymetrix CHP (500K, SNP6 etc.) • Affymetrix text file (Mouse Diversity Genotyping Array) • Illumina output from GenomeStudio • Illumina Text file • Specify input file(s) • Specify output file(s)

42

Copyright © Partek Inc.

21

QA/QC Sample QC • Report rate of NC and heterogynous call in each sample SNP QC • Hardy-Weinberg Equilibrium — Allele frequencies in a population remain constant • Chi-square test is performed on expected genotype frequency vs. observed genotype frequency • Output frequencies of each allele 43

Copyright © Partek Inc.

Create LOH • Hidden Markov Model is used to find the LOH regions based on genotype error and expected heterozygous frequency at each SNP • Paired and unpaired • Paired is preferred when possible as it is more accurate in its expected genotype frequencies.

44

Copyright © Partek Inc.

22

Paired LOH • Specify the tumor/normal pair information • Specify parameters for HMM • Homozygous SNPs in normal sample are excluded

45

Copyright © Partek Inc.

Unpaired LOH • Need to create baseline using normal samples • Specify baseline file • Specify parameters for HMM • Default heterozygous frequency is used if no baseline is provided

46

Copyright © Partek Inc.

23

LOH creates a segment table

• One row per sample per LOH region • Heterozygous rate is the number of AB calls divided by total number of genotype calls in the region • Paired LOH only display case sample LOH regions

47

Copyright © Partek Inc.

48

Copyright © Partek Inc.

Intro to CN & LOH merge • Sample ID should match • Input file is regions spreadsheet detected on each sample • Output is the union of the regions with different categories • • • • •

Amplification with LOH Amplification without LOH Deletion with LOH Deletion without LOH Copy-Neutral LOH

24

LOH & CN Overlay LOH

Amplification

Deletion

Amplification

Amp w/ LOH

Copy neutra l LOH

49

Del w/ LOH

Deletion

Copyright © Partek Inc.

Output of CN overlap with LOH • Genomic location of the region • Sample ID • Description • Average copy number • Heterozygous rate

50

Copyright © Partek Inc.

25

LOH and Copy Number Overlap Chromosome view of the region in the five categories

51

Copyright © Partek Inc.

Find regions in multiple samples • Specify output region that is in common in # of samples

• Output number of samples in the region and sample ID

52

Copyright © Partek Inc.

26

Visualization Histogram on number of samples in each region for each category

53

Copyright © Partek Inc.

Allele Specific Copy Number

Copyright © 2009 Partek Incorporated. All rights reserved.

27

Why ASCN • Estimate number of copies for each allele • Able to detect imbalance in copy number between alleles which is important for mixture tissue e.g. tumor samples • Help interpret total copy number analysis results

Copyright © 2009 Partek Incorporated. All rights reserved.

ASCN Workflow • Import samples • genotype calls • allele intensity • Sample ID should match in both spreadsheets • Normal sample(s) are required on the spreadsheet • Analyze both paired and unpaired design • Detect allelic imbalance • Overlap with copy number analysis

Copyright © 2009 Partek Incorporated. All rights reserved.

28

Create ASCN • Not all SNPs are used • Paired is preferred (if possible) • minimizes differences between individuals • Informative SNPs are heterozygous call in normal sample • Un-Paired analysis • Informative SNPs are heterozygous call in tumor sample • LOH regions in tumor sample means no informative SNPs (missing values) Copyright © 2009 Partek Incorporated. All rights reserved.

ASCN Result • Each row is one allele copy number in each sample • “?” means it is not informative SNPs in the sample • Min/max is determined by its value • Diploid region min/max should be around 1

Copyright © 2009 Partek Incorporated. All rights reserved.

29

Detect Allelic Imbalance • Each informative SNP proportion score is: Proportion = (Max - Min)/(Max + Min) • Large proportion score means allelic imbalance, range from 0-1 • Genomic segmentation is performed, • Average proportion score on informative SNPs is reported

Copyright © 2009 Partek Incorporated. All rights reserved.

Questions? Hands On & Demo

Copyright © 2009 Partek Incorporated. All rights reserved.

30

Data Set •20 paired tumor/normal samples •Kindly provided by Ian Campbell, Peter MacCallum Cancer Centre •Could easily be run paired •Run on the Affymetrix Human SNP 6.0 Arrays ° 900K SNPs& 900K CNVs for 1.8 M total genetic markers

Copyright © 2009 Partek Incorporated. All rights reserved.

31