DNA Chimera

DNA Sequencing by MALDI-TOF MS using alkali cleavage of RNA / DNA Chimera LIX Bioinformatics Colloquium 2010, Ecole Polytechnique, November 8-10, 2010...
9 downloads 1 Views 2MB Size
DNA Sequencing by MALDI-TOF MS using alkali cleavage of RNA / DNA Chimera LIX Bioinformatics Colloquium 2010, Ecole Polytechnique, November 8-10, 2010

Picaud Vincent [email protected] CEA/LIST (lab. LOAD)

November 8, 2010

Introduction

Algorithm, some details

Conclusion

Bibliography

Outline DNA Sequencing by MALDI-TOF MS 1

Introduction ReaDNA project DNA re-sequencing by MALDI-TOF MS

2

Algorithm, some details Spectra processing Distance fragment-sequence Fragment generation Sequence modification Sequence merit factor

3

Conclusion

4

Bibliography DNA Sequencing by MALDI-TOF MS

November 8, 2010

2/39

Introduction

Algorithm, some details

Conclusion

Bibliography

ReaDNA project REvolutionnary Approaches and Devices for Nucleic Acid analysis (ReaDNA) Coordinator: Dr. Ivo Glynne GUT ([email protected]) European project FP7-HEALTH-2007-1, 48 months, 19 organizations 1

2

3

The objectives of ReaDNA are to provide solutions for several currently unmet needs of the very diverse aspects of nucleic acid analysis. DNA re-sequencing by MALDI-TOF MS is only a small part of the whole project (WP2) In this project, Florence Mauger, CEA/CNG Evry, ([email protected]) is my contact for chemical aspect of the work and for experimental data. DNA Sequencing by MALDI-TOF MS

November 8, 2010

3/39

Introduction

Algorithm, some details

Conclusion

Bibliography

DNA (re)sequencing (approx 400 bp) Mutations we want to discover: reference analyzed

GGGT - GATTGCTGTACTTGCTTGTTAGCATGGGG GGGTTGATTGCTGTCCTTGCTTG -C AGCATGGGG

Inputs: 

DNA sequence (∼ 400 bp) mass spectra A sample to analyze: mass spectra A reference sample:

Output: Mutations present in the analyzed sequence Method: [1, 2] Search of the differences in the analyzed sequence explaining the differences in mass spectra peak patterns DNA Sequencing by MALDI-TOF MS

November 8, 2010

4/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Experimental setup GGGTTGA TTGCTGTA CTTGCTTGTA A GCA TGGGGA

I TGGGGA GCA

GGGTTGA

I

TTGTAAGC

CTTGCTTGTA TTGCTGTA

MALDI-ToF m/z GGGTTGATTGC TGTAC TTGC TTGTAAGC

TTGC

GGGTTGATTGC TGTAC

...T and G...

MALDI-ToF

m/z ...T and G...

3’

5’ GGGTTGATTGCTGTACTTGCTTGTAAGCATGGGGAGG complementary sequence 5’ 3’ CCCAACTAACGACATGAACGAACATTCGTACCCCTCC

I ....A CTA

CCC A ACT A ACG AC ATG A ACG A AC ATTCGT ACCCCTCC m/z

...C, T and G...

...C, T and G...

For each sample we can get 4 spectra or 8 spectra if we also consider the complementary strand the information is redundant DNA Sequencing by MALDI-TOF MS

November 8, 2010

5/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Tiny example

reference sequence TTCCTCTACTTGCTTGTA

I

cleavage A TTCCTCTA CTTGCTTGTA

I CTTG

m/z

cleavage G TTCCTCTACTTG

m/z

peak death I sample TTCCTCTACTTTCTTGTA

I

TTCCTCTA

TTCCTCTACTTTCTTG

CTTTCTTGTA

1 SNP m/z

m/z

peak birth

Sample SNP can be discovered studying differences in peak patterns

DNA Sequencing by MALDI-TOF MS

November 8, 2010

6/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Experimental constraints Potential masses reference sample

11

Peak death 10

Example of parasite peak

9

8

7

6

1750

1800

1850

1900

1950

2000

2050

2100

mass spectra contain some parasite peaks with the current experimental protocol T cleavage is not possible; peaks not containing T are hardly detectable mass window for the mass spectra restrict fragment size from 3b ∼ 900m/z to ≈ 12b ∼ 4500m/z relation between peak height and number of occurrences of the fragment is not linear DNA Sequencing by MALDI-TOF MS

November 8, 2010

7/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Experimental constraints Sequence and fragments view undetectacle fragment (out of the mass window, T-cleavage...)

fragments with occurrence=1 no redundance

good redundance (x4)

known potential SNP in the litterature

fragments with occurrence>1 cleavage letter

DNA Sequencing by MALDI-TOF MS

November 8, 2010

8/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Algorithm overview

The main steps of the algorithm are: 1

Search for peak pattern changes in spectra

2

For fragments associated with peak births (or increases), search those which are “close” to the analyzed sequence

3

Modify the analyzed sequence to explain those new fragments: this generates a lot of potential candidates

4

Use the redundancy to select the best sequences

5

Restart and iterate

DNA Sequencing by MALDI-TOF MS

November 8, 2010

9/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Algorithm overview

Cleaved fragments (from spectra)

DNA Sequencing by MALDI-TOF MS

November 8, 2010

10/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Algorithm overview Analyzed sequence Cleaved fragments (from spectra)

All positions corresponding to new fragments close to the sequence

DNA Sequencing by MALDI-TOF MS

November 8, 2010

10/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Algorithm overview Analyzed sequence Cleaved fragments (from spectra)

All positions corresponding to new fragments close to the sequence For each position the set of all possible variations are computed

DNA Sequencing by MALDI-TOF MS

November 8, 2010

10/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Algorithm overview

The set of all "compatible" configurations is generated (cardinal of this set α# (yi..n )) ∧ (yi# (yi..n ) > yi# (Si0 ))

2

Insertion of α before yi (if |y | < |Si0 |):Ri+1 − Ri = 0 α# (Si0 ) > α# (yi..n ) We can now write down the complete algorithm DNA Sequencing by MALDI-TOF MS

November 8, 2010

25/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Fragment generator  1

Substitution yi ⇔ α: Ri+1 − Ri = 1

1 0

if yi = α otherwise

Case yi = α: No extra condition

2

Case yi 6= α: Mi+1 − Mi + Ri+1 − Ri

=

+ min (α# (Si0 ) − 1, α# (yi..n )) + min (yi# (Si0 ), yi# (yi..n ) − 1) − min (α# (Si0 ), α# (yi..n )) − min (yi# (Si0 ), yi# (yi..n )) + 0

The condition to fulfill Mi+1 − Mi + Ri+1 − Ri = 0 is (α# (Si0 ) > α# (yi..n )) ∧ (yi# (yi..n ) > yi# (Si0 )) Means: (we have enough α for future needs) AND (we won’t be able to realize all perfect matches for yi ) (α# (Si0 ) > α# (yi..n )) ∧ (yi# (yi..n ) > yi# (Si0 )) 2

Insertion of α before yi (if |y | < |Si0 |):Ri+1 − Ri = 0 α# (Si0 ) > α# (yi..n )

DNA Sequencing by MALDI-TOF MS

November 8, 2010

25/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Fragment generator  1

Substitution yi ⇔ α: Ri+1 − Ri = 1

1 0

if yi = α otherwise

Case yi = α: No extra condition

2

Case yi 6= α: (α# (Si0 ) > α# (yi..n )) ∧ (yi# (yi..n ) > yi# (Si0 ))

2

Insertion of α before yi (if |y | < |Si0 |):Ri+1 − Ri = 0 Mi+1 − Mi + Ri+1 − Ri

=

min (α# (Si0 ) − 1, α# (yi..n )) − min (α# (Si0 ), α# (yi..n ))

The condition to fulfill Mi+1 − Mi + Ri+1 − Ri = 0 is α# (Si0 ) > α# (yi..n ) We can now write down the complete algorithm DNA Sequencing by MALDI-TOF MS

November 8, 2010

25/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Fragment generator  1

Substitution yi ⇔ α: Ri+1 − Ri = 1

1 0

if yi = α otherwise

Case yi = α: No extra condition

2

Case yi 6= α: (α# (Si0 ) > α# (yi..n )) ∧ (yi# (yi..n ) > yi# (Si0 ))

2

Insertion of α before yi (if |y | < |Si0 |):Ri+1 − Ri = 0 α# (Si0 ) > α# (yi..n )

We can now write down the complete algorithm DNA Sequencing by MALDI-TOF MS

November 8, 2010

25/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Fragment generator FragmentGenerator(i, y , yi0 , Si0 ) // End of recursion ? if (|Si0 | = 0) then Record y 0 and exit // For each α ∈ Si0 try all moves... for α ∈ Si0 do // Insertion of α ? if (|y | < |Si0 |) ∧ (α# (Si0 ) > α# (yi..n )) then Call FragmentGenerator(i, y , y 0 + α, S 0 − {α}) // Substitution yi ⇔ α ? if (yi = α) ∨ ((α# (Si0 ) > α# (yi..n )) ∧ (yi# (yi..n ) > yi# (Si0 ))) then Call FragmentGenerator(i + 1, y , y 0 α, S 0 − {α})

DNA Sequencing by MALDI-TOF MS

November 8, 2010

26/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Fragment generator 1

y = ACGGT and y 0 = AAG ⇒ S10 = {(A, 2), (G, 1), (’-’, 2)} 6 ways to keep d(y,y’)=3 1 2 3 4 5 6

2

| | | | | |

y: y: y: y: y: y:

ACGGT ACGGT ACGGT ACGGT ACGGT ACGGT

| | | | | |

y’: y’: y’: y’: y’: y’:

if we do not care of the insert positions, only 2 different results A--GA A-AGA-G-A A-GAAA-GAAG--

1 | AAG 2 | AGA

y = AT and y 0 = ACTT ⇒ S10 = {(A, 1), (C, 1), (T, 2)} 12 ways to keep d(y,y’)=2 1 2 3 4 5 6 7 8 9 10 11 12

| | | | | | | | | | | |

y: y: y: y: y: y: y: y: y: y: y: y:

--AT -A-T -AT--AT -A-T -ATA--T A-TA--T A-TAT-AT--

| | | | | | | | | | | |

y’: y’: y’: y’: y’: y’: y’: y’: y’: y’: y’: y’:

CTAT CATT CATT TCAT TACT TATC ACTT ACTT ATCT ATTC ATCT ATTC

| | | | | | | | | | | |

(coded (coded (coded (coded (coded (coded (coded (coded (coded (coded (coded (coded

y’: y’: y’: y’: y’: y’: y’: y’: y’: y’: y’: y’:

if we do not care of the insert positions, only 8 different results +C+TAT ) +CA+TT ) +CAT+T ) 1 | ACTT +T+CAT ) 2 | ATCT +TA+CT ) 3 | ATTC +TAT+C ) 4 | CATT A+C+TT ) 5 | CTAT A+CT+T ) 6 | TACT A+T+CT ) 7 | TATC A+TT+C ) 8 | TCAT AT+C+T ) AT+T+C )

DNA Sequencing by MALDI-TOF MS

November 8, 2010

27/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Sequence modification For each configuration the resulting sequences are generated

..... For each generated sequence, a merit factor is computed (using the information redundance) only the better candidates are preserved

DNA Sequencing by MALDI-TOF MS

November 8, 2010

28/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Sequence modification For each configuration the resulting sequences are generated

..... For each generated sequence, a merit factor is computed (using the information redundance) only the better candidates are preserved

DNA Sequencing by MALDI-TOF MS

November 8, 2010

28/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Sequence modification Why do we try to process all fragments in parallel? Independant modifications do not allow to use redundancy and will likely to be discarded when only the best sequences are preserved Sequence

A C G

This phenomenon can not happen if we are able to process all the potential modifications in parallel Sequence

A C G

DNA Sequencing by MALDI-TOF MS

November 8, 2010

29/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Best generated sequences For each configuration the resulting sequences are generated

..... For each generated sequence, a merit factor is computed (using the information redundance) only the better candidates are preserved

DNA Sequencing by MALDI-TOF MS

November 8, 2010

30/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Best generated sequences For each configuration the resulting sequences are generated

..... For each generated sequence, a merit factor is computed (using the information redundance) only the better candidates are preserved

DNA Sequencing by MALDI-TOF MS

November 8, 2010

30/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm inputs All the generated sequences are sorted and pruned according to a merit factor, only the best candidates are reused for the next iteration. The merit factor can be computed by : 1 checking peaks in spectra raw data (likelihood computation) 2 simple count of the unexplained fragments (only the peak list is used) In both cases this is the most time consuming operation

INPUTS: ******* Observed changes (from spectra) Theo. mass 2496.58 2511.59 2247.42 1887.2 2799.79 2167.37 2182.38 1284.82

Compomer Reference Variation (A)CCGGTTT(A) 0 1 (A)CGGTTTT(A) 2 -1 (A)GGGTTT(A) 1 -1 (C)AAGTT(C) 0 1 (C)AAGTTTTT(C) 1 -1 (G)ACCTTT(G) 0 1 (G)ACTTTT(G) 1 -1 (G)ATT(G) 2 1

Dist/Ref 1

1 1 1

Type PEAK BIRTH height decrease PEAK DEATH PEAK BIRTH PEAK DEATH PEAK BIRTH PEAK DEATH height increase

OUTPUTS, number of examinated sequences #10561 **********************************************

DNA Sequencing by MALDI-TOF MS

November 8, 2010

31/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm inputs All the generated sequences are sorted and pruned according to a merit factor, only the best candidates are reused for the next iteration. The merit factor can be computed by : 1 checking peaks in spectra raw data (likelihood computation) 2 simple count of the unexplained fragments (only the peak list is used) In both cases this is the most time consuming operation All the observed peak pattern changes in the spectra, with their associated "fragments"

INPUTS: ******* Observed changes (from spectra) Theo. mass 2496.58 2511.59 2247.42 1887.2 2799.79 2167.37 2182.38 1284.82

Compomer Reference Variation (A)CCGGTTT(A) 0 1 (A)CGGTTTT(A) 2 -1 (A)GGGTTT(A) 1 -1 (C)AAGTT(C) 0 1 (C)AAGTTTTT(C) 1 -1 (G)ACCTTT(G) 0 1 (G)ACTTTT(G) 1 -1 (G)ATT(G) 2 1

Dist/Ref 1

1 1 1

Type PEAK BIRTH height decrease PEAK DEATH PEAK BIRTH PEAK DEATH PEAK BIRTH PEAK DEATH height increase

OUTPUTS, number of examinated sequences #10561 **********************************************

DNA Sequencing by MALDI-TOF MS

November 8, 2010

31/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm inputs All the generated sequences are sorted and pruned according to a merit factor, only the best candidates are reused for the next iteration. The merit factor can be computed by : 1 checking peaks in spectra raw data (likelihood computation) 2 simple count of the unexplained fragments (only the peak list is used) In both cases this is the most time consuming operation Number of examined sequences by the algorithm (output)

INPUTS: ******* Observed changes (from spectra) Theo. mass 2496.58 2511.59 2247.42 1887.2 2799.79 2167.37 2182.38 1284.82

Compomer Reference Variation (A)CCGGTTT(A) 0 1 (A)CGGTTTT(A) 2 -1 (A)GGGTTT(A) 1 -1 (C)AAGTT(C) 0 1 (C)AAGTTTTT(C) 1 -1 (G)ACCTTT(G) 0 1 (G)ACTTTT(G) 1 -1 (G)ATT(G) 2 1

Dist/Ref 1

1 1 1

Type PEAK BIRTH height decrease PEAK DEATH PEAK BIRTH PEAK DEATH PEAK BIRTH PEAK DEATH height increase

OUTPUTS, number of examinated sequences #10561 **********************************************

DNA Sequencing by MALDI-TOF MS

November 8, 2010

31/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C ) Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

first output, with its score (highest is best)

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

The discovered mutations (2 SNP), with their positions in the analyzed sequence

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

(not important): fragments that initiated the modification during the recursive search

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

All the modifications induced by the 2 SNP

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

Modifications induced by the 2 SNP confirmed by the input (peak pattern changes)

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

Modifications induced by the 2 SNP that can not be confirmed because not detactable (out of the mass window for instance)

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #1 OUTPUT # 1, score -0.166667 ===================================== SNP # 2 16234 : C -> T (for reverse strand : G -> A ) 16316 : A -> G (for reverse strand : T -> C )

Modifications induced by the 2 SNP in contradiction with the observed peaks remark: here only a small peak height variation

Configuration (fragment modifications): [16232,16236] : GTGTG -> GTATG [16310,16316] : CATGTAT -> CATGTAC Theo. mass Compomer 2496.58 (A)CCGGTTT(A) 2511.59 (A)CGGTTTT(A) 2247.42 (A)GGGTTT(A) 980.622 (A)GT(A) 1284.82 (A)GTT(A) 9787.24 (C)AAAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) 9803.24 (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) 1887.2 (C)AAGTT(C) 2799.79 (C)AAGTTTTT(C) 931.584 (C)TT(C) 2167.37 (G)ACCTTT(G) 2182.38 (G)ACTTTT(G) 1284.82 (G)ATT(G) 667.415 (G)T(G) 338.208 (T)(T) 940.597 (T)AC(T) 667.415 (T)G(T)

(G)ATT(G) (C)AAGTT(C) Reference 0 2 1 Expected Expected Expected Expected 0 1 Expected 0 1 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 4 6 1 0 Expected Expected 2 Expected Expected Expected 10 35 8 14

Expected 1 1 0 3 5 0 1 1 0 1 1 0 3 12 36 7 15

DNA Sequencing by MALDI-TOF MS

: : : : : : : : : : : : : : : : : :

OK OK OK # U HEIGHT VARIATION +1 # # OK OK # OK OK OK # # # #

November 8, 2010

32/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #2 OUTPUT # 2, score -1 ===================================== SNP # 2 16236 : C -> - (for reverse strand : G -> - ) 16316 : A -> G (for reverse strand : T -> C ) Configuration (fragment modifications): [16234,16239] : GTGTAG -> GTTAG [16310,16316] : CATGTAT -> CATGTAC Theo. mass 2496.58 2511.59 2247.42 1918.22 9803.24 9474.03 1887.2 2799.79 931.584 2167.37 2182.38 980.622 1284.82 667.415 651.415 940.597 667.415

(G)ATT(G) (C)AAGTT(C)

Compomer (A)CCGGTTT(A) (A)CGGTTTT(A) (A)GGGTTT(A) (A)GGTTT(A) (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) (C)AAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) (C)AAGTT(C) (C)AAGTTTTT(C) (C)TT(C) (G)ACCTTT(G) (G)ACTTTT(G) (G)AT(G) (G)ATT(G) (G)T(G) (T)A(T) (T)AC(T) (T)G(T)

Reference 0 2 1 Expected Expected Expected 0 1 Expected 0 1 Expected 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 1 0 1 Expected Expected 2 Expected Expected 3 Expected 11 8 8 14

DNA Sequencing by MALDI-TOF MS

Expected 1 1 0 0 1 0 1 0 1 1 0 4 3 12 9 7 15

: : : : : : : : : : : : : : : : : :

OK OK OK FALSE NEG PEAK # # OK OK # OK OK # OK # # # #

November 8, 2010

33/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #2 OUTPUT # 2, score -1 ===================================== SNP # 2 16236 : C -> - (for reverse strand : G -> - ) 16316 : A -> G (for reverse strand : T -> C )

Second best solution, with its score and with 1 different SNP (the other being equal to the one suggested in the first solution)

Configuration (fragment modifications): [16234,16239] : GTGTAG -> GTTAG [16310,16316] : CATGTAT -> CATGTAC Theo. mass 2496.58 2511.59 2247.42 1918.22 9803.24 9474.03 1887.2 2799.79 931.584 2167.37 2182.38 980.622 1284.82 667.415 651.415 940.597 667.415

(G)ATT(G) (C)AAGTT(C)

Compomer (A)CCGGTTT(A) (A)CGGTTTT(A) (A)GGGTTT(A) (A)GGTTT(A) (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) (C)AAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) (C)AAGTT(C) (C)AAGTTTTT(C) (C)TT(C) (G)ACCTTT(G) (G)ACTTTT(G) (G)AT(G) (G)ATT(G) (G)T(G) (T)A(T) (T)AC(T) (T)G(T)

Reference 0 2 1 Expected Expected Expected 0 1 Expected 0 1 Expected 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 1 0 1 Expected Expected 2 Expected Expected 3 Expected 11 8 8 14

DNA Sequencing by MALDI-TOF MS

Expected 1 1 0 0 1 0 1 0 1 1 0 4 3 12 9 7 15

: : : : : : : : : : : : : : : : : :

OK OK OK FALSE NEG PEAK # # OK OK # OK OK # OK # # # #

November 8, 2010

33/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Example: algorithm output #2 OUTPUT # 2, score -1 ===================================== SNP # 2 16236 : C -> - (for reverse strand : G -> - ) 16316 : A -> G (for reverse strand : T -> C )

False negative peak that DISCARDS this solution (can be confirmed by visual inspection of the spectrum)

Configuration (fragment modifications): [16234,16239] : GTGTAG -> GTTAG [16310,16316] : CATGTAT -> CATGTAC Theo. mass 2496.58 2511.59 2247.42 1918.22 9803.24 9474.03 1887.2 2799.79 931.584 2167.37 2182.38 980.622 1284.82 667.415 651.415 940.597 667.415

(G)ATT(G) (C)AAGTT(C)

Compomer (A)CCGGTTT(A) (A)CGGTTTT(A) (A)GGGTTT(A) (A)GGTTT(A) (C)AAAAAAGGGGGGGGGGGGTTTTTTTTTTTT(C) (C)AAAAAAGGGGGGGGGGGTTTTTTTTTTTT(C) (C)AAGTT(C) (C)AAGTTTTT(C) (C)TT(C) (G)ACCTTT(G) (G)ACTTTT(G) (G)AT(G) (G)ATT(G) (G)T(G) (T)A(T) (T)AC(T) (T)G(T)

Reference 0 2 1 Expected Expected Expected 0 1 Expected 0 1 Expected 2 Expected Expected Expected Expected

Candidate Expected Expected Expected 1 0 1 Expected Expected 2 Expected Expected 3 Expected 11 8 8 14

DNA Sequencing by MALDI-TOF MS

Expected 1 1 0 0 1 0 1 0 1 1 0 4 3 12 9 7 15

: : : : : : : : : : : : : : : : : :

OK OK OK FALSE NEG PEAK # # OK OK # OK OK # OK # # # #

November 8, 2010

33/39

Introduction

Algorithm, some details

1

Introduction

2

Algorithm, some details

3

Conclusion

4

Bibliography

Conclusion

DNA Sequencing by MALDI-TOF MS

Bibliography

November 8, 2010

34/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Conclusion 1

The interpretation of experimental spectra can be problematic Potential masses reference sample

11

Peak death 10

Example of parasite peak

9

8

7

6

1750

2

3

1800

1850

1900

1950

2000

2050

2100

But artificial tests of the reconstruction algorithm have shown a good behavior (for instance 21 SNP correctly found for a 160 bp long sequence) Currently working on a GUI to make the tools more usable by our partners (CEA/CNG Evry) DNA Sequencing by MALDI-TOF MS November 8, 2010 35/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Conclusion 1

2

The interpretation of experimental spectra can be problematic But artificial tests of the reconstruction algorithm have shown a good behavior (for instance 21 SNP correctly found for a 160 bp long sequence) 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 sol:TCAGCTGAGGCAAGTCTAAAGTACGAAGTAGATAAGCCCAGGGCTTTCAAGTAATGCTTCAAAAGGT-TTTCAACTATATGCTTTCTACTTTCTCCGAGAAGCGGCA-TT ref:TCAGCTGAGGCAAG-CTAAAG-ACGAAGTAGATAAG-CCAGGGC-TTCAAGTAATGCTTCAAAAGGTGTTGCAACTATATGCTTCCAACTTTCTCCGA-AAGC-GCAGTT ! ! ! ! ! ! ! ! ! ! ! 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789 sol:AGCGGCA-TTACGCTTTATAAAGGGGGAGGGGTTGTATTGGCCCGAAAAGTTTCTTGAATCGATAATAGAGAGCTCCCGAAAATGA-TGGAATAGGCTGTGAGCTTGATG ref:AGC-GCAGTTACGC-TTATAAAGGGGGAGAGGTTGTAGT-GCCTGAAAAGTTTCTTGAATCGATAATAGAG-GCTCCCGAAAATGACTGGAATAGGCTGT-TGCTTGATG ! ! ! ! ! ! ! ! ! !! 012345678901234567890123456789012345678901234567890123456789 sol:TGATGGACTTGCAGTTGGAAAGGGAGATGTTTCACCTGAAGAATTTTACGCTGTTACCAA ref:TGATGGACTTACAGTTGGAAAGGGAGATGTTTCACCTGAAGAATTTTACGCTGTTACCAA !

3

4

Currently working on a GUI to make the tools more usable by our partners (CEA/CNG Evry) All presented algorithms have been extended to support ”mask” i.d. letter groups for which no modification are allowed. DNA Sequencing by MALDI-TOF MS November 8, 2010 35/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Conclusion 1

2

3

4

The interpretation of experimental spectra can be problematic But artificial tests of the reconstruction algorithm have shown a good behavior (for instance 21 SNP correctly found for a 160 bp long sequence) Currently working on a GUI to make the tools more usable by our partners (CEA/CNG Evry)

All presented algorithms have been extended November to support DNA Sequencing by MALDI-TOF MS 8, 2010

35/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Conclusion 1

The interpretation of experimental spectra can be problematic

2

But artificial tests of the reconstruction algorithm have shown a good behavior (for instance 21 SNP correctly found for a 160 bp long sequence)

3

Currently working on a GUI to make the tools more usable by our partners (CEA/CNG Evry)

4

All presented algorithms have been extended to support ”mask” i.d. letter groups for which no modification are allowed.

5

Parallelization is easy.

DNA Sequencing by MALDI-TOF MS

November 8, 2010

35/39

Introduction

Algorithm, some details

Conclusion

Bibliography

Bibliography

S. Bocker. SNP and mutation discovery using base-specific cleavage and MALDI-TOF mass spectrometry. Bioinformatics, 19(Suppl 1):i44, 2003. F. Mauger, O. Jaunay, V. Chamblain, F. Reichert, K. Bauer, I.G. Gut, and D.H. Gelfand. SNP genotyping using alkali cleavage of RNA/DNA chimeras and MALDI time-of-flight mass spectrometry. Nucleic acids research, 34(3):e18, 2006.

DNA Sequencing by MALDI-TOF MS

November 8, 2010

36/39

Appendix (C++ demonstrator)

Fragment generator: main // not coded to be efficient but to follow as close as possible // the ’slide’ presentation #include #include #include #include #include using namespace std; int main() { const std::string y("AT"),yp("ACTT"); // const std::string y("ACGGT"),yp("AAG"); // const std::string y(""),yp("AAC"); // const std::string y("AAC"),yp(""); std::cerr second; else return 0; } MultiSet operator-(char x) const { MultiSet buffer(*this); Data::iterator iter=buffer.data_.find(x); assert( (iter!=end())&&(iter->second>0) ); if( (--iter->second)==0) buffer.data_.erase(iter); return buffer; } }; size_t countLetter(const char x,const size_t i,const std::string& y) { return std::count(y.begin()+i,y.end(),x); } DNA Sequencing by MALDI-TOF MS

November 8, 2010

39/39

Suggest Documents