QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for...
Transcript of QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for...
![Page 1: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/1.jpg)
QPalma - Optimal Spliced Alignments of ShortSequence Reads
Fabio De Bona1
Stephan Ossowski2
Korbinian Schneeberger2
Gunnar Ratsch1
(1) Friedrich Miescher Laboratory of the Max Planck Society(2) Max Planck Institute for Developmental Biology
Tubingen, Germany
September 25, 2008
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 1 / 18
![Page 2: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/2.jpg)
Introductionfml
Next Generation (NG) sequencing technologies produce hugeamounts of sequencing data
Difference to Sanger sequencing:I Much cheaper and fasterI Much more and shorter fragments ⇒ “reads”I Much more errors
Genome / transcriptome sequencing
Data can be used via local alignments:I Discovery of new genes,I Identification of alternative splice forms, . . .
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 2 / 18
![Page 3: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/3.jpg)
Introductionfml
Next Generation (NG) sequencing technologies produce hugeamounts of sequencing data
Difference to Sanger sequencing:I Much cheaper and fasterI Much more and shorter fragments ⇒ “reads”I Much more errors
Genome / transcriptome sequencing
Data can be used via local alignments:I Discovery of new genes,I Identification of alternative splice forms, . . .
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 2 / 18
![Page 4: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/4.jpg)
Transcriptome Analysis with NG Sequencingfml
QPalma’s aim: Accurately Align transcriptome reads to genomic sequences
Genomic read mapping already challenging
Transcriptome read mapping is even more difficult:I Spliced reads stem from several genomic regionsI Short reads getting even shorter due to splicing⇒ Alignments more error prone
Improve alignments by using more information:I Accurate splice site modelsI Intron length modelI Quality scores model
Idea: Use a machine learning method to infer optimal parameters
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 3 / 18
![Page 5: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/5.jpg)
Transcriptome Analysis with NG Sequencingfml
QPalma’s aim: Accurately Align transcriptome reads to genomic sequences
Genomic read mapping already challenging
Transcriptome read mapping is even more difficult:I Spliced reads stem from several genomic regionsI Short reads getting even shorter due to splicing⇒ Alignments more error prone
Improve alignments by using more information:I Accurate splice site modelsI Intron length modelI Quality scores model
Idea: Use a machine learning method to infer optimal parameters
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 3 / 18
![Page 6: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/6.jpg)
Transcriptome Analysis with NG Sequencingfml
QPalma’s aim: Accurately Align transcriptome reads to genomic sequences
Genomic read mapping already challenging
Transcriptome read mapping is even more difficult:I Spliced reads stem from several genomic regionsI Short reads getting even shorter due to splicing⇒ Alignments more error prone
Improve alignments by using more information:I Accurate splice site modelsI Intron length modelI Quality scores model
Idea: Use a machine learning method to infer optimal parameters
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 3 / 18
![Page 7: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/7.jpg)
Spliced vs. Unspliced Alignments
fml
Find matching region on genome with a few mismatches
Efficient data structures for mapping many reads
Most current mapping techniques are limited to unspliced reads
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 4 / 18
![Page 8: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/8.jpg)
Spliced vs. Unspliced Alignments
fml
Find matching region on genome with a few mismatches
Efficient data structures for mapping many reads
Most current mapping techniques are limited to unspliced reads
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 4 / 18
![Page 9: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/9.jpg)
Spliced vs. Unspliced Alignments
fml
Find matching region on genome with a few mismatches
Efficient data structures for mapping many reads
Most current mapping techniques are limited to unspliced reads
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 4 / 18
![Page 10: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/10.jpg)
Extended Smith-Waterman Algorithm
fml
Classical scoring f : Σ× Σ→ R
Source of Information
Sequence matches
Computational splice sitepredictions
Intron length model
Read quality information
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18
![Page 11: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/11.jpg)
Extended Smith-Waterman Algorithm
fml
Classical scoring f : Σ× Σ→ R
Source of Information
Sequence matches
Computational splice sitepredictions
Intron length model
Read quality information
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18
![Page 12: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/12.jpg)
Extended Smith-Waterman Algorithm
fml
Classical scoring f : Σ× Σ→ R
Source of Information
Sequence matches
Computational splice sitepredictions
Intron length model
Read quality information
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18
![Page 13: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/13.jpg)
Extended Smith-Waterman Algorithm
fml
Quality scoring f : (Σ× R)× Σ→ R
Source of Information
Sequence matches
Computational splice sitepredictions
Intron length model
Read quality information
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18
![Page 14: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/14.jpg)
Extended Smith-Waterman Algorithm
fml
Quality scoring f : (Σ× R)× Σ→ R
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 5 / 18
![Page 15: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/15.jpg)
Scoring Parameter Inferencefml
What are optimal parameters?
How do we jointly optimize 336 parameters?
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 6 / 18
![Page 16: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/16.jpg)
Scoring Parameter Inferencefml
What are optimal parameters?
How do we jointly optimize 336 parameters?
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 6 / 18
![Page 17: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/17.jpg)
Cartoon: Maximize the Marginfml
truefalse false
Correct alignment is not highest scoring one
Correct alignment is highest scoring one
Can we do better?
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 7 / 18
![Page 18: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/18.jpg)
Cartoon: Maximize the Marginfml
truefalse
Correct alignment is not highest scoring one
Correct alignment is highest scoring one
Can we do better?
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 7 / 18
![Page 19: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/19.jpg)
Cartoon: Maximize the Marginfml
false true
Use a technique motivated by “large-margin” methods
Idea: Enforce a margin between correct and incorrect examples
One has to solve a huge quadratic problem
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 7 / 18
![Page 20: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/20.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome reads
I Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 21: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/21.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 22: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/22.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
⇒ Identify reads overlapping into introns
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 23: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/23.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
⇒ Remove overlapping parts
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 24: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/24.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
⇒ Remove overlapping parts
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 25: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/25.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
⇒ Combine both exonic parts
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 26: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/26.jpg)
How Can We Generate Data for Training?fml
How do we obtain true alignments for training QPalma?
Simulate realistic transcriptome readsI Consider well-annotated genome: Arabidopsis thalianaI Use short genomic reads (38nt) with natural quality scoresI Use reference annotation (TAIR7)
I Generate artificially spliced reads:
⇒ Truncate to desired length (38nt)
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 8 / 18
![Page 27: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/27.jpg)
Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions
Task: Find the correct spliced alignment
Trained on 10, 000 artificially spliced genome readsTested on 30, 000 spliced reads
SmithW Intron Intron+Splice Intron+Splice +Quality
Alig
nmen
t Er
ror
Rate
14.19% 9.96% 1.94% 1.78%
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 9 / 18
![Page 28: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/28.jpg)
Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions
Task: Find the correct spliced alignment
Trained on 10, 000 artificially spliced genome readsTested on 30, 000 spliced reads
SmithW Intron Intron+Splice Intron+Splice +Quality
Alig
nmen
t Er
ror
Rate
14.19% 9.96% 1.94% 1.78%
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 9 / 18
![Page 29: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/29.jpg)
Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions
Task: Find the correct spliced alignment
Trained on 10, 000 artificially spliced genome readsTested on 30, 000 spliced reads
SmithW Intron Intron+Splice Intron+Splice +Quality
Alig
nmen
t Er
ror
Rate
14.19% 9.96% 1.94% 1.78%
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 9 / 18
![Page 30: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/30.jpg)
Error Rate Depends on Intron Positionfml
1 Trust introns confirmed by spliced read with ≥ 6nt overlap
2 Pure Smith-Waterman algorithm would need longer overlapsand would still perform worse
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 10 / 18
![Page 31: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/31.jpg)
Error Rate Depends on Intron Positionfml
1 Trust introns confirmed by spliced read with ≥ 6nt overlap
2 Pure Smith-Waterman algorithm would need longer overlapsand would still perform worse
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 10 / 18
![Page 32: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/32.jpg)
Error Rate Depends on Intron Positionfml
1 Trust introns confirmed by spliced read with ≥ 6nt overlap
2 Pure Smith-Waterman algorithm would need longer overlapsand would still perform worse
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 10 / 18
![Page 33: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/33.jpg)
Can We Use QPalma for Whole-Genome Alignments?
fmlSo far we had two assumptions:
1 All reads are spliced
2 Genomic region is known
1 Many reads will be fully contained in an exon
2 Direct alignment to genome too expensive (O(n ·m))
![Page 34: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/34.jpg)
Can We Use QPalma for Whole-Genome Alignments?
fmlSo far we had two assumptions:
1 All reads are spliced
2 Genomic region is known
1 Many reads will be fully contained in an exon
2 Direct alignment to genome too expensive (O(n ·m))
![Page 35: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/35.jpg)
A Pipeline for Efficient Large Scale Alignmentfml
Pipeline Workflow
1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most
two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)
I ≈ 4h for 2, 6× 106 reads
2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be
spliced (even if they can be mapped well)
I ≈ 17min for 2, 2× 106 reads
3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads
I ≈ 8h for ∼ 442, 000 reads
![Page 36: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/36.jpg)
A Pipeline for Efficient Large Scale Alignmentfml
Pipeline Workflow
1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most
two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)
I ≈ 4h for 2, 6× 106 reads
2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be
spliced (even if they can be mapped well)
I ≈ 17min for 2, 2× 106 reads
3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads
I ≈ 8h for ∼ 442, 000 reads
![Page 37: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/37.jpg)
A Pipeline for Efficient Large Scale Alignmentfml
Pipeline Workflow
1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most
two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)
I ≈ 4h for 2, 6× 106 reads
2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be
spliced (even if they can be mapped well)
I ≈ 17min for 2, 2× 106 reads
3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads
I ≈ 8h for ∼ 442, 000 reads
![Page 38: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/38.jpg)
A Pipeline for Efficient Large Scale Alignmentfml
Pipeline Workflow
1 Map unspliced reads & find seed regionsI Use suffix-array based method to find short read match with at most
two mismatches (Vmatch, Kurtz et al.; ShoRe, Ossowski et al. 2008)
I ≈ 4h for 2, 6× 106 reads
2 Recover potentially spliced reads from first mappingI Use fast approximation of QPalma to decide which reads may be
spliced (even if they can be mapped well)
I ≈ 17min for 2, 2× 106 reads
3 Identify accurate alignments for the candidate spliced readsI Use full QPalma model to align remaining reads
I ≈ 8h for ∼ 442, 000 reads
![Page 39: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/39.jpg)
Application to Natural Transcriptome Readsfml
RNA-seq reads for A. thaliana (provided by Weigel lab, MPI Devel. Biology)
4 lanes from Illumina Genome Analyzer 1G (≈ 50× coverage)
38nt reads, polyA enriched
Strand unspecific, young leaves
Initial Read mapping by ShoRe (http://1001genomes.org)
Spliced alignment by QPalma(http://fml.mpg.de/raetsch/projects/qpalma)
⇒ ≈ 30 million unspliced
⇒ ≈ 1 million spliced reads
Exon coverage
Annotation
Intron coverage
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 13 / 18
![Page 40: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/40.jpg)
Application to Natural Transcriptome Readsfml
RNA-seq reads for A. thaliana (provided by Weigel lab, MPI Devel. Biology)
4 lanes from Illumina Genome Analyzer 1G (≈ 50× coverage)
38nt reads, polyA enriched
Strand unspecific, young leaves
Initial Read mapping by ShoRe (http://1001genomes.org)
Spliced alignment by QPalma(http://fml.mpg.de/raetsch/projects/qpalma)
⇒ ≈ 30 million unspliced
⇒ ≈ 1 million spliced reads
Exon coverage
Annotation
Intron coverage
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 13 / 18
![Page 41: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/41.jpg)
Application to Natural Transcriptome Readsfml
RNA-seq reads for A. thaliana (provided by Weigel lab, MPI Devel. Biology)
4 lanes from Illumina Genome Analyzer 1G (≈ 50× coverage)
38nt reads, polyA enriched
Strand unspecific, young leaves
Initial Read mapping by ShoRe (http://1001genomes.org)
Spliced alignment by QPalma(http://fml.mpg.de/raetsch/projects/qpalma)
⇒ ≈ 30 million unspliced
⇒ ≈ 1 million spliced reads
Exon coverage
Annotation
Intron coverage
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 13 / 18
![Page 42: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/42.jpg)
What Can We Do With It?fml
Exon coverage
Annotation
Intron coverage
For instance:
Assemble alignments to obtain splice graphs (describing alternative splicing)
I Only works well for highly covered genes
Use exon/intron coverage tracks to improve gene findingI Considerable improvements appear possible
Estimate relative abundances of alternative transcriptsI Spliced reads help as they can connect exons
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 14 / 18
![Page 43: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/43.jpg)
What Can We Do With It?fml
Exon coverage
Annotation
Intron coverage
For instance:
Assemble alignments to obtain splice graphs (describing alternative splicing)
I Only works well for highly covered genes
Use exon/intron coverage tracks to improve gene findingI Considerable improvements appear possible
Estimate relative abundances of alternative transcriptsI Spliced reads help as they can connect exons
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 14 / 18
![Page 44: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/44.jpg)
What Can We Do With It?fml
Exon coverage
Annotation
Intron coverage
For instance:
Assemble alignments to obtain splice graphs (describing alternative splicing)
I Only works well for highly covered genes
Use exon/intron coverage tracks to improve gene findingI Considerable improvements appear possible
Estimate relative abundances of alternative transcriptsI Spliced reads help as they can connect exons
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 14 / 18
![Page 45: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/45.jpg)
Summaryfml
New method for accurate de novo spliced alignments of short reads
Challenging data: many short & error prone reads
Proposed algorithm for parameter estimation
Successfully combined many information sources
Can be applied to data from other sequencing platforms (ABI, . . .)
Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)
Potential improvements
Speed-up dynamic program
Integrated seed region finding and spliced alignments
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18
![Page 46: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/46.jpg)
Summaryfml
New method for accurate de novo spliced alignments of short reads
Challenging data: many short & error prone reads
Proposed algorithm for parameter estimation
Successfully combined many information sources
Can be applied to data from other sequencing platforms (ABI, . . .)
Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)
Potential improvements
Speed-up dynamic program
Integrated seed region finding and spliced alignments
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18
![Page 47: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/47.jpg)
Summaryfml
New method for accurate de novo spliced alignments of short reads
Challenging data: many short & error prone reads
Proposed algorithm for parameter estimation
Successfully combined many information sources
Can be applied to data from other sequencing platforms (ABI, . . .)
Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)
Potential improvements
Speed-up dynamic program
Integrated seed region finding and spliced alignments
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18
![Page 48: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/48.jpg)
Summaryfml
New method for accurate de novo spliced alignments of short reads
Challenging data: many short & error prone reads
Proposed algorithm for parameter estimation
Successfully combined many information sources
Can be applied to data from other sequencing platforms (ABI, . . .)
Fast enough for whole genome/transcriptome sequencingI Currently, very good for smaller genomesI When excluding indels, then also fast for mouse/human (in progress)
Potential improvements
Speed-up dynamic program
Integrated seed region finding and spliced alignments
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 15 / 18
![Page 49: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/49.jpg)
AcknowledgementsfmlCoauthors
Stephan Ossowski
Korbinian Schneeberger
Gunnar Ratsch
Weigel Lab – DNA- and RNA-seq Data
Jun Cao
Richard Clark
Christa Lanz
Detlef Weigel
Stipend
Friedrich Miescher Laboratory of the Max Planck Society
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 16 / 18
![Page 50: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/50.jpg)
Thank [email protected]
http://www.fml.tuebingen.mpg.de/raetsch/projects/qpalma
Poster D19
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 17 / 18
![Page 51: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/51.jpg)
Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions
Task: Find the correct spliced alignment
Trained on 10, 000 artificially spliced genome reads
Tested on 30, 000 spliced reads
Quality Splice Intron Errorinformation site pred. length rate
- - - 14.19 %+ - - 13.49 %
- - + 9.96 %+ - + 9.68 %
- + - 3.16 %+ + - 2.81 %
- + + 1.94 %+ + + 1.78%
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 18 / 18
![Page 52: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/52.jpg)
Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions
Task: Find the correct spliced alignment
Trained on 10, 000 artificially spliced genome reads
Tested on 30, 000 spliced reads
Quality Splice Intron Errorinformation site pred. length rate
- - - 14.19 %+ - - 13.49 %
- - + 9.96 %+ - + 9.68 %
- + - 3.16 %+ + - 2.81 %
- + + 1.94 %+ + + 1.78%
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 18 / 18
![Page 53: QPalma - Optimal Spliced Alignments of Short Sequence Reads · How Can We Generate Data for Training? fml How do we obtain true alignments for training QPalma? Simulate realistic](https://reader034.fdocuments.net/reader034/viewer/2022052520/607afb893fc78c3737601ec2/html5/thumbnails/53.jpg)
Results on Artificially Spliced ReadsfmlGiven: Short reads and corresponding genomic regions
Task: Find the correct spliced alignment
Trained on 10, 000 artificially spliced genome reads
Tested on 30, 000 spliced reads
Quality Splice Intron Errorinformation site pred. length rate
- - - 14.19 %+ - - 13.49 %
- - + 9.96 %+ - + 9.68 %
- + - 3.16 %+ + - 2.81 %
- + + 1.94 %+ + + 1.78%
Fabio De Bona (FML, Tubingen) QPalma September 25, 2008 18 / 18