20140711 4 e_tseng_ercc2.0_workshop

18
FIND MEANING IN COMPLEXITY For Research Use Only. Not for use in diagnostic procedures. © Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved. Elizabeth Tseng / 2014.07.11 Staff Scientist Technical Variability in PacBio ® Full-length cDNA (Iso-Seq TM ) Sequencing

Transcript of 20140711 4 e_tseng_ercc2.0_workshop

Page 1: 20140711 4 e_tseng_ercc2.0_workshop

FIND MEANING IN COMPLEXITY For Research Use Only. Not for use in diagnostic procedures.

© Copyright 2014 by Pacific Biosciences of California, Inc. All rights reserved.

Elizabeth Tseng / 2014.07.11 Staff Scientist

Technical Variability in PacBio® Full-length cDNA (Iso-SeqTM) Sequencing

Page 2: 20140711 4 e_tseng_ercc2.0_workshop

SampleNet: Iso-Seq Method with Clonetech® cDNA Synthesis Kit

PacBio’s Iso-Seq™ Method for High-quality, Full-length Transcripts

PolyA mRNA AAAAA

AAAAA

AAAAA

AAAAA

cDNA synthesis with adapters

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

AAAAA TTTTT

Size partitioning & PCR amplification

SMRTbell™ ligation

PacBio® RS II Sequencing

Experimental Pipeline

Informatics Pipeline

Remove adapters Remove artifacts

Clean sequence

reads

Reads clustering

Isoform clusters

Consensus calling

Nonredundant transcript isoforms

Quality filtering

Final isoforms PacBio raw sequence

reads

5’ primer 3’ primer

Map to reference genome

Experimental pipeline Informatics pipeline

PacBio raw sequence reads

Figure 1

a b

AAAA

AAAA

AAAAAAAAAA

AAAAAAAAAAAAAAA

Size partitioning &PCR amplification

cDNA synthesiswith adapters

SMRTbell ligation

RS sequencing

Remove adaptersRemove artifacts

Reads clustering

Quality filtering

Cleansequence reads

Nonredundant transcript isoforms

Final isoforms

TTTT

TTTT

Consensus calling

Isoform clusters

Map to reference genome

Evidence-based gene models

polyA mRNA

AAAA

AAAA

TTTT

TTTT

AAAATTTT

AAAATTTT

AAAATTTT

AAAATTTT

Evidenced-based gene models

(AAA)n

(TTT)n

1 2 3 4 5

6 7 8 9 10

(TTT)n(AAA)n

Coding sequence polyA tail

SMRT® adapter

DevNet: Iso-Seq wiki page

(AAA)n Reads of Insert (AAA)n

Page 3: 20140711 4 e_tseng_ercc2.0_workshop

Iso-Seq Full-length cDNA Library Protocol

3

polyA+  RNA  

 Total    RNA  

Optional Poly-A Selection

Reverse Transcription (SMARTScribe RT)

Full-­‐length  1st  Strand  cDNA  

PCR Optimization

Large-scale Amplification

Amplified  cDNA  

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

Size Selection

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

Re-Amplification

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

SMRTbell™ Template Preparation

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

SMRT® Sequencing

3-­‐6  kb  

Optional Size Selection

Page 4: 20140711 4 e_tseng_ercc2.0_workshop

Iso-Seq Informatics Pipeline Per-molecule reads

Clusters of transcript alignments using FL + nFL reads

Transcript 1 Transcript 2 Transcript 3

Final transcript consensus

Transcript 1 Transcript 2 Transcript 3

Full-length (FL) reads

Non-FL reads

Transcript 1 Transcript 2 Transcript 3

Isoform-level clusters

Page 5: 20140711 4 e_tseng_ercc2.0_workshop

Key Features of Current Iso-Seq Bioinformatics

•  Non-redundant, full-length, transcript consensus sequences –  No assembly –  De novo

–  Achieves high-quality consensus (≥ 99%) –  Universal PacBio features: robust to GC%, repeat structure, etc

•  Applications

–  Alternative splicing

–  Fusion transcripts

–  Alternative polyadenlyation –  (possible w/ proper protocol) Alternative start sites

Page 6: 20140711 4 e_tseng_ercc2.0_workshop

Disclaimer

•  Everything shown from now on are transcripts/isoforms, not genes

•  Data shown is preliminary, very unbaked

•  Concept Analysis

Page 7: 20140711 4 e_tseng_ercc2.0_workshop

Count Information Associated with Each Unique Transcript

Clusters of transcript alignments using FL + nFL reads

Transcript 1 Transcript 2 Transcript 3

Final transcript consensus

Transcript 1 Transcript 2 Transcript 3

Count matrix

Transcript Count Norm_Count

1 2 3 …

8 5 7 …

0.08 0.05 0.07 …

Page 8: 20140711 4 e_tseng_ercc2.0_workshop

Count Information from non-FL reads

For non-FL reads: •  If uniquely associated with a transcript, assume it is the transcript •  If ambiguously associated, most likely because it’s a partial match

•  For now, weight of ambiguous nFL is just

read _ count = # of FL + # of unique nFL + weighted # of ambiguous nFL

1Number of associated transcripts

In current dataset, about 40-60% nFL reads partially match multiple isoforms (FL reads are always fully and uniquely associated)

Page 9: 20140711 4 e_tseng_ercc2.0_workshop

Read Count Variation in Technical Replicates

Rat Heart •  Technical replicates (same starting RNA & protocol) •  3 size libraries (1 – 2 kb, 2 – 3 kb, 3 – 6 kb) •  Runs from diff sizes pooled for

bioinformatics pipeline

Boxplot of log2 read counts

Scatterplot of log2 read count for each transcript

Rat Heart, technical replicates

Page 10: 20140711 4 e_tseng_ercc2.0_workshop

Read Count Variation in Technical Replicates

10

Rat Lung, technical replicates

All technical replicates were seq with total ~8 SMRT® Cells (low depth) Most NA transcripts are low counts

Page 11: 20140711 4 e_tseng_ercc2.0_workshop

Choice of Chemistry Does Not Bias Sequencing

11

Rat Brain Same 3-size library (not technical replicate) •  Sequenced with P4-C2 chemistry •  Sequenced with P5-C3 chemistry

However for longer (> 3 kb) transcripts, P5-C3 chemistry will increase chance of seeing FL reads

Page 12: 20140711 4 e_tseng_ercc2.0_workshop

Choice of PCR Enzyme May Bias Amplification

12

Human Brain, 2 – 3 kb library

Human Brain, 3 – 6 kb library

Page 13: 20140711 4 e_tseng_ercc2.0_workshop

Current Iso-Seq Protocol Amplifies Sample Twice

13

polyA+  RNA  

 Total    RNA  

Optional Poly-A Selection

Reverse Transcription (SMARTScribe RT)

Full-­‐length  1st  Strand  cDNA  

PCR Optimization

Large-scale Amplification

Amplified  cDNA  

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

Size Selection

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

Re-Amplification

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

SMRTbell™ Template Preparation

1-­‐2  kb  

2-­‐3  kb  

3-­‐6  kb  

SMRT® Sequencing

3-­‐6  kb  

Optional Size Selection

Page 14: 20140711 4 e_tseng_ercc2.0_workshop

2nd Amplification Does Not Introduce Strong Bias

14

FL Read Length Distribution

Std. vs. skipping 2nd amp

Std. vs. skipping 1st amp Skipping 1st amplification results in size selection of first-strand cDNA that may be hard to optimize

Page 15: 20140711 4 e_tseng_ercc2.0_workshop

Expected Transcript Variability in Different Rat Tissues

15

Rat Heart vs Rat Lung

Rat Heart vs Rat Brain

Heart Lung

Heart Brain

Page 16: 20140711 4 e_tseng_ercc2.0_workshop

Conclusion

•  Technical variation not a big issue –  If done with same library protocol –  Different (PCR) enzymes bias amplification

–  Amplification can be tolerated if kept at reasonable # of cycles

•  Potential for DE –  Still many unknown factors –  Everything shown in previous slides merely “proof of concept”

–  With control comes better modeling

16

Page 17: 20140711 4 e_tseng_ercc2.0_workshop

Looking Ahead

17

•  Detection limit •  Amplification bias

–  Adding control at known %

–  Factors: GC? Length? Enzyme?

•  Account for library pooling •  Ambiguous mapping •  Modeling bias •  DE isoform detection •  Combining short-read data

Wet Lab Bioinformatics

Page 18: 20140711 4 e_tseng_ercc2.0_workshop

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences in the United States and/or other countries. All other trademarks are the sole property of their respective owners.