Single Molecule, Real-Time Sequencing of Full-length cDNA ...

1
For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences of California, Inc. All other trademarks are the property of their respective owners. © 2015 Pacific Biosciences of California, Inc. All rights reserved. Brain Heart Liver Brain Heart Liver Single Molecule, Real-Time Sequencing of Full-length cDNA Transcripts Uncovers Novel Alternatively Spliced Isoforms Tyson A. Clark, Ting Hon, and Elizabeth Tseng Pacific Biosciences, 1380 Willow Road, Menlo Park, CA 94025 In higher eukaryotic organisms, the majority of multi- exon genes are alternatively spliced. Different mRNA isoforms from the same gene can produce proteins that have distinct properties such as structure, function, or subcellular localization. Thus, the importance of understanding the full complement of transcript isoforms with potential phenotypic impact cannot be underscored. While microarrays and other NGS-based methods have become useful for studying transcriptomes, these technologies yield short, fragmented transcripts that remain a challenge for accurate, complete reconstruction of splice variants. The Iso-Seq™ protocol developed at PacBio offers the only solution for direct sequencing of full-length, single- molecule cDNA sequences to survey transcriptome isoform diversity useful for gene discovery and annotation. Knowledge of the complete isoform repertoire is also key for accurate quantification of isoform abundance. As most transcripts range from 1 – 10 kb, fully intact RNA molecules can be sequenced using SMRT ® Sequencing (avg. read length: 10-15 kb) without requiring fragmentation or post-sequencing assembly. Our open-source computational pipeline delivers high-quality, non-redundant sequences for unambiguous identification of alternative splicing events, alternative transcriptional start sites, polyA tail, and gene fusion events. The standard Iso-Seq protocol workflow available for all researchers is presented using a deep dataset of full- length cDNA sequences from the MCF-7 cancer cell line, and multiple tissues (brain, heart, and liver). Detected novel transcripts approaching 10 kb and alternative splicing events are highlighted. Even in extensively profiled samples, the method uncovered large numbers of novel alternatively spliced isoforms and previously unannotated genes. Abstract Sample Prep Improvements SageELF™ Size Fractionation polyA+ RNA Total RNA Optional PolyA Selection Reverse Transcription (Clontech SMARTer PCR cDNA Synthesis Kit) Full-length 1 st Strand cDNA PCR Optimization Large-scale Amplification Amplified cDNA 1-2 kb 2-3 kb 3-6 kb Size Selection (BluePippin™ System or Gel) 1-2 kb 2-3 kb 3-6 kb Re-Amplification 1-2 kb 2-3 kb 3-6 kb SMRTbell™ Template Preparation 1-2 kb 2-3 kb 3-6 kb SMRT Sequencing 3-6 kb Optional Size Selection (BluePippin System) 5-10 kb Brain Heart Liver Clontech ® SMARTer ® PCR cDNA Synthesis Kit Iso-Seq Sample Preparation Workflow Size Distribution of Amplified cDNA From Multiple Tissues Sample Preparation Methods Summary and Resources Targeted Full-Length cDNA Sequencing Full-Length Human Tissue Transcriptomes PacBio Sequencing of Iso-Seq Libraries From 3 Human Tissues Full-Length Non-Redundant Transcript Sequences Sequencing of Full-Length RT-PCR Products Shows Differential Alternative Splicing Across Three Tissues SageELF Allows For Collection of cDNA Molecules in 12 Fractions Across the Entire Size Distribution Bioanalyzer ® Traces of SageELF Size-Selected cDNA from Human Brain Phusion Kapa Hifi SeqAmp Protocol Adjustments Improve Representation of Longer Transcripts Brain 4000 2000 1250 800 500 1-2 kb 2-3 kb 3-6 kb 5-10 kb 6-10 kb 8-12 kb 10-15 kb Heart Liver 4000 2000 1250 800 500 1-2 kb 2-3 kb 3-6 kb 5-10 kb 8-12 kb 1-2 kb 2-3 kb 3-6 kb 5-10 kb cDNA Amplified with Kapa Hifi PacBio human three tissue dataset available here: http://blog.pacificbiosciences.com/2014/10/data-release-whole-human-transcriptome.html PacBio MCF-7 transcriptome dataset available here: http://blog.pacificbiosciences.com/2013/12/data-release-human-mcf-7-transcriptome.html Additional information and Iso-Seq protocols: http://www.pacb.com/applications/isoseq/index.html Details on data analysis of Iso-Seq data can be found here: https://github.com/PacificBiosciences/cDNA_primer/wiki Sage Science’s BluePippin Size Fractionation Summary: The Iso-Seq method provides full-length cDNA sequences without the need for assembly. Improved sample prep and size-selection methods allows for sequencing of transcripts up to 10 kb. Alternatively spliced transcripts can be easily identified from either whole transcriptome or targeted sequencing. Example Bioanalyzer trace of four size-selected Iso-Seq libraries Changing the PCR enzyme allows for amplification of transcripts in the 5-10 kb size range from tissue samples that have significant expression of cDNAs in that size range. Two examples of genes with differential alternative splicing across the three tissues Overview of the dataset showing numbers of transcripts of various sizes and the number of isoforms per gene Sage ELF increases the flexibility of size selection and allows for isolation of amplified cDNAs from several hundred kb up to more than 10 kb in size. Amplified cDNAs after size selection on either Sage ELF or BluePippin. PacBio sequencing of full-length RT-PCR products simplifies identification of alternatively spliced isoforms and allows for relative quantification of isoform abundance. RNA is converted into first strand cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit followed by universal amplification. Amplified cDNA is size fractionated and converted into SMRTbell templates for sequencing on the PacBio ® RS II.

Transcript of Single Molecule, Real-Time Sequencing of Full-length cDNA ...

Page 1: Single Molecule, Real-Time Sequencing of Full-length cDNA ...

For Research Use Only. Not for use in diagnostic procedures. Pacific Biosciences, the Pacific Biosciences logo, PacBio, SMRT, SMRTbell and Iso-Seq are trademarks of Pacific Biosciences of California, Inc. All other trademarks are the property of their respective owners. © 2015 Pacific Biosciences of California, Inc. All rights reserved.

Brain

Heart

Liver

Brain

Heart

Liver

Single Molecule, Real-Time Sequencing of Full-length cDNA Transcripts Uncovers Novel Alternatively Spliced Isoforms Tyson A. Clark, Ting Hon, and Elizabeth Tseng Pacific Biosciences, 1380 Willow Road, Menlo Park, CA 94025

In higher eukaryotic organisms, the majority of multi-exon genes are alternatively spliced. Different mRNA isoforms from the same gene can produce proteins that have distinct properties such as structure, function, or subcellular localization. Thus, the importance of understanding the full complement of transcript isoforms with potential phenotypic impact cannot be underscored. While microarrays and other NGS-based methods have become useful for studying transcriptomes, these technologies yield short, fragmented transcripts that remain a challenge for accurate, complete reconstruction of splice variants. The Iso-Seq™ protocol developed at PacBio offers the only solution for direct sequencing of full-length, single-molecule cDNA sequences to survey transcriptome isoform diversity useful for gene discovery and annotation. Knowledge of the complete isoform repertoire is also key for accurate quantification of isoform abundance. As most transcripts range from 1 – 10 kb, fully intact RNA molecules can be sequenced using SMRT® Sequencing (avg. read length: 10-15 kb) without requiring fragmentation or post-sequencing assembly. Our open-source computational pipeline delivers high-quality, non-redundant sequences for unambiguous identification of alternative splicing events, alternative transcriptional start sites, polyA tail, and gene fusion events. The standard Iso-Seq protocol workflow available for all researchers is presented using a deep dataset of full-length cDNA sequences from the MCF-7 cancer cell line, and multiple tissues (brain, heart, and liver). Detected novel transcripts approaching 10 kb and alternative splicing events are highlighted. Even in extensively profiled samples, the method uncovered large numbers of novel alternatively spliced isoforms and previously unannotated genes.

Abstract Sample Prep Improvements SageELF™ Size Fractionation

polyA+ RNA

Total RNA

Optional PolyA Selection

Reverse Transcription (Clontech SMARTer PCR cDNA Synthesis Kit)

Full-length 1st Strand cDNA

PCR Optimization

Large-scale Amplification

Amplified cDNA

1-2 kb

2-3 kb

3-6 kb

Size Selection (BluePippin™ System or Gel)

1-2 kb

2-3 kb

3-6 kb

Re-Amplification

1-2 kb

2-3 kb

3-6 kb

SMRTbell™ Template Preparation

1-2 kb

2-3 kb

3-6 kb

SMRT Sequencing

3-6 kb

Optional Size Selection (BluePippin System)

5-10 kb

Brain Heart Liver

Clontech® SMARTer® PCR cDNA Synthesis Kit

Iso-Seq Sample Preparation Workflow

Size Distribution of Amplified cDNA From Multiple Tissues

Sample Preparation Methods

Summary and Resources

Targeted Full-Length cDNA Sequencing

Full-Length Human Tissue Transcriptomes

PacBio Sequencing of Iso-Seq Libraries From 3 Human Tissues

Full-Length Non-Redundant Transcript Sequences

Sequencing of Full-Length RT-PCR Products Shows Differential Alternative Splicing Across Three Tissues

SageELF Allows For Collection of cDNA Molecules in 12 Fractions Across the Entire Size Distribution

Bioanalyzer® Traces of SageELF Size-Selected cDNA from Human Brain

Phusion Kapa Hifi SeqAmp

Protocol Adjustments Improve Representation of Longer Transcripts

Brain

4000

2000 1250 800 500

1-2

kb

2-3

kb

3-6

kb

5-10

kb

6-10

kb

8-12

kb

10-1

5 kb

Heart Liver

4000

2000 1250 800 500

1-2

kb

2-3

kb

3-6

kb

5-10

kb

8-12

kb

1-2

kb

2-3

kb

3-6

kb

5-10

kb

cDNA Amplified with Kapa Hifi

PacBio human three tissue dataset available here: http://blog.pacificbiosciences.com/2014/10/data-release-whole-human-transcriptome.html PacBio MCF-7 transcriptome dataset available here: http://blog.pacificbiosciences.com/2013/12/data-release-human-mcf-7-transcriptome.html Additional information and Iso-Seq protocols: http://www.pacb.com/applications/isoseq/index.html Details on data analysis of Iso-Seq data can be found here: https://github.com/PacificBiosciences/cDNA_primer/wiki

Sage Science’s BluePippin Size Fractionation

Summary: • The Iso-Seq method provides full-length cDNA

sequences without the need for assembly. • Improved sample prep and size-selection methods allows

for sequencing of transcripts up to 10 kb. • Alternatively spliced transcripts can be easily identified

from either whole transcriptome or targeted sequencing.

Example Bioanalyzer trace of four size-selected Iso-Seq libraries

Changing the PCR enzyme allows for amplification of transcripts in the 5-10 kb size range from tissue samples that have significant expression of cDNAs in that size range.

Two examples of genes with differential alternative splicing across the three tissues

Overview of the dataset showing numbers of transcripts of various sizes and the number of isoforms per gene

Sage ELF increases the flexibility of size selection and allows for isolation of amplified cDNAs from several hundred kb up to more than 10 kb in size.

Amplified cDNAs after size selection on either Sage ELF or BluePippin.

PacBio sequencing of full-length RT-PCR products simplifies identification of alternatively spliced isoforms and allows for relative quantification of isoform abundance.

RNA is converted into first strand cDNA using the Clontech SMARTer PCR cDNA Synthesis Kit followed by universal amplification. Amplified cDNA is size fractionated and converted into

SMRTbell templates for sequencing on the PacBio® RS II.