Applications of NEXT GENERATION SEQUENCING Technologies on Biomedical Research
Overview and Applications of Next-Generation Sequencing Technologies
description
Transcript of Overview and Applications of Next-Generation Sequencing Technologies
Overview and Applications of Next-Generation Sequencing
Technologies
Stéphane Deschamps
Analytical & Genomic TechnologiesDuPont Agriculture & NutritionPioneer Hi-Bred International
Outline
1. Next-Generation Sequencing Platforms1. 454 FLX technology2. Solexa/Illumina technology
2. Applications of Next-Generation Sequencing Technologies1. Overview2. Variant detection with Illumina platform3. Open-source tools for bioinformatics
3. Third-Generation Sequencing technologies: what’s next?
Sanger sequencing
Successive improvements now allows 96 800-900 base reads to be sequenced in less than 2h
Sanger sequencing
Sanger sequencing has been, and still is, very useful...
...but it remains slow and expensive
Sequencing Platform Comparisons
ABI3730xl
454 FLXTitanium
IlluminaGA II
Read Length ~750bps ~450bps 18-75bps
Number ofreads/run
96 500K 100MM
Max Yield/run ~70Kbps ~1Gbp ~10Gbps
Cost/1Gbp $3.5MM $7,000 $1,000
Run time/machineto 1Gbp
8 years 1 day <1 day
Next-Generation Sequencing
Third-generation platforms:
•Complete Genomics
•BioNanomatrix
•VisiGen
•Pacific Biosciences
•Intelligent Bio-Systems
•ZS Genetics
•Reveo
•LightSpeed Genomics
•NABsys
•Oxford Nanopore Technologies
Second-generation platforms:
•454/Roche
•Solexa/Illumina
•SOLiD/ABI
•Helicos BioSciences
•Dover Systems
454 FLX Titanium
• First next-generation sequencing platform launched (October 2005)
• Titanium chemistry for the 454 FLX launched in September 2008
• Sequencing By Synthesis
– Pyrosequencing
– Chemiluminescent signal
• Long read technology (~450 nucleotides)
• Possibility of sequencing both ends of
DNA fragments (FLX platform)
• Generates up to 0.5Gbps per run
• Max cost is ~$10,000/run
454 FLX Titanium
• DNA Library Construction
• Emulsion PCR
• Sequencing
DNA Library Construction
• DNA fragmentation via nebulization
• Size-selection
• Ligation of adapters A & B
• Selection of A/B fragments via biotin selection
• Denaturation to select single-stranded A/B fragments
• No cloning!
Streptavidin Streptavidin
+(A/A)
(B/B)
(A/B)
End repair
Denaturation+
Emulsion PCR
A/B ss DNA
Emulsion PCR
• Add DNA to capture beads (needs titration)
• Add PCR reagents to DNA and capture beads
• Transfer sample to oil tube or cup
• Emulsify DNA capture beads in PCR reagents to form
water-in-oil “microreactors”
– Emulsion with Qiagen TissueLyser (high-speed
shaker)
• Clonal amplification in microreactors
– Careful not to break the emulsion!
– ~10MM copies per capture bead
• Break emulsion and enrich for DNA positive beads
– Use biotinylated oligo to capture enriched beads then
denature
www.roche-applied-science.com
Bead deposition into plates
• Deposition of enriched beads into
PicoTiter plate
• Well diameter = 29uM allowing for a
single bead (20uM diameter) per well
• Chambers are filled with enzyme
beads, DNA beads and packing beads.
www.roche-applied-science.com
Pyrosequencing
1. Polymerase add nucleotide
(sequential flow of dNTPs)
2. PPi is released
3. Sulfurylase creates ATP from
PPi
4. Luciferase hydrolyzes ATP
and use luciferin to make lightwww.roche-applied-science.com
Image and signal processing
1. Raw data is series of images (one image per base per cycle).
2. Data are extracted, quantified and normalized.
3. Read data are converted into “flowgrams”.
Post-processing
1. Output = flowgrams, basecalls, Phred-equivalent scores
2. Basecall & Flowgrams can be used in the following applications:
1. De novo assembler – consensus sequences assembled into contigs
with quality scores and ACE file (works best with genomic DNA).
2. Reference mapper – contigs mapped to reference sequence + list of
high-confidence mutations
3. Amplicon variant analyzer – identification of sequence variants in
amplicon libraries
Illumina Genome Analyzer
• Successor to MPSS (Massively Parallel Signature Sequencing)
• Single molecule array (“flow cell”) with millions of amplified
clusters
• Sequencing By Synthesis
– Removable fluorescence
– Reversible terminators
• Short read technology (16 - 75 nucleotides)
• Possibility of sequencing both ends of DNA fragments
• Generates up to 20Gbps per run
• Max cost is ~$10,000/run
= $500/Gbp!
Prepare DNA fragments
+Ligate
adapters
Sample Prep
Cluster Synthesis
Cluster Station Genome Analyzer
Analysis Pipeline
Illumina Genome Analyzer
Sequencing
Cluster Station
Fluidics and Electronics
Flow Cell &Detection
LaserOptics
Genome Analyzer
Cluster Generation
or RNA
- anneal
Cluster Generation
DNA Clusters• ~1,000 copies of DNA in each cluster• 1-2 microns in diameter
- extension
Reversible Terminator Chemistry
5’
G
T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
G
T
C
A
T
C
A
C
C
T
A
G
C
G
T
A
First base incorporated
Cycle 1: Add sequencing reagents
Remove unincorporated bases
Detect signal
Cycle 2 - n: Add sequencing reagents and repeat
Deblock (removal of fluorescent dye and protecting group)
Sequencing by Synthesis (SBS)
C
A
T
G
5’
3’
T
Sequencing by Synthesis (SBS)
Data Analysis Workflow - Illumina
Sequence Analysis
alignment (ELAND), filtering (chastity)
Image Analysis
Base calling
Illumina Analysis PipelineImages (.tif)
1 image per dye4 dyes/cycle 75 cycles 50 tiles/column2 columns/lane8 lanes/flowcell240,000 images
per flowcellx8 MB per image1.92 TB of image datax2 for PE run3.8 TB of image data
Alignments, Assemblies, Normalization,
Annotations &Post-processing Evaluations
Datatransfer
and Storage
•Cluster Intensities•Cluster Noise
•Cluster Sequence•Cluster Probabilities (Scores)•Corrected Cluster Intensities
• cross-talk correction• phasing correction
•Image analysis module is Firecrest
•Base calling module is Bustard
•Sequence analysis module is Gerald
Other platforms
Sequencing Sequencing Run Read Reads per Throughput per
Platform Chemistry Time Length (bp) Run (million) Run (Gbp)
Roche 454 FLX Pyrosequencing 10h 400-500 ~1 0.4-0.5
Sequencing bySynthesis
Sequencing byLigation
Sequencing bySynthesis
Sequencing byLigation
15-45
Polonator 80h 28 300-400 10
Helicos HeliScope 8 days 25-55 600-800
25
ABI SOLiD 8 days 50 400 20
Illumina GAIIx 9.5 days 100 250
Data Storage & Quality
Images?
Phred score 20 = 1% error rate
Quality vs. Read Length? Trimming?Lower sequence quality than Sanger sequencing but offset by deeper coverage
~Phred 20
Single short read uniqueness
~4MM reads
Illumina 35 base reads aligned to A. thaliana genome
Applications of Next-Generation Sequencing
– Tag count & Alignments
– Digital Gene Expression Tag Profiling
• Short cDNA fragments mapping to 3’ ends of transcripts
• SAGE-like approach (1 short tag/transcript)
• 20 base tag output (RE site + 16 bases) aligned to a reference genome
• Identify, quantify and annotate expressed genes
– Transcriptome Profiling (RNA-Seq)
• cDNA fragments generated via random priming
• 36-75 base output aligned to a reference genome
• Assemble entire transcript sequence
• Identify, quantify and annotate expressed genes
• Identify SNPs, alleles and alternative splice variants
Gene Expression Profiling
GEX Adaptor 1 Ligation
GTACNN
MmeI
GEX Adaptor 2 Ligation
NNNN
CATGGTAC
Restriction Enzyme Digestion (DpnII or NlaIII)
AAAAATTTTT-bio
CATG
MmeIAAAAATTTTT_bio
CATGGTAC
AAAAA
AAAAATTTTT-bio
1st and 2nd Strand cDNA Synthesis
MmeI digestion
CATG
TAGPCR Primer 1 PCR Primer 2
PCR Amplification
Tag Profiling – Sample Prep (Illumina)
CATGGTAC
Cluster Generation
sequencing primer
mRNA isolation
Total RNA (5ug)
Adaptor Ligations
AAAAA
AAAAA
Fragmentation (random)
Total RNA isolation (10ug)
PCR Primer 1PCR Primer 2
PCR Amplification
Transcriptome Profiling – Sample Prep (Illumina)
Cluster Generation
AAAAA
1st and 2nd Strand cDNA Synthesis (N6 primer)
TTTTT
sequencing primer 1 sequencing primer 2
mRNA isolation
Tissue
– Small RNA Identification and Profiling
• Small RNA size is suitable to discovery with next-generation sequencing
– Deep assessment of alternative splicing isoforms
• Deep coverage allows discovery of rare isoforms
Novel Transcript Discovery
Mortazavi et al. (2008), Nat. Methods
– Whole Genome Sequencing
• Small genomes that are not too complex (microbial)
• The longer the reads, the better – 454 chemistry most suitable
• Paired-End sequencing
– Whole Transcriptome Sequencing
– Targeted Sequencing
• Pooled PCR products
– Raindance Technologies (~4,000 amplicons in one tube)
– Padlock probes
• Pooled BAC clones
• Sequence Capture (Solid phase, Liquid phase)
– Agilent, Febit & Nimblegen
– Metagenomics & Microbial diversity
De novo Sequencing
– ChIP-Seq (immunoprecipitate sequencing)
• Capture regions of the genome bound by proteins (transcription factors,
histones)
• Sequences need to be aligned to a reference sequence
• Requires complex algorithm to determine differential levels of coverage
throughout the genome
– Methyl-Seq (methylation status) – Bisulfite Sequencing
• Sequences aligned and compared to reference sequence
– DNAseI Hypersensitivity Site Sequencing
Gene Regulation
Mikkelsen et al. (2007), Nature
– Coverage & Alignment
– Paired-End Sequencing
– Whole Genome Resequencing
• Small genomes that are not too complex (repeats, duplications...)
• The longer the reads, the better
– Targeted Resequencing
• Complex genomes (crops)
– Reduced representation libraries (methyl-sensitive enzymes)
– Transcriptome
• Sequence Capture (Microarrays)
» Agilent, Febit & Nimblegen
» CGH arrays
Variant & Structural Variation
Challenges in variant discovery
1. Base quality & filtering (scoring threshold)
2. Sequencing errors vs. SNPs
1. To differentiate true polymorphisms from sequencing errors
2. Coverage of a given SNP region and redundancy of reads (coverage vs. number of samples)
3. Availability of a reference sequence (genome)
1. To separate unique vs. duplicated sequences
2. Duplication in one line but not another
3. Polymorphism rate in one line vs. another = need to set conditions for alignment
4. Paired-end sequencing can help unique read placement
5. Complex genomes = need to reduce complexity prior to sequencing
1. High repeat content (ex: ~80% in maize, ~70% in soy, 90% in sunflower…)
2. Gene duplications and genome plasticity (polyploidy, partial or whole genome duplications...)
Reduced-representation libraries
transposon transposon transposon
PstI sites
PstI digestion
Recover digested fraction (gel, column)
1. DNA methylation in plants occurs at 5-methyl cytosine within CpG dinucleotides and CpNpG trinucleotides
2. Transposons and other repeats comprise the largest fraction of methylated DNA. Studies in Arabidopsis have shown that CG sites in the 3’ end of the transcribed regions of more than one third of all genes also are methylated (Zhang, Science, 320, 489, 2008).
3. Methylation is critically important in silencing transposons and regulating plant development (methylation in promoters appears to reduce transcription)
P P P P P P P P
Library Construction
Digestion with one methyl-sensitive restriction enzyme (RE) and fractionation
Genomic DNA
Ligation of biotinylated RE-specific adapters 1
Digestion with 4-bp cutter (DpnII)GATC
Ligation of DpnII-specific adapter
Binding to streptavidin column and digestion with REGATCCTAG
Ligation of RE-specific adapters 2
PCR enrichment, gel purification, size selection (150-500bp fragments), cluster synthesis and sequencing (36 cycles)
B
B
B
B
GATCCTAG
GATCCTAG
Deschamps et al. The Plant Genome (in press)
SNP detection flowchart
Basecalling, cropping last 4 bases & initial base-quality filter (for individual tags)
Condensing & optional consensus base-quality filter (for unitags sequences)
Creating HQ unitag datasets (removing singlets)
Comparing HQ unitag datasets from genotype “A” and genotype “B” using Vmatch
Filtering, to accept clusters with only two members (A, B) with exactly one mismatch
Recovering matched HQ unitag sequences and SNP sites from Vmatch alignments
Mapping SNP-containing HQ unitags to reference sequence (genome), using a k-mer table (k=length of trimmed tags), and find copy numbers and locations.
Capturing single-copy HQ unitags with up to a single-base mismatch to the reference sequence at the exact location of the putative SNP site for one or both genotypes.
Filtering and Condensing
Comparing two genotypes
Mapping to genome
Example: one flow cell in soybean (Williams82 vs. Pintado)
† Filtered total reads defined as having a quality value for individual base greater than or equal to 15
‡ HQ unitag reads defined as having a quality value for each base greater than or equal to 15, and with an individual read count greater than or equal to 2.
§ Best match to reference sequence of HQ unitag reads aligning uniquely or multiple times to the reference sequence
1
10
100
1,000
10,000
10 100 1,000 10,000
Fre
qu
en
cy
100,000
100,000
Depth
Run Metrics Williams82
Pintado
Number of total reads generated (after initial basecalling)
37,666,279 38,000,474
Number of filtered total reads † 24,519,484 23,101,973
Number of unitags (generated from filtered total reads)
965,610 885,429
Number of high quality (HQ) unitags ‡
255,918 246,102
Alignment of HQ unitags against the reference sequence:
Zero mismatch § 208,923 197,015
One mismatch § 27,770 27,699
Two or more mismatches § 19,225 21,388
HQ unitags aligning uniquely to the reference sequence with zero
mismatch
152,185 144,559
Results & Validation
*SNPs confirmed/not confirmed via Sanger sequencing of PCR products for both genotypes
**Experiments Putative SNPs Confirmed Not Confirmed Validation rateQ Score threshold: 15Soy: Williams82 vs. Pintado 1,682 163 5 97.0%Rice: Kasalath vs. Taichung65 2,618 162 6 96.4%
Q Score threshold: 25Soy: Williams82 vs. Pintado 702 168 2 98.8%Rice: Kasalath vs. Taichung65 2,148 174 1 99.4%
Distribution of HQ unitags & SNPs related to annotated gene density (soybean)
Gene Density (excluding TEs) in 500Kb window
Coverage by HQ unitags in 70Kb window
SNP Density in 70Kb window
Distribution of HQ unitags & SNPs related to distance to annotated genes (excluding TEs) in soybean
Intron, CDS and UTR coordinates determined from GFF annotation files
Bioinformatic tools
Alignment and Polymorphism Detection
1. SOAP – Short Oligonucleotide Alignment Program
• Ruiqiang Li, Beijing Genomics Institute
• http://soap.genomics.org.cn
2. MAQ – Mapping and Assembly with Quality
• Heng Li, Sanger Centre
• http://maq.sourceforge.net/maq-man.shtml
3. Bowtie - An ultrafast memory-efficient short read aligner
• Ben Langmead and Cole Trapnell, University of Maryland
• http://bowtie-bio.sourceforge.net/
4. ssahaSNP – Tool to detect homozygous SNPs and indels
• Adam Spargo and Zemin Ning, Sanger Centre
• http://www.sanger.ac.uk/Software/analysis/ssahaSNP
Bioinformatic tools
Genomic Assembly
1. Velvet – De novo assembly of short reads
• Daniel Zerbino and Ewan Birney, EMBL-EBI
• http://www.ebi.ac.uk/~zerbino/velvet/
2. SSAKE – Assembly of short reads
• Rene Warren, et al, British Columbia Cancer Agency
• http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500
3. Euler – Genomic Assembly
• Pavel Pevzner and Mark Chaisson, University of California, San Diego
• http://nbcr.sdsc.edu/euler/
www.illumina.com
ChIP Sequencing
1. ChIP-Seq Peak Finder
• Barbara Wold, Cal Tech and Rick Meyers, Stanford University
• http://woldlab.caltech.edu/html/software/
Digital Gene Expression
1. Comparative Count Display
• Alex Lash, NIH
• ftp://ftp.ncbi.nlm.nih.gov/pub/sage/obsolete/bin/ccd/
2. SAGE DGED Tool
• Cancer Genome Anatomy Project
• http://cgap.nci.nih.gov/SAGE/SDGED_Wizard?METHOD=SS10,LS10&ORG=Hs
Bioinformatic tools
www.illumina.com
Overview
1. Obtain Bustard reads and align against Genome with Eland
2. Aggregate and SNP call data with CASAVA
3. GenomeStudio™ wizard import of data
4. Examine coverage and quality in stacked alignment graphs for a selected region/chromosome
5. Export table of SNPs and consensus sequence
Bioinformatic tools - Illumina
Bioinformatic tools - Illumina
Third-Generation Sequencing technologies: what’s next?
Next-Generation Sequencing
Third-generation platforms:
•Complete Genomics
•BioNanomatrix
•VisiGen
•Pacific Biosciences
•Intelligent Bio-Systems
•ZS Genetics
•Reveo
•LightSpeed Genomics
•NABsys
•Oxford Nanopore Technologies
Second-generation platforms:
•454/Roche
•Solexa/Illumina
•SOLiD/ABI
•Helicos BioSciences
•Dover Systems
Pacific Biosciences
• SMRT™ Technology (to be commercially
launched Fall 2010)
• Single DNA polymerase attached at bottom
surface of nanometer-scale hole, incorporates
in real-time fashion fluorescently labeled
nucleotides to elongated strand of DNA
• Elongated strand can be several thousands of
nucleotides in length
www.pacificbiosciences.com
Pacific Biosciences
1. Small size of the hole favors rapid in-and-out diffusion of nucleotides and dye following
their cleavage. Meanwhile, incorporated nucleotide is held within the detection volume
for 10’s of milliseconds, order of magnitude longer than the time it takes for nucleotides
to diffuse in and out of the hole, therefore decreasing background noise
2. Fluorescent dye is attached to the phosphate chain, rather than the base, and is
cleaved when the nucleotide is incorporated to the DNA strand.
=> Decreased background noise and use of phospholinked nucleotides circumvents the need
for successive cycles of incorporation, washing, scanning and removal of the label,
therefore optimizing processivity of the enzyme and allowing longer read lengths
=> No need for washing decreases the consumption of reagents
Nanopore Sequencing = the real $100 genome?
1. Sequencing-by-Synthesis requires lots of preparation, lots of reagents (polymerase,
nucleotides, fluorescent labels...) and expensive detection systems.
2. Nanopore sequencing does not rely on amplification or labeling, and provides a direct
electrical signal for base calling. It is based on a simple idea of “passing” DNA
fragments through a nanometer-scale pore and detecting in a real-time fashion signal
due to the DNA blocking the electrical current that runs through the pore
3. Oxford Nanopore: Protein nanopore
1. Long read lengths (1000’s)
2. High read accuracy
3. Current technical issues:
• Attachment of the exonuclease
to the pore
• Parallelization
(1,000’s of pores per chips)
Exonuclease
Alpha-hemolysin
Cyclodextrin(encapsulate nucleotide)
www.nanoporetech.com
Questions?