Overview and Applications of Next-Generation Sequencing Technologies

Overview and Applications of Next-Generation Sequencing

Technologies

Stéphane Deschamps

Analytical & Genomic TechnologiesDuPont Agriculture & NutritionPioneer Hi-Bred International

Outline

1. Next-Generation Sequencing Platforms1. 454 FLX technology2. Solexa/Illumina technology

2. Applications of Next-Generation Sequencing Technologies1. Overview2. Variant detection with Illumina platform3. Open-source tools for bioinformatics

3. Third-Generation Sequencing technologies: what’s next?

Sanger sequencing

Successive improvements now allows 96 800-900 base reads to be sequenced in less than 2h

Sanger sequencing

Sanger sequencing has been, and still is, very useful...

...but it remains slow and expensive

Sequencing Platform Comparisons

ABI3730xl

454 FLXTitanium

IlluminaGA II

Read Length ~750bps ~450bps 18-75bps

Number ofreads/run

96 500K 100MM

Max Yield/run ~70Kbps ~1Gbp ~10Gbps

Cost/1Gbp $3.5MM $7,000 $1,000

Run time/machineto 1Gbp

8 years 1 day <1 day

Next-Generation Sequencing

Third-generation platforms:

•Complete Genomics

•BioNanomatrix

•VisiGen

•Pacific Biosciences

•Intelligent Bio-Systems

•ZS Genetics

•Reveo

•LightSpeed Genomics

•NABsys

•Oxford Nanopore Technologies

Second-generation platforms:

•454/Roche

•Solexa/Illumina

•SOLiD/ABI

•Helicos BioSciences

•Dover Systems

454 FLX Titanium

• First next-generation sequencing platform launched (October 2005)

• Titanium chemistry for the 454 FLX launched in September 2008

• Sequencing By Synthesis

– Pyrosequencing

– Chemiluminescent signal

• Long read technology (~450 nucleotides)

• Possibility of sequencing both ends of

DNA fragments (FLX platform)

• Generates up to 0.5Gbps per run

• Max cost is ~$10,000/run

454 FLX Titanium

• DNA Library Construction

• Emulsion PCR

• Sequencing

DNA Library Construction

• DNA fragmentation via nebulization

• Size-selection

• Ligation of adapters A & B

• Selection of A/B fragments via biotin selection

• Denaturation to select single-stranded A/B fragments

• No cloning!

Streptavidin Streptavidin

+(A/A)

(B/B)

(A/B)

End repair

Denaturation+

Emulsion PCR

A/B ss DNA

Emulsion PCR

• Add DNA to capture beads (needs titration)

• Add PCR reagents to DNA and capture beads

• Transfer sample to oil tube or cup

• Emulsify DNA capture beads in PCR reagents to form

water-in-oil “microreactors”

– Emulsion with Qiagen TissueLyser (high-speed

shaker)

• Clonal amplification in microreactors

– Careful not to break the emulsion!

– ~10MM copies per capture bead

• Break emulsion and enrich for DNA positive beads

– Use biotinylated oligo to capture enriched beads then

denature

www.roche-applied-science.com

Bead deposition into plates

• Deposition of enriched beads into

PicoTiter plate

• Well diameter = 29uM allowing for a

single bead (20uM diameter) per well

• Chambers are filled with enzyme

beads, DNA beads and packing beads.

www.roche-applied-science.com

Pyrosequencing

1. Polymerase add nucleotide

(sequential flow of dNTPs)

2. PPi is released

3. Sulfurylase creates ATP from

PPi

4. Luciferase hydrolyzes ATP

and use luciferin to make lightwww.roche-applied-science.com

Image and signal processing

1. Raw data is series of images (one image per base per cycle).

2. Data are extracted, quantified and normalized.

3. Read data are converted into “flowgrams”.

Post-processing

1. Output = flowgrams, basecalls, Phred-equivalent scores

2. Basecall & Flowgrams can be used in the following applications:

1. De novo assembler – consensus sequences assembled into contigs

with quality scores and ACE file (works best with genomic DNA).

2. Reference mapper – contigs mapped to reference sequence + list of

high-confidence mutations

3. Amplicon variant analyzer – identification of sequence variants in

amplicon libraries

Illumina Genome Analyzer

• Successor to MPSS (Massively Parallel Signature Sequencing)

• Single molecule array (“flow cell”) with millions of amplified

clusters

• Sequencing By Synthesis

– Removable fluorescence

– Reversible terminators

• Short read technology (16 - 75 nucleotides)

• Possibility of sequencing both ends of DNA fragments

• Generates up to 20Gbps per run

• Max cost is ~$10,000/run

= $500/Gbp!

Prepare DNA fragments

+Ligate

adapters

Sample Prep

Cluster Synthesis

Cluster Station Genome Analyzer

Analysis Pipeline

Illumina Genome Analyzer

Sequencing

Cluster Station

Fluidics and Electronics

Flow Cell &Detection

LaserOptics

Genome Analyzer

Cluster Generation

or RNA

- anneal

Cluster Generation

DNA Clusters• ~1,000 copies of DNA in each cluster• 1-2 microns in diameter

- extension

Reversible Terminator Chemistry

5’

G

T

C

A

G

T

C

A

G

T

C

A

G

T

C

A

G

T

C

A

T

C

A

C

C

T

A

G

C

G

T

A

First base incorporated

Cycle 1: Add sequencing reagents

Remove unincorporated bases

Detect signal

Cycle 2 - n: Add sequencing reagents and repeat

Deblock (removal of fluorescent dye and protecting group)

Sequencing by Synthesis (SBS)

C

A

T

G

5’

3’

T

Sequencing by Synthesis (SBS)

Data Analysis Workflow - Illumina

Sequence Analysis

alignment (ELAND), filtering (chastity)

Image Analysis

Base calling

Illumina Analysis PipelineImages (.tif)

1 image per dye4 dyes/cycle 75 cycles 50 tiles/column2 columns/lane8 lanes/flowcell240,000 images

per flowcellx8 MB per image1.92 TB of image datax2 for PE run3.8 TB of image data

Alignments, Assemblies, Normalization,

Annotations &Post-processing Evaluations

Datatransfer

and Storage

•Cluster Intensities•Cluster Noise

•Cluster Sequence•Cluster Probabilities (Scores)•Corrected Cluster Intensities

• cross-talk correction• phasing correction

•Image analysis module is Firecrest

•Base calling module is Bustard

•Sequence analysis module is Gerald

Other platforms

Sequencing Sequencing Run Read Reads per Throughput per

Platform Chemistry Time Length (bp) Run (million) Run (Gbp)

Roche 454 FLX Pyrosequencing 10h 400-500 ~1 0.4-0.5

Sequencing bySynthesis

Sequencing byLigation

Sequencing bySynthesis

Sequencing byLigation

15-45

Polonator 80h 28 300-400 10

Helicos HeliScope 8 days 25-55 600-800

25

ABI SOLiD 8 days 50 400 20

Illumina GAIIx 9.5 days 100 250

Data Storage & Quality

Images?

Phred score 20 = 1% error rate

Quality vs. Read Length? Trimming?Lower sequence quality than Sanger sequencing but offset by deeper coverage

~Phred 20

Single short read uniqueness

~4MM reads

Illumina 35 base reads aligned to A. thaliana genome

Applications of Next-Generation Sequencing

– Tag count & Alignments

– Digital Gene Expression Tag Profiling

• Short cDNA fragments mapping to 3’ ends of transcripts

• SAGE-like approach (1 short tag/transcript)

• 20 base tag output (RE site + 16 bases) aligned to a reference genome

• Identify, quantify and annotate expressed genes

– Transcriptome Profiling (RNA-Seq)

• cDNA fragments generated via random priming

• 36-75 base output aligned to a reference genome

• Assemble entire transcript sequence

• Identify, quantify and annotate expressed genes

• Identify SNPs, alleles and alternative splice variants

Gene Expression Profiling

GEX Adaptor 1 Ligation

GTACNN

MmeI

GEX Adaptor 2 Ligation

NNNN

CATGGTAC

Restriction Enzyme Digestion (DpnII or NlaIII)

AAAAATTTTT-bio

CATG

MmeIAAAAATTTTT_bio

CATGGTAC

AAAAA

AAAAATTTTT-bio

1st and 2nd Strand cDNA Synthesis

MmeI digestion

CATG

TAGPCR Primer 1 PCR Primer 2

PCR Amplification

Tag Profiling – Sample Prep (Illumina)

CATGGTAC

Cluster Generation

sequencing primer

mRNA isolation

Total RNA (5ug)

Adaptor Ligations

AAAAA

AAAAA

Fragmentation (random)

Total RNA isolation (10ug)

PCR Primer 1PCR Primer 2

PCR Amplification

Transcriptome Profiling – Sample Prep (Illumina)

Cluster Generation

AAAAA

1st and 2nd Strand cDNA Synthesis (N6 primer)

TTTTT

sequencing primer 1 sequencing primer 2

mRNA isolation

Tissue

– Small RNA Identification and Profiling

• Small RNA size is suitable to discovery with next-generation sequencing

– Deep assessment of alternative splicing isoforms

• Deep coverage allows discovery of rare isoforms

Novel Transcript Discovery

Mortazavi et al. (2008), Nat. Methods

– Whole Genome Sequencing

• Small genomes that are not too complex (microbial)

• The longer the reads, the better – 454 chemistry most suitable

• Paired-End sequencing

– Whole Transcriptome Sequencing

– Targeted Sequencing

• Pooled PCR products

– Raindance Technologies (~4,000 amplicons in one tube)

– Padlock probes

• Pooled BAC clones

• Sequence Capture (Solid phase, Liquid phase)

– Agilent, Febit & Nimblegen

– Metagenomics & Microbial diversity

De novo Sequencing

– ChIP-Seq (immunoprecipitate sequencing)

• Capture regions of the genome bound by proteins (transcription factors,

histones)

• Sequences need to be aligned to a reference sequence

• Requires complex algorithm to determine differential levels of coverage

throughout the genome

– Methyl-Seq (methylation status) – Bisulfite Sequencing

• Sequences aligned and compared to reference sequence

– DNAseI Hypersensitivity Site Sequencing

Gene Regulation

Mikkelsen et al. (2007), Nature

– Coverage & Alignment

– Paired-End Sequencing

– Whole Genome Resequencing

• Small genomes that are not too complex (repeats, duplications...)

• The longer the reads, the better

– Targeted Resequencing

• Complex genomes (crops)

– Reduced representation libraries (methyl-sensitive enzymes)

– Transcriptome

• Sequence Capture (Microarrays)

» Agilent, Febit & Nimblegen

» CGH arrays

Variant & Structural Variation

Challenges in variant discovery

1. Base quality & filtering (scoring threshold)

2. Sequencing errors vs. SNPs

1. To differentiate true polymorphisms from sequencing errors

2. Coverage of a given SNP region and redundancy of reads (coverage vs. number of samples)

3. Availability of a reference sequence (genome)

1. To separate unique vs. duplicated sequences

2. Duplication in one line but not another

3. Polymorphism rate in one line vs. another = need to set conditions for alignment

4. Paired-end sequencing can help unique read placement

5. Complex genomes = need to reduce complexity prior to sequencing

1. High repeat content (ex: ~80% in maize, ~70% in soy, 90% in sunflower…)

2. Gene duplications and genome plasticity (polyploidy, partial or whole genome duplications...)

Reduced-representation libraries

transposon transposon transposon

PstI sites

PstI digestion

Recover digested fraction (gel, column)

1. DNA methylation in plants occurs at 5-methyl cytosine within CpG dinucleotides and CpNpG trinucleotides

2. Transposons and other repeats comprise the largest fraction of methylated DNA. Studies in Arabidopsis have shown that CG sites in the 3’ end of the transcribed regions of more than one third of all genes also are methylated (Zhang, Science, 320, 489, 2008).

3. Methylation is critically important in silencing transposons and regulating plant development (methylation in promoters appears to reduce transcription)

P P P P P P P P

Library Construction

Digestion with one methyl-sensitive restriction enzyme (RE) and fractionation

Genomic DNA

Ligation of biotinylated RE-specific adapters 1

Digestion with 4-bp cutter (DpnII)GATC

Ligation of DpnII-specific adapter

Binding to streptavidin column and digestion with REGATCCTAG

Ligation of RE-specific adapters 2

PCR enrichment, gel purification, size selection (150-500bp fragments), cluster synthesis and sequencing (36 cycles)

B

B

B

B

GATCCTAG

GATCCTAG

Deschamps et al. The Plant Genome (in press)

SNP detection flowchart

Basecalling, cropping last 4 bases & initial base-quality filter (for individual tags)

Condensing & optional consensus base-quality filter (for unitags sequences)

Creating HQ unitag datasets (removing singlets)

Comparing HQ unitag datasets from genotype “A” and genotype “B” using Vmatch

Filtering, to accept clusters with only two members (A, B) with exactly one mismatch

Recovering matched HQ unitag sequences and SNP sites from Vmatch alignments

Mapping SNP-containing HQ unitags to reference sequence (genome), using a k-mer table (k=length of trimmed tags), and find copy numbers and locations.

Capturing single-copy HQ unitags with up to a single-base mismatch to the reference sequence at the exact location of the putative SNP site for one or both genotypes.

Filtering and Condensing

Comparing two genotypes

Mapping to genome

Example: one flow cell in soybean (Williams82 vs. Pintado)

† Filtered total reads defined as having a quality value for individual base greater than or equal to 15

‡ HQ unitag reads defined as having a quality value for each base greater than or equal to 15, and with an individual read count greater than or equal to 2.

§ Best match to reference sequence of HQ unitag reads aligning uniquely or multiple times to the reference sequence

1

10

100

1,000

10,000

10 100 1,000 10,000

Fre

qu

en

cy

100,000

100,000

Depth

Run Metrics Williams82

Pintado

Number of total reads generated (after initial basecalling)

37,666,279 38,000,474

Number of filtered total reads † 24,519,484 23,101,973

Number of unitags (generated from filtered total reads)

965,610 885,429

Number of high quality (HQ) unitags ‡

255,918 246,102

Alignment of HQ unitags against the reference sequence:

Zero mismatch § 208,923 197,015

One mismatch § 27,770 27,699

Two or more mismatches § 19,225 21,388

HQ unitags aligning uniquely to the reference sequence with zero

mismatch

152,185 144,559

Results & Validation

*SNPs confirmed/not confirmed via Sanger sequencing of PCR products for both genotypes

**Experiments Putative SNPs Confirmed Not Confirmed Validation rateQ Score threshold: 15Soy: Williams82 vs. Pintado 1,682 163 5 97.0%Rice: Kasalath vs. Taichung65 2,618 162 6 96.4%

Q Score threshold: 25Soy: Williams82 vs. Pintado 702 168 2 98.8%Rice: Kasalath vs. Taichung65 2,148 174 1 99.4%

Distribution of HQ unitags & SNPs related to annotated gene density (soybean)

Gene Density (excluding TEs) in 500Kb window

Coverage by HQ unitags in 70Kb window

SNP Density in 70Kb window

Distribution of HQ unitags & SNPs related to distance to annotated genes (excluding TEs) in soybean

Intron, CDS and UTR coordinates determined from GFF annotation files

Bioinformatic tools

Alignment and Polymorphism Detection

1. SOAP – Short Oligonucleotide Alignment Program

• Ruiqiang Li, Beijing Genomics Institute

• http://soap.genomics.org.cn

2. MAQ – Mapping and Assembly with Quality

• Heng Li, Sanger Centre

• http://maq.sourceforge.net/maq-man.shtml

3. Bowtie - An ultrafast memory-efficient short read aligner

• Ben Langmead and Cole Trapnell, University of Maryland

• http://bowtie-bio.sourceforge.net/

4. ssahaSNP – Tool to detect homozygous SNPs and indels

• Adam Spargo and Zemin Ning, Sanger Centre

• http://www.sanger.ac.uk/Software/analysis/ssahaSNP

Bioinformatic tools

Genomic Assembly

1. Velvet – De novo assembly of short reads

• Daniel Zerbino and Ewan Birney, EMBL-EBI

• http://www.ebi.ac.uk/~zerbino/velvet/

2. SSAKE – Assembly of short reads

• Rene Warren, et al, British Columbia Cancer Agency

• http://bioinformatics.oxfordjournals.org/cgi/content/full/23/4/500

3. Euler – Genomic Assembly

• Pavel Pevzner and Mark Chaisson, University of California, San Diego

• http://nbcr.sdsc.edu/euler/

www.illumina.com

ChIP Sequencing

1. ChIP-Seq Peak Finder

• Barbara Wold, Cal Tech and Rick Meyers, Stanford University

• http://woldlab.caltech.edu/html/software/

Digital Gene Expression

1. Comparative Count Display

• Alex Lash, NIH

• ftp://ftp.ncbi.nlm.nih.gov/pub/sage/obsolete/bin/ccd/

2. SAGE DGED Tool

• Cancer Genome Anatomy Project

• http://cgap.nci.nih.gov/SAGE/SDGED_Wizard?METHOD=SS10,LS10&ORG=Hs

Bioinformatic tools

www.illumina.com

Overview

1. Obtain Bustard reads and align against Genome with Eland

2. Aggregate and SNP call data with CASAVA

3. GenomeStudio™ wizard import of data

4. Examine coverage and quality in stacked alignment graphs for a selected region/chromosome

5. Export table of SNPs and consensus sequence

Bioinformatic tools - Illumina

Bioinformatic tools - Illumina

Third-Generation Sequencing technologies: what’s next?

Next-Generation Sequencing

Third-generation platforms:

•Complete Genomics

•BioNanomatrix

•VisiGen

•Pacific Biosciences

•Intelligent Bio-Systems

•ZS Genetics

•Reveo

•LightSpeed Genomics

•NABsys

•Oxford Nanopore Technologies

Second-generation platforms:

•454/Roche

•Solexa/Illumina

•SOLiD/ABI

•Helicos BioSciences

•Dover Systems

Pacific Biosciences

• SMRT™ Technology (to be commercially

launched Fall 2010)

• Single DNA polymerase attached at bottom

surface of nanometer-scale hole, incorporates

in real-time fashion fluorescently labeled

nucleotides to elongated strand of DNA

• Elongated strand can be several thousands of

nucleotides in length

www.pacificbiosciences.com

Pacific Biosciences

1. Small size of the hole favors rapid in-and-out diffusion of nucleotides and dye following

their cleavage. Meanwhile, incorporated nucleotide is held within the detection volume

for 10’s of milliseconds, order of magnitude longer than the time it takes for nucleotides

to diffuse in and out of the hole, therefore decreasing background noise

2. Fluorescent dye is attached to the phosphate chain, rather than the base, and is

cleaved when the nucleotide is incorporated to the DNA strand.

=> Decreased background noise and use of phospholinked nucleotides circumvents the need

for successive cycles of incorporation, washing, scanning and removal of the label,

therefore optimizing processivity of the enzyme and allowing longer read lengths

=> No need for washing decreases the consumption of reagents

Nanopore Sequencing = the real $100 genome?

1. Sequencing-by-Synthesis requires lots of preparation, lots of reagents (polymerase,

nucleotides, fluorescent labels...) and expensive detection systems.

2. Nanopore sequencing does not rely on amplification or labeling, and provides a direct

electrical signal for base calling. It is based on a simple idea of “passing” DNA

fragments through a nanometer-scale pore and detecting in a real-time fashion signal

due to the DNA blocking the electrical current that runs through the pore

3. Oxford Nanopore: Protein nanopore

1. Long read lengths (1000’s)

2. High read accuracy

3. Current technical issues:

• Attachment of the exonuclease

to the pore

• Parallelization

(1,000’s of pores per chips)

Exonuclease

Alpha-hemolysin

Cyclodextrin(encapsulate nucleotide)

www.nanoporetech.com

Questions?

Overview and Applications of Next-Generation Sequencing Technologies

Documents

Transcript of Overview and Applications of Next-Generation Sequencing Technologies