Introduction to next generation sequencing
-
Upload
ueb -
Category
Technology
-
view
20.368 -
download
3
description
Transcript of Introduction to next generation sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Introduction toNext Generation Sequencing
Statistics and Bioinformatics Research GroupStatistics department, Universitat de Barelona
Statistics and Bioinformatics UnitVall d’Hebron Institut de Recerca
Alex Sánchez
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Outline
Introduction, Presentation, Goals.Next generation sequencing technologies.
Evolution, Description, Comparison.Applications of NGS.Bioinformatics challenges.Some aspects of NGS data analysis.
NGS data, and data preprocessing (QC)Types of analyses, workflows, tools
Conclusions and perspectives
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Who, where, what?
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Introduction
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Why is NGS revolutionary?
• NGS has brought high speed not only to genomesequencing and personal medicine
• it has also changed the way we do genome research
Got a question on genome organization?
SEQUENCE IT !!!
Ana Conesa, bioinformatics researcher at Principe Felipe Research Center
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Sequencing: from DNA to GenomesSanger chain termination (1977) Hierarchical and
Shotgun sequencing (1996)
Introduction to NGS http://ueb.ir.vhebron.net/NGS
The human genome project
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next generation sequencing
The future is here, now
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next generation Sequencing• By the middle decade new technologies consolidated
allowing the massive production of tens of millions ofshort sequencing fragments.
• These techniques could be used to– Deal with similar problems than microarrays,– But also with many other.
• “Again” they raised the promise of personalizedmedicine..
Introduction to NGS http://ueb.ir.vhebron.net/NGS
NGS technologies
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencingSanger sequencing Cyclic-array sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencingSanger sequencing Next-generation sequencing
Advantages of NGS
- Construction of a sequencinglibrary clonal amplification togenerate sequencing features
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencingSanger sequencing Next-generation sequencing
Advantages:
- Construction of a sequencinglibrary clonal amplification togenerate sequencing features
No in vivo cloning, transformation, colony picking...
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencingSanger sequencing Next-generation sequencing
Advantages:
- Construction of a sequencinglibrary clonal amplification togenerate sequencing features
No in vivo cloning, transformation, colony picking...
- Array-based sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencingSanger sequencing Next-generation sequencing
Advantages:
- Construction of a sequencinglibrary clonal amplification togenerate sequencing features
No in vivo cloning, transformation, colony picking...
- Array-based sequencing
Higher degree of parallelismthan capillary-based sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
NGS means high sequencing capacity
GS FLX 454(ROCHE)
HiSeq 2000(ILLUMINA)
5500xl SOLiD(ABI)
Ion TORRENT
GS Junior
Introduction to NGS http://ueb.ir.vhebron.net/NGS
454 GS Junior35MB
NGS Platforms Performance
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
454
SOLiD
SOLEXA
Workflow?
Introduction to NGS http://ueb.ir.vhebron.net/NGS
454 Sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
ABI SOLID Sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Solexa sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Comparison of 2nd NGS
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Some numbers
Introduction to NGS http://ueb.ir.vhebron.net/NGS
The sequencing process, in detail
DNA fragmentationand in vitroadaptor ligation
111 Library preparation
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
DNA fragmentationand in vitroadaptor ligation
emulsion PCR
1
2
11
22
Library preparation
Clonal amplification
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
DNA fragmentationand in vitroadaptor ligation
emulsion PCR bridge PCR
1
2
11
22
Library preparation
Clonal amplification
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
DNA fragmentationand in vitroadaptor ligation
emulsion PCR bridge PCR
Pyrosequencing
1
2
3
11
22
33 Cyclic array sequencing
Library preparation
Clonal amplification
454 sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
DNA fragmentationand in vitroadaptor ligation
emulsion PCR bridge PCR
454 sequencing SOLiD platform
Pyrosequencing Sequencing-by-ligation
1
2
3
11
22
33 Cyclic array sequencing
Library preparation
Clonal amplification
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next-generation DNA sequencing
DNA fragmentationand in vitroadaptor ligation
emulsion PCR bridge PCR
Solexa technologySOLiD platform
Pyrosequencing Sequencing-by-ligation Sequencing-by-synthesis
1
2
3
11
22
33 Cyclic array sequencing
454 sequencing
Library preparation
Clonal amplification
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Next next generation sequencing
• Pacific Biosystems– Real time DNA
synthesis– Up to 12000nt (?)– 50 bases/second (?)
• Promises delivery ofhuman genome in minutes?– Company on track for
2013
Introduction to NGS http://ueb.ir.vhebron.net/NGS
NGS Applications
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Bioinformatics challenges of NGS
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Introduction to NGS http://ueb.ir.vhebron.net/NGS
NGS pushes bioinformatics needs up
• Need for large amount of CPU power– Informatics groups must manage compute clusters– Challenges in parallelizing existing software or redesign of
algorithms to work in a parallel environment– Another level of software complexity and challenges to
interoperability• VERY large text files (~10 million lines long)
– Can’t do ‘business as usual’ with familiar tools such as Perl/Python.
– Impossible memory usage and execution time – Impossible to browse for problems
• Need sequence Quality filtering
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Data management issues
• Raw data are large. How long should be kept?• Processed data are manageable for most people
– 20 million reads (50bp) ~1Gb• More of an issue for a facility: HiSeq recommends
32 CPU cores, each with 4GB RAM
• Certain studies much more data intensive than other– Whole genome sequencing
• A 30X coverage genome pair (tumor/normal) ~500 GB• 50 genome pairs ~ 25 TB
Introduction to NGS http://ueb.ir.vhebron.net/NGS
So what?
• In NGS we have to process really big amounts of data, which is not trivial in computing terms.
• Big NGS projects require supercomputing infrastructures
• Or put another way: it's not the case that anyone can study everything.– Small facilities must carefully choose their projects to be scaled
with their computing capabilities.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Computational infrastructure for NGS
• There is great variety but a good point to start with:– Computing cluster
• Multiple nodes (servers) with of course multiple cores• High performance storage (TB, PB level)• Fast networks (10Gb ethernet, infiniband)
– Enough space and conditions for the equipment ("servers room")– Skilled people (sysadmin, developers)
• CNAG, in Barcelona: 30 people, more than 50% of theminformaticians
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Big computing infrastructure
• Distributed memory cluster– Starting at 20 computing nodes– 160 to 240 cores– amd64 (x86_64) is the most used cpu architecture– At least 48GB ram per node
• Fast networks– 10Gbit– Infiniband
• Batch queue system (sge, condor, pbs, slurm)• Optional MPI and GPUs environment depending on
project requirements
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Big infrastructure is expensive
• Starting at 200.000€– 200.000€ is just the hardware– Plus data center (computers room)– Plus informaticians salary
• Not every partner knows about supercomputing.– SGI– Bull– IBMHP
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Middle size infrastructure
• "Small” distributed filesystem ( around 50TB).
• "Small” cluster (around 10 nodes, 80 to 120 cores).
• At least gigabit ethernet network.
• Price range: 50.000 – 100.000 € (just hardware)– plus data center and informaticians salary
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Small infrastructure
• Recommended at least 2 machines – 8 or 12 cores each machine.– 48Gb ram minimum each machine.– BIG local disk. At least 4TB each machine
• As much local disks as we can afford
• Price range: starting at 8.000€ - 10.000€ (2 machines)
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Alternatives (1): Cloud Computing• Pros
– Flexibility.– You pay what you use.– Don´t need to maintain a data center.
• Cons– Transfer big datasets over internet is
slow.– You pay for consumed bandwidth.
That is a problem with big datasets.– Lower performance, specially in disk
read/write.– Privacy/security concerns.– More expensive for big and long
term projects.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Alternatives (2): Grid Computing
• Pros– Cheaper.– More resources available.
• Cons– Heterogeneous
environment.– Slow connectivity
(specially in Spain).– Much time required to find
good resources in the grid.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
NGS data analysis
Introduction to NGS http://ueb.ir.vhebron.net/NGS
NGS data analysis stages
Introduction to NGS http://ueb.ir.vhebron.net/NGS
A typical workflow (Seq-to-variant wf)
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Whole Genome Sequencing
Resequencing
Transcriptome Analysis
Gene Regulation
Epigenetic Changes
Metagenomics
Paleogenomics
NGS Applications are sequencing applications
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Metagenomics and other community-based “omics”
Zoetendal E G et al. Gut 2008;57:1605-1615
Introduction to NGS http://ueb.ir.vhebron.net/NGS
De novo sequencing
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Transcriptomics by NGS: RNASeq
• Digital Signal
• Harder to achieve & interpret• Reads counts: discrete values• Weak background or no noise
• Analog Signal
• Easy to convey the signal’sinformation
• Continuous strength• Signal loss and distortion
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Quality control and preprocessing ofNGS data
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Preprocessing sequences improves results
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Why QC and preprocessing
• Sequencer output:– Reads + quality
• Natural questions– Is the quality of my sequenced
data OK?– If something is wrong can I fix it?
• Problem: HUGE files... How do they look?
• Files are flat files and are big... tens of Gbs (even hard tobrowse them)
Introduction to NGS http://ueb.ir.vhebron.net/NGS
How is quality measured?
• Assign quality score to each peak• The frequently used Phred scores provide log(10)-transformed error• probability values:
– score = 20 corresponds to a 1% error rate– score = 30 corresponds to a 0.1% error rate– score = 40 corresponds to a 0.01% error rate
• The base calling (A, T, G or C) is performed based on Phred scores.• Ambiguous positions with Phred scores <= 20 are labeled with N.
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Sequence formats
• FastA format (everybody knows about it)– Header line starts with “>” followed by a sequence ID– Sequence (string of nt).
• FastQ format (http://maq.sourceforge.net/fastq.shtml)– First is the sequence (like Fasta but starting with “@”)– Then “+” and sequence ID (optional) and in the following line are
QVs encoded as single byte ASCII codes• Different quality encode variants
• Nearly all downstream analysis take FastQ as inputsequence
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Some tools to deal with QC
• Use FastQC to see your starting state.
• Use Fastx-toolkit to optimize different datasets and thenvisualize the result with FastQC to prove your success!
• Hints: – Trimming, clipping and filtering may improve quality– But beware of removing too many sequences…
Go to the tutorial and try the exercises...
Introduction to NGS http://ueb.ir.vhebron.net/NGS
AcknowledgementsGrupo de investigación en Estadística y Bioinformática del departamento de Estadística de la Universidad de Barcelona.
Xavier de Pedro and Ferran Briansó (but also Jose Luis Mosquera and Israel Ortega) de la Unitat d’Estadística i Bioinformàtica del VHIR (Vall d’Hebron Institut de Recerca)
Unitat de Serveis Científico Tècnics (UCTS) del VHIR (Vall d’Hebron Institut de Recerca)
People whose materials have been borrowedManel Comabella, Rosa Prieto, Paqui Gallego, Javier Santoyo, Ana Conesa, Pablo Escobar, Thomas Girke…
Introduction to NGS http://ueb.ir.vhebron.net/NGS
Gracias por la atención y la paciencia