The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec...
-
Upload
julissa-hammell -
Category
Documents
-
view
221 -
download
4
Transcript of The Data Tsunami in Biomedical Research Guillaume Bourque McGill University and Genome Quebec...
The Data Tsunami in Biomedical Research
Guillaume Bourque
McGill University and Genome Quebec Innovation Center, Dept. of Human Genetics, McGill University
June 5th, 2013
2
Next-generation sequencing (NGS)
Stein, Genome Biol. 2010
3
Falling cost of sequencing
DeWitt, Nat. Biotechnol. 2012
Sequencing human genomes
1000 Genomes
Project
~ 10 000 $
The Human
Genome
~ 3 Billion $
Your Genome
100 - 1000 $
20112001 2013 (?)
5
Outline
• Overview of Next-Generation Sequencing (NGS)
• Applications
• Challenges
• Solutions
6
Sequencing Revolution
http://www.brusselsgenetics.be
Sanger sequencing Next-Generation sequencing
Metzker, Nat. Rev. Genet. 2010
100s of reactions… 10000s of base pairs…
Millions of reactions!Billions of base pairs!
High-throughput Sequencing
6 Gbases2009 36bp X 20MX 8 lanes
2013 600 Gbases2 X 150bp X 250M
X 8 lanes
200 Human Genomes in 1 run!!!
NGS Technology Comparisoninstrument Pacbio Ion Torrent 454 Illumina SOLiD
Method Single-molecule in real-time
Ion semiconductor Pyrosequencing synthesis Ligation
Read length 3kb average 200 bp 700 bp 50 to 250 bp 50+35 or 50+50 bp
Error type indel indel indel substitution A-T bias
single-Pass Error rate % 13 ~1 ~0.1 ~0.1 ~0.1
Reads per run 35000–75000 up to 4M 1M up to 3.2G 1.2 to 1.4G
Time per run 30 minutes to 2 hours 2 hours 24 hours 1 to 10 days, 1 to 2 weeks
Cost per 1 million bases
(in US$) $2 $1 $10 $0.05 to $0.15 $0.13
Advantages Longest read length. Fast.
Less expensive equipment.
Fast. Long read size.
Fast. high sequence
yield, cost, accuracy
Low cost per base.
Disadvantages
Low yield at high accuracy. Equipment can
be very expensive.
Homopolymer errors.
Runs are expensive.
Homopolymer errors.
Equipment can be very
expensive.
Slower than other methods,
read length, longevity of the
plateform
9
Genome Canada
• > $915M investment and > $900M in co-funding• 100s Large-scale genomics projects• 5 Innovation centers
10
Outline
• Overview of Next-Generation Sequencing (NGS)
• Applications
• Challenges
• Solutions
Applications (I)• De novo sequencing
– From the human genome… To all model organisms… To all relevant organisms (e.g. extreme genomes)… To “all” organisms?
11
12
Human Genome• 3 Billion DNA base pairs (bp)• Two human genomes are
~99.9% identical • There are about ~3M bp
differences between you and me
• Some of these differences explain variation in:– Disease susceptibility– Differences in drug metabolism– …www.dnacenter.com
Applications (II)• Genome re-sequencing
– Genetic disorders– Cancer genome sequencing– Map genomic structural variations across individuals – Genealogy and migration– Agricultural crops– …
13
1000 Genomes Project
The Cancer Genome Atlas
14
Exome sequencing for Mendelian disease
“… about one-half to one-third (~3,000) of all known or suspected Mendelian disorders (for example, cystic fibrosis and sickle cell anaemia) have been discovered. However, there is a substantial gap in our knowledge about the genes that cause many rare Mendelian phenotypes.”
“Accordingly, we can realistically look towards a future in which the genetic basis of all Mendelian traits is known, …”
15
Exome sequencing
16
Cancer genome sequencing
Can obtain a full catalogue of mutations
Michael Stromberg, bioinformatics.ca
18
Mutations in paediatric gliblastoma
Jabado, Pfister and Majewski
19
Mutations in paediatric gliblastoma
Sequenced the exomes of 48 paediatric GBM samples, found:
• Somatic mutations in the H3.3-ATRX-DAXX chromatin remodelling pathway in 44% of tumours
• Recurrent mutations in H3F3A, which encodes the replication-independent histone 3 variant H3.3 in 31% of tumours
Applications (III)
• Quantitative biology of complex systems– New high-throughput technologies in functional genomics: ChIP-Seq,
RNA-Seq, ChIA-PET, RIP-Seq, …– From single-gene measurements, to thousands of probes on arrays, to
profiles covering all 3B bases of the genome– Important systems: Stem cells, Cancer, Infectious diseases…
20
21
Outline
• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions
High-throughput Sequencing
6 Gbases2009 36bp X 20MX 8 lanes
2013 600 Gbases2 X 150bp X 250M
X 8 lanes
200 Human Genomes in 1 run!!!
Big Data
1 TBytes2013 2 X 10 TBytes
Intensity files Reads + qualities
70 TBytes
Image files
Big Data
1 TBytes2013 2 X 10 TBytes
12 TBytes240 TBytes
Intensity files Reads + qualities
25 TB of raw data / month300 TB of raw data / year
From: Alexandre Montpetit Subject: news from IlluminaDate: 4 June, 2013 2:15:16 PM EDTTo: Guillaume Bourque
De Mark Van Oene (vp Illumina ventes): dans la prochaine annee on doit s'attendre a 2x plus de reads en 2x moins de temps (et 2x plus longs) Ca cause probleme?
Alex
Large NGS project
Cancer project with whole genome data:
125 TB raw
500 X 3 lanes = 500 X 250GB
125 TB raw
500 tumors 500 matched-normal
500 X 3 lanes = 500 X 250GB
vs
26
DNA bases sequenced at the Innovation Center
DN
A ba
ses
12 HiSeqs
72 Trillions!
0r 800 genomes at 30X
27
adventure.nationalgeographic.com
Biomedical research is built on data integration
Your data
Biomedical research is built on data integration
100X
Your data
30
Challenges
• NGS instruments generate TBs of data• NGS instruments are getting faster, cheaper and will
increasingly be found in small research labs and hospitals
• Data sharing and integration is critical in biomedical research
• Sequencing data represents sensitive private data and is identifiable
31
Outline
• Overview of Next-Generation Sequencing (NGS)• Applications• Challenges• Solutions
32
Nanuq softwareHas tracked data and meta-data for more than:
• 2.6 million sample aliquots, • 20,500 reagents, • 17,000 plates, • 140,000 tubes, • Multiple platforms, technologies and
workflows(sequencing, genotyping, microarray, etc.)• 3,900 external users
33
Standardized analysis pipelines
ChIP-Seq Analysis report
RNA-Seq Analysis report
MethylationAnalysis report
…
…
… … …
34
Data center at the Innovation Center
> 1200 cores> 2 PB disk> 5 PB tape
35
Need more!
McGill Guillimin – 16000 cores
UdeS Mammouth – 39168 cores
Data processing issues
• We have many different projects all needing space and processing.
• We want to use the Compute Canada clusters for scalability but also to facilitate data distribution (we have >800 users).
• This brings uniformity problems:– Different setups Hardware and Software– Different configurations– Etc.
Our strategy
• We wrote analyses pipelines to be easily configurable across clusters.
• Same code, one ini file to customize (we already have templates for 3 cluster sites)
• We install Linux modules readable by all on all these clusters so we know exactly what is available everywhere
• We also deploy common genomes across sites.
38
Usage on Compute Canada
39
Canadian Epigenetics,
Environment and Health Research
Consortium (CEEHRC)$1.5M
(2012-2017)
40
PORTal for the Analysis of Genetics and Genomics Experiments (PORTAGGE)
Conclusions• NGS offers a variety of technologies and numerous exciting
applications• Many areas of NGS data analyses are still under active
development (e.g. RNA-Seq)• A major challenge is to ensure sufficient compute and storage
capacities not to limit more advanced analyses• Need to work together to avoid duplication of efforts in
installing tools but also to develop efficient ways to use HPC in biomedical research
Acknowledgements
IT teamTerrance McquilkinMarc-André LabontéGenevieve DancausseAndras FrankelAlexandru Guja
Development teamNathalie ÉmondDavid BujoldFrancois CantinCatherine CôtéBurak DemirtasDaniel GuertinLouis Dumond JosephFrancois KorbulyMarc MichaudThuong Ngo
Analysis teamLouis LetourneauMathieu BourgeyMaxime CaronGary LévesqueRobert EveleighFrancois LefebvreJohanna SandovalPascale Marquis
EDCC teamDavid Morais (UdeS)Carol Gauthier (UdeS)Bryan Caron (McGill)Alain Veilleux (UdeS)ME Rousseau (McGill)
43
Questions?