Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

31
Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa

Transcript of Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Page 1: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Next Generation Sequencing Data Analysis

Nadia Pisanti, University of Pisa

Page 2: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Why sequencing?

The knowledge of DNA and RNA sequences has become a crucial tool for:

• Basic research in biology, pharmacology and medicine.

• Many applied fields: diagnostic (genetic diseases

detection), pharmacogenomics (influence of genetic variation on drug response) and personalized medicine, forensic biology, gene therapies, biological systematics (the study of the diversification of living forms)…

Page 3: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Sequencing: some history• "rapid DNA sequencing" by Frederick Sanger (UK) in the 1970s, became

the method of choice for DNA sequencing, and was worth him his 2nd Nobel Prize in chemistry in 1980.

• Sanger method for sequencing DNA was used in the Human Genome Project (HGP) that produced the first reference sequence of the human genome.

• The HGP started in 1990 and was expected to take 15 years.

• A first "rough draft" was finished in 2000 and announced in a press conference by… Bill Clinton and Tony Blair!

• The complete genome was announced in 2003.

– Why announcing the rough draft in 2000?– Why did the HGP take less than expected?

Page 4: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Celera Genomics- cut and paste from wikipedia and my memory -

• In 1998, the American NIH researcher Craig Venter announced that his private company Celera Genomics would sequence the human genome at a fraction of the cost of the public project.

• A significant portion of the human genome had already been sequenced when Celera entered the field and was freely available to the public from GenBank.

• Celera used a technique called whole genome shotgun sequencing. This novelty spurred the HGP to change its own strategy, leading to a rapid acceleration of the public effort.

• Celera filed preliminary ("place-holder") patent applications on 6,500 whole or partial genes. Celera also promised to publish their findings in accordance with the terms of the 1996 "Bermuda Statement," by releasing new data annually (the HGP released its new data daily), although, unlike the publicly funded project, they would not permit free redistribution or scientific use of the data. For this reason, the public competitor was compelled to publish the first draft of the human genome before Celera.

• In 2000, the HGP released a first working draft on the web. The scientific community downloaded one-half trillion bytes of information from the UCSC genome server in the first 24 hours of free and unrestricted access to the first ever assembled blueprint of our human species.

• Also in 2000, president Clinton announced that the genome sequence could not be patented, and should be made freely available to all researchers. The statement sent Celera's stock plummeting and dragged down the biotechnology-heavy Nasdaq. The biotechnology sector lost about $50 billion in market capitalization in two days. But the public release of the data ensured its fair use and availability.

Page 5: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

shotgun sequencing

• Since when shotgun sequencing was introduced by Celera, it is the method of choice for large scale sequencing.

• The Sanger sequencing technology could only be used for short DNA fragments (from 100 to 1000 bases): DNA must thus be divided into small pieces, and then be re-assembled.

• This can be done in two ways:

– Chromosome walking: sequencing piece by piece consecutive fragments.

– Shotgun sequencing: break several copies of the DNA strand into random overlapping fragments, sequencing them, and then re-assemblying in silico exploiting the overlap.

Page 6: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Shotgun sequencing & assembly• Wikipedia, about shotgun sequencing: "faster

but more complex".

• The "complexity" of the approach is because of algorithmic issues…

• (Eu)gene Myers, a string algorithms expert, was leading the computer scientists at Celera: he made the difference…

• Challenges in assembly phase: finding prefix/suffix overlap, data structure for storing fragments and "overlap graph", assembly algorithm managing duplications.

Page 7: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Fragment Assembly

The problem of sequence assembly can be compared to taking many copies of a book, passing them all through a shredder, and piecing the text of the book back together just by looking at the shredded pieces.

Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some

shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be

completely unrecognizable.

Page 8: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

What is NGS?• Next/New Generation Sequencing

• Massively Parallel Sequencing

• Third Generation Sequencing

• High Throughput Sequencing

millions of fragments (reads) in a single run !!

by means of new technologies developed mainly by:

• Lynx Therapeutics merged with Solexa and they were bought by Illumina.

• ABI SOLiD

• ION Torrent Systems

• 454 Life Science acquired by Roche Diagnostics

they actually differ quite a lot on performances and characteristics.

Page 9: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

What's new with NGS?

• Sequencing the whole human genome took the HGP:

– 3.000.000.000 dollars

– 13 years

• Sequencing a whole human genome now with NGS techniques takes:

– about 1.000 dollars

– 4-5 days

Sequencing is much faster and (thus) cheaper !!

Page 10: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

What is NGS great for

• re-sequencing: no assembly, just mapping on a known reference genome.

• Metagenomics

• Transcriptome Sequencing: RNA-Seq

• Chromatin immunoprecipitation combined with DNA sequencing: ChIP-Seq

Page 11: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

re-sequencing

• Sequencing a new individual of a species for which the reference genome is know (and (well) annotated).

• Important applications:– Medicine– Building datasets of several strains of the

same organism to investigate intra-species evolution.

Page 12: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

re-sequencing: medical applications[we will get back to this later]

• Genotyping: testing for known mutations (sequencing can be possibly targeted to specific regions).

• Variation analysis: scanning for any mutation such as Single Nucleotide Polymorphisms (SNPs), or Copy Number Variations (CNVs) or other Structural Variants (SVs) that can be associated to congenital diseases, predisposition for certain pathologies, or drug response.

• Most of NGS tools offer the relative software to detect mutations.

• With NGS these tests can be made on large scale…. and back in time: Roche sequenced the Neanderthal genome in 2006!

Page 13: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

re-sequencing

• Challenges for computer science:– Indexing data and (quickly) mapping on

reference genome– SNPs and SVs calling.– Mind the repeats up there!

• Challenges for informatics:– Build tools for genetists.– Interpreting SNPs and SVs crossing with

DB information.– DB management…

Page 14: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

metagenomicsMetagenomics essentially entails brute force sequencing of DNA fragments obtained from an

uncultured, unpurified, microbial and/or viral population, followed by bioinformatics-based analyses that attempt to answer the question "Who's there?" [E.R.Mardis, Trends in genetics

2008]

• Characterizing the human microbiome: we live in symbiosis with millions of microbial species. There is a theory saying that these symbiotic microbes provide an extension of the human genome and hence contribute to its genetic potentials in terms of protective immunity, added enzymatic capability…

• Metagenomics not only in human body, but also in important ecosystems such as ocean, soil, deep mines.

• Metagenomics costs are effordable only now with NGS (mostly 454 Roche as with longer reads they better allow de novo sequencing)

Page 15: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

What is RNA-Seq

NGS opened a new phase

in transcriptomics (aka

expression profiling)

thanks to

low requirements of

nucleotide sequence

product

and

deep coverage

Page 16: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Why RNA-Seq• Among the goals of the HGP there was the mapping and

genotype associated to (the predisposition for) diseases.• It is now very clear (and it was not then) that reading the

genome is not enough…• Same genome, different phenotypes and different diseases:

how comes?• Environmental effects (food, pollution, life style) act on gene

transcription.• We ought to investigate the transcriptome!• The transcriptome are the genes that are being actively

expressed at a given time.• The role of miRNA for gene regulation.

Page 17: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

RNA-SeqSequencing the transcriptome to investigate

differentially expressed genes:

- under different conditions, or

- in different tissues

- in different alleles

The different expression can be in quantitative terms or in alternative splicing terms (eukaryotes only).

de novo transcriptome assembly

Page 18: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

RNA-SeqSequencing the transcriptome to investigate

differentially expressed genes:

- under different conditions, or

- in different tissues

- in different alleles

The different expression can be in quantitative terms or in alternative splicing terms (eukaryotes only).

transcriptome re-sequencing

Page 19: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

RNA-Seq quantificationRNA-Seq (Quantification) is used to analyze gene expression of certain

biological objects under specific conditions.

Page 20: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Alternative Splicing[we will get back to this later]

• AS is when several mRNAs can be produced from a unique pre-mRNA

• E.g. in humans there are approximately 30,000 genes and it is estimated that 70% of human protein-coding genes undergo alternative splicing to generate up to 150,000-200,000 mRNAs and proteins through alternative splice site usage.

• In 2008, an experiment revealed that 34% of human transcripts were not from known genes [Science 321]

Page 21: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

non coding RNA

• ncRNA includes a wide class of regulatory RNA molecules whose function is as crucial as not yet understood.

• Discovering their sequences and (hence) genomic locations is hard because they (mostly) small and poorly conserved over evolutionary time.

• In silico prediction methods are of high importance and very promising, but so far of little use.

• Currently, ncRNA are mostly discovered by sequencing small RNA fragments, for which task NGS tools are ideal!

• In silico analysis of such data will be crucial for understanding it (secondary structure prediction, putative functions prediction based on learning methods).

• A new class of miRNA (or small RNA) is being discovered every day…

Page 22: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

ChIP-Seq

• ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins.

• The goal is to analyze protein interactions with DNA (e.g. how transcription factors, that are proteins, regulate gene expression).

Page 23: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

The bad side of NGS

• Even shorter fragments: from 1000 of Sanger technology to 25, then 50, then 75, now 100 bases.

• Even more errors (when new size is released).

Fragment assembly is even harder !!

Page 24: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

From: M.L. Metzker "Sequencing technologies — the next generation", Nature Reviews Genetics 11, 31-46, 2010

What is the best depends on:

what you need it for

and

how much money you have

Page 25: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Roche 454 Genome Sequencer

• It was the first introduced in the market in 2005.• Its technology allows to produce relatively long reads

(400-700 bases).• Its base calling cannot handle long (>6) stretches of

the same nucleotide, resulting in insertions and deletions errors there…

• On the other hand very low substitutions error rate.• Overall error rate at 1%.

Page 26: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Illumina Genome Analyzer(aka Solexa sequencer)

• The most widely available NGS technology.

• Reads up to 100b long.

• Error rate at 1-1,5%, mostly substitutions (indels are much less common).

Page 27: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

ABI's SOLiD• Probably the second most widely used.• The workflow is similar to Solexa/Illumina's.• An interesting difference: SOLiD uses a di-base

sequencing technique in which two nucleotides are read simultaneously. 16 di-bases still represented by 4 "colors", but the one-base-shift solves the redundancy.

• As a consequence:– Sequencing error may propagate.– Read alignment can be speed up.

• Error rate around 2-4%

Page 28: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Paired-end and Mate-pairs

• Two very different objects from the point of view of the technology as they are obtained with very different procedures.

• Available from all NGS platforms.• From the computational point of view, they are the same:

two sequences at an approximatively know distance from eachother in the genome (insert size).

• They are crucial to:– Correctly map/assemble repeated fragments– Detect Structural Variants and Copy Number Variations.

Page 29: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

Fragment Assemblywith NGS data

It is like a diabolic sudoku:

- with very few initial numbers

- many solutions satisfy the constraint: choice is arbitrary

- only one of the many solution is the good one, and there is no clue on which…

Page 30: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

NGS and Informaticsthe challenges 1

• Massive Image processing and basecalling within sequencing technology.

• Growing need of managing big data:– Indexing issues.– Efficient mapping and alignments.– Parallel and High Performance computing.– New emphasis on efficient data structures and

algorithms with special care on memory usage.

Page 31: Next Generation Sequencing Data Analysis Nadia Pisanti, University of Pisa.

• Designing and producing tools for data analysis integrating information from different sources (e.g. genome browsers).

• Designing and producing tools for assemblying..• Designing and producing tools for genotyping: a new one

every day, hard to compare...• Customized analysis: informatics is needed for any project

and in any lab. • "Curiously": back to old style stuff such as command line,

machine language programming…

NGS and Informaticsthe challenges 2