Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon:...

31
1 Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014

Transcript of Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon:...

Page 1: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

1

Introduction to taxonomic analysis of metagenomic amplicon and shotgun data with QIIME Peter Sterk EBI Metagenomics Course 2014

Page 2: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

2

Taxonomic analysis using next-generation sequencing

Objective •  we want to obtain samples from a particular environment to find out what

lives in it. Know your sample •  What kind of samples do we have? Soil, water, host-associated (e.g. gut),

etc. •  What do we expect to find in those samples? Prokaryotes, eukaryotic

microorganisms (e.g. protists, fungi), viruses? Decide what you want to find out, e.g. •  bacteria/archaea populations •  all microbes (including eukaryotic ones) Design your experiment around that

Page 3: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

3

Some terminology

•  Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions, or other marker genes. Most researchers will make use of standard PCR primers

•  Clustering: grouping sequences in bins (or clusters) based on a percent similarity threshold.

•  Operational Taxonomic Unit (OTU): species distinction in microbiology. Typically using rDNA and a percent similarity threshold for classifying microbes within the same, or different, OTUs. Note that an OTU is distinct from a species. For bacteria/archaea, OTUs are clusters of reads that are >97% identical.

•  Barcode: a short DNA sequence that is added to each read during amplification and that is specific for a given sample. This allows samples to be mixed (multiplexed) to reduce sequencing cost. During analysis sequences need to be demultiplexed, i.e. separated by sample.

Page 4: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

4

Common approaches: amplicon-based

•  Sequencing of (regions of) target genes (amplicons) obtained by PCR using gene specific primers. For bacteria/archaea, the target is usually a 16S rRNA gene fragment containing one of more variable regions, internal transcribed spacer (ITS) for fungi, 18S rRNA gene fragments for eukaryotes

–  Analysis usually requires a reference database that is searched to find the closest match to an OTU from which a taxonomic lineage is inferred. Some examples:

•  Greengenes (http://greengenes.lbl.gov) (16S) •  Ribosomal Database Project (http://rdp.cme.msu.edu) (16S) •  Silva (http://www.arb-silva.de) 16S + 18S •  Unite (http://unite.ut.ee) ITS

–  Less suitable for certain groups of organisms such as protists these are extremely diverse and only few have sequence information. The same goes for viruses.

•  We will mainly focus on 16S analysis during the hands-on as this is most common, but you must decide whether this is suitable for your work.

•  We will also spend a little time on taxonomic analysis of Illumina shotgun data

Page 5: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

5

Hands-on QIIME tutorial

•  QIIME is an open source software package for comparison and analysis of microbial communities, primarily based on high-throughput amplicon sequencing data (such as SSU rRNA) generated on a variety of platforms. It is widely used and supported.

•  We will use the latest version of QIIME (Quantitative Insights Into Microbial Ecology; qiime.org; version 1.8), pronounced ‘chime’ to analyze 26 soil samples from a diesel-contaminated railway site (Sutton et al. 2013). You will have an electronic copy of the paper with your training materials. We have randomly picked 5000 reads from the original Roche 454 dataset to speed up the analysis. We also provide a pre-computed analysis of the full dataset.

•  QIIME is used in the EBI metagenomics pipeline with whole genome shotgun data. EBI metagenomics currently does not analyze amplicon data as standard. However, with the help of this tutorial you could soon be analyzing your own amplicon data sets. We will spend some time on the analysis of an Illumina shotgun dataset, a metagenome of a microbial consortium obtained from the Tuna oil field in the Gippsland Basin, Australia (Dongmei et al. 2013 and Sutcliffe et al. 2013).

Page 6: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

6

OTU picking strategies in QIIME

•  De novo –  Use for amplicons that overlap

–  Use if you do not have a reference sequence collection –  Clusters all reads without using a reference

–  Not very suitable for very large data sets (cannot be run in parallel)

(I will explain this strategy in more detail)

•  Closed-reference –  Use if amplicons (or shotgun reads) do not overlap

–  And you have a reference sequence collection

–  Note: reads that do not hit a reference sequence are discarded

•  Open-reference –  Use for amplicons that overlap

–  Reads are clustered against a reference sequence

–  Reads that do not match are clustered de novo

Page 7: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

7

Common approaches: metagenomic analysis

•  Identification of reads with 16S sequence (e.g. using rRNASelector) and closed-reference OTU picking in QIIME.

–  We will analyze an artificially small Illumina dataset during the hands-on. •  Blast-based analysis. E.g. blasting reads against the NCBI non-redundant

nucleotide or protein data databases and inferring taxonomic lineage from the best hit

–  The tool MEGAN requires Blast output. A major drawback is that without preprocessing of NGS datasets and access to a major computational resource, this is not an option for most.

•  MetaPhlAn approach (http://huttenhower.sph.harvard.edu/metaphlan) –  relies on unique clade-specific marker genes identified from 3,000 reference

genomes –  fast, but limited to certain types of study (mainly human microbiome)

Page 8: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

8

De novo OTU picking in detail

We will now go through the de novo OTU picking steps in more detail and focus on the diesel-contaminated railway line study. We will perform the actual analysis during the hands-on session today. We will largely follow the QIIME 454 overview tutorial at http://qiime.org/tutorials/tutorial.html Aim of our study: Understand interrelationship among microbial community composition, pollution level, and soil geochemical and physical properties. Sequencing technology/chemistry: Roche 454 FLX Titanium Amplicon: V3 + V4 region of the 16S rRNA gene

Page 9: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

9

Overview of the diesel-contaminated railway site

A1: Fill; Polluted A2: Fill_Polluted

B1: Fill; Clean B2: Clay; Polluted B3: Peat; Polluted B4: Peat; Polluted C1: Fill; Clean C2: Peat; Clean C3: Peat; Polluted D1: Fill; Clean D2: Clay; Clean

D3: Clay; Polluted D4: Peat; Polluted

D5: Sand; Polluted E1: Fill; Clean

E2: Fill; Polluted F1: Sand; Clean F2: Sand; Polluted G1: Fill; Clean G2: Fill; Clean G3: Fill; Clean H1: Peat; Clean H2: Peat; Clean H3: Sand; Clean

I1: Sand; Clean I2: Sand; Clean

In 2010 26 samples were taken from 9 locations at different depths:

Page 10: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

10

The targeted 16S rRNA gene region

•  The targeted region is a 466 bp fragment containing the 16S rRNA V3 and V4 hypervariable region

•  Each sample has a sequence primer adapter and 10 nucleotide barcode to allow multiplexing (sequencing all samples on the same plate mainly to reduce sequencing cost)

•  The sequence file is in Roche 454 SFF format

Page 11: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

11

The analysis in detail (1)

File preparation –  The standard 454 data format is SFF. We need to extract the fasta sequences

and quality scores in two separate files. We will use the tool sffinfo from Roche.

>GW6RNWL02GKV5K length=463 xy=2581_0822 region=2 run=R_2011_02_04_06_15_22_

ACATACGCGTCCTATGGGATGCAGCAGGCGCGAAAACTTTACAATGCCGGCAACGGCGAT

>GW6RNWL02HFI7P length=418 xy=2930_0883 region=2 run=R_2011_02_04_06_15_22_

ACATACGCGTCCTATGGGATGCAGCAGGCGCGAAAACTTTACAATGCTGGCAACAGCGAT...

AAGGGAACCTCGAGTGCCAGGTTACAAATCTGGCTGTCGAGATGCCTAAAAAGCATTTCA...

>GW6RNWL02GKV5K length=463 xy=2581_0822 region=2 run=R_2011_02_04_06_15_22_

40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 29 29 29 29 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

>GW6RNWL02HFI7P length=418 xy=2930_0883 region=2 run=R_2011_02_04_06_15_22_

40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 38 21 21 21 21 38 39 39 39 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 ...

Page 12: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

12

The analysis in detail (2)

Assign reads to samples using barcode information and perform some quality control

–  We need to provide a tab-delimited mapping file that provides at a minimum the name of each sample, the barcode to identify the different samples, the linker/primer sequence used to amplify the DNA, and a description of the sample

#SampleID BarcodeSequence LinkerPrimerSequence Description

A1 ACATACGCGT CCTAYGGGRBGCASCAG A1_Fill_Polluted

A2 ACGCGAGTAT CCTAYGGGRBGCASCAG A2_Fill_Polluted

B1 ACTACTATGT CCTAYGGGRBGCASCAG B1_Fill_Clean

etc. For example, sequence reads that have the sequence ACATACGCGT near the start will be assigned to sample ‘A1’. The procedure we use will rename headers in the fasta and quality files accordingly. It also removes the barcode and primer sequences from the reads as these interfere with the OTU picking.

Page 13: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

13

Optional: Denoising 454 data (flowgram clustering)

•  A small number of reads from Roche 454 pyrosequencing runs have characteristic errors when longer homopolymer runs are present. These reads give rise to erroneous OTUs.

•  A procedure called denoising or flowgram clustering removes problematic reads and increases the accuracy of the taxonomic analysis

•  Denoising is computationally expensive and we will therefore skip this procedure in the hands-on. If you work with 454 amplicon data and your file uses the older regular flow pattern, consider denoising. See http://qiime.org/tutorials/denoising_454_data.html. Read the warning about the new random flow patterns.

•  Remember that denoising does not make sense with shotgun data.

Page 14: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

14

The analysis in detail (3)

•  Pick Operational Taxonomic Units. These are collections of sequences that are highly

similar (here 97% or more). Taxonomic assignments are done on these OTUs. We will perform de novo OTU picking.

•  The QIIME workflow will produce a number of output files. –  A list of OTUs with taxonomic assignments with the hierarchy: kingdom, phylum, class, order, family,

genus, species. Most OTUs cannot be classified up to species level. E.g:

denovo745 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhizobiales; f__Rhizobiaceae; g__Agrobacterium; s__ 1.00 3

–  A representation of a taxonomic tree in Newick format. The tree can be visualized in applications like FigTree.

–  A file in biom (Biological Observation Matrix) format representing OTU tables. We will import this file into Megan 5 to visualize our results

Page 15: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

15

De novo OTU picking in detail (1)

•  Generate OTUs by clustering reads based on similarity (default is 97%) –  Sort reads according to size (long -> short)

–  Cluster

OTU1

OTU2

OTU3

OTU4

OTU5

Page 16: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

16

De novo OTU picking in detail (2)

•  Pick representative sequence for each OTU

OTU1

OTU2

OTU3

OTU4

OTU5

•  Assign taxonomy to each OTU

lineage 1

lineage 2

lineage 3

lineage 4

lineage 5 Reference database

Page 17: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

17

De novo OTU picking in detail (3)

•  Align OTU sequences (if you want to do further phylogenetic analysis)

•  Optional: remove chimaeras from your alignment •  Filter alignment •  Create tree file in Newick format •  Create OTU table in biom format

We can now visualize the results and do further analysis, such as alpha-diversity analysis (diversity within a sample) and beta-diversity analysis (diversity across samples)

•  We will first have a quick look at Megan 5, a tool we will use to visualize our results.

Page 18: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

18

A quick look at MEGAN 5

•  MEGAN stands for MEtaGenome ANalyzer and was written to help understand the composition and operation of complex microbial consortia. It is free for academic users and can be downloaded from http://ab.inf.uni-tuebingen.de/software/megan5/.

•  In order to use MEGAN for both functional analysis and taxonomic analysis, a Blast step needs to be performed whereby a metagenomic dataset is Blast-ed against e.g. one of NCBI’s non-redundant nucleotide or protein databases. This steps is extremely computationally expensive and not an option for many users.

•  Recently support for the BIOM format was added, which allows us to visualize and analyze taxonomic analysis results from QIIME. Select import BIOM from the File menu.

Page 19: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

19

Taxonomic tree display in MEGAN5

Page 20: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

20

Rarefaction curves in MEGAN 5

Page 21: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

21

Taxonomic composition of samples in MEGAN5

Page 22: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

22

Selecting 16S rDNA sequence with rRNASelector from shotgun data and closed-reference OTU picking with QIIME

•  Amplicon studies offer insight into taxonomic diversity of samples, but they cannot be used to study function (or coding potential). Instead we need shotgun data.

•  In an ideal world, to get the most out of your physical samples you’d prepare multiple libraries (amplicon, metagenomic, transcriptomic). In practice most people don’t.

•  It is possible to get taxonomic information out of shotgun data. We’ll discuss how we have approached this at the EBI.

rDNA sequence

rRNASelector (1): select reads with rDNA

rRNASelector (2): remove non-rDNA

Page 23: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

23

Closed-reference OTU picking

•  The set of clipped rDNA reads obtained with rRNASelector is clustered against a reference database.

uclust

16S rDNA reference set

X

Page 24: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

24

Further phylogenetic analyses: taxa summary

•  We can visualize the taxonomic composition of our samples. We will reproduce this figure during the hands-on session. We are looking at the composition at phylum level. A legend is also produced (not shown)

Page 25: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

25

Further phylogenetic analyses: alpha diversity and rarefaction curves

•  Alpha diversity looks at the species diversity within samples

•  If you produced more sequence from your sample, you would expect the number of species to increase until a point where producing more sequence does not significantly increase the number of observed species. You can perform rarefaction analysis on your sample to find out whether you have sequenced at sufficient depth.

•  Rarefaction analysis involves in silico repeated subsampling of your data at different intervals. For example, if your sample consists of 1000 sequences, you could randomly sample 100 reads (with e.g. 10 repetitions), then 200, 300 etc. You can then plot these subsamples against the number of observed species. If curves flatten, then you have sequenced at sufficient depth.

Page 26: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

26

Divergence measurements between organisms

•  Divergence-based diversity measures estimate the degree to which pairs of organisms differ

–  Sequence distance: measure of sequence identity

–  Phylogenetic distance: sum of branch lengths that separate two organisms in a phylogenetic tree (see fig A)

–  Topological distance: as phylogenetic distance, but all branch lengths set the same (usually 1)

–  Taxonomic distance. Taxonomic level separating two organisms (e.g. same species = 1, same genus = 2, same family = 3, etc)

•  Usually, where sequence data is available (e.g. 16S rRNA), sequence or phylogenetic distance measurements are most powerful

•  If phylogenetic trees with meaningful branch lengths are not available, but taxonomic relationships are well defined, topological or taxonomic distance measures can be used (most commonly used for macroorganisms)

PD for grey is sum of grey brachnes

Page 27: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

27

Measures of alpha diversity

•  A community that contain taxa that are more divergent from each other is more diverse

•  There are many ways to measure alpha diversity, below a few examples: –  Phylogenetic Diversity (PD): measures the total sum of branch lengths in a phylogenetic tree

that leads to each community member. Qualitative measure of divergence

–  Theta: measures the average divergence between two randomly chosen sequences (individuals). Quantitative as it accounts for both evenness and divergence between taxa (Low evenness: numerically dominance of a few species)

–  Chao 1: species-based qualitative measure

–  Shannon: species-based quantitative measure

Page 28: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

28

Further phylogenetic analyses: beta diversity

•  Beta diversity analysis compares diversity between each sample in your study.

•  We calculate the distance between a pair of samples and we do this for all samples.

•  We obtain a distance matrix that we can visualize in a number of ways, e.g. as a tree, a network or a principal coordinates (PCoA) plot. During the hands-on we will generate PCoA plots to visualize the distances between our samples in 3-dimensional space. We’ll have a separate tutorial on visualization with Emperor.

•  As our samples show variation in sequencing depth, we will use the number of reads from the smallest sample as our sequencing depth and rarify all other samples at this depth.

Page 29: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

29

Measures of community distance: UniFrac

•  There are many ways to measure beta diversity (see e.g. Lozupone and Knight, 2009) for summary

•  Divergence-based measures: communities are considered more related if the taxa they contain are more closely related.

•  UniFrac (qualitative): Measures phylogenetic distance between sets of taxa in a tree. Weighted UniFrac (quantitative): Variation of UniFrac that accounts for changes in relative abundance of lineages between communities.

•  Quantitative measures depends on accurate information of relative abundance of sequences (could be biased by lab procedures)

•  UniFrac allows you to: –  Determine if the environments in the input phylogenetic tree have significantly different

microbial communities.

–  Determine if community differences are concentrated within particular lineages of the phylogenetic tree.

–  Cluster environments to determine whether there are environmental factors (such as temperature or salinity) that group communities together.

–  Determine whether the environments were sampled sufficiently to support cluster nodes.

Page 30: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

30

QIIME analysis of Illumina amplicon data

•  Data preparation differs from 454 analysis •  Closed-reference OTU picking can be parallelized and is therefore preferred

•  For demultiplexing you need a mapping file (as discussed for 454), the fastq file containing the barcode sequence and the fastq file containing the reads. It is also possible to demultiplex samples if your data is from multiple lanes.

•  For details see the following QIIME tutorial: http://qiime.org/tutorials/processing_illumina_data.html Note: for a full HiSeq2000 run, this process can take up to 500 CPU hours!

Page 31: Introduction to taxonomic analysis of metagenomic amplicon ... · 3 Some terminology • Amplicon: a DNA fragment that is amplified with PCR, e.g. one or more 16S rRNA variable regions,

31

Finally

This concludes the introduction to taxonomic analysis with QIIME.

If taxonomic analysis is important to your work, then do spend time going through the different QIIME tutorials at http://qiime.org/tutorials/.

Thank you