Inferring microbial community function from taxonomic composition

1
Abstract It is often most efficient to characterize microbial communities using taxonomic markers such as the 16S ribosomal small subunit rRNA gene. The 16S gene is typically used to describe the organisms or taxonomic units present in a sample, but data from such markers do not inherently reveal the molecular functions or ecological roles of members of a microbial community. We have developed and validated a novel computational method that takes a set of observed taxonomic abundances and infers abundance profiles of enzymes and pathways from multiple functional classification schemes (KEGG, PFAM, COG, etc.). We use ancestral state reconstruction to determine approximate genomic content, taking into account 16S copy number and known functional abundance profiles from all currently available microbial genomes. We have evaluated the accuracy of this inference for different groups of taxa and for different areas of biological function. Our method, implemented as the PI-CRUST software (Phylogenetic Investigation of Communities by Reconstruction of Unobserved STates), allows 16S metagenomic based studies to be extended to predict the functional abilities of microbiomes as well as to compare expected versus observed functions in shotgun based metagenomic experiments. 1. PI-CRUST Software Pipeline 1.2 PI-CRUST: Genome Functional Predictions 1.1 Starting Data Sources (Internally used by PI-CRUST) Entire GreenGenes 16S reference tree. A functional “Trait Table” for all completed genomes (e.g. KEGG, PFAM, etc.). This contains abundances of each functional category for each genome in the IMG database. 16S copy number information for each completed genome in IMG (used to normalize OTU tables) GreenGenes identifier to IMG completed genomes map (to link information we have about completed genomes to tips in our reference tree). Acknowledgements MGIL is the recipient of an IHMC travel award funded by the NIH. MGIL and RGB are supported by a CIHR emerging team grant. 1.3 User Input “OTU table”, Number of OTUs (with greengenes identifiers) per sample 1.4 PI-CRUST: Metagenome Functional Predictions Inferring microbial community function from taxonomic composition Morgan G.I. Langille 1,* , Jesse R.R. Zaneveld 2 , J Gregory Caporaso 3 , Joshua Reyes 4 , Dan Knights 5 , Daniel McDonald 6 , Rob Knight 5 , Robert G. Beiko 1 , Curtis Huttenhower 4 1 Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada; 2 Dept. of Microbiology, Oregon State University, Corvallis, OR, USA; 3 Dept. of Computer Science, Northern Arizona University, Flagstaff, AZ, USA; 4 Dept. of Biostatistics, Harvard School of Public Health, Boston, MA, USA; 5 Dept. Computer Science, University of Colorado, Boulder, CO, USA; 6 Biofrontiers Institute, University of Colorado, Boulder, CO, USA; * [email protected] OTU Table 16S Copy Number Predictions Normalized OTU Table Normalized OTU Table Functional Trait Predictions 3. Genome Validation 3.1 Method 1) Remove a single genome from our reference dataset (pretending it has not been sequenced) 2) Use PI-CRUST to predict the functional abundances for our “unknown” genome using only its 16S gene 3) Compare PI-CRUST predictions vs. the known functional abundances of our genome 4) Repeat for all completed genomes (>2000) 5) Plot the distribution of accuracy values for each genome (3.2) or each functional group (3.3) 3.2 PI-CRUST accuracy for completed genomes “Random”: Functional abundances are chosen randomly from each of its distributions in all genomes. “Nearest Neighbour ”: Functional profile from genome with closest 16S distance is used. “PIC”: Ancestral state reconstruction using least squares regression (APE R package). “WAGNER”: Ancestral state reconstruction using Wagner parsimony (Count package). 3.3 PI-CRUST accuracy for various functional groups Using Various Ancestral State Reconstruction The ability to predict functions from 16S varies depending on the functional class. Functions that are well conserved and evolve similarly to 16S have higher accuracy, such as “RNA metabolism” and “Cell Division and Cell Cycle”. Other groups that tend not to be inherited by vertical descent such as “Phages, Prophages, Transposable Elements, Plasmids” are not predicted as accurately. 2 Metagenome Validation 2.1 Method 1) Obtain microbiome samples with both whole metagenomic and 16S sequencing 2) Use PI-CRUST with 16S data to predict functions for samples 3) Compare PI-CRUST predictions with functions observed from sequencing 2.2 PI-CRUST accuracy on HMP samples Distance to nearest genome affects accuracy 4 Concluding Remarks 4.1 Discussion Genome content has been shown in the past to vary widely even in closely related species. However, this may not be typical for the majority of bacterial and archaeal species. Our ability to predict the functions encoded in an organism based solely by its 16S gene and knowledge from the thousands of completed genomes suggests that gene content often has good phylogenetic correlation with 16S. PI-CRUST allows 16S-only studies to be expanded to include information about functional abundances. Studies with full metagenomic sequencing can use PI-CRUST to identify functions that are observed but not expected based on their 16S profiles (i.e the taxa that are present in the sample). 4.2 Availability & Future Plans PI-CRUST is still under development but will be freely available under the GPL at: http://picrust.sourceforge.net Various methods of ancestral state reconstruction and confidence weighting are still being evaluated. Evaluation of PI-CRUST on other paired metagenomic and 16S datasets is underway. Each point represents the predicted vs. observed relative abundance for a single KEGG category PI-CRUST predicted abundance based on 16S data 16S phylogenetic distance to nearest species Endosymbionts& Reduced Genomes Metagenome Functional Predictions PI-CRUST Accuracy (for each SEED function) Reference 16S Tree (greengenes) Genome Functional Table (completed genomes only) & 16S Copy Number Predictions Functional Trait Predictions 16S Copy Number (completed genomes only) Known functional composition (from sequenced genome) Predicted functional composition (for unsequenced genome) Inferred ancestral functional composition Prune taxa with no genome information Infer ancestral genome traits Predict functional compositions

Transcript of Inferring microbial community function from taxonomic composition

Page 1: Inferring microbial community function from taxonomic composition

AbstractIt is often most efficient to characterize microbial communities using taxonomic markers such as the 16S ribosomal small subunit rRNA gene. The 16S gene is typically used to describe the organisms or taxonomic units present in a sample, but data from such markers do not inherently reveal the molecular functions or ecological roles of members of a microbial community. We have developed and validated a novel computational method that takes a set of observed taxonomic abundances and infers abundance profiles of enzymes and pathways from multiple functional classification schemes (KEGG, PFAM, COG, etc.). We use ancestral state reconstruction to determine approximate genomic content, taking into account 16S copy number and known functional abundance profiles from all currently available microbial genomes. We have evaluated the accuracy of this inference for different groups of taxa and for different areas of biological function. Our method, implemented as the PI-CRUST software (Phylogenetic Investigation of Communities by Reconstruction of Unobserved STates), allows 16S metagenomic based studies to be extended to predict the functional abilities of microbiomes as well as to compare expected versus observed functions in shotgun based metagenomic experiments.

1. PI-CRUST Software Pipeline

1.2 PI-CRUST: Genome Functional Predictions

1.1 Starting Data Sources (Internally used by PI-CRUST)• Entire GreenGenes 16S reference tree.• A functional “Trait Table” for all completed genomes (e.g. KEGG, PFAM, etc.). This contains

abundances of each functional category for each genome in the IMG database.• 16S copy number information for each completed genome in IMG (used to normalize OTU tables)• GreenGenes identifier to IMG completed genomes map (to link information we have about

completed genomes to tips in our reference tree).

Acknowledgements

• MGIL is the recipient of an IHMC travel award funded by the NIH. • MGIL and RGB are supported by a CIHR emerging team grant.

1.3 User Input

• “OTU table”, Number of OTUs (with greengenes identifiers) per sample

1.4 PI-CRUST: Metagenome Functional Predictions

Inferring microbial community function from taxonomic compositionMorgan G.I. Langille1,*, Jesse R.R. Zaneveld2, J Gregory Caporaso3, Joshua Reyes4,

Dan Knights5, Daniel McDonald6, Rob Knight5, Robert G. Beiko1, Curtis Huttenhower4

1Faculty of Computer Science, Dalhousie University, Halifax, NS, Canada; 2Dept. of Microbiology, Oregon State University, Corvallis, OR, USA; 3Dept. of Computer Science, Northern Arizona University, Flagstaff, AZ, USA;4Dept. of Biostatistics, Harvard School of Public Health, Boston, MA, USA; 5Dept. Computer Science, University of Colorado, Boulder, CO, USA; 6Biofrontiers Institute, University of Colorado, Boulder, CO, USA; *[email protected]

OTU Table

16S Copy Number

Predictions

NormalizedOTU Table

NormalizedOTU Table

FunctionalTrait

Predictions

3. Genome Validation

3.1 Method1) Remove a single genome from our reference dataset (pretending it has not been sequenced)2) Use PI-CRUST to predict the functional abundances for our “unknown” genome using only its 16S gene 3) Compare PI-CRUST predictions vs. the known functional abundances of our genome4) Repeat for all completed genomes (>2000)5) Plot the distribution of accuracy values for each genome (3.2) or each functional group (3.3)

3.2 PI-CRUST accuracy for completed genomes

“Random”: Functional abundances are chosen randomly from each of its distributions in all genomes. “Nearest Neighbour”: Functional profile from genome with closest 16S distance is used.“PIC”: Ancestral state reconstruction using least squares regression (APE R package).“WAGNER”: Ancestral state reconstruction using Wagner parsimony (Count package).

3.3 PI-CRUST accuracy for various functional groups

Using Various Ancestral State Reconstruction

The ability to predict functions from 16S varies depending on the functional class. Functions that are well conserved and evolve similarly to 16S have higher accuracy, such as “RNA metabolism” and “Cell Division and Cell Cycle”. Other groups that tend not to be inherited by vertical descent such as “Phages, Prophages, Transposable Elements, Plasmids” are not predicted as accurately.

2 Metagenome Validation

2.1 Method1) Obtain microbiome samples with both whole metagenomic and 16S sequencing2) Use PI-CRUST with 16S data to predict functions for samples3) Compare PI-CRUST predictions with functions observed from sequencing

2.2 PI-CRUST accuracy on HMP samples

Distance to nearest genome affects accuracy

4 Concluding Remarks

4.1 Discussion

• Genome content has been shown in the past to vary widely even in closely related species. However, this may not be typical for the majority of bacterial and archaeal species. Our ability to predict the functions encoded in an organism based solely by its 16S gene and knowledge from the thousands of completed genomes suggests that gene content often has good phylogenetic correlation with 16S.

• PI-CRUST allows 16S-only studies to be expanded to include information about functional abundances.

• Studies with full metagenomic sequencing can use PI-CRUST to identify functions that are observed but not expected based on their 16S profiles (i.e the taxa that are present in the sample).

4.2 Availability & Future Plans• PI-CRUST is still under development but will be freely available under the GPL at: http://picrust.sourceforge.net• Various methods of ancestral state reconstruction and confidence weighting are still being evaluated.• Evaluation of PI-CRUST on other paired metagenomic and 16S datasets is underway.

Each point represents the predicted vs. observed relative abundance for a single KEGG category

PI-CRUST predicted abundance based on 16S data

16S phylogenetic distance to nearest species

Endosymbionts& Reduced Genomes

MetagenomeFunctional Predictions

PI-CRUST Accuracy (for each SEED function)

Reference 16S Tree(greengenes)

Genome Functional Table

(completed genomes only)

&

16S Copy Number

Predictions

FunctionalTrait

Predictions

16S Copy Number

(completed genomes only)

Known functional composition (from sequenced genome)

Inferred ancestral functional composition

Predicted functional composition(for unsequenced genome)

Inferred ancestral functional composition

Prune taxa with no genome information

Infer ancestral genome traits

Predict functional

compositions