Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State...

26
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division, Argonne National Laboratory ASM General Meeting, Boston. www.nmpdr.org www.theseed.org See also poster: B-179 (126B) Aziz et al
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State...

Annotating Metagenomes Using the NMPDR

Annotating Metagenomes Using the NMPDR

Rob Edwards

Department of Computer Sciences, San Diego State University

Mathematics and Computer Sciences Division, Argonne National Laboratory

ASM General Meeting, Boston.

www.nmpdr.org www.theseed.org

See also poster:B-179 (126B)

Aziz et al

Firstbacterial genome

100bacterial genomes

1,000bacterial genomesN

um

ber

of

know

n s

equence

s

Year

How much has been sequenced?How much has been sequenced?

Environmentalsequencing

www.nmpdr.org www.theseed.org

Everybody inBoston

Everybody inUSA

AllculturedBacteria

100people

How much will be sequenced?

One genome fromevery species

Most majormicrobial environments

www.nmpdr.org www.theseed.org

The ProblemThe Problem

How do you generate consistent and accurate annotations for

metagenomes?

www.nmpdr.org www.theseed.org

The SEED Family

The SEED Family

www.nmpdr.org www.theseed.org

Annotations using subsystemsAnnotations using subsystems

FIG has developed the notion of Subsystem – a generalization of “pathway” as a collection of functional roles jointly involved in a biological process or complex

Extended subsystems into FIGfams – protein families that perform the same functions.

www.nmpdr.org www.theseed.org

Subsystems make up metabolismSubsystems make up metabolism

Wik

ipedia

Meta

bolis

mhtt

p:/

/en.w

ikip

edia

.org

/wik

i/Port

al:M

eta

bolis

m

SEED ViewerSEED Viewer

www.nmpdr.org www.theseed.org

Populated SubsystemPopulated Subsystem

www.nmpdr.org www.theseed.org

predicted or measured co-regulation

genome context(virulence islands, prophages,

conserved gene clusters)

virulence mechanism

cellular localization

enzymatic activity

common phenotype

combinations of criteria

Subsystems Are Not Just PathwaysSubsystems Are Not Just Pathways

www.nmpdr.org www.theseed.org

Automated Annotations of Complete genomes

Automated Annotations of Complete genomes

• Automated user originated processing

• Takes 1-7 hours depending on size and complexity of the genome

• ~1,500 external submissions, including 150 genomes not yet publicly released.

• Reannotation of >500 genomes complete

• 789 users, 160 organizations, 25 countries.

http://rast.nmpdr.org/

Automated Annotations of Complete Metagenomes

Automated Annotations of Complete Metagenomes

MG-RAST Server

Accurate and consistent annotations in a few days

Automatic metabolic reconstructionFreely available after registration

http://metagenomics.theseed.org/

www.nmpdr.org www.theseed.org

Metagenome AnnotationMetagenome Annotation

Automated pipeline– upload sequences in fasta, with or without

Q-scores– removes exact duplicates (454 artefact)– renumbers sequences (mapping provided)– BLAST against SEED nr, 16S rDNA– Annotations and metabolic reenactment– Taxonomic summary

www.nmpdr.org www.theseed.org

Metagenome Metabolic ReenactmentMetagenome Metabolic Reenactment

PhylogenomicsPhylogenomics

Comparing Metagenomes to Genomes (or other metagenomes!)

Comparing Metagenomes to Genomes (or other metagenomes!)

Metabolic potential in environmentsMetabolic potential in environments

Hours

of

Com

pute

Tim

e

Input size (MB)

MG-RAST computationMG-RAST computation~19 hours of compute per input megabyte

How much so farHow much so far

676 metagenomes

10,012,793,995 bp (10 Gbp)

Average: ~15 M bp per genome

Compute time (on a single CPU):

190,243 hours = 7,926 days = 21 years

~200 GS20~200 FLX~200 Sanger]

www.nmpdr.org www.theseed.org

Lots of sequencesall pyrosequencing

Lots of sequencesall pyrosequencing

www.nmpdr.org www.theseed.org

Sulfur

CDA 60.2%

CD

A 2

1.7

% Respiration

Capsule Motility

Membranetransport

Stress

Signaling

Phosphorus

RNA

MineSaltern

MarineMicrobialites

CoralFish

AnimalsFreshwater

From Sequences To EnvironmentsFrom Sequences To Environments

Dinsdale et al, Nature 2008

Upcoming FeaturesUpcoming Features

• More user options (removing sequences, E-values, percent identities, etc)

• More databases (ACLAME, human, etc)

• More user generated content (mash-ups) via webservices and published API

www.nmpdr.org www.theseed.org

Thanks:

Bahador NosratSDSU

Accessing Data via Web ServicesAccessing Data via Web Services

WorkshopsWorkshops

Free workshops on NMPDR, RAST, mg-RAST, SEED

Upcoming workshops: Greece, Argonne, Urbana-Champaign, San Diego

Contact Leslie McNeil [email protected]

or visithttp://www.nmpdr.org/

AcknowledgementsAcknowledgements

Environmental GenomicsForest Rohwerand the labs that

provided sequence

Metagenomics Annotation ServerRick StevensDaniel Paarman Folker MeyerBob OlsenMark D'Souza Statistics & Web services

Liz DinsdaleDana HallBeltran Rodriguez-BritoBahador Nosrat

FIGRoss OverbeekVeronika VonsteinAnnotators

www.nmpdr.org www.theseed.org