The Integrated Microbial Genome (IMG) systems

32
The Integrated Microbial Genome (IMG) systems Nikos Kyrpides

description

The Integrated Microbial Genome (IMG) systems. Nikos Kyrpides. Reddy. Bahador. Iain. Denis. Amrita. Billis. Peter. Marcel. OMICS GROUP. STANDARDS GROUP. ANNOTATION GROUP. Natalia. Dino. Kostas. Ioanna. Biological Data Management. Victor Markowitz. Yuri Grechkin. Ken Chu. - PowerPoint PPT Presentation

Transcript of The Integrated Microbial Genome (IMG) systems

Page 1: The Integrated Microbial Genome (IMG) systems

The Integrated Microbial Genome (IMG) systems

Nikos Kyrpides

Page 2: The Integrated Microbial Genome (IMG) systems

OMICS GROUP

Ken Chu

KrishnaPalaniappan

ErnestSzeto

YuriGrechkin

Amy Chen

VictorMarkowitz

Biju Jacob

ANNOTATION GROUP

STANDARDS GROUP

Kostas

Marcel Peter Billis

Natalia Dino

Amrita Denis Iain Bahador IoannaReddy

Page 3: The Integrated Microbial Genome (IMG) systems

Science driven data generation and analysis

Science Goals

ANALYSIS

UserFacility

Page 4: The Integrated Microbial Genome (IMG) systems

Science driven data generation and analysis

Science Goals

ANALYSIS

UserFacility

Page 5: The Integrated Microbial Genome (IMG) systems

Data Integration

Comparative Analysis

Data analysis

Page 6: The Integrated Microbial Genome (IMG) systems

Data management system for comparative analysis of biological data

I

MG

IMG

GenesGenomes

Functions

Metadata Clusters

SNPsProteomics

RegulonsTranscriptomes

What is the Matrix?

Page 7: The Integrated Microbial Genome (IMG) systems

Become the HOME of Microbial Genomes and Metagenomes

IMG’s Mission

• support comparative genome analysis• support community functional

annotationprovide a user friendly interface

Page 8: The Integrated Microbial Genome (IMG) systems

What is IMG:IMG is a data management system for comparative analysis and annotation of all publicly available genomes from three domains of life in a uniquely integrated context.

Mission:To become the Home of Microbial Genome and Metagenome Analysis

Background: Launched on March 2005 3 Releases/Year, 20 releases so far >5,000 unique visitors per month >350 citations

Current Status: 6891 Genomes 11.6 Million Genes

Bacteria: 2780 Archaea: 107 Eukarya: 121 Plasmids: 1186 Viruses: 2697

• http://img.jgi.doe.gov/

• http://img.jgi.doe.gov/

USERS CAN Search data Browse data Compare data Export data

Integrated Microbial Genomes (IMG)[It’s easier to analyze 1000 genomes than a single one]

http://img.jgi.doe.gov/

Page 9: The Integrated Microbial Genome (IMG) systems

Why more data are neededfaster and more accurate function prediction

Ribokinase family

Fructokinase family

2-dehydro-3-deoxy

glucokinase family

Page 10: The Integrated Microbial Genome (IMG) systems

Binning

Metagenomic Analysis

Acid Mine Drainage Sargasso Sea Soil

1 10 100 1000 1000s 10000

Species complexity

Human GutTermite Hindgut

?The road to success in Metagenomics is through Microbial Genomics

Source: Susannah Tringe, JGI

Reference Genomes

Page 11: The Integrated Microbial Genome (IMG) systems

Availability of Reference Genomes

Acid Mine Drainage Human gut Soil

100% 60% 50% 40% 20% 1%

Reference Genomes

Termite GutMarine

?

Page 12: The Integrated Microbial Genome (IMG) systems

Data Model Abstraction Example:

IMG Operations

Ge n

e s

Functions/

Pathways

Genomes

Gene occurrence

profile across genomes

Gene occurrence profiles across

pathways

Pathways shared by genomes

Genes present in G1 and absent from G2, G3, G4 and G5

G1 G2 G3 G4 G5

g3

g2

g1 + + + + + + + - + + + - - - -

Page 13: The Integrated Microbial Genome (IMG) systems

IMG Data Integration

Genomes Functions

Genes

• COG• GO• Pfam• TIGRfam• InterPro• KEGG• BioCyc• SEED

• Protein product

• MyIMG• IMG Terms• IMG

Pathways• IMG

Networks

Groupings•

Phylogenetic

• Phenotypic

• Ecotypic• Disease•

Geographical

• Isolation

• RNAs, Proteins• Sequence Clusters• Positional clusters• Regulatory clusters• Fusions• Operons• Expression

6891

11.6M

1.1M

Page 14: The Integrated Microbial Genome (IMG) systems

IMG ToolkitChromosome

MapFunction

ProfileGene

SyntenyAbundance

ProfilesFunctional Categories

ProjectsMap

IMG Pathway Profile

MetadataSearch

PhylogeneticProfile

GenomeClustering

CompareAnnotations

KEGGMaps

PhylogeneticDistribution

ChromosomalMap Artemis

VISTA

RecruitmentPlot

FragmentRecruitment

WRITE PAPER

Page 15: The Integrated Microbial Genome (IMG) systems

USERS CAN Search data Browse data Compare

data Export data USERS CAN

Submit data Annotate

data

APRIL 2011

Users 1370Submissions 2626Private Genes 188 M

UNIQUE VISITS~ 5,000 / month

Page 16: The Integrated Microbial Genome (IMG) systems
Page 17: The Integrated Microbial Genome (IMG) systems

NEW PROJECT

SEQUENCING

Informatics Steps & Servicessupport of a new user community

ASSEMBLY ANNOTATION DATA RELEASE

INTEGRATION & COMPARATIVE

ANALYSIS

METADATA

EXTERNAL

PROJECTS

2012ASSEMBLY

EXTERNAL

PROJECTS

2008IMG-ER

EXTERNAL

PROJECTS

2005IMG

Page 18: The Integrated Microbial Genome (IMG) systems

18

• Metadata• Gene calling• Annotation

• Quantity• Quality

• Number of Genes• All vs all Blast

• Number of Datasets• How do we navigate

through a sea of data

Data Analysis

Data Challenges & Opportunities

Integration

Page 19: The Integrated Microbial Genome (IMG) systems

Challenges we face

DATA SIZE DATA QUALITY DATA STANDARDS

Page 20: The Integrated Microbial Genome (IMG) systems

Challenges we face

1. DATA SIZE• Number of Genes• Number of Datasets

a. How do we compare datab. How do we find datac. How do we navigate through data

Page 21: The Integrated Microbial Genome (IMG) systems

MetagenomeReference genomes

Use clusters

Metagenome Metagenome

Clusters• Common/unique genes• Rapid identification of

best hit(s)• ….

2. Computation of similarities

ii. Method dev for data reduction & comparison

- Computation of Similarities

21

Page 22: The Integrated Microbial Genome (IMG) systems

SCALINGComputation of Similarities

IMG

OLD: BLAST~ 30 days for 8 Million

Genes

NEW: CLUSTERS~ 3 days for 8 Million

Genes

IMG/M

OLD: BLASTNot Possible

NEW: CLUSTERS~ 10 days for 80 Million

Genes

Page 23: The Integrated Microbial Genome (IMG) systems

Strain / species diversity

Page 24: The Integrated Microbial Genome (IMG) systems

Prochlorococcus marinus Pangenome10

Listeria monocytogenes Pangenome

17

15

Staphylococcus aureus Pangenome

PangenomesWe need better ways to

• represent and browse through thousands of genomes• represent an organism

Page 25: The Integrated Microbial Genome (IMG) systems

Reference Genome

Bes

t Bla

st H

it

Pangenome

Metagenome Analysiswith Pangenomes

Page 26: The Integrated Microbial Genome (IMG) systems

Challenges we face

2. DATA QUALITYa. Did we generate enough data to support biological

conclusions?b. Did we introduce any biases during sequencing?c. Is the quality of assembly comparable between

different datasets?d. Is the quality of predicted genes comparable between

different datasets?e. Is the quality of functional annotation comparable

between different datasets

Page 27: The Integrated Microbial Genome (IMG) systems

Microbial GenomesGene Prediction Quality Assurance

Gene Prediction Improvement PipelineGenePRIMP is a pipeline that consists of a series of computational units that identify erroneous gene calls and missed genes and correct a subset of the identified defective features.

APPLICATIONS• Identify gene prediction anomalies• Benchmark the quality of gene

prediction algorithms• Benchmark the quality of combination /

coverage of sequencing platforms• Improve the sequence quality

Pati A. et al, (2010) Nature Methods

GenePRIMPhttp://geneprimp.jgi-psf.org

Natalia

Amrita

Page 28: The Integrated Microbial Genome (IMG) systems

Challenges we face

3. DATA STANDARDSa. Assemblyb. Gene Findingc. Functional Annotationd. Metadata

Page 29: The Integrated Microbial Genome (IMG) systems

Project Catalog & MetadataGenomes OnLine Database

D. LioliosI. Pagani

Page 30: The Integrated Microbial Genome (IMG) systems

COMPUTATIONSM5: Pilot Project with ANL

Building a roadmap for a scaleable and sustainable computing MetaInfrastructure for the metagenomics

community

innovation through collaboration

GSC

CAMERA

JGI ANL• develop standards to share and process data more effectively

• run data-intensive workflows once (reduce wasted cycles)

Develop a single QC data processing pipeline Develop a single data submission entry Develop a single data processing pipeline Develop a common project catalog

ANL JGI

Page 31: The Integrated Microbial Genome (IMG) systems

Standards in Genomic Scienceshttp://standardsingenomics.org

Page 32: The Integrated Microbial Genome (IMG) systems

New Data & Tools for Visualization & Analysis of• Integration of Expression data• Integration of Regulatory Data• Resequencing data (strain variation)• Pangenomes

Data Processing• Short Read annotation• Bypass the all vs all Blast bottleneck

Ongoing Developments