Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

26
Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Transcript of Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Page 1: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Deployment and Preparation of Metagenomic Analysis on the EELA

GridGabriel Aparício, et al.

Page 2: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Topic summary

Global Process Topics

• Introduction• Cases of study• Deployment• Automation System• Results and Performances Analysis• Future Plans

Page 3: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Introduction

• What is a Metagenome?• What is a Metagenome Analysis?• Why Grid is a Good Solution?• Which is the Proposed Structure?• Which are the Future Plans?

Page 4: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

What is a Metagenome?

Introduction

• Term first used by Jo Handelsman and others in the University of Wisconsin in 1998.

• A metagenome is a collection of genes.

• It can be studied as a single gene.• Analysis can be done without

isolating genes and lab-cultivating them.

Page 5: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

What is a Metagenome Analysis?

Introduction

• A Metagenome Analysis is the group of necessary steps to transform a file of a coded metagenome into another file with some interest information.

• This can include:– Database filtering.– BLAST alignments.– BLAST output filtering.– Creation of Phylogenetic Trees.

Page 6: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Why Grid is a Good Solution?

Introduction

• A Metagenome can be coded into several hundred of thousand sequences.

• Sequential time can take more than a year.• Public databases are continuously changing.• Several coarse steps can be done in parallel.• In a Grid, the global job can be divided into

subjobs.• A Metagenome Analysis can be processed in

a few days with a Grid Infrastructure.

Page 7: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Farm Soil Metagenome

Cases of Study

• This is a sample from a nutrient rich, moderately contaminated soil environment.

• This community is very diverse and complex.

• Many yet unknown enzymes are probably present there.

• Its study is very interesting from the biotechnological point of view.

Page 8: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Whale fall Metagenome

Cases of Study

• Whale carcasses are known to be a nutrient-rich environment in the bottom of the ocean.

• A heterogeneous mixture of bacteria flourish there.

• It is one of the best examples of marine bacterial communities.

Page 9: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Sargasso Sea Metagenome

Cases of Study

• These oceanic samples are taken from surface waters.

• They represent the diversity of bacteria that live planktonically.

Page 10: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Gut Metagenomes

Cases of Study

• Several metagenomes of the human intestinal microbiota.

• This consortia of bacteria helps its host to metabolize many nutrients that would be indigestible otherwise.

• It is involved in other functions–Maturation and modulation of the

immune response of the host.– Prevention of infection by bacterial

pathogens.

Page 11: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Sequential or Parallel jobs? (I)

Deployment

• There are around 150 CEs in BIOMED and EELA VOs.

• There are only around 30 CEs able to run MPICH jobs.

• The number of CEs decreases when the number of required nodes increases.

• Full efficiency in MPICH jobs is achieved occasionally.

Page 12: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Sequential or Parallel jobs? (II)

Deployment

Page 13: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Selecting CEs (I)

Deployment

• Several jobs are needed– A single job can take more than a year.– It is needed to split the Analysis into

several subjobs (often more than 100 subjobs).

• Several CEs are needed– To decentralize processing, storing and

network bandwidth.

• A Metagenome Analysis job has requirements– On software, hardware and configuration.

Page 14: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Selecting CEs (II)

Deployment

• Not all available CEs are able to produce results.

• Not all available CEs have the same performance.

• It is needed to select CEs and to distribute jobs according to their performance.

Page 15: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Selecting CEs (III)

Deployment

Page 16: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Selecting SEs and Replicating Files

Deployment

• All jobs need certain common files.• These files have to be replicated to

increase performance and to distribute network bandwidth.

• SEs selected will be located according to their geographical and administrative nearness to selected CEs, their performance and their configuration.

Page 17: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Splitting global job

Deployment

• The global job has to be broken down into subjobs.

• The subjob lifetime will decrease– Increase interactivity.– Improve monitoring capabilities.

Page 18: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Submitting Jobs

Automation System

• Subjobs are assigned to a list of CEs• These CEs have been tested.• Assignation is done according to

obtained performances in previous experiments.

Page 19: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Monitoring Jobs

Automation System

• Periodically, jobs status are monitored.

• In case of errors (aborted job, bad results, etc.), the job is automatically resubmitted.

• In case the job is running too long, the job is cancelled and resubmitted.

• In case the job has finished successfully, its CEs is annotated for later submissions.

Page 20: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Resubmitting Jobs

Automation System

• Each correctly finished job annotates its CEs and puts it into a list.

• The jobs are resubmitted to a random CE of this list.

• If the list does not exist, the job is submitted to a random CE.

Page 21: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Retrieving Results

Automation System

• Once results are available, they are downloaded and the standard outputs are explored to find any error.

• A retrieved job is no longer monitored.

Page 22: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

First conclusions

Results and Performances

• Jobs are too long to run sequentially– Sargasso Sea Metagenome takes 512 days.

• The same job in Grid takes 13 days to be fully finished.– Speedup is around 40.

• High speed for most jobs (90% in 7 days)– Speedup is around 80.– No need to finish all jobs to begin with new

stages.

Page 23: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Correctly finished jobs percentage

Results and Performances

Page 24: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Sequences processed per hour

Results and Performances

Page 25: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Future plans

Future plans

• To create several shell-scripts with different stages depending on the desired results.

• To increase cases of study.• To improve automation

performances.• To make a report with the issues and

lessons learnt in EGEE and EELA infrastructures.

Page 26: Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Contact

Contact

Gabriel Aparício i PlaIgnacio Blanquer Espert

Vicente Hernández García

Universitat Politècnica de ValènciaCamí de Vera, s/n

46022 València, SpainEmails: [email protected]

[email protected]@itaca.upv.es