Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Post on 01-Apr-2015

216 views 1 download

Tags:

Transcript of Deployment and Preparation of Metagenomic Analysis on the EELA Grid Gabriel Aparício, et al.

Deployment and Preparation of Metagenomic Analysis on the EELA

GridGabriel Aparício, et al.

Topic summary

Global Process Topics

• Introduction• Cases of study• Deployment• Automation System• Results and Performances Analysis• Future Plans

Introduction

• What is a Metagenome?• What is a Metagenome Analysis?• Why Grid is a Good Solution?• Which is the Proposed Structure?• Which are the Future Plans?

What is a Metagenome?

Introduction

• Term first used by Jo Handelsman and others in the University of Wisconsin in 1998.

• A metagenome is a collection of genes.

• It can be studied as a single gene.• Analysis can be done without

isolating genes and lab-cultivating them.

What is a Metagenome Analysis?

Introduction

• A Metagenome Analysis is the group of necessary steps to transform a file of a coded metagenome into another file with some interest information.

• This can include:– Database filtering.– BLAST alignments.– BLAST output filtering.– Creation of Phylogenetic Trees.

Why Grid is a Good Solution?

Introduction

• A Metagenome can be coded into several hundred of thousand sequences.

• Sequential time can take more than a year.• Public databases are continuously changing.• Several coarse steps can be done in parallel.• In a Grid, the global job can be divided into

subjobs.• A Metagenome Analysis can be processed in

a few days with a Grid Infrastructure.

Farm Soil Metagenome

Cases of Study

• This is a sample from a nutrient rich, moderately contaminated soil environment.

• This community is very diverse and complex.

• Many yet unknown enzymes are probably present there.

• Its study is very interesting from the biotechnological point of view.

Whale fall Metagenome

Cases of Study

• Whale carcasses are known to be a nutrient-rich environment in the bottom of the ocean.

• A heterogeneous mixture of bacteria flourish there.

• It is one of the best examples of marine bacterial communities.

Sargasso Sea Metagenome

Cases of Study

• These oceanic samples are taken from surface waters.

• They represent the diversity of bacteria that live planktonically.

Gut Metagenomes

Cases of Study

• Several metagenomes of the human intestinal microbiota.

• This consortia of bacteria helps its host to metabolize many nutrients that would be indigestible otherwise.

• It is involved in other functions–Maturation and modulation of the

immune response of the host.– Prevention of infection by bacterial

pathogens.

Sequential or Parallel jobs? (I)

Deployment

• There are around 150 CEs in BIOMED and EELA VOs.

• There are only around 30 CEs able to run MPICH jobs.

• The number of CEs decreases when the number of required nodes increases.

• Full efficiency in MPICH jobs is achieved occasionally.

Sequential or Parallel jobs? (II)

Deployment

Selecting CEs (I)

Deployment

• Several jobs are needed– A single job can take more than a year.– It is needed to split the Analysis into

several subjobs (often more than 100 subjobs).

• Several CEs are needed– To decentralize processing, storing and

network bandwidth.

• A Metagenome Analysis job has requirements– On software, hardware and configuration.

Selecting CEs (II)

Deployment

• Not all available CEs are able to produce results.

• Not all available CEs have the same performance.

• It is needed to select CEs and to distribute jobs according to their performance.

Selecting CEs (III)

Deployment

Selecting SEs and Replicating Files

Deployment

• All jobs need certain common files.• These files have to be replicated to

increase performance and to distribute network bandwidth.

• SEs selected will be located according to their geographical and administrative nearness to selected CEs, their performance and their configuration.

Splitting global job

Deployment

• The global job has to be broken down into subjobs.

• The subjob lifetime will decrease– Increase interactivity.– Improve monitoring capabilities.

Submitting Jobs

Automation System

• Subjobs are assigned to a list of CEs• These CEs have been tested.• Assignation is done according to

obtained performances in previous experiments.

Monitoring Jobs

Automation System

• Periodically, jobs status are monitored.

• In case of errors (aborted job, bad results, etc.), the job is automatically resubmitted.

• In case the job is running too long, the job is cancelled and resubmitted.

• In case the job has finished successfully, its CEs is annotated for later submissions.

Resubmitting Jobs

Automation System

• Each correctly finished job annotates its CEs and puts it into a list.

• The jobs are resubmitted to a random CE of this list.

• If the list does not exist, the job is submitted to a random CE.

Retrieving Results

Automation System

• Once results are available, they are downloaded and the standard outputs are explored to find any error.

• A retrieved job is no longer monitored.

First conclusions

Results and Performances

• Jobs are too long to run sequentially– Sargasso Sea Metagenome takes 512 days.

• The same job in Grid takes 13 days to be fully finished.– Speedup is around 40.

• High speed for most jobs (90% in 7 days)– Speedup is around 80.– No need to finish all jobs to begin with new

stages.

Correctly finished jobs percentage

Results and Performances

Sequences processed per hour

Results and Performances

Future plans

Future plans

• To create several shell-scripts with different stages depending on the desired results.

• To increase cases of study.• To improve automation

performances.• To make a report with the issues and

lessons learnt in EGEE and EELA infrastructures.

Contact

Contact

Gabriel Aparício i PlaIgnacio Blanquer Espert

Vicente Hernández García

Universitat Politècnica de ValènciaCamí de Vera, s/n

46022 València, SpainEmails: gaparicio@itaca.upv.es

iblanque@itaca.upv.esvhernand@itaca.upv.es