INFSO-RI-508833 Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for...

INFSO-RI-508833

Enabling Grids for E-sciencE

www.eu-egee.org

BioDCV: a grid-enabled complete validation setup

for functional profiling

EGEE User Forum

CERN, 01-03.03.2006

S. Paoli, D. Albanese, G. Jurman, A. Barla, S. Merler, R. Flor, S. Cozzini, J. Reid, C. Furlanello

http://mpa.itc.ithttp://mpa.itc.it

EGEE User Forum, CERN, 01-03.03.2006 2


INFSO-RI-508833

Summary

1. Predictive profiling from microarray data.

2. A complete validation environment in grid:

BioDCV.

3. Test: Cluster vs Grid.



INFSO-RI-508833

Predictive Profiling

QUESTIONS for a discriminating QUESTIONS for a discriminating molecular signature:molecular signature:

predict disease statepredict disease state identify patterns regarding identify patterns regarding

subclasses of patientssubclasses of patients

Group AGroup A

Group BGroup B

ArrayArray(gene expression Affy) (gene expression Affy)

BB

Over-expressionOver-expressionin group Bin group B

Over-expressionOver-expressionin group Ain group A

B

genesgenesUnder-expressionUnder-expression

in group Bin group B

sam

ple

ssa

mple

s

A PANEL OF DISCRIMINATING GENES?A PANEL OF DISCRIMINATING GENES?

A



INFSO-RI-508833

The BioDCV system

A set-up based on the E-RFE algorithm for Support Vector Machines (SVM).

• Control of selection bias, outlier detection • Subtype identification

C language coupled with SQLite database libraries. It implements complete validation procedure

on distributed systems: MPI or Open Mosix clusters. Since March 2005: ported as grid application with

MPI execution through LCG middleware and data storage in SE.

A software setup for predictive molecular profiling gene expression data:



INFSO-RI-508833

The BioDCV setup (E-RFE SVM)To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME*

• externally a stratified random partitioning,• internally a model selection based on a K-fold cross-validation 3 x 105 SVM models (+ random labels 2 x 106) **

** Binary classification, on a 20000 genes x 45 cDNA array, 400 runs

* Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003

OFS-M: Model tuning and Feature ranking

ONF: Optimal gene panel estimator

ATE: Average Test Error

B=400



INFSO-RI-508833

Implementation

CESE SEUI Egrid

50-400 MB

2-50 MB

BioDCV system

WNs

Egrid infrastructure

WN



INFSO-RI-508833

Experiments

We present two experiments designed to measure the performances of the BioDCV.

• Resources A Linux cluster of 8 Xeon CPUs 3.0 GHZ and Egrid infrastructure

(into Italian Grid-it) ranging from 1 to 64 Xeon CPUs 3.0 GHZ.

• DataA set of 6 different microarray datasets.

• Tests– Benchmark 1: footprint– Benchmark 2: scalability



INFSO-RI-508833

Datasets

Dataset name Samples Genes DB (MB) dN / 106 T_tns (s)

1 BRCA 62 4057 2.2 0.2 2534

2 Sarcoma 35 7143 2.2 0.3 2887

3 Liver Cancer 213 1993 3.7 0.4 6894

4PediatricLeukemia

327 12625 32 4.1 27831

5 Wang 286 17816 40 5.0 138335

6 Chang 295 24481 57 7.2 114546

1-2 IFOM-INT, Milan (Italy), 2005

3 ATAC-PCR: Sese et. al, Bioinformatics 2000

3 Yeoh et al., NCBI 2002

4 Wang et al., Lancet 2005

5 Chang et al., PNAS 2005

Benchmark1Benchmark1

Benchmark2Benchmark2

Footprint (dN=Samples x Genes)



INFSO-RI-508833

Benchmark 1

• We characterize the BioDCV application with respect to different dataset for fixed number of CPUs in grid.

• This benchmark tries to discover the discrimination factor, called footprint, between execution times of one application and its input data

• Applied on the set of 6 microarray datasets with a fixed number of 32 CPUs in grid.

Evaluation metrics:

T_tns=Li+U+E_g+D+S

Evaluation metrics:

T_tns=Li+U+E_g+D+S

T_tns: effective execution time, total execution time (without time spent in queue)

Li: experiment setup

E_g: computing time without latency time

S: semisupervised analysis time

U: time for uploading data and application to the grid, including delivery on CE.

D: time for data retrieval and download. This includes copying all results from the WNs to the starting SE, and their transfer to local site



INFSO-RI-508833

Benchmark 1 - Footprint

FOOTPRINT dN: #genes x #samples FOOTPRINT dN: #genes x #samples

dN / 106

Tim

e (s

)

1 2 3 4 5 6 7 8

1500

10000

50000

100000

T_tns

E_g

10 x L_i

10 x U

10 x S

BRCA ChangMorishita

PL

Sarcoma

Wang

T_tns: effective execution time

E_g: computing time

S: semisupervised analysis

L_i: setup experiment

U: upload data to grid

Dataset

footprint

Fixed 32 CPUs in grid



INFSO-RI-508833

Benchmark 2

We study the scalability of our application as function of the number of CPUs through a speed-up measure on different computational environments.

• Resources: Linux cluster (ranging from 1 to 8 CPUs) and in grid (from 1 to 32 CPUs).

• Data:

Speed-up metricDef: if E_g[N] is user time of a program from shell command “time” for N CPUs:

Speed-up (N)= E_g[1] / E_g[N]

Speed-up metricDef: if E_g[N] is user time of a program from shell command “time” for N CPUs:

Speed-up (N)= E_g[1] / E_g[N]

Dataset name Samples Genes DB (MB) dN x 10e-7

Liver Cancer 213 1993 3.7 0.4

PediatricLeukemia

327 12625 32 4.1



INFSO-RI-508833

Benchmark 2 Cluster

N.Cpu

Sp

ee

du

p

1 2 4 8

0

1

2

3

4

5

6

7

8

LiverCanc: cluster

Experimental dataLinear Speed-up

N.Cpu

Sp

ee

du

p

1 2 4 8

0

1

2

3

4

5

6

7

8

PedLeuk: cluster




INFSO-RI-508833

Benchmark 2 Grid

N.Cpu

Sp

ee

du

p

4 8 16 32

4

8

16

32

PedLeuk: Grid


N.Cpu

Sp

ee

du

p

1 2 4 8 16 32

12

4

8

16

32

LiverCanc: Grid




INFSO-RI-508833

Discussion

• Two experiments for 139 CPU days in Egrid infrastructure

• In Benchmark 1, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples

• In Benchmark 2, the speed-up curve is very close to linear

• BioDCV system on LCG/EGEE computational grid can be used in practical large scale experiments

• BioDCV system will soon be executed on Proteomic data in grid

• Next step is porting our system under EGEE’s Biomed VO



INFSO-RI-508833

BioDCV SubVersion Homepage

http://biodcv.itc.it

• C. Furlanello, M. Serafini, S. Merler and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Trans. Comp. Biology and Bioinformatics, 2(2):110-118, 2005.

• More on http://mpa.itc.it



INFSO-RI-508833

Acknowledgments

ICTP E-GRID Project, Trieste

Angelo Leto Riccardo Murri

Ezio CorsoAlessio TerpinAntonio MessinaRiccardo Di Meo

INFN GRID

Roberto BarberaMirco Mazzuccato

ICTP E-GRID Project, Trieste

Angelo Leto Riccardo Murri

Ezio CorsoAlessio TerpinAntonio MessinaRiccardo Di Meo

INFN GRID

Roberto BarberaMirco Mazzuccato

IFOM-FIRC and INT, Milano

Manuela GariboldiMarco A. Pierotti

Grants:

BICG (AIRC)Democritos

Data:

IFOM-FIRCCardiogenomics PGA

IFOM-FIRC and INT, Milano

Manuela GariboldiMarco A. Pierotti

Grants:

BICG (AIRC)Democritos

Data:

IFOM-FIRCCardiogenomics PGA

INFSO-RI-508833 Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for...

Documents

Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for...