INFSO-RI-508833 Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for...
-
Upload
jonas-newton -
Category
Documents
-
view
216 -
download
0
Transcript of INFSO-RI-508833 Enabling Grids for E-sciencE BioDCV: a grid-enabled complete validation setup for...
INFSO-RI-508833
Enabling Grids for E-sciencE
www.eu-egee.org
BioDCV: a grid-enabled complete validation setup
for functional profiling
EGEE User Forum
CERN, 01-03.03.2006
S. Paoli, D. Albanese, G. Jurman, A. Barla, S. Merler, R. Flor, S. Cozzini, J. Reid, C. Furlanello
http://mpa.itc.ithttp://mpa.itc.it
EGEE User Forum, CERN, 01-03.03.2006 2
Enabling Grids for E-sciencE
INFSO-RI-508833
Summary
1. Predictive profiling from microarray data.
2. A complete validation environment in grid:
BioDCV.
3. Test: Cluster vs Grid.
EGEE User Forum, CERN, 01-03.03.2006 3
Enabling Grids for E-sciencE
INFSO-RI-508833
Predictive Profiling
QUESTIONS for a discriminating QUESTIONS for a discriminating molecular signature:molecular signature:
predict disease statepredict disease state identify patterns regarding identify patterns regarding
subclasses of patientssubclasses of patients
Group AGroup A
Group BGroup B
ArrayArray(gene expression Affy) (gene expression Affy)
BB
Over-expressionOver-expressionin group Bin group B
Over-expressionOver-expressionin group Ain group A
B
genesgenesUnder-expressionUnder-expression
in group Bin group B
sam
ple
ssa
mple
s
A PANEL OF DISCRIMINATING GENES?A PANEL OF DISCRIMINATING GENES?
A
EGEE User Forum, CERN, 01-03.03.2006 4
Enabling Grids for E-sciencE
INFSO-RI-508833
The BioDCV system
A set-up based on the E-RFE algorithm for Support Vector Machines (SVM).
• Control of selection bias, outlier detection • Subtype identification
C language coupled with SQLite database libraries. It implements complete validation procedure
on distributed systems: MPI or Open Mosix clusters. Since March 2005: ported as grid application with
MPI execution through LCG middleware and data storage in SE.
A software setup for predictive molecular profiling gene expression data:
EGEE User Forum, CERN, 01-03.03.2006 5
Enabling Grids for E-sciencE
INFSO-RI-508833
The BioDCV setup (E-RFE SVM)To avoid selection bias (p>>n): a COMPLETE VALIDATION SCHEME*
• externally a stratified random partitioning,• internally a model selection based on a K-fold cross-validation 3 x 105 SVM models (+ random labels 2 x 106) **
** Binary classification, on a 20000 genes x 45 cDNA array, 400 runs
* Ambroise & McLachlan, 2002, Simon et. al 2003, Furlanello et. al 2003
OFS-M: Model tuning and Feature ranking
ONF: Optimal gene panel estimator
ATE: Average Test Error
B=400
EGEE User Forum, CERN, 01-03.03.2006 6
Enabling Grids for E-sciencE
INFSO-RI-508833
Implementation
CESE SEUI Egrid
50-400 MB
2-50 MB
BioDCV system
WNs
Egrid infrastructure
WN
EGEE User Forum, CERN, 01-03.03.2006 7
Enabling Grids for E-sciencE
INFSO-RI-508833
Experiments
We present two experiments designed to measure the performances of the BioDCV.
• Resources A Linux cluster of 8 Xeon CPUs 3.0 GHZ and Egrid infrastructure
(into Italian Grid-it) ranging from 1 to 64 Xeon CPUs 3.0 GHZ.
• DataA set of 6 different microarray datasets.
• Tests– Benchmark 1: footprint– Benchmark 2: scalability
EGEE User Forum, CERN, 01-03.03.2006 8
Enabling Grids for E-sciencE
INFSO-RI-508833
Datasets
Dataset name Samples Genes DB (MB) dN / 106 T_tns (s)
1 BRCA 62 4057 2.2 0.2 2534
2 Sarcoma 35 7143 2.2 0.3 2887
3 Liver Cancer 213 1993 3.7 0.4 6894
4PediatricLeukemia
327 12625 32 4.1 27831
5 Wang 286 17816 40 5.0 138335
6 Chang 295 24481 57 7.2 114546
1-2 IFOM-INT, Milan (Italy), 2005
3 ATAC-PCR: Sese et. al, Bioinformatics 2000
3 Yeoh et al., NCBI 2002
4 Wang et al., Lancet 2005
5 Chang et al., PNAS 2005
Benchmark1Benchmark1
Benchmark2Benchmark2
Footprint (dN=Samples x Genes)
EGEE User Forum, CERN, 01-03.03.2006 9
Enabling Grids for E-sciencE
INFSO-RI-508833
Benchmark 1
• We characterize the BioDCV application with respect to different dataset for fixed number of CPUs in grid.
• This benchmark tries to discover the discrimination factor, called footprint, between execution times of one application and its input data
• Applied on the set of 6 microarray datasets with a fixed number of 32 CPUs in grid.
Evaluation metrics:
T_tns=Li+U+E_g+D+S
Evaluation metrics:
T_tns=Li+U+E_g+D+S
T_tns: effective execution time, total execution time (without time spent in queue)
Li: experiment setup
E_g: computing time without latency time
S: semisupervised analysis time
U: time for uploading data and application to the grid, including delivery on CE.
D: time for data retrieval and download. This includes copying all results from the WNs to the starting SE, and their transfer to local site
EGEE User Forum, CERN, 01-03.03.2006 10
Enabling Grids for E-sciencE
INFSO-RI-508833
Benchmark 1 - Footprint
FOOTPRINT dN: #genes x #samples FOOTPRINT dN: #genes x #samples
dN / 106
Tim
e (s
)
1 2 3 4 5 6 7 8
1500
10000
50000
100000
T_tns
E_g
10 x L_i
10 x U
10 x S
BRCA ChangMorishita
PL
Sarcoma
Wang
T_tns: effective execution time
E_g: computing time
S: semisupervised analysis
L_i: setup experiment
U: upload data to grid
Dataset
footprint
Fixed 32 CPUs in grid
EGEE User Forum, CERN, 01-03.03.2006 11
Enabling Grids for E-sciencE
INFSO-RI-508833
Benchmark 2
We study the scalability of our application as function of the number of CPUs through a speed-up measure on different computational environments.
• Resources: Linux cluster (ranging from 1 to 8 CPUs) and in grid (from 1 to 32 CPUs).
• Data:
Speed-up metricDef: if E_g[N] is user time of a program from shell command “time” for N CPUs:
Speed-up (N)= E_g[1] / E_g[N]
Speed-up metricDef: if E_g[N] is user time of a program from shell command “time” for N CPUs:
Speed-up (N)= E_g[1] / E_g[N]
Dataset name Samples Genes DB (MB) dN x 10e-7
Liver Cancer 213 1993 3.7 0.4
PediatricLeukemia
327 12625 32 4.1
EGEE User Forum, CERN, 01-03.03.2006 12
Enabling Grids for E-sciencE
INFSO-RI-508833
Benchmark 2 Cluster
N.Cpu
Sp
ee
du
p
1 2 4 8
0
1
2
3
4
5
6
7
8
LiverCanc: cluster
Experimental dataLinear Speed-up
N.Cpu
Sp
ee
du
p
1 2 4 8
0
1
2
3
4
5
6
7
8
PedLeuk: cluster
Experimental dataLinear Speed-up
EGEE User Forum, CERN, 01-03.03.2006 13
Enabling Grids for E-sciencE
INFSO-RI-508833
Benchmark 2 Grid
N.Cpu
Sp
ee
du
p
4 8 16 32
4
8
16
32
PedLeuk: Grid
Experimental dataLinear Speed-up
N.Cpu
Sp
ee
du
p
1 2 4 8 16 32
12
4
8
16
32
LiverCanc: Grid
Experimental dataLinear Speed-up
EGEE User Forum, CERN, 01-03.03.2006 14
Enabling Grids for E-sciencE
INFSO-RI-508833
Discussion
• Two experiments for 139 CPU days in Egrid infrastructure
• In Benchmark 1, effective execution time increases linearly with the dataset footprint, i.e. the product of number of genes and number of samples
• In Benchmark 2, the speed-up curve is very close to linear
• BioDCV system on LCG/EGEE computational grid can be used in practical large scale experiments
• BioDCV system will soon be executed on Proteomic data in grid
• Next step is porting our system under EGEE’s Biomed VO
EGEE User Forum, CERN, 01-03.03.2006 15
Enabling Grids for E-sciencE
INFSO-RI-508833
BioDCV SubVersion Homepage
http://biodcv.itc.it
• C. Furlanello, M. Serafini, S. Merler and G. Jurman. Semi-supervised learning for molecular profiling. IEEE Trans. Comp. Biology and Bioinformatics, 2(2):110-118, 2005.
• More on http://mpa.itc.it
EGEE User Forum, CERN, 01-03.03.2006 16
Enabling Grids for E-sciencE
INFSO-RI-508833
Acknowledgments
ICTP E-GRID Project, Trieste
Angelo Leto Riccardo Murri
Ezio CorsoAlessio TerpinAntonio MessinaRiccardo Di Meo
INFN GRID
Roberto BarberaMirco Mazzuccato
ICTP E-GRID Project, Trieste
Angelo Leto Riccardo Murri
Ezio CorsoAlessio TerpinAntonio MessinaRiccardo Di Meo
INFN GRID
Roberto BarberaMirco Mazzuccato
IFOM-FIRC and INT, Milano
Manuela GariboldiMarco A. Pierotti
Grants:
BICG (AIRC)Democritos
Data:
IFOM-FIRCCardiogenomics PGA
IFOM-FIRC and INT, Milano
Manuela GariboldiMarco A. Pierotti
Grants:
BICG (AIRC)Democritos
Data:
IFOM-FIRCCardiogenomics PGA