No definitive “gold standard” causal networks Use a novel held-out validation approach,...
-
Upload
rafe-daniels -
Category
Documents
-
view
216 -
download
0
Transcript of No definitive “gold standard” causal networks Use a novel held-out validation approach,...
The HPN-DREAM breast cancer network inference
challenge:Scoring and results
Steven HillThe Netherlands Cancer Institute
RECOMB/ISCB Conference on Regulatory and Systems Genomics, with DREAM Challenges
8th November 2013
SC1A: Network inference from
experimental data
SC1A: Scoring
• No definitive “gold standard” causal networks• Use a novel held-out validation approach, emphasizing causal aspect of
challenge
Training Data (4 treatments)
Test Data (N-4 treatments)
FGFR1/3iAKTi
AKTi+MEKiDMSO
All Data (N treatments)
Test1Test2
….Test(N-4)
Participants infer 32
networks using
training data
Inferred networks assessed using test
data
SC1A: Scoring metric
Assessment: How well do inferred causal networks agree with effects observed under inhibition in test data?
Step 1: Identify “gold standard” with a paired t-test to compare DMSO and test inhibitors for each phosphoprotein and cell line/stimulus regime
phos
pho1
(a.u
.)
p-value = 3.2x10-5
time
DMSO
Test1
Phos
pho2
(a.u
.)time
DMSOTest1
0 1 1 0 1 0 0 1 0 0Test1
p-value = 0.45
e.g. UACC812/Serum, Test1
phosphoproteins
“gold standard”
FP FPTP TP TPCompare descendants of test inhibitor targetto “gold standard” list of observed effects in held-out data #TP(τ), #FP(τ)
Step 2: Score submissions
SC1A: Scoring metric
threshold, τ
Vary threshold τ ROC curve and AUROC score # TP
# FP
1 0 1 0 1 0 1 1 0 0Test1
phosphoproteins
AUROC
(0 .67 ⋯ 0.43⋮ ⋱ ⋮
0.58 ⋯ 0.87)
(1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 1)
Test1
Obtain protein descendants downstream of test inhibitor
target
Matrix of predicted edge scores for a single cell line/stimulus regime
SC1A: AUROC scores & nulls
• 74 final submissions• Each submission has 32 AUROC scores
(one for each cell line/stimulus regime)
3.58 x 10-6 8.98 x 10-6
9.19 x 10-44.18 x 10-6
non-significant AUROCsignificant AUROCbest performer
Scoring procedure:
1. For each submission and each cell line/stimulus pair, compute AUROC score
2. Submissions ranked for each cell line/stimulus pair
3. Mean rank across cell line/stimulus pairs calculated for each submission
4. Rank submissions according to mean rank
SC1A: Final ranking
32 cell line/stimulus pairs
Subm
issi
ons 0.50.70.90.60.5
0.80.70.40.70.6
0.80.5 AUROC
scores
32 cell line/stimulus pairs
Subm
issi
ons 4
2133
12423
14 AUROC
ranks
Subm
issi
ons 3
21.333.66
mean rank
Subm
issi
ons 3
214
final rank
• Verify that final ranking is robust
Procedure:1. Mask 50% of phosphoproteins in each
AUROC calculation
2. Re-calculate final ranking
3. Repeat (1) and (2) 100 times
SC1A: Robustness analysis
phosphoproteins
rank
Top 10 teams
5.40 x 10-10
SC1B: Network inference from
in silico data
• Gold-standard available: Data-generating causal network
• Participants submitted a single set of edge scores
• Edge scores compared against gold standard -> AUROC score
• Participants ranked based on AUROC score
SC1B Scoring, AUROCs, Null & Robustness
3.11 x 10-11
non-significant AUROC (51)
significant AUROC (14)
best performer
Robustness Analysis:1. Mask 50% of edges in
calculation of AUROC2. Re-calculate final ranking3. Repeat (1) and (2) 100 times
rank
Top 10 teams
3.90 x 10-14
• 59 teams participated in both SC1A and SC1B
• Reward for consistently good performance across both parts of SC1
• Average of SC1A rank and SC1B rank
• Top team ranked robustly first
Combined score for SC1A and SC1B
SC2A:Timecourse prediction from
experimental data
FGFR1/3iAKTi
AKTi+MEKiDMSO
Test1Test2
….Test(N-4)
SC2A Scoring
Training Data (4 treatments)
Test Data (N-4 treatments)
All Data (N treatments)
Participants build dynamical
models using training data
and make predictions for
phosphoprotein trajectories
under inhibitions not
in training data
Predictions assessed using test
data
• Participants made predictions for all phosphoproteins for each cell line/stimulus pair, under inhibition of each of 5 test inhibitors
• Assessment: How well do predicted trajectories agree with the corresponding trajectories in the test data?
• Scoring metric: Root-mean-squared error (RMSE), calculated for each cell line/phosphoprotein/test inhibitor combination
SC2A: Scoring metric
2, , , , , , , , , ,
1 1
1ˆ( )
T Sr rp c i p c i s t p c i s t
t s
RMSE x xTS
e.g. UACC812, Phospho1, Test1
SC2A: RMSE scores, nulls & final ranking
• 14 final submissions
1.35 x 10-4 3.70 x 10-8
1.21 x 10-61.49 x 10-5
non-significant AUROC
significant AUROC
best performer
Final ranking: Analogously to SC1A, submissions ranked for each regime and mean rank calculated
• Verify that final ranking is robust
Procedure:1. Mask 50% of data points in each
RMSE calculation
2. Re-calculate final ranking
3. Repeat (1) and (2) 100 times
SC2A: Robustness analysis
Top 10 teams
0.99
3.04 x 10-18
rank
6.97 x 10-5
Incomplete submission
2 best performers
SC2B:Timecourse prediction from
in silico data
• Participants made predictions for all phosphoproteins for each stimulus regime, under inhibition of each phosphoprotein in turn
• Scoring metric is RMSE and procedure follows that of SC2A
SC2B: Scoring Metric, Nulls, Robustness
2, , , , , , ,
1 1
1ˆ( )
T Sr rp i p i s t p i s t
t s
RMSE x xTS
1.01.68 x 10-14
2.89 x 10-70.015
Robustness Analysis:1. Mask 50% of data points in
each RMSE calculation2. Re-calculate final ranking3. Repeat (1) and (2) 100 times
non-significant AUROCsignificant AUROCbest performer
7.71 x 10-19
Top 10 teams
rank
0.99
Incomplete submission
• 10 teams participated in both SC2A and SC2B
• Reward for consistently good performance across both parts of SC2
• Average of SC2A rank and SC2B rank
• Top team ranked robustly first
Combined score for SC2A and SC2B
SC3:Visualization
• 14 submissions
• 36 HPN-DREAM participants voted – assigned ranks 1 to 3
• Final score = mean rank (unranked submissions assigned rank 4)
SC3: Scoring and Results
• Submissions rigorously assessed using held-out test data:• SC1A: Novel procedure used to assess network inference performance
in setting with no true “gold standard”
• Many statistically significant predictions submitted
For further investigation:• Explore why some regimes (e.g. cell line/stimulus pairs) are easier to predict
than others• Determine why different teams performed well in experimental and in silico
challenges • Identify the methods/approaches that yield the best predictions • Wisdom of crowds – does aggregating submissions improve performance
and lead to discovery of biological insights?
Conclusions and Observations