No definitive “gold standard” causal networks Use a novel held-out validation approach,...

The HPN-DREAM breast cancer network inference

challenge:Scoring and results

Steven HillThe Netherlands Cancer Institute

RECOMB/ISCB Conference on Regulatory and Systems Genomics, with DREAM Challenges

8th November 2013

SC1A: Network inference from

experimental data

SC1A: Scoring

• No definitive “gold standard” causal networks• Use a novel held-out validation approach, emphasizing causal aspect of

challenge

Training Data (4 treatments)

Test Data (N-4 treatments)

FGFR1/3iAKTi

AKTi+MEKiDMSO

All Data (N treatments)

Test1Test2

….Test(N-4)

Participants infer 32

networks using

training data

Inferred networks assessed using test

data

SC1A: Scoring metric

Assessment: How well do inferred causal networks agree with effects observed under inhibition in test data?

Step 1: Identify “gold standard” with a paired t-test to compare DMSO and test inhibitors for each phosphoprotein and cell line/stimulus regime

phos

pho1

(a.u

.)

p-value = 3.2x10-5

time

DMSO

Test1

Phos

pho2

(a.u

.)time

DMSOTest1

0 1 1 0 1 0 0 1 0 0Test1

p-value = 0.45

e.g. UACC812/Serum, Test1

phosphoproteins

“gold standard”

FP FPTP TP TPCompare descendants of test inhibitor targetto “gold standard” list of observed effects in held-out data #TP(τ), #FP(τ)

Step 2: Score submissions


threshold, τ

Vary threshold τ ROC curve and AUROC score # TP

# FP

1 0 1 0 1 0 1 1 0 0Test1

phosphoproteins

AUROC

(0 .67 ⋯ 0.43⋮ ⋱ ⋮

0.58 ⋯ 0.87)

(1 ⋯ 0⋮ ⋱ ⋮0 ⋯ 1)

Test1

Obtain protein descendants downstream of test inhibitor

target

Matrix of predicted edge scores for a single cell line/stimulus regime

SC1A: AUROC scores & nulls

• 74 final submissions• Each submission has 32 AUROC scores

(one for each cell line/stimulus regime)

3.58 x 10-6 8.98 x 10-6

9.19 x 10-44.18 x 10-6

non-significant AUROCsignificant AUROCbest performer

Scoring procedure:

1. For each submission and each cell line/stimulus pair, compute AUROC score

2. Submissions ranked for each cell line/stimulus pair

3. Mean rank across cell line/stimulus pairs calculated for each submission

4. Rank submissions according to mean rank

SC1A: Final ranking

32 cell line/stimulus pairs

Subm

issi

ons 0.50.70.90.60.5

0.80.70.40.70.6

0.80.5 AUROC

scores

32 cell line/stimulus pairs

Subm

issi

ons 4

2133

12423

14 AUROC

ranks

Subm

issi

ons 3

21.333.66

mean rank

Subm

issi

ons 3

214

final rank

• Verify that final ranking is robust

Procedure:1. Mask 50% of phosphoproteins in each

AUROC calculation

2. Re-calculate final ranking

3. Repeat (1) and (2) 100 times

SC1A: Robustness analysis

phosphoproteins

rank

Top 10 teams

5.40 x 10-10

SC1B: Network inference from

in silico data

• Gold-standard available: Data-generating causal network

• Participants submitted a single set of edge scores

• Edge scores compared against gold standard -> AUROC score

• Participants ranked based on AUROC score

SC1B Scoring, AUROCs, Null & Robustness

3.11 x 10-11

non-significant AUROC (51)

significant AUROC (14)

best performer

Robustness Analysis:1. Mask 50% of edges in

calculation of AUROC2. Re-calculate final ranking3. Repeat (1) and (2) 100 times

rank

Top 10 teams

3.90 x 10-14

• 59 teams participated in both SC1A and SC1B

• Reward for consistently good performance across both parts of SC1

• Average of SC1A rank and SC1B rank

• Top team ranked robustly first

Combined score for SC1A and SC1B

SC2A:Timecourse prediction from

experimental data

FGFR1/3iAKTi

AKTi+MEKiDMSO

Test1Test2

….Test(N-4)

SC2A Scoring

Training Data (4 treatments)

Test Data (N-4 treatments)

All Data (N treatments)

Participants build dynamical

models using training data

and make predictions for

phosphoprotein trajectories

under inhibitions not

in training data

Predictions assessed using test

data

• Participants made predictions for all phosphoproteins for each cell line/stimulus pair, under inhibition of each of 5 test inhibitors

• Assessment: How well do predicted trajectories agree with the corresponding trajectories in the test data?

• Scoring metric: Root-mean-squared error (RMSE), calculated for each cell line/phosphoprotein/test inhibitor combination


2, , , , , , , , , ,

1 1

1ˆ( )

T Sr rp c i p c i s t p c i s t

t s

RMSE x xTS

e.g. UACC812, Phospho1, Test1

SC2A: RMSE scores, nulls & final ranking

• 14 final submissions

1.35 x 10-4 3.70 x 10-8

1.21 x 10-61.49 x 10-5

non-significant AUROC

significant AUROC

best performer

Final ranking: Analogously to SC1A, submissions ranked for each regime and mean rank calculated

• Verify that final ranking is robust

Procedure:1. Mask 50% of data points in each

RMSE calculation

2. Re-calculate final ranking

3. Repeat (1) and (2) 100 times

SC2A: Robustness analysis

Top 10 teams

0.99

3.04 x 10-18

rank

6.97 x 10-5

Incomplete submission

2 best performers

SC2B:Timecourse prediction from

in silico data

• Participants made predictions for all phosphoproteins for each stimulus regime, under inhibition of each phosphoprotein in turn

• Scoring metric is RMSE and procedure follows that of SC2A

SC2B: Scoring Metric, Nulls, Robustness

2, , , , , , ,

1 1

1ˆ( )

T Sr rp i p i s t p i s t

t s

RMSE x xTS

1.01.68 x 10-14

2.89 x 10-70.015

Robustness Analysis:1. Mask 50% of data points in

each RMSE calculation2. Re-calculate final ranking3. Repeat (1) and (2) 100 times

non-significant AUROCsignificant AUROCbest performer

7.71 x 10-19

Top 10 teams

rank

0.99

Incomplete submission

• 10 teams participated in both SC2A and SC2B

• Reward for consistently good performance across both parts of SC2

• Average of SC2A rank and SC2B rank

• Top team ranked robustly first

Combined score for SC2A and SC2B

SC3:Visualization

• 14 submissions

• 36 HPN-DREAM participants voted – assigned ranks 1 to 3

• Final score = mean rank (unranked submissions assigned rank 4)

SC3: Scoring and Results

• Submissions rigorously assessed using held-out test data:• SC1A: Novel procedure used to assess network inference performance

in setting with no true “gold standard”

• Many statistically significant predictions submitted

For further investigation:• Explore why some regimes (e.g. cell line/stimulus pairs) are easier to predict

than others• Determine why different teams performed well in experimental and in silico

challenges • Identify the methods/approaches that yield the best predictions • Wisdom of crowds – does aggregating submissions improve performance

and lead to discovery of biological insights?

Conclusions and Observations

No definitive “gold standard” causal networks Use a novel held-out validation approach,...

Documents

Transcript of No definitive “gold standard” causal networks Use a novel held-out validation approach,...