The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel...

The sbv IMPROVER species translation challenge

Sometimes you can trust a rat

Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara

Gyan Bhanot Rutgers Univ.

Michael Biehl University of GroningenJohann Bernoulli Institute

www.cs.rug.nl/biehl

[email protected]

Winning the rat race 2

sbv IMPROVER species translation challenge

systemsbiologyverificationcombined withindustrial methodologyfor process verificationin research

IBM Research, Yorktown HeightsPhilip Morris International Research and Developmentwww.sbvimprover.com


protein phosphorylation

reversible protein phosphorylation

addition or removal of a phosphate group

alters shape and function of proteins


protein phosphorylation

chemical stimuli

gene expression

reversible protein phosphorylation

addition or removal of a phosphate group

alters shape and function of proteins


www.sbvimprover.com

chemical stimuli

phosphorylation

status

( measured)

gene expression

(Δ measured)

complex network (incomplete snapshot)


A AB B

• normal bronchial epithelial cells, derived from human and rat• 52 different chemical stimuli (26 (A) + 26 (B)), additional controls• phosphorylation status after 5 minutes and 25 minutes• gene expression after 6 hours

challenge data

• rather low noise levels• subtract control, median of replicates

challenge organizers: activation

abs(P) > 3 @5min. or @25min.• ~ 10% positive examples

• noisy data (microarray)• correct for saturation effects

N= 20110 (human)

N= 13841 (rat)


www.sbvimprover.com

2

1

3

challenge set-up and goals

1 intra-species prediction of phosphorylation from gene expression

2 predict the response in human using data available for rat cells

3 predict gene expression response across species


intra-species phosphorylation prediction

sub-challenge 1

combination of two approaches:• voter method

gene selection based on mutual information• machine learning analysis

Principal Components representation +

Linear Discriminant Analysis • weighted combination

based on Leave-One-Out cross validation


voter method

binarize data by thresholding

gene expression: G=1 if p < 0.01 (p-value for differential expression)

phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.)

for all pairs of genes and proteins:

calculate separate and joint entropies

using frequencies over stimuli

mutual information

assumption: high I indicates that a gene is predictive for the

corresponding protein status


example:

SYNPR level predictive of AKT1 activation

green = significant phosphorylationred = significant gene expression

SYNPR under-expressed AKT1 phosphorylated

voter method

for each protein:

- determine a set of most predictive genes (varying number ~ 30-70)

- vote according to the presence of significant gene expressions

relative frequency of positive votes determines certainty score in [0,1]

Leave-One-Out (L-1-O) validation:

consider mutual information only over 25 stimuli, predict the 26th

performance estimate with respect to predicting novel data


voter method prediction

27 ... stimuli … 52

1 2

…. p

rotein

s……

. 16

• voting schemes obtained

from examples in A,

applied to the 26 new

stimuli of data set B

416 predictions w.r.t. data set B

• certainties in [0,1]

on average over the

26 L-1-O runs


machine learning approach

low-dimensional representation of gene expression data• omit all genes with zero variation or only insignificant (p>0.05)

expression values over all 26 training stimuli (13841 -> 6033 genes)

• Principal Component Analysis (PCA) (pcascat, www.mloss.org

c/o MarcStrickert)

- error free representation of all data possible by max. 52 PCs

- here: use k ≤ 22 leading PCs only (remove small variations due to noise)

• Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify)

- identifies discriminative directions in k-dim. space

based on within-class and between-class variation

- probabilistic output provided, interpreted as certainty score

- if all training examples negative, score 0 is assigned



• Leave-One-Out procedure with varying number k of PC projections

for each of the 16 target proteins for k=1:22

- repeat 26 times: LDA based on 25 stimuli, predict the 26th

yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5)

- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)

- determine the number of false positives (fp), true positives (tp),

false negatives (fn), true negatives (tn)



• perform protein-specific

weighted average to obtain certainties:

• prediction: apply to test set (B) (binarized)

27 ... stimuli … 52 27 ... stimuli … 52

proteins

proteins



• for fair comparison with voter method:

Nested Leave-One-Out procedure

for each protein, repeat 26 times:

L-1-O using 24 out of 25 stimuli, varying k

mcc-weighted prediction for the 26th stimulus

• averaged certainties as weighted means (unweighted mean if both mcc=0)


combined prediction


combined prediction

1 2

…. p

rotein

s……

. 16

27 ... stimuli … 52


111


LDA 0.34 0.71 0.67 2

voting 0.40 0.67 0.65 2

111

combination improved the performance!


inter-species phosphorylation prediction

sub-challenge 2


www.sbvimprover.com

sub-challenge 2 set-up


sub-challenge 2 set-up

restrict ourselves to the useof phosphorylation data only

reasoning:immediate response to stimuli should be comparable between species

www.sbvimprover.com


data

rat data set A

ratP rat data set B

ratP

human data set A

humP

human data set B

| humP | > 3 ?

1 2 3 … 25 26 27 28 29 … 51 521 2

3 …

161

2 3

… 16

stimuli

known prediction

pro

tein

s


assume similar activation in both species: “human ≈ rat”

naïve prediction

prediction score, corresponding to threshold 3 for activation

- precise (monotonic!) form is irrelevant for ROC, PR etc.

- threshold 0.5 for crisp classification

- here: scaling factor yields values well-spread in [0,1]


naïve prediction

AUC ≈ 0.83

sen

sitiv

ity

1-specificity

ROC

with respect to the full panel

(416 predictions) of

| humP | > 3


27 ... stimuli … 52

1 2

…. pro

teins…

…. 16

color-coded certainty

for | humP |>3

in data set B

naïve prediction



rat data set A

ratP rat data set B

ratP

human data set A

|humP | > 3 ?

human data set B

| humP | > 3 ?

1 2 3 … 25 26 27 28 29 … 51 521 2

3 …

161

2 3

… 16

stimuli

training prediction

pro

tein

s

16-dim.vectors

16 separatebinary

classificationproblems


LVQ prediction

LVQ1, one prototype per class

Nearest prototype classification:

here: 16-dim. data


prediction score / certainty for activation

- precise (monotonic!) form is irrelevant for ROC, PR etc.

- crisp classification for threshold 0.5

- here: scaling factor yields range of values similar to naïve prediction

validation: 26 Leave-One-Out training processes:

split data set A in 25 training / 1 test sample

(if training set is all negative: accept naïve prediction)

prediction: ensemble average of certainties over the 26 LVQ systems

LVQ prediction


AUC ≈ 0.88

ROC



| humP | > 3

obtained in the Leave-One-Out

validation scheme

LVQ predictionse

nsi

tivity

1-specificity


naïve prediction

AUC ≈ 0.83

sen

sitiv

ity

1-specificity

ROC



| humP | > 3


27 ... stimuli … 52

1 2

…. p

roteins…

16

combined prediction

1 2

…. p

roteins…

16

27 ... stimuli … 52

combined prediction: weighted average according to

protein-specific performance (AUROC)


color-coded certainty

for |humP|>3

in data set B

27 ... stimuli … 52

1 2

…. pro

teins…

…. 16

combined prediction


naïve (rat) 0.45 0.74 0.79 1

LVQ 0.37 0.69 0.76 3

naïve scheme: best indiviudal prediction

• L-1-O not confirmed in the test set

combination improves performance!

confirmed in “wisdom of the crowd”

analysis


inter-species pathway perturbation prediction

sub-challenge 3


additional data / domain knowledge

246 gene sets from the C2CP collection (Broad Institute)

www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP

2) annotation of gene sets representing known pathways and function

1) mapping of rat genes to human orthologs

HGNC Comparison of Ortholog Predictions, HCOP

www.genenames.org/cgi-bin/hcop.pl

3) gene set enrichment analysis

www.broadinstitute.org/gsea/index.jsp

NES: normalized enrichment scores, representing expressionFDR: false discovery rate, i.e. statistical significance threshold: FDR <0.25


in stimuli (set A)

gen

e s

ets

FDR < 0.25

rat vs. human

frequent observation:

negative correlations between significant

rat and human gene sets

biology? data (pre-)processing?


• PCA: dimension and noise reduction

rat gene set data A and B represented by k (≤52) projections

training

training data: 26 stimuli in rat data set A

246-dim. vectors of rat NES

246 classification problems

targets: binarized human FDR (<0.25?)

• LDA: linear classifier using k projections as features (probabilistic output)• Leave-One-Out validation: determine optimal k from data set A

• use k=8 to make predictions for

data set B (averaged over 26 L-1-O runs)



27 ... stimuli … 52

gen

e s

ets

final prediction, certanties

human gene set prediction


significant


summary

sc-1: intra-species prediction of phosphorylation

gene expression is predictive for phosphorylation status

sc-3: inter-species prediction of gene sets

weakly predictive, presence of negative correlations between rat and human genes and gene sets

sc-2: inter-species prediction of phosphorylation

rat phosphorylation is predictive for human cell response


outlook

• more sophisticated learning schemes / classifiers e.g. feature weighting schemes, Matrix Relevance LVQ

• ‘joint’ predictions of protein or gene set tableaus e.g. predict 1 protein from 16 + 15 values in set A two-step procedure for set B

• include gene expression in sub-challenge 2

• investigate difficult to predict proteins / gene sets

• infer and enhance network models from experimental data on-going, new challenge (runs until February 2014) Network Verification Challenge (NVC) www.sbvimprover.com


take home messages

• team work works (and skype is great)

• in case of doubt: PCA

• the smaller the data set, the simpler the method

• committees can be useful!

• if you have won the rat race, you might be a rat

The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel...

Documents

Transcript of The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel...