Génie parasismique cours 2 hormoz modaressi [email protected].
The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel...
-
Upload
rosalyn-heath -
Category
Documents
-
view
218 -
download
1
Transcript of The sbv IMPROVER species translation challenge Sometimes you can trust a rat Sahand Hormoz Adel...
The sbv IMPROVER species translation challenge
Sometimes you can trust a rat
Sahand Hormoz Adel Dayarian KITP, UC Santa Barbara
Gyan Bhanot Rutgers Univ.
Michael Biehl University of GroningenJohann Bernoulli Institute
www.cs.rug.nl/biehl
Winning the rat race 2
sbv IMPROVER species translation challenge
systemsbiologyverificationcombined withindustrial methodologyfor process verificationin research
IBM Research, Yorktown HeightsPhilip Morris International Research and Developmentwww.sbvimprover.com
Winning the rat race 3
protein phosphorylation
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race 4
protein phosphorylation
chemical stimuli
gene expression
reversible protein phosphorylation
addition or removal of a phosphate group
alters shape and function of proteins
Winning the rat race 5
www.sbvimprover.com
chemical stimuli
phosphorylation
status
( measured)
gene expression
(Δ measured)
complex network (incomplete snapshot)
Winning the rat race 6
A AB B
• normal bronchial epithelial cells, derived from human and rat• 52 different chemical stimuli (26 (A) + 26 (B)), additional controls• phosphorylation status after 5 minutes and 25 minutes• gene expression after 6 hours
challenge data
• rather low noise levels• subtract control, median of replicates
challenge organizers: activation
abs(P) > 3 @5min. or @25min.• ~ 10% positive examples
• noisy data (microarray)• correct for saturation effects
N= 20110 (human)
N= 13841 (rat)
Winning the rat race 7
www.sbvimprover.com
2
1
3
challenge set-up and goals
1 intra-species prediction of phosphorylation from gene expression
2 predict the response in human using data available for rat cells
3 predict gene expression response across species
Winning the rat race 8
intra-species phosphorylation prediction
sub-challenge 1
combination of two approaches:• voter method
gene selection based on mutual information• machine learning analysis
Principal Components representation +
Linear Discriminant Analysis • weighted combination
based on Leave-One-Out cross validation
Winning the rat race 9
voter method
binarize data by thresholding
gene expression: G=1 if p < 0.01 (p-value for differential expression)
phosphorylation : P=1 if abs(P) > 3 (@5min. or @25 min.)
for all pairs of genes and proteins:
calculate separate and joint entropies
using frequencies over stimuli
mutual information
assumption: high I indicates that a gene is predictive for the
corresponding protein status
Winning the rat race 10
example:
SYNPR level predictive of AKT1 activation
green = significant phosphorylationred = significant gene expression
SYNPR under-expressed AKT1 phosphorylated
voter method
for each protein:
- determine a set of most predictive genes (varying number ~ 30-70)
- vote according to the presence of significant gene expressions
relative frequency of positive votes determines certainty score in [0,1]
Leave-One-Out (L-1-O) validation:
consider mutual information only over 25 stimuli, predict the 26th
performance estimate with respect to predicting novel data
Winning the rat race 11
voter method prediction
27 ... stimuli … 52
1 2
…. p
rotein
s……
. 16
• voting schemes obtained
from examples in A,
applied to the 26 new
stimuli of data set B
416 predictions w.r.t. data set B
• certainties in [0,1]
on average over the
26 L-1-O runs
Winning the rat race 12
machine learning approach
low-dimensional representation of gene expression data• omit all genes with zero variation or only insignificant (p>0.05)
expression values over all 26 training stimuli (13841 -> 6033 genes)
• Principal Component Analysis (PCA) (pcascat, www.mloss.org
c/o MarcStrickert)
- error free representation of all data possible by max. 52 PCs
- here: use k ≤ 22 leading PCs only (remove small variations due to noise)
• Linear Discriminant Analysis (LDA) (Matlab, Statistics: classify)
- identifies discriminative directions in k-dim. space
based on within-class and between-class variation
- probabilistic output provided, interpreted as certainty score
- if all training examples negative, score 0 is assigned
Winning the rat race 13
machine learning approach
• Leave-One-Out procedure with varying number k of PC projections
for each of the 16 target proteins for k=1:22
- repeat 26 times: LDA based on 25 stimuli, predict the 26th
yields probabilistic prediction 0 ≤ c(k) ≤ 1 (crisp threshold 0.5)
- compute Mathews Correlation Coefficient (0 ≤ mcc ≤ 1)
- determine the number of false positives (fp), true positives (tp),
false negatives (fn), true negatives (tn)
Winning the rat race 14
machine learning approach
• perform protein-specific
weighted average to obtain certainties:
• prediction: apply to test set (B) (binarized)
27 ... stimuli … 52 27 ... stimuli … 52
proteins
proteins
Winning the rat race 15
machine learning approach
• for fair comparison with voter method:
Nested Leave-One-Out procedure
for each protein, repeat 26 times:
L-1-O using 24 out of 25 stimuli, varying k
mcc-weighted prediction for the 26th stimulus
• averaged certainties as weighted means (unweighted mean if both mcc=0)
Winning the rat race 16
combined prediction
Winning the rat race 17
combined prediction
1 2
…. p
rotein
s……
. 16
27 ... stimuli … 52
Winning the rat race 18
111
Winning the rat race 19
LDA 0.34 0.71 0.67 2
voting 0.40 0.67 0.65 2
111
combination improved the performance!
Winning the rat race 20
inter-species phosphorylation prediction
sub-challenge 2
Winning the rat race 21
www.sbvimprover.com
sub-challenge 2 set-up
Winning the rat race 22
sub-challenge 2 set-up
restrict ourselves to the useof phosphorylation data only
reasoning:immediate response to stimuli should be comparable between species
www.sbvimprover.com
Winning the rat race 23
data
rat data set A
ratP rat data set B
ratP
human data set A
humP
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 521 2
3 …
161
2 3
… 16
stimuli
known prediction
pro
tein
s
Winning the rat race 24
assume similar activation in both species: “human ≈ rat”
naïve prediction
prediction score, corresponding to threshold 3 for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- threshold 0.5 for crisp classification
- here: scaling factor yields values well-spread in [0,1]
Winning the rat race 25
naïve prediction
AUC ≈ 0.83
sen
sitiv
ity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
Winning the rat race 26
27 ... stimuli … 52
1 2
…. pro
teins…
…. 16
color-coded certainty
for | humP |>3
in data set B
naïve prediction
Winning the rat race 27
machine learning approach
rat data set A
ratP rat data set B
ratP
human data set A
|humP | > 3 ?
human data set B
| humP | > 3 ?
1 2 3 … 25 26 27 28 29 … 51 521 2
3 …
161
2 3
… 16
stimuli
training prediction
pro
tein
s
16-dim.vectors
16 separatebinary
classificationproblems
Winning the rat race 28
LVQ prediction
LVQ1, one prototype per class
Nearest prototype classification:
here: 16-dim. data
Winning the rat race 29
prediction score / certainty for activation
- precise (monotonic!) form is irrelevant for ROC, PR etc.
- crisp classification for threshold 0.5
- here: scaling factor yields range of values similar to naïve prediction
validation: 26 Leave-One-Out training processes:
split data set A in 25 training / 1 test sample
(if training set is all negative: accept naïve prediction)
prediction: ensemble average of certainties over the 26 LVQ systems
LVQ prediction
Winning the rat race 30
AUC ≈ 0.88
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
obtained in the Leave-One-Out
validation scheme
LVQ predictionse
nsi
tivity
1-specificity
Winning the rat race 31
naïve prediction
AUC ≈ 0.83
sen
sitiv
ity
1-specificity
ROC
with respect to the full panel
(416 predictions) of
| humP | > 3
Winning the rat race 32
27 ... stimuli … 52
1 2
…. p
roteins…
16
combined prediction
1 2
…. p
roteins…
16
27 ... stimuli … 52
combined prediction: weighted average according to
protein-specific performance (AUROC)
Winning the rat race 33
color-coded certainty
for |humP|>3
in data set B
27 ... stimuli … 52
1 2
…. pro
teins…
…. 16
combined prediction
Winning the rat race 34
Winning the rat race 35
naïve (rat) 0.45 0.74 0.79 1
LVQ 0.37 0.69 0.76 3
naïve scheme: best indiviudal prediction
• L-1-O not confirmed in the test set
combination improves performance!
confirmed in “wisdom of the crowd”
analysis
Winning the rat race 36
Winning the rat race 37
inter-species pathway perturbation prediction
sub-challenge 3
Winning the rat race 38
additional data / domain knowledge
246 gene sets from the C2CP collection (Broad Institute)
www.broadinstitute.org/gsea/msigdb/genesets.jsp?collection=CP
2) annotation of gene sets representing known pathways and function
1) mapping of rat genes to human orthologs
HGNC Comparison of Ortholog Predictions, HCOP
www.genenames.org/cgi-bin/hcop.pl
3) gene set enrichment analysis
www.broadinstitute.org/gsea/index.jsp
NES: normalized enrichment scores, representing expressionFDR: false discovery rate, i.e. statistical significance threshold: FDR <0.25
Winning the rat race 39
in stimuli (set A)
gen
e s
ets
FDR < 0.25
rat vs. human
frequent observation:
negative correlations between significant
rat and human gene sets
biology? data (pre-)processing?
Winning the rat race 40
• PCA: dimension and noise reduction
rat gene set data A and B represented by k (≤52) projections
training
training data: 26 stimuli in rat data set A
246-dim. vectors of rat NES
246 classification problems
targets: binarized human FDR (<0.25?)
• LDA: linear classifier using k projections as features (probabilistic output)• Leave-One-Out validation: determine optimal k from data set A
• use k=8 to make predictions for
data set B (averaged over 26 L-1-O runs)
machine learning approach
Winning the rat race 41
27 ... stimuli … 52
gen
e s
ets
final prediction, certanties
human gene set prediction
Winning the rat race 42
significant
Winning the rat race 43
summary
sc-1: intra-species prediction of phosphorylation
gene expression is predictive for phosphorylation status
sc-3: inter-species prediction of gene sets
weakly predictive, presence of negative correlations between rat and human genes and gene sets
sc-2: inter-species prediction of phosphorylation
rat phosphorylation is predictive for human cell response
Winning the rat race 44
outlook
• more sophisticated learning schemes / classifiers e.g. feature weighting schemes, Matrix Relevance LVQ
• ‘joint’ predictions of protein or gene set tableaus e.g. predict 1 protein from 16 + 15 values in set A two-step procedure for set B
• include gene expression in sub-challenge 2
• investigate difficult to predict proteins / gene sets
• infer and enhance network models from experimental data on-going, new challenge (runs until February 2014) Network Verification Challenge (NVC) www.sbvimprover.com
Winning the rat race 45
take home messages
• team work works (and skype is great)
• in case of doubt: PCA
• the smaller the data set, the simpler the method
• committees can be useful!
• if you have won the rat race, you might be a rat