Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

16
Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU

Transcript of Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Page 1: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Stabilization matrix method(Rigde regression)

Morten NielsenDepartment of Systems Biology,

DTU

Page 2: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

• A prediction method contains a very large set of parameters

– A matrix for predicting binding for 9meric peptides has 9x20=180 weights

• Over fitting is a problem

Data driven method training

yearsTe

mperature

Page 3: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Model over-fitting (early stopping)

Evaluate on 600 MHC:peptide binding dataPCC=0.89

Stop training

Page 4: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Stabilization matrix method The mathematics

y = ax + b2 parameter model

Good description, poor fit

y = ax6+bx5+cx4+dx3+ex2+fx+g

7 parameter modelPoor description, good fit

Page 5: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Stabilization matrix method The mathematics

y = ax + b2 parameter model

Good description, poor fit

y = ax6+bx5+cx4+dx3+ex2+fx+g

7 parameter modelPoor description, good fit

Page 6: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM training

Evaluate on 600 MHC:peptide binding dataL=0: PCC=0.70L=0.1 PCC = 0.78

Page 7: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Stabilization matrix method.The analytic solution

Each peptide is represented as 9*20 number (180)H is a stack of such vectors of 180 valuest is the target value (the measured binding)l is a parameter introduced to suppress the effect of noise in the experimental data and lower the effect of overfitting

Page 8: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

Sum over weights

Sum over data points

Page 9: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

Per target error:

Global error:

Sum over weights

Sum over data points

Page 10: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM - Stabilization matrix methodDo it yourself

I1 I2

w1 w2

Linear function

o

l per target

Page 11: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

And now you

Page 12: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

l per target

Page 13: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM - Stabilization matrix method

I1 I2

w1 w2

Linear function

o

Page 14: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

SMM - Stabilization matrix methodMonte Carlo

I1 I2

w1 w2

Linear function

o

Global:

• Make random change to weights

• Calculate change in “global” error

• Update weights if MC move is accepted

Note difference between MC and GD in the use of “global” versus “per target” error

Page 15: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

Training/evaluation procedure

• Define method• Select data• Deal with data redundancy

– In method (sequence weighting)– In data (Hobohm)

• Deal with over-fitting either– in method (SMM regulation term) or– in training (stop fitting on test set

performance)• Evaluate method using cross-validation

Page 16: Stabilization matrix method (Rigde regression) Morten Nielsen Department of Systems Biology, DTU.

A small doit tcsh script/home/projects/mniel/ALGO/code/SMM

#! /bin/tcsh -fset DATADIR = /home/projects/mniel/ALGO/data/SMM/

foreach a ( A0101 A3002 )mkdir -p $acd $a

foreach n ( 0 1 2 3 4 )mkdir -p n.$ncd n.$n

cat $DATADIR/$a/f00$n | gawk '{print $1,$2}' > f00$ncat $DATADIR/$a/c00$n | gawk '{print $1,$2}' > c00$ntouch $$.perfrm -f $$.perf

foreach l ( 0 0.05 0.1 0.5 1.0 2.0 5.0 10.0 20.0 50.0 )smm -nc 500 -l $l f00$n > mat.$lpep2score -mat mat.$l c00$n > c00$n.pred

echo $l `cat c00$n.pred | grep -v "#" | args 2,3 | gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}' ` >> $$.perf

end

set best_l = `cat $$.perf | sort -nk2 | head -1 | gawk '{print $1}'`echo "Allele" $a "n" $n "best_l" $best_l

pep2score -mat mat.$best_l c00$n > c00$n.pred

cd ..end

echo $a `cat n.?/c00?.pred | grep -v "#" | args 2,3 | xycorr | grep -v "#"` \ `cat n.?/c00?.pred | grep -v "#" | args 2,3 | \ gawk 'BEGIN{n+0; e=0.0}{n++; e += ($1-$2)*($1-$2)}END{print e/n}' `

cd ..end