Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC
-
Upload
chester-andrews -
Category
Documents
-
view
20 -
download
0
description
Transcript of Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC
Presented at the Alabany Chapter of the ASAFebruary 25, 2004Washinghton DC
Direct Kernel Methods for the Detection of Ischemia from Magnetocardiograms: Support Vector Machines for the Rest of Us
Mark J. Embrechts ([email protected])
Department of decision Sciences and Engineering Systems Rensselaer Polytechnic Institute, Troy, NY 12180
Supported by NSF Grant SBIR Phase I # 0232215 and KDI # IIS-9979860
Magnetocardiography at CardioMag Imaging inc.
With Bolek Szymanski and Karsten Sternickel
Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid).Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace
1
1
1
1
1
1
1
11
11
nTmnnm
Tmnm
nTmnnm
Tmnmnm
Tmnnm
Tmn
nTmnmnm
Tmn
nTmnmnm
Tmn
yXXXw
yXXXwXXXX
yXwXX
yXwXX
Pseudo inverse
Classical (Linear) Regression Analysis: Predict y from X
NAME PIE PIF DGR SAC MR Lam Vol DDGTS IDAla 0.23 0.31 -0.55 254.2 2.126 -0.02 82.2 8.5 0Asn -0.48 -0.6 0.51 303.6 2.994 -1.24 112.3 8.2 1Asp -0.61 -0.77 1.2 287.9 2.994 -1.08 103.7 8.5 2Cys 0.45 1.54 -1.4 282.9 2.933 -0.11 9.1 11 3Gln -0.11 -0.22 0.29 335 3.458 -1.19 127.5 6.3 4Glu -0.51 -0.64 0.76 311.6 3.243 -1.43 120.5 8.8 5Gly 0 0 0 224.9 1.662 0.03 65 7.1 6His 0.15 0.13 -0.25 337.2 3.856 -1.06 140.6 10.1 7Ile 1.2 1.8 -2.1 322.6 3.35 0.04 131.7 16.8 8
Leu 1.28 1.7 -2 324 3.518 0.12 131.5 15 9Lys -0.77 -0.99 0.78 336.6 2.933 -2.26 144.3 7.9 10Met 0.9 1.23 -1.6 336.3 3.86 -0.33 132.3 13.3 11Phe 1.56 1.79 -2.6 366.1 4.638 -0.05 155.8 11.2 12Pro 0.38 0.49 -1.5 288.5 2.876 -0.31 106.7 8.2 13Ser 0 -0.04 0.09 266.7 2.279 -0.4 88.5 7.4 14Thr 0.17 0.26 -0.58 283.9 2.743 -0.53 105.3 8.8 15Trp 1.85 2.25 -2.7 401.8 5.755 -0.31 185.9 9.9 16Tyr 0.89 0.96 -1.7 377.8 4.791 -0.84 162.7 8.8 17Val 0.71 1.22 -1.6 295.1 3.054 -0.13 115.6 12 18
Xnm
y 1,1,ˆ mmtesttest wXy
Prediction model
11 nmnm ywX
Can we apply wisdom to data and forecast them right?
(n = 19 & m = 7)19 data and 7 attributes
(1 response)
Fundamental Machine Learning Paradox
1
1
1
1
1
1
11
nTmnmmFm
nTmnnm
Tmnm
nmnm
yXKw
yXXXw
ywX
• How to resolve Machine Learning Paradox?
testmtest ywx ˆ1
• Learning occurs because of redundancy (patterns) in the data
• Machine Learning Paradox: If data contain redundancies (i) we can learn from data (ii) the “feature kernel matrix” KF is ill-conditioned
(i) fix rank deficiency of KF with principal components (PCA) (ii) regularization: use KF+I instead of KF (ridge regression) (iii) local learning
1,1,
1
1
1
ˆ mmtesttest
nTmnnm
Tmnm
wXy
yXXXw
Tmmnmnm
mmnmnm
BXT
BTX
Tmhnmnh
hmnhnm
BXT
BTX
Principal Component Regression (PCR): Replace Xnm by Tnh
1,1,
1
1
1
ˆ hhtesttest
nThnnh
Thnh
wTy
yTTTw
Tnh principal components projection of the (n) data records on the (h) “most important” eigenvectors of the feature kernel KF
Ridge Regression in Data Space
nTmnnm
Tmnm
nmnm
yXXXw
ywX
1
• “Wisdom” is now obtained from the right-hand inverse or Penrose inverse
nnn
trainDn
traintestDtest
nTmnnm
Tmnmtesttest
nTmnnm
Tmnmtesttest
mmtesttest
yIKKy
yXXXXy
yXXXXy
wXy
1
,1,
1
,
1
,,
,
ˆ
ˆ
ˆ
ˆ
Ridge term is added to resolvelearning paradox
nnnDTmnm yIKXw
1
Data Kernel KD
Needs kernels only
nTmnnm
Tmnm yXXXw
1
Implementing Direct Kernel Methods
Linear Model:- PCA model- PLS model- Ridge Regression- Self-Organizing Map . . .
What have we learned so far?
• There is a “learning paradox” because of redundancies in the data
• We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel
• So far prediction models involved only linear algebra stricly linear
• What is in a kernel?
jiij xxk
nnnn
inijii
n
n
TmnnmD
kkk
kkkk
kkk
kkk
XXK
...
...
...
21
21
22221
11211
�
The data kernel contains linear similarity measures (correlations) of data records
xi
xj
Kernels
• What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel
nnnn
inijii
n
n
D
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
�
xi
xj
• We actually can make up nonlinear similarity measures as well
22
2ji xx
ij ek
jiij xxk
TmnnmD XXK
�
Radial Basis Function Kernel
Nonlinear
Distance or difference
Review: What is in a Kernel?
• A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel
• The RBF or Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions
- As defined by:
22
2lj xx
ij ek
nnnn
inijii
n
n
nn
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
�
Direct Kernel Methods for Nonlinear Regression/Classification
• Consider the Kernel as a (nonlinear) data transformation - This is the so-called “kernel trick” (Hilbert, early 1900’s) - The Radial Basis Function (RBF) or Gaussian kernel is an efficient nonlinear kernel
• Linear regression models can be “tricked” into nonlinear models by applying such regression models on kernel transformed data - PCA DK-PCA - PLS DK-PLS (Partial Least Squares Support Vector Machines) - (Direct) Kernel Ridge Regression Least Squares Support Vector Machines - Direct Kernel Self-Organizing maps (DK-SOM)
• These methods work in the same space as SVMs - DK models can usually be derived also from an optimization formulation (similar to SVMs) - Unlike the original SVMs DK methods are not sparse (i.,e., all data are support vectors) - Unlike SVMs there is no patent on direct kernel methods - Performance on hunderds of benchmark problems compare favorably with SVMs
• Classification can be considered as a special cae of regression
• Data Pre-processing: Data are usually Mahalanobis scaled first
Nonlinear PCA in Kernel Space
• Like PCA• Consider a nonlinear data kernel transformation up front: Data Kernel• Derive principal components for that kernel (e.g. with NIPALS)• Examples: - Haykin’s Spiral - Cherkassky’s nonlinear function model
22
2lj xx
ij ek
nnnn
inijii
n
n
nn
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
� n
Tnnnn
Tnnn
nnnn
yKKKw
ywK���
�
1
PCA Example: Haykin’s Spiral
REM HAYKINS SPIRALREM GENERATE DATA (6, 500, 2)
analyze num_eg.txt 3301REM GENERATE LABELS
analyze spiral.txt 116REM SPLIT DATA (400, 2)
analyze spiral.txt 20REM FOR SAFEKEEPING
copy cmatrix.txt spiral.pat > embrex.logcopy dmatrix.txt spiral.tes >> embrex.log
REM SCALE DATAanalyze spiral.pat 3314159copy spiral.pat.txt a.pat >> embrex.log
REM SCALE TEST SET CONSISTENTLY (n)analyze spiral.tes 314159copy spiral.tes.txt a.tes >> embrex.logpause
REM DO PCA (2)analyze num_eg.txt 105analyze a.pat 36analyze a.tes 3636copy a.tes.txt dmatrix.txt
REM JAVA PLOTanalyze a.pat.txt 3308pause
(demo: haykin1)
PCA
Linear PCR Example: Haykin’s SpiralREM HAYKINS SPIRAL
REM GENERATE DATA (6, 500, 2)analyze num_eg.txt 3301
REM GENERATE LABELSanalyze spiral.txt 116
REM SPLIT DATA (400, 2)analyze spiral.txt 20
REM FOR SAFEKEEPINGecho offcopy cmatrix.txt spiral.pat > embrex.logcopy dmatrix.txt spiral.tes >> embrex.logecho on
REM SCALE DATAanalyze spiral.pat 3314159copy spiral.pat.txt a.pat >> embrex.logREM SCALE TEST SET CONSISTENTLY (n)
analyze spiral.tes 314159echo offcopy spiral.tes.txt a.tes >> embrex.logecho onpause
REM RUN PCR (2)analyze num_eg.txt 105analyze a.pat 17analyze a.tes 18pause
REM DESCALEanalyze resultss.xxx 4echo offcopy results.ttt results.xxx >> embrex.logecho onanalyze resultss.ttt 4
REM JAVA PLOTanalyze num_eg.txt 3354pause
REM ROC CURVEanalyze results.ttt -42pauseanalyze results.ttt 40analyze results.ttt 3310pause
(demo: haykin2)
K-PCR Example: Haykin’s SpiralREM HAYKINS SPIRAL
REM GENERATE DATA (6, 500, 2)analyze num_eg.txt 3301
REM GENERATE LABELSanalyze spiral.txt 116
REM SPLIT DATA (400, 2)analyze spiral.txt 20
REM FOR SAFEKEEPINGecho offcopy cmatrix.txt spiral.pat > embrex.logcopy dmatrix.txt spiral.tes >> embrex.logecho on
REM SCALE DATAanalyze spiral.pat 3314159copy spiral.pat.txt a.pat >> embrex.log
REM SCALE TEST SET CONSISTENTLY (n)analyze spiral.tes 314159echo offcopy spiral.tes.txt a.tes >> embrex.logecho onpause
REM RUN K-PCA (12 1)analyze num_eg.txt 105analyze a.pat 4531analyze a.tes 4518pause
REM DESCALEanalyze resultss.xxx 4copy results.ttt results.xxx >> embrex.loganalyze resultss.ttt 4
REM JAVA PLOTanalyze num_eg.txt 3354pause
REM ROC CURVEanalyze results.ttt -42pauseanalyze results.ttt 40analyze results.ttt 3310pause
3 PCAs 12 PCAs
(demo: haykin3)
Training Data
Test Data
Mahalanobis-scaledTraining Data
Kernel TransformedTraining Data
Centered Direct Kernel
(Training Data)
Mahalanobis-scaledTest Data
MahalanobisScaling Factors
Vertical KernelCentering Factors
Kernel TransformedTest Data
Centered Direct Kernel
(Test Data)
Scaling, centering & making the test kernel centering consistent
36 MCG T3-T4 Traces
Preprocessing:- horizontal Mahalanobis scaling- D4 wavlet transform- vertical Mahalanobis scaling (features and response)
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 5 10 15 20 25
Target
an
d P
redic
ted V
alu
es
Sorted Sequence Number
Errorplot for Test Data
Thu May 08 21:04:45 2003
targetpredicted
SVMLib
Linear PCA
Direct Kernel PLS
SVMLib
7 (TN) 0 (FN)3 (FN) 15 (TP)
PharmaPlot
Wed Mar 19 15:23:32 2003
'negative''positive'
-0.08-0.06-0.04-0.02 0 0.02 0.04First PLS Component -0.06-0.04
-0.02 0
0.02 0.04
0.06 0.08
Second PLS Component
-0.08-0.06-0.04-0.02
0 0.02 0.04 0.06 0.08 0.1
Third PLS Component
Direct Kernel PLS with 3 Latent Variables
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Tru
e P
osi
tive
s
False Positives
AZ_area = 0.879
Predictions on Test Cases with K-PLS
-1.5
-1
-0.5
0
0.5
1
1.5
0 5 10 15 20 25 30 35 40 45 50 55
Targ
et a
nd
Pre
dic
ted
Va
lue
s
Sorted Sequence Number
q2 = 0.568
Q2 = 0.575
RMSE = 0.751
target
predicted
-1.5
-1
-0.5
0
0.5
1
1.5
0 5 10 15 20 25 30 35 40 45 50 55
Targ
et a
nd
Pre
dic
ted
Va
lue
s
Sorted Sequence Number
q2 = 0.529
Q2 = 0.542
RMSE = 0.729
target
predicted
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Tru
e P
osi
tive
s
False Positives
AZ_area = 0.903
K-PLS Predictions After Removing 14 Outliers
Method Domain q2 Q2 RMSE %correct #misses time (s) comment
SVMLib time 0.767 0.842 0.852 74 4+5 10 lambda = 0.011, sigma = 10K-PLS time 0.779 0.849 0.856 74 4+5 6 5 latent variables, sgma = 10K-PCA D4-wavelet 0.783 0.812 0.87 71 7+3 5 5 principal components
PLS D4-wavelet 0.841 0.142 1.146 63 2+11 3 5 latent variablesK-PLS D4-wavelet 0.591 0.694 0.773 80 2+5 6 5 latent variables, sigma = 10
DK-PLS D4-wavelet 0.608 0.708 0.781 80 2+5 5 5 latent variables, sigma = 10SVMLib D4-wavelet 0.591 0.697 0.775 80 2+5 10 lambda = 0.011, sigma = 10LS-SVM D4-wavelet 0.554 0.662 0.75 83 1+5 0.5 lambda = 0.011, sigma = 10
SOM D4-wavelet 0.866 1.304 1.06 63 3+10 960 9x18 hexagonal gridDK-SOM D4-wavelet 0.855 1.0113 0.934 71 5+5 28 9x18 hex grid, sigma = 10DK-SOM D4-wavelet 0.755 0.859 0.861 77 3+5 28 18x18 hexagonal, sigma = 8
Benchmark Predictions on Test Cases
Direct Kernel7 (TN) 0 (FN)4 (FN) 14 (TP)
with Robert Bress and Thanakorn Naenna
www.drugmining.com
Kristin Bennett and Mark Embrechts
Docking Ligands is a Nonlinear Problem
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA
TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA
TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT
GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG
CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG
GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA
CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC
ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC
ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG
TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA
TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA
CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA
CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA
CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA
CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA
CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA
TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA
CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA
CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA
CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT
ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT
TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA
CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT
WORK IN PROGRESS
Direct Kernel Partial-Least Squares (K-PLS)Direct Kernel Partial-Least Squares (K-PLS)
x1
x2
x3
t1
t2
y
• Direct Kernel PLS is PLS with the kernel transform as a preprocessing step• Consider K-PLS as a “better” nonlinear PLS• Consider PLS as a “ better” PCA• K-PLS gives almost identical (but more stable) results as SVMs - PLS is the method by choice for chemometrics and QSAR drug design - hyper-parameters are easy to tune (5 latent variables) - unlike SVMs there is no patent on K-PLS
What have we learned so far?
• There is a “learning paradox” because of redundancies in the data
• We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel
• So far prediction models involved only linear algebra strictly linear
• What is in a kernel?
jiij xxk
nnnn
inijii
n
n
TmnnmD
kkk
kkkk
kkk
kkk
XXK
...
...
...
21
21
22221
11211
�
The data kernel contains linear similarity measures (correlations) of data records
xi
xj
Kernels
• What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel
nnnn
inijii
n
n
D
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
�
xi
xj
• We actually can make up nonlinear similarity measures as well
22
2ji xx
ij ek
jiij xxk
TmnnmD XXK
�
Radial Basis Function Kernel
Nonlinear
Distance or difference
Σ
Σ
Σ
x1
xm
xi
n
Thnnh
Thn
Tmh
T
nThnnh
Thn
Tmhm
yTTTBxy
yTTTBw
1
1
ˆ
TmhB
Weights correspond toH eigenvectors
corresponding tolargest eigenvalues
of XTX
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ. . .
ThnT
Weights correspond tothe scores
or PCAs for theentire training set
1y
my
iyy
Weights correspond tothe dependent variable
for the entire training data
1 nh
ThnTT
Means that the projections on the eigenvectors will be dividedwith the corresponding variance (cfr. Mahalanobis scaling)
This layer gives a weighted similarity score
with each datapoint
Kind of a nearestneighbor weighted
prediction score
PCR in Feature Space
Σ
Σ
Σ
x1
xm
xi
n
Thnnh
Thn
Tmh
T
nThnnh
Thn
Tmhm
yTTTBxy
yTTTBw
1
1
ˆ
TmhB
Weights correspond toH eigenvectors
corresponding tolargest eigenvalues
of XTX
Σ
PCR in Feature Space
w1
w2
wh
• Principal components can be thought of as a data pre-processing step• Rather than building a model for an m-dimensional input vector x we now have a h-dimensional t vector
t1
y
th
t2
Tmhmh Bxt ,1,1
Use of a direct kernel self-organizing map in testing mode for the detection of patients with ischemia (read patient IDs). The darker hexagons colored during a separate training phase represent nodes corresponding with ischemia cases.
Predictions on Test Cases with DK-SOM
Outlier Detection Procedure in Analyze
One-class SVM on training dataProprietory regularization mechanism
start
Determine number of outliers from elbow plotEliminate outliers from training set
Run K-PLS for new training/test data
See whether outliers make sense on pharmaplotsInspect outlier clusters on SOMs
end
List ofOutlier pattern IDs
Outliers are flagged in pharmaplots
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
0 10 20 30 40 50 60 70 80
Targ
et a
nd
Pre
dic
ted
Va
lue
s
Sorted Sequence Number
q2 = 0.818
Q2 = 2.042
RMSE = 1.364
target
predicted
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
resp
on
se
sorted index number
Outlier Detection Plot (1/C)
'outliers.txt' using 1:3
Tagging Outliers on Pharmaplot with Analyze Code
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
resp
on
se
sorted index number
Outlier Detection Plot (1/C)
'outliers.txt' using 1:3
“Elbows” suggest 7-14 outliers
“Elbow” Plot for Specifying # Outliers
One-Class SVM Results for MCG Data
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
0 10 20 30 40 50 60 70 80
Targ
et a
nd
Pre
dic
ted
Va
lue
s
Sorted Sequence Number
q2 = 0.818
Q2 = 2.042
RMSE = 1.364
target
predicted
Outlier/Novelty Detection Methods in Analyze: Hypotheses
• One-class SVMs are commonly cited for outlier detection (e.g., Suykens) - used publicly available SVM code (LibSVM) - Analyze has user-friendly interface operators for using LibSVM• Proprietary heuristic tuning for C in SVMs - heuristic tuning method explained in previous publications - heuristic tuning is essential to make outlier detection work properly• “Elbow” curves for indicating # outliers• Pharmaplot justifies/validates detection from different methods• Pharmaplots extended to PLS, K-PCA, and K-PLS pharmaplots
One-Class SVM: Brief Theory
• Well-known method for outlier & novelty detection in SVM literature (e.g., seeSuykens)• LibSVM, a publicly available SVM code for general use, has one-class SVM option built-in (see Chih-Chung Chang and Chih-Jen Lin )• Analyze has operators to interface with LibSVM• Theory: - One-class SVM ignores response (assumes all zeros for responses) - Maximizes spread and subtracts regularization term - Suykens, pp. 203 has following formulation
- is a regularization parameter, Analyze has proprietary way to determine • Application: - Analyze combines one-class SVMs with pharmaplots to see whether outliers can be explained and make sense - Analyze has elbow curves to assist user in determining # outliers - Combination of 1-class SVMs with pharmaplots, gave excellent results on several industrial (non-pharmaceutical) data
Nkxwe
wweewJ
kT
k
TN
kkp
ew
,...,1,such that
2
1
2
1,max
1
2
,
NIPALS ALGORITHM FOR PLS (with just one response variable y)
• Start for a PLS component:
• Calculate the score t:
• Calculate c’:
• Calculate the loading p:
• Store t in T, store p in P, store w in W• Deflate the data matrix and the response variable:
'1
'1
'1'
1
11
'1
ˆm
Tm
mm
nTn
Tmn
m
ww
ww
yy
yXw
11
11
nTn
nTmn
m tt
tXp
'11 ˆmnmn wXt
11
11'11
nTn
nTn
tt
ytc
'11111
'11
ctyy
ptXX
nnn
Tmnnmnm
Do
for
h la
tent
var
iabl
es
Outlier/Novelty Detection Methods in Analyze
• Outlier detection methods where extensively tested: - on a variety of different UCI data sets - models sometimes showed significant improvement after removal of outliers - models were rarely worse - outliers could be validated on pharmaplots and lead to enhanced insight• The pharmaplots confirm the validity of outlier detection with one-class SVM• Prediction on test set for albumin data improves model• A non-pharmaceutical (medical) data set actually shows two data points in the training set that probably were given wrong labels (Appendix A)
P Q
R
S T
Innovations in Analyze for Outlier Detection
• User-fiendly procedure with automated processes• Interface for one-class SVM from LibSVM• Automated tuning for regularization parameters• Elbow plots to determine number of outliers• Combination of LibSVM outliers with pharmaplots - efficient visualization of outliers - facilitates interpretation of outliers• Extended pharmaplots - PCA - K-PCA - PLS - K-PLS• User-friendly and efficient SOM with outlier identification• Direct-Kernel-based outlier detection as an alternative to LibSVM
Principal Component Analysis (PCA)
Tmhnmnh
hmnhnm
hnhn
nThnnh
Thnh
BXT
BTX
bTy
yTTTb
11
1
1
1
ˆ
• We introduce a modest set of h most important principal components, Tnh
• Replace data Xnm by most important principal components Tnh
• The most important T’s are the ones corresponding to largest eigenvalues of XTX• The B’s are the eigenvectors of XTX ordered from largest to lowest eigenvalue• In practice calculation of B’s and T’s proceeds iteratively with NIPALS algorithm• NIPALS: non-linear iterative least squares (Herman Wold)
x1
x2
x3
t1t2
y
Partial Least Squares (PLS)
• Similar to PCA• PLS: Partial Least Squares/Projection to Latent Structures/Please Listen to Svante• T’s are now called scores or latent variables and the p’s are the loading vectors • Loading vectors are not orthogonal anymore and influenced by y vector• A special version of NIPALS is also used to build up t’s
x1
x2
x3
t1
t2
y
1*
*
111
1
1
1
ˆ
WPWW
WXT
PTX
yTWPWXbTy
yTTTb
T
nmnh
Thmnhnm
nThnmh
Thmmhnmhnhn
nThnnh
Thnh
Kernel PLS (K-PLS)
x1
x2
x3
t1
t2
y
• Invented by Rospital and Trejo (J. Machine Learning, December 2000)• Consider K-PLS as a better and nonlinear PLS• K-PLS gives almost identical results to SVMs for the QSAR data we tried• K-PLS is a lot faster than SVMs
P Q
R
S T
Validation Model: 100x leave 10% out validations
PLS, K-PLS, SVM, ANN
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 100 200 300 400 500 600
1 -
Q2
# Features
1 - Q2 versus # Features on Validation Set
Thu Mar 13 15:59:57 2003
'evolve.txt' using 1:2
Feature Selection (data strip mining)