Kernel MethodsKernel Methods
(Support Vector Machines)(Support Vector Machines)
forfor
Environmental and GeoEnvironmental and Geo-- SciencesSciences
Alexei Pozdnoukhov
Lecturer
National Centre for Geocomputation
National University of Ireland, Maynooth
+353 (0)1 7086146
Machine LearningMachine Learning
• Environmental monitoringCurrent rate of data acquisition is about
0.5Tb/day (increasing at 82% per year)
• Remote Sensing DataNASA holds more than 10Pb of data,
increasing by 10x every 5 years.
ESA data stream is about 0.5Tb/year,likely to increase by 20x in next 5 years.
• GIS, DEM
• Sensor Networks
• Field Measurements
Learning From Data
Clustering
Cluster 1
Cluster 2
Dimensionality Reduction
Classification
Binary Multi-Class
Regression
Input, x
y
Curse of Dimensionality
Sensor NetworkSensor Network
Geographical Information
Wireless Sensor Network
Remote Sensing
Batteries Recharged at WSN
Need more data?
Human activity
Detecting Events
Observed environment:
high-dimensional input spaceEvents: Very Rare, Extreme
• High-dimensional spaces: risk of overfitting
• Robust to noise in both inputs/outputs
• Non-linear and non-parametric
• Computationally effective for real-time processing and LBS dissemination
Curse of Dimensionality
Statistical Learning Theory
• Models that can generalise from data
• Good predictive abilities
• Complexity can be controlled
Statistical Learning TheoryStatistical Learning Theory
• Occam’s Razor Principle (14th century)
One should not increase, beyond what is necessary,the number of entities required to explain anything
• When many solutions are available for a given problem, weshould select the simplest one.
• But what do we mean by simple?
• We will use prior knowledge of the problem to solve to definewhat is a simple solution (example of a prior: smoothness).
OccamOccam’’s Razor and Classification s Razor and Classification
-√√√√-Overall
√√√√√√√√√√√√××××××××Training error
×××× ××××√√√√ √√√√√√√√Complexity
Model 3Model 2Model 1
Structural Risk MinimizationStructural Risk Minimization
• Define a set of learning functions, {S}
• Order it in terms of complexity, {S1, …, SN}
• Select the optimal S*
F = {f(x,α), α∈Λ}
ClassificationClassification
Support Vector Machine
SVM
Separating Separating HyperplaneHyperplane
x - input patterns
w - weight vector
b - threshold
, ( ) ( )w bf x sign w x b= ⋅ +
How powerful are linear decision functions?
VCVC--dimension in classificationdimension in classification
Shattering
•• the number of samples which can be discriminated by the functiothe number of samples which can be discriminated by the function for all n for all possible class memberships possible class memberships –– shattered.shattered.
xx
xx
xx
xxxx
3 samples:
4 samples:
VC-dimension h of the linear decision functions in RN equals N+1
?
That is, the power of linear decision functions is beyond our control…?
Support Vector MachineSupport Vector Machine
Intuition:
Large Margin is good.
Decision function is a margin hyperplane(*)(*)
−≤−⋅−
≥−⋅=
1)(,1
1)(,1}),{,(
bxw
bxwbwxf
Lemma: Given that the N-dimensional data {xl, x2, …xL} lie inside a finite enclosing sphere of the radius R, the VC-dimension h of the margin-based decision functions (*) follows the inequality:
22min , 1h R w N ≤ +
The complexity (VC-dimension) can be controlled with ||w||2 !!
Separating Separating HyperplaneHyperplane: Max Margin: Max Margin
))(()(, bxwsignxf bw +⋅=
To maximize the margin ρ, one would like to minimize ||w||, or ||w||2.
,
1, ( ) 1( )
1, ( ) 1w b
w x bf x
w x b
⋅ − ≥=
− ⋅ − ≤ −
Optimization Problem, Optimization Problem, LagrangianLagrangian
.,...,1,1)( Libxwy ii =≥+⋅
2
21min w{
)1)((1
2
21 −+⋅−= ∑
=
bxwywL ii
L
i
ip α
1
1
0,L
i i
i
L
i i i
i
y
w y x
α
α
=
=
⋅ =
= ⋅ ⋅
∑
∑
ibxwy iii ∀=−+⋅ ,0)1)((α
KKT conditions:
0
0
=
>
i
i
α
α -- Support VectorsSupport Vectors
⇒
⇒ {
Optimization Problem: Dual Variables.Optimization Problem: Dual Variables.
Li
y
xxyyL
i
L
i
ii
L
ji
jijiji
L
i
iD
,...1,0
0
)(
1
1,1
21
=≥
=
⋅−=
∑
∑∑
=
==
α
α
ααα
1
( ) ( ) ( )L
i i i
i
f x sign w x b sign y x x bα=
= ⋅ + = ⋅ +
∑
• inputs are presented as dot products
• Quadratic Programming
• convex problem, nice theoretical field
• unique solution, good solvers
Soft margin Soft margin hyperplanehyperplane::
allowing for the training errorallowing for the training error.
12
1 , 1
1
( )
0
0 , 1,...
L L
D i i j i j i j
i i j
L
i i
i
i
L y y x x
y
i LC
α α α
α
α
= =
=
= − ⋅
=
≤ ≤ =
∑ ∑
∑
.,...,1,1)( Libxwy iii =−≥+⋅ ξ
∑=
+L
i
iCw1
2
21min ξ{
Lii ,...1,0 =≥ξ
C C -- regularization parameterregularization parameter
trade-off between margin maximization
&training error
{
Support Vector TerminologySupport Vector Terminology
1
( ) ( )L
i i i
i
f x sign y x x bα=
= ⋅ +
∑
0 < αi < C Support Vectors
αi = 0 Normal Samples
αi = C Support Vectorsuntypical or noisy
C C -- regularization parameterregularization parameter
trade-off between margin maximization
&training error
Support Vector AlgorithmSupport Vector AlgorithmKernel Trick
( , )x x K x x′ ′⋅ →( ) ( )x x x x′ ′⋅ → Φ ⋅Φ
Example.
2
1
1
1 2
2 2
2
2
xx
x xx
x
→
2( , ) ( )K x x x x′ ′= ⋅
•K is symmetric
•K is positive-definite⇔
If data is not linearly separable, it can be projected into (sufficiently)
high dimensional space. There it is much easier to separate!
( )x x→ Φ ? The algorithm was formulated in terms of dot products!
Nonlinear SVM. Kernel trick.Nonlinear SVM. Kernel trick.
1
( )
( ) ( , )L
i i i
i
f x wx b
f x y K x x bα=
= + →
= +∑
Any linear algorithm, formulated in terms of dot products of input data,can be modified into a non-linear one using the kernel trickkernel trick.
• Support Vector Machine
• Kernel Ridge Regression
• Kernel Principle Component Analysis
• Kernel Fischer Discriminant Analysis
• etc.
Nonlinear SVM. Kernel types.Nonlinear SVM. Kernel types.
• Polynomial kernel: p
yxyxK )1(),( +⋅=
• Radial Basis Function kernel: 2
2
2),( σ
yx
eyxK
−−
=
( ) ( ( , ) )i i i
i SV
f x sign y K x x bα∈
= +∑
Nonlinear SVM. Optimization problem.Nonlinear SVM. Optimization problem.
LiC
y
xxKyyL
i
L
i
ii
L
ji
jijiji
L
i
iD
,...1,0
0
),(
1
1,1
21
=≤≤
=
−=
∑
∑∑
=
==
α
α
ααα
( ) ( ( , ) )i i i
i SV
f x sign y K x x bα∈
= +∑∑=
−=L
i
jiiii xxKyyb0
),(α
K is positive-definite, still QP programming, hence unique solution!
Support Vector Machine
http://www.geokernels.org/teaching/svm
SVM: Software.SVM: Software.
ExamplesExamples
SV Porosity MappingSV Porosity Mapping
Data description
200 training samples
“++” 94 validation samples
minimum = 0.0
median = 0.515
max = 1.000
mean = 0.53
variance = 0.048
The original continuous data were transformed into 2-class data according to the
0.5 threshold:
If fpor ≥ 0.5, then y = +1
If fpor < 0.5, then y = -1
SV Porosity MappingSV Porosity Mapping
Data: 2-class transformation
• class “+1”, ≥ 0.5
o class “-1”, < 0.5
+ validation data
SV Porosity MappingSV Porosity Mapping
Data loading
150 training samples
50 testing samples
Prediction Grid
SV Porosity MappingSV Porosity Mapping
Hyper-parameters tuning
• Gaussian RBF kernel is selected.
• Two hyper-parameters: CC and σσ..
•• Grid search: testing error analysis for every pair of paramaters.
2
22( , )
x x
K x x e σ
′−−
′ =
The range of σσσσ
The range of log(C)
Start calculation using testing data
min(σ) - minimum distance between data samples
max(σ) - max distance between data samples
min(C) - some small value, 1 or less
max(C) – depends on data, 1e3-1e6
Save results to file
SV Porosity MappingSV Porosity Mapping
Hyper-parameters tuning
Gaussian RBF kernel bandwidth
Log(C)
Training error surface
• increase with kernel bandwidth
• decrease with C
SV Porosity MappingSV Porosity Mapping
Hyper-parameters tuning
Gaussian RBF kernel bandwidth
Log(C)
Testing error surface
Complex structure, but generally, if the range is selected reasonably and
data splitting is correct, there exist a region of minima – optimal values.
SV Porosity MappingSV Porosity Mapping
Hyper-parameters tuning
Gaussian RBF kernel bandwidth
Log(C)
Normalized number of Support Vectors
Represents the complexity of the model, the more complex one has more SVs.
Hyper-parameters selection
What are the parameters for the final model?
Training error
Testing error
Normalized NSV
C = 3σ = 0.09
Hyper-parameters selection
What are the parameters for the final model?
Training error
Testing error
Normalized NSV
C = 18σ = 0.13
SV Porosity MappingSV Porosity Mapping
Dependence on Parameters
C = 10
σ0.02 0.06 0.1 0.2 0.3 0.4 0.5
SV Porosity MappingSV Porosity Mapping
Dependence on Parameters
σ = 0.1
C=0.1
C=1
C=10
C=100
SV Porosity MappingSV Porosity Mapping
Predictive Mapping and Support Vectors
Predictive mapping
+MARGIN
+
Normal SV, 0<α<C.
+
Critical SV, α=C.
Applications for Natural Hazards
• Topo-climatic mapping
• Landslides
• Snow avalanches prediction
Weather observations
• 110 meteo stations
• Measurements, up to every 10min
• Altitude: 270m-3580m
• Temperature
• Precipitation
• Humidity
• Air Pressure
• Wind Speed
• Insolation
• Etc.
SpatioSpatio--temporal prediction mapping?temporal prediction mapping?
Temperature Inversion
Can only be explained using terrain surface characteristics (convexity, slope, etc.)
Physical Models at local scales
• Terrain roughness is too high for physical models, computational speed,
precision, uncertainty estimation…
PDE on smoothed terrain + empirical correction
( , ) ...Model Physical Ridges Canyons Values FlatAreas Sea
v x y v c c c c c= + + + + +
Can this information be extracted directly from data?Can this information be extracted directly from data?
Modelling Scheme
Data
DEM
F
E
A
T
U
R
E
S
….
Non-linear dependencies
Noise, Outliers
Feature
Selection/Extraction
Predictive Modeling
with
Machine Learning
Spatio-Temporal Mapping
Analysis Decision Support
Temperature vs. Elevation
Mean Monthly
Linear
Mean Daily
Locally Linear
Regionalized
Mean Hourly
Non-linear
Regionalized
Mean Hourly
Explained
Temperature
Inversion
DEM Features
Large Scale Difference of Gaussians Short Scale Difference of Gaussians
Slope Local Variance
Temperature Inversion Mapping
Probability of InversionTemperature
Visual Validation
Operational setting
http://www.geokernels.org/services/meteo
Applications
• Topo-climatic mapping
• Landslides
• Snow avalanches prediction
• Remote Sensing
SFI (SRC-ID 07/SRC/I1168)
Landslide inventory
Method IProbability density estimation
Factor 2
Facto
r 1
SFI (SRC-ID 07/SRC/I1168)
SFI (SRC-ID 07/SRC/I1168)
Model vs. Training Data
SFI (SRC-ID 07/SRC/I1168)
What is wrong with this susceptibility map?
Method IIClassification
Factor 2
Facto
r 1
Stable
Unstable
SFI (SRC-ID 07/SRC/I1168)
Predictive models
SFI (SRC-ID 07/SRC/I1168)
SFI (SRC-ID 07/SRC/I1168)
A model should fit the observed landslides, and …
Applications
• Topo-climatic mapping
• Landslides
• Snow avalanches prediction
• Remote Sensing
• 1842 days of weather conditions (11 features) recording,
1991-2007
• 1135 days with documented avalanche events
• 797 safe days, 245 with avalanches
• 260 days unknown (mainly bad weather)
Lochaber, Scotland
Validation data: 72 events,winters 2006-2007
Training data: 722 events,winters 1991-2005
Spatial Data
• 47 avalanche paths, x, y, z, slope, aspect, date
• DEM, 10m resolution, 5km x 5km
• Snow index 0-10
• No-settle cumulative Snow over a season
• Rain at 900m binary [0, 1]
• Snow drift binary [0, 1]
• Air temperature -10,… +10
• Wind speed 0, … 25 m/s
• Wind Direction 0o-360o
• Cloudness [25, 50, 75, 100]
• Foot penetration 0, … 50
• Snow temperature 0, … -10
• Insolation cumulative over season
Lochaber weather
observations
Z Slope Aspect: SN-WE [Spatialized Weather Features] +1
…over all the documented avalanche events…
…over all the 47 gullies for documented days without avalanches…
720720
4400044000
4 + 22 = 264 + 22 = 26
Z Slope Aspect: SN-WE [Spatialized Weather Features] +1
Z Slope Aspect: SN-WE [Spatialized Weather Features] -1
Z Slope Aspect: SN-WE [Spatialized Weather Features] -1
Classification Problem
Wind Speed and Direction
Terrain-corrected wind direction:
Wind speed weighting:
Correction for slope:
Correction for curvature:
Snow accumulation
If Snow index > 0
If Snow drift = 1
Snow accumulation =F(Wind Speed,
Wind Direction)
Simple heuristics based on wind speed gradients
Results
DEM Avalanche Danger
Results
wind
Animation in 3D
Applications
• Topo-climatic mapping
• Landslides
• Snow avalanches prediction
• Remote Sensing
Inhabited areas
Testing Training
Ground truth is known: population census
Inhabited areasGround truth is known: population census
Inhabited areas: examples
Inhabited areas: examples
Inhabited areas: examples
Inhabited areas: examples
Inhabited areas: examples
Inhabited areas: examples
Pre-processing and Features
Mathematical morphology (image closing)
Pre-processing and Features
SIFT
Pre-processing and Features
Gaussian Mixture Model
Pre-processing and Features
Testing: inhabited areas
Inhabited areas
Inhabited areas
Summary and ConclusionsSummary and Conclusions
• Statistical Learning Theory• Classification Problem• Support Vector Machines and Kernel Methods
• GeoSpatial Data Classification with SVM
Top Related