Download - Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov – National Centre for Geocomputation, National University of Ireland , Maynooth (Ireland)

Kernel MethodsKernel Methods

(Support Vector Machines)(Support Vector Machines)

forfor

Environmental and GeoEnvironmental and Geo-- SciencesSciences

Alexei Pozdnoukhov

Lecturer

National Centre for Geocomputation

National University of Ireland, Maynooth

+353 (0)1 7086146

[email protected]

Machine LearningMachine Learning

• Environmental monitoringCurrent rate of data acquisition is about

0.5Tb/day (increasing at 82% per year)

• Remote Sensing DataNASA holds more than 10Pb of data,

increasing by 10x every 5 years.

ESA data stream is about 0.5Tb/year,likely to increase by 20x in next 5 years.

• GIS, DEM

• Sensor Networks

• Field Measurements

Learning From Data

Clustering

Cluster 1

Cluster 2

Dimensionality Reduction

Classification

Binary Multi-Class

Regression

Input, x

y

Curse of Dimensionality

Sensor NetworkSensor Network

Geographical Information

Wireless Sensor Network

Remote Sensing

Batteries Recharged at WSN

Need more data?

Human activity

Detecting Events

Observed environment:

high-dimensional input spaceEvents: Very Rare, Extreme

• High-dimensional spaces: risk of overfitting

• Robust to noise in both inputs/outputs

• Non-linear and non-parametric

• Computationally effective for real-time processing and LBS dissemination

Curse of Dimensionality

Statistical Learning Theory

• Models that can generalise from data

• Good predictive abilities

• Complexity can be controlled

Statistical Learning TheoryStatistical Learning Theory

• Occam’s Razor Principle (14th century)

One should not increase, beyond what is necessary,the number of entities required to explain anything

• When many solutions are available for a given problem, weshould select the simplest one.

• But what do we mean by simple?

• We will use prior knowledge of the problem to solve to definewhat is a simple solution (example of a prior: smoothness).

OccamOccam’’s Razor and Classification s Razor and Classification

-√√√√-Overall

√√√√√√√√√√√√××××××××Training error

×××× ××××√√√√ √√√√√√√√Complexity

Model 3Model 2Model 1

Structural Risk MinimizationStructural Risk Minimization

• Define a set of learning functions, {S}

• Order it in terms of complexity, {S1, …, SN}

• Select the optimal S*

F = {f(x,α), α∈Λ}

ClassificationClassification

Support Vector Machine

SVM

Separating Separating HyperplaneHyperplane

x - input patterns

w - weight vector

b - threshold

, ( ) ( )w bf x sign w x b= ⋅ +

How powerful are linear decision functions?

VCVC--dimension in classificationdimension in classification

Shattering

•• the number of samples which can be discriminated by the functiothe number of samples which can be discriminated by the function for all n for all possible class memberships possible class memberships –– shattered.shattered.

xx

xx

xx

xxxx

3 samples:

4 samples:

VC-dimension h of the linear decision functions in RN equals N+1

?

That is, the power of linear decision functions is beyond our control…?

Support Vector MachineSupport Vector Machine

Intuition:

Large Margin is good.

Decision function is a margin hyperplane(*)(*)

−≤−⋅−

≥−⋅=

1)(,1

1)(,1}),{,(

bxw

bxwbwxf

Lemma: Given that the N-dimensional data {xl, x2, …xL} lie inside a finite enclosing sphere of the radius R, the VC-dimension h of the margin-based decision functions (*) follows the inequality:

22min , 1h R w N ≤ +

The complexity (VC-dimension) can be controlled with ||w||2 !!

Separating Separating HyperplaneHyperplane: Max Margin: Max Margin

))(()(, bxwsignxf bw +⋅=

To maximize the margin ρ, one would like to minimize ||w||, or ||w||2.

,

1, ( ) 1( )

1, ( ) 1w b

w x bf x

w x b

⋅ − ≥=

− ⋅ − ≤ −

Optimization Problem, Optimization Problem, LagrangianLagrangian

.,...,1,1)( Libxwy ii =≥+⋅

2

21min w{

)1)((1

2

21 −+⋅−= ∑

=

bxwywL ii

L

i

ip α

1

1

0,L

i i

i

L

i i i

i

y

w y x

α

α

=

=

⋅ =

= ⋅ ⋅

∑

∑

ibxwy iii ∀=−+⋅ ,0)1)((α

KKT conditions:

0

0

=

>

i

i

α

α -- Support VectorsSupport Vectors

⇒

⇒ {

Optimization Problem: Dual Variables.Optimization Problem: Dual Variables.

Li

y

xxyyL

i

L

i

ii

L

ji

jijiji

L

i

iD

,...1,0

0

)(

1

1,1

21

=≥

=

⋅−=

∑

∑∑

=

==

α

α

ααα

1

( ) ( ) ( )L

i i i

i

f x sign w x b sign y x x bα=

= ⋅ + = ⋅ +

∑

• inputs are presented as dot products

• Quadratic Programming

• convex problem, nice theoretical field

• unique solution, good solvers

Soft margin Soft margin hyperplanehyperplane::

allowing for the training errorallowing for the training error.

12

1 , 1

1

( )

0

0 , 1,...

L L

D i i j i j i j

i i j

L

i i

i

i

L y y x x

y

i LC

α α α

α

α

= =

=

= − ⋅

=

≤ ≤ =

∑ ∑

∑

.,...,1,1)( Libxwy iii =−≥+⋅ ξ

∑=

+L

i

iCw1

2

21min ξ{

Lii ,...1,0 =≥ξ

C C -- regularization parameterregularization parameter

trade-off between margin maximization

&training error

{

Support Vector TerminologySupport Vector Terminology

1

( ) ( )L

i i i

i

f x sign y x x bα=

= ⋅ +

∑

0 < αi < C Support Vectors

αi = 0 Normal Samples

αi = C Support Vectorsuntypical or noisy

C C -- regularization parameterregularization parameter

trade-off between margin maximization

&training error

Support Vector AlgorithmSupport Vector AlgorithmKernel Trick

( , )x x K x x′ ′⋅ →( ) ( )x x x x′ ′⋅ → Φ ⋅Φ

Example.

2

1

1

1 2

2 2

2

2

xx

x xx

x

→

2( , ) ( )K x x x x′ ′= ⋅

•K is symmetric

•K is positive-definite⇔

If data is not linearly separable, it can be projected into (sufficiently)

high dimensional space. There it is much easier to separate!

( )x x→ Φ ? The algorithm was formulated in terms of dot products!

Nonlinear SVM. Kernel trick.Nonlinear SVM. Kernel trick.

1

( )

( ) ( , )L

i i i

i

f x wx b

f x y K x x bα=

= + →

= +∑

Any linear algorithm, formulated in terms of dot products of input data,can be modified into a non-linear one using the kernel trickkernel trick.

• Support Vector Machine

• Kernel Ridge Regression

• Kernel Principle Component Analysis

• Kernel Fischer Discriminant Analysis

• etc.

Nonlinear SVM. Kernel types.Nonlinear SVM. Kernel types.

• Polynomial kernel: p

yxyxK )1(),( +⋅=

• Radial Basis Function kernel: 2

2

2),( σ

yx

eyxK

−−

=

( ) ( ( , ) )i i i

i SV

f x sign y K x x bα∈

= +∑

Nonlinear SVM. Optimization problem.Nonlinear SVM. Optimization problem.

LiC

y

xxKyyL

i

L

i

ii

L

ji

jijiji

L

i

iD

,...1,0

0

),(

1

1,1

21

=≤≤

=

−=

∑

∑∑

=

==

α

α

ααα

( ) ( ( , ) )i i i

i SV

f x sign y K x x bα∈

= +∑∑=

−=L

i

jiiii xxKyyb0

),(α

K is positive-definite, still QP programming, hence unique solution!

Support Vector Machine

http://www.geokernels.org/teaching/svm

SVM: Software.SVM: Software.

ExamplesExamples

SV Porosity MappingSV Porosity Mapping

Data description

200 training samples

“++” 94 validation samples

minimum = 0.0

median = 0.515

max = 1.000

mean = 0.53

variance = 0.048

The original continuous data were transformed into 2-class data according to the

0.5 threshold:

If fpor ≥ 0.5, then y = +1

If fpor < 0.5, then y = -1


Data: 2-class transformation

• class “+1”, ≥ 0.5

o class “-1”, < 0.5

+ validation data


Data loading

150 training samples

50 testing samples

Prediction Grid


Hyper-parameters tuning

• Gaussian RBF kernel is selected.

• Two hyper-parameters: CC and σσ..

•• Grid search: testing error analysis for every pair of paramaters.

2

22( , )

x x

K x x e σ

′−−

′ =

The range of σσσσ

The range of log(C)

Start calculation using testing data

min(σ) - minimum distance between data samples

max(σ) - max distance between data samples

min(C) - some small value, 1 or less

max(C) – depends on data, 1e3-1e6

Save results to file



Gaussian RBF kernel bandwidth

Log(C)

Training error surface

• increase with kernel bandwidth

• decrease with C




Log(C)

Testing error surface

Complex structure, but generally, if the range is selected reasonably and

data splitting is correct, there exist a region of minima – optimal values.




Log(C)

Normalized number of Support Vectors

Represents the complexity of the model, the more complex one has more SVs.

Hyper-parameters selection

What are the parameters for the final model?

Training error

Testing error

Normalized NSV

C = 3σ = 0.09

Hyper-parameters selection

What are the parameters for the final model?

Training error

Testing error

Normalized NSV

C = 18σ = 0.13


Dependence on Parameters

C = 10

σ0.02 0.06 0.1 0.2 0.3 0.4 0.5


Dependence on Parameters

σ = 0.1

C=0.1

C=1

C=10

C=100


Predictive Mapping and Support Vectors

Predictive mapping

+MARGIN

+

Normal SV, 0<α<C.

+

Critical SV, α=C.

Applications for Natural Hazards

• Topo-climatic mapping

• Landslides

• Snow avalanches prediction

Weather observations

• 110 meteo stations

• Measurements, up to every 10min

• Altitude: 270m-3580m

• Temperature

• Precipitation

• Humidity

• Air Pressure

• Wind Speed

• Insolation

• Etc.

SpatioSpatio--temporal prediction mapping?temporal prediction mapping?

Temperature Inversion

Can only be explained using terrain surface characteristics (convexity, slope, etc.)

Physical Models at local scales

• Terrain roughness is too high for physical models, computational speed,

precision, uncertainty estimation…

PDE on smoothed terrain + empirical correction

( , ) ...Model Physical Ridges Canyons Values FlatAreas Sea

v x y v c c c c c= + + + + +

Can this information be extracted directly from data?Can this information be extracted directly from data?

Modelling Scheme

Data

DEM

F

E

A

T

U

R

E

S

….

Non-linear dependencies

Noise, Outliers

Feature

Selection/Extraction

Predictive Modeling

with

Machine Learning

Spatio-Temporal Mapping

Analysis Decision Support

Temperature vs. Elevation

Mean Monthly

Linear

Mean Daily

Locally Linear

Regionalized

Mean Hourly

Non-linear

Regionalized

Mean Hourly

Explained

Temperature

Inversion

DEM Features

Large Scale Difference of Gaussians Short Scale Difference of Gaussians

Slope Local Variance

Temperature Inversion Mapping

Probability of InversionTemperature

Visual Validation

Operational setting

http://www.geokernels.org/services/meteo

Applications


• Landslides


• Remote Sensing

SFI (SRC-ID 07/SRC/I1168)

Landslide inventory

Method IProbability density estimation

Factor 2

Facto

r 1



Model vs. Training Data


What is wrong with this susceptibility map?

Method IIClassification

Factor 2

Facto

r 1

Stable

Unstable


Predictive models



A model should fit the observed landslides, and …

Applications


• Landslides


• Remote Sensing

• 1842 days of weather conditions (11 features) recording,

1991-2007

• 1135 days with documented avalanche events

• 797 safe days, 245 with avalanches

• 260 days unknown (mainly bad weather)

Lochaber, Scotland

Validation data: 72 events,winters 2006-2007

Training data: 722 events,winters 1991-2005

Spatial Data

• 47 avalanche paths, x, y, z, slope, aspect, date

• DEM, 10m resolution, 5km x 5km

• Snow index 0-10

• No-settle cumulative Snow over a season

• Rain at 900m binary [0, 1]

• Snow drift binary [0, 1]

• Air temperature -10,… +10

• Wind speed 0, … 25 m/s

• Wind Direction 0o-360o

• Cloudness [25, 50, 75, 100]

• Foot penetration 0, … 50

• Snow temperature 0, … -10

• Insolation cumulative over season

Lochaber weather

observations

Z Slope Aspect: SN-WE [Spatialized Weather Features] +1

…over all the documented avalanche events…

…over all the 47 gullies for documented days without avalanches…

720720

4400044000

4 + 22 = 264 + 22 = 26

Z Slope Aspect: SN-WE [Spatialized Weather Features] +1

Z Slope Aspect: SN-WE [Spatialized Weather Features] -1

Z Slope Aspect: SN-WE [Spatialized Weather Features] -1

Classification Problem

Wind Speed and Direction

Terrain-corrected wind direction:

Wind speed weighting:

Correction for slope:

Correction for curvature:

Snow accumulation

If Snow index > 0

If Snow drift = 1

Snow accumulation =F(Wind Speed,

Wind Direction)

Simple heuristics based on wind speed gradients

Results

DEM Avalanche Danger

Results

wind

Animation in 3D

Applications


• Landslides


• Remote Sensing

Inhabited areas

Testing Training

Ground truth is known: population census

Inhabited areasGround truth is known: population census

Inhabited areas: examples

Pre-processing and Features

Mathematical morphology (image closing)


SIFT


Gaussian Mixture Model

Testing: inhabited areas

Inhabited areas

Summary and ConclusionsSummary and Conclusions

• Statistical Learning Theory• Classification Problem• Support Vector Machines and Kernel Methods

• GeoSpatial Data Classification with SVM

Thank you!

Alexei Pozdnoukhov

[email protected]


Open PhD positions at NCG