An Introduction to Machine Learning with...
Transcript of An Introduction to Machine Learning with...
![Page 1: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/1.jpg)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 1
An Introduction to Machine Learningwith Kernels
Lecture 1Alexander J. Smola
Statistical Machine Learning ProgramNational ICT Australia, Canberra
![Page 2: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/2.jpg)
Day 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 2
Machine learning and probability theoryIntroduction to pattern recognition, classification, regres-sion, novelty detection, probability theory, Bayes rule, in-ference
Density estimation and Parzen windowsKernels and density estimation, Silverman’s rule, Wat-son Nadaraya estimator, crossvalidation
Perceptron and kernelsHebb’s rule, perceptron algorithm, convergence, featuremaps, kernel trick, examples
Support Vector classificationGeometrical view, dual problem, convex optimization,kernels and SVM
![Page 3: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/3.jpg)
Day 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 3
Text analysis and bioinformaticsText categorization, biological sequences, kernels onstrings, efficient computation, examples
OptimizationSequential minimal optimization, convex subproblems,convergence, SVMLight, SimpleSVM
Regression and novelty detectionSVM regression, regularized least mean squares, adap-tive margin width, novel observations
Practical tricksCrossvalidation, ν-trick, median trick, data scaling,smoothness and kernels
![Page 4: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/4.jpg)
L1 Introduction to Machine Learning
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 4
DataTexts, images, vectors, graphs
What to do with dataUnsupervised learning (clustering, embedding, etc.)Classification, sequence annotationRegression, autoregressive models, time seriesNovelty detection
What is not machine learningArtificial intelligenceRule based inference
Statistics and probability theoryProbability of an eventDependence, independence, conditional probabilityBayes rule, Hypothesis testing
![Page 5: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/5.jpg)
Data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 5
VectorsCollections of features (e.g. height, weight, bloodpressure, age, . . . )Can map categorical variables into vectors (useful formixed objects)
MatricesImages, MoviesRemote sensing and satellite data (multispectral)
StringsDocumentsGene sequences
Structured ObjectsXML documentsGraphs
![Page 6: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/6.jpg)
Optical Character Recognition
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 6
![Page 7: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/7.jpg)
Reuters Database
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 7
![Page 8: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/8.jpg)
Faces
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 8
![Page 9: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/9.jpg)
More Faces
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 9
![Page 10: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/10.jpg)
Microarray Data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 10
![Page 11: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/11.jpg)
Biological Sequences
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 11
![Page 12: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/12.jpg)
Graphs
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 12
![Page 13: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/13.jpg)
Missing Variables
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 13
Incomplete DataMeasurement devices may fail (e.g. dead pixels oncamera)Measuring things may be expensive (diagnosis for pa-tients)Data may be censored
How to fix itClever algorithms (not this course)Simple mean imputation (substitute in the averagefrom other observations)Works amazingly well (for starters) . . .
![Page 14: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/14.jpg)
What to do with data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14
Unsupervised LearningFind clusters of the dataFind low-dimensional representation of the data (e.g.unroll a swiss roll)Find interesting directions in dataInteresting coordinates and correlationsFind novel observations / database cleaning
Supervised LearningClassification (distinguish apples from oranges)Speech recognitionRegression (tomorrow’s stock value)Predict time seriesAnnotate strings
![Page 15: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/15.jpg)
Clustering
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15
![Page 16: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/16.jpg)
Linear Subspace
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 16
![Page 17: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/17.jpg)
Principal Components
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 17
![Page 18: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/18.jpg)
Classification
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 18
DataPairs of observations (xi, yi) generated from some distri-bution P(x, y), e.g., (blood status, cancer), (credit trans-action information, fraud), (sound profile of jet engine,defect)
Goal Estimate y ∈ {±1} given x at a new location. Orfind a function f (x) that does the trick.
![Page 19: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/19.jpg)
Regression
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 19
![Page 20: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/20.jpg)
Regression
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 20
DataPairs of observations (xi, yi) generated from some jointdistribution Pr(x, y), e.g.,
market index, SP100fab parfameters, yielduser profile, price
TaskEstimate y, given x, such that some loss c(x, y, f (x)) isminimized.
Examples
Quadratic error between y and f (x), i.e.c(x, y, f (x)) = 1
2(y − f (x))2.Absolute value, i.e., c(x, y, f (x)) = |y − f (x))|.
![Page 21: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/21.jpg)
Annotating Strings
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 21
![Page 22: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/22.jpg)
Annotating Audio
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 22
GoalPossible meaning of an audio sequenceGive confidence measure
Example (from Australian Prime Minister’s speech)a stray alienAustralian
![Page 23: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/23.jpg)
Novelty Detection
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 23
DataObservations (xi)generated fromsome P(x), e.g.,
network usagepatternshandwritten digitsalarm sensorsfactory status
TaskFind unusual events,clean database, dis-tinguish typical ex-amples.
![Page 24: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/24.jpg)
What Machine Learning is not
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 24
LogicIf A meets B and B meets C, does A know C?Rule satisfactionLogical rules from data
Artificial IntelligenceUnderstanding of the worldMeet Sunny from I, RobotGo and get me a bottle of beer (robot need not under-stand what it is doing)
Biology and NeuroscienceUnderstand the brain by building neural networks?!?Model brain and build good systems with thatGet inspiration from biology but no requirement tobuild systems like that (e.g. jet planes don’t flap wings)
![Page 25: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/25.jpg)
How the brain doesn’t work
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 25
![Page 26: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/26.jpg)
Statistics and Probability Theory
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 26
Why do we need it?We deal with uncertain eventsNeed mathematical formulation for probabilitiesNeed to estimate probabilities from data (e.g. for cointosses, we only observe number of heads and tails,not whether the coin is really unbiased).
How do we use it?Statement about probability that an object is an apple(rather than an orange)Probability that two things happen at the same timeFind unusual events (= low density events)Conditional events (e.g. what happens if A, B, and Care true)
![Page 27: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/27.jpg)
Probability
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 27
Basic IdeaWe have events in a space of possible outcomes. ThenPr(X) tells us how likely is that an event x ∈ X will occur.
Basic AxiomsPr(X) ∈ [0, 1] for all X ⊆ XPr(X) = 1
Pr (∪iXi) =∑
i
Pr(Xi) if Xi ∩Xj = ∅ for all i 6= j
Simple Corollary
Pr(X ∪ Y ) = Pr(X) + Pr(Y )− Pr(X ∩ Y )
![Page 28: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/28.jpg)
Example
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 28
![Page 29: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/29.jpg)
Multiple Variables
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 29
Two SetsAssume that X and Y are a probability measure on theproduct space of X and Y. Consider the space of events(x, x) ∈ X× Y.
IndependenceIf x and y are independent, then for all X ⊂ X and Y ⊂ Y
Pr(X,Y ) = Pr(X) · Pr(Y ).
![Page 30: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/30.jpg)
Independent Random Variables
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 30
![Page 31: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/31.jpg)
Dependent Random Variables
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 31
![Page 32: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/32.jpg)
Bayes Rule
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 32
Dependence and Conditional ProbabilityTypically, knowing x will tell us something about y (thinkregression or classification). We have
Pr(Y |X) Pr(X) = Pr(Y, X) = Pr(X|Y ) Pr(Y ).
Hence Pr(Y, X) ≤ min(Pr(X), Pr(Y )).Bayes Rule
Pr(X|Y ) =Pr(Y |X) Pr(X)
Pr(Y ).
Proof using conditional probabilities
Pr(X,Y ) = Pr(X|Y ) Pr(Y ) = Pr(Y |X) Pr(X)
![Page 33: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/33.jpg)
Example
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 33
Pr(X ∩X ′) = Pr(X|X ′) Pr(X ′) = Pr(X ′|X) Pr(X)
![Page 34: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/34.jpg)
AIDS Test
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 34
How likely is it to have AIDS if the test says so?Assume that roughly 0.1% of the population is infected.
p(X = AIDS) = 0.001
The AIDS test reports positive for all infections.
p(Y = test positive|X = AIDS) = 1
The AIDS test reports positive for 1% healthy people.
p(Y = test positive|X = healthy) = 0.01
We use Bayes rule to infer Pr(AIDS|test positive) via
Pr(Y |X) Pr(X)
Pr(Y )=
Pr(Y |X) Pr(X)
Pr(Y |X) Pr(X) + Pr(Y |X\X) Pr(X\X)
= 1·0.0011·0.001+0.01·0.999 = 0.091
![Page 35: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/35.jpg)
Eye Witness
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35
Evidence from an Eye-WitnessA witness is 90% certain that a certain customer commit-ted the crime. There were 20 people in the bar . . .
Would you convict the person?Everyone is presumed innocent until guilty, hence
p(X = guilty) = 1/20
Eyewitness has equal confusion probability
p(Y = eyewitness identifies|X = guilty) = 0.9
and p(Y = eyewitness identifies|X = not guilty) = 0.1
Bayes Rule
Pr(X|Y ) = 0.9·0.050.9·0.05+0.1·0.95 = 0.3213 = 32%
But most judges would convict him anyway . . .
![Page 36: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/36.jpg)
Improving Inference
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 36
Follow up on the AIDS test:The doctor performs a, conditionally independent testwhich has the following properties:
The second test reports positive for 90% infections.The AIDS test reports positive for 5% healthy people.
Pr(T1, T2|Health) = Pr(T1|Health) Pr(T2|Health).
A bit more algebra reveals (assuming that both tests areindependent): 0.01·0.05·0.999
0.01·0.05·0.999+1·0.9·0.001 = 0.357.
Conclusion:Adding extra observations can improve the confidenceof the test considerably.
![Page 37: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/37.jpg)
Different Contexts
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37
Hypothesis Testing:
Is solution A or B better to solve the problem (e.g. inmanufacturing)?Is a coin tainted?Which parameter setting should we use?
Sensor Fusion:
Evidence from sensors A and B (cf. AIDS test 1 and2).We have different types of data.
More Data:
We obtain two sets of data — we get more confidentEach observation can be seen as an additional test
![Page 38: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/38.jpg)
Estimating Probabilities from Data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 38
Rolling a dice:Roll the dice many times and count how many timeseach side comes up. Then assign empirical probabilityestimates according to the frequency of occurrence.
Maximum Likelihood for Multinomial Distribution:We match the empirical probabilities via
Premp
(i) = #occurrences of i#trials
In plain English this means that we take the number ofoccurrences of a particular event (say 7 times head) anddivide this by the total number of trials (say 10 times).This yields 0.7.
![Page 39: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/39.jpg)
Maximum Likelihood Proof
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
GoalWe want to estimate the parameter π ∈ Rn such that
Pr(X|π) =
m∏j=1
Pr(Xj|π) =
n∏i=1
π#ii
is maximized while π is a probability (reparameterizeπi = eθi).
Constrained Optimization Problem
minimizen∑
i=1
−#i · θi subject ton∑
i=1
eθi = 1
Lagrange Function
L(π, γ) =
n∑i=1
−#i · θi + γ
(1−
n∑i=1
eθi
)
![Page 40: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/40.jpg)
Maximum Likelihood Proof
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40
First Order Optimality Conditions
L(π, α, γ) =
n∑i=1
−#i · θi + γ
(n∑
i=1
eθi − 1
)∂θi
= −#i + γeθi vanishes
=⇒ πi = eθi =#i
γ
Finally, the sum constraint is satisfied if γ =∑
i #i.
![Page 41: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/41.jpg)
Practical Example
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41
![Page 42: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/42.jpg)
Properties of MLE
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42
Hoeffding’s BoundThe probability estimates converge exponentially fast
Pr{|πi − pi| > ε} ≤ 2 exp(−2mε2)
ProblemFor small ε this can still take a very long time. In par-ticular, for a fixed confidence level δ we have
δ = 2 exp(−2mε2) =⇒ ε =
√− log δ + log 2
2m
The above bound holds only for single πi, not uni-formly over all i.
Improved ApproachIf we know something about πi, we should use this extrainformation: use priors.
![Page 43: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/43.jpg)
Summary
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43
DataWhat to do with data
Unsupervised learning (clustering, embedding, etc.),Classification, sequence annotation, Regression, . . .
Statistics and probability theoryProbability of an eventDependence, independence, conditional probabilityBayes rule, Hypothesis testingMaximum Likelihood EstimationConfidence bounds
![Page 44: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/44.jpg)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 1
An Introduction to Machine Learningwith Kernels
Lecture 2Alexander J. Smola
Statistical Machine Learning ProgramNational ICT Australia, Canberra
![Page 45: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/45.jpg)
Day 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 2
Machine learning and probability theoryIntroduction to pattern recognition, classification, regres-sion, novelty detection, probability theory, Bayes rule, in-ference
Density estimation and Parzen windowsKernels and density estimation, Silverman’s rule, Wat-son Nadaraya estimator, crossvalidation
Perceptron and kernelsHebb’s rule, perceptron algorithm, convergence, featuremaps, kernel trick, examples
Support Vector classificationGeometrical view, dual problem, convex optimization,kernels and SVM
![Page 46: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/46.jpg)
L2 Density estimation
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 3
Density estimationempirical frequency, bin countingpriors and Laplace rule
Parzen windowsSmoothing out the estimatesExamples
Adjusting parametersCross validationSilverman’s rule
Classification and regression with Parzen windowsWatson-Nadaraya estimatorNearest neighbor classifier
![Page 47: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/47.jpg)
Tossing a dice (again)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 4
![Page 48: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/48.jpg)
Priors to the Rescue
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 5
Big ProblemOnly sampling many times gets the parameters right.
Rule of ThumbWe need at least 10-20 times as many observations.
PriorsOften we know what we should expect. Using a con-jugate prior helps. There insert fake additional datawhich we assume that it comes from the prior.
Practical ExampleIf we assume that the dice is even, then we can add m0
observations to each event 1 ≤ i ≤ 6. This yields
πi =#occurrences of i + ui − 1
#trials +∑
j(uj − 1).
For m0 = 1 this is the famous Laplace Rule .
![Page 49: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/49.jpg)
Example: Dice
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 6
20 tosses of a dice
Outcome 1 2 3 4 5 6Counts 3 6 2 1 4 4MLE 0.15 0.30 0.10 0.05 0.20 0.20MAP (m0 = 6) 0.25 0.27 0.12 0.08 0.19 0.19MAP (m0 = 100) 0.16 0.19 0.16 0.15 0.17 0.17
ConsequencesStronger prior brings the estimate closer to uniformdistribution.More robust against outliersBut: Need more data to detect deviations from prior
![Page 50: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/50.jpg)
Correct dice
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 7
![Page 51: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/51.jpg)
Tainted dice
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 8
![Page 52: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/52.jpg)
Density Estimation
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 9
DataContinuous valued random variables.
Naive SolutionApply the bin-counting strategy to the continuum. Thatis, we discretize the domain into bins.
ProblemsWe need lots of data to fill the binsIn more than one dimension the number of bins growsexponentially:Assume 10 bins per dimension, so we have 10 in R1
100 bins in R2
1010 bins (10 billion bins) in R10 . . .
![Page 53: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/53.jpg)
Mixture Density
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 10
![Page 54: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/54.jpg)
Sampling from p(x)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 11
![Page 55: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/55.jpg)
Bin counting
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 12
![Page 56: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/56.jpg)
Parzen Windows
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 13
Naive approachUse the empirical density
pemp(x) =1
m
m∑i=1
δ(x, xi).
which has a delta peak for every observation.Problem
What happens when we see slightly different data?Idea
Smear out pemp by convolving it with a kernel k(x, x′).Here k(x, x′) satisfies∫
X
k(x, x′)dx′ = 1 for all x ∈ X.
![Page 57: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/57.jpg)
Parzen Windows
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14
Estimation FormulaSmooth out pemp by convolving it with a kernel k(x, x′).
p(x) =1
m
m∑i=1
k(xi, x)
Adjusting the kernel widthRange of data should be adjustableUse kernel function k(x, x′) which is a proper kernel.Scale kernel by radius r. This yields
kr(x, x′) := rnk(rx, rx′)
Here n is the dimensionality of x.
![Page 58: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/58.jpg)
Discrete Density Estimate
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15
![Page 59: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/59.jpg)
Smoothing Function
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 16
![Page 60: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/60.jpg)
Density Estimate
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 17
![Page 61: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/61.jpg)
Examples of Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 18
Gaussian Kernel
k(x, x′) =(2πσ2
)n2 exp
(− 1
2σ2‖x− x′‖2
)Laplacian Kernel
k(x, x′) = λn2−n exp (−λ‖x− x′‖1)
Indicator Kernel
k(x, x′) = 1[−0.5,0.5](x− x′)
Important IssueWidth of the kernel is usually much more important thantype .
![Page 62: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/62.jpg)
Gaussian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 19
![Page 63: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/63.jpg)
Laplacian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 20
![Page 64: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/64.jpg)
Indicator Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 21
![Page 65: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/65.jpg)
Gaussian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 22
![Page 66: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/66.jpg)
Laplacian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 23
![Page 67: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/67.jpg)
Laplacian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 24
![Page 68: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/68.jpg)
Selecting the Kernel Width
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 25
GoalWe need a method for adjusting the kernel width.
ProblemThe likelihood keeps on increasing as we narrow the ker-nels.
ReasonThe likelihood estimate we see is distorted (we are beingoverly optimistic through optimizing the parameters).
Possible SolutionCheck the performance of the density estimate on anunseen part of the data. This can be done e.g. by
Leave-one-out crossvalidationTen-fold crossvalidation
![Page 69: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/69.jpg)
Expected log-likelihood
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 26
What we really wantA parameter such that in expectation the likelihood ofthe data is maximized
pr(X) =
m∏i=1
pr(xi)
or equivalently1
mlog pr(X) =
1
m
m∑i=1
log pr(xi).
However, if we optimize r for the seen data, we willalways overestimate the likelihood.
Solution: CrossvalidationTest on unseen dataRemove a fraction of data from X, say X ′, estimateusing X\X ′ and test on X ′.
![Page 70: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/70.jpg)
Crossvalidation Details
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 27
Basic IdeaCompute p(X ′|θ(X\X ′)) for various subsets of X and av-erage over the corresponding log-likelihoods.
Practical ImplementationGenerate subsets Xi ⊂ X and compute the log-likelihood estimate
1
n
n∑i
1
|Xi|log p(Xi|θ(X|\Xi))
Pick the parameter which maximizes the above estimate.Special Case: Leave-one-out Crossvalidation
pX\xi(xi) =
m
m− 1pX(xi) −
1
m− 1k(xi, xi)
![Page 71: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/71.jpg)
Cross Validation
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 28
![Page 72: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/72.jpg)
Best Fit ( λ = 1.9)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 29
![Page 73: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/73.jpg)
Application: Novelty Detection
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 30
GoalFind the least likely observations xi from a dataset X.Alternatively, identify low-density regions, given X.
IdeaPerform density estimate pX(x) and declare all xi withpX(xi) < p0 as novel.
AlgorithmSimply compute f (xi) =
∑j k(xi, xj) for all i and sort ac-
cording to their magnitude.
![Page 74: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/74.jpg)
Applications
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 31
Network Intrusion DetectionDetect whether someone is trying to hack the network,downloading tons of MP3s, or doing anything else un-usual on the network.
Jet Engine Failure DetectionYou can’t destroy jet engines just to see how they fail.
Database CleaningWe want to find out whether someone stored bogus in-formation in a database (typos, etc.), mislabelled digits,ugly digits, bad photographs in an electronic album.
Fraud DetectionCredit Cards, Telephone Bills, Medical Records
Self calibrating alarm devicesCar alarms (adjusts itself to where the car is parked),home alarm (furniture, temperature, windows, etc.)
![Page 75: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/75.jpg)
Order Statistic of Densities
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 32
![Page 76: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/76.jpg)
Typical Data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 33
![Page 77: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/77.jpg)
Outliers
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 34
![Page 78: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/78.jpg)
Silverman’s Automatic Adjustment
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35
ProblemOne ’width fits all’ does not work well whenever we haveregions of high and of low density.
IdeaAdjust width such that neighbors of a point are includedin the kernel at a point. More specifically, adjust range hi
to yieldhi =
r
k
∑xj∈NN(xi,k)
‖xj − xi‖
where NN(xi, k) is the set of k nearest neighbors of xi
and r is typically chosen to be 0.5.Result
State of the art density estimator, regression estimatorand classifier.
![Page 79: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/79.jpg)
Sampling from p(x)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 36
![Page 80: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/80.jpg)
Uneven Scales
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37
![Page 81: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/81.jpg)
Neighborhood Scales
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 38
![Page 82: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/82.jpg)
Adjusted Width
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
![Page 83: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/83.jpg)
Watson-Nadaraya Estimator
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40
GoalGiven pairs of observations (xi, yi) with yi ∈ {±1} findestimator for conditional probability Pr(y|x).
IdeaUse definition p(x, y) = p(y|x)p(x) and estimate both p(x)and p(x, y) using Parzen windows. Using Bayes rule thisyields
Pr(y = 1|x) =P (y = 1, x)
P (x)=
m−1∑
yi=1 k(xi, x)
m−1∑
i k(xi, x)
Bayes optimal decisionWe want to classify y = 1 for Pr(y = 1|x) > 0.5. This isequivalent to checking the sign of
Pr(y = 1|x) − Pr(y = −1|x) =∑
i
yik(xi, x)
![Page 84: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/84.jpg)
Training Data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41
![Page 85: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/85.jpg)
Watson Nadaraya Classifier
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42
![Page 86: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/86.jpg)
Difference in Signs
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43
![Page 87: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/87.jpg)
Watson Nadaraya Regression
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 44
Decision BoundaryPicking y = 1 or y = −1 depends on the sign of
Pr(y = 1|x) − Pr(y = −1|x) =
∑i yik(xi, x)∑i k(xi, x)
Extension to RegressionUse the same equation for regression. This meansthat
f (x) =
∑i yik(xi, x)∑i k(xi, x)
where now yi ∈ R.We get a locally weighted version of the data
![Page 88: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/88.jpg)
Regression Problem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45
![Page 89: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/89.jpg)
Watson Nadaraya Regression
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 46
![Page 90: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/90.jpg)
Nearest Neighbor Classifier
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 47
Extension of Silverman’s trickUse the density estimator for classification and regres-sion.
SimplificationRather than computing a weighted combination of labelsto estimate the label, use an unweighted combinationover the nearest neighbors.
Resultk-nearest neighbor classifier. Often used as baseline tocompare a new algorithm.
Nice PropertiesGiven enough data, k-nearest neighbors converges tothe best estimator possible (it is consistent).
![Page 91: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/91.jpg)
Practical Implementation
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48
Nearest Neighbor RuleNeed distance measure between dataGiven data x, find nearest point xi
Classify according to the label yi
k-Nearest Neighbor RuleFind k nearest neighbors of xDecide class of x according to majority of labels yi.Hence prefer odd k.
Neighborhood Search AlgorithmsBrute force search (OK if data not too large)Random projection tricks (fast but difficult)Neighborhood trees (very fast, implementation tricky)
BaselineUse k-NN as reference before fancy algorithms.
![Page 92: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/92.jpg)
Summary
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 49
Density estimationempirical frequency, bin countingpriors and Laplace rule
Parzen windowsSmoothing out the estimatesExamples
Adjusting parametersCross validationSilverman’s rule
Classification and regression with Parzen windowsWatson-Nadaraya estimatorNearest neighbor classifier
![Page 93: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/93.jpg)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 1
An Introduction to Machine Learningwith Kernels
Lecture 3Alexander J. Smola
Statistical Machine Learning ProgramNational ICT Australia, Canberra
![Page 94: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/94.jpg)
Day 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 2
Machine learning and probability theoryIntroduction to pattern recognition, classification, regres-sion, novelty detection, probability theory, Bayes rule, in-ference
Density estimation and Parzen windowsKernels and density estimation, Silverman’s rule, Wat-son Nadaraya estimator, crossvalidation
Perceptron and kernelsHebb’s rule, perceptron algorithm, convergence, featuremaps, kernel trick, examples
Support Vector classificationGeometrical view, dual problem, convex optimization,kernels and SVM
![Page 95: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/95.jpg)
L3 Perceptron and Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 3
Hebb’s rulepositive feedbackperceptron convergence rule
HyperplanesLinear separabilityInseparable sets
FeaturesExplicit feature constructionImplicit features via kernels
KernelsExamplesKernel perceptron
![Page 96: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/96.jpg)
Biology and Learning
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 4
Basic IdeaGood behavior should be rewarded, bad behaviorpunished (or not rewarded). This improves the fitnessof the system.Example: hitting a sabertooth tiger over the headshould be rewarded . . .Correlated events should be combined.Example: Pavlov’s salivating dog.
Training MechanismsBehavioral modification of individuals (learning) —successful behavior is rewarded (e.g. food).Hard-coded behavior in the genes (instinct) — thewrongly coded animal dies.
![Page 97: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/97.jpg)
Neurons
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 5
SomaCell body. Here the signalsare combined (“CPU”).
DendriteCombines the inputs fromseveral other nerve cells(“input bus”).
SynapseInterface between two neurons (“connector”).
AxonThis may be up to 1m long and will transport the acti-vation signal to nerve cells at different locations (“outputcable”).
![Page 98: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/98.jpg)
Perceptron
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 6
![Page 99: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/99.jpg)
Perceptrons
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 7
Weighted combinationThe output of the neuron is a linear combination ofthe inputs (from the other neurons via their axons)rescaled by the synaptic weights.Often the output does not directly correspond to theactivation level but is a monotonic function thereof.
Decision FunctionAt the end the results are combined into
f (x) = σ
(n∑
i=1
wixi + b
).
![Page 100: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/100.jpg)
Separating Half Spaces
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 8
Linear FunctionsAn abstract model is to assume that
f (x) = 〈w, x〉 + b
where w, x ∈ Rm and b ∈ R.Biological Interpretation
The weights wi correspond to the synaptic weights (acti-vating or inhibiting), the multiplication corresponds to theprocessing of inputs via the synapses, and the summa-tion is the combination of signals in the cell body (soma).
ApplicationsSpam filtering (e-mail), echo cancellation (old analogoverseas cables)
LearningWeights are “plastic” — adapted via the training data.
![Page 101: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/101.jpg)
Linear Separation
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 9
![Page 102: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/102.jpg)
Perceptron Algorithm
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 10
argument: X := {x1, . . . , xm} ⊂ X (data)Y := {y1, . . . , ym} ⊂ {±1} (labels)
function (w, b) = Perceptron(X, Y, η)initialize w, b = 0repeat
Pick (xi, yi) from dataif yi(w · xi + b) ≤ 0 then
w′ = w + yixi
b′ = b + yi
until yi(w · xi + b) > 0 for all iend
![Page 103: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/103.jpg)
Interpretation
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 11
AlgorithmNothing happens if we classify (xi, yi) correctlyIf we see incorrectly classified observation we update(w, b) by yi(xi, 1).Positive reinforcement of observations.
SolutionWeight vector is linear combination of observations xi:
w ←− w + yixi
Classification can be written in terms of dot products:
w · x + b =∑j∈E
yjxj · x + b
![Page 104: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/104.jpg)
Theoretical Analysis
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 12
Incremental AlgorithmAlready while the perceptron is learning, we can use it.
Convergence Theorem (Rosenblatt and Novikoff)Suppose that there exists a ρ > 0, a weight vector w∗
satisfying ‖w∗‖ = 1, and a threshold b∗ such that
yi (〈w∗, xi〉 + b∗) ≥ ρ for all 1 ≤ i ≤ m.
Then the hypothesis maintained by the perceptron algo-rithm converges to a linear separator after no more than
(b∗2 + 1)(R2 + 1)
ρ2
updates, where R = maxi ‖xi‖.
![Page 105: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/105.jpg)
Solutions of the Perceptron
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 13
![Page 106: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/106.jpg)
Proof, Part I
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14
Starting PointWe start from w1 = 0 and b1 = 0.
Step 1: Bound on the increase of alignmentDenote by wi the value of w at step i (analogously bi).
Alignment: 〈(wi, bi), (w∗, b∗)〉
For error in observation (xi, yi) we get
〈(wj+1, bj+1) · (w∗, b∗)〉= 〈[(wj, bj) + yi(xi, 1)] , (w∗, b∗)〉= 〈(wj, bj), (w
∗, b∗)〉 + ηyi〈(xi, 1) · (w∗, b∗)〉≥ 〈(wj, bj), (w
∗, b∗)〉 + ηρ
≥ jηρ.
Alignment increases with number of errors.
![Page 107: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/107.jpg)
Proof, Part II
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15
Step 2: Cauchy-Schwartz for the Dot Product
〈(wj+1, bj+1) · (w∗, b∗)〉 ≤ ‖(wj+1, bj+1)‖ ‖(w∗, b∗)‖=√
1 + (b∗)2‖(wj+1, bj+1)‖Step 3: Upper Bound on ‖(wj, bj)‖
If we make a mistake we have
‖(wj+1, bj+1)‖2 = ‖(wj, bj) + yi(xi, 1)‖2= ‖(wj, bj)‖2 + 2yi〈(xi, 1), (wj, bj)〉 + ‖(xi, 1)‖2≤ ‖(wj, bj)‖2 + ‖(xi, 1)‖2≤ j(R2 + 1).
Step 4: Combination of first three steps
jηρ ≤√
1 + (b∗)2‖(wj+1, bj+1)‖ ≤√
j(R2 + 1)((b∗)2 + 1)
Solving for j proves the theorem.
![Page 108: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/108.jpg)
What does it mean?
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 16
Learning AlgorithmWe perform an update only if we make a mistake.
Convergence BoundBounds the maximum number of mistakes in total.We will make at most (b∗2 + 1)(R1 + 1)/ρ2 mistakes inthe case where a “correct” solution w∗, b∗ exists.This also bounds the expected error (if we know ρ, R,and |b∗|).
Dimension IndependentBound does not depend on the dimensionality of X.
Sample ExpansionWe obtain x as a linear combination of xi.
![Page 109: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/109.jpg)
Realizable and Non-realizable Concepts
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 17
Realizable ConceptHere some w∗, b∗ exists such that y is generated byy = sgn (〈w∗, x〉 + b). In general realizable means thatthe exact functional dependency is included in the classof admissible hypotheses.
Unrealizable ConceptIn this case, the exact concept does not exist or it is notincluded in the function class.
![Page 110: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/110.jpg)
The XOR Problem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 18
![Page 111: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/111.jpg)
Training data
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 19
![Page 112: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/112.jpg)
Perceptron algorithm (i=7)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 20
![Page 113: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/113.jpg)
Perceptron algorithm (i=16)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 21
![Page 114: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/114.jpg)
Perceptron algorithm (i=2)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 22
![Page 115: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/115.jpg)
Perceptron algorithm (i=4)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 23
![Page 116: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/116.jpg)
Perceptron algorithm (i=16)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 24
![Page 117: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/117.jpg)
Perceptron algorithm (i=2)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 25
![Page 118: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/118.jpg)
Perceptron algorithm (i=16)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 26
![Page 119: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/119.jpg)
Perceptron algorithm (i=12)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 27
![Page 120: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/120.jpg)
Perceptron algorithm (i=16)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 28
![Page 121: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/121.jpg)
Perceptron algorithm (i=20)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 29
![Page 122: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/122.jpg)
Stochastic Gradient Descent, 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 30
Linear Functionf (x) = 〈w, x〉 + b
Objective Function
R[f ] :=1
m
m∑i=1
max(0,−yif (xi))
=
m∑i=1
max (0,−yi (〈w, xi〉 + b))
Stochastic GradientWe use each term in the sum as a stochastic approxi-mation of the overall objective function:
w ←− w − η∂w (0,−yi (〈w, xi〉 + b))
b←− b− η∂b (0,−yi (〈w, xi〉 + b))
![Page 123: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/123.jpg)
Stochastic Gradient Descent, 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 31
Details
∂w max (0,−yi (〈w, xi〉 + b)) =
{−yixi for f (xi) < 00 otherwise
∂b max (0,−yi (〈w, xi〉 + b)) =
{−yi for f (xi) < 00 otherwise
Overall StrategyHave complicated function consisting of lots of termsWant to minimize this monsterSolve it performing descent into one direction at a timeRandomly pick directions and convergeOften need to adjust learning rate η
![Page 124: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/124.jpg)
Nonlinearity via Preprocessing
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 32
ProblemLinear functions are often too simple to provide good es-timators.
IdeaMap to a higher dimensional feature space via Φ : x→Φ(x) and solve the problem there.Replace every 〈x, x′〉 by 〈Φ(x), Φ(x′)〉 in the perceptronalgorithm.
ConsequenceWe have nonlinear classifiers.Solution lies in the choice of features Φ(x).
![Page 125: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/125.jpg)
Nonlinearity via Preprocessing
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 33
FeaturesQuadratic features correspond to circles, hyperbolas andellipsoids as separating surfaces.
![Page 126: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/126.jpg)
Constructing Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 34
IdeaConstruct features manually. E.g. for OCR we could use
![Page 127: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/127.jpg)
More Examples
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35
Two Interlocking SpiralsIf we transform the data (x1, x2) into a radial part (r =√
x21 + x2
2) and an angular part (x1 = r cos φ, x1 = r sin φ),the problem becomes much easier to solve (we onlyhave to distinguish different stripes).
Japanese Character RecognitionBreak down the images into strokes and recognize itfrom the latter (there’s a predefined order of them).
Medical DiagnosisInclude physician’s comments, knowledge about un-healthy combinations, features in EEG, . . .
Suitable RescalingIf we observe, say the weight and the height of a person,rescale to zero mean and unit variance.
![Page 128: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/128.jpg)
Perceptron on Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 36
argument: X := {x1, . . . , xm} ⊂ X (data)Y := {y1, . . . , ym} ⊂ {±1} (labels)
function (w, b) = Perceptron(X, Y, η)initialize w, b = 0repeat
Pick (xi, yi) from dataif yi(w · Φ(xi) + b) ≤ 0 then
w′ = w + yiΦ(xi)b′ = b + yi
until yi(w · Φ(xi) + b) > 0 for all iend
Important detailw =
∑j
yjΦ(xj) and hence f (x) =∑
j yj(Φ(xj) · Φ(x)) + b
![Page 129: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/129.jpg)
Problems with Constructing Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37
ProblemsNeed to be an expert in the domain (e.g. Chinesecharacters).Features may not be robust (e.g. postman drops letterin dirt).Can be expensive to compute.
SolutionUse shotgun approach.Compute many features and hope a good one isamong them.Do this efficiently.
![Page 130: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/130.jpg)
Polynomial Features
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 38
Quadratic Features in R2
Φ(x) :=(x2
1,√
2x1x2, x22
)Dot Product
〈Φ(x), Φ(x′)〉 =⟨(
x21,√
2x1x2, x22
),(x′1
2,√
2x′1x′2, x′22)⟩
= 〈x, x′〉2.Insight
Trick works for any polynomials of order d via 〈x, x′〉d.
![Page 131: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/131.jpg)
Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
ProblemExtracting features can sometimes be very costly.Example: second order features in 1000 dimensions.This leads to 5005 numbers. For higher order polyno-mial features much worse.
SolutionDon’t compute the features, try to compute dot productsimplicitly. For some features this works . . .
DefinitionA kernel function k : X× X → R is a symmetric functionin its arguments for which the following property holds
k(x, x′) = 〈Φ(x), Φ(x′)〉 for some feature map Φ.
If k(x, x′) is much cheaper to compute than Φ(x) . . .
![Page 132: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/132.jpg)
Polynomial Kernels in Rn
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40
IdeaWe want to extend k(x, x′) = 〈x, x′〉2 to
k(x, x′) = (〈x, x′〉 + c)d where c > 0 and d ∈ N.
Prove that such a kernel corresponds to a dot product.Proof strategy
Simple and straightforward: compute the explicit sumgiven by the kernel, i.e.
k(x, x′) = (〈x, x′〉 + c)d
=
m∑i=0
(d
i
)(〈x, x′〉)i cd−i
Individual terms (〈x, x′〉)i are dot products for some Φi(x).
![Page 133: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/133.jpg)
Kernel Perceptron
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41
argument: X := {x1, . . . , xm} ⊂ X (data)Y := {y1, . . . , ym} ⊂ {±1} (labels)
function f = Perceptron(X,Y, η)initialize f = 0repeat
Pick (xi, yi) from dataif yif (xi) ≤ 0 then
f (·)← f (·) + yik(xi, ·) + yi
until yif (xi) > 0 for all iend
Important detailw =
∑j
yjΦ(xj) and hence f (x) =∑
j yjk(xj, x) + b.
![Page 134: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/134.jpg)
Are all k(x, x′) good Kernels?
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42
ComputabilityWe have to be able to compute k(x, x′) efficiently (muchcheaper than dot products themselves).
“Nice and Useful” FunctionsThe features themselves have to be useful for the learn-ing problem at hand. Quite often this means smoothfunctions.
SymmetryObviously k(x, x′) = k(x′, x) due to the symmetry of thedot product 〈Φ(x), Φ(x′)〉 = 〈Φ(x′), Φ(x)〉.
Dot Product in Feature SpaceIs there always a Φ such that k really is a dot product?
![Page 135: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/135.jpg)
Mercer’s Theorem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43
The TheoremFor any symmetric function k : X × X → R which issquare integrable in X× X and which satisfies∫
X×X
k(x, x′)f (x)f (x′)dxdx′ ≥ 0 for all f ∈ L2(X)
there exist φi : X→ R and numbers λi ≥ 0 where
k(x, x′) =∑
i
λiφi(x)φi(x′) for all x, x′ ∈ X.
InterpretationDouble integral is the continuous version of a vector-matrix-vector multiplication. For positive semidefinitematrices we have∑
i
∑j
k(xi, xj)αiαj ≥ 0
![Page 136: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/136.jpg)
Properties of the Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 44
Distance in Feature SpaceDistance between points in feature space via
d(x, x′)2 :=‖Φ(x)− Φ(x′)‖2
=〈Φ(x), Φ(x)〉 − 2〈Φ(x), Φ(x′)〉 + 〈Φ(x′), Φ(x′)〉=k(x, x) + k(x′, x′)− 2k(x, x)
Kernel MatrixTo compare observations we compute dot products, sowe study the matrix K given by
Kij = 〈Φ(xi), Φ(xj)〉 = k(xi, xj)
where xi are the training patterns.Similarity Measure
The entries Kij tell us the overlap between Φ(xi) andΦ(xj), so k(xi, xj) is a similarity measure.
![Page 137: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/137.jpg)
Properties of the Kernel Matrix
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45
K is Positive SemidefiniteClaim: α>Kα ≥ 0 for all α ∈ Rm and all kernel matricesK ∈ Rm×m. Proof:m∑i,j
αiαjKij =
m∑i,j
αiαj〈Φ(xi), Φ(xj)〉
=
⟨m∑i
αiΦ(xi),m∑j
αjΦ(xj)
⟩=
∥∥∥∥∥m∑
i=1
αiΦ(xi)
∥∥∥∥∥2
Kernel ExpansionIf w is given by a linear combination of Φ(xi) we get
〈w, Φ(x)〉 =
⟨m∑
i=1
αiΦ(xi), Φ(x)
⟩=
m∑i=1
αik(xi, x).
![Page 138: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/138.jpg)
A Counterexample
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 46
A Candidate for a Kernel
k(x, x′) =
{1 if ‖x− x′‖ ≤ 10 otherwise
This is symmetric and gives us some information aboutthe proximity of points, yet it is not a proper kernel . . .
Kernel MatrixWe use three points, x1 = 1, x2 = 2, x3 = 3 and computethe resulting “kernelmatrix” K. This yields
K =
1 1 01 1 10 1 1
and eigenvalues (√
2−1)−1, 1 and (1−√
2).
as eigensystem. Hence k is not a kernel.
![Page 139: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/139.jpg)
Some Good Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 47
Examples of kernels k(x, x′)
Linear 〈x, x′〉Laplacian RBF exp (−λ‖x− x′‖)Gaussian RBF exp
(−λ‖x− x′‖2
)Polynomial (〈x, x′〉 + c〉)d , c ≥ 0, d ∈ NB-Spline B2n+1(x− x′)
Cond. Expectation Ec[p(x|c)p(x′|c)]Simple trick for checking Mercer’s condition
Compute the Fourier transform of the kernel and checkthat it is nonnegative.
![Page 140: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/140.jpg)
Linear Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48
![Page 141: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/141.jpg)
Laplacian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 49
![Page 142: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/142.jpg)
Gaussian Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 50
![Page 143: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/143.jpg)
Polynomial (Order 3)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 51
![Page 144: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/144.jpg)
B3-Spline Kernel
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 52
![Page 145: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/145.jpg)
Summary
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 53
Hebb’s rulepositive feedbackperceptron convergence rule, kernel perceptron
FeaturesExplicit feature constructionImplicit features via kernels
KernelsExamplesMercer’s theorem
![Page 146: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/146.jpg)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 1
An Introduction to Machine Learningwith Kernels
Lecture 4Alexander J. Smola
Statistical Machine Learning ProgramNational ICT Australia, Canberra
![Page 147: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/147.jpg)
Day 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 2
Machine learning and probability theoryIntroduction to pattern recognition, classification, regres-sion, novelty detection, probability theory, Bayes rule, in-ference
Density estimation and Parzen windowsKernels and density estimation, Silverman’s rule, Wat-son Nadaraya estimator, crossvalidation
Perceptron and kernelsHebb’s rule, perceptron algorithm, convergence, featuremaps, kernel trick, examples
Support Vector classificationGeometrical view, dual problem, convex optimization,kernels and SVM
![Page 148: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/148.jpg)
L4 Support Vector Classification
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 3
Support Vector MachineProblem definitionGeometrical pictureOptimization problem
Optimization ProblemHard marginConvexityDual problemSoft margin problem
![Page 149: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/149.jpg)
Classification
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 4
DataPairs of observations (xi, yi) generated from some distri-bution P(x, y), e.g., (blood status, cancer), (credit trans-action, fraud), (profile of jet engine, defect)
TaskEstimate y given x at a new location.Modification: find a function f (x) that does the task.
![Page 150: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/150.jpg)
So Many Solutions
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 5
![Page 151: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/151.jpg)
One to rule them all . . .
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 6
![Page 152: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/152.jpg)
Optimal Separating Hyperplane
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 7
![Page 153: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/153.jpg)
Optimization Problem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 8
Margin to NormSeparation of sets is given by 2
‖w‖ so maximize that.Equivalently minimize ‖w‖.Equivalently minimize ‖w‖2.
ConstraintsSeparation with margin, i.e.
〈w, xi〉 + b ≥ 1 if yi = 1
〈w, xi〉 + b ≤ −1 if yi = −1
Equivalent constraint
yi(〈w, xi〉 + b) ≥ 1
![Page 154: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/154.jpg)
Optimization Problem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 9
Mathematical Programming SettingCombining the above requirements we obtain
minimize1
2‖w‖2
subject to yi(〈w, xi〉 + b)− 1 ≥ 0 for all 1 ≤ i ≤ m
PropertiesProblem is convexHence it has unique minimumEfficient algorithms for solving it exist
![Page 155: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/155.jpg)
Lagrange Function
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 10
Objective FunctionWe have 1
2‖w‖2.
Constraints
ci(w, b) := 1− yi(〈w, xi〉 + b) ≤ 0
Lagrange Function
L(w, b, α) = PrimalObjective +∑
i
αici
=1
2‖w‖2 +
m∑i=1
αi(1− yi(〈w, xi〉 + b))
Saddle Point ConditionPartial derivatives of L with respect to w and b need tovanish.
![Page 156: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/156.jpg)
Solving the Equations
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 11
Lagrange Function
L(w, b, α) =1
2‖w‖2 +
m∑i=1
αi(1− yi(〈w, xi〉 + b))
Saddlepoint condition
∂wL(w, b, α) = w −m∑
i=1
αiyixi = 0 ⇐⇒ w =
m∑i=1
αiyixi
∂bL(w, b, α) = −m∑
i=1
αiyixi = 0 ⇐⇒m∑
i=1
αiyi = 0
To obtain the dual optimization problem we have to sub-stitute the values of w and b into L. Note that the dualvariables αi have the constraint αi ≥ 0.
![Page 157: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/157.jpg)
Solving the Equations
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 12
Dual Optimization ProblemAfter substituting in terms for b, w the Lagrange functionbecomes
− 1
2
m∑i,j=1
yiyj〈xi, xj〉 +
m∑i=1
αi
subject tom∑
i=1
αiyi = 0 and αi ≥ 0 for all 1 ≤ i ≤ m
Practical ModificationNeed to maximize dual objective function. Rewrite as
minimize1
2
m∑i,j=1
yiyj〈xi, xj〉 −m∑
i=1
αi
subject to the above constraints.
![Page 158: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/158.jpg)
Support Vector Expansion
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 13
Solution in w =
m∑i=1
αiyixi
w is given by a linear combination of training patternsxi. Independent of the dimensionality of x.w depends on the Lagrange multipliers αi.
Kuhn-Tucker-ConditionsAt optimal solution Constraint · Lagrange Multiplier = 0In our context this means
αi(1− yi(〈w, xi〉 + b)) = 0.
Equivalently we have
αi 6= 0 ⇐⇒ yi (〈w, xi〉 + b) = 1
Only points at the decision boundary can con-tribute to the solution.
![Page 159: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/159.jpg)
Kernels
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 14
Nonlinearity via Feature MapsReplace xi by Φ(xi) in the optimization problem.
Equivalent optimization problem
minimize1
2
m∑i,j=1
αiαjyiyjk(xi, xj)−m∑
i=1
αi
subject tom∑
i=1
αiyi = 0 and αi ≥ 0 for all 1 ≤ i ≤ m
Decision Function
From w =
m∑i=1
αiyiΦ(xi) we conclude
f (x) = 〈w, Φ(x)〉 + b =
m∑i=1
αiyik(xi, x) + b.
![Page 160: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/160.jpg)
Examples and Problems
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 15
AdvantageWorks well when the datais noise free.
ProblemAlready a single wrongobservation can ruin ev-erything — we requireyif (xi) ≥ 1 for all i.
IdeaLimit the influence ofindividual observationsby making the constraintsless stringent (introduceslacks).
![Page 161: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/161.jpg)
Optimization Problem (Soft Margin)
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 16
Recall: Hard Margin Problem
minimize1
2‖w‖2
subject to yi(〈w, xi〉 + b)− 1 ≥ 0
Softening the Constraints
minimize1
2‖w‖2 + C
m∑i=1
ξi
subject to yi(〈w, xi〉 + b)− 1+ξi ≥ 0 and ξi ≥ 0
![Page 162: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/162.jpg)
Linear SVM C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 17
![Page 163: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/163.jpg)
Linear SVM C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 18
![Page 164: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/164.jpg)
Linear SVM C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 19
![Page 165: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/165.jpg)
Linear SVM C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 20
![Page 166: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/166.jpg)
Linear SVM C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 21
![Page 167: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/167.jpg)
Linear SVM C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 22
![Page 168: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/168.jpg)
Linear SVM C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 23
![Page 169: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/169.jpg)
Linear SVM C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 24
![Page 170: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/170.jpg)
Linear SVM C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 25
![Page 171: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/171.jpg)
Linear SVM C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 26
![Page 172: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/172.jpg)
Linear SVM C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 27
![Page 173: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/173.jpg)
Linear SVM C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 28
![Page 174: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/174.jpg)
Linear SVM C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 29
![Page 175: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/175.jpg)
Linear SVM C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 30
![Page 176: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/176.jpg)
Linear SVM C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 31
![Page 177: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/177.jpg)
Linear SVM C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 32
![Page 178: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/178.jpg)
Linear SVM C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 33
![Page 179: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/179.jpg)
Linear SVM C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 34
![Page 180: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/180.jpg)
Linear SVM C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 35
![Page 181: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/181.jpg)
Linear SVM C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 36
![Page 182: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/182.jpg)
Linear SVM C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 37
![Page 183: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/183.jpg)
Linear SVM C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 38
![Page 184: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/184.jpg)
Linear SVM C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 39
![Page 185: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/185.jpg)
Linear SVM C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 40
![Page 186: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/186.jpg)
Linear SVM C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 41
![Page 187: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/187.jpg)
Linear SVM C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 42
![Page 188: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/188.jpg)
Linear SVM C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 43
![Page 189: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/189.jpg)
Linear SVM C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 44
![Page 190: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/190.jpg)
Insights
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 45
Changing C
For clean data C doesn’t matter much.For noisy data, large C leads to narrow margin (SVMtries to do a good job at separating, even though it isn’tpossible)
Noisy data
Clean data has few support vectorsNoisy data leads to data in the marginsMore support vectors for noisy data
![Page 191: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/191.jpg)
Lagrange Function and Constraints
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 46
Lagrange FunctionWe have m more constraints, namely those on the ξi, forwhich we will use ηi as Lagrange multipliers.
L(w, b, ξ, α, η) =1
2‖w‖2+C
m∑i=1
ξi+
m∑i=1
αi (1− ξi − yi(〈w, xi〉 + b))−m∑
i=1
ηiξi
Saddle Point Conditions
∂wL(w, b, ξ, α, η) = w −m∑
i=1
αiyixi = 0 ⇐⇒ w =
m∑i=1
αiyixi.
∂bL(w, b, ξ, α, η) =
m∑i=1
−αiyi = 0 ⇐⇒m∑
i=1
αiyi = 0.
C − αi − ηi = 0 ⇐⇒ αi ∈ [0, C]
![Page 192: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/192.jpg)
Dual Optimization Problem
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 47
Optimization Problem
minimize1
2
m∑i,j=1
αiαjyiyjk(xi, xj)−m∑
i=1
αi
subject tom∑
i=1
αiyi = 0 and C ≥ αi ≥ 0 for all 1 ≤ i ≤ m
InterpretationAlmost same optimization problem as beforeConstraint on weight of each αi (bounds influence ofpattern).Efficient solvers exist (more about that tomorrow).
![Page 193: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/193.jpg)
SV Classification Machine
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 48
![Page 194: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/194.jpg)
Gaussian RBF with C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 49
![Page 195: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/195.jpg)
Gaussian RBF with C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 50
![Page 196: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/196.jpg)
Gaussian RBF with C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 51
![Page 197: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/197.jpg)
Gaussian RBF with C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 52
![Page 198: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/198.jpg)
Gaussian RBF with C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 53
![Page 199: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/199.jpg)
Gaussian RBF with C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 54
![Page 200: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/200.jpg)
Gaussian RBF with C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 55
![Page 201: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/201.jpg)
Gaussian RBF with C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 56
![Page 202: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/202.jpg)
Gaussian RBF with C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 57
![Page 203: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/203.jpg)
Gaussian RBF with C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 58
![Page 204: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/204.jpg)
Gaussian RBF with C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 59
![Page 205: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/205.jpg)
Gaussian RBF with C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 60
![Page 206: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/206.jpg)
Gaussian RBF with C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 61
![Page 207: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/207.jpg)
Gaussian RBF with C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 62
![Page 208: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/208.jpg)
Gaussian RBF with C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 63
![Page 209: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/209.jpg)
Gaussian RBF with C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 64
![Page 210: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/210.jpg)
Gaussian RBF with C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 65
![Page 211: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/211.jpg)
Gaussian RBF with C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 66
![Page 212: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/212.jpg)
Gaussian RBF with C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 67
![Page 213: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/213.jpg)
Gaussian RBF with C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 68
![Page 214: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/214.jpg)
Gaussian RBF with C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 69
![Page 215: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/215.jpg)
Gaussian RBF with C = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 70
![Page 216: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/216.jpg)
Gaussian RBF with C = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 71
![Page 217: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/217.jpg)
Gaussian RBF with C = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 72
![Page 218: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/218.jpg)
Gaussian RBF with C = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 73
![Page 219: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/219.jpg)
Gaussian RBF with C = 20
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 74
![Page 220: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/220.jpg)
Gaussian RBF with C = 50
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 75
![Page 221: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/221.jpg)
Gaussian RBF with C = 100
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 76
![Page 222: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/222.jpg)
Insights
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 77
Changing C
For clean data C doesn’t matter much.For noisy data, large C leads to more complicatedmargin (SVM tries to do a good job at separating, eventhough it isn’t possible)Overfitting for large C
Noisy data
Clean data has few support vectorsNoisy data leads to data in the marginsMore support vectors for noisy data
![Page 223: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/223.jpg)
Gaussian RBF with σ = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 78
![Page 224: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/224.jpg)
Gaussian RBF with σ = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 79
![Page 225: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/225.jpg)
Gaussian RBF with σ = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 80
![Page 226: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/226.jpg)
Gaussian RBF with σ = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 81
![Page 227: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/227.jpg)
Gaussian RBF with σ = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 82
![Page 228: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/228.jpg)
Gaussian RBF with σ = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 83
![Page 229: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/229.jpg)
Gaussian RBF with σ = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 84
![Page 230: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/230.jpg)
Gaussian RBF with σ = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 85
![Page 231: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/231.jpg)
Gaussian RBF with σ = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 86
![Page 232: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/232.jpg)
Gaussian RBF with σ = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 87
![Page 233: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/233.jpg)
Gaussian RBF with σ = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 88
![Page 234: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/234.jpg)
Gaussian RBF with σ = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 89
![Page 235: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/235.jpg)
Gaussian RBF with σ = 1
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 90
![Page 236: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/236.jpg)
Gaussian RBF with σ = 2
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 91
![Page 237: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/237.jpg)
Gaussian RBF with σ = 5
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 92
![Page 238: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/238.jpg)
Gaussian RBF with σ = 10
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 93
![Page 239: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/239.jpg)
Insights
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 94
Changing σ
For clean data σ doesn’t matter much.For noisy data, small σ leads to more complicatedmargin (SVM tries to do a good job at separating, eventhough it isn’t possible)Lots of overfitting for small σ
Noisy data
Clean data has few support vectorsNoisy data leads to data in the marginsMore support vectors for noisy data
![Page 240: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/240.jpg)
Summary
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 95
Support Vector MachineProblem definitionGeometrical pictureOptimization problem
Optimization ProblemHard marginConvexityDual problemSoft margin problem
![Page 241: An Introduction to Machine Learning with Kernelsusers.cecs.anu.edu.au/~daa/courses/GSAC6017/day_1.pdf · Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page](https://reader034.fdocuments.net/reader034/viewer/2022042103/5e80fa1c880708383a57ccde/html5/thumbnails/241.jpg)
Today’s Summary
Alexander J. Smola: An Introduction to Machine Learning with Kernels, Page 96
Machine learning and probability theoryIntroduction to pattern recognition, classification, regres-sion, novelty detection, probability theory, Bayes rule, in-ference
Density estimation and Parzen windowsKernels and density estimation, Silverman’s rule, Wat-son Nadaraya estimator, crossvalidation
Perceptron and kernelsHebb’s rule, perceptron algorithm, convergence, featuremaps, kernel trick, examples
Support Vector classificationGeometrical view, dual problem, convex optimization,kernels and SVM