Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing...
Transcript of Spectral Embedding of k-Cliques, Graph Partitioning, and k ...pa336/mlS16/ml-sp16-lec15.pdfChoosing...
Boosting
Instructor: Pranjal Awasthi
Motivating Example
Want to predict winner of 2016 general election
Motivating Example
β’ Step 1: Collect training data
β 2012,
β 2008,
β 2004,
β 2000,
β 1996,
β 1992,
β β¦......
β’ Step 2: Find a function that gets high accuracy(~99%) on training set.
What if instead, we have good βrules of thumbβ that often work well?
Motivating Example
β’ Step 1: Collect training data
β 2012,
β 2008,
β 2004,
β 2000,
β 1996,
β 1992,
β β¦......
β’ β1will do well on a subset of train set, and will do better than half overall, say 51% accuracy.
β1= If sitting president is running, go with his/her party,Otherwise, predict randomly
Motivating Example
β’ Step 1: Collect training data
β ,
β 2008
β ,
β 2000,
β ,
β ,
β β¦......
β1= If sitting president is running, go with his/her party,Otherwise, predict randomly
What if only focus on a subset of examples: β1 will do poorly.
Motivating Example
β’ Step 1: Collect training data
β ,
β 2008
β ,
β 2000,
β ,
β ,
β β¦......
β’ β2will do well, say 51% accuracy.
β2= If CNN poll gives a 20pt lead to someone, go with his/her party,Otherwise, predict randomly
What if only focus on a subset of examples: β1 will do poorly.But β2 will do well.
Motivating Example
β’ Step 1: Collect training data
β 2012,
β 2008,
β 2004,
β 2000,
β 1996,
β 1992,
β β¦......
β’ Suppose can consistently produce rules in π» that do slightly better than random guessing.
β1, β2, β3, β¦ .
H
Motivating Example
β’ Step 1: Collect training data
β 2012,
β 2008,
β 2004,
β 2000,
β 1996,
β 1992,
β β¦......
β’ Output a single rule that gets high accuracy on training set.
β1, β2, β3, β¦ .
H
Boosting
Weak Learning Algorithm
Strong Learning Algorithm
Boosting: A general method for converting βrules of thumbβ into highly accurate predictions.
History of Boosting
β’ Originally studied from a purely theoretical perspective
β Is weak learning equivalent to strong learning?
β Answered positively by Rob Schapire in 1989 via the boosting algorithm.
β’ AdaBoost -- A highly practical version developed in 1995
β One of the most popular ML (meta)algorithms
History of Boosting
β’ Has connections to many different areas
β Empirical risk minimization
β Game theory
β Convex optimization
β Computational complexity theory
Strong Learning
β’ To get strong learning, need to
β Design an algorithm that can get any arbitrary error π > 0,on the training set ππ
β Do VC dimension analysis to see how large π needs to be for generalization
Theory of Boosting
β’ Weak Learning Assumption:
β There exists algorithm π΄ that can consistently produce weak classifiers on the training set
ππ = π1, π¦1 , π2, π¦2 β¦
β Weak Classifier:
β’ For every weighting W of ππ, π΄ outputs βπ such that
πππππ,πβπ β€
1
2β πΎ, πΎ > 0
β’ πππππ,π βπ = Οπ=1π π€ππΌ βπ ππ β π¦π
Οπ=1π π€π
Theory of Boosting
β’ Weak Learning Assumption:
β There exists an algorithm π΄ that can consistently produce weak classifiers on the training set ππ
β’ Boosting Guarantee: Use π΄ to get algorithm π΅ such that
β For every π > 0, π΅ outputs π with πππππ π β€ π
AdaBoost
β’ Run π΄ on ππ get β1β’ Use β1to get π2
β’ Run π΄ on ππ with weights π2 to get β2β’ Repeat π times
β’ Combine β1, β2, β¦ βπ to get the final classifier.
Q1: How to choose weights?Q2: How to combine classifiers?
Choosing weights
β’ Run π΄ on ππ get β1β’ Use β1to get π2
β’ Run π΄ on ππ with weights π2 to get β2β’ Repeat π times
β’ Combine β1, β2, β¦ βπ to get the final classifier.
Idea 1: Let ππ‘ be uniform over examples that βπ‘β1 got incorrectly
Choosing weights
β’ Run π΄ on ππ get β1β’ Use β1to get π2
β’ Run π΄ on ππ with weights π2 to get β2β’ Repeat π times
β’ Combine β1, β2, β¦ βπ to get the final classifier.
Idea 2: Let ππ‘ be such that error of βπ‘β1 becomes 1
2.
Choosing weights
β’ Run π΄ on ππ get β1β’ Use β1to get π2
β’ Run π΄ on ππ with weights π2 to get β2β’ Repeat π times
β’ Combine β1, β2, β¦ βπ to get the final classifier
β’ If πππππ,ππ‘β1βπ‘β1 =
1
2β πΎ, then
β ππ‘,π = ππ‘β1,π, if βπ‘β1 is incorrect on ππ
β ππ‘,π =ππ‘β1,π
1
2βπΎ
(1
2+πΎ)
, if βπ‘β1 is correct on ππ
Combining Classifiers
β’ Run π΄ on ππ get β1β’ Use β1to get π2
β’ Run π΄ on ππ with weights π2 to get β2β’ Repeat π times
β’ Combine β1, β2, β¦ βπ to get the final classifier.
Take majority vote π(π) = ππ΄π½(β1( Τ¦π), β2( Τ¦π), β¦ βπ( Τ¦π))
Combining Classifiers
β’ Run π΄ on ππ get β1β’ Use β1to get π2
β’ Run π΄ on ππ with weights π2 to get β2β’ Repeat π times
β’ Combine β1, β2, β¦ βπ to get the final classifier.
Take majority vote π(π) = ππ΄π½(β1( Τ¦π), β2( Τ¦π), β¦ βπ( Τ¦π))
Theorem:
If error of each βπ is =1
2β πΎ, then πππππ π β€ πβ2ππΎ
2
Analysis
Theorem:
If error of each βπ is =1
2β πΎ, then πππππ π β€ πβ2ππΎ
2
Analysis
Theorem:
If error of each βπ is =1
2β πΎ, then πππππ π β€ πβ2ππΎ
2
Analysis
Theorem:
If error of each βπ is =1
2β πΎ, then πππππ π β€ πβ2ππΎ
2
Analysis
Theorem:
If error of each βπ is β€1
2β πΎ, then πππππ π β€ πβ2ππΎ
2
Let ππ‘ be such that error of βπ‘β1 becomes 1
2.
Take (weighted) majority vote
Strong Learning
β’ To get strong learning, need to
β Design an algorithm that can get any arbitrary error π > 0,on the training set ππ
β Do VC dimension analysis to see how large π needs to be for generalization
Generalization Bound
β1, β2, β3, β¦ .
H
Applications: Decision Stumps
Advantages of AdaBoost
β’ Helps learn complex models from simple ones without overfitting
β’ A general paradigm and easy to implement
β’ Parameter free, no tuning
β’ Often not very sensitive to the number of rounds π
β’ Cons:
β Does not help much if weak learners are already complex.
β Sensitive to noise in the data