Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information...
-
date post
22-Dec-2015 -
Category
Documents
-
view
215 -
download
1
Transcript of Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information...
Experts and Boosting Algorithms
Experts: Motivation
• Given a set of experts– No prior information– No consistent behavior– Goal: Predict as the best expert
• Model– online model– Input: historical results.
Experts: Model
• N strategies (experts)• At time t:
– Learner A chooses a distribution over N.– Let pt(i) probability of i-th expert.– Clearly pt(i) = 1– Receiving a loss vector lt
– Loss at time t: pt(i) lt(i)
• Assume bounded loss, lt(i) in [0,1]
Expert: Goal
• Match the loss of best expert.
• Loss:– LA
– Li
• Can we hope to do better?
Example: Guessing letters
• Setting:– Alphabet of k letters
• Loss:– 1 incorrect guess– 0 correct guess
• Experts:– Each expert guesses a certain letter always.
• Game: guess the most popular letter online.
Example 2: Rock-Paper-Scissors
• Two player game.• Each player chooses: Rock, Paper, or Scissors.• Loss Matrix:
• Goal: Play as best as we can given the opponent.
Rock Paper Scissors
Rock 1/2 1 0
Paper 0 1/2 1
Scissors 1 0 1/2
Example 3: Placing a point
• Action: choosing a point d.
• Loss (give the true location y): ||d-y||.
• Experts: One for each point.
• Important: Loss is Convex
• Goal: Find a “center”
||||)1(||||))1(( 2121 ydydydd
Experts Algorithm: Greedy
• For each expert define its cumulative loss:
• Greedy: At time t choose the expert with minimum loss, namely, arg min Li
t
t
j
ji
t
ilL
1
Greedy Analysis
• Theorem: Let LGT be the loss of Greedy at
time T, then
• Proof!
)1min( Ti
i
TG LNL
Better Expert Algorithms
• Would like to bound
Ti
i
TA LL min
Expert Algorithm: Hedge(b)
• Maintains weight vector wt
• Probabilities pt(k) = wt(k) / wt(j)
• Initialization w1(i) = 1/N
• Updates:– wt+1(k) = wt(k) Ub(lt(k))
– where b in [0,1] and
– br < Ub (r) < 1-(1-b)r
Hedge Analysis
• Lemma: For any sequence of losses
• Proof!
• Corollary:
H
N
j
T Lbjw )1())(ln(1
1
b
jw
H
N
j
T
L
1
))(ln(1
1
Hedge: Properties
• Bounding the weights
• Similarly for a subset of experts.
Ti
T
t
t
Lil
T
t
tb
T
biwbiw
ilUiwiw
)()(
))(()()(
1)(
1
1
11
1
Hedge: Performance
• Let k be with minimal loss
• Therefore
TkLT
N
j
T bkwkwjw )()()( 11
1
1
bbLN
b
bN
H
Tk
TkL
L
1)/1ln()ln(
1
)1
ln(
Hedge: Optimizing b
• For b=1/2 we have
• Better selection of b:
)2ln(2)ln(2 TkH LNL
)ln(ln2min NNLLL iiH
Occam Razor
Occam Razor
• Finding the shortest consistent hypothesis.• Definition: ()-Occam algorithm
– >0 and <1– Input: a sample S of size m– Output: hypothesis h– for every (x,b) in S: h(x)=b– size(h) < size(ct) m
• Efficiency.
Occam algorithm and compression
A BS(xi,bi)
x1, … , xm
compression
• Option 1:
– A sends B the values b1 , … , bm
– m bits of information
• Option 2:
– A sends B the hypothesis h
– Occam: large enough m has size(h) < m
• Option 3 (MDL):
– A sends B a hypothesis h and “corrections”
– complexity: size(h) + size(errors)
Occam Razor Theorem
• A: (,)-Occam algorithm for C using H• D distribution over inputs X• ct in C the target function• Sample size:
• with probability 1- A(S)=h has error(h) <
)1(121
ln1
2
n
m
Occam Razor Theorem
• Use the bound for finite hypothesis class.
• Effective hypothesis class size 2size(h)
• size(h) < n m
• Sample size:
1ln
12lnln
1 2 mn
mmn
Weak and Strong Learning
PAC Learning model
• There exists a distribution D over domain X• Examples: <x, c(x)>
– use c for target function (rather than ct)
• Goal: – With high probability (1-)– find h in H such that – error(h,c ) < – arbitrarily small.
Weak Learning Model
• Goal: error(h,c) < ½ - • The parameter is small
– constant– 1/poly
• Intuitively: A much easier task• Question:
– Assume C is weak learnable, – C is PAC (strong) learnable
Majority Algorithm
• Hypothesis: hM(x)= MAJ[ h1(x), ... , hT(x) ]
• size(hM) < T size(ht)
• Using Occam Razor
Majority: outline
• Sample m example
• Start with a distribution 1/m per example.
• Modify the distribution and get ht
• Hypothesis is the majority
• Terminate when perfect classification– of the sample
Majority: Algorithm
• Use the Hedge algorithm.
• The “experts” will be associate with points.
• Loss would be a correct classification.– lt(i)= 1 - | ht(xi) – c(xi) |
• Setting b= 1- • hM(x) = MAJORITY( hi(x))
• Q: How do we set T?
Majority: Analysis
• Consider the set of errors S– S={i | hM(xi)c(xi) }
• For ever i in S:– Li/T < ½ (Proof!)
• From Hedge properties:
2/)())(ln( 2 TxD
MSi iL
MAJORITY: Correctness
• Error Probability:
• Number of Rounds:
• Terminate when error less than 1/m
Si ixD )(
2/2 Te
2
ln2
m
T
AdaBoost: Dynamic Boosting
• Better bounds on the error
• No need to “know” • Each round a different b
– as a function of the error
AdaBoost: Input
• Sample of size m: < xi,c(xi) >
• A distribution D over examples – We will use D(xi)=1/m
• Weak learning algorithm
• A constant T (number of iterations)
AdaBoost: Algorithm
• Initialization: w1(i) = D(xi)• For t = 1 to T DO
– pt(i) = wt(i) / wt(j)– Call Weak Learner with pt
– Receive ht
– Compute the error t of ht on pt
– Set bt= t/(1-t)– wt+1(i) = wt(i) (bt)e, where e=1-|ht(xi)-c(xi)|
• Output
T
t t
T
ttt
A bxhbIxh1
1
1log21)()1(log)(
AdaBoost: Analysis
• Theorem: – Given 1, ... , T
– the error of hA is bounded by
T
ttt
T
1
)1(2
AdaBoost: Proof
• Let lt(i) = 1-|ht(xi)-c(xi)|
• By definition: pt lt = 1 –t
• Upper bounding the sum of weights– From the Hedge Analysis.
• Error occurs only if
T
t t
T
ttt bxcxhb
11
1log21 |)()(|)1(log
AdaBoost Analysis (cont.)
• Bounding the weight of a point
• Bounding the sum of weights
• Final bound as function of bt
• Optimizing bt:
– bt= t / (1 – t)
AdaBoost: Fixed bias
• Assume t= 1/2 - • We bound:
TT e222/2 )41(
Learning OR with few attributes
• Target function: OR of k literals
• Goal: learn in time:– polynomial in k and log n– and constant
• ELIM makes “slow” progress – disqualifies one literal per round– May remain with O(n) literals
Set Cover - Definition
• Input: S1 , … , St and Si U
• Output: Si1, … , Sik and j Sjk=U
• Question: Are there k sets that cover U?
• NP-complete
Set Cover Greedy algorithm
• j=0 ; Uj=U; C=
• While Uj – Let Si be arg max |Si Uj|
– Add Si to C
– Let Uj+1 = Uj – Si
– j = j+1
Set Cover: Greedy Analysis
• At termination, C is a cover.
• Assume there is a cover C’ of size k.
• C’ is a cover for every Uj
• Some S in C’ covers Uj/k elements of Uj
• Analysis of Uj: |Uj+1| |Uj| - |Uj|/k
• Solving the recursion.
• Number of sets j < k ln |U|
Building an Occam algorithm
• Given a sample S of size m– Run ELIM on S – Let LIT be the set of literals– There exists k literals in LIT that classify
correctly all S
• Negative examples: – any subset of LIT classifies theme correctly
Building an Occam algorithm
• Positive examples: – Search for a small subset of LIT – Which classifies S+ correctly– For a literal z build Tz={x | z satisfies x}– There are k sets that cover S+
– Find k ln m sets that cover S+
• Output h = the OR of the k ln m literals • Size (h) < k ln m log 2n• Sample size m =O( k log n log (k log n))