The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

61
The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Page 1: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

The Rate of Convergence of AdaBoost

Indraneel Mukherjee

Cynthia Rudin

Rob Schapire

Page 2: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

AdaBoost (Freund and Schapire 97)

Page 3: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

AdaBoost (Freund and Schapire 97)

Page 4: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Basic properties of AdaBoost’s convergence are still not fully understood.

Page 5: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Basic properties of AdaBoost’s convergence are still not fully understood.

We address one of these basic properties: convergence rates with no assumptions.

Page 6: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Page 7: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 8: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Page 9: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

misclassific. error ≤ exponential loss

1m

1[ yiF (xi )≤0]i=1

m

∑ ≤1m

exp −yiF(xi )( )i=1

m

Combination: F(x)=λ1h1(x)+…+λNhN(x)

Page 10: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• AdaBoost is known for its ability to combine “weak classifiers” into a “strong” classifier

• AdaBoost iteratively minimizes “exponential loss” (Breiman 99, Frean and Downs, 1998; Friedman et al., 2000; Friedman, 2001; Mason et

al., 2000; Onoda et al., 1998; Ratsch et al., 2001; Schapire and Singer, 1999)

Exponential loss:

L(λ)= 1m

exp − λ jyihj (xi )j=1

N

∑⎛

⎝⎜⎞

⎠⎟i=1

m

Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 11: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Exponential loss:

L(λ)= 1m

exp − λ jyihj (xi )j=1

N

∑⎛

⎝⎜⎞

⎠⎟i=1

m

λ1

λ2

Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 12: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Exponential loss:

L(λ)= 1m

exp − λ jyihj (xi )j=1

N

∑⎛

⎝⎜⎞

⎠⎟i=1

m

λ1

λ2

Examples: {(xi , yi )}i=1,...,m, with each (xi ,yi ) ∈X ×{−1,1}

Hypotheses: H ={h1,...,hN} , where hj :X → [−1,1]

Page 13: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Known:

• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)

• Convergence rates under strong assumptions:

• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 14: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Known:

• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)

• Convergence rates under assumptions:

• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 15: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Known:

• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)

• Convergence rates under assumptions:

• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 16: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Known:

• AdaBoost converges asymptotically to the minimum of the exponential loss (Collins et al 2002, Zhang and Yu 2005)

• Convergence rates under assumptions:

• “weak learning” assumption holds, hypotheses are better than random guessing (Freund and Schapire 1997, Schapire and Singer 1999)

• assume that a finite minimizer exists (Rätsch et al 2002, many classic results)

• Conjectured by Schapire (2010) that fast convergence rates hold without any assumptions.

• Convergence rate is relevant for consistency of AdaBoost (Bartlett and Traskin 2007).

Page 17: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a ‘reference’ solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Ú

Page 18: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Main Messages

• Usual approaches assume a finite minimizer– Much more challenging not to assume this!

• Separated two different modes of analysis– comparison to reference, comparison to optimal– different rates of convergence are possible in each

• Analysis of convergence rates often ignore the “constants”– we show they can be extremely large in the worst case

Page 19: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Based on a conjecture that says...

Page 20: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B

in a number of rounds that is at most a polynomial

inlogN,m, B, and 1/Ú."

Page 21: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*

"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B

in a number of rounds that is at most a polynomial

inlogN,m, B, and 1/Ú."

Page 22: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*

L(λ * )

"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B

in a number of rounds that is at most a polynomial

inlogN,m, B, and 1/Ú."

Page 23: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*λ t

L(λ * )

L(λ t)

"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B

in a number of rounds that is at most a polynomial

inlogN,m, B, and 1/Ú."

Page 24: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*λ t

L(λ * )

ÚL(λ t)

"At iteration t, L(λ t) will be at most Ú more than that of any parameter vector of l 1-norm bounded by B

in a number of rounds that is at most a polynomial

inlogN,m, B, and 1/Ú."

Page 25: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*λ t

L(λ * )

ÚL(λ t)

This happens at:

t ≤poly logN,m,B, 1Ú( )

Page 26: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*λ t

L(λ * )

ÚL(λ t)

This happens at:

t ≤poly logN,m,B, 1Ú( )

Page 27: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*

λ t

L(λ * )

ÚL(λ t)

t ≤poly logN,m,B, 1Ú( )

This happens at:

Page 28: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

Page 29: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

poly log N ,m, B, 1

Ú( )

Page 30: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

poly log N ,m, B, 1

Ú( )

Best known previous result is that it takes at mostorder rounds (Bickel et al). e

1/Ú2

Page 31: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Intuition behind proof of Theorem 1

• Old fact: if AdaBoost takes a large step, it makes a lot of progress:

L(λ t) ≤L(λ t−1) 1−δ t2

δ t is called the "edge." It is related to the step size.

λ1

λ2

Page 32: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

radius B

λ*

λ t St

Rt

Rt :=lnL(λ t)−lnL(λ * )

St :=infλ

Åaλ −λ tÅa1:L(λ) ≤L(λ * ){ }

L(λ * )L(λ t)

measuresprogress

measuresdistance

Page 33: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2

If δ t's are large, we make progress.• First lemma says:

Intuition behind proof of Theorem 1

If St is small, then δ t is large.

Page 34: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2

If δ t's are large, we make progress.• First lemma says:

• Second lemma says:

• Combining:

δ t 's are large at each t (unless R t already small).

Intuition behind proof of Theorem 1

δ t ≥ Rt−13 / B3 in each round t.

St remains small (unless Rt is already small).

If St is small, then δ t is large.

Page 35: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the number of rounds required to achieve loss L* is at least (roughly) the norm of the smallest solution achieving loss L*

Page 36: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm

Page 37: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the norm of the smallest solution achieving loss L* is exponential in the number of examples.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm

Page 38: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

• Dependence onÅaλ *Åa1 is necessary for many datasets.

Lemma: There are simple datasets for which the

number of rounds required to achieve loss L* is at least

inf ÅaλÅa1:L(λ) ≤L*{ } / 2 lnm

Lemma: There are simple datasets for which

inf ÅaλÅa1:L(λ) ≤2m+Ú{ } ≥ 2m−2 −1( )ln(1 / (3Ú))

Page 39: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

Page 40: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 1: For any λ * ∈° N , AdaBoost achieves loss

at most L(λ * )+Ú in at most 13Åaλ *Åa16 Ú−5 rounds.

Conjecture: AdaBoost achieves loss at most L(λ * )+Ú

in at most O(B2 /Ú) rounds.

Page 41: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Number of rounds

Loss

– (O

ptim

al L

oss)

10 100 1000 10000 1e+053e-0

6 3

e-05

3e

-04

3e-

03

3e-0

2

Rate on a Simple Dataset (Log scale)

Page 42: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Outline

• Convergence Rate 1: Convergence to a target loss“Can we get within of a “reference” solution?”

• Convergence Rate 2: Convergence to optimal loss“Can we get within of an optimal solution?”

Ú

Ú

Page 43: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 2: AdaBoost reaches within Ú of the optimal loss

in at most C / Ú rounds, where C only depends on the data.

Page 44: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Better dependence on than Theorem 1, actually optimal.

• Doesn’t depend on the size of the best solution within a ball

• Can’t be used to prove the conjecture because in some cases C>2m. (Mostly it’s much smaller.)

Theorem 2: AdaBoost reaches within Ú of the optimal loss

in at most C / Ú rounds, where C only depends on the data.

Ú

Page 45: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Main tool is the “decomposition lemma”– Says that examples fall into 2 categories, • Zero loss set Z • Finite margin set F.

– Similar approach taken independently by (Telgarsky, 2011)

Theorem 2: AdaBoost reaches within Ú of the optimal loss

in at most C / Ú rounds, where C only depends on the data.

Page 46: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 2: AdaBoost reaches within Ú of the optimal loss

in at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

Page 47: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 2: AdaBoost reaches within Ú of the optimal loss

in at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

F

Page 48: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

Theorem 2: AdaBoost reaches within Ú of the optimal loss

in at most C / Ú rounds, where C only depends on the data.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

Z

Page 49: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite η* .

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 50: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

++

++

+

++

-- -

-

-

+

+

++---

-

-

-

margin of γ

margin of γ

η+

Page 51: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite η* .

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 52: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

++

++

+

++

-- -

-

-

+

++

+

-

--

-

-

-

F

Page 53: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

+

++

+

-

--

-

-

-

F

Page 54: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

+

++

+

-

--

-

-

-

F

η*

Page 55: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

1.) For some γ > 0, there exists vector η+, Åaη+Åa1=1 such that:

∀i ∈Z, η j+yih j (xi )

j∑ ≥ γ , (Margins are at least gamma in Z )

∀ i ∈F, η j+yih j (xi )

j∑ = 0, (Examples in F have zero margins)

2.) The optimal loss considering only examples in F is

achieved by some finite η* .

For any dataset, there exists a partition of the training examples into Z and F s.t. these hold simultaneously:

Decomposition Lemma

Page 56: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• We provide a conjecture about dependence on m.

Lemma: There are simple datasets for which the

constant C is doubly exponential, at least 2Ω(2m/m).

Conjecture: If hypotheses are {-1,0,1}-valued, AdaBoost

converges to within Ú of the optimal loss within

2O(m ln m )Ú−1+o(1) rounds.

• This would give optimal dependence on m and simultaneously.

Ú

Page 57: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Page 58: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

To summarize• Two rate bounds, one depends on the size of the

best solution within a ball and has dependence .

• The other depends only on but C can be doubly exponential in m.

• Many lower bounds and conjectures in the paper.

Ú−5

C / Ú

Thank you

Page 59: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2

If δ t's are large, we make progress.• First lemma says:

δ t 's are large whenever loss on Z is large.

Intuition behind proof of Theorem 2

δ t 's are large whenever loss on F is large.

Translates into that δ t's are large whenever loss on Z is small.

• Second lemma says:

Page 60: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• Old Fact: L(λ t) ≤L(λ t−1) 1−δ t2

If δ t's are large, we make progress.• First lemma says:

δ t 's are large whenever loss on Z is large.

Intuition behind proof of Theorem 2

δ t 's are large whenever loss on F is large.

Translates into that δ t's are large whenever loss on Z is small.

• Second lemma says:

Page 61: The Rate of Convergence of AdaBoost Indraneel Mukherjee Cynthia Rudin Rob Schapire.

• see notes