Regression trees and regression graphs: Efficient estimators for Generalized Additive Models
-
Upload
august-schmidt -
Category
Documents
-
view
51 -
download
2
description
Transcript of Regression trees and regression graphs: Efficient estimators for Generalized Additive Models
Regression trees and regression graphs:Efficient estimators for
Generalized Additive Models
Adam Tauman Kalai
TTI-Chicago
New
New
Outline
• Generalized Additive Models (GAM)
• Computationally efficient regression– Model
Thm: Regression graph algorithm efficiently learns GAMs
• Regression tree algorithm
• Regression graph algorithm
Correlation boosting
[Mansour&McAllester]
[Valiant] [Kearns&Schapire]
Generalized Additive Models [Hastie & Tibshirani]
Dist. Dist. over over XX ££ YY = = RRdd ££ RR
f(x) = Ef(x) = E[y|x] = u(f[y|x] = u(f11(x(x(1)(1))+f)+f22(x(x(2)(2))+…)+…+f+fdd(x(x(d)(d))) ))
monotonic u: monotonic u: RR!!RR, arbitrary f, arbitrary fii: : RR!!RR• e.g., Generalized linear models– u( w¢x ), monotonic u– linear/logistic models
• e.g., f(x) = f(x) = ee–||x||–||x||2 2 = = ee–x(1)–x(1)22–x(2)–x(2)22…–x(d)…–x(d)22
# Risk factors
Relapse
< 5 years
Relapse
< 2 years
Death
< 5 years
Death
< 2 years
0,1 30% 21% 30% 16%
2 50% 34% 50% 34%
3 51% 41% 51% 46%
4,5 60% 42% 60% 66%
Non-Hodgkin’s Lymphoma International Prognostics Index [NEJM ‘93]
Risk Factors age>60, # sites>1, perf. status>1, LDH>normal, stage>2
0
0
0
0
00
0 0
0
0 00
00
0
0
00
0
0
0
0
0
01
11
1
1
1
1
1
.8
1
1 111 1
1
11
1
1
1
11
1
1
0
1
11
1
.2
1
.5
.4
.4
.5
0 .2
0.1
0
0
0
0
1
Setup
X £ Y
X = Rd Y = [0,1]training sample:(x1,y1),…,(xn,yn)
regression algorithm
h: X ! [0,1]
.02
.4.3
.2
.1
0
00
0
00
0 0
00
00
0
0
0
0
00
0
0
0
0
0
01
11
1
1
1
1
1
1
1
1 111 1
1
11
1
1
1
11
1
1
0
1
11
1
.3
.3
.7
.3
.4
.4
0 .3
0.2
00
0
0
1
“true error”h) = E[(h(x)y)2]
“training error”h,train) = i(h(xi)-y)21
n
Computationally-efficient regression [Kearns&Schapire]
Learning Algorithm
A
X £ [0,1]
f(x) = E[y|x] 2 F,
8
n examplesn examples
h: X ! [0,1]
Family oftarget functions
>0
Definition: A efficiently learns F:
E[(h(x)-y)2] · E[(f(x)-y)2]+(term)/nc
A’s runtime must be poly(n,|f|)
poly(|f|,1/)
with probability 1-,
true error (h)
New
New
Outline
• Generalized Additive Models (GAM)
• Computationally efficient regression– Model
Thm: Regression graph algorithm efficiently learns GAMs
• Regression tree algorithm
• Regression graph algorithm
Correlation boosting
[Mansour&McAllester]
[Valiant] [Kearns&Schapire]
Results for GAM’s
RegressionGraph
Learner
0
h: Rd ! [0,1]
0 00
00
00
0 .2
0
1 11 1
.4
1
1
1.7
11
111
1
0
.1 .6
.8
New
8 with probability 1-,
n samples 2 X £ [0,1] X µ Rd
Thm: reg. graph learner efficiently learns GAMs
• 8dist over X£Y with E[y|x] = f(x) 2 GAM
– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/))
– runtime = poly(n,d) n1/7
Results for GAM’sNew
• f(x) = u(i fi(x(i)))
– u: R!R, monotonic, L-Lipschitz (L=max |u’(z)|)
– fi: R!R, bounded total variationV = i s |fi’(z)|dz
Thm: reg. graph learner efficiently learns GAMs
• 8dist over X£Y with E[y|x]=f(x) 2 GAM
– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV log(dn/))
– runtime = poly(n,d) n1/7
Results for GAM’sNew
Thm: reg. tree learner inefficiently learns GAMs
• 8dist over X£Y with E[y|x]=f(x) 2 GAM
– E[(h(x)-y)2] · E[(f(x)-y)2] + O(LV)
– runtime = poly(n,d)
RegressionTree
Learner
0
h: Rd ! [0,1]
0 00
00
00
0 .2
0
1 11 1
.4
1
1
1.7
1
n samples 2 X £ [0,1] X µ Rd
1
11
1
1
0
.1 .6
.8
log(d)log(n)( )
1/4
Regression Tree Algorithm
• Regression tree RT: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1] (x1,y1),
(x2,y2),…
avg(y1,y2,…,yn)
Regression Tree Algorithm
• Regression tree RT: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸
avg(yi: xi(j)<)
(xi,yi): x(j) <
avg(yi: xi(j)¸)
(xi,yi): x(j) ¸
Regression Tree Algorithm
• Regression tree RT: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸
avg(yi: xi(j)<)
(xi,yi): x(j) < x(j’) ¸ ’
(xi,yi): x(j) ¸ andx(j’) < ’
avg(yi: x(j)¸Æx(j’)¸’)
(xi,yi): x(j) ¸ andx(j’) ¸ ’
avg(yi: x(j)¸Æx(j’)<’)
Regression Tree Algorithm
• n = amount of training data
• Put all data into one leaf
• Repeat until size(RT)=n/log2(n):– Greedily choose leaf and split x(j) · to
minimize (RT,train) = (RT(xi)-yi)2/n
– Divide data in split node into two new leaves
Equivalent to “Gini”
• Regression graph RG: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸
x(j’) ¸ ’
avg(yi: x(j)¸Æx(j’)¸’)
(xi,yi): x(j) ¸ andx(j’) ¸ ’
avg(yi: x(j)<Æx(j’’)¸’’)
x(j’’) ¸ ’’
(xi,yi): x(j) < andx(j’’) < ’’
avg(yi: x(j)<Æx(j’’)<’’)
Regression Graph Algorithm [Mansour&McAllester]
(xi,yi): x(j) ¸ andx(j’) < ’
avg(yi: x(j)¸Æx(j’)<’)
(xi,yi): x(j) < andx(j’’) ¸ ’’
• Regression graph RG: Rd ! [0,1]• Training sample (x1,y1),(x2,y2),…,(xn,yn) 2 Rd £ [0,1]
x(j) ¸
x(j’) ¸ ’
avg(yi: x(j)¸Æx(j’)¸’)
(xi,yi): x(j) ¸ andx(j’) ¸ ’
avg(yi: (x(j)<Æx(j’’)¸’’)Ç(x(j)¸Æx(j’)<’))
x(j’’) ¸ ’’
(xi,yi): x(j) < andx(j’’) < ’’
(xi,yi): x(j) < andx(j’’) ¸ ’’or x(j) ¸ and x(j’) < ’
avg(yi: x(j)<Æx(j’)<’)
Regression Graph Algorithm [Mansour&McAllester]
Regression Graph Algorithm [Mansour&McAllester]
• Put all n training data into one leaf• Repeat until size(RG)=n3/7:
– Split: greedily choose leaf and split x(j) · to minimize (RG,train) = (RG(xi)-yi)2/n
• Divide data in split node into two new leaves
• Let be the decrease in (RG,train) from this split
– Merge(s):• Greedily choose two leaves whose merger increases
(RG,train) as little as possible
• Repeat merging while total increase in (RG,train) from merges is · /2
Two useful lemmas
• Uniform generalization bound for any n:
• Existence of a correlated split:There always exists a split I(x(i) · ) s.t.,
probability over training sets (x1,y1),…,(xn,yn)
regression graph R
Motivating natural example
• X = {0,1}d, f(x) = (x(1)+x(2)+…+x(d))/d, uniform • Size(RT) ¼ exp(Size(RG)c), e.g. d=4:
x(1)>½
x(2)>½ x(2)>½
x(3)>½x(3)>½
x(4)>½
x(3)>½
x(4)>½x(4)>½ x(4)>½
0 .75 1.5.25
x(1)>½
x(2)>½ x(2)>½
x(3)>½
x(4)>½
x(3)>½
x(4)>½
.75 1.75.75
x(4)>½x(4)>½
.5 .5.5.25
x(3)>½
x(4)>½
x(3)>½
x(4)>½
.5 .75.5.5
x(4)>½x(4)>½
.25.25
.250
Regression boosting
• Incremental learning– Suppose you find something of positive
correlation with y, then reg. graphs make progress
– “Weak regression” implies strong regression, i.e. small correlations can efficiently be combined to get correlation near 1 (error near 0)
– Generalizes binary classification boosting[Kearns&Valiant, Schapire, Mansour&McAllester,…]
Conclusions
• Generalized additive models are very general• Regression graphs, i.e., regression trees with
merging, provably estimate GAMs using polynomial data and runtime
• Regression boosting generalizes binary classification boosting
• Future work– Improve algorithm/analysis– Room for interesting work in
statistics Å computational learning theory