Stochastic Tree Ensembles
for Regularized Supervised Learning
Jingyu He, Saar Yalov, Jared Murray and P. Richard Hahn
November 5, 2019
The University of Chicago Booth School of Business
1
Elements of supervised learning
• x ∈ Rp is a vector of covariate variables, the outcome y is
continuous (regression) or discrete (classfication)
• How to predict y given x.
• We want to learn the function form f (·)
y = f (x) (1)
• Linear model: y = xβ (or Logistic regression)
• Non-parametric model: tree, deep learning
2
Simulation and Empirical Results
Regression Simulation
We draw 30 x variables from U(0, 1). y = f (x) + ε where
ε ∼ N(0, σ2) and σ2 = κVar(f ). κ measures signal to noise ratio.
Name Function
Linear xtγ; γj = −2 + 4(j−1)d−1
Single index 10√a + sin (5a); a =
∑10j=1(xj − γj)2; γj = −1.5 + j−1
3 .
Trig + poly 5 sin(3x1) + 2x22 + 3x3x4
Max max(x1, x2, x3)
We compare with neural networks, random forest, xgboost with /
without cross validation.
3
Simulation, signal to noise = 1:1, κ = 1 and n = 10K
Function Method RMSE Seconds
Linear XBART 1.74 20
Linear XGBoost Tuned 2.63 64
Linear XGBoost Untuned 3.23 < 1
Linear Random Forest 3.56 6
Linear BART 1.50 117
Linear Neural Network 1.39 26
Trig + Poly XBART 1.31 17
Trig + Poly XGBoost Tuned 2.08 61
Trig + Poly XGBoost Untuned 2.70 < 1
Trig + Poly Random Forest 3.04 6
Trig + Poly BART 1.30 115
Trig + Poly Neural Network 3.96 26
Max XBART 0.39 16
Max XGBoost Tuned 0.42 62
Max XGBoost Untuned 0.79 < 1
Max Random Forest 0.41 6
Max BART 0.44 114
Max Neural Network 0.40 30
Single Index XBART 2.27 17
Single Index XGBoost Tuned 2.65 61
Single Index XGBoost Untuned 3.65 < 1
Single Index Random Forest 3.45 6
Single Index BART 2.03 116
Single Index Neural Network 2.76 28 4
Simulation, signal to noise = 1:10, κ = 10 and n = 10K
Function Method RMSE Seconds
Linear XBART 5.07 16
Linear XGBoost Tuned 8.04 61
Linear XGBoost Untuned 21.25 < 1
Linear Random Forest 6.52 6
Linear BART 6.64 111
Linear Neural Network 7.39 12
Trig + Poly XBART 4.94 16
Trig + Poly XGBoost Tuned 7.16 61
Trig + Poly XGBoost Untuned 17.97 < 1
Trig + Poly Random Forest 6.34 7
Trig + Poly BART 6.15 110
Trig + Poly Neural Network 8.20 13
Max XBART 1.94 16
Max XGBoost Tuned 2.76 60
Max XGBoost Untuned 7.18 < 1
Max Random Forest 2.30 6
Max BART 2.46 111
Max Neural Network 2.98 15
Single Index XBART 7.13 16
Single Index XGBoost Tuned 10.61 61
Single Index XGBoost Untuned 28.68 < 1
Single Index Random Forest 8.99 6
Single Index BART 8.69 111
Single Index Neural Network 9.43 14 5
Larger Simulation, κ = 1, signal to noise = 1:1
κ = 1, signal to noise ratio 1 to 1
n XBART XGB+CV XGB NN
Linear
10k 1.74 (20) 2.63 (64) 3.23 (0) 1.39 (26)
50k 1.04 (180) 1.99 (142) 2.56 (4) 0.66 (28)
250k 0.67 (1774) 1.50 (1399) 2.00 (55) 0.28 (40)
Max
10k 0.39 (16) 0.42 (62) 0.79 (0) 0.40 (30)
50k 0.25 (134) 0.29 (140) 0.58 (4) 0.20 (32)
250k 0.14 (1188) 0.21 (1554) 0.41 (60) 0.16 (44)
Single Index
10k 2.27 (17) 2.65 (61) 3.65 (0) 2.76 (28)
50k 1.54 (153) 1.61 (141) 2.81 (4) 1.93 (31)
250k 1.14 (1484) 1.18 (1424) 2.16 (55) 1.67 (41)
Trig + Poly
10k 1.31 (17) 2.08 (61) 2.70 (0) 3.96 (26)
50k 0.74 (147) 1.29 (141) 1.67 (4) 3.33 (29)
250k 0.45 (1324) 0.82 (1474) 1.11 (59) 2.56 (41)6
Larger Simulation, κ = 10, signal to noise = 1:10
κ = 10, signal to noise ratio 1 to 10
n XBART XGB+CV XGB NN
Linear
10k 5.07 (16) 8.04 (61) 21.25 (0) 7.39 (12)
50k 3.16 (135) 5.47 (140) 16.17 (4) 3.62 (14)
250k 2.03 (1228) 3.15 (1473) 11.49 (54) 1.89 (19)
Max
10k 1.94 (16) 2.76 (60) 7.18 (0) 2.98 (15)
50k 1.22 (133) 1.85 (139) 5.49 (4) 1.63 (16)
250k 0.75 (1196) 1.05 (1485) 3.85 (54) 0.85 (22)
Single Index
10k 7.13 (16) 10.61 (61) 28.68 (0) 9.43 (14)
50k 4.51 (133) 6.91 (139) 21.18 (4) 6.42 (16)
250k 3.06 (1214) 4.10 (1547) 14.82 (54) 4.72 (21)
Trig + Poly
10k 4.94 (16) 7.16 (61) 17.97 (0) 8.20 (13)
50k 3.01 (132) 4.92 (139) 13.30 (4) 5.53 (14)
250k 1.87 (1216) 3.17 (1462) 9.37 (49) 4.13 (20)7
Motivation
Grow a tree recursively
x1
x2µ1
8
Grow a tree recursively
x1
x2
x1 < 0.3
µ1 µ2
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 µ2
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.7
µ1 µ2
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 µ2
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.3
µ2 µ3
yes no
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.5
µ2 µ3
yes no
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.7
µ2 µ3
yes no
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.7
µ2 µ3
yes no
yes no
8
Grow a tree recursively
x1
x2
x1 < 0.5
µ1 x2 < 0.7
µ2 µ3
yes no
yes no
Prediction of a new observation is µ3.
8
Grow a tree recursively
306 9. Additive Models, Trees, and Related Methods
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ! t1
X2 ! t2 X1 ! t3
X2 ! t4
FIGURE 9.2. Partitions and CART. Top right panel shows a partition of atwo-dimensional feature space by recursive binary splitting, as used in CART,applied to some fake data. Top left panel shows a general partition that cannotbe obtained from recursive binary splitting. Bottom left panel shows the tree cor-responding to the partition in the top right panel, and a perspective plot of theprediction surface appears in the bottom right panel.
Tree is essentially a step function, partitions at one variable at a
time. Predict with constant. Picture from Friedman, Hastie, Tibshirani 2001.
9
Sum of trees
+ =
d
c
µ1
µ2
µ3
+
e
f
θ1
θ2 θ3=
e
f
d
c
θ1 + µ2 θ1 + µ1
θ2 + µ3
θ2 + µ2
θ3 + µ3
θ3 + µ2
θ3 + µ1
Sum of trees (tree ensembles or forest) implies an extra level of
smoothing.
10
Why a tree / forest
• Widely used, more than half winners of data mining
competition on kaggle use variants of tree ensemble methods.
• Invariant to scaling of input variables. Not necessary to worry
about feature normalization.
• Learn high order interactions between features.
• Random Forest, takes average of multiple trees, all trees are
independent.
• Boosting, takes sum of multiple trees, grown sequentially.
11
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
x
y
12
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
Random Forest
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Random Forest
x
y
12
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Boosting
0.0 0.2 0.4 0.6 0.8 1.0
−0.
6−
0.4
−0.
20.
00.
20.
4
Boosting
x
y
13
Classification and Regression Trees (CART)
• Probably the most popular tree growing algorithm.
• Grow tree until very large, then prune the tree.
• CART split criterion minimizes L2 loss,∑i∈left
(yi − yleft)2 +
∑i∈right
(yi − yright)2
• Split points and leaf parameter might conspire to make a bad
split point looks nicer.
• Optimize split criterion. What if two split points have very
close evaluation like 10.00 and 9.99?
14
Intuition for XBART
• A randomized algorithm. Sampling split points instead ofoptimization.
• A new split criterion with probabilistic interpretation.
• Early stopping, rather than pruning after over-growing.
• Split and estimate leaf parameter separately.
• Tree ensemble.
15
Comparison of tree-based algorithms
CART RF XGB XBART BART
Leaf optimized optimized optimized integrate out at split integrate out at split
Parameters with splits with splits with splits then sample then sample
Criteria likelihood likelihood likelihood marginal likelihood marginal likelihood
Aggregation Noaggregation
Noaggregation aggregation
of trees of forests of forests
SequentialNo No Yes Yes Yes
fitting
Iterations No No No Yes No
Recursion Yes Yes Yes Yes No
16
XBART regression
Split criterion of Gaussian regression
• Assume Gaussian likelihood on one leaf node N(µ, σ2)
• µ ∼ N(0, τ) prior
The integrated-likelihood is
p(Y | τ, σ2) =
∫N(Y | µ, σ2In)N(µ | 0, τ)dµ = N(0, τJJt+σ2In),
Re-arrange terms and ignore constants in the density, take
logarithm,
log(m(s | τ, σ2)) = log
(σ2
σ2 + τn
)+
τ
σ2(σ2 + τn)s2.
where s =∑
y , sum of all y in the leaf node.
17
Split criterion of a single tree
• A split point candidate cjk partitions the current node to left
and right nodes.
• Two nodes have sufficient statistics s ljk and srjk .
• Assume data in two leaf nodes are independent.
The joint integrated-likelihood of two children leaf nodes is
l(cjk) := m(s ljk | τ, σ2)m(srjk | τ, σ2),
This is split criterion of cjk .
18
Why integrated-likelihood
• Most tree algorithms optimize split point and leaf parameters
simultaneously.
• A bad split point probably looks better in collusion with leaf
parameters because random noise from the data.
• We split nodes and estimate leaf parameters separately.
19
No-split option (Regularization)
Furthermore, the split criterion for no-split is defined as
lstop = |C|(
(1 + d)β
α− 1
)m(sall | Φ,Ψ)
• d is depth of the current node.
• sall is sufficient stat of all data.
• α, β are hyper-parameters.
20
Weight of no-split option
• A split point candidates has prior proportional to 1.
• No-split has prior proportional to |C|(
(1+d)β
α − 1)
.
• |C| split candidates in total.
Therefore the implied prior probability of splitting is
split weight
split weight + no split weight=
|C|
|C|(
(1+d)β
α − 1)
+ |C|= α(1+d)−β.
which is same as prior probability of a split of BART.
21
Sample one split point (or stop)
Sample one of them according to probability
P(cjk) =l(cjk)∑
cjk∈C l(cjk) + lstop, P(stop) =
lstop∑cjk∈C l(cjk) + lstop
22
Split criterion of a single tree
Consider a split point candidate, it partitions the space to left and
right child as
x1
x2
23
Split criterion of a single tree
Consider a split point candidate, it partitions the space to left and
right child as
x1
x2
m(s ljk | Φ,Ψ)
23
Split criterion of a single tree
Consider a split point candidate, it partitions the space to left and
right child as
x1
x2
m(s ljk | Φ,Ψ)m(srjk | Φ,Ψ)
23
No-split option
If stop splitting, it is equivalent to leave all data in one node.
x1
x2
|C|(
(1 + d)β
α− 1
)m(s∅ | Φ,Ψ)
24
GrowFromRoot algorithm
In addition to no-split option, we still have regular stop splitting
conditions — max depth of tree, least leaf size, etc. The algorithm
growing a single tree is (GrowFromRoot)
1. Start from a root node.
2. Evaluate split criterion for all split point candidates andno-split option, sample one from it.
if no-split or other stopping criterion is reached. Update leaf
parameter and return the algorithm.
else split current node to left and right side, repeat step 1 for both
child nodes.
25
XBART forest
Next it is algorithm growing a forest. We fit a sum of trees
(regression) or product of trees (classification).
Suppose we sample I forests with L trees in each forest.
For iter in 1 to I .
For h in 1 to L.
GrowFromRoot fits the target r iterh .
Update r iterh+1, the target for the next tree to fit.
Update other non-tree parameters Ψ.
26
XBART forest
Take regression (additive trees) for an example. Suppose we fit
three trees, the first tree fits target R11 = 1
3Y . Denote the fitted
value as R11 .
27
XBART forest
Take regression (additive trees) for an example. The second tree
fits target R12 = 1
3Y and fitted value is R12 .
+
27
XBART forest
The third tree fits target R13 = 1
3Y and fitted value is R13 .
+ +
27
XBART forest
Next iteration of forest, re-fit the first tree to target
R21 = Y − R1
2 − R13 . Fitted value is R2
1 .
+ +
27
XBART forest
Next iteration of forest, re-fit the second tree to target
R22 = Y − R2
1 − R13 . Fitted value is R2
2 .
+ +
27
XBART forest
Re-fit the third tree to target R23 = Y − R2
1 − R22 . Fitted value is
R23 .
+ +
27
Update leaf parameter µ and residual variance
We assume µlb ∼ N(0, τ) prior on each leaf, update it by
µlb ∼ N
( ∑y
σ2(
1τ + nlb
σ2
) , 1(1τ + nlb
σ2
))∑
takes sum of all data, and nlb is number of observations.
Assume a regular inverse Gamma prior on σ2 and update in
between of trees.
σ2 ∼ inverse-Gamma(N + a, r
(iter)th r
(iter)h + b
)r(iter) is the total residual of all trees.
28
Default Hyperparameters
We recommend
• L = 50, 100 or 200
• α = 0.95,
• β = 1.25 and
• τ = var(y)/L.
Lower β permits deeper trees (BART’s default is β = 2).
This τ dictates that a priori each tree will account for 1L of the
observed variance.
Our default suggestion is just 40 sweeps through the data,
discarding the first 15 as burn-in.
29
Final prediction
Given I iterations of the algorithm, keep final I − I0 forests for
prediction, where I0 < I is denotes the length of the burn-in period.
Take average on the time domain of a Markov chain.
f (X) =1
I − I0
I∑k>I0
f (k)(X).
where f (k) denotes a sample of the forest, a sum of trees in that
forest.
30
Variable importance
We also keep track of variable importance. Count how many times
each variable is used as split variable in tree l at iteration k as
w(k)l . Update parameter
w← w − w(k−1)l + w
(k)l
The weight parameter is then resampled as w ∼ Dirichlet(w).
Then for the next tree, we subsample mtry variables for
consideration with respect to probability w.
w gives a natural evaluation of variable importance.
31
Generic XBART and Classification
Generic split criterion
It is natural to extend split criterion to other likelihoods.
• Define a likelihood L(y ;µ,Ψ) on one leaf node.
• Leaf parameter µ; other model parameters Ψ (given and fixed
in tree growing process).
• Assume prior π(µ | Φ) on µ.
The integrated likelihood is
m(s | Φ,Ψ) :=
∫L(y ;µ,Ψ)π(µ | Φ)dµ,
s is a sufficient statistics of data y falling in the current node.
32
Split criterion of multi-class classification
• Classification of C categories. Suppose each xi is observed ni
times in the data.
• The response yij is the number of observations with covariate
xi in category j , where 1 ≤ i ≤ n and 1 ≤ j ≤ C .
• So we have∑C
j=1 yij = ni . If xi is continuous, then ni = 1.
• The probability of response of covariate vector xi belongs to
category j is
πj(xi ) =f (j)(xi )∑Ch=1 f
(h)(xi ).
33
Split criterion of multi-class classification
We assume the logarithm of regression function is a sum of trees,
taking the form as log(f (j)(x)
)=∑L
l=1 g(
x;T(j)l , µ
(j)l
), which
leads to a multinomial logistic trees model
πj(xi ) =exp
[∑Ll=1 g
(x;T
(j)l , µ
(j)l
)]∑C
h=1 exp[∑L
l=1 g(
x;T(h)l , µ
(h)l
)] .Let λlb = exp(µlb)
f (x) = exp
[L∑
l=1
g(x;Tl , µl)
]=
L∏l=1
g(x;Tl ,Λl),
where g(x,Tl ,Λl) = λlb if x ∈ Alb for 1 ≤ b ≤ Bl .
34
Split criterion of multi-class classification
The likelihood of each covariate value is
pMN(yi , φi ) =
(ni
yi1yi2 · · · yic
)∏cj=1 f
(j)(xi )yij∑c
j=1 f(j)(xi )ni
.
We apply apply the data augmentation strategy of Murray 2017 by
introducing a latent variable φi , the augmented likelihood is
pMN(yi , φi ) =( ni
yi1yi2 · · · yic
) c∏j=1
f (j)(xi )yij
φni−1i
Γ(ni )exp
−φi c∑j=1
f (j)(xi )
=( ni
yi1yi2 · · · yic
)φni−1
Γ(ni )
c∏j=1
f (j)(xi )yij exp
[−φi f (j)(xi )
].
The augmented likelihood introduces latent variable φi for
1 ≤ i ≤ N per data observation.
35
Split criterion of multi-class classification
Assume independent conjugate prior λlb ∼ Gamma(a1, a2) for each
leaf parameter in Λl , the integrated likelihood is
L(Tl ;T(l),Λ(l), θ, y) =
∫L(Tl ,Λl ;T(l),Λ(l), θ, y)p(Λl)dΛl
=
∫cl
Bl∏t=1
λrlblb exp[−slbλlb]λa1−1lb aa1
2 e−a2λlb
Γ(a1)dλlt
∝Bl∏t=1
1
(a2 + slb)(a1+rlb)Γ(a1 + rlb),
The integrated likelihood of one leaf is
m(slb | a1, a2, φiNi=1
)=
1
(a2 + slb)(a1+rlb)Γ(a1 + rlb)
36
Sampling non-tree parameters in multi-class classification
The sampling steps of leaf and non-tree parameters are
• For 1 ≤ j ≤ C , update each leaf parameter of f (j)
independently when update leaf parameter in GrowFromRoot
algorithm.
λlb ∼ Gamma(a1 + rlb, a2 + slb).
• For 1 ≤ i ≤ N, update latent variable after updating a tree in
XBART forest algorithm.
φi ∼ Gamma
ni ,c∑
j=1
f (j)(xi )
.
37
Some Theory
Markov chain
The algorithm sampling forest is a finite-state Markov chain with
stationary distribution.
• Each iteration only relies on prior iteration but all forests
before that.
• Forest states are finite because of max depth and finite split
point candidates.
• The probability drawing a single tree is defined as product of
integrated-likelihood of all cut points, which is non-zero.
• At least one way to go from one forest to the other: re-grow
trees one by one.
38
Consistency of a single regression tree
• We prove consistency of a single regression tree based on
consistency result of CART (Scornet et al. 2015).
• The framework of proof is the same. We verify XBART
satisfies key lemmas related to specific split criterion function.
• Random sampling split-point can be convert to optimization
by perturb max lemma.
39
Perturb max lemma
Lemma
Suppose at a specific node, there are |C| finite split-point
candidates cjk, we are interested in drawing one of them
according to probability P(cjk) =exp(l(cjk ))∑
cjk∈Cexp(l(cjk )) . We have
P
(cjk = arg max
cjk∈Cl(cjk) + γjk
)=
exp(l(cjk))∑cjk∈C exp(l(cjk))
where γjk is a set of independent random draws from
Gumble(0, 1) distribution with density
p(x) = exp(−x + exp(−x)).
Random sampling is equivalent to optimization with an additional
random drawn constant.
40
Perturb max lemma
Follow the perturb max lemma, we optimize
arg maxcjk∈Cl(cjk) + γjk
which is equivalent to
arg maxcjk∈C
l(cjk)
n+γjkn.
Let n→∞, our empirical split criterion function Ln(x) converges
to the theoretical version
L∗(j , cjk) =1
σ2P(x(j) ≤ cjk)
[E(y | x(j) ≤ cjk)
]2
+1
σ2P(x(j) > cjk)
[E(y | x(j) > cjk)
]2.
This theoretical split criterion is the same as CART.
41
Assumption
We prove consistency of a single tree in regression setting.
Assumption (A1)
y =
p∑i=1
fj(x(j)) + ε
where x = (x(1), · · · , x(p)) is uniformly distributed on [0, 1]p.
ε ∼ N(0, σ2).
42
Main theorem
Suppose dn is a sequence of max depth of trees, each tree is fully
grown, we have
Theorem
Assume (A1) holds. Let n→∞, dn →∞ and
(2dn − 1)(log n)9/n→ 0, a single XBART tree is consistent in the
sense that
limn→∞
E[fn(x)− f (x)]2 = 0.
43
An important proposition
The total variation of the true function f within leaf node A is
∆(f ,A) = supx,x′∈A
|f (x)− f (x′)|.
An(x,Θ) is the leaf node x falls in.
Proposition
Assume (A1) holds, for all ρ, ξ > 0, there exists N ∈ N∗ such
that, for all n > N,
P [∆(f ,An(x,Θ)) ≤ ξ] ≥ 1− ρ.
n→∞, variation of the true function on all leaf nodes is arbitrarily
small. Either because the leaf size shrinks to zero, or the true
function is flat.44
Proof of proposition
Proof of proposition is done by three key lemmas, we verify it is
valid for XBART in the paper.
1. Proposition is true for tree grown with theoretical split
criterion.
2. Tree grown with empirical split criterion is close enough to
theoretical tree as n→∞.
45
Future Research
• XBART enjoys great prediction accuracy and is fast.
• Application to Bayesian causal forest. Including future
empirical works in causal inference.
• Prove consistency of the XBART forest.
46
Extra slides
48
Classification Results
We compare XBART with other methods on 20 datasets from the
UCI machine learning repository. All datasets have 3 to 6
categories and 100 to 3,000 observations.
The goal is to demonstrate that default setting of XBART has a
reasonable performance comparing to other approaches.
49
Classification Results
rf gbm mno svm nnet xbart
balance-scale 0.848 (0.023) 0.925 (0.010) 0.897 (0.021) 0.909 (0.025) 0.961 (0.019)* 0.912 (0.011)
car 0.983 (0.006)* 0.979 (0.008) 0.834 (0.019) 0.774 (0.033) 0.947 (0.015) 0.938 (0.018)
cardiotocography-3clases 0.937 (0.009) 0.949 (0.009)* 0.894 (0.011) 0.911 (0.011) 0.909 (0.013) 0.931 (0.011)
contrac 0.546 (0.024) 0.557 (0.023) 0.516 (0.028) 0.551 (0.024) 0.556 (0.028)* 0.324 (0.058)
dermatology 0.970 (0.016) 0.972 (0.020)* 0.968 (0.020) 0.759 (0.024) 0.970 (0.022) 0.972 (0.018)*
glass 0.798* (0.062) 0.771 (0.055) 0.622 (0.066) 0.679 (0.054) 0.673 (0.064) 0.702 (0.076)
heart-cleveland 0.578 (0.033) 0.586 (0.039) 0.587 (0.039) 0.620 (0.038)* 0.603 (0.052) 0.583 (0.034)
heart-va 0.357 (0.071)* 0.320 (0.067) 0.349 (0.069) 0.315 (0.069) 0.302 (0.08) 0.308 (0.064)
iris 0.948 (0.034) 0.945 (0.034) 0.965 (0.029)* 0.947 (0.034) 0.954 (0.045) 0.954 (0.033)
lymphography 0.866 (0.063)* 0.853 (0.057) 0.821 (0.069) 0.850 (0.062) 0.818 (0.077) 0.835 (0.058)
pittsburg-bridges-MATERIAL 0.840 (0.058) 0.844 (0.049) 0.834 (0.061) 0.860 (0.048)* 0.824 (0.066) 0.849 (0.046)
pittsburg-bridges-REL-L 0.725 (0.084)* 0.681 (0.093) 0.650 (0.087) 0.692 (0.082) 0.659 (0.091) 0.680 (0.083)
pittsburg-bridges-SPAN 0.637 (0.098) 0.648 (0.101) 0.675 (0.100) 0.681 (0.099)* 0.647 (0.102) 0.628 (0.100)
pittsburg-bridges-TYPE 0.609 (0.088)* 0.581 (0.089) 0.549 (0.089) 0.540 (0.075) 0.565 (0.092) 0.585 (0.093)
seeds 0.940 (0.030) 0.940 (0.031) 0.948 (0.035)* 0.929 (0.029) 0.944 (0.034) 0.945 (0.042)
synthetic-control 0.984 (0.012) 0.971 (0.015) 0.984 (0.012) 0.716 (0.024) 0.987 (0.011)* 0.983 (0.017)
teaching 0.622 (0.085)* 0.557 (0.086) 0.526 (0.072) 0.547 (0.073) 0.525 (0.087) 0.491 (0.086)
vertebral-column-3clases 0.847 (0.038) 0.829 (0.038) 0.861 (0.033)* 0.839 (0.042) 0.859 (0.033) 0.842 (0.039)
wine-quality-red 0.702 (0.021)* 0.631 (0.026) 0.597 (0.024) 0.576 (0.026) 0.597 (0.025) 0.613 (0.022)
wine 0.985 (0.018)* 0.979 (0.021) 0.979 (0.026) 0.980 (0.025) 0.977 (0.027) 0.969 (0.032)
50
Tree is Prone to Overfitting
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
10 20 30
1020
3040
50
Boston$lstat
Bos
ton$
med
v
51
Regularization
Three secrets of a successful tree model.
Regularization, regularization and regularization.
The strategies for preventing overfitting are diverse, but the basic
idea is to favor smaller trees.
52
Split criterion of multi-class classification
Let f(j)−l (x) =
∏h 6=l g
(j)(x,Th,Λh) be the total fit of all trees
except the l-th one, the likelihood of all n covariates has the form
L(Tl ,Λl ;T(l),Λ(l), y) =n∏
i=1
wi f(j)(xi )
yij exp[vi f(j)(xi )]
= cl
Bl∏b=1
λrlblt exp[−slbλlb]
where Bl is number of nodes of the l-th tree and
cl =n∏
i=1
(ni
yi1yi2 · · · yic
)φni−1
Γ(ni )f
(j)−l (xi )
yij
rlb =∑
i :xi∈Alb
yij , slb =∑
i :xi∈Alb
φi f(j)−l (xi ).
53
Adaptive Cut-points
Cut-points are defined via evenly spaced quantiles of the data
observation X .
At each recursion, the cut-points are redefined.
x0 1 2 3 4 5 6 7 8
f (x)
54
Adaptive Cut-points
x0 1 2 3 4 5 6 7 8
f (x)
54
Adaptive Cut-points
x0 1 2 3 4 5 6 7 8
f (x)
”Zoom in”. It’s easy because data are ordered.
54
Prune CART
Loop over all possible sub-trees T , calculate loss function
Cα(T ) = C (T ) + α|T |
where C (T ) is prediction error of sub-tree T and |T | is a
measurement of tree complexity. α is a tuning parameter. Pick a
sub-tree with lowest loss.
55
Gradient boosting trees
Obj =n∑
i=1
loss(yi , yi ) +L∑
l=1
Ω(g(Tl , µl))
Obj t =n∑
i=1
[2(y (t−1) − yi )gt(xi ) + gt(xi )
2]
+ Ω(gt) + const
56
Gyorfi et al. (2006)
Theorem (Gyorfi et al. (2006))
Assume that
1. limn→∞ βn =∞;
2. limn→∞ E
[inf m∈Mn(Θ)||m||∞≤βn
EX [m(x)− f (x)]2
]= 0;
3. for all L > 0,
limn→∞
E
supm∈Mn(Θ)||f ||∞≤βn
∣∣∣∣∣ 1
an
n∑i=1
[m(xi )− yi,L]2 − E [m(x)− yL]2
∣∣∣∣∣ = 0.
Then
limn→∞
E[Tβn fn(X ,Θ)− f (X )]2 = 0.
57
Lemma 1
Lemma
Assume that (A1) holds. Then for all x ∈ [0, 1]p,
∆(f ,A∗k(x,Θ))→ 0 almost surely as k →∞.
58
Lemma 2
Lemma
Assume that (A1) holds. Fix x ∈ [0, 1]p, k ∈ N∗ and let ξ > 0.
Then Ln,k(x , ) is stochastically equicontinuous on Aξk(x), that is,
for all α, ρ > 0, there exist δ > 0 such that
limn→∞
P
sup||ck−c′k ||∞≤δck ,c′k∈Aξk (x)
∣∣Ln,k(x, ck)− Ln,k(x, c′k)∣∣ > α
≤ ρ.
59
Lemma 3
Lemma
Assume that (A1) holds. Fix ξ, ρ > 0 and k ∈ N∗. There there
exists N ∈ N∗ such that for all n ≥ N,
P [c∞(ck,n(x,Θ),A∗k(x,Θ)) ≤ ξ] ≥ 1− ρ.
60
References
Jingyu He, Saar Yalov, P. Richard Hahn XBART: Accelerated
bayesian additive regression trees. AISTATS 2019.
Jingyu He, Saar Yalov, Jared Murray, P. Richard Hahn Stochastic
tree ensembles for regularized supervised learning. Technical
Report.
61
Top Related