High-Dimensional Sparse Econometric Models, an...
Transcript of High-Dimensional Sparse Econometric Models, an...
High-Dimensional Sparse EconometricModels,
an Introduction
Alexandre Belloni
ICE, July 2011
Alexandre Belloni High-Dimensional Sparse Econometrics
Outline
1. High-Dimensional Sparse Econometric Models (HDSM)◮ Models◮ Motivating Examples for linear/nonparametric regression
2. Estimation Methods for linear/nonparametric regression◮ ℓ1-penalized estimators
◮ formulations◮ penalty parameter
◮ computational methods◮ post-ℓ1-penalized estimators◮ summary of rates of convergence
3. Plenty of model selection applications in Econometrics◮ Results for IV Estimators◮ Results for Quantile Regression◮ Results for Variable Length Markov Chain
Alexandre Belloni High-Dimensional Sparse Econometrics
High-Dimensional Sparse Models
◮ We have a large number of parameters p, potentially largerthan the sample size n
◮ The idea is that a low dimensional submodel captures veryaccurately all essential features of the full dimensionalmodel.
◮ The key assumption is that the number of relevantparameters is smaller than the sample size
◮ fundamentally, a model selection problem
◮ One question is whether sparse models make sense.◮ Another question is how to compute such models.◮ Another question is what guarantees can we establish.
Alexandre Belloni High-Dimensional Sparse Econometrics
1. High-Dimensional Sparse Econometric Model
HDSM. A response variable yi obeys
yi = x ′i β0 + ǫi , ǫi ∼ N(0, σ2), i = 1, ..., n
where xi are fixed p-dimensional regressors,
xi = (xij , j = 1, ..., p), En[x2ij ] = 1.
p possibly much larger than n.
The key assumption is sparsity:
s := ‖β0‖0 =
p∑
j=1
1{β0j 6= 0} ≪ n.
This generalizes the standard large parametric econometricmodel by letting the identity
T := support (β0)
of the relevant s regressors be unknown.Alexandre Belloni High-Dimensional Sparse Econometrics
High p motivation
◮ transformations of basic regressors zi ,
xi = (P1(zi ), ..., Pp(zi ))
◮ for example, in wage regressions, Pj ’s are polynomials orB-splines in education and experience.
◮ or simply a very large list of regressors,◮ a list of country characteristics in cross-country growth
regressions (Barro & Lee)◮ housing characteristics in hedonic regressions
(American Housing Survey)◮ price and product characteristics at the point of purchase
(scanner data, TNS).
Alexandre Belloni High-Dimensional Sparse Econometrics
Sparseness
◮ The key assumption is that the number of non-zeroregression coefficients is smaller than the sample size:
s := ‖β0‖0 =
p∑
j=1
1{β0j 6= 0} ≪ n.
◮ The idea is that a low s-dimensional submodel accuratelyapproximates the full p-dimensional model. Theapproximation error in fact is zero.
◮ Can easily relax this assumption and put in a non-zeroapproximation error.
◮ We can relax the Gaussianity of the error terms ǫi
substantially based on self-normalized moderate deviationtheory results
Alexandre Belloni High-Dimensional Sparse Econometrics
High-Dimensional Sparse Econometric Model
HDSM′. For yi the response variable and zi the fixedelementary regressors
yi = g(zi) + ǫi , E [ǫi ] = 0, E [ǫ2i ] = σ2, i = 1, ..., n
Consider a p-dimensional dictionary of series terms,
xi = (P1(zi ), ..., Pp(zi ))
for example, B-splines, polynomials, and interactions, where
p is possibly much larger than n.
Expand via generalized series: g(zi) = x ′i β0 + ri ,
◮ x ′i β0 is the best s−dimensional parametric approximation
s := ‖β0‖0 =
p∑
j=1
1{β0j 6= 0} ≪ n.
◮ ri is the approximation error.Alexandre Belloni High-Dimensional Sparse Econometrics
High-Dimensional Sparse Econometric Model
HDSM′. As a result we have an approximately sparse model:
yi = x ′i β0 + ri + ǫi , i = 1, . . . , n
The parametric part x ′i β0 is the target, with s non-zero
coefficients.We choose s to balance the approximation error with the oracleestimation error:
En[r2i ] ∝ σ2 s
n.
◮ This model generalizes the exact sparse model, by lettingin approximation error and non-Gaussian errors.
◮ This model also generalizes the standard series regressionmodel by letting the identity
T = support (β0)
of the most important s series terms be unknown.
Alexandre Belloni High-Dimensional Sparse Econometrics
Example 1: Series Models of Wage Function
◮ In this example, abstract away from the estimationquestions.
◮ Consider a series expansion of the conditional expectationE [yi |zi ] of wage yi given education zi .
◮ A conventional series approximation to the regressionfunction is, for example,
E [yi |zi ] = β1P1(zi ) + β2P2(zi) + β3P3(zi) + β4P4(zi) + ri
where P1, ..., P4 are first low-order polynomials or splineswith a few knots.
Alexandre Belloni High-Dimensional Sparse Econometrics
8 10 12 14 16 18 20
6.0
6.5
7.0
education
wage
Traditional Approximation of Expected Wage Function using Polynomials
◮ In the figure, true regression function E [yi |zi ] computedusing U.S. Census data.
Alexandre Belloni High-Dimensional Sparse Econometrics
◮ With the same number of parameters, can we a find amuch better series approximation?
◮ For this to be possible, some high-order series terms in theexpansion
E [yi |zi ] =
p∑
k=1
β0kPk (zi) + r ′i
must have large coefficients.◮ This indeed happens in our example, since E [yi |zi ] exhibits
sharp oscillatory behavior in some regions, which lowdegree polynomials fail to capture.
◮ And we find the “right” series terms by usingℓ1-penalization methods
Alexandre Belloni High-Dimensional Sparse Econometrics
8 10 12 14 16 18 20
6.0
6.5
7.0
education
wage
Lasso Approximation of Expected Wage Function using Polynomials and Dummies
Alexandre Belloni High-Dimensional Sparse Econometrics
8 10 12 14 16 18 20
6.0
6.5
7.0
education
wage
Traditional vs Lasso Approximation of Expected Wage Functions
with Equal Number of Parameters
Alexandre Belloni High-Dimensional Sparse Econometrics
Errors of Conventional and the lasso-Based SparseApproximations
L2 error L∞ errorConventional Series Approximation 0.135 0.290lasso-Based Series Approximation 0.031 0.063
Alexandre Belloni High-Dimensional Sparse Econometrics
Example 2: Cross-Country Growth Regression
◮ Relation between growth rate and initial per capita GDP
GrowthRate = α0 + α1 log(GDP) + ǫ
◮ Test the convergence hypothesis, α1 < 0 , poor countriescatch up with richer countries. Prediction from the classicalSolow growth model.
◮ Linear regression only with log(GDP) yields:
GrowthRate = 0.0352 + 0.0013 log(GDP) + ǫ
◮ Test the convergence hypothesis conditional on covaritesdescribing institutions and technological factors:
GrowthRate = α0 + α1 log(GDP) +
p∑
j=1
βjXj + ǫ
Alexandre Belloni High-Dimensional Sparse Econometrics
Example 2: Cross-Country Growth Regression
In Barro-Lee data, we have p = 60 covariates, n = 90observations. Need to select significant covariates among 60.
Penalization Real GDP per capita (log) is in all modelsParameter
λ Additional Selected Variables2.3662 Black Market Premium (log)1.3935 Black Market Premium (log)
Political Instability0.929 Black Market Premium (log)
Political InstabilityRatio of nominal government expenditure on defense
to nominal GDPRatio of import to GDP
Alexandre Belloni High-Dimensional Sparse Econometrics
Example 2: Cross-Country Growth Regression
◮ We find a sparse model for conditioning effects using ℓ1-penalization. We re-fit the resulting model using LS.
◮ We find that the coefficient on lagged GDP is negative, andthe confidence intervals, adjusted for model selection,exclude zero.
Coefficient Confidence Setlagged GDP [-0.0261, -0.0007]
◮ Our findings fully support the conclusions reached in Barroand Lee and Barro and Sala-i-Martin.
Alexandre Belloni High-Dimensional Sparse Econometrics
2. Estimation via ℓ1-penalty methods
The canonical LS which minimizes
Q̂(β) = En[(yi − x ′i β)2]
is not consistent when dim(β) = p ≥ n.We can try minimizing an AIC/BIC type criterion function
Q̂(β) + λ‖β‖0, ‖β‖0 =
p∑
j=1
1{βj 6= 0}
but this is not computationally feasible when p is large.
Essentially a search over too many (non-nested) models.
Technically, it is a combinatorial problem known to be NP-hard.
Alexandre Belloni High-Dimensional Sparse Econometrics
2. Estimation via ℓ1-penalty methods
It was required to balance and understand simultaneously
computational issues︸ ︷︷ ︸hard combinatorial search
and statistical issues︸ ︷︷ ︸lack of global identification
Understanding that statistical aspects can help computationalmethods and vice-versa was crucial.
Further developments in model selection estimation will relyheavily on computational advances.
(MCMC is an example of a computational method with aprofound impact on the statistics/econometrics literature)
Alexandre Belloni High-Dimensional Sparse Econometrics
2. Estimation via ℓ1-penalty methods
avoid computing: Q̂(β)+λ‖β‖0 but get something meaninful
A solution (Tibshirani, JRSS, 96) is to replace theℓ0-“norm” by the closest convex function, the ℓ1-norm:
‖β‖1 =
p∑
j=1
|βj |
◮ lasso estimator β̂ then minimizes
Q̂(β) + λ‖β‖1
◮ this change makes the regularization function to be convex◮ good computational methods for large scale problems◮ many ways to combine the fit and regularization functions
Alexandre Belloni High-Dimensional Sparse Econometrics
Formulations
Idea: balance fit Q̂(β) and penalty ‖β‖1 using a parameter λ
Several ways to do it:
minβ Q̂(β) + λ‖β‖1 lasso
minβ Q̂(β) : ‖β‖1 ≤ λ sieve
minβ ‖β‖1 : Q̂(β) ≤ λ ℓ1-minimization method
minβ ‖β‖1 : ‖∇Q̂(β)‖∞ ≤ λ Dantzig selector
minβ
√Q̂(β) + λ‖β‖1
√lasso
Alexandre Belloni High-Dimensional Sparse Econometrics
Penalty parameter λ
Motivation/heuristic is to choose the parameter λ so that β0 isfeasible for the problem but not too many more points.
sieve: ideally set λ = ‖β0‖1
ℓ1-minimization: ideally set λ = σ2
Dantzig selector: ideally set λ = ‖∇Q̂(β0)‖∞ = ‖2En[ǫixi ]‖∞
Note that lasso and√
lasso are unrestricted minimizations.
Their penalties are chosen to majorate the score at the truevalue, respectively
λ > ‖∇Q̂(β0)‖∞ and λ > ‖∇√
Q̂(β0)‖∞
Alexandre Belloni High-Dimensional Sparse Econometrics
Heuristics for lasso via Convex Geometry
−5 −4 −3 −2 −1 0 1 2 3 4 5
βT c
true value(β0T c = 0)
Q̂(β)
Alexandre Belloni High-Dimensional Sparse Econometrics
Heuristics for lasso via Convex Geometry
−5 −4 −3 −2 −1 0 1 2 3 4 5
βT c
Q̂(β)
true value(β0T c = 0)
Q̂(β0) + ∇Q̂(β0)′βT c
Alexandre Belloni High-Dimensional Sparse Econometrics
Heuristics for lasso via Convex Geometry
−5 −4 −3 −2 −1 0 1 2 3 4 5
βT c
Q̂(β)λ = ‖∇Q̂(β0)‖∞
λ‖β‖1
true value(β0T c = 0)
Alexandre Belloni High-Dimensional Sparse Econometrics
Heuristics for lasso via Convex Geometry
−5 −4 −3 −2 −1 0 1 2 3 4 5
βT c
Q̂(β)
true value(β0T c = 0)
λ‖β‖1
λ > ‖∇Q̂(β0)‖∞
Alexandre Belloni High-Dimensional Sparse Econometrics
Heuristics via Convex Geometry
−5 −4 −3 −2 −1 0 1 2 3 4 5
βT c
Q̂(β) + λ‖β‖1
true value(β0T c = 0)
λ > ‖∇Q̂(β0)‖∞
Alexandre Belloni High-Dimensional Sparse Econometrics
Penalty parameters for lasso and√
lasso
◮ In the case of lasso, the optimal choice of penalty level λ,guaranteeing near-oracle performance, is
λ =σ · 2√
2 log p/n ≥P ‖2En[ǫixi ]‖∞ = ‖∇Q̂(β0)‖∞.
◮ The choice relies on knowing σ, which may be apriori hardto estimate when p ≫ n.
◮ In the case of√
lasso the score is a pivotal random variable,with distribution independent of unknown parameters:
∥∥∥∥∇√
Q̂(β0)
∥∥∥∥∞
=‖En[ǫixi ]‖∞√
En[ǫ2i ]
=‖En[(ǫi/σ)xi ]‖∞√
En[(ǫi/σ)2],
◮ hence for√
lasso the penalty λ does not depend on σ.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso
Big Picture of Computational Methods
Method Problem Size Comp. ComplexityComponentwise extremely large convergence
First-order very large pn(
log pε
)
(up to 100000)Interior-point large (p + n)3.5 log
(p+nε
)
(up to 5000)
Remark:Several implementations for these methods are available:
◮ SDPT3 or SeDuMi for interior-point methods◮ TFOCS for first-order methods◮ several downloadable codes for componentwise methods
(easy to implement)Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: Componentwise
Componentwise Method.A common approach to solve multivariate optimization:
(i) pick a component and fix all remaining components,
(ii) improve the objective function along the chosencomponent,
(iii) loop steps (i)-(iii) until convergence is achieved.
This has been widely used when the minimization over a singlecomponent can be done very efficiently.
This is the case of lasso.
Its simple implementation also contributed for the widespreaduse of this approach.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: Componentwise
For a current iterate β, let β−j = (β1, . . . , βj−1, 0, βj+1, . . . , βp)′:
◮ if 2En[xij(yi − x ′i β−j)] > λ, the minimum over βj is
βj =(−2En[xij (yi − x ′
i β−j)] + λ)/En[x2
ij ];
◮ if 2En[xij(yi − x ′i β−j)] < −λ the minimum over βj is
βj =(−2En[xij (yi − x ′
i β−j)] − λ)/En[x2
ij ];
◮ if 2|En[xij(yi − x ′i β−j)]| ≤ λ the minimum over βj is
βj = 0.
Remark: Similar formulas can be computed for other methodslike
√lasso.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: first-order methods
First-order methods. The new generation of first-ordermethods focus on structured convex problems that can be castas
minw
f (A(w) + b) + h(w) or minw
h(w) : A(w) + b ∈ K .
where◮ f is a smooth function◮ h is a structured function that is possibly non-differentiable
or with extended values◮ K is a structure convex set (typically a convex cone).
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: first-order methods
lasso is cast as
minw
f (A(w) + b) + h(w)
with
f (·) =‖ · ‖2
n, h(·) = λ‖ · ‖1, A = X , and b = −Y
Thus,
minβ
‖Xβ − Y‖2
n+ λ‖β‖1
Remark: A number of different first-order methods have beenproposed and analyzed. We just mention briefly one.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: first-order methods
On every iteration for a given µ > 0 and current point βk is
β(βk ) = arg minβ
2En[xi (yi − x ′i β
k )]′β +12µ‖β − βk‖2 +
λ
n‖β‖1.
It follows that the minimization in β above is separable and canbe solved by soft-thresholding, letting
mkj := βk
j + 2En[xij(yi − x ′i β
k )]/µ
We update our iterate as
βj(βk ) = sign
(mk
j
)max
{∣∣∣mkj
∣∣∣ − λ/µ, 0}
.
Remark: The additional regularization term µ‖β − βk‖2 bringcomputational stability that allows the algorithm to enjoy bettercomputational complexity properties.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: ipm
Interior-point methods. Interior-point methods (ipms) solverstypically focus on solving conic programming problems instandard form,
minw
c′w : Aw = b, w ∈ K . (2.1)
The conic constraint will be biding at the optimal solution.These methods use a “barrier function” FK for the cone K that is
strictly convex, smooth and FK (w) → ∞ when w → ∂K ,
and for some µ > 0 proceed to compute
w(µ) = arg minw
c′w + µFK (w) : Aw = b. (2.2)
Letting µ → 0, w(µ) converges to the solution (2.1).We can recover an approximate solution for (2.1) efficiently byusing an approximate solution for w(µk ) as a warm-start toapproximate w(µk+1), µk > µk+1.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: ipm
Not all barriers are created equally, solving the problems
w(µ) = arg minw
c′w + µFK (w) : Aw = b. (2.3)
will be tractable only if FK is a special barrier (depends on K ).Three cones play a major role in optimization theory:
◮ Non-negative orthant:IRp
+ = {w ∈ IRp : wj ≥ 0, j = 1, . . . , p},FK (w) = −
∑pj=1 log(wi)
◮ Second-order cone:Qn+1 = {(w , t) ∈ IRn × IR+ : t ≥ ‖w‖},FK (w) = − log(t − ‖w‖)
◮ Semi-definite positive matrices:Sk×k
+ = {W ∈ IRk×k : W = W ′, v ′Wv ≥ 0, for all v ∈ IRk},FK (w) = − log det(W )
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: ipm
These cones (and three others) are called self-scaled cones.
Their cartesian products are also self-scaled.
Therefore, when using ipms, it is VERY importantcomputationally to arrive a formulation that is not only convexbut also one that relies on self-scaled cones.
It turns out that many tractable penalization methods (includingℓ1 but others as well) can be formulated with these three cones.
Alexandre Belloni High-Dimensional Sparse Econometrics
Computational Methods for lasso: ipm
lasso minβ
Q̂(β) + λ‖β‖1
to formulate the ‖ · ‖1, let β = β+ − β−, where β+, β− ∈ IRp+.
for any vector v ∈ IRn and any scalar t ≥ 0 we have that
v ′v ≤ t is equivalent to ‖(v , (t − 1)/2︸ ︷︷ ︸=:u
)‖ ≤ (t + 1)/2︸ ︷︷ ︸=:z
.
mint,β+,β−,u,z,v
tn
+ λ
p∑
j=1
β+
j + β−j
v = Y − Xβ+ + Xβ−
t = −1 + 2zt = 1 + 2u(v , u, z) ∈ Qn+2, t ≥ 0, β+ ∈ IRp
+, β− ∈ IRp+.
Alexandre Belloni High-Dimensional Sparse Econometrics
Quick comparison on a very sparse model:s = 5, β0j = 1 for j = 1, . . . , s, β0j = 0 for j > s.
Table: Average running time over 100 simulations. Differentprecisions across methods.
n = 100, p = 500 Componentwise First-order Interior-pointlasso 0.2173 10.99 2.545√lasso 0.3268 7.345 1.645
n = 200, p = 1000 Componentwise First-order Interior-pointlasso 0.6115 19.84 14.20√lasso 0.6448 19.96 8.291
n = 400, p = 2000 Componentwise First-order Interior-pointlasso 2.625 84.12 108.9√lasso 2.687 77.65 62.86
Alexandre Belloni High-Dimensional Sparse Econometrics
Post-Model Selection Estimator
Define the post-lasso estimator as follows:
◮ Step 1:Select the model using the lasso, T̂ := support(β̂).
◮ Step 2:Apply ordinary least square (LS) to the selected model T̂ .
Remark: We could use√
lasso, Dantzig selector, sieve orℓ1-minimization instead of lasso.
Alexandre Belloni High-Dimensional Sparse Econometrics
Post-Model Selection Estimator Geometry
β1
β2
β0β̂
minβ∈Rp
‖β‖1 : Q̂(β) ≤ γ,
In this example the correct model is selected, but note the shrinkagebias. The shrinkage bias can be removed by post model selection
estimator, i.e. applying OLS to the selected model.
Alexandre Belloni High-Dimensional Sparse Econometrics
Post-Model Selection Estimator Geometry
β1
β2
β0β̂
minβ∈Rp
‖β‖1 : Q̂(β) ≤ γ,
In this example the correct model was not selected. The selectedmodel contains a “junk” regressor which can impact the post model
selection estimator.
Alexandre Belloni High-Dimensional Sparse Econometrics
Issues on post model selection estimator
◮ lasso and others ℓ1-penalized methods will successfully“zero out” lots of irrelevant regressors.
◮ however, these procedures are not perfect at modelselection – might include “junk”, exclude some relevantregressors,
◮ and these procedures bias/shrink the non-zero coefficientestimates towards zero.
◮ post model selection estimators are motivated to removethe bias towards zero
◮ but results need to account for possible misspecification inthe selected model.
Alexandre Belloni High-Dimensional Sparse Econometrics
Regularity Condition
◮ A simple sufficient condition is as follows.Condition RSE. Take any C > 1.Then for all n large enough, the sC-sparse eigenvalues of the
empirical design matrix En[xix ′i ],
are bounded away from zero and from above; namely,
0 < κ ≤ min‖δ‖0≤sC
En[x ′i δ]
2
‖δ‖2 ≤ max‖δ‖0≤sC
En[x ′i δ]
2
‖δ‖2 ≤ κ′ < ∞.(2.4)
◮ This holds under i.i.d. sampling if E [xix ′i ] has eigenvalues
bounded away from zero and above, and:• xi has light tails (i.e., log-concave) and s log p = o(n);• or bounded regressors maxij |xij | ≤ K and s log4 p = o(n).
Alexandre Belloni High-Dimensional Sparse Econometrics
Result 1: Rates for ℓ1-penalized methods
Theorem (Rates)Under condition RSE, wp → 1
√En[x ′
i β̂ − x ′i β0]2 . σ
√s log(n ∨ p)
n
◮ the rate is very close to the oracle rate√
s/n, obtainablewhen we know the “true” model T .
◮ p shows up only through a logarithmic term log p.◮ for Dantzig selector:
◮ Candes and Tao (Annals of Statistics 2007)◮ for lasso:
◮ Meinshausen and Yu (Annals of Statistics 2009),◮ Bickel, Ritov, Tsybakov (Annals of Statistics 2009)
◮ for√
lasso:◮ B., Chernozhukov and Wang (Biometrika, 2011)
Alexandre Belloni High-Dimensional Sparse Econometrics
Result 2: Post-Model Selection Estimator
Theorem (Rate for Post-Model Selection Estimator)Under conditions of Result 1, wp → 1
√En[x ′
i β̂PL − x ′i β0]2 . σ
√sn
log(n ∨ p),
Under some further exceptional cases faster, up to σ√
sn .
◮ Even though lasso does not in general perfectly select therelevant regressors, post-lasso performs at least as well.
◮ analysis account for a selected model that can be misspecified◮ This result was first derived in the context of median regression
◮ B. and Chernozhukov (Annals of Statistics, 2011),◮ and extended to least squares by
◮ B. and Chernozhukov (R & R Bernoulli, 2011).
Alexandre Belloni High-Dimensional Sparse Econometrics
Monte Carlo
◮ In this simulation we used
s = 6, p = 500, n = 100
yi = x ′i β0 + ǫi , ǫi ∼ N(0, σ2),
β0 = (1, 1, 1/2, 1/3, 1/4, 1/5, 0, . . . , 0)′
xi ∼ N(0,Σ), Σij = (1/2)|i−j |, σ2 = 1
◮ Ideal benchmark: Oracle estimator which runs OLS of yi
on xi1, ..., xi6. This estimator is not feasible outsideMonte-Carlo.
Alexandre Belloni High-Dimensional Sparse Econometrics
Monte Carlo Results: Risk
Risk:√
E [(x ′i β̂ − x ′β0)2]
n = 100, p = 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
LASSO Post-LASSO Oracle
Estimation Risk
Alexandre Belloni High-Dimensional Sparse Econometrics
Monte Carlo Results: Bias
Bias: ‖E [β̂] − β0‖2
n = 100, p = 500
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
LASSO Post-LASSO
Bias
Alexandre Belloni High-Dimensional Sparse Econometrics
Applications of model selection
Alexandre Belloni High-Dimensional Sparse Econometrics
Gaussian Instrumental Variable
Gaussian IV model
yi = d ′i α + ǫi ,
di = D(zi) + vi , (first stage)
(ǫi
vi
)∼ N
(0,
(σ2
ǫ σǫv
σǫv σ2v
))
◮ can have additional controls wi in both equations;suppressed here, see references for details
◮ want to estimate “optimal” instrument D(zi) = E [di |zi ]using a large list of “technical instruments,”
xi := (xi1, ..., xip)′ := (P1(zi ), ..., Pp(zi ))′, and we can write
D(zi) = x ′i β0 + ri , with ‖β0‖0 = s, En[r2
i ] . σ2vs/n
where p is large, possibly much larger than n.◮ use estimated optimal instrument to estimate α.
Alexandre Belloni High-Dimensional Sparse Econometrics
2SLS with LASSO/Post-LASSO estimated Optimal IV
2SLS with LASSO/Post-LASSO estimated Optimal IV
◮ Step 1:Estimate D̂(zi ) = x ′
i β̂ using LASSO or Post-LASSOestimator.
◮ Step 2:α̂ = [En[diD̂(zi)
′]]−1En[D̂(zi)yi ]
Alexandre Belloni High-Dimensional Sparse Econometrics
2SLS with LASSO-based IV
Theorem (2SLS with lasso-based IV)Under conditions of Result 1, and s2 log2 p = o(n), we have
(Λ̂∗)−1/2√n(α̂ − α) →d N(0, I),
where Λ̂∗ := σ̂2{En[D̂(zi)D̂(zi)′]}−1.
◮ For more results, see B., Chernozhukov, and Hansen(submitted, ArXiv 2010)
◮ The requirement s2 log2 p = o(n) can be relaxed tos log p = o(n) if split IV estimator is used.
◮ Extension to non-Gaussian and heteroskedastic errors in B.,Chen, Chernozhukov and Hansen (submitted, ArXiv 2010)required adjustments in the penalty function
Alexandre Belloni High-Dimensional Sparse Econometrics
Application of IV: Eminent Domain
Estimate economic consequences of government take-over ofproperty rights from individuals
◮ yi = economic outcome in a region i , e.g. housing priceindex
◮ di = number of property take-overs decided in a court oflaw, by panels of 3 judges
◮ xi = demographic characteristics of judges, that arerandomly assigned to panels: education, politicalaffiliations, age, experience etc.
◮ fi = xi + various interactions of components of xi ,◮ a very large list p = p(fi) = 344
Alexandre Belloni High-Dimensional Sparse Econometrics
Application of IV: Eminent Domain, Continued.
◮ Outcome is log of housing price index; endogenousvariable is government take-over
◮ Can use 2 elementary instruments, suggested by reallawyers (Chen and Yeh, 2010)
◮ Can use all 344 instruments and select approximately theright set using LASSO.
Estimator Instruments Price Effect Rob Std Error2SLS 2 .07 .0322SLS / LASSO IVs 4 .05 .017
Alexandre Belloni High-Dimensional Sparse Econometrics
ℓ1-Penalized Quantile Regression
◮ a response variable y and p-dimensional covariates x◮ u-th conditional quantile function of y given x is given by
x ′β(u), β(u) ∈ Rp.
◮ the dimension p is large, possibly much larger than thesample size n
◮ the true model β(u) is sparse having only s < n non-zerocomponents; we don’t know which.
◮ in population, β(u) minimizes
Qu(β) = E[ρu(y − x ′β)],
where ρu(t) = (u − 1{t ≤ 0})t = ut+ + (1 − u)t−.
Alexandre Belloni High-Dimensional Sparse Econometrics
ℓ1-Penalized Quantile Regression
◮ canonical estimator that minimizes
Q̂u(β) = n−1n∑
i=1
ρu(yi − x ′i β)
is not consistent when p ≥ n.◮ ℓ1-penalized quantile regression β̂(u) solves
minβ∈Rp
Q̂u(β) + λu‖β‖1.
Alexandre Belloni High-Dimensional Sparse Econometrics
ℓ1-QR Rate Result
Theorem (Rates)Under standard conditions, letting U ⊂ (0, 1) be a compact set,wp→ 1
supu∈U
‖β̂(u) − β(u)‖2 .P
√s log(n ∨ p)
n.
◮ very similar to the uniform rate√
s log n/n when thesupport is known
◮ similar rates to lasso but also hold uniformly over u ∈ U◮ although similar rates substantially different analysis
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
Let X ∈ IRm denote a random vector. We are interest indetermining conditional independence structure of X , for everypair of components i , j determine if
Xi ⊥⊥ Xj | X−i ,−j .
Conditional independence between two components of aGaussian distributed random vector X ∼ N(0,Σ) has been wellunderstood:For Σ = Var(X ), Xi and Xj are independent conditioned on theother components if and only if Σ−1
ij = 0.
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
Let X ∈ IRm denote a random vector. We are interest indetermining conditional independence structure of X , for everypair of components i , j determine if
Xi ⊥⊥ Xj | X−i ,−j .
We consider a non-Gaussian high-dimensional setting.
Xi ⊥⊥ Xj |X−i ,−j
if and only if
QXi(u|X−i) = QXi
(u|X−i ,−j) for all u ∈ (0, 1).
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
In order to approximate conditional independence consider: wesay that Xi and Xj are U -quantile uncorrelated if
β i(u) ∈ arg min E [ρu(Xi − β′X−i)] and
β j(u) ∈ arg min E [ρu(Xj − β′X−j)]
are such that β ij (u) = 0 and β j
i (u) = 0 for all u ∈ U .
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
In order to approximate conditional independence consider:◮ U ⊂ (0, 1) a compact set of quantile indices◮ a class of functions F
We will say thatXi ⊥⊥U ,F Xj | X−i ,−j
if and only if for every u ∈ U
f i ,ju (0, X−i ,−j) ∈ arg min
f∈FE [ρu(Xi − f (Xj , X−i ,−j))] and
f j ,iu (0, X−i ,−j) ∈ arg min
f∈FE [ρu(Xj − f (Xi , X−i ,−j))]
This approach requires:◮ series approximation for F ;◮ uniform estimation across all quantile indices u ∈ U ;◮ model selection.
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
Let X be a vector of log returns. Edge (i , j) if Xi 6⊥⊥ Xj |X−i ,−j
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
Consider IBM returns first.
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
Next consider all returns.
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
Next we investigate how this dependence structure is affectedby “downside movement” of the market.This is particulary relevant for hedging since during downsidemovements of the market is when hedge is most needed.
We quantify downside movement by conditioning on theobservations with the largest 25%, 50%, 75% and 100%downside movement of the market.(The 100% corresponds to the overall dependence.)
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
100% (No downside movement)
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
75%
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
50%
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
25%
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
Alexandre Belloni High-Dimensional Sparse Econometrics
Application ℓ1-QR: Conditional Independence
100% vs. 25%
EWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBMEWST
BTFGAEPI
JJSF
DGSE
PLXS
CAU
RMCF
HGR
SUNW
KDE
ORCL
MSFT
TW
OSIP
FUND
LLTC
DJCO
DELL
IBM
The right graph is much sparser than the left graph. Thus, whenmarkets are moving down several variables become (nearly)conditionally independent. Therefore, hedging strategies thatdo not accounting for downside movements might not offer thedesired protection exactly when it is needed the most.
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Consider a dynamic model:
◮ dynamic discrete choice models◮ discrete stochastic dynamic programming
The main feature we are interest:
history of outcomes potentially impactsthe probability distribution of the next outcome
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Consider at each time t agents choose: st ∈ A = {Low , High}.Typical assumption: Markov Chain of order 1 (2 cond. prob.)
t = 0
lllllllllllll
RRRRRRRRRRRRR
t = −1 L H
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Consider at each time t agents choose: st ∈ A = {Low , High}.Markov Chain of order 2 requires to estimate 4 cond. prob.
t = 0
lllllllllllll
RRRRRRRRRRRRR
t = −1 L
yyyy
yyyy
EEEE
EEEE
H
yyyy
yyyy
EEEE
EEEE
t = −2 LL LH HL HH
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Consider at each time t agents choose: st ∈ A = {Low , High}.Markov Chain of order 3 requires to estimate 8 cond. prob.
t = 0
lllllllllllll
RRRRRRRRRRRRR
t = −1 L
yyyy
yyyy
EEEE
EEEE
H
yyyy
yyyy
EEEE
EEEE
t = −2 LL
yyyy
yyyy
LH
yyyy
yyyy
HL
EEEE
EEEE
HH
EEEE
EEEE
t = −3 LLL LLH LHL LHH HLL HLH HHL HHH
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Consider at each time t agents choose: st ∈ A = {Low , High}.Markov Chain of order K requires to estimate 2K cond. prob.
t = 0
lllllllllllll
RRRRRRRRRRRRR
t = −1 L
yyyy
yyyy
EEEE
EEEE
H
yyyy
yyyy
EEEE
EEEE
t = −2 LL
yyyy
yyyy
LH
yyyy
yyyy
HL
EEEE
EEEE
HH
EEEE
EEEE
t = −3 LLL
%%%%%
�����
LLH
%%%%%
�����
LHL
%%%%%
�����
LHH
%%%%%
�����
HLL
%%%%%
�����
HLH
%%%%%
�����
HHL
%%%%%
�����
HHH
%%%%%
�����
t = −K ......
......
......
......
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Issues:
◮ K is unknown (potentially infinite)◮ full K -order Markov chain becomes large quickly◮ similar histories may lead to similar transition probabilities
Motivation:◮ bias (small model) vs. variance (large model)◮ make the order of the Markov Chain depend on the history◮ model selection balances:
misspecification and length of confidence intervals
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
Variable Length Markov Chain:very flexible but need to select the “context tree”
t = 0
lllllllllllll
RRRRRRRRRRRRR
t = −1 L
yyyy
yyyy
EEEE
EEEE
H
yyyy
yyyy
EEEE
EEEE
t = −2 LL LH
yyyy
yyyy
HL HH
t = −3 LHL LHH
%%%%%
�����
...
Alexandre Belloni High-Dimensional Sparse Econometrics
Variable Length Markov Chains
A penalty method based on (uniform) confidence intervals.
Uniform estimation of transition probabilities over all possiblehistories x ∈ A−1
−∞:
supx∈A−1
−∞
‖P̂(·|x) − P(·|x)‖∞
(bounds achieve the minimax rates for some classes)
Results allow to establish:◮ in dynamic discrete choice models:
uniform estimation of dynamic marginal effects◮ in discrete stochastic dynamic programming:
uniform estimation of the value function
Alexandre Belloni High-Dimensional Sparse Econometrics
Conclusion
Alexandre Belloni High-Dimensional Sparse Econometrics
Conclusion
1. ℓ1 and post-ℓ1-penalization methods◮ offer a good way to control for confounding factors;◮ offer a good way to select the series terms;◮ are especially useful when p ≥ n
Alexandre Belloni High-Dimensional Sparse Econometrics
Conclusion
1. ℓ1 and post-ℓ1-penalization methods◮ offer a good way to control for confounding factors;◮ offer a good way to select the series terms;◮ are especially useful when p ≥ n
2. ℓ1-penalized estimators achieve near-oracle rates
Alexandre Belloni High-Dimensional Sparse Econometrics
Conclusion
1. ℓ1 and post-ℓ1-penalization methods◮ offer a good way to control for confounding factors;◮ offer a good way to select the series terms;◮ are especially useful when p ≥ n
2. ℓ1-penalized estimators achieve near-oracle rates
3. post-ℓ1-penalized estimators achieve at least near-oraclerates
Alexandre Belloni High-Dimensional Sparse Econometrics
Conclusion
1. ℓ1 and post-ℓ1-penalization methods◮ offer a good way to control for confounding factors;◮ offer a good way to select the series terms;◮ are especially useful when p ≥ n
2. ℓ1-penalized estimators achieve near-oracle rates
3. post-ℓ1-penalized estimators achieve at least near-oraclerates
4. applications of model selection in econometrics estimators
Alexandre Belloni High-Dimensional Sparse Econometrics
References
◮ Bickel, Ritov and Tsybakov, “Simultaneous analysis of Lasso andDantzig selector”, Annals of Statistics, 2009.
◮ Candes and T. Tao, “The Dantzig selector: statistical estimation when pis much larger than n,” Annals of Statistics, 2007.
◮ Meinshausen and Yu, “Lasso-type recovery of sparse representationsfor high-dimensional data,” Annals of Statistics, 2009.
◮ Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy.
Statist. Soc. Ser. B, 1996.
Alexandre Belloni High-Dimensional Sparse Econometrics
References
◮ Belloni and Chernozhukov, “ℓ1-Penalized Quantile Regression for HighDimensional Sparse Models,” 2007, Annals of Statistics.
◮ Belloni and Chernozhukov, “Post-ℓ1-Penalized Estimators inHigh-Dimensional Linear Regression Models”, ArXiv, 2009.
◮ Belloni, Chernozhukov and Wang, “√
LASSO: Pivotal Recovery ofSparse Signals via Conic Programming,” Biometrika, 2011.
◮ Belloni and Chernozhukov, “High Dimensional Sparse EconometricModels: An Introduction,” Ill-Posed Inverse Problems andHigh-Dimensional Estimation, Springer Lecture Notes in Statistics,2011.
◮ Belloni, Chernozhukov, and Hansen, “Estimation and InferenceMethods for High-dimensional Sparse Econometric Models: a Review,”2011, submitted.
◮ Belloni and Oliveira, “Approximate Group Context Tree: application to
dynamic programming and dynamic choice models,” 2011, submitted.
Alexandre Belloni High-Dimensional Sparse Econometrics