Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU...

17
Robust regression with imprecise data Marco Cattaneo and Andrea Wiencierz Department of Statistics, LMU Munich 17 November 2011

Transcript of Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU...

Page 1: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

Robust regression with imprecise data

Marco Cattaneo and Andrea WiencierzDepartment of Statistics, LMU Munich

17 November 2011

Page 2: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

imprecise data

example data set:

I in the literature, two kinds of general approaches to regression withimprecise data:

I midpoint regression: represent the observed imprecise data by few precisevalues (e.g., intervals by midpoint and length), and apply standardregression methods to those values (see, e.g., Domingues et al., 2010)

I set of precise regressions: apply standard regression methods to all possibleprecise data compatible with the observed imprecise data, and consider therange of outcomes as the imprecise result (see, e.g., Ferson et al., 2007)

I LIR (Likelihood-based Imprecise Regression): new regression methoddirectly applicable to imprecise data (Cattaneo and Wiencierz, 2011a,b)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 3: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

linear regression for the example data set

midpoint regression (green and violet lines) and set of precise regressions(set of gray lines):

least squares least median of squares

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 4: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

nonparametric likelihood

I precise data (unobserved): random variables Vi = (Xi ,Yi ) ∈ X × R

I imprecise data (observed): random sets V ∗i ⊆ X × R

I nonparametric model: P is the set of all probability measures such that

I (V1,V∗1 ), . . . , (Vn,V

∗n ) i.i.d.

I P(Vi ∈ V ∗i ) ≥ 1− ε (where ε ∈ [0, 1] is fixed)

I the observed (imprecise) data V ∗1 = A1, . . . ,V

∗n = An induce the

(normalized) likelihood function lik : P → [0, 1] with

lik(P) =P(V ∗

1 = A1, . . . ,V∗n = An)

maxP′∈P P ′(V ∗1 = A1, . . . ,V ∗

n = An)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 5: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

regression problem

I regression functions: F is a certain set of functions f : X → R

I absolute residuals: Rf ,i = |Yi − f (Xi )|

I for each function f ∈ F , the quantiles of the distribution of the absoluteresiduals Rf ,i can be estimated even under the nonparametric model P

I the regression problem can be interpreted as the minimization of thep-quantile of the distribution of the absolute residuals Rf ,i (wherep ∈ (0, 1) is fixed)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 6: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

generalized LQS regression

I likelihood-based confidence interval for the p-quantile of the distributionof the absolute residuals Rf ,i (where Qf (P) is the interval of allp-quantiles of Rf ,i under P, and β ∈ (0, 1) is fixed):

Cf =⋃

P∈P : lik(P)>β

Qf (P)

I point estimate: fLRM is the function in F minimizing sup Cf(Likelihood-based Region Minimax: see Cattaneo, 2007)

I fLRM has a simple geometrical interpretation: B fLRM ,qLRMis the thinnest

band of the form B f ,q = {(x , y) ∈ X × R : |y − f (x)| ≤ q} containing atleast k imprecise data (where k > (p + ε) n depends on n, ε, p, β), for allf ∈ F and all q ∈ [0,+∞)

I when the observed data are in fact precise, fLRM corresponds to the LQS

(Least Quantile of Squares) estimate with quantile kn

I fLRM can be computed by generalizing the algorithm of Rousseeuw andLeroy (1987)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 7: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

imprecise regression

I interval dominance: interval I (strictly) dominates interval J iff x < y forall x ∈ I and all y ∈ J

I imprecise regression: set of all undominated functions (i.e., all f ∈ Fsuch that qLRM ∈ Cf )

I the undominated functions have a simple geometrical interpretation: f isundominated iff B f ,qLRM

intersects at least k + 1 imprecise data (wherek < (p − ε) n depends on n, ε, p, β)

I complex uncertainty, consisting of two kinds of uncertainty:

I statistical uncertainty: decreases when n increases (reflected by the spread

between k+1n

and kn)

I indetermination: unavoidable under such weak assumptions (reflected bythe difference between containing and intersecting imprecise data)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 8: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

LIR analysis of the example data set

n = 17, ε = 0, p = 0.5, β = 0.5 ⇒ k = 11, k = 6

fLRM and B fLRM ,qLRMundominated lines (k = 11, k = 6)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 9: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

other values for β and ϵn = 17, ε = 0, p = 0.5, β = 0.8 ⇒ k = 10, k = 7

n = 17, ε = 0.05, p = 0.5, β = 0.8 ⇒ k = 11, k = 6

n = 17, ε = 0, p = 0.5, β = 0.26 ⇒ k = 12, k = 5

n = 17, ε = 0.05, p = 0.5, β = 0.5 ⇒ k = 12, k = 5

undominated lines (k = 10, k = 7) undominated lines (k = 12, k = 5)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 10: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

precise data (midpoints of the example data set)

n = 17, ε = 0, p = 0.5, β = 0.5 ⇒ k = 11, k = 6

LIR analysis (undominated functions: gray lines), least squares regression(green line), and least median of squares regression (violet line):

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 11: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

example with social survey data

I data from the “ALLBUS — German General Social Survey” of 2008(provided by GESIS — Leibniz Institute for the Social Sciences)

I relationship between age Xi ∈ X = [18, 100) and personal income (onaverage per month) Yi ∈ [0,+∞), with n = 3469

I choice of regression functions: F = {fa,b1,b2 : a, b1, b2 ∈ R} is the set ofall quadratic functions fa,b1,b2(x) = a+ b1 x + b2 x

2

I choice of parameters:I ε = 0, p = 0.5, β = 0.15 ⇒ k = 1792, k = 1677I ε = 0.0107, p = 0.5, β = 0.8 ⇒ k = 1792, k = 1677

I in 4 different data situations, fLRM (violet solid line, with B fLRM ,qLRM

represented by the violet dashed lines) and the undominated functions(gray dotted curves) are compared with the results of the least squaresmidpoint regressions with upper income limit 15000 (blue curve) or 10000(green curve)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 12: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

original data

I age data: 3457 “precise” (in years: 83 classes), 12 missing

I income data: 2406 precise, 381 categorized (22 classes), 682 missing

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 13: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

categorized income data

I age data: 3457 “precise” (in years: 83 classes), 12 missing

I income data: 2787 categorized (22 classes), 682 missing

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 14: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

categorized age data

I age data: 3457 categorized (6 classes), 12 missing

I income data: 2406 precise, 381 categorized (22 classes), 682 missing

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 15: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

categorized age and income data

I age data: 3457 categorized (6 classes), 12 missing

I income data: 2787 categorized (22 classes), 682 missing

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 16: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

conclusion and outlookI LIR: new line of approach to regression with imprecise data

I LIR is directly applicable to any kind of imprecise data (with precise dataas a special case), where the coarsening process can be informative (andeven wrong with a certain probability)

I the result of the regression is imprecise, reflecting the total uncertainty inthe data

I LIR without distributional assumptions leads to very robust results(generalized LQS regression)

I future work:I improve the implementation of LIRI study in more detail the statistical properties of the method (e.g., the

coverage probability of the imprecise result), even though the repeatedsampling evaluation is particularly problematic with imprecise data

I investigate the consequences of stronger assumptions (e.g., the existenceof a true regression function with an additive, homoscedastic, normal error)

I consider the minimization of other properties of the distribution of theabsolute residuals (besides the quantiles), in order to increase theefficiency of the method (e.g., generalized LTS regression)

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data

Page 17: Marco Cattaneo and Andrea Wiencierz - uni-muenchen.de · Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data LIR analysis of the example data set

referencesCattaneo, M. (2007). Statistical Decisions Based Directly on theLikelihood Function. PhD thesis, ETH Zurich.

Cattaneo, M., and Wiencierz, A. (2011a). Regression with ImpreciseData: A Robust Approach. In ISIPTA ’11, Proceedings of the SeventhInternational Symposium on Imprecise Probability: Theories andApplications, eds. F. Coolen, G. de Cooman, T. Fetz, andM. Oberguggenberger. SIPTA, 119–128.

Cattaneo, M., and Wiencierz, A. (2011b). Robust regression withimprecise data. Technical Report 114. Department of Statistics, LMUMunich.

Domingues, M. A. O., de Souza, R. M. C. R., and Cysneiros, F. J. A.(2010). A robust method for linear regression of symbolic interval data.Pattern Recognit. Lett. 31, 1991–1996.

Ferson, S., Kreinovich, V., Hajagos, J., Oberkampf, W., and Ginzburg, L.(2007). Experimental Uncertainty Estimation and Statistics for DataHaving Interval Uncertainty. Technical Report SAND2007-0939.Sandia National Laboratories.

Rousseeuw, P. J., and Leroy, A. M. (1987). Robust Regression andOutlier Detection. Wiley.

Marco Cattaneo and Andrea Wiencierz @ LMU Munich Robust regression with imprecise data