Post on 30-Dec-2015
description
Loss Functions for Detecting Outliers in Panel Data
Charles D. ColemanThomas Bryan
Jason E. DevineU.S. Census Bureau
Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA,
March, 2000
Panel Data
A.k.a. “longitudinal data.”
xit:
– i indexes cross-sectional units: retain identities over time. Exx: Geographic areas, persons, households, companies, autos.
– t indexes time.– Chronological or nominal.– Chronological time measures time elapsed between two dates.– Nominal time indexes different sets of estimates, can also
index true values.
Notation
• Bi is base value for unit i.
• Fi is “future” value for unit i.
• Fit is future value for unit i at time t.
• Bi, Fi, Fit > 0.
i=|Fi-Bi| is absolute difference for unit i.
• Subscripts will be dropped when not needed.
What is an Outlier?
“[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”
D.M. Hawkins, Identification of Outliers, 1980, p. 1.
Meaning of an Outlier
• Either– Indication of a problem with the data
generation process.
• Or– A true, but unusual, statement about reality.
Loss Functions• Motivations: The i come from unknown
distributions. Want to compare multiple size classes on same basis.
• L(Fi;Bi)(i,Bi) is loss function for observation i.
• Loss functions measure “badness.”
• Loss functions produce rankings of observations to be examined.
• Loss functions are empirically based, except for one special case in nominal time.
Assumption 1
Loss is symmetric in error:
L(B+; B) = L(B–; B)
Assumption 2
Loss increases in difference:
/ > 0
Assumption 3
Loss decreases in base value:
/B < 0
Property 1
Loss associated with given absolute percentage difference (| / B|) increases in B.
Simplest Loss Function
L(F;B) = |F – B|Bq (1a)
or
(,B) = Bq (1b)
with
0 > q > –1.
~( ; )L F B F B
F B
Br
s
Loss as Weighted Combination of Absolute Difference and
Absolute Percentage Difference
• This generates loss function with q = –s/(r + s).• Infinite number of pairs (r, s) correspond to any given q.
Outlier Criterion
• Outlier declared wheneverL(F;B)(,B) > C
• C is “critical value.”
• C can be determined in advance, or as function of data (e.g., quantile or multiple of scale measure).
Loss Function Variants
• Time-Invariant Loss Function
• Signed Loss Function
• Nominal Time
Time-Invariant Loss Function
• Idea: Compare multiple dates of data on same basis.
• Time need not be round number.
• L(Fit;Bi,t) = |Fit – Bi|Btq
• Property 1 satisfied as long as t < –1/q.
• Thus, useful horizon is limited.
Signed Loss Function• Idea: Account for direction and magnitude of loss.
S(F;B) = (F – B) Bq
• Can use asymmetric critical values and “q”s:– Declare outliers whenever
S+(F;B) = (F – B) Bq+ > C+
or
S–(F;B) = (F – B) Bq– < C–
with C+ –C–, q+ q–.
Nominal Time
• Compare 2 sets of estimates, one set can be actual values, Ai.
• Assumptions:– Unbiased: EBi = EFi = Ai.
– Proportionate variance: Var(Bi) = Var(Fi) = 2Ai.
• q = –1/2.
• Either set of estimates can be used for Bi, Fi.
– Exception: Ai can only be substituted for Bi.
How to Use: No Preexisting Outlier Criteria
• Start with q = – 0.5.– Adjust by increments of 0.1 to get “good”
distribution of outliers.
• Alternative: Start with
q = log(range)/25 – 1, where range is range of data. (Bryan, 1999)– Can adjust.
How to Use: Preexisting Discrete Outlier Criteria
• Start with schedule of critical pairs (j, Bj).
– These pairs (approximately) satisfy equation Bq = C for some q and C. They are the cutoffs between outliers and nonoutliers.
• Run regressionlog j = –q log Bj + K
• Then, C = eK.
Loss Functions and GIS
• Loss functions can be used with GIS to focus analyst’s attention on problem areas.
• Maps compare tax method county population estimates to unconstrained housing unit method estimates.
• q = –0.5 in loss function map.
Persons
0 - 50005000 - 2500025000 - 50000Over 50000No Data
Note: The tax method estimates are the base
Map 1Absolute Differences between the Two Sets of Population EstimatesAbsolute Differences between the Population Estimates
Percent
0 - 55 - 1010 - 20Above 20No Data
Note: The tax method estimates are the base
Map 2Absolute Percent Differences between the Two Sets of Population EstimatesPercent Absolute Differences between the Population Estimates
0 - 10001000 - 20002000 - 4000Above 4000No Data
Loss
Map 3Loss Function Values
Note: The tax method estimates are the base
Loss Function Values
Outliers Classified by Another Variable
• Di is function of 2 successive observations.
• Ri is “reference” variable, used to classify outliers.
• Start with schedule of critical pairs (Dj, Rj).
• Run regressionlog Dj = a + log Rj
• Then, L(D, R) = DRb and C = ea.
What to Do with Negative Data
• From Coleman and Bryan (2000):
L(F,B) = |F–B|(|F|+|B|)q, B 0 or F 0,
0 , B = F = 0.
S(F,B) = (F–B)(|F|+|B|)q, B 0 or F 0,
0 , B = F = 0.
• 0 > q > –1. Suggest q –0.5.
Summary
• Defined panel data.
• Defined outliers.
• Created several types of loss functions to detect outliers in panel data.
• Loss functions are empirical (except for nominal time.)
• Showed several applications, including GIS.
URL for Presentation
http://chuckcoleman.home.dhs.org/fscpela.ppt