Loss Functions for Detecting Outliers in Panel Data

Charles D. ColemanThomas Bryan

Jason E. DevineU.S. Census Bureau

Prepared for the Spring 2000 meetings of the Federal-State Cooperative Program for Population Estimates, Los Angeles, CA,

March, 2000

Panel Data

A.k.a. “longitudinal data.”

– i indexes cross-sectional units: retain identities over time. Exx: Geographic areas, persons, households, companies, autos.

– t indexes time.– Chronological or nominal.– Chronological time measures time elapsed between two dates.– Nominal time indexes different sets of estimates, can also

index true values.

Notation

• Bi is base value for unit i.

• Fi is “future” value for unit i.

• Fit is future value for unit i at time t.

• Bi, Fi, Fit > 0.

i=|Fi-Bi| is absolute difference for unit i.

• Subscripts will be dropped when not needed.

What is an Outlier?

“[An outlier is] an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism.”

D.M. Hawkins, Identification of Outliers, 1980, p. 1.

Meaning of an Outlier

• Either– Indication of a problem with the data

generation process.

• Or– A true, but unusual, statement about reality.

Loss Functions• Motivations: The i come from unknown

distributions. Want to compare multiple size classes on same basis.

• L(Fi;Bi)(i,Bi) is loss function for observation i.

• Loss functions measure “badness.”

• Loss functions produce rankings of observations to be examined.

• Loss functions are empirically based, except for one special case in nominal time.

Assumption 1

Loss is symmetric in error:

L(B+; B) = L(B–; B)

Assumption 2

Loss increases in difference:

Assumption 3

Loss decreases in base value:

/B < 0

Property 1

Loss associated with given absolute percentage difference (| / B|) increases in B.

Simplest Loss Function

L(F;B) = |F – B|Bq (1a)

(,B) = Bq (1b)

0 > q > –1.

~( ; )L F B F B

Loss as Weighted Combination of Absolute Difference and

Absolute Percentage Difference

• This generates loss function with q = –s/(r + s).• Infinite number of pairs (r, s) correspond to any given q.

Outlier Criterion

• Outlier declared wheneverL(F;B)(,B) > C

• C is “critical value.”

• C can be determined in advance, or as function of data (e.g., quantile or multiple of scale measure).

Loss Function Variants

• Time-Invariant Loss Function

• Signed Loss Function

• Nominal Time

Time-Invariant Loss Function

• Idea: Compare multiple dates of data on same basis.

• Time need not be round number.

• L(Fit;Bi,t) = |Fit – Bi|Btq

• Property 1 satisfied as long as t < –1/q.

• Thus, useful horizon is limited.

Signed Loss Function• Idea: Account for direction and magnitude of loss.

S(F;B) = (F – B) Bq

• Can use asymmetric critical values and “q”s:– Declare outliers whenever

S+(F;B) = (F – B) Bq+ > C+

S–(F;B) = (F – B) Bq– < C–

with C+ –C–, q+ q–.

Nominal Time

• Compare 2 sets of estimates, one set can be actual values, Ai.

• Assumptions:– Unbiased: EBi = EFi = Ai.

– Proportionate variance: Var(Bi) = Var(Fi) = 2Ai.

• q = –1/2.

• Either set of estimates can be used for Bi, Fi.

– Exception: Ai can only be substituted for Bi.

How to Use: No Preexisting Outlier Criteria

• Start with q = – 0.5.– Adjust by increments of 0.1 to get “good”

distribution of outliers.

• Alternative: Start with

q = log(range)/25 – 1, where range is range of data. (Bryan, 1999)– Can adjust.

How to Use: Preexisting Discrete Outlier Criteria

• Start with schedule of critical pairs (j, Bj).

– These pairs (approximately) satisfy equation Bq = C for some q and C. They are the cutoffs between outliers and nonoutliers.

• Run regressionlog j = –q log Bj + K

• Then, C = eK.

Loss Functions and GIS

• Loss functions can be used with GIS to focus analyst’s attention on problem areas.

• Maps compare tax method county population estimates to unconstrained housing unit method estimates.

• q = –0.5 in loss function map.

Persons

0 - 50005000 - 2500025000 - 50000Over 50000No Data

Note: The tax method estimates are the base

Map 1Absolute Differences between the Two Sets of Population EstimatesAbsolute Differences between the Population Estimates

Percent

0 - 55 - 1010 - 20Above 20No Data

Map 2Absolute Percent Differences between the Two Sets of Population EstimatesPercent Absolute Differences between the Population Estimates

0 - 10001000 - 20002000 - 4000Above 4000No Data

Map 3Loss Function Values

Loss Function Values

Outliers Classified by Another Variable

• Di is function of 2 successive observations.

• Ri is “reference” variable, used to classify outliers.

• Start with schedule of critical pairs (Dj, Rj).

• Run regressionlog Dj = a + log Rj

• Then, L(D, R) = DRb and C = ea.

What to Do with Negative Data

• From Coleman and Bryan (2000):

L(F,B) = |F–B|(|F|+|B|)q, B 0 or F 0,

0 , B = F = 0.

S(F,B) = (F–B)(|F|+|B|)q, B 0 or F 0,

0 , B = F = 0.

• 0 > q > –1. Suggest q –0.5.

Summary

• Defined panel data.

• Defined outliers.

• Created several types of loss functions to detect outliers in panel data.

• Loss functions are empirical (except for nominal time.)

• Showed several applications, including GIS.

URL for Presentation

http://chuckcoleman.home.dhs.org/fscpela.ppt

Loss Functions for Detecting Outliers in Panel Data

Documents

Transcript of Loss Functions for Detecting Outliers in Panel Data

Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.

SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.

8. Outliers

DETECTING ADDITIVE AND INNOVATIONAL OUTLIERS IN BL(p,0 ...

June 27-28, 2006 Vikramaditya Jakkula Monitoring Health by Detecting Drifts and Outliers for a Smart Environment Inhabitant Gaurav Jain, Diane J. Cook,

Detecting - University of Minnesotashekhar/talk/outlier/spatial-outlier-slide.pdfDetecting Spatial Outliers: Algorithm and Application Chang-Tien Lu Spatial Database Lab ... 0 0.5

Loss Functions for Detecting Outliers in Panel Data Charles D. Coleman Thomas Bryan Jason E. Devine U.S. Census Bureau Prepared for the Spring 2000 meetings.

On Detecting Spatial Outliers - Virginia Techctlu/Publication/2008/Geoinformatica... · 2016-03-22 · On Detecting Spatial Outliers ... forecast, and medical diagnosis. ... potential

Outliers - University of Notre Damerwilliam/stats2/l24.pdf · Detecting Outliers using Stata. As is often the case with Stata, instead of a few big commands with several options,

Detecting Loss of Flame in Oil Refinery Fired Heaters Using ...

SW388R7 Data Analysis & Computers II Slide 1 Multinomial Logistic Regression: Detecting Outliers and Validating Analysis Outliers Split-sample Validation.

The Hybrid Approach for Handling and Detecting Outliers ... · The Hybrid Approach for Handling and Detecting Outliers from Dynamic Data Stream. Mr. Raghav M Purankar, Prof. Pragati

Chapter 3 Descriptive Statistics: Numerical Methods Part B n Measures of Relative Location and Detecting Outliers n Exploratory Data Analysis n Measures.

METHODS OF DETECTING AND TREATING OUTLIERS USED IN ...icos2017.fzs.ba/images/Presentations/Day_1_A1a/ICOS2017_Marink… · Non-sampling errors (with respect to nature) •Stochastic

Exploration Framework For Detecting Outliers In Data Streams€¦ · The problem of detecting outliers in streaming context has been studied in the literature [2, 17]. Both of them

DM outliers

Avoiding Catastrophic Performance Loss - NVIDIA … · Avoiding Catastrophic Performance Loss Detecting CPU-GPU Sync Points John McDonald, NVIDIA Corporation

Outliers 1

Outliers innovation

A Review and Comparison of Methods for Detecting …d-scholarship.pitt.edu/7948/1/Seo.pdf · A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets University