Post on 01-Jan-2016
description
Simple Method for Outlier Detectionin Fitting Experimental Data
under Interval Error
Sergei Zhilin,sergei@asu.ru
Altai State University,Barnaul, Russia
2
Plan
• Fitting under interval error
• Simple method for outlier detection
• Geometric correction of satellite images
• Connections between the proposed approach and other theories
• Conclusions
3
Fitting under Interval Error
f(x,) +…
x1x2
xp
y
Input variablesx = (x1,…,xp) measured
without error
Output variable y
measured with error
Modeling function with known structure
Model parametersto be estimated Measurement error
• Black box approach
4
Fitting under Interval Error
• Classical statistical approach often assumes that the measurement error is normal
• In real-life applications the error is rather interval than normal
• “Interval” means “unknown but bounded”: [j, j], where j is error bound in j-th
measurement, j=1,…,n• There are no other assumptions about the
error
5
Fitting under Interval Error
• The structure of the modeling functionf (x,) is assumed fixed
.,...,1,),( nj yxfy S jjjjjj
• Each row (xj , yj , j) of the measurements table constrains possible values of the parameter with the set
n
jjSA
1
• Values of the parameter consistent with all constraints form the uncertainty set
6
Set of feasible models
Fitting under Interval Error
• Fitting data with the model y = 1 + 2x
1
2
x
y
In (x, y) domain In (1, 2) domain
Uncertainty set A is unbounded =
not enough data to build the model
Uncertainty set A
Uncertainty set ASet of feasible
models
7
Fitting under Interval Error
• Problems that may be stated with respect to the uncertainty set A
– Model parameters estimation
,min iA
i
,max iA
i
:],[],...,,[ 11 pp α .,...,1 pi
• Interval estimates of
• Point estimates of
,21
iii
.,...,1 pi :,...,1
p
8
Fitting under Interval Error
• Problems that may be stated with respect to the uncertainty set A
– Prediction of the output variable value for fixed values of input variables
• Point estimate of y
)()(21
)( xyxyxy
,min)( xxy T
A
:)](),([ xyxy(x) y
• Interval estimate of y
,max)( xxy T
A
9
Fitting under Interval Error
• All the above problems make sense only if the uncertainty set is not empty
• Possible reasons of the emptiness of the uncertainty set– Presence of outliers in the data set– Wrong structure assumed for the modeling
function
10
Simple method for outlier detection
• Core idea– An outlier may be treated as a measurement
with the underestimated error (i.e. the actual measurement error is greater than the declared error j for it)
– What are the lower bounds j' for actual errors which provide non-empty uncertainty set?
11
Simple method for outlier detection
1
2
x
y
In variables domain In parameters domain
• How much must we stretch the declared error interval in order to «correct» an outlier?
j'j
Let j' = wj ·j
wj = ?
12
Simple method for outlier detection
• Weights wj may be found from the following optimization problem
(1)
(2)
n
jj
ww
1,min
,),( jjjjjjj wyxfwy nj ,...,1
(3),1jw nj ,...,1
(4),1jw nkj ,...,1We can only enlarge error intervals…
(3),1jw kj ,...,1
,...121 jwww
(5)njj www
mm ...21
......,....................
Uncertainty set constraints with movable bounds
…or “freeze” some of error
intervalsSome of the measurements
are obtained with equal errors
13
Simple method for outlier detection
• Example
#Measurement
method x y
1 A 1 2.13 0.20
2 A 2 2.95 0.20
3 A 3 5.01 0.20
4 A 4 4.99 0.20
5 A 5 5.97 0.20
6 B 6 7.04 0.40
7 B 7 8.02 0.40
8 C 8 8.15 0.40
9 C 9 10.01 0.40
10 D 10 10.98 0.50
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
11
Data with outliers which give empty
uncertainty set
w
1.000
1.000
4.686
1.000
1.000
1.000
1.000
1.343
1.000
1.000 x
y
1st attempt Solution of LPP (1)-(3)
y = 1 + 2x
Looks like outlier caused by a blunder.Let’s try to exclude it.
Not so explicit.We need to examine the
precision of method C
14
Simple method for outlier detection
• Example
#Measurement
method x y
1 A 1 2.13 0.20
2 A 2 2.95 0.20
3 A 3 5.01 0.20
4 A 4 4.99 0.20
5 A 5 5.97 0.20
6 B 6 7.04 0.40
7 B 7 8.02 0.40
8 C 8 8.15 0.40
9 C 9 10.01 0.40
10 D 10 10.98 0.50
1 2 3 4 5 6 7 8 9 101
2
3
4
5
6
7
8
9
10
11w
1.000
1.000
1.000
1.000
1.000
1.000
1.143
1.143
1.000 x
y
y = 1 + 2x
2nd attemptSolution of (1) subject to (2)-(3) and w8 = w9
Is the precision of the method C overestimated
on ~14%?
Summary
In order to correct inconsistent data set we have to answer the following questions:
1. Is the outlier #3 really caused by a blunder?
2. Is the outlier #8 caused by a blunder OR is the precision of the method C overestimated?
15
Geometric correction ofsatellite images
y
x u
v
Distorted image
+
++
++
+
+
+
++
Target coordinate system
Ground Control Points
5919.3014309.602714179514
5927.5514349.30274520452
5991.4914486.30307229351
vuyx
Target coordinatesSource coordinates#
202
22011011000
202
22011011000
vbubuvbvbubby
vauauvavauaax
Geometric transformation
Obtained usinghigh-precision methods (GPS, large-scale maps)
Pointed by operatoron the screen with the error ≥ 1 pixel
+ +
Outliers are detected «on the fly» and operator
is noticed about error
+
After correction of outliers and building transformation,
target image is built
16
Geometric correction ofsatellite images
Resulting image with ground control points
Po
siti
on
al u
nce
rta
inty
(x x
)+(y
y)
, p
ixel
s
Resulting image with positional uncertainty map
17
Connections with other theories
• Proposed approach andinconsistent linear programming problems– When outliers are presented in the data, most of the
problems with respect to the uncertainty set may be stated as inconsistent linear programming problems
– Simple outlier detection method may be regarded as one of the possible ways to correct an inconsistent linear programming problem by building a minimal cost approximation by a proper linear programming problem.
18
Connections with other theories
• Proposed approach and robust estimation
(1)
(2)
n
jj
ww
1,min
,),( jjjjjjj wyxfwy nj ,...,1
(3),1jw nj ,...,1
We can only enlarge error intervals…
Uncertainty set constraints with movable bounds (3'),0jw nj ,...,1
We allow to scale error intervals freely
(to expand and to contract)
Solution (*, w*) of (1)-(3') gives
* is M-estimator for parameters (known as L1)
Weight function: W(x) = 1/|x|.
Residuals: wj*·j.
19
Conclusions
• Outlier detection is necessary tool in fitting experimental data
• Interval error model provides effective means of solving outliers detection problem
• Proposed approach is based on the simple idea and may be simply implemented
• Proposed approach provides flexible way to express and take into account a priori information