1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics.
Transcript of 1 Theory of Differentiation in Statistics Mohammed Nasser Department of Statistics.
1
Theory of Differentiation in StatisticsMohammed Nasser
Department of Statistics
2
Relation between Statistics and Differentiation
Statistical Concepts/Techniques
Use of Differentiation Theory
Study of shapes of univariate pdfs
An easy application of first-order and second-order derivatives
Calculation/stablization of variance of a random variable
An application of Taylor’s theorem
Calculation of Moments from MGF/CF
Differentiating MGF/CF
3
Description of a density/ a model
dy/dx=k, dy/dx=kx
Optimize some risk functional/regularized functional/ empirical risk functional with/without constraints
Needs heavy tools of nonlinear optimizationTechniques that depend on multivariate differential calculus and functional differential calculus
Relation between Statistics and Differentiation
Influence function to assess robustness of a statistical
An easy application of directional derivative in function space
4
Relation between Statistics and Differentiation
Classical delta theorem to find asymptotic distribution
An application of ordinary Taylor’s theorem
Von Mises Calculus Extensive application of functional differential calculus
Relation between probability measures and probability density functions
Radon Nikodym theorem
5
Monotone Function
f(x)
Monotone Increasing
Monotone Decreasing
Strictly Increasing
Non Decreasing
Strictly Decreasing
Non Increasing
6
Increasing/Decreasing test
3( ): f x xf R R
3( ): f x xf R R
7
Example of Monotone Increasing Function
0
3( ): f x xf R R
8
a b
Maximum/Minimum
Is there any sufficient condition that guarantees existence of global max/global min/both?
9
If the function is continuous and its domain is compact, the function attains its extremum
It’s a very general result It holds for any compact space other compact set of Rn.
Any convex ( concave) function attains its global min ( max).
Without satisfying any of the above conditions some functions may have global min ( max).
Some Results to Mention
Firstly, proof of existence of extremum
Calculation of extremum
Then
10
What Does Say about f"0)(" 0 xf I
Fermat’s Theorem: if f has local maximum or minimum at c, and if exist, then but converse is not true
)(cf I,0)( cf I
11
Concave
Convex
Point of inflectionc
Concavity
• If for all x in (a,b), then the graph of f concave on (a,b).• If for all x in (a,b), then the graph of f concave on (a,b).• If then f has a point of inflection at c.
0)( xf II
0)( xf II
0)( cf II
12
Maximum/Minimum
Let f(x) be a differential function on an interval I
• f is maximum at
• f is maximum at
• If for all x in an interval, then f is maximum at first end point of the interval if left side is closed and minimum at last end point if right side is closed.
• If for all x in an interval, then f is minimum at first end point of the interval if left side is closed and maximum at last end point if right side is closed.
0)(0)( cfandcfifIc III
0)(0)( cfandcfifIc III
0)( xf I
0)( xf I
13
Concave
Convex Convex
point of in
flection
Normal Distribution
The probability density function is given as,2
2
1
2
1)(
x
exf
continuous on Rf(x)>=0Differentiable on
R
( ) 0xlt f x
14
Take log both sidePut first derivative equal to zero
Now,
Normal Distribution
x
fx
f
xf
f
xf
exf
I
I
x
01
10
1
2
1
2
1loglog
2
1)(
2
2
12
15
Normal Distribution
01
1
)(1
2
2
2
f
ff
ffxf
I
III
Therefore f is maximum at
xSince
x
16
Normal Distribution
Put 2nd derivative equal to zero
x
xx
x
x
ffx
ffx
f
I
II
0)}()}{({
0)(
01
0
0)(1
0
22
2
2
2
Therefore f has point of inflection at x
17
Convex Concave
Logistic Distribution
The distribution function is defined as,
xe
exF
x
x
;1
)(
18
Logistic DistributionTake first derivative with respect to x
Therefore F is strictly increasing
Take2nd derivative and put equal to zero
Therefore F has a point of inflection at x=0
0
0)1log(
1
0)1(
0)1(
)1()(
3
x
x
e
ee
e
eexF
x
xx
x
xxII
xe
exF
x
xI
;0
)1()(
2
19
Logistic Distribution
Now we comment that F has no maximum and minimum.
),0(;0)(
)0,(;0)(
xxF
xxFII
II
Therefore F is convex on and concave on
Since,
)0,( ).,0(
20
Variance of a Function of Poisson Variate Using Taylor’s Theorem
We know that,
, ( ) , , ( )Mean E Y VarianceV Y
We are interested to find the Variance of YYg )(
?)( YV
12
1 212
,
1( ) ( )
21 1
( ) ( )2 4
I
I I
Giventhat
gY Y g Y Y
g g
21
The Taylor’s series is defined as,
4
1
)(4
1
)()(0))((
))(()()(
)())(()()(
1
2
YV
YVgYgV
YggYg
YoYggYg
I
I
I
Therefore the variance of4
1isY
Variance of a Function of Poisson Variate Using Taylor’s Theorem
22
Risk Functional
Risk functional, RL,P(g)= ( , , ( )) ( , )
( , , ( )) ( / )
X Y
X
X Y
L x y g x dP x y
L x y g x dP y x dP
Population Regression functional /classifier, g* *, ,
:( ) inf ( )L P L P
g X YR g R g
From sample D, we will select gD by a learning method(???)
P is chosen by nature , L is chosen by the scientist
Both RL,P(g*) and g* are uknown
23
Problems of empirical risk minimization
Empirical risk minimizationEmpirical Risk functional, =
1
( , , ( )) ( , )
( , , ( )) ( / )
1( , , ( ))
n
X Y
n n X
X Y
n
i i ii
L x y g x dP x y
L x y g x dP y x dP
L x y g xn
,( )
nL PR g
24
What Can We Do?
We can restrict the set of functions over which we minimize empirical risk functionals
modify the criterion to be minimized (e.g. adding a penalty for `complicated‘ functions). We can combine two.
Stru
ctural risk
Min
imizatio
n
Regularization
25
Regularized Error Function
22
1
1( ( ) )
2 2
l
i ii
f x y wl
2
1
1( ( ) )
2
l
ii
C E f x y w
In linear regression, we minimize the error function:
Replace the quadratic error function by Є-insensitive error function:
An example of Є-insensitive error function:
26
Linear SVR: Derivation
Meaning of equation 3
27
●●
Linear SVR: Derivation
●
●
●●
●
Complexity Sum of errors
vs.
Case I:
Case II:
“tube” complexity
“tube” complexity
28
Linear SVR: Derivation
Case I:
Case II:
“tube” complexity
“tube” complexity
• The role of C
●
●
●●
●
●
●
C is small
●
●
●●
●
●
●
C is big
29
●●
Linear SVR: derivation
●
●
●●
●Subject to:
30
Lagrangian
2* * * * *
1 1 1 1
*
1
*
1
* **
1( ) ( ) ( , ) ( , )
2
0 ( )
0 ( ) 0
0
0
l l l l
n n n n n n n n n n n n n nn n n n
l
n n nn
l
n nn
n nn
n nn
L C w y w x b y w x b
Lw x
w
L
b
LC
La C
Minimize:
f(x)=<w,x>= * *
1 1
( ) , ( ) ,l l
n n n n n nn n
x x x x
Dual var. α
n,α
n*,μn,μ*
n >=
0
31
Dual Form of Lagrangian
* * * * *
1 1 1 1
*
*
1
1( , ) ( )( ) , ( ) ( )
2
0
0
( ) 0
l l l l
n n m m n m n n n n nn m n n
n
n
l
n nn
W a a x x y
C
C
Prediction can be made using:
*
1
( ) ( ) ,l
n n nn
f x x x b
Maximize:
???
32
How to determine b?
Karush-Kuhn-Tucker (KKT) conditions implies( at the optimal solutions:
* *
* *
( , ) 0
( , ) 0
( ) 0
( ) 0
n n n n
n n n n
n n
n n
y w x b
y w x b
C
C
Support vectors are points that lie on the boundary or outside the tube
These equations implies many important things.
33
Important Interpretations
* *0, . . 0 (why??)i i i ii e
* *
*
, 0
,
,
i n n n
n n n
n n
C y w x b
w x b y
w x b y
*
*
0 0,
and 0
0
i i
i
i
34
Support Vector: The Sparsity of SV Expansion
*
0 ( )
0 ( )
i i i
i i i
y f x
f x y
and
*
( ) 0
( ) 0
i i i
i i i
y f x
f x y
35
Dual Form of Lagrangian(Nonlinear case)
* * * * *
1 1 1 1
*
*
1
1( , ) ( )( ) ( , ) ( ) ( )
2
0
0
( ) 0
l l l l
n n m m n m n n n n nn m n n
n
n
l
i ii
W k x x y
C
C
Prediction can be made using:
*
1
( ) ( ) ( , )l
n n nn
f x a a k x x b
Maximize:
36
Non-linear SVR: derivation
Subject to:
37
Non-linear SVR: derivationSubject to:
Saddle point of L has to be found:
min with respect to
max with respect to
38
Non-linear SVR: derivation
...
39
UA Banach Space
V,AnotherB-space
f,a nonlinear function
What is Differentiation?
Differentiation is nothing but local linearization
In differentiation we approximate a non-linear function locally by a (continuous) linear function
40
Fréchet Derivative
0||
|)()()(|
0)()()(
0
0
h
hxfxfhxfLt
h
hxfxfhxfLt
h
h
It can be easily generalized to Banach space valued function, f: 2211 ,, BB
0||||
||)()()(||
1
2
0
h
hxfxfhxfLt
h
is a linear map. It can be shown,.1 2
( ) :f x B B
every linear map between infinite-dimensional spaces is not always continuous.
Definition 1
41
We have just mentioned that Fréchet recognized , the definition 1 could be easily generalized to normed spaces in the following way: lim
)2(............0))(()()(
lim
0))(()()(
lim
10
1
2
0
h
hxdfxfhxf
h
hxdfxfhxf
h
h
Frécehet Derivative
Where and the set of all continuous linear functions between B1and B2 If we write, the remainder of f at x+h, ; Rem(x+h)= f(x+h)-f(x)-df(x)(h)
42
Then 2 becomes
)3(.............0)(Re
lim
0)(Re
lim
10
1
2
0
h
hxm
h
hxm
h
h
Soon the definition is generalized (S-differentiation ) in general topological vector spaces in such a way ; i) a particular case of the definition becomes equivalent to the previous definition when , domain of f is a normed space, ii) Gateaux derivative remains the weakest derivative in all types of S-differentiation.
S Derivative
43
Definition 3When S= all singletons of B1, f is called Gâteaux differentiable with Gâteaux derivative . When S= all compact subsets of B1, f is called Hadamard or compactly differentiable with Hadamard or compact derivative . When S= all bounded subsets of B1, f is called or boundedly differentiable with or bounded derivative .
Definition 2Let S be a collection of subsets of B1 , let t R. Then f is S-differentiable at x with derivative df(x) if ),( 21 BBL SA
Ahinuniformlytast
hxm
00
)(Re
S Derivatives
44
Equivalent Definitions of Fréchet derivative
(a) For each bounded set, as in R,
uniformly
0)(
,1
t
thxRBE 0t
Eh
(b) For each sequence, and each sequence1}{ Bhn ;0}0/{}{ Rtn
nast
htxR
n
nn 0)(
45
(c) 00)(
1
hash
hxR
00)(
tast
thxRUniformly in }1:{
11 hBhh (d)
(e) 00)(
tast
thxR Uniformly in }1:{11 hBhh
Statisticians generally uses this form or its some slight modification
46
Relations among Usual Forms of Definitions
Set of Gateaux differentiable function at set of Hadamad differentiable function at set Frechet differentiable function x. In application to find Frechet or Hadamard derivative generally we shout try first to determine the form of derivative deducing Gateaux derivative acting on h,df(h) for a collection of directions h which span B1. This reduces to computing the ordinary derivative (with respect to R) of the mapping which is much related to influence function, one of the central concepts in robust statistics. It can be easily shown that,
(i) When B1=R with usual norm, they will three coincide
(ii)When B1, a finite dimensional Banach space, Frechet and Hadamard derivative are equal. The two coincide with familiar total derivative.
)(xDG )(xDx G
)(xDx G
,0)( tatthxft
47
Properties of Fréchet derivative
Hadamard diff. implies continuity but Gâteaux does not.
Hadamard diff. satisfies chain rule but Gâteaux does not.
Meaningful Mean Value Theorem, Inverse Function Theorem, Taylor’s Theorem and Implicit Function Theorem have been proved for Fréchet derivative
48
0
0
[(1 ) ] ( )( , ; )
[( ( )] ( ) =
x
x
T F T FIF T x F lt
T F F T Flt
49
nn xdFFTX )(
( ) ( )T F x dF
( ) ( ) ( ) ( )T F x dF x f x dx
1
( ) ( )i
i
x x dF
Lebesgue
Counting
50
Mathematical Foundations of Robust Statistics
T(G)≈T(F)+ )(1
FGTF
d 1(F,G) <δ
d 2(T(F),T(G)) <ε
(T(G)-T(F))≈ )(1
FGTF n n
51
Math
ematical F
ou
nd
ation
s o
f Ro
bu
st Statistics
52
Math
ematical F
ou
nd
ation
s o
f Ro
bu
st Statistics
53
Math
ematical F
ou
nd
ation
s o
f Ro
bu
st Statistics
54
Given a Measurable Space (W,F),
There exist many measures on F.
If W is the real line, the standard measure is “length”. That is, the measure of each interval is its length. This is known as “Lebesgue measure”.
The s-algebra must contain intervals. The smallest s-algebra that contains all open sets (and hence intervals) is call the “Borel” s-algebra and is denoted B.
A course in real analysis will deal a lot with the measurable space . ),( B
55
Given a Measurable Space (W,F),
A measurable space combined with a measure is called a measure space. If we denote the measure by m, we would write the triple: (W,F,m).
Given a measure space (W,F,m), if we decide instead to use a different measure, say , u then we call this a “change of measure”. (We should just call this using another measure!)
Let m and u be two measures on (W,F), then
(Notation )0)(0)( AA u is “absolutely continuous” with respect to m if
u and m are “equivalent” if
0)(0)( AA
56
gdd
d
dd
The Radon-Nikodym Theorem
If u<<m then u is actually the integral of a function wrt m.
d
d
d
d
g
d
dg
AAA
gddd
ddA
)(
A
gdA )(
g is known as the Radon-Nikodym derivative and denoted:
d
dg
57
The Radon-Nikodym Theorem
If u<<m then u is actually the integral of a function wrt m.
Consider the set function (this is actually a signed measure)
)()())(( AAA
Then A is the a-superlevel set of g.
Idea of proof: Create the function through its superlevel sets
Choose and let A be the largest set such that
0)()( AA for all AA (You must prove such an A exists.)
Now, given superlevel sets, we can construct a function by:
}|sup{)( Ag
58
The Riesz Representation Theorem:
All continuous linear functionals on Lp are given by integration against a function with
qLg 111 qp
That is, let pLL :)( fLy
be a cts. linear functional.
Then: fgdfL )(
Note, in L2 this becomes:
gffgdfL ,)(
59
The Riesz Representation Theorem:
All continuous linear functionals on Lp are given by integration against a function with
qLg 111 qp
What is the idea behind the proof:
Linearity allows you to break things into building blocks, operate on them, then add them all together.
What are the building blocks of measurable functions.
Indicator functions! Of course!
)1()( ALA Let’s define a set valued function from indicator functions:
60
The Riesz Representation Theorem:
All continuous linear functionals on Lp are given by integration against a function with
qLg 111 qp
)1()( ALA A set valued function
How does L operate on simple functions
n
ii
n
iAi
n
iAi ALLL
ii111
)()1()1()(
This looks like an integral with u the measure! dL )(
But, it is not too hard to show that u is a (signed) measure. (countable additivity follows from continuity). Furthermore, u<<m. Radon-Nikodym then says du=gdm.
61
The Riesz Representation Theorem:
All continuous linear functionals on Lp are given by integration against a function with
qLg 111 qp
)1()( ALA A set valued function
How does L operate on simple functions
n
ii
n
iAi
n
iAi ALLL
ii111
)()1()1()(
This looks like an integral with u the measure! gdL )(
For measurable functions it follows from limits and continuity.
fgdL )(
The details are left as an “easy” exercise for the reader...
62
A random variable is a measurable function.
)(X
The expectation of a random variable is its integral:
XdPXE )(
A density function is the Radon-Nikodym derivative wrt Lebesgue measure:
dx
dPf X
dxxxfXdPXE X )()(
A probability measure P is a measure that satisfiesThat is, the measure of the whole space is 1.
1)( P
63
In finance we will talk about expectations with respect to different measures.
A probability measure P is a measure that satisfiesThat is, the measure of the whole space is 1.
1)( P
P XdPXE P )(
Q XdQXEQ )(
)()( XEdQXdQdQ
dPXXdPXE QP
where dQ
dPor dQdP
And write expectations in terms of the different measures: