Vector geometry: A visual tool for statistics Sylvain Chartier Laboratory for Computational...
-
Upload
gerard-benson -
Category
Documents
-
view
223 -
download
0
description
Transcript of Vector geometry: A visual tool for statistics Sylvain Chartier Laboratory for Computational...
Vector geometry: A visual tool for statistics
Sylvain ChartierLaboratory for Computational Neurodynamics and CognitionCentre for Neural Dynamics
Vector geometry
• How using a vector (arrow) we can represent concepts of– Mean, variance (standard deviation), normalization and
standardization.• How using two vectors we can represent concepts of
– Correlation and regression.
A datum
(16)(0)
(16)
(8)
Principal of independence of observation : perfectly opposed direction
(0)
Two data
(16)
(8)(16,8)
(0)
Two data
(0, 0)
(16,8)
(0, 0)
Two data
Starting point: Zero
(16,8)
Finish point
Starting point
(0,0)
x = (x1, x2)
Finish point
Starting point
),( xx
Starting point: Mean
x = (16, 8)
Finish point
Starting point (12, 12)
Starting point: Mean
One group
Many groups
Degrees of freedom
We remove the effect of the meanWe centralized the data
= (4, -4)
Finish point
Starting point (mean) (12, 12)
(0, 0)
xx
x = (16, 8)
We remove the effect of the mean(many groups)
We remove the effect of the mean(many groups)
What is the real dimensionality?
We remove the effect of the mean(many groups)
We remove the effect of the man
• If we have two data, we will get one dimension.• If we have three data, we will get two dimensions
.
.
.• If we have n data, we will get n-1 dimensions.
In other words, degrees of freedom represent the true dimensionality of the data..
Variance
(1.5, -1.5)(-0.5, 0,5) (2.5, -2.5)
What is the difference between these three (composed of two data each) ?
Length (distance) The higher the variability, the longer the lengthwill be.
What is the difference between these three groups?
How do we measure the length (distance)?PythagorasHypotenuse of a triangle? = (4^2+3^2) = 25 = 5
4
(4,3)
3?5
What is the difference between these three groups?
Therefore, the point (4,3) is at a distance of 5 from its starting point.
(4,3)5
n
ii xx
1
22 )(5 = sum of squares = variance×(n-1)
What is the difference between these three groups?
What is the length of this three lines?
1?
1
1
?
?
A)
C)
B)
1
1 11
2
3
The dimensionality inflates the variability.
In order to a have measure that can take into account for the dimensionality, what do we need to do?
What is the difference between these three groups?•We divide the length of the data set by its true dimensionality
1
)( Variance 1
2
n-
xxn
ii
= (quadratic) distance (from the mean) corrected by the (true) dimensionality of the data.
Normalization et standardization
Normalization vs Standardization
• To normalize is equivalent as to bring a given vector x (arrow) centered (mean = 0) at a length of 1..
• Normalization: z = x by its length zTz = 1
• Standardization: zx = x SD zx
Tzx = n-1
=> zx = z*(n-1)
Two groups
One group of three participants
Two groups of three participants
Two groups of three participants
• They can be represented by a plane
Two groups of three participants
• They can be represented by a plane
Two groups of three participants
• They can be represented by a plane
Two groups of three participants
• They can be represented by a plane
• This is true whatever the number of participants
Correlation and regression
Relation between two vectors• If two groups (u and v) has the same data, then the two vectors are superposed on
each other. • As the two vectors distinguish from each other, the angle between them will increase.
• If the angle reaches 90 degrees, then they share nothing in common.
Relation between two vectors
• The cosine of the angle is the coefficient of correlation
Relation between two vectors
T1 cov
cos
n
i ii r
s s
uv
uvu v
u vu vu v u v
– The shortest distance is the one that crosses at 90° the vector u
Relation between two vectors
• Regression: 0 1v̂ b b u
b
e
– By substitution, we can isolate the b1 coefficient.
Relation between two vectors
• Regression: The formula to obtain the regression coefficients can be obtained directly from the geometry
T
T1
T T1
T T1
T 1 T T 1 T1
T 1 T1 1
0
( ) 0
0
( ) ( ) ( ) ( )
( ) ( ) 1
b
b
b
b
b b
u eu v u
u v u u
u v u u
u u u v u u u u
u u u v
If we generalized to any situation (multiple, multivariate)
T 1 T( )B X X X Y