The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.
-
date post
15-Jan-2016 -
Category
Documents
-
view
216 -
download
0
Transcript of The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.
The Hidden Message
Some useful techniques for data analysis
Chihway Chang, Feb 18’ 2009
A famous example… Hubble’s law v=H0d Expansion of the universe
What do we learn? Seemingly crappy data can lead to
astonishing discoveries Insight + imagination
Nature laws are usually simple Most parts in our observable Universe are
linear, spherical symmetric, Gaussian or Poisson
Data analysis should be easy! …theoretically
CLT
We all know how this happens Process of data analysis:
Sampling Central Limit Theorem Strategy of sampling
Model fitting Linear regression Maximum likelihood Chi square
Correlations Or…
-0.2 -0.1 0.1 0.2
-0.6
-0.4
-0.2
0.2
0.4???@#$
Collect lots of data Stare at your data
Outline Useful techniques in data analysis:
Correlations Linear correlation Cross-correlation Autocorrelation
Principle Component Analysis (PCA)
Correlations Linear correlation Data
Standard scores Correlation coefficients (Pearson product-moment)
Coefficient of determinationVariance in common
Correlation matrix
BAxxfy )(
)()()()(
)()()(),cov(1),(
2222,,YEYEXEXE
YEXEXYEYXzz
NYXcorr
YXiiYiX
}{},{ ii YX
X
iiX
XXz
,Y
iiY
YYz
,
2),( YXcorr
%100),( 2 YXcorr
),( jiij XXcorrCorr
Example – Hubble’s law
Hubble's original 1929 data
-200
0
200
400
600
800
1000
1200
0 0.5 1 1.5 2 2.5
Mpc
km/s
We have 24 data points, we’d like to know how v and d correlate
N d d * d zd v v * v zv zv*zd
1 0.032 0.001024
-1.39163
170 28900
-0.5589
0.777778
2 0.034 0.001156
-1.38846
290 84100
-0.22872
0.317567
3 0.214 0.045796
-1.10361
-130 16900
-1.38435
1.527778
4 0.263 0.069169
-1.02606
-70 4900
-1.21926
1.251038
5 0.275 0.075625
-1.00707
-185 34225
-1.53568
1.546545
……
19 1.4 1.96 0.773257 500 25000
00.3490
97 0.269942
20 1.7 2.89 1.248012 960 92160
01.6147
88 2.015274
21 2 4 1.722767 500 25000
00.3490
97 0.601412
22 2 4 1.722767 850 72250
01.3121
22 2.260481
23 2 4 1.722767 800 64000
01.1745
47 2.023471
24 2 4 1.722767 1090 11881
001.9724
83 3.398128
Ave 0.911375
1.229908
0.399304
373.125
271309.4
132087.1 0.789639
Ave2 0.631905
363.4379 0.623529
Correlation coefficientCorrelation of determination
Standard scores
Example – Hubble’s law
Hubble's original 1929 data
-200
0
200
400
600
800
1000
1200
0 0.5 1 1.5 2 2.5
Mpc
km/s
We have 24 data points, we’d like to know how v and d correlate
Significance and likelihood One-tailed table usage What is the likelihood for
24 random number sets to have by chance corr(X,Y) ≧0.79?
What if we only have 5 samples?
Limitations Only capable of linear dependence Sensible to outliers Affected by correlated errors
Cross-correlation Signal processing: search in a long series
of data a short feature signal
dtgftgf
)()())(( *
fg
t
t(f*g)(t)
Autocorrelation Finding repeating patterns Identifying fundamental lengths or time
scales in noisy signal Cross-correlation with self
or simply
2
)])([()(
tt XXER
)])([()( tt XXER
f0.1
Application Correlation coefficient:
Well, um…everywhere? Auto & cross-correlation:
Optics: laser coherence, spectra measurement, ultra-short laser pulse
Signal process: musical beats Astronomy: pulsar frequency Correlation in space: 2-point (n-point)
correlation functions & power spectrum
Example: 2-point correlation in weak lensing Assumption: galaxy sha
pes are entirely random Correlation of shape par
ameter “e” 0 Shear induces correlatio
n at length scale ~arcmin
Atmosphere and systematics induce correlated noise
Typical 2-point correlation plots, no shear, but with noise and systematics
Shear signal is at 1% level Controlling systematics is the key!
1 arcmin 5 arcmin
Principle Component Analysis Revealing the internal structu
re of data in a way that best explains its variance
Conceptually, it is a transformation of coordinate system that rotates data into its eigen-space where the greatest variance by any projection of the data lie on the first coordinate
High-dimension analysis
Hubble's original 1929 data
-200
0
200
400
600
800
1000
1200
0 0.5 1 1.5 2 2.5
Mpc
km/s
Mathematical operation Recognize important variance in data —
the Principle Components (PCs) Reconstruct data using only low orders of
PCs thus compressing dimension of data Assumption:
Data can be represented by a linear combination of certain basis
Data is Gaussian
Example — Hubble’s Law again
Get data {(xi,yi)}24*2
Subtract mean {(Xi,Yi)}={(xi-ave(x),yi-ave(y))}24*2
Calculate covariance matrix C2*2={(Xi,Yi)}T {(Xi,Yi)}/N
Calculate & normalize 2 eigenvalue and
2 eigenvectors of C
The eigenvectors point to 2 PCs
and the eigenvalues indicate relevant
weightings
Hubble's original 1929 data
-200
0
200
400
600
800
1000
1200
0 0.5 1 1.5 2 2.5
Mpc
km/s
PC1, eigenvalue=132087
PC2, eigenvalue=0.1503
Recognize important PC and ignore others
To form a new basis of compressed dimension {V}2*1
Reconstruct data using1 eigenvector to rotate data back
{X’i,Y’i}={V}T{Xi,Yi}{V}
Shift data back and get final reconstructed data
{X,Y}reconstruct ={X’i+ave(x),Y’i+ave(y)}
0.5 1 1.5 2
-200
200
400
600
800
1000
0.25 0.5 0.75 1 1.25 1.5 1.75
-200
200
400
600
800
1000
Example – characterize shape of CCD chips Fit 27 chip shapes using 4th order polynomials
Data matrix of dimension 27*15 15 eigenvalues and 15 eigenvectors Choose 15,5,1 PCs to reconstruct shapes
0
10
20
0
20
40
60
-4
-2
0
2
0
10
20
0
10
20
300
20
40
60
-202
4
0
10
20
30
0
10
20
300
20
40
60
-2
-1
0
1
2
0
10
20
30
0
10
20
0
20
40
60
-4
-2
0
2
0
10
20
0
10
20
300
20
40
60
-202
4
0
10
20
30
0
10
20
300
20
40
60
-2
-1
0
1
2
0
10
20
30
0
10
20
0
20
40
60
-4
-2
0
2
0
10
20
0
10
20
0
20
40
60
-4
-2
0
2
0
10
20
0
10
20
0
20
40
60
-3-2-10
0
10
20
0
10
20
0
20
40
60
-2
0
2
4
0
10
20
0
10
20
0
20
40
60
-2
0
2
4
0
10
20
0
10
20
0
20
40
60
0
2
4
0
10
20
0
10
20
0
20
40
60
-2-1
0
1
2
0
10
20
0
10
20
0
20
40
60
-2
0
2
0
10
20
0
10
20
0
20
40
60
-3-2-1
0
0
10
20
Applications Pattern recognition (
http://icg.cityu.edu.hk/private/PowerPoint/PCA.ppt)
Multi-dimension data analysis Noise reduction Image analysis
Conclusion Data is only useful if we know how to interpret them Various statistical techniques are developed Analyzing correlations and PCA are two common techniques I int
roduce today
“It can aid understanding reality, but it is no substitute for insight, reason, and imagination. It is a flashlight of the mind. It must be turned on and directed by our interests and knowledge; and it can help gratify and illuminate both. But like a flashlight, it can be uselessly turned on in the daytime, used unnecessarily beneath a lamp, employed to search for something in the wrong room, or become a play thing.”
R.J. Rummel, department of political science, University of Hawaii
Reference A tutorial on Principal Components Analysis, Li
ndsay I Smith Understanding Correlation, R.J. Rummel
…and yes, I learned everything from Wikipidia
FIN
Thank you for your attention!