The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

The Hidden Message

Some useful techniques for data analysis

Chihway Chang, Feb 18’ 2009

A famous example… Hubble’s law v=H0d Expansion of the universe

What do we learn? Seemingly crappy data can lead to

astonishing discoveries Insight + imagination

Nature laws are usually simple Most parts in our observable Universe are

linear, spherical symmetric, Gaussian or Poisson

Data analysis should be easy! …theoretically

CLT

We all know how this happens Process of data analysis:

Sampling Central Limit Theorem Strategy of sampling

Model fitting Linear regression Maximum likelihood Chi square

Correlations Or…

-0.2 -0.1 0.1 0.2

-0.6

-0.4

-0.2

0.2

0.4???@#$

Collect lots of data Stare at your data

Outline Useful techniques in data analysis:

Correlations Linear correlation Cross-correlation Autocorrelation

Principle Component Analysis (PCA)

Correlations Linear correlation Data

Standard scores Correlation coefficients (Pearson product-moment)

Coefficient of determinationVariance in common

Correlation matrix

BAxxfy )(

)()()()(

)()()(),cov(1),(

2222,,YEYEXEXE

YEXEXYEYXzz

NYXcorr

YXiiYiX

}{},{ ii YX

X

iiX

XXz

,Y

iiY

YYz

,

2),( YXcorr

%100),( 2 YXcorr

),( jiij XXcorrCorr

Example – Hubble’s law

Hubble's original 1929 data

-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

We have 24 data points, we’d like to know how v and d correlate

N d d * d zd v v * v zv zv*zd

1 0.032 0.001024

-1.39163

170 28900

-0.5589

0.777778

2 0.034 0.001156

-1.38846

290 84100

-0.22872

0.317567

3 0.214 0.045796

-1.10361

-130 16900

-1.38435

1.527778

4 0.263 0.069169

-1.02606

-70 4900

-1.21926

1.251038

5 0.275 0.075625

-1.00707

-185 34225

-1.53568

1.546545

……

19 1.4 1.96 0.773257 500 25000

00.3490

97 0.269942

20 1.7 2.89 1.248012 960 92160

01.6147

88 2.015274

21 2 4 1.722767 500 25000

00.3490

97 0.601412

22 2 4 1.722767 850 72250

01.3121

22 2.260481

23 2 4 1.722767 800 64000

01.1745

47 2.023471

24 2 4 1.722767 1090 11881

001.9724

83 3.398128

Ave 0.911375

1.229908

0.399304

373.125

271309.4

132087.1 0.789639

Ave2 0.631905

363.4379 0.623529

Correlation coefficientCorrelation of determination

Standard scores

Example – Hubble’s law


-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

We have 24 data points, we’d like to know how v and d correlate

Significance and likelihood One-tailed table usage What is the likelihood for

24 random number sets to have by chance corr(X,Y) ≧0.79?

What if we only have 5 samples?

Limitations Only capable of linear dependence Sensible to outliers Affected by correlated errors

Cross-correlation Signal processing: search in a long series

of data a short feature signal

dtgftgf

)()())(( *

fg

t

t(f*g)(t)

Autocorrelation Finding repeating patterns Identifying fundamental lengths or time

scales in noisy signal Cross-correlation with self

or simply

2

)])([()(

tt XXER

)])([()( tt XXER

f0.1

Application Correlation coefficient:

Well, um…everywhere? Auto & cross-correlation:

Optics: laser coherence, spectra measurement, ultra-short laser pulse

Signal process: musical beats Astronomy: pulsar frequency Correlation in space: 2-point (n-point)

correlation functions & power spectrum

Example: 2-point correlation in weak lensing Assumption: galaxy sha

pes are entirely random Correlation of shape par

ameter “e” 0 Shear induces correlatio

n at length scale ~arcmin

Atmosphere and systematics induce correlated noise

Typical 2-point correlation plots, no shear, but with noise and systematics

Shear signal is at 1% level Controlling systematics is the key!

1 arcmin 5 arcmin

Principle Component Analysis Revealing the internal structu

re of data in a way that best explains its variance

Conceptually, it is a transformation of coordinate system that rotates data into its eigen-space where the greatest variance by any projection of the data lie on the first coordinate

High-dimension analysis


-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

Mathematical operation Recognize important variance in data —

the Principle Components (PCs) Reconstruct data using only low orders of

PCs thus compressing dimension of data Assumption:

Data can be represented by a linear combination of certain basis

Data is Gaussian

Example — Hubble’s Law again

Get data {(xi,yi)}24*2

Subtract mean {(Xi,Yi)}={(xi-ave(x),yi-ave(y))}24*2

Calculate covariance matrix C2*2={(Xi,Yi)}T {(Xi,Yi)}/N

Calculate & normalize 2 eigenvalue and

2 eigenvectors of C

The eigenvectors point to 2 PCs

and the eigenvalues indicate relevant

weightings


-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

PC1, eigenvalue=132087

PC2, eigenvalue=0.1503

Recognize important PC and ignore others

To form a new basis of compressed dimension {V}2*1

Reconstruct data using1 eigenvector to rotate data back

{X’i,Y’i}={V}T{Xi,Yi}{V}

Shift data back and get final reconstructed data

{X,Y}reconstruct ={X’i+ave(x),Y’i+ave(y)}

0.5 1 1.5 2

-200

200

400

600

800

1000

0.25 0.5 0.75 1 1.25 1.5 1.75

-200

200

400

600

800

1000

Example – characterize shape of CCD chips Fit 27 chip shapes using 4th order polynomials

Data matrix of dimension 27*15 15 eigenvalues and 15 eigenvectors Choose 15,5,1 PCs to reconstruct shapes

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

300

20

40

60

-202

4

0

10

20

30

0

10

20

300

20

40

60

-2

-1

0

1

2

0

10

20

30

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

300

20

40

60

-202

4

0

10

20

30

0

10

20

300

20

40

60

-2

-1

0

1

2

0

10

20

30

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

0

20

40

60

-3-2-10

0

10

20

0

10

20

0

20

40

60

-2

0

2

4

0

10

20

0

10

20

0

20

40

60

-2

0

2

4

0

10

20

0

10

20

0

20

40

60

0

2

4

0

10

20

0

10

20

0

20

40

60

-2-1

0

1

2

0

10

20

0

10

20

0

20

40

60

-2

0

2

0

10

20

0

10

20

0

20

40

60

-3-2-1

0

0

10

20

Applications Pattern recognition (

http://icg.cityu.edu.hk/private/PowerPoint/PCA.ppt)

Multi-dimension data analysis Noise reduction Image analysis

Conclusion Data is only useful if we know how to interpret them Various statistical techniques are developed Analyzing correlations and PCA are two common techniques I int

roduce today

“It can aid understanding reality, but it is no substitute for insight, reason, and imagination. It is a flashlight of the mind. It must be turned on and directed by our interests and knowledge; and it can help gratify and illuminate both. But like a flashlight, it can be uselessly turned on in the daytime, used unnecessarily beneath a lamp, employed to search for something in the wrong room, or become a play thing.”

R.J. Rummel, department of political science, University of Hawaii

Reference A tutorial on Principal Components Analysis, Li

ndsay I Smith Understanding Correlation, R.J. Rummel

…and yes, I learned everything from Wikipidia

FIN

Thank you for your attention!

The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Documents

Transcript of The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.