The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

27
The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Page 1: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

The Hidden Message

Some useful techniques for data analysis

Chihway Chang, Feb 18’ 2009

Page 2: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

A famous example… Hubble’s law v=H0d Expansion of the universe

Page 3: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

What do we learn? Seemingly crappy data can lead to

astonishing discoveries Insight + imagination

Nature laws are usually simple Most parts in our observable Universe are

linear, spherical symmetric, Gaussian or Poisson

Data analysis should be easy! …theoretically

Page 4: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

CLT

We all know how this happens Process of data analysis:

Sampling Central Limit Theorem Strategy of sampling

Model fitting Linear regression Maximum likelihood Chi square

Correlations Or…

-0.2 -0.1 0.1 0.2

-0.6

-0.4

-0.2

0.2

0.4???@#$

Collect lots of data Stare at your data

Page 5: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Outline Useful techniques in data analysis:

Correlations Linear correlation Cross-correlation Autocorrelation

Principle Component Analysis (PCA)

Page 6: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Correlations Linear correlation Data

Standard scores Correlation coefficients (Pearson product-moment)

Coefficient of determinationVariance in common

Correlation matrix

BAxxfy )(

)()()()(

)()()(),cov(1),(

2222,,YEYEXEXE

YEXEXYEYXzz

NYXcorr

YXiiYiX

}{},{ ii YX

X

iiX

XXz

,Y

iiY

YYz

,

2),( YXcorr

%100),( 2 YXcorr

),( jiij XXcorrCorr

Page 7: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Example – Hubble’s law

Hubble's original 1929 data

-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

We have 24 data points, we’d like to know how v and d correlate

Page 8: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

N d d * d zd v v * v zv zv*zd

1 0.032 0.001024

-1.39163

170 28900

-0.5589

0.777778

2 0.034 0.001156

-1.38846

290 84100

-0.22872

0.317567

3 0.214 0.045796

-1.10361

-130 16900

-1.38435

1.527778

4 0.263 0.069169

-1.02606

-70 4900

-1.21926

1.251038

5 0.275 0.075625

-1.00707

-185 34225

-1.53568

1.546545

……

19 1.4 1.96 0.773257 500 25000

00.3490

97 0.269942

20 1.7 2.89 1.248012 960 92160

01.6147

88 2.015274

21 2 4 1.722767 500 25000

00.3490

97 0.601412

22 2 4 1.722767 850 72250

01.3121

22 2.260481

23 2 4 1.722767 800 64000

01.1745

47 2.023471

24 2 4 1.722767 1090 11881

001.9724

83 3.398128

Ave 0.911375

1.229908

0.399304

373.125

271309.4

132087.1 0.789639

Ave2 0.631905

363.4379 0.623529

Correlation coefficientCorrelation of determination

Standard scores

Page 9: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Example – Hubble’s law

Hubble's original 1929 data

-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

We have 24 data points, we’d like to know how v and d correlate

Page 10: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Significance and likelihood One-tailed table usage What is the likelihood for

24 random number sets to have by chance corr(X,Y) ≧0.79?

What if we only have 5 samples?

Page 11: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Limitations Only capable of linear dependence Sensible to outliers Affected by correlated errors

Page 12: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Cross-correlation Signal processing: search in a long series

of data a short feature signal

dtgftgf

)()())(( *

fg

t

t(f*g)(t)

Page 13: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Autocorrelation Finding repeating patterns Identifying fundamental lengths or time

scales in noisy signal Cross-correlation with self

or simply

2

)])([()(

tt XXER

)])([()( tt XXER

f0.1

Page 14: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Application Correlation coefficient:

Well, um…everywhere? Auto & cross-correlation:

Optics: laser coherence, spectra measurement, ultra-short laser pulse

Signal process: musical beats Astronomy: pulsar frequency Correlation in space: 2-point (n-point)

correlation functions & power spectrum

Page 15: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Example: 2-point correlation in weak lensing Assumption: galaxy sha

pes are entirely random Correlation of shape par

ameter “e” 0 Shear induces correlatio

n at length scale ~arcmin

Atmosphere and systematics induce correlated noise

Page 16: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Typical 2-point correlation plots, no shear, but with noise and systematics

Shear signal is at 1% level Controlling systematics is the key!

1 arcmin 5 arcmin

Page 17: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Principle Component Analysis Revealing the internal structu

re of data in a way that best explains its variance

Conceptually, it is a transformation of coordinate system that rotates data into its eigen-space where the greatest variance by any projection of the data lie on the first coordinate

High-dimension analysis

Hubble's original 1929 data

-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

Page 18: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Mathematical operation Recognize important variance in data —

the Principle Components (PCs) Reconstruct data using only low orders of

PCs thus compressing dimension of data Assumption:

Data can be represented by a linear combination of certain basis

Data is Gaussian

Page 19: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Example — Hubble’s Law again

Get data {(xi,yi)}24*2

Subtract mean {(Xi,Yi)}={(xi-ave(x),yi-ave(y))}24*2

Calculate covariance matrix C2*2={(Xi,Yi)}T {(Xi,Yi)}/N

Calculate & normalize 2 eigenvalue and

2 eigenvectors of C

The eigenvectors point to 2 PCs

and the eigenvalues indicate relevant

weightings

Page 20: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Hubble's original 1929 data

-200

0

200

400

600

800

1000

1200

0 0.5 1 1.5 2 2.5

Mpc

km/s

PC1, eigenvalue=132087

PC2, eigenvalue=0.1503

Page 21: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Recognize important PC and ignore others

To form a new basis of compressed dimension {V}2*1

Reconstruct data using1 eigenvector to rotate data back

{X’i,Y’i}={V}T{Xi,Yi}{V}

Shift data back and get final reconstructed data

{X,Y}reconstruct ={X’i+ave(x),Y’i+ave(y)}

0.5 1 1.5 2

-200

200

400

600

800

1000

0.25 0.5 0.75 1 1.25 1.5 1.75

-200

200

400

600

800

1000

Page 22: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Example – characterize shape of CCD chips Fit 27 chip shapes using 4th order polynomials

Data matrix of dimension 27*15 15 eigenvalues and 15 eigenvectors Choose 15,5,1 PCs to reconstruct shapes

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

300

20

40

60

-202

4

0

10

20

30

0

10

20

300

20

40

60

-2

-1

0

1

2

0

10

20

30

Page 23: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

300

20

40

60

-202

4

0

10

20

30

0

10

20

300

20

40

60

-2

-1

0

1

2

0

10

20

30

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

0

20

40

60

-4

-2

0

2

0

10

20

0

10

20

0

20

40

60

-3-2-10

0

10

20

0

10

20

0

20

40

60

-2

0

2

4

0

10

20

0

10

20

0

20

40

60

-2

0

2

4

0

10

20

0

10

20

0

20

40

60

0

2

4

0

10

20

0

10

20

0

20

40

60

-2-1

0

1

2

0

10

20

0

10

20

0

20

40

60

-2

0

2

0

10

20

0

10

20

0

20

40

60

-3-2-1

0

0

10

20

Page 24: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Applications Pattern recognition (

http://icg.cityu.edu.hk/private/PowerPoint/PCA.ppt)

Multi-dimension data analysis Noise reduction Image analysis

Page 25: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Conclusion Data is only useful if we know how to interpret them Various statistical techniques are developed Analyzing correlations and PCA are two common techniques I int

roduce today

“It can aid understanding reality, but it is no substitute for insight, reason, and imagination. It is a flashlight of the mind. It must be turned on and directed by our interests and knowledge; and it can help gratify and illuminate both. But like a flashlight, it can be uselessly turned on in the daytime, used unnecessarily beneath a lamp, employed to search for something in the wrong room, or become a play thing.”

R.J. Rummel, department of political science, University of Hawaii

Page 26: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

Reference A tutorial on Principal Components Analysis, Li

ndsay I Smith Understanding Correlation, R.J. Rummel

…and yes, I learned everything from Wikipidia

Page 27: The Hidden Message Some useful techniques for data analysis Chihway Chang, Feb 18’ 2009.

FIN

Thank you for your attention!