(Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear...

Post on 21-Mar-2020

10 views 0 download

Transcript of (Review) Linear Algebra Statistical Inference Python & R · Review some basic concepts Linear...

(Review)Linear Algebra

Statistical InferencePython & R

CS57300 - Data MiningSpring 2016

Instructor: Bruno Ribeiro

© 2016 Bruno Ribeiro

} Review some basic concepts◦ Linear Algebra◦ Statistical Inference

} Introduce useful tools in Python and R

} Introduce the Scholar cluster

Goals today

2

} A woman is leading an environmental protest outside PMU today

} What is more likely?a) That she is an investment banker?b) That she is an investment banker studying Environmental

Engineering at Purdue?

3

But before…

P [A] =X

b2B

P [A, b]

4

Linear Algebra Review

5

Why Linear Algebra?

} Why Algebra? ◦ Computing is all about algebra◦ Way to mathematically describe data

} Why Linear? ◦ Fast tools◦ Easy to understand◦ Many non-linear problems can be transformed or

approximated as linear systems

} Combination works well in practice

6

Example of Linear Algebra Application: Explain Relationships

} Marvel character appears on comic book} Described as an adjacency matrix (representing a graph)

Matrix Transpose & Symmetry

3. (AB)T = BTAT

3.1 symmetric matrix: A= AT

3.2 (BT)T = B3.3 (BBT)T =?

Cocitation networks

! C = ATA

(Newman’s notation C = AAT)

Bibliographic coupling

! B = AAT

6

Marvel character

Comic book

7

Important PropertiesLinear algebra

1. Matrix multiplication is not commutative

!

0 10 0

"!

0 01 0

"

=

!

1 00 0

"

=!

0 01 0

"!

0 10 0

"

=

!

0 00 1

"

2. Trace: Tr(AB) = Tr(BA)

2. inner product ⟨x ,y⟩= xTy = ∑∀i xiyi

2.1 ⟨x ,x⟩ ≥ 02.2 ∥x∥2 ≡ ⟨x ,x⟩2.3 ⟨x ,y⟩= 0 iff x ⊥ y2.4 ⟨x ,y⟩= ∥x∥∥y∥cosθ , thus y⟨x ,y⟩= ∥x∥u, where u = y/∥y∥

5

8

Common Matrix RepresentationsAdjacency matrix

Undirected graph:A= AT

Bipartite graph:

A=

[

0 B

BT 0

]

k connected components:

A=

A1 · · · 0

0. . . 0

0 · · · Ak

4

(undirected)

} Consider vectors

}

}

} Orthonormal if both

9

Orthogonal vectors (linearly independent)

Normalized if uiuTi = 1 ,

also defined as kuik = 1, i = 1, . . . , k

Orthogonal if uiuTj = 0 ,

also defined as ui?uj , i 6= j

If U is orthonormal then UU�1= UUT

= I,

where I is the identity matrix

We will eventually use this to show a drawback of

Principal Component Analysis (PCA)10

Orthogonal ProjectionsNote that I = UUT

=

kX

i=1

uiuTi

Thus a = Ia = U(UTa), a 2 Rk

The vector (UTa) is the projection of a onto U

Thus, a =

kX

i=1

(uTi a)ui

} Yes

} For instance:

11

Can we have non-orthogonal basis?

Let U = [u1, u2] be an orthonormal basis and let

v1 = 2u1 + u2

v2 = u1 + 2u2

The vectors v1 and v1 are not orthogonal:

v1vT2 = 4 ,

but still form a basis for R2

Eigenvalues and EigenvectorsA> 0 is n×n full rank, λ eigenvalue

det(A−λ I ) = 0

and x eigenvectorAx = λx ,

we assume ∥x∥= 1.

A[

x1, · · · , xn]

=[

x1, · · · , xn]

λ1. . .

λn

,

where xi is the i-th eigenvector of A ordered s.t. λ1 ≥ · · ·≥ λnIf A has n linearly independent eigenvectors

A= VΛV−1

7Inverse is A�1 = V ⇤�1V �1 , where ⇤�1 =

2

641/�1

. . .1/�n

3

7512

(right eigenvector)

If A is symmetric positive semidefinite V-1 = VT

Square is A2 = V ⇤2V �1

} X(i,j) = value of user i for property j

◦ X(Alice, cholesterol) = 10

◦ X(i,j) = number of times i buys j

◦ X(i,j) = how much i pays j

◦ X(i,j) = 1 if i and j are friends, 0 otherwise

◦ X(i,j) = temperature of sensor j at time i

13

Singular Value Decomposition (SVD)

2

2

j5

i

© 2015 Bruno Ribeiro

σ1σ2

σk

=

14

SVD Dimensions

… …

X UΣ

V T

Left singular vectors Right singular vectors

singular values

Data

x1 xmx2 u1 uku2

v1v2

vk

© 2015 Bruno Ribeiro

} SVD gives best rank-k approximation of X in L2 and Frobenius norm

15

SVD Definition

X = +u1

v1σ1

u2

v2σ2

Outer product

+ …

© 2015 Bruno Ribeiro

} “Almost unique” decomposition

} There are two sources of ambiguity◦ Orientation of singular vectors� Permute rows of left singular vector and corresponding

rows of left singular vector◦ If I is identity matrix: I = UIUT , for all orthonormal U � “Hypersphere ambiguity”� Related to rotational ambiguity of PCA

16

SVD Properties (I)

X = +u1

v1σ1

u2

v2σ2

+ …

© 2015 Bruno Ribeiro

} Theorem (Eckart-Young, 1936) ◦ UΣ1VT is best rank 1 approximation of X, that is

|X − UΣ1 V T |2 ≤ |X − Y|2 for every rank 1 matrix Y

◦ UΣ1VT + UΣ2VT is the best rank 2 approximation of X, that is|X − UΣ1 V T − UΣ2 V T|2 ≤ |X − Y| 2

for every rank ≤ 2 matrix Y

◦ also for 3 , 4, …, r

17

SVD Properties (II)

© 2015 Bruno Ribeiro

18

SVD Properties (III)

U and V are orthonormal(orthogonal & unit norm)

© 2015 Bruno Ribeiro

19

Singular Value Decomposition

V* is the transpose if V is real-valued (always the case for us)

SVD is significantly more generic:} Applies to matrices of any shape, not just square matrices} Applies to any matrix, not just invertible matrices}

}

• SVD factorization A = U⌃V ?is more general than

eigenvalue / eigenvector factorization A = V ⇤V �1.

⌃ =

2

64�1

. . .�n

3

75

AAT = (U⌃V T )(U⌃V T )T = (U⌃V T )(V ⌃TUT )

= U⌃⌃TV T = Udiag(⌃)2V T

Moore–Penrose pseudoinverse is A+= V ⌃

+UT

Columns of

U are orthonormal

Columns of

V are orthonormal

} As U and V have orthogonal rows

20

Understanding SVD singular vectors

Now you explain: What do V and U represent?

© 2015 Bruno Ribeiro

} If X(i,j) = user i buys product j

} What is XTX ?◦ Product-to-product similarity matrix◦ What does V represent?

} What is XXT ?◦ User-to-user similarity matrix◦ What does U represent?

21

My “Help”

2

2

j5

i

© 2015 Bruno Ribeiro

22

Statistical Inference Review

Data processing inequality: “No processing can increase the amount of statistical information

already contained in the data”

23

Estimating characteristics from sampling

Worldraw

samples

samplesummarycharacteristic

summary

Data processinginequality

} In data mining we often work with a sample of data from the population of interest

} Estimation techniques allow inferences about population properties from sample data

} If we had the population we could calculate the properties of interest

} Elementary units:

◦ Entities (e.g., persons, objects, events) that meet a set of specified criteria

◦ Example: All people who’ve purchased something at Wallmart

} Population:

◦ Aggregate of elementary units (i.e, all items of interest)

} Sampling:

◦ Sub-group of the population

◦ Serves as a reference group for estimating characteristics about the population and drawing conclusions

} Sampling is the main technique employed for data selection

◦ It is often used for both the preliminary investigation of the data and the final data analysis

} Reasons to sample

◦ Obtaining the entire set of data of interest is too expensive or time consuming

◦ Processing the entire set of data of interest is too expensive or time consuming

◦ Note: Even if you use an entire dataset for analysis, you should be aware of the sampling method that was used to gather the dataset

Tan, Steinbach, Kumar. Introduction to Data Mining, 2004.

} The key principle for effective sampling is the following:

◦ Using a sample will work almost as well as using the entire dataset, if the sample is representative

◦ A sample is representative if it has approximately the same property (of interest) as the original data

} Infer properties of an unknown distribution with sample data generated from that distribution

} Parameter estimation◦ Infer the value of a population parameter based on a sample

statistic (e.g., estimate the mean)

} Hypothesis testing◦ Infer the answer to a question about a population parameter

based on a sample statistic (e.g., is the mean non-zero?)

} Maximum likelihood estimate (MLE)

} Maximum a poseriori probability estimate (MAP)

29

ˆ✓ = argmax

✓P [Data|✓]

s.t. f(✓) = 0

ˆ✓ = argmax

✓P [Data|✓]P [✓]

s.t. f(✓) = 0

30

import numpy as np from matplotlib import use # Avoid using the xterminal to create plots (needed if plotting # on Scholar) use("Agg") import matplotlib.pyplot as pltimport scipy as spimport math

p = 0.8 MAX_DEGREE = 101 ECCDF = 1.0 x = [] y = [] for d in xrange(MAX_DEGREE): #Be careful with machine precision

x.append(d) y.append(ECCDF) ECCDF = ECCDF - (1-p)*p**d

plt.xlim([1,max(x)]) plt.xlabel("node degree", fontsize=18) plt.ylabel("ECCDF", fontsize=18) plt.loglog(x,y,"ro") plt.savefig(’ECCDF_plot.pdf’) 31

# Read CSV into R mydata = read.csv(file=”input.csv", header=FALSE, sep=",")

#Generate a random matrix (elements exponentially distributed)h = 100k = 200M = matrix(rexp(n=(h*k), rate=0.1),ncol=h,nrow=k)

print(max(M))max_per_row = c()for (i in 1:k) {

max_per_row = c(max_per_row,max(M[i,]))}plot(max_per_row)print(max(max_per_row)/min(max_per_row))

32

33

} High Performance Computing Cluster

} Jobs must be submitted with qsub(do not use main terminal to run tasks or you will be banished)

} But you can also use your own computer (but Scholar learning curve pays-off later)

34

Scholar Cluster

#!/bin/bash -l # Example submission file (myjob.sub)# choose queue to use (e.g. standby or scholar-b)#PBS -q standby# FILENAME: myjob.submodule load develmodule load gccmodule load anaconda/4.4.1-py35module load rcd $PBS_O_WORKDIR unset DISPLAY

python matlibplot_example.py

35

Qsub Example File

} qsub myjob.sub} qstat -u <your Purdue username>

} qstat -u <your Purdue username>

} After job finishes (submission directory has output files): ◦ myjob.sub.o<jobID> (std output)◦ myjob.sub.e<jobID> (error)◦ ECCDF_plot.pdf (your plot)

36

Scholar Cluster Job Submission

37

Not ideal for someone learning Python and R

} Submission takes a while to run (few minutes)◦ How to iteratively debug code?◦ qsub -I -q standby -l walltime=01:00:00

# asks for 1 hour iterative terminal to debug problems in your code# remember to load the modules you need

} Where to get helphttps://www.rcac.purdue.edu/compute/scholar/guide/