Continuous Latent Variables --Bishop Xue Tian. 2 Continuous Latent Variables Explore models in which...

Continuous Latent Variables--Bishop

Xue Tian

2

Continuous Latent Variables

• Explore models in which some, or all of the latent variables are continuous

• Motivation is in many data sets– dimensionality of the original data space is

very high– the data points all lie close to a manifold of

much lower dimensionality

3

Example

• data set: 100x100 pixel grey-level images• dimensionality of the original data space is

100x100• digit 3 is embedded, the location and

orientation of the digit is varied at random• 3 degrees of freedom of variability

– vertical translation– horizontal translation– rotation

4

Outline

• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data

• Kernel PCA

• Probabilistic PCA

two commonly used definitions of PCAgive rise to the same algorithm

5

PCA-maximum variance formulation

• PCA can be defined as – the orthogonal projection of the data onto a

lower dimensional linear space-principal subspace

– s.t. the variance of the projected data is maximized

goal

6

red dots: data pointspurple line: principal subspacegreen dots: projected points


7

• data set: {xn} n=1,2,…N

• xn: D dimensions

• goal:– project the data onto a space having

dimensionality M < D– maximize the variance of the projected data


8

• M=1

• D-dimension unit vector u1: the direction

u1T u1=1

• xn u1Txn

• mean of the projected data:

• variance of the projected data:


project

covariance matrix

1

9

• goal: maximize variance of projected data

• maximize variance u1TSu1 with respect to u1

• introduce a Lagrange multiplier λ1

– a constrained maximization to prevent ||u1||

– constraint comes from u1T u1=1

• set derivative equal to zero

u1: an eigenvector of S

max variance: largest λ1 u1


2

10

• define additional PCs in an incremental fashion

• choose new directions– maximize the projected variance– orthogonal to those already considered

• general case: M-dimensional• the optimal linear projection defined by

– M eigenvectors u1, ... ,uM of S– M largest eigenvalues λ1,...,λM


11

Outline


• Kernel PCA



12

PCA-minimum error formulation

• PCA can be defined as– the linear projection – minimizes the average projection cost– average projection cost: mean squared

distance between the data points and their projections

goal

13

red dots: data pointspurple line: principal subspacegreen dots: projected pointsblue lines: projection error


14

• complete orthonormal set of basis vectors {ui}

– i=1,…D, D-dimensional –

• each data point can be represented by a linear combination of the basis vectors

• take the inner produce with uj


3

4

15

• to approximate data points using a

M-dimensional subspace

- depend on the particular data points

- constant, same for all data points

• goal: minimize the mean squared distance

• set derivative with respect to to zero

j=1,…,M


5

16

• set derivative with respect to to zero

j=M+1,…,D

• remaining task: minimize J with respect to ui


6

7

8

17

• M=1 D=2

• introduce a Lagrange multiplier λ2

– a constrained minimization to prevent ||u2||0

– constraint comes from u2T u2=1

• set derivative equal to zero

u2: an eigenvector of S

min error: smallest λ2 u2


18

• general case:

i=M+1,…,D

J: sum of the eigenvalues of those eigenvectors that are orthogonal to the principal subspace

• obtain the min value of J:– select eigenvectors corresponding to the D - M smallest

eigenvalues– the eigenvectors defining the principal subspace are

those corresponding to the M largest eigenvalues


19

Outline


• Kernel PCA



20

PCA-application

• dimensionality reduction

• lossy data compression

• feature extraction

• data visualization

• examplePCA is unsupervised and depends only on

the values xn

21

• go through the steps to perform PCA on a set of data

• Principal Components Analysis by Lindsay Smith

• http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

PCA-example

http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

22

Step 1: get data set

D=2 N=10

PCA-example

23

Step 2: subtract the mean

PCA-example

x y

24

PCA-example

Step 3: calculate the covariance matrix S

S: 2x2

9

25

• Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix S

• the eigenvector with the highest eigenvalue is the first principle component of the data set

PCA-example

26

• two eigenvectors

• go through the middle of the points, like drawing a line of best fit

• extract lines to characterize the data

PCA-example

27

• in general, once eigenvectors are found

• the next step is to order them by eigenvalues, highest to lowest

• this gives us the PCs in order of significance

• decide to ignore the less significant components

• here is where the notion of data compression and reduced dimensionality comes

PCA-example

28

PCA-example

• Step 5: derive the new data set

newDataT=eigenvectorsT x originalDataAdjustT

originalDataAdjustT=

newData: 10x1

29

• newData

PCA-example

30

PCA-example

• newData: 10x2

31

• Step 6: get back old data

data compression

• took all the eigenvectors in transformation, get exactly the original data back

• otherwise, lose some information

PCA-example

32

• newDataT=eigenvectorsT x originalDataAdjustT

• newDataT=eigenvectors-1 x originalDataAdjustT

• originalDataAdjustT=eigenvectors x newDataT

• originalDataT=eigenvectors x newDataT + mean

PCA-example

• take all the eigenvectors• inverse of the eigenvectors matrix is equal to the transpose of it• unit vectors

33

PCA-example

• newData: 10x1

34

Outline


• Kernel PCA



35

PCA-high dimensional data

• number of data points is smaller than the dimensionality of the data space N < D

• example: – data set: a few hundred images– dimensionality: several million corresponding

to three color values for each pixel

36

• standard algorithm for finding eigenvectors for a DxD matrix is O(D3)O(MD2)

• if D is really high, a direct PCA is computationally infeasible


37

• N < D • a set of N points defines a linear subspace

whose dimensionality is at most N – 1• there is little point to apply PCA for M > N – 1

• if M > N-1• at least D-N+1 of the eigenvalues are 0• eigenvectors has 0 variance of the data set


38

solution:

• define X: NxD dimensional centred data matrix

• nth row:


DxD

39

define


eigenvector equation for matrix NxN

• have the same N-1 eigenvalues

has D-N+1 zero

eigenvalues

• O(D3)O(N3)

• eigenvectors

40

Outline


• Kernel PCA



41

Kernel

• Kernel function

• inner product in feature space• feature space M ≥ input space N• feature space mapping is implicit

mapping of x into a feature space

42

PCA-linear

• maximum variance formulation– the orthogonal projection of the data onto a

lower dimensional linear space

– s.t. the variance of the projected data is maximized

• minimum error formulation– the linear projection – minimizes the average projection distance

linearlinear

43

Kernel PCA

• data set: {xn} n=1,2,…N

• xn: D dimensions

• assume: the mean has been subtracted from xn (zero mean)

• PCs are defined by the eigenvectors ui of S

i=1,…,D

44

• a nonlinear transformation into an M-dimensional feature space

• xn

• perform standard PCA in the feature space

• implicitly defines a nonlinear PC in the original data space

Kernel PCA

project

45

original data space feature space

green lines: linear projection onto the first PCnonlinear projection in the original data space

46

• assume: the projected data has zero mean

MxM

i=1,…,M

given , vi is a linear combination of

Kernel PCA

47

Kernel PCA

express this in terms of kernel function

in matrix notation i=1,…,N ai: column vector

the solutions of these two eigenvector equations differ only byeigenvectors of K having zero eigenvalues

48

Kernel PCA

• normalization condition for ai

49

• in feature space: what is the projected data points after PCA

Kernel PCA

50

Kernel PCA

• original data space– dimensionality: D– D eigenvectors– at most D linear PCs

• feature space– dimensionality: M M>>D (even infinite)– M eigenvectors– a number of nonlinear PCs then can exceed D

• the number of nonzero eigenvalues can not exceed N

51

Kernel PCA

• assume: the projected data has zero mean

• nonzero mean– cannot simply compute and then subtract off

the mean– avoid working directly in feature space

• formulate the algorithm purely in terms of kernel function

52

Kernel PCA

in matrix notation

1N: NxN matrix 1/N

Gram matrix

53

Kernel PCA

• linear kernel: standard PCA

• Gaussian kernel:

• example: kernel PCA

55

Kernel PCA• contours: lines along which the projection

onto the corresponding PC is constant

56

Kernel PCA

disadvantage:

• determine the eigenvectors of , NxN

• for large data sets, approximations are used

57

Outline


• Kernel PCA



58

Probabilistic PCA

• standard PCA: a linear projection of the data onto a lower dimensional subspace

• probabilistic PCA: the maximum likelihood solution of a probabilistic latent variable model

59

Probabilistic PCA

• the combination of a probabilistic model and EM allows us to deal with missing values in the data set– EM: expectation-maximization algorithm– a method to find maximum likelihood solutions

for models with latent variables

60

Probabilistic PCA

• probabilistic PCA forms the basis for a Bayesian treatment of PCA

• in Bayesian PCA, the dimensionality of the principal subspace can be found automatically

61

Probabilistic PCA

• the probabilistic PCA model can be run generatively to provide samples from the distribution

• the simplest continuous latent variable model assumes– Gaussian distribution for both the latent and

observed variables– makes use of a linear-Gaussian dependence

of the observed variables on the state of the latent variables

62

Probabilistic PCA• an explicit latent variable z

– corresponding to the PC subspace

• a Gaussian prior distribution p(z) over the latent variable

• a Gaussian conditional distribution p(x|z)

– W: DxM matrix– columns of W: principal subspace – : D-dimensional vector

observed variableDx1

Mx1

63

Probabilistic PCA• get a sample value of the observed

variables by– choosing a value for the latent variable – sampling the observed variable given the

latent value

• x is defined by a linear transformation of z – plus additive Gaussian noise

– : D-dimensional noise

64

Probabilistic PCA

data space: 2-dimensional latent space: 1-dimensional• get a value for the latent variable z• get a value for x from an isotropic Gaussian distribution

• green ellipses: density contours for the marginal distribution p(x)

65

Probabilistic PCA

• a mapping from latent space to data space

• in contrast to the standard PCA

66

Probabilistic PCA

• Gaussian conditional distribution

• maximum likelihood PCA - determine 3 parameters

• we need an expression for p(x)

67

Probabilistic PCA• so far, we assumed the value M is given• in practice, choose a suitable value for M

– for visualization: M=2 or M=3– plot the eigenvalue spectrum for the data setseek a significant gap indicating a choice for Min practice, such a gap is often not seen– Bayesian PCAemploy cross-validation to determine the value

of Mby selecting the largest log likelihood on a

validation data set

68

Probabilistic PCA

• the only clear break is between the 1st and 2nd PCs• the 1st PC explains less than 40% of the variance

– more components are probably needed

• the first 3 PCs explain two thirds of the total variability– 3 might be a reasonable value of M

Continuous Latent Variables --Bishop Xue Tian. 2 Continuous Latent Variables Explore models in which...

Documents

Transcript of Continuous Latent Variables --Bishop Xue Tian. 2 Continuous Latent Variables Explore models in which...