Continuous Latent Variables --Bishop Xue Tian. 2 Continuous Latent Variables Explore models in which...
-
Upload
magnus-evans -
Category
Documents
-
view
226 -
download
0
Transcript of Continuous Latent Variables --Bishop Xue Tian. 2 Continuous Latent Variables Explore models in which...
Continuous Latent Variables--Bishop
Xue Tian
2
Continuous Latent Variables
• Explore models in which some, or all of the latent variables are continuous
• Motivation is in many data sets– dimensionality of the original data space is
very high– the data points all lie close to a manifold of
much lower dimensionality
3
Example
• data set: 100x100 pixel grey-level images• dimensionality of the original data space is
100x100• digit 3 is embedded, the location and
orientation of the digit is varied at random• 3 degrees of freedom of variability
– vertical translation– horizontal translation– rotation
4
Outline
• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data
• Kernel PCA
• Probabilistic PCA
two commonly used definitions of PCAgive rise to the same algorithm
5
PCA-maximum variance formulation
• PCA can be defined as – the orthogonal projection of the data onto a
lower dimensional linear space-principal subspace
– s.t. the variance of the projected data is maximized
goal
6
red dots: data pointspurple line: principal subspacegreen dots: projected points
PCA-maximum variance formulation
7
• data set: {xn} n=1,2,…N
• xn: D dimensions
• goal:– project the data onto a space having
dimensionality M < D– maximize the variance of the projected data
PCA-maximum variance formulation
8
• M=1
• D-dimension unit vector u1: the direction
u1T u1=1
• xn u1Txn
• mean of the projected data:
• variance of the projected data:
PCA-maximum variance formulation
project
covariance matrix
1
9
• goal: maximize variance of projected data
• maximize variance u1TSu1 with respect to u1
• introduce a Lagrange multiplier λ1
– a constrained maximization to prevent ||u1||
– constraint comes from u1T u1=1
• set derivative equal to zero
u1: an eigenvector of S
max variance: largest λ1 u1
PCA-maximum variance formulation
2
10
• define additional PCs in an incremental fashion
• choose new directions– maximize the projected variance– orthogonal to those already considered
• general case: M-dimensional• the optimal linear projection defined by
– M eigenvectors u1, ... ,uM of S– M largest eigenvalues λ1,...,λM
PCA-maximum variance formulation
11
Outline
• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data
• Kernel PCA
• Probabilistic PCA
two commonly used definitions of PCAgive rise to the same algorithm
12
PCA-minimum error formulation
• PCA can be defined as– the linear projection – minimizes the average projection cost– average projection cost: mean squared
distance between the data points and their projections
goal
13
red dots: data pointspurple line: principal subspacegreen dots: projected pointsblue lines: projection error
PCA-minimum error formulation
14
• complete orthonormal set of basis vectors {ui}
– i=1,…D, D-dimensional –
• each data point can be represented by a linear combination of the basis vectors
• take the inner produce with uj
PCA-minimum error formulation
3
4
15
• to approximate data points using a
M-dimensional subspace
- depend on the particular data points
- constant, same for all data points
• goal: minimize the mean squared distance
• set derivative with respect to to zero
j=1,…,M
PCA-minimum error formulation
5
16
• set derivative with respect to to zero
j=M+1,…,D
• remaining task: minimize J with respect to ui
PCA-minimum error formulation
6
7
8
17
• M=1 D=2
• introduce a Lagrange multiplier λ2
– a constrained minimization to prevent ||u2||0
– constraint comes from u2T u2=1
• set derivative equal to zero
u2: an eigenvector of S
min error: smallest λ2 u2
PCA-minimum error formulation
18
• general case:
i=M+1,…,D
J: sum of the eigenvalues of those eigenvectors that are orthogonal to the principal subspace
• obtain the min value of J:– select eigenvectors corresponding to the D - M smallest
eigenvalues– the eigenvectors defining the principal subspace are
those corresponding to the M largest eigenvalues
PCA-minimum error formulation
19
Outline
• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data
• Kernel PCA
• Probabilistic PCA
two commonly used definitions of PCAgive rise to the same algorithm
20
PCA-application
• dimensionality reduction
• lossy data compression
• feature extraction
• data visualization
• examplePCA is unsupervised and depends only on
the values xn
21
• go through the steps to perform PCA on a set of data
• Principal Components Analysis by Lindsay Smith
• http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
PCA-example
22
Step 1: get data set
D=2 N=10
PCA-example
23
Step 2: subtract the mean
PCA-example
x y
24
PCA-example
Step 3: calculate the covariance matrix S
S: 2x2
9
25
• Step 4: Calculate the eigenvectors and eigenvalues of the covariance matrix S
• the eigenvector with the highest eigenvalue is the first principle component of the data set
PCA-example
26
• two eigenvectors
• go through the middle of the points, like drawing a line of best fit
• extract lines to characterize the data
PCA-example
27
• in general, once eigenvectors are found
• the next step is to order them by eigenvalues, highest to lowest
• this gives us the PCs in order of significance
• decide to ignore the less significant components
• here is where the notion of data compression and reduced dimensionality comes
PCA-example
28
PCA-example
• Step 5: derive the new data set
newDataT=eigenvectorsT x originalDataAdjustT
originalDataAdjustT=
newData: 10x1
29
• newData
PCA-example
30
PCA-example
• newData: 10x2
31
• Step 6: get back old data
data compression
• took all the eigenvectors in transformation, get exactly the original data back
• otherwise, lose some information
PCA-example
32
• newDataT=eigenvectorsT x originalDataAdjustT
• newDataT=eigenvectors-1 x originalDataAdjustT
• originalDataAdjustT=eigenvectors x newDataT
• originalDataT=eigenvectors x newDataT + mean
PCA-example
• take all the eigenvectors• inverse of the eigenvectors matrix is equal to the transpose of it• unit vectors
33
PCA-example
• newData: 10x1
34
Outline
• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data
• Kernel PCA
• Probabilistic PCA
two commonly used definitions of PCAgive rise to the same algorithm
35
PCA-high dimensional data
• number of data points is smaller than the dimensionality of the data space N < D
• example: – data set: a few hundred images– dimensionality: several million corresponding
to three color values for each pixel
36
• standard algorithm for finding eigenvectors for a DxD matrix is O(D3)O(MD2)
• if D is really high, a direct PCA is computationally infeasible
PCA-high dimensional data
37
• N < D • a set of N points defines a linear subspace
whose dimensionality is at most N – 1• there is little point to apply PCA for M > N – 1
• if M > N-1• at least D-N+1 of the eigenvalues are 0• eigenvectors has 0 variance of the data set
PCA-high dimensional data
38
solution:
• define X: NxD dimensional centred data matrix
• nth row:
PCA-high dimensional data
DxD
39
define
PCA-high dimensional data
eigenvector equation for matrix NxN
• have the same N-1 eigenvalues
has D-N+1 zero
eigenvalues
• O(D3)O(N3)
• eigenvectors
40
Outline
• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data
• Kernel PCA
• Probabilistic PCA
two commonly used definitions of PCAgive rise to the same algorithm
41
Kernel
• Kernel function
• inner product in feature space• feature space M ≥ input space N• feature space mapping is implicit
mapping of x into a feature space
42
PCA-linear
• maximum variance formulation– the orthogonal projection of the data onto a
lower dimensional linear space
– s.t. the variance of the projected data is maximized
• minimum error formulation– the linear projection – minimizes the average projection distance
linearlinear
43
Kernel PCA
• data set: {xn} n=1,2,…N
• xn: D dimensions
• assume: the mean has been subtracted from xn (zero mean)
• PCs are defined by the eigenvectors ui of S
i=1,…,D
44
• a nonlinear transformation into an M-dimensional feature space
• xn
• perform standard PCA in the feature space
• implicitly defines a nonlinear PC in the original data space
Kernel PCA
project
45
original data space feature space
green lines: linear projection onto the first PCnonlinear projection in the original data space
46
• assume: the projected data has zero mean
MxM
i=1,…,M
given , vi is a linear combination of
Kernel PCA
47
Kernel PCA
express this in terms of kernel function
in matrix notation i=1,…,N ai: column vector
the solutions of these two eigenvector equations differ only byeigenvectors of K having zero eigenvalues
48
Kernel PCA
• normalization condition for ai
49
• in feature space: what is the projected data points after PCA
Kernel PCA
50
Kernel PCA
• original data space– dimensionality: D– D eigenvectors– at most D linear PCs
• feature space– dimensionality: M M>>D (even infinite)– M eigenvectors– a number of nonlinear PCs then can exceed D
• the number of nonzero eigenvalues can not exceed N
51
Kernel PCA
• assume: the projected data has zero mean
• nonzero mean– cannot simply compute and then subtract off
the mean– avoid working directly in feature space
• formulate the algorithm purely in terms of kernel function
52
Kernel PCA
in matrix notation
1N: NxN matrix 1/N
Gram matrix
53
Kernel PCA
• linear kernel: standard PCA
• Gaussian kernel:
• example: kernel PCA
54
55
Kernel PCA• contours: lines along which the projection
onto the corresponding PC is constant
56
Kernel PCA
disadvantage:
• determine the eigenvectors of , NxN
• for large data sets, approximations are used
57
Outline
• PCA-principal component analysis– maximum variance formulation– minimum-error formulation– application of PCA– PCA for high-dimensional data
• Kernel PCA
• Probabilistic PCA
two commonly used definitions of PCAgive rise to the same algorithm
58
Probabilistic PCA
• standard PCA: a linear projection of the data onto a lower dimensional subspace
• probabilistic PCA: the maximum likelihood solution of a probabilistic latent variable model
59
Probabilistic PCA
• the combination of a probabilistic model and EM allows us to deal with missing values in the data set– EM: expectation-maximization algorithm– a method to find maximum likelihood solutions
for models with latent variables
60
Probabilistic PCA
• probabilistic PCA forms the basis for a Bayesian treatment of PCA
• in Bayesian PCA, the dimensionality of the principal subspace can be found automatically
61
Probabilistic PCA
• the probabilistic PCA model can be run generatively to provide samples from the distribution
• the simplest continuous latent variable model assumes– Gaussian distribution for both the latent and
observed variables– makes use of a linear-Gaussian dependence
of the observed variables on the state of the latent variables
62
Probabilistic PCA• an explicit latent variable z
– corresponding to the PC subspace
• a Gaussian prior distribution p(z) over the latent variable
• a Gaussian conditional distribution p(x|z)
– W: DxM matrix– columns of W: principal subspace – : D-dimensional vector
observed variableDx1
Mx1
63
Probabilistic PCA• get a sample value of the observed
variables by– choosing a value for the latent variable – sampling the observed variable given the
latent value
• x is defined by a linear transformation of z – plus additive Gaussian noise
– : D-dimensional noise
64
Probabilistic PCA
data space: 2-dimensional latent space: 1-dimensional• get a value for the latent variable z• get a value for x from an isotropic Gaussian distribution
• green ellipses: density contours for the marginal distribution p(x)
65
Probabilistic PCA
• a mapping from latent space to data space
• in contrast to the standard PCA
66
Probabilistic PCA
• Gaussian conditional distribution
• maximum likelihood PCA - determine 3 parameters
• we need an expression for p(x)
67
Probabilistic PCA• so far, we assumed the value M is given• in practice, choose a suitable value for M
– for visualization: M=2 or M=3– plot the eigenvalue spectrum for the data setseek a significant gap indicating a choice for Min practice, such a gap is often not seen– Bayesian PCAemploy cross-validation to determine the value
of Mby selecting the largest log likelihood on a
validation data set
68
Probabilistic PCA
• the only clear break is between the 1st and 2nd PCs• the 1st PC explains less than 40% of the variance
– more components are probably needed
• the first 3 PCs explain two thirds of the total variability– 3 might be a reasonable value of M
69