Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT...

34
Chapter 3 Non-Negative Matrix Factorization Part 1: Introduction & computation

Transcript of Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT...

Page 1: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

Chapter 3 Non-Negative Matrix

FactorizationPart 1: Introduction & computation

Page 2: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Motivating NMF

2Skillicorn chapter 8; Berry et al. (2007)

Page 3: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Reminder

3

1

0

0

1

1

1

1

0

0

1

1

1

1

0

0

0.8

0.5

0.5

0.6

−0.5

−0.5

1.5

0

0

1.3

0.3

0.5

0.6

−0.3

0.3

0.5

0.6

−0.3

0.3

0.5

U Σ VTA

= +

0.6

0.3

0.3

1.3

0.8

0.8

0.6

0.3

0.3

1.3

0.8

0.8

0.6

0.3

0.3

0.4

−0.3

−0.3

−0.3

0.2

0.2

0.4

−0.3

−0.3

−0.3

0.2

0.2

0.4

−0.3

−0.3

U1Σ1,1V1T U2Σ2,2V2

T

The components of the SVD are not very interpretable

Page 4: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Non-negative factors

4

1

0

0

1

1

1

1

0

0

1

1

1

1

0

0

A

=

1

0

1

1

1

0

1

1

1

0

1

0

0

0

1

1

W H1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

0

0

0

1

1

0

0

0

0

1

1

0

0

0

+

W1H1 W2H2

Forcing the factors to be non-negative can, and often will, improve the interpretability of the factorization

Page 5: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

The definition

5

Page 6: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Definition of NMF

6

Given a non-negative matrix and an integer k, find non-negative matrices and such thatis minimized.

12 kA �WHk2F

A 2 Rn⇥m+

W 2 Rn⇥k+ H 2 Rk⇥m+

Page 7: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Non-negative rank

• The non-negative rank of matrix A, rank+(A), is the size of the smallest exact non-negative factorization A = WH

• rank(A) ≤ rank+(A) ≤ min{n, m}

7

Page 8: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Some comments• NMF is not unique

• If X is nonnegative and with nonnegative inverse, then WXX–1H is equivalent valid decomposition

• Computing NMF (and non-negative rank) is NP-hard

• This was open until 2008

8

Page 9: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Example of non-uniqueness

9

1

0

0

1

1

1

1

0

0

1

1

1

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

1

0

0

0

0

0

0

1

1

0

0

0

0

1

1

0

0

0

1

0

0

0.5

0

0

1

0

0

0.5

0

0

1

0

0

0

0

0

0.5

1

1

0

0

0

0.5

1

1

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

0

0

0

0

0

1

1

1

0

0

0

1

1

1

0

0

0

=

=

=

+

+

+

Page 10: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

NMF has no order• The factors in NMF have no inherent order

• The first component is no more important than the second is no more important…

• NMF is not hierarchical

• The factors of rank-(k+1) decomposition can be completely different to those of rank-k decomposition

10

Page 11: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Example

11

1

0

0

1

1

1

1

0

0

1

1

1

1

0

0

0.8

0.5

0.5

0.7 1.5 0.7 1.5 0.7

≈ Rank-1}1

0

0

0

1

1

1

0

1

1

1

0

1

1

1

0= Rank-2}

Page 12: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Interpreting NMF

12

Page 13: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Parts-of-whole

• NMF works over anti-negative semiring

• There is no subtraction

• Each rank-1 component wihi explains a part of the whole

• This can yield to sparse factors

13

Page 14: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

NMF example: faces

14

Row of original

Row of reconstruction

= ×

PCA/SVD

Page 15: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

NMF example: faces

15

Row of original

= ×

NMF

Page 16: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

NMF example: digits

16

A H

NMF factors correspond to patterns and background

Page 17: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Some NMF applications• Text mining (more later)

• Bioinformatics

• Microarray analysis

• Mineral exploration

• Neuroscience

• Image understanding

• Air pollution research

• Weather forecasting

• …

17

Page 18: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Computing NMF

18

Page 19: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

General idea• NMF is not convex, but it is biconvex

• If W is fixed, is convex

• Start from random W and repeat

• Fix W and update H

• Fix H and update W

• until the error doesn’t decrease anymore

19

12 kA �WHk2F

Page 20: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Notes on the general idea• How to create a good random starting point?

• Is the algorithm robust to initial solutions?

• How to update W and H?

• When (and how quickly) has the process converged?

• Fixed number of iterations? Minimum change in error?

20

Page 21: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Alternating least squares

• Without the non-negativity constraint, this is the standard least-squares:

• wi ← argminw ||wH – ai||F

• we can update W ← AH+ and H ← W+A

• X+ is the pseudo-inverse of X which is LS-optimal

• The method is called alternating least-squares (ALS)

• This can introduce negative values

21

Page 22: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Enforcing non-negativity in ALS

• Least-squares optimal update of W (or H) with non-negativity constraints is convex optimization problem

• In theory in P, in practice slow, but subject to much research

• Simple approach: truncate all negative values to 0

• Update W ← [AH+]+

22

Page 23: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

The NMF-ALS algorithm

1. W ← random(n, k)

2. repeat

2.1. H ← [W+A]+

2.2. W ← [AH+]+

3. until convergence

23

Page 24: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

When has there been enough convergence?

• When the error doesn’t change too much

• ||A – W(k)H(k)||F – ||A – W(k+1)H(k+1)||F ≤ ε

• After some number of maximum iterations has been achieved

• Usually, whichever of these two happens first

24

Page 25: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Gradient descent• We can compute the gradient of the error function

(with one factor matrix fixed)

• We can move slightly towards the negative gradient

• How much is the step size and deciding it is a big problem

25

ƒ (H) = 12 kA �WHk2F =

12P

� k�� �Wh�k2F�H�j ƒ (H) = (W

TA)�j � (WTWH)�j

Page 26: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

The NMF gradient descent algorithm

1. W ← random(n, k)

2. H ← random(k, m)

3. repeat

3.1.

3.2.

4. until convergence

26

H H � �H�ƒ�H

W W � �W�ƒ�W

Page 27: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Oblique Projected Landweber (OPL) for NMF

• OPL provides one way to select the step size

• With updates, the convergence radius is 2/λmax(WTW), where λmax is the largest eigenvalue

• λmax ≤ max(rowSums(WTW))

• We can set the learning rates to 1/rowSums(WTW) for a good convergence

27

H H � �H�ƒ�H

Page 28: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

The OPL algorithm for updating H

1. η ← diag(1 / rowSums(WTW))

2. repeat

2.1. G ← WTWH – WTA

2.2. H ← [H – ηG]+

3. until a stopping criterion is met

28

(small) number of iterations

OR H doesn’t change much

Page 29: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Interior Point Gradient (IPG) for NMF

• In OPL, we might (temporarily) have negative values in W or H

• In Interior Point Gradient (IPG) algorithm, we set the step sizes so that we never update to negative

29

Page 30: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

The IPG algorithm for updating H

1. repeat until a stopping criterion is met

1.1. G ← WT(WH – A)

1.2. D ← H / (WTWH)

1.3. P ← –D * G

1.4. η* ← – ⟨vec(P), vec(G)⟩/⟨vec(WP), vec(WP)⟩

1.5. η’ ← max{η : H + ηP ≥ 0}

1.6. η ← min{τη’, η*}

1.7. H ← H + ηP

30

Gradient

Scaling

Update direction

Best step size

Positive step size

Update

/ and * are element-wise

Page 31: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Multiplicative updates• The KKT conditions for H in NMF are

• H ≥ 0; ∇H||A – WH||2/2 ≥ 0

• H * ∇H||A – WH||2/2 = 0

• Substituting ∇H||A – WH||2/2 = WTWH – WTA one gets H * (WTWH) = H * (WTA)

• This gives us an update rule for H

31

* is element-wise product

Page 32: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

The NMF multiplicative updates algorithm

1. W ← random(n, k)

2. H ← random(k, m)

3. repeat

3.1.

3.2.

4. until convergence

32

h�j h�j(WTA)�j

(WTWH)�j+�

��j ��j(AHT )�j

(WHHT )�j+�

Page 33: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Notes on multiplicative updates

• Proposed by Lee & Seung (Nature, 1999)

• Equivalent to gradient descent with dynamic step size

• Zeros in initial solutions will never turn into non-zeros; non-zeros will never turn into zeros

• Problems if the correct solution contains zeros

33

Page 34: Chapter 3 Non-Negative Matrix Factorization...2017/05/29  · P k Wh k 2 F Hj ƒ (H)=(W T A)j (WT WH)j DMM, summer 2017 Pauli Miettinen The NMF gradient descent algorithm 1. W ←

DMM, summer 2017 Pauli Miettinen

Summary• NMF can provide factorizations that are more

interpretable than those given by SVD

• Harder to compute than SVD, but many different approaches

• Or are they so different…

• In two weeks: Applications & alternations of NMF… Stay tuned!

34