COMP6237 Data Mining Finding Independent...

COMP6237 Data Mining

Finding Independent FeaturesJonathon Hare [email protected]

Content based on material from slides on NMF from Derek Greene at UCD (http://derekgreene.com/slides/nmf_insight_workshop.pdf)

and David Blei’s MLSS slides on LDA (http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf)

mailto:[email protected]

http://derekgreene.com/slides/nmf_insight_workshop.pdf

http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf

Introduction

• Topic Models

• Non-negative Matrix Factorisation

• Brief introduction to Probabilistic approaches

Problem statement

• When we looked at LSA, we saw that it created concepts that were linear mixtures of words

• But, the weightings were unconstrained, and could be negative

• Very difficult to interpret or give semantic meaning to the topic

• Would be really nice if we could determine thematic topics for a corpus of documents

Topic Modelling

• Topic models uncover the hidden thematic structure in document collections.

• These algorithms help us develop new ways to

• search

• browse

• summarise

Latent Dirichlet allocation (LDA)

Simple intuition: Documents exhibit multiple topics.

Simple intuition: documents exhibit multiple topics

Key topic modelling techniques

• There are many!

• Probabilistic Latent Semantic Analysis (PLSA)

• Latent Dirichlet Allocation (LDA)

• Pachinko allocation (PAM)

• Non-Negative Matrix Factorisation ([N]NMF)

Key topic modelling techniques

• There are many!

• Probabilistic Latent Semantic Analysis (pLSA)

• Latent Dirichlet Allocation (LDA)

• Pachinko allocation (PAM)

• Non-Negative Matrix Factorisation ([N]NMF)

Probabilistic Models

Relationship to clustering

• Clearly topic modelling is related to clustering

• trying to group documents into similar sets

• Topic models in a sense perform soft clustering

• A document can belong to a weighted mixture of topics

Non-negative Matrix Factorisation

Non-negative Matrix Factorisation• NMF: an unsupervised family of algorithms that simultaneously perform dimension reduction

and clustering.

• Sometimes referred to as NNMF

• Also known as positive matrix factorisation (PMF) and non-negative matrix approximation (NNMA).

• No strong statistical justification or grounding.

• But has been successfully applied in a range of areas:

• Bioinformatics (e.g. clustering gene expression networks).

• Image processing (e.g. face detection).

• Audio processing (e.g. source separation).

• Text analysis

NMF Overview

• NMF produces a “parts-based” decomposition of the latent relationships in a data matrix

• Like SVD/PCA can reduce the dimensionality

• Best explained visually in comparison to PCA of a data matrix formed from images:

© 1999 Macmillan Magazines Ltd

letters to nature

NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789

PCA constrains the columns of W to be orthonormal and therows of H to be orthogonal to each other. This relaxes the unaryconstraint of VQ, allowing a distributed representation in whicheach face is approximated by a linear combination of all the basisimages, or eigenfaces6. A distributed encoding of a particular face isshown next to the eigenfaces in Fig. 1. Although eigenfaces have astatistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This isbecause PCA allows the entries of W and H to be of arbitrary sign.As the eigenfaces are used in linear combinations that generallyinvolve complex cancellations between positive and negativenumbers, many individual eigenfaces lack intuitive meaning.

NMF does not allow negative entries in the matrix factors W andH. Unlike the unary constraint of VQ, these non-negativity con-straints permit the combination of multiple basis images to repre-sent a face. But only additive combinations are allowed, because thenon-zero elements of W and H are all positive. In contrast to PCA,no subtractions can occur. For these reasons, the non-negativityconstraints are compatible with the intuitive notion of combiningparts to form a whole, which is how NMF learns a parts-basedrepresentation.

As can be seen from Fig. 1, the NMF basis and encodings containa large fraction of vanishing coefficients, so both the basis imagesand image encodings are sparse. The basis images are sparse becausethey are non-global and contain several versions of mouths, nosesand other facial parts, where the various versions are in differentlocations or forms. The variability of a whole face is generated bycombining these different parts. Although all parts are used by at

least one face, any given face does not use all the available parts. Thisresults in a sparsely distributed image encoding, in contrast to theunary encoding of VQ and the fully distributed PCA encoding7–9.

We implemented NMF with the update rules for Wand H given inFig. 2. Iteration of these update rules converges to a local maximumof the objective function

F ¼ !n

i¼1!

m

m¼1

½VimlogðWHÞim ! ðWHÞimÿ ð2Þ

subject to the non-negativity constraints described above. Thisobjective function can be derived by interpreting NMF as amethod for constructing a probabilistic model of image generation.In this model, an image pixel Vim is generated by adding Poissonnoise to the product (WH)im. The objective function in equation (2)is then related to the likelihood of generating the images in V fromthe basis W and encodings H.

The exact form of the objective function is not as crucial as thenon-negativity constraints for the success of NMF in learning parts.A squared error objective function can be optimized with updaterules for W and H different from those in Fig. 2 (refs 10, 11). Theseupdate rules yield results similar to those shown in Fig. 1, but havethe technical disadvantage of requiring the adjustment of a parametercontrolling the learning rate. This parameter is generally adjustedthrough trial and error, which can be a time-consuming process ifthe matrix V is very large. Therefore, the update rules described inFig. 2 may be advantageous for applications involving large data-bases.

VQ

× =

NMF

=×

PCA

=×

Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation offaces, whereas vector quantization (VQ) and principal components analysis (PCA) learnholistic representations. The three learning methods were applied to a database ofm ¼ 2;429 facial images, each consisting of n ¼ 19 " 19 pixels, and constituting ann " m matrix V. All three find approximate factorizations of the form V " WH, but withthree different types of constraints on W and H, as described more fully in the main textand methods. As shown in the 7 " 7 montages, each method has learned a set ofr ¼ 49 basis images. Positive values are illustrated with black pixels and negative valueswith red pixels. A particular instance of a face, shown at top right, is approximatelyrepresented by a linear superposition of basis images. The coefficients of the linearsuperposition are shown next to each montage, in a 7 " 7 grid, and the resultingsuperpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMFlearns to represent faces with a set of basis images resembling parts of faces.

Use low-rank basis from PCA and NNMF to reconstruct this face:

Daniel D. Lee and H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (6755): 788–791

https://en.wikipedia.org/wiki/Sebastian_Seung

https://en.wikipedia.org/wiki/Nature_(journal)

© 1999 Macmillan Magazines Ltd

letters to nature

NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789

PCA constrains the columns of W to be orthonormal and therows of H to be orthogonal to each other. This relaxes the unaryconstraint of VQ, allowing a distributed representation in whicheach face is approximated by a linear combination of all the basisimages, or eigenfaces6. A distributed encoding of a particular face isshown next to the eigenfaces in Fig. 1. Although eigenfaces have astatistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This isbecause PCA allows the entries of W and H to be of arbitrary sign.As the eigenfaces are used in linear combinations that generallyinvolve complex cancellations between positive and negativenumbers, many individual eigenfaces lack intuitive meaning.

NMF does not allow negative entries in the matrix factors W andH. Unlike the unary constraint of VQ, these non-negativity con-straints permit the combination of multiple basis images to repre-sent a face. But only additive combinations are allowed, because thenon-zero elements of W and H are all positive. In contrast to PCA,no subtractions can occur. For these reasons, the non-negativityconstraints are compatible with the intuitive notion of combiningparts to form a whole, which is how NMF learns a parts-basedrepresentation.

As can be seen from Fig. 1, the NMF basis and encodings containa large fraction of vanishing coefficients, so both the basis imagesand image encodings are sparse. The basis images are sparse becausethey are non-global and contain several versions of mouths, nosesand other facial parts, where the various versions are in differentlocations or forms. The variability of a whole face is generated bycombining these different parts. Although all parts are used by at

least one face, any given face does not use all the available parts. Thisresults in a sparsely distributed image encoding, in contrast to theunary encoding of VQ and the fully distributed PCA encoding7–9.

We implemented NMF with the update rules for Wand H given inFig. 2. Iteration of these update rules converges to a local maximumof the objective function

F ¼ !n

i¼1!

m

m¼1

½VimlogðWHÞim ! ðWHÞimÿ ð2Þ

subject to the non-negativity constraints described above. Thisobjective function can be derived by interpreting NMF as amethod for constructing a probabilistic model of image generation.In this model, an image pixel Vim is generated by adding Poissonnoise to the product (WH)im. The objective function in equation (2)is then related to the likelihood of generating the images in V fromthe basis W and encodings H.

The exact form of the objective function is not as crucial as thenon-negativity constraints for the success of NMF in learning parts.A squared error objective function can be optimized with updaterules for W and H different from those in Fig. 2 (refs 10, 11). Theseupdate rules yield results similar to those shown in Fig. 1, but havethe technical disadvantage of requiring the adjustment of a parametercontrolling the learning rate. This parameter is generally adjustedthrough trial and error, which can be a time-consuming process ifthe matrix V is very large. Therefore, the update rules described inFig. 2 may be advantageous for applications involving large data-bases.

VQ

× =

NMF

=×

PCA

=×

Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation offaces, whereas vector quantization (VQ) and principal components analysis (PCA) learnholistic representations. The three learning methods were applied to a database ofm ¼ 2;429 facial images, each consisting of n ¼ 19 " 19 pixels, and constituting ann " m matrix V. All three find approximate factorizations of the form V " WH, but withthree different types of constraints on W and H, as described more fully in the main textand methods. As shown in the 7 " 7 montages, each method has learned a set ofr ¼ 49 basis images. Positive values are illustrated with black pixels and negative valueswith red pixels. A particular instance of a face, shown at top right, is approximatelyrepresented by a linear superposition of basis images. The coefficients of the linearsuperposition are shown next to each montage, in a 7 " 7 grid, and the resultingsuperpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMFlearns to represent faces with a set of basis images resembling parts of faces.

Red signifies -ve

Stronger colour indicates larger values

Daniel D. Lee and H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (6755): 788–791

https://en.wikipedia.org/wiki/Sebastian_Seung

https://en.wikipedia.org/wiki/Nature_(journal)

NMF Overview

• Given a non-negative matrix A, find k-dimension approximation in terms of non-negative factors W and H:

• Approximate each item (i.e. column of A) by a linear combination of k reduced dimensions or “basis vectors” in W.

• Each basis vector can be interpreted as a cluster. The memberships of items in these clusters encoded by H.

≈Am⨉n

UrW Hm⨉k

k⨉nW≥0, H≥0

Data Matrix (rows=features, cols=items)

Basis Matrix (rows=features)

Coefficient Matrix (cols=items)

NMF Algorithm Components

• Input: Non-negative data matrix (A), number of basis vectors (k), initial values for factors W and H (e.g. random matrices).

• Objective Function: Some measure of reconstruction error between A and the approximation WH.

• Optimisation Process: Local EM-style optimisation to refine W and H in order to minimise the objective function.

• Common approach is to iterate between two multiplicative update rules until convergence:




�4

12

||A�WH||2F =nX

i=1

mX

j=1

(Aij � (WH)ij)2


• Common approach is to iterate between two multiplicative update rules until convergence (Lee & Seung, 1999).

EuclideanDistance

(Lee & Seung, 1999)

Hcj Hcj(WA)cj

(WWH)cjWic Wic

(AH)ic

(WHH)ic

1. Update H 2. Update W




�4

12

||A�WH||2F =nX

i=1

mX

j=1

(Aij � (WH)ij)2


• Common approach is to iterate between two multiplicative update rules until convergence (Lee & Seung, 1999).

EuclideanDistance

(Lee & Seung, 1999)

Hcj Hcj(WA)cj

(WWH)cjWic Wic

(AH)ic

(WHH)ic

1. Update H 2. Update W

NMF Variants

• Different objective functions:

• KL divergence; Bregman divergences…

• More efficient optimisation:

• Alternating least squares with projected gradient method for sub-problems.

• Constraints:

• Enforcing sparseness in outputs.

• Incorporation of background information (Semi-NMF).

• Different inputs:

• Symmetric matrices - e.g. document-document cosine similarity matrix.

Topic Modelling with NMF• Basic methodology:

1. Construct vector space model for documents (after stop-‐word filtering), resulting in a term-‐document matrix A.

2. Apply TF-‐IDF term weight normalisation to A.

3. Normalize TF-‐IDF vectors to unit length.

4. Initialise factors (randomly or using NNDSVD(A)).

5. Compute NMF of A.

• Interpreting NMF output:

• Basis vectors: the topics (clusters) in the data.

• Coefficient matrix: the membership weights for documents relative to each topic (cluster).

NMF Topic Modelling: Simple example

• Apply TF-IDF and unit length normalisation to rows of A.

• Run Euclidean NMF on normalised A (k=3, random initialisation).

NMF Topic Modeling: Simple Example

Insight Latent Space Workshop �7

Document-Term Matrix A (6 rows x 10 columns)

document1

document2

document3

document4

document5

document6

bank

mon

ey

finan

ce

spor

t

club

foot

ball

tv

show

acto

r

mov

ie

• Apply TF-IDF and unit length normalization to rows of A.• Run Euclidean NMF on normalized A (k=3, random initialization).

NMF Topic Modelling: Simple exampleNMF Topic Modeling: Simple Example

Insight Latent Space Workshop �8

bank

money

finance

sport

club

football

tv

show

actor

movie

Topic1 Topic2 Topic3

Basis vectors W: topics (clusters)

document1

document2

document3

document4

document5

document6

Topic1 Topic2 Topic3

Coefficients H: memberships for documents

Challenge: Selecting K

• The selection of number of topics k is often performed manually.

• No definitive model selection strategy.

• Various alternatives comparing different models:

• Compare reconstruction errors for different parameters.

• Natural bias towards larger value of k.

• Build a “consensus matrix” from multiple runs for each k, assess presence of block structure.

• Examine the stability (i.e. agreement between results) from multiple randomly initialised runs for each value of k.

Challenge: initialisation

• Standard random initialisation of NMF factors can lead to instability

• i.e. significantly different results for different runs on the same data matrix.

• NNDSVD: Nonnegative Double Singular Value Decomposition

• Provides a deterministic initialisation with no random element.

• Chooses initial factors based on positive components of the first k dimensions of SVD of data matrix A.

• Often leads to significant decrease in number of NMF iterations required before convergence.

Experiment: BBC News Articles

• Collection of 2,225 BBC news articles from 2004-2005 with 5 manually annotated topics

• Apply NMF with k=5 to 2,225 x 9,125 matrix.

• Extract topic “descriptions” based on top-ranked terms in basis vectors.

Experiment: BBC News Articles• Collection of 2,225 BBC news articles from 2004-2005 with 5 manually

annotated topics (http://mlg.ucd.ie/datasets/bbc.html).• Applied Euclidean Projected Gradient NMF (k=5) to 2,225 x 9,125 matrix.• Extract topic “descriptions” based on top ranked terms in basis vectors.

�11

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

growth mobile england film labour

economy phone game best election

year music win awards blair

bank technology wales award brown

sales people cup actor party

economic digital ireland oscar government

oil users team festival howard

market broadband play films minister

prices net match actress tax

china software rugby won chancellor

Experiment: Irish Economy Dataset• Collection of 21k news articles from 2009-2010 relating to the economy (Irish Times, …).

• Extracted all named entities from articles (person, org, location), and constructed 21,496 x 3,014 article-entity matrix.

• Apply NMF (k=8) and examine topics on basis of top-ranked entities

Experiment: Irish Economy Dataset

• Collection of 21k news articles from 2009-2010 relating to the economy (Irish Times, Irish Independent & Examiner).

• Extracted all named entities from articles (person, org, location), and constructed 21,496 x 3,014 article-entity matrix.

• Applied Euclidean Projected Gradient NMF (k=8) matrix.

�12

Topic 1 Topic 2 Topic 3 Topic 4

nama european_union allied_irish_bank hse

brian_lenihan europe bank_of_ireland dublin

green_party greece anglo_irish_bank mary_harney

ntma lisbon_treaty dublin department_of_health

anglo_irish_bank ecb irish_life_permanent brendan_drumm

Topic 5 Topic 6 Topic 7 Topic 8

usa aer_lingus uk brian_cowen

asia ryanair dublin fine_gael

new_york dublin northern_ireland fianna_fail

federal_reserve daa bank_of_england green_party

china christoph_mueller london brian_lenihan

Experiment: IMDB Dataset

• Documents constructed from IMDB Keywords for set of 21k movies.

• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.

• Topic “descriptions” based on top ranked keywords in basis vectors appear to reveal genres and genre cross-overs.

Experiment: IMDb Dataset• Constructed documents from IMDb Keywords for set of 21k movies

(http://www.imdb.com/Sections/Keywords/).• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.• Topic “descriptions” based on top ranked keywords in basis vectors

appear to reveal genres and genre cross-overs.

�13


cowboy bmovie martialarts police superhero

shootout atgunpoint combat detective basedoncomic

cowboyhat bwestern hero murder superheroine

cowboyboots stockfootage actionhero investigation dccomics

horse gangmember brawl policedetective secretidentity

revolver duplicity fistfight detectiveseries amazon

sixshotter gangleader disarming murderer culttv

outlaw deception warrior policeofficer actionheroine

rifle sheriff kungfu policeman twowordtitle

winchester povertyrow onemanarmy crime bracelet

Experiment: IMDb Dataset• Constructed documents from IMDb Keywords for set of 21k movies

(http://www.imdb.com/Sections/Keywords/).• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.• Topic “descriptions” based on top ranked keywords in basis vectors

appear to reveal genres and genre cross-overs.

�14


worldwartwo monster love newyorkcity shotinthechest

soldier alien friend manhattan shottodeath

battle cultfilm kiss nightclub shotinthehead

army supernatural adultery marriageproposal punchedintheface

1940s scientist infidelity jealousy corpse

nazi surpriseending restaurant engagement shotintheback

military demon extramaritalaffair party shotgun

combat occult photograph hotel shotintheforehead

warviolence possession tears deception shotintheleg

explosion slasher pregnancy romanticrivalry shootout

Experiment: IMDB Dataset

• Documents constructed from IMDB Keywords for set of 21k movies.

• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.

• Topic “descriptions” based on top ranked keywords in basis vectors appear to reveal genres and genre cross-overs.

A brief overview of the key probabilistic models

Latent Dirichlet allocation (LDA)

gene 0.04

dna 0.02

genetic 0.01

.,,

life 0.02

evolve 0.01

organism 0.01

.,,

brain 0.04

neuron 0.02

nerve 0.01

...

data 0.02

number 0.02

computer 0.01

.,,

Topics DocumentsTopic proportions and

assignments

• Each topic is a distribution over words

• Each document is a mixture of corpus-wide topics

• Each word is drawn from one of those topics

• Each topic is a distribution over words

• Each document is a mixture of corpus-wide topics

• Each word is drawn from one of those topics

Typical probabilistic model

Typical probabilistic modelLatent Dirichlet allocation (LDA)


assignments

• In reality, we only observe the documents

• The other structure are hidden variables• In reality, we only observe the documents

• The other structures are hidden variables

Typical probabilistic modelLatent Dirichlet allocation (LDA)


assignments

• In reality, we only observe the documents

• The other structure are hidden variables

• Our goal is to infer the hidden variables

• i.e. compute their distribution conditioned on the documents p(topics, proportions, assignments|documents)

Probabilistic Latent Semantic Analysis

• Given a corpus, observations produced in the form of pairs of words and documents (w,d)

• Each observation is associated with an unobserved latent class variable, c

• PLSA model assumes that the probability of a co-occurrence P(w,d) is a mixture of conditionally independent multinomial distributions:

P (w, d) =X

c

P (c)P (d|c)P (w|c) = P (d)X

c

P (c|d)P (w|c)

Latent Dirichlet Allocation

• Popular Bayesian extension to PLSA

• Add a Dirichlet prior on the per-document topic distribution

• Makes the model fully generative (i.e. we can sample new documents)

• Parameters must be learned using Bayesian inference (e.g. variational Bayes/Gibbs Sampling/etc)

• In practice: better than PLSA for small datasets; with lots of data tends to perform similarly

Experiment: Science Articles

• Data: The OCR’ed collection of Science from 1990–2000

• 17K documents, 11M words, 20K unique terms (stop words and rare words removed)

• Model: 100-topic LDA model using variational inference.

1

analysis

dnagene

genes

geneticgenomehuman

sequence

sequences

two

6

article

card

circle

end

letters

news

readers

science

service

start

11

age

ago

early

evidence

fig

million

north

record

university

years

16

aaas

advertising

associate

faxmanager

member

recruitment

sales

science

washington

2

activation

activity

binding

cellcells

fig

kinase

protein

proteinsreceptor

7

data

different

fig

model

number

rate

results

system

time

two

12

biology

evolution

evolutionary

genetic

natural

population

populations

species

studies

university

17

biology

cellcells

developmentexpression

fig

genegenes

mice

mutant

3

atmosphere

atmospheric

carbon

changes

climate

global

ocean

surface

temperature

water

8

chemical

fig

high

materials

molecular

molecules

structure

surface

temperature

university

13

acid

amino

binding

molecular

protein

proteins

residues

structural

structure

two

18

electron

electrons

energy

high

laser

light

magnetic

physicsquantum

state

4

first

just

like

new

researcherssays

science

university

work

years

9

binding

dna

protein

proteins

rna

sequence

sequences

site

specific

transcription

14

antigen

cellcells

hiv

human

immune

infected

infection

viral

virus

19

health

national

new

researchscience

scientific

scientists

states

united

university

5

crust

earth

earthquakes

earths

high

lower

mantle

pressure

seismic

temperature

10

cancer

disease

drug

drugs

gene

human

medical

normal

patients

studies

15

astronomers

earth

mass

observations

solar

space

stars

sun

telescope

university

20

activity

braincells

channels

cortex

fig

neuronal

neurons

university

visual

Summary

• Topic modelling as an important part of data mining of unstructured data

• Key idea is that documents/items belong to or are made up of a number of topics

• Typically a small subset of the overall set of topics

• All the models we’ve looked at have 1 key parameter that must be manually tuned: the number of topics

COMP6237 Data Mining Finding Independent...

Documents

Transcript of COMP6237 Data Mining Finding Independent...