COMP6237 Data Mining Finding Independent...
Transcript of COMP6237 Data Mining Finding Independent...
COMP6237 Data Mining
Finding Independent FeaturesJonathon Hare [email protected]
Content based on material from slides on NMF from Derek Greene at UCD (http://derekgreene.com/slides/nmf_insight_workshop.pdf)
and David Blei’s MLSS slides on LDA (http://www.cs.columbia.edu/~blei/talks/Blei_MLSS_2012.pdf)
Introduction
• Topic Models
• Non-negative Matrix Factorisation
• Brief introduction to Probabilistic approaches
Problem statement
• When we looked at LSA, we saw that it created concepts that were linear mixtures of words
• But, the weightings were unconstrained, and could be negative
• Very difficult to interpret or give semantic meaning to the topic
• Would be really nice if we could determine thematic topics for a corpus of documents
Topic Modelling
• Topic models uncover the hidden thematic structure in document collections.
• These algorithms help us develop new ways to
• search
• browse
• summarise
Latent Dirichlet allocation (LDA)
Simple intuition: Documents exhibit multiple topics.
Simple intuition: documents exhibit multiple topics
Key topic modelling techniques
• There are many!
• Probabilistic Latent Semantic Analysis (PLSA)
• Latent Dirichlet Allocation (LDA)
• Pachinko allocation (PAM)
• Non-Negative Matrix Factorisation ([N]NMF)
Key topic modelling techniques
• There are many!
• Probabilistic Latent Semantic Analysis (pLSA)
• Latent Dirichlet Allocation (LDA)
• Pachinko allocation (PAM)
• Non-Negative Matrix Factorisation ([N]NMF)
Probabilistic Models
Relationship to clustering
• Clearly topic modelling is related to clustering
• trying to group documents into similar sets
• Topic models in a sense perform soft clustering
• A document can belong to a weighted mixture of topics
Non-negative Matrix Factorisation
Non-negative Matrix Factorisation• NMF: an unsupervised family of algorithms that simultaneously perform dimension reduction
and clustering.
• Sometimes referred to as NNMF
• Also known as positive matrix factorisation (PMF) and non-negative matrix approximation (NNMA).
• No strong statistical justification or grounding.
• But has been successfully applied in a range of areas:
• Bioinformatics (e.g. clustering gene expression networks).
• Image processing (e.g. face detection).
• Audio processing (e.g. source separation).
• Text analysis
NMF Overview
• NMF produces a “parts-based” decomposition of the latent relationships in a data matrix
• Like SVD/PCA can reduce the dimensionality
• Best explained visually in comparison to PCA of a data matrix formed from images:
© 1999 Macmillan Magazines Ltd
letters to nature
NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789
PCA constrains the columns of W to be orthonormal and therows of H to be orthogonal to each other. This relaxes the unaryconstraint of VQ, allowing a distributed representation in whicheach face is approximated by a linear combination of all the basisimages, or eigenfaces6. A distributed encoding of a particular face isshown next to the eigenfaces in Fig. 1. Although eigenfaces have astatistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This isbecause PCA allows the entries of W and H to be of arbitrary sign.As the eigenfaces are used in linear combinations that generallyinvolve complex cancellations between positive and negativenumbers, many individual eigenfaces lack intuitive meaning.
NMF does not allow negative entries in the matrix factors W andH. Unlike the unary constraint of VQ, these non-negativity con-straints permit the combination of multiple basis images to repre-sent a face. But only additive combinations are allowed, because thenon-zero elements of W and H are all positive. In contrast to PCA,no subtractions can occur. For these reasons, the non-negativityconstraints are compatible with the intuitive notion of combiningparts to form a whole, which is how NMF learns a parts-basedrepresentation.
As can be seen from Fig. 1, the NMF basis and encodings containa large fraction of vanishing coefficients, so both the basis imagesand image encodings are sparse. The basis images are sparse becausethey are non-global and contain several versions of mouths, nosesand other facial parts, where the various versions are in differentlocations or forms. The variability of a whole face is generated bycombining these different parts. Although all parts are used by at
least one face, any given face does not use all the available parts. Thisresults in a sparsely distributed image encoding, in contrast to theunary encoding of VQ and the fully distributed PCA encoding7–9.
We implemented NMF with the update rules for Wand H given inFig. 2. Iteration of these update rules converges to a local maximumof the objective function
F ¼ !n
i¼1!
m
m¼1
½VimlogðWHÞim ! ðWHÞimÿ ð2Þ
subject to the non-negativity constraints described above. Thisobjective function can be derived by interpreting NMF as amethod for constructing a probabilistic model of image generation.In this model, an image pixel Vim is generated by adding Poissonnoise to the product (WH)im. The objective function in equation (2)is then related to the likelihood of generating the images in V fromthe basis W and encodings H.
The exact form of the objective function is not as crucial as thenon-negativity constraints for the success of NMF in learning parts.A squared error objective function can be optimized with updaterules for W and H different from those in Fig. 2 (refs 10, 11). Theseupdate rules yield results similar to those shown in Fig. 1, but havethe technical disadvantage of requiring the adjustment of a parametercontrolling the learning rate. This parameter is generally adjustedthrough trial and error, which can be a time-consuming process ifthe matrix V is very large. Therefore, the update rules described inFig. 2 may be advantageous for applications involving large data-bases.
VQ
× =
NMF
=×
PCA
=×
Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation offaces, whereas vector quantization (VQ) and principal components analysis (PCA) learnholistic representations. The three learning methods were applied to a database ofm ¼ 2;429 facial images, each consisting of n ¼ 19 " 19 pixels, and constituting ann " m matrix V. All three find approximate factorizations of the form V " WH, but withthree different types of constraints on W and H, as described more fully in the main textand methods. As shown in the 7 " 7 montages, each method has learned a set ofr ¼ 49 basis images. Positive values are illustrated with black pixels and negative valueswith red pixels. A particular instance of a face, shown at top right, is approximatelyrepresented by a linear superposition of basis images. The coefficients of the linearsuperposition are shown next to each montage, in a 7 " 7 grid, and the resultingsuperpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMFlearns to represent faces with a set of basis images resembling parts of faces.
Use low-rank basis from PCA and NNMF to reconstruct this face:
Daniel D. Lee and H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (6755): 788–791
© 1999 Macmillan Magazines Ltd
letters to nature
NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789
PCA constrains the columns of W to be orthonormal and therows of H to be orthogonal to each other. This relaxes the unaryconstraint of VQ, allowing a distributed representation in whicheach face is approximated by a linear combination of all the basisimages, or eigenfaces6. A distributed encoding of a particular face isshown next to the eigenfaces in Fig. 1. Although eigenfaces have astatistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This isbecause PCA allows the entries of W and H to be of arbitrary sign.As the eigenfaces are used in linear combinations that generallyinvolve complex cancellations between positive and negativenumbers, many individual eigenfaces lack intuitive meaning.
NMF does not allow negative entries in the matrix factors W andH. Unlike the unary constraint of VQ, these non-negativity con-straints permit the combination of multiple basis images to repre-sent a face. But only additive combinations are allowed, because thenon-zero elements of W and H are all positive. In contrast to PCA,no subtractions can occur. For these reasons, the non-negativityconstraints are compatible with the intuitive notion of combiningparts to form a whole, which is how NMF learns a parts-basedrepresentation.
As can be seen from Fig. 1, the NMF basis and encodings containa large fraction of vanishing coefficients, so both the basis imagesand image encodings are sparse. The basis images are sparse becausethey are non-global and contain several versions of mouths, nosesand other facial parts, where the various versions are in differentlocations or forms. The variability of a whole face is generated bycombining these different parts. Although all parts are used by at
least one face, any given face does not use all the available parts. Thisresults in a sparsely distributed image encoding, in contrast to theunary encoding of VQ and the fully distributed PCA encoding7–9.
We implemented NMF with the update rules for Wand H given inFig. 2. Iteration of these update rules converges to a local maximumof the objective function
F ¼ !n
i¼1!
m
m¼1
½VimlogðWHÞim ! ðWHÞimÿ ð2Þ
subject to the non-negativity constraints described above. Thisobjective function can be derived by interpreting NMF as amethod for constructing a probabilistic model of image generation.In this model, an image pixel Vim is generated by adding Poissonnoise to the product (WH)im. The objective function in equation (2)is then related to the likelihood of generating the images in V fromthe basis W and encodings H.
The exact form of the objective function is not as crucial as thenon-negativity constraints for the success of NMF in learning parts.A squared error objective function can be optimized with updaterules for W and H different from those in Fig. 2 (refs 10, 11). Theseupdate rules yield results similar to those shown in Fig. 1, but havethe technical disadvantage of requiring the adjustment of a parametercontrolling the learning rate. This parameter is generally adjustedthrough trial and error, which can be a time-consuming process ifthe matrix V is very large. Therefore, the update rules described inFig. 2 may be advantageous for applications involving large data-bases.
VQ
× =
NMF
=×
PCA
=×
Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation offaces, whereas vector quantization (VQ) and principal components analysis (PCA) learnholistic representations. The three learning methods were applied to a database ofm ¼ 2;429 facial images, each consisting of n ¼ 19 " 19 pixels, and constituting ann " m matrix V. All three find approximate factorizations of the form V " WH, but withthree different types of constraints on W and H, as described more fully in the main textand methods. As shown in the 7 " 7 montages, each method has learned a set ofr ¼ 49 basis images. Positive values are illustrated with black pixels and negative valueswith red pixels. A particular instance of a face, shown at top right, is approximatelyrepresented by a linear superposition of basis images. The coefficients of the linearsuperposition are shown next to each montage, in a 7 " 7 grid, and the resultingsuperpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMFlearns to represent faces with a set of basis images resembling parts of faces.
Red signifies -ve
Stronger colour indicates larger values
Daniel D. Lee and H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (6755): 788–791
© 1999 Macmillan Magazines Ltd
letters to nature
NATURE | VOL 401 | 21 OCTOBER 1999 | www.nature.com 789
PCA constrains the columns of W to be orthonormal and therows of H to be orthogonal to each other. This relaxes the unaryconstraint of VQ, allowing a distributed representation in whicheach face is approximated by a linear combination of all the basisimages, or eigenfaces6. A distributed encoding of a particular face isshown next to the eigenfaces in Fig. 1. Although eigenfaces have astatistical interpretation as the directions of largest variance, manyof them do not have an obvious visual interpretation. This isbecause PCA allows the entries of W and H to be of arbitrary sign.As the eigenfaces are used in linear combinations that generallyinvolve complex cancellations between positive and negativenumbers, many individual eigenfaces lack intuitive meaning.
NMF does not allow negative entries in the matrix factors W andH. Unlike the unary constraint of VQ, these non-negativity con-straints permit the combination of multiple basis images to repre-sent a face. But only additive combinations are allowed, because thenon-zero elements of W and H are all positive. In contrast to PCA,no subtractions can occur. For these reasons, the non-negativityconstraints are compatible with the intuitive notion of combiningparts to form a whole, which is how NMF learns a parts-basedrepresentation.
As can be seen from Fig. 1, the NMF basis and encodings containa large fraction of vanishing coefficients, so both the basis imagesand image encodings are sparse. The basis images are sparse becausethey are non-global and contain several versions of mouths, nosesand other facial parts, where the various versions are in differentlocations or forms. The variability of a whole face is generated bycombining these different parts. Although all parts are used by at
least one face, any given face does not use all the available parts. Thisresults in a sparsely distributed image encoding, in contrast to theunary encoding of VQ and the fully distributed PCA encoding7–9.
We implemented NMF with the update rules for Wand H given inFig. 2. Iteration of these update rules converges to a local maximumof the objective function
F ¼ !n
i¼1!
m
m¼1
½VimlogðWHÞim ! ðWHÞimÿ ð2Þ
subject to the non-negativity constraints described above. Thisobjective function can be derived by interpreting NMF as amethod for constructing a probabilistic model of image generation.In this model, an image pixel Vim is generated by adding Poissonnoise to the product (WH)im. The objective function in equation (2)is then related to the likelihood of generating the images in V fromthe basis W and encodings H.
The exact form of the objective function is not as crucial as thenon-negativity constraints for the success of NMF in learning parts.A squared error objective function can be optimized with updaterules for W and H different from those in Fig. 2 (refs 10, 11). Theseupdate rules yield results similar to those shown in Fig. 1, but havethe technical disadvantage of requiring the adjustment of a parametercontrolling the learning rate. This parameter is generally adjustedthrough trial and error, which can be a time-consuming process ifthe matrix V is very large. Therefore, the update rules described inFig. 2 may be advantageous for applications involving large data-bases.
VQ
× =
NMF
=×
PCA
=×
Original Figure 1 Non-negative matrix factorization (NMF) learns a parts-based representation offaces, whereas vector quantization (VQ) and principal components analysis (PCA) learnholistic representations. The three learning methods were applied to a database ofm ¼ 2;429 facial images, each consisting of n ¼ 19 " 19 pixels, and constituting ann " m matrix V. All three find approximate factorizations of the form V " WH, but withthree different types of constraints on W and H, as described more fully in the main textand methods. As shown in the 7 " 7 montages, each method has learned a set ofr ¼ 49 basis images. Positive values are illustrated with black pixels and negative valueswith red pixels. A particular instance of a face, shown at top right, is approximatelyrepresented by a linear superposition of basis images. The coefficients of the linearsuperposition are shown next to each montage, in a 7 " 7 grid, and the resultingsuperpositions are shown on the other side of the equality sign. Unlike VQ and PCA, NMFlearns to represent faces with a set of basis images resembling parts of faces.
Red signifies -ve
Stronger colour indicates larger values
Daniel D. Lee and H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (6755): 788–791
NMF Overview
• Given a non-negative matrix A, find k-dimension approximation in terms of non-negative factors W and H:
• Approximate each item (i.e. column of A) by a linear combination of k reduced dimensions or “basis vectors” in W.
• Each basis vector can be interpreted as a cluster. The memberships of items in these clusters encoded by H.
≈Am⨉n
UrW Hm⨉k
k⨉nW≥0, H≥0
Data Matrix (rows=features, cols=items)
Basis Matrix (rows=features)
Coefficient Matrix (cols=items)
NMF Algorithm Components
• Input: Non-negative data matrix (A), number of basis vectors (k), initial values for factors W and H (e.g. random matrices).
• Objective Function: Some measure of reconstruction error between A and the approximation WH.
• Optimisation Process: Local EM-style optimisation to refine W and H in order to minimise the objective function.
• Common approach is to iterate between two multiplicative update rules until convergence:
NMF Algorithm Components
• Input: Non-negative data matrix (A), number of basis vectors (k), initial values for factors W and H (e.g. random matrices).
• Objective Function: Some measure of reconstruction error between A and the approximation WH.
�4
12
||A�WH||2F =nX
i=1
mX
j=1
(Aij � (WH)ij)2
• Optimisation Process: Local EM-style optimisation to refine W and H in order to minimise the objective function.
• Common approach is to iterate between two multiplicative update rules until convergence (Lee & Seung, 1999).
EuclideanDistance
(Lee & Seung, 1999)
Hcj Hcj(WA)cj
(WWH)cjWic Wic
(AH)ic
(WHH)ic
1. Update H 2. Update W
NMF Algorithm Components
• Input: Non-negative data matrix (A), number of basis vectors (k), initial values for factors W and H (e.g. random matrices).
• Objective Function: Some measure of reconstruction error between A and the approximation WH.
�4
12
||A�WH||2F =nX
i=1
mX
j=1
(Aij � (WH)ij)2
• Optimisation Process: Local EM-style optimisation to refine W and H in order to minimise the objective function.
• Common approach is to iterate between two multiplicative update rules until convergence (Lee & Seung, 1999).
EuclideanDistance
(Lee & Seung, 1999)
Hcj Hcj(WA)cj
(WWH)cjWic Wic
(AH)ic
(WHH)ic
1. Update H 2. Update W
NMF Variants
• Different objective functions:
• KL divergence; Bregman divergences…
• More efficient optimisation:
• Alternating least squares with projected gradient method for sub-problems.
• Constraints:
• Enforcing sparseness in outputs.
• Incorporation of background information (Semi-NMF).
• Different inputs:
• Symmetric matrices - e.g. document-document cosine similarity matrix.
Topic Modelling with NMF• Basic methodology:
1. Construct vector space model for documents (after stop-‐word filtering), resulting in a term-‐document matrix A.
2. Apply TF-‐IDF term weight normalisation to A.
3. Normalize TF-‐IDF vectors to unit length.
4. Initialise factors (randomly or using NNDSVD(A)).
5. Compute NMF of A.
• Interpreting NMF output:
• Basis vectors: the topics (clusters) in the data.
• Coefficient matrix: the membership weights for documents relative to each topic (cluster).
NMF Topic Modelling: Simple example
• Apply TF-IDF and unit length normalisation to rows of A.
• Run Euclidean NMF on normalised A (k=3, random initialisation).
NMF Topic Modeling: Simple Example
Insight Latent Space Workshop �7
Document-Term Matrix A (6 rows x 10 columns)
document1
document2
document3
document4
document5
document6
bank
mon
ey
finan
ce
spor
t
club
foot
ball
tv
show
acto
r
mov
ie
• Apply TF-IDF and unit length normalization to rows of A.• Run Euclidean NMF on normalized A (k=3, random initialization).
NMF Topic Modelling: Simple exampleNMF Topic Modeling: Simple Example
Insight Latent Space Workshop �8
bank
money
finance
sport
club
football
tv
show
actor
movie
Topic1 Topic2 Topic3
Basis vectors W: topics (clusters)
document1
document2
document3
document4
document5
document6
Topic1 Topic2 Topic3
Coefficients H: memberships for documents
Challenge: Selecting K
• The selection of number of topics k is often performed manually.
• No definitive model selection strategy.
• Various alternatives comparing different models:
• Compare reconstruction errors for different parameters.
• Natural bias towards larger value of k.
• Build a “consensus matrix” from multiple runs for each k, assess presence of block structure.
• Examine the stability (i.e. agreement between results) from multiple randomly initialised runs for each value of k.
Challenge: initialisation
• Standard random initialisation of NMF factors can lead to instability
• i.e. significantly different results for different runs on the same data matrix.
• NNDSVD: Nonnegative Double Singular Value Decomposition
• Provides a deterministic initialisation with no random element.
• Chooses initial factors based on positive components of the first k dimensions of SVD of data matrix A.
• Often leads to significant decrease in number of NMF iterations required before convergence.
Experiment: BBC News Articles
• Collection of 2,225 BBC news articles from 2004-2005 with 5 manually annotated topics
• Apply NMF with k=5 to 2,225 x 9,125 matrix.
• Extract topic “descriptions” based on top-ranked terms in basis vectors.
Experiment: BBC News Articles• Collection of 2,225 BBC news articles from 2004-2005 with 5 manually
annotated topics (http://mlg.ucd.ie/datasets/bbc.html).• Applied Euclidean Projected Gradient NMF (k=5) to 2,225 x 9,125 matrix.• Extract topic “descriptions” based on top ranked terms in basis vectors.
�11
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
growth mobile england film labour
economy phone game best election
year music win awards blair
bank technology wales award brown
sales people cup actor party
economic digital ireland oscar government
oil users team festival howard
market broadband play films minister
prices net match actress tax
china software rugby won chancellor
Experiment: Irish Economy Dataset• Collection of 21k news articles from 2009-2010 relating to the economy (Irish Times, …).
• Extracted all named entities from articles (person, org, location), and constructed 21,496 x 3,014 article-entity matrix.
• Apply NMF (k=8) and examine topics on basis of top-ranked entities
Experiment: Irish Economy Dataset
• Collection of 21k news articles from 2009-2010 relating to the economy (Irish Times, Irish Independent & Examiner).
• Extracted all named entities from articles (person, org, location), and constructed 21,496 x 3,014 article-entity matrix.
• Applied Euclidean Projected Gradient NMF (k=8) matrix.
�12
Topic 1 Topic 2 Topic 3 Topic 4
nama european_union allied_irish_bank hse
brian_lenihan europe bank_of_ireland dublin
green_party greece anglo_irish_bank mary_harney
ntma lisbon_treaty dublin department_of_health
anglo_irish_bank ecb irish_life_permanent brendan_drumm
Topic 5 Topic 6 Topic 7 Topic 8
usa aer_lingus uk brian_cowen
asia ryanair dublin fine_gael
new_york dublin northern_ireland fianna_fail
federal_reserve daa bank_of_england green_party
china christoph_mueller london brian_lenihan
Experiment: IMDB Dataset
• Documents constructed from IMDB Keywords for set of 21k movies.
• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.
• Topic “descriptions” based on top ranked keywords in basis vectors appear to reveal genres and genre cross-overs.
Experiment: IMDb Dataset• Constructed documents from IMDb Keywords for set of 21k movies
(http://www.imdb.com/Sections/Keywords/).• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.• Topic “descriptions” based on top ranked keywords in basis vectors
appear to reveal genres and genre cross-overs.
�13
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
cowboy bmovie martialarts police superhero
shootout atgunpoint combat detective basedoncomic
cowboyhat bwestern hero murder superheroine
cowboyboots stockfootage actionhero investigation dccomics
horse gangmember brawl policedetective secretidentity
revolver duplicity fistfight detectiveseries amazon
sixshotter gangleader disarming murderer culttv
outlaw deception warrior policeofficer actionheroine
rifle sheriff kungfu policeman twowordtitle
winchester povertyrow onemanarmy crime bracelet
Experiment: IMDb Dataset• Constructed documents from IMDb Keywords for set of 21k movies
(http://www.imdb.com/Sections/Keywords/).• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.• Topic “descriptions” based on top ranked keywords in basis vectors
appear to reveal genres and genre cross-overs.
�14
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
worldwartwo monster love newyorkcity shotinthechest
soldier alien friend manhattan shottodeath
battle cultfilm kiss nightclub shotinthehead
army supernatural adultery marriageproposal punchedintheface
1940s scientist infidelity jealousy corpse
nazi surpriseending restaurant engagement shotintheback
military demon extramaritalaffair party shotgun
combat occult photograph hotel shotintheforehead
warviolence possession tears deception shotintheleg
explosion slasher pregnancy romanticrivalry shootout
Experiment: IMDB Dataset
• Documents constructed from IMDB Keywords for set of 21k movies.
• Applied NMF (k=10) to 20,923 x 5,528 movie-keyword matrix.
• Topic “descriptions” based on top ranked keywords in basis vectors appear to reveal genres and genre cross-overs.
A brief overview of the key probabilistic models
Latent Dirichlet allocation (LDA)
gene 0.04
dna 0.02
genetic 0.01
.,,
life 0.02
evolve 0.01
organism 0.01
.,,
brain 0.04
neuron 0.02
nerve 0.01
...
data 0.02
number 0.02
computer 0.01
.,,
Topics DocumentsTopic proportions and
assignments
• Each topic is a distribution over words
• Each document is a mixture of corpus-wide topics
• Each word is drawn from one of those topics
• Each topic is a distribution over words
• Each document is a mixture of corpus-wide topics
• Each word is drawn from one of those topics
Typical probabilistic model
Typical probabilistic modelLatent Dirichlet allocation (LDA)
Topics DocumentsTopic proportions and
assignments
• In reality, we only observe the documents
• The other structure are hidden variables• In reality, we only observe the documents
• The other structures are hidden variables
Typical probabilistic modelLatent Dirichlet allocation (LDA)
Topics DocumentsTopic proportions and
assignments
• In reality, we only observe the documents
• The other structure are hidden variables
• Our goal is to infer the hidden variables
• i.e. compute their distribution conditioned on the documents p(topics, proportions, assignments|documents)
Probabilistic Latent Semantic Analysis
• Given a corpus, observations produced in the form of pairs of words and documents (w,d)
• Each observation is associated with an unobserved latent class variable, c
• PLSA model assumes that the probability of a co-occurrence P(w,d) is a mixture of conditionally independent multinomial distributions:
P (w, d) =X
c
P (c)P (d|c)P (w|c) = P (d)X
c
P (c|d)P (w|c)
Latent Dirichlet Allocation
• Popular Bayesian extension to PLSA
• Add a Dirichlet prior on the per-document topic distribution
• Makes the model fully generative (i.e. we can sample new documents)
• Parameters must be learned using Bayesian inference (e.g. variational Bayes/Gibbs Sampling/etc)
• In practice: better than PLSA for small datasets; with lots of data tends to perform similarly
Experiment: Science Articles
• Data: The OCR’ed collection of Science from 1990–2000
• 17K documents, 11M words, 20K unique terms (stop words and rare words removed)
• Model: 100-topic LDA model using variational inference.
1
analysis
dnagene
genes
geneticgenomehuman
sequence
sequences
two
6
article
card
circle
end
letters
news
readers
science
service
start
11
age
ago
early
evidence
fig
million
north
record
university
years
16
aaas
advertising
associate
faxmanager
member
recruitment
sales
science
washington
2
activation
activity
binding
cellcells
fig
kinase
protein
proteinsreceptor
7
data
different
fig
model
number
rate
results
system
time
two
12
biology
evolution
evolutionary
genetic
natural
population
populations
species
studies
university
17
biology
cellcells
developmentexpression
fig
genegenes
mice
mutant
3
atmosphere
atmospheric
carbon
changes
climate
global
ocean
surface
temperature
water
8
chemical
fig
high
materials
molecular
molecules
structure
surface
temperature
university
13
acid
amino
binding
molecular
protein
proteins
residues
structural
structure
two
18
electron
electrons
energy
high
laser
light
magnetic
physicsquantum
state
4
first
just
like
new
researcherssays
science
university
work
years
9
binding
dna
protein
proteins
rna
sequence
sequences
site
specific
transcription
14
antigen
cellcells
hiv
human
immune
infected
infection
viral
virus
19
health
national
new
researchscience
scientific
scientists
states
united
university
5
crust
earth
earthquakes
earths
high
lower
mantle
pressure
seismic
temperature
10
cancer
disease
drug
drugs
gene
human
medical
normal
patients
studies
15
astronomers
earth
mass
observations
solar
space
stars
sun
telescope
university
20
activity
braincells
channels
cortex
fig
neuronal
neurons
university
visual
Summary
• Topic modelling as an important part of data mining of unstructured data
• Key idea is that documents/items belong to or are made up of a number of topics
• Typically a small subset of the overall set of topics
• All the models we’ve looked at have 1 key parameter that must be manually tuned: the number of topics