Knowledge Discovery and Data Mining 2 (VU)...
Transcript of Knowledge Discovery and Data Mining 2 (VU)...
Knowledge Discovery and Data Mining 2 (VU) (707.004)Topic Models: Latent Dirichlet Allocation
Denis Helic
KTI, TU Graz
Apr 9, 2014
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 1 / 35
Introduction
Outline
1 Introduction
2 Latent Dirichlet Allocation
3 Formal Description of LDA
4 LDA Computation
Slides
Slides are partially based on “Introduction to Probabilistic Topic Models”by David Blei
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 2 / 35
Introduction
Probabilistic topic models
Topic models are algorithms that aim to discover hidden thematicstructure in large collection of documents
From information retrieval perspective
Nowadays, we use either keyword search or links to navigate e.g. theWeb
Suppose we can explore documents based on themes or topics ofthese documents
First, you select a topic and examine the documents related to thattopic
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 3 / 35
Introduction
Probabilistic topic models
For example, categories in Wikipedia
Sports, Politics, Geography, etc.
These are editorially created, which is a huge effort
On the other hand, topic models are statistically sound algorithmsthat analyze the words of documents to discover topics
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 4 / 35
Introduction
Topic models: short recap
Latent semantic analysis applies algebraic methods, e.g. matrixdecomposition to identify “concepts”
No statistical model behind it → it assumes normal distribution overword occurrences
Probabilistic latent semantic indexing is the first probabilistic topicmodel
However, it does not have a probabilitic model at the level ofdocuments
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 5 / 35
Latent Dirichlet Allocation
Latent Dirichlet Allocation: LDA
The intuition behind LDA is that documents exhibit multiple topics
For example Wikipedia article on Bayes:http://en.wikipedia.org/wiki/Thomas_Bayes
This document is e.g. about a person, religion, probability, andstatistics
It has words that are about those topics
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 6 / 35
Latent Dirichlet Allocation
Latent Dirichlet Allocation: LDA
Person
Thomas, Bayes, son, born
Religion
Presbyterian, minister, nonconformist, theology
Probability
distribution, binomial, theorem, events
Statistics
interpretations, belief, Bayesian, observable
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 7 / 35
Latent Dirichlet Allocation
Latent Dirichlet Allocation: LDA
gene 0.04dna 0.02genetic 0.01.,,
life 0.02evolve 0.01organism 0.01.,,
brain 0.04neuron 0.02nerve 0.01...
data 0.02number 0.02computer 0.01.,,
Topics Documents Topic proportions andassignments
Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.
model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)
We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.
1. Randomly choose a distribution over topics.
2. For each word in the document
(a) Randomly choose a topic from the distribution over topics in step #1.
(b) Randomly choose a word from the corresponding distribution over the vocabulary.
This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document
1Technically, the model assumes that the topics are generated first, before the documents.
3
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 8 / 35
Latent Dirichlet Allocation
Latent Dirichlet Allocation: LDA
LDA is a statistical model of document collections that tries tocapture this intiuition
It is defined as a generative model
We normally apply to oposite direction: statistical inference
To learn the parameters of the model, i.e. the topic distribution, thedocument topic distribution and the word topic distribution
The only observations that we have are words in documents
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 9 / 35
Latent Dirichlet Allocation
LDA: generative model
Formally, a topic is a distribution over a fixed vocabulary
Vocabulary: all words that we observe
A topic: probability distribution of words
For example: Probability topic
Words such as distribution, Gaussian, condiotional will have highprobability for this topic
Words such as hiking, soccer, water will have low probability for thistopic
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 10 / 35
Latent Dirichlet Allocation
LDA: generative model
First, we generate the topics and the word topic distribution
Next, for each document we generate words in a two-stage process1 Randomly choose a distribution over topics (what is this document
about)2 For each word in the document
(a) Randomly choose a topic from the distribution over topics from step 1(b) Randomly choose a word from the corresponding distribution over the
vocabulary
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 11 / 35
Latent Dirichlet Allocation
LDA: generative model
The statistical model reflects the intuition that documents exhibitmultiple topics
Each document exhibits the topics with different proportions (step 1)
Each word in each document is drawn from one of the topics (step 2b)
The selected topic is chosen from the per-document distribution overtopics (step 2a)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 12 / 35
Latent Dirichlet Allocation
LDA: generative model
“Genetics” “Evolution” “Disease” “Computers”
human evolution disease computergenome evolutionary host models
dna species bacteria informationgenetic organisms diseases datagenes life resistance computers
sequence origin bacterial systemgene biology new network
molecular groups strains systemssequencing phylogenetic control model
map living infectious parallelinformation diversity malaria methods
genetics group parasite networksmapping new parasites softwareproject two united new
sequences common tuberculosis simulations
1 8 16 26 36 46 56 66 76 86 96
Topics
Probability
0.0
0.1
0.2
0.3
0.4
Figure 2: Real inference with LDA. We fit a 100-topic LDA model to 17,000 articlesfrom the journal Science. At left is the inferred topic proportions for the example article inFigure 1. At right are the top 15 most frequent words from the most frequent topics found inthis article.
is drawn from one of the topics (step #2b), where the selected topic is chosen from theper-document distribution over topics (step #2a).2
In the example article, the distribution over topics would place probability on genetics,data analysis and evolutionary biology, and each word is drawn from one of those threetopics. Notice that the next article in the collection might be about data analysis andneuroscience; its distribution over topics would place probability on those two topics. Thisis the distinguishing characteristic of latent Dirichlet allocation—all the documents in thecollection share the same set of topics, but each document exhibits those topics with differentproportion.
As we described in the introduction, the goal of topic modeling is to automatically discoverthe topics from a collection of documents. The documents themselves are observed, whilethe topic structure—the topics, per-document topic distributions, and the per-documentper-word topic assignments—are hidden structure. The central computational problem fortopic modeling is to use the observed documents to infer the hidden topic structure. Thiscan be thought of as “reversing” the generative process—what is the hidden structure thatlikely generated the observed collection?
Figure 2 illustrates example inference using the same example document from Figure 1.Here, we took 17,000 articles from Science magazine and used a topic modeling algorithm toinfer the hidden topic structure. (The algorithm assumed that there were 100 topics.) We
2We should explain the mysterious name, “latent Dirichlet allocation.” The distribution that is used todraw the per-document topic distributions in step #1 (the cartoon histogram in Figure 1) is called a Dirichletdistribution. In the generative process for LDA, the result of the Dirichlet is used to allocate the words of thedocument to different topics. Why latent? Keep reading.
4
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 13 / 35
Latent Dirichlet Allocation
LDA: generative model
In the example, the most probable topics would be genetics, dataanalysis, and evolutionary biology
Then each word is drawn from one of those topics
The next document might have different topics
LDA: all documents in the collection share the same topics
Each document exhibits those topics with different proportions
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 14 / 35
Latent Dirichlet Allocation
LDA: inference
The goal of topic modeling is to automatically discover the topicsfrom a collection of documents
The documents and words are observed
The topic structure is hidden
The topic structure: the topics, per-document topic distribution,per-document per-word topic assignement
We use observed variables to infer the hidden structure
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 15 / 35
Latent Dirichlet Allocation
LDA: inference
4
consumption
earnings
estate
exemption
funds
income
organizations
revenue
subsidies
taxtaxation
taxes
taxpayers
treasuryyear
6
crime
crimes
defendantdefendants
evidence
guilty
judge
judges
jurors
jury
offense
punishment
sentence
sentencing
trial
7
app
cause
class
damages
defendant
defendantsevidence
information
medical
plaintiff
police
reasonable
rule
standard
tort
11
amendment
civil
clause
congress
congressionaldoctrine
federal
government
jurisdiction
legislation
national
protection
statute
statutes
supreme
14
accompanying
civil
criminal
force
human
language
lawyers
life
notes
people
person
persons
society
status
world
10
bargaining
collective
employee
employees
employeremployersemployment
industrial
job
labor
union
unions
work
worker
workers
15
amendment
conduct
content
contextculture
equality
expression
free
freedom
ideas
informationprotect
protected
speech
values
9
black
blacks
discrimination
education
group
minorityprotection
race
racial
religious
school
schools
students
supreme
white
8
bankruptcy
costs
economic
efficiency
expected
goods
investment
likely
payproduct
propertyrisk
rulerules
transaction
5
choice
control
current
effects
federal
future
government
greater
group
level
number
policy
private
problems
property
3
child
children
discrimination
family
female
gender
male
marriage
men
parents
sex
sexual
social
woman
women
1
assets
capital
corporate
cost
efficient
firm
firms
insurance
market
offer
price
share
shareholdersstock
value
17
antitrust
business
commercial
consumerconsumers
economicindustry
information
investors
market
prices
protection
regulation
securities
standard
19
actions
cir
claim
claimsconduct
constitutional
criminal
immunity
inc
judgment
liability
litigation
plaintiffs
suit
supp
12
argued
authority
early
good
great
john
justice
laws
limited
moral
review
said
term
true
war
13
agreement
bargaining
breach
contract
contracting
contracts
contractual
creditors
debtexchange
liability
limited
parties
party
terms
16
amendment
article
citizens
constitution
constitutional
fourteenth
government
history
justice
legislative
majority
opinion
people
political
republican
2
administrative
agency
authoritycommittee
cong
decisions
executive
foreignjudicial
legislative
policy
powers
president
senate
statutory
20
community
direct
economic
equal
groups
history
international
likely
local
members
national
political
reform
report
section
18
argument
claim
common
decisions
judicial
principle
reason
role
rule
rules
social
terms
text
theory
work
Figure 3: A topic model fit to the Yale Law Journal. Here there are twenty topics (the topeight are plotted). Each topic is illustrated with its top most frequent words. Each word’sposition along the x-axis denotes its specificity to the documents. For example “estate” inthe first topic is more specific than “tax.”
then computed the inferred topic distribution for the example article (Figure 2, left), thedistribution over topics that best describes its particular collection of words. Notice that thistopic distribution, though it can use any of the topics, has only “activated” a handful of them.Further, we can examine the most probable terms from each of the most probable topics(Figure 2, right). On examination, we see that these terms are recognizable as terms aboutgenetics, survival, and data analysis, the topics that are combined in the example article.
We emphasize that the algorithms have no information about these subjects and the articlesare not labeled with topics or keywords. The interpretable topic distributions arise bycomputing the hidden structure that likely generated the observed collection of documents.3
For example, Figure 3 illustrates topics discovered from Yale Law Journal. (Here the numberof topics was set to be twenty.) Topics about subjects like genetics and data analysis arereplaced by topics about discrimination and contract law.
The utility of topic models stems from the property that the inferred hidden structureresembles the thematic structure of the collection. This interpretable hidden structureannotates each document in the collection—a task that is painstaking to perform by hand—and these annotations can be used to aid tasks like information retrieval, classification, and
3Indeed calling these models “topic models” is retrospective—the topics that emerge from the inferencealgorithm are interpretable for almost any collection that is analyzed. The fact that these look like topics hasto do with the statistical structure of observed language and how it interacts with the specific probabilisticassumptions of LDA.
5
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 16 / 35
Latent Dirichlet Allocation
LDA: inference
The example shows the topics from the Yale Law Journal
The algorithm assumed 20 topics
Each topic is illustrated with the top most frequent word
Position along x-axis denotes the word’s specificity to the documents
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 17 / 35
Formal Description of LDA
LDA: formalism
We treat data as arising from a generative process that includeslatent (hidden) variables
This generative process defines a joint probability distribution overboth the observed and hidden random varaibles
When analyzing data we use that joint distribution to compute theconditional distribution
We are interested in the distribution of hidden variables given theobserved variables
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 18 / 35
Formal Description of LDA
LDA: formalism
We are interested in the topic structure given the documents andwords
In Bayes terms this conditional distribution is the posterior distribution
The computational problem is the problem of computing the posteriordistribution
As oposed to “nice” settings with e.g. conjugate priors where we cancalculate the posterior analytically in LDA this is not possible
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 19 / 35
Formal Description of LDA
LDA: formalism
LDA is defined by:
1 Number of topics K , number of documents D, length of a document N2 K distributions over the vocabulary where βk is the distribution over
words for topic k3 D distributions over the topics where θd is the distribution over topics
for document d4 The topic assignement for the document d is zd , and zd,n is the topic
assignement for the nth worh in document d5 Observed words for document d are wd , wd,n is the nth word in
document d
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 20 / 35
Formal Description of LDA
LDA: formalism
Joint distribution of observed and hidden variables:
p(β, θ, z ,w) = . . .
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 21 / 35
Formal Description of LDA
LDA: formalism
What is the probability to observe a word wd ,n
What other variables is the nth word in the dth documentconditioned on?
p(β, θ, z ,w) = . . . p(wd ,n|zd ,n, β)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 22 / 35
Formal Description of LDA
LDA: formalism
What is the probability to observe a word wd ,n
What other variables is the nth word in the dth documentconditioned on?
p(β, θ, z ,w) = . . . p(wd ,n|zd ,n, β)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 22 / 35
Formal Description of LDA
LDA: formalism
Let us apply the chain rule: what is the probability of having thehidden topic z for the nth worh in the dth document
What other variables is the nth topic in the dth documentconditioned on?
p(β, θ, z ,w) = . . . p(zd ,n|θd)p(wd ,n|zd ,n, β)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 23 / 35
Formal Description of LDA
LDA: formalism
Let us apply the chain rule: what is the probability of having thehidden topic z for the nth worh in the dth document
What other variables is the nth topic in the dth documentconditioned on?
p(β, θ, z ,w) = . . . p(zd ,n|θd)p(wd ,n|zd ,n, β)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 23 / 35
Formal Description of LDA
LDA: formalism
Under the “bag-of-words” assumption, i.e. under the wordindependence assumption
What is the probability of observing a document with N words
p(β, θ, z ,w) = . . . (N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 24 / 35
Formal Description of LDA
LDA: formalism
Under the “bag-of-words” assumption, i.e. under the wordindependence assumption
What is the probability of observing a document with N words
p(β, θ, z ,w) = . . . (N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 24 / 35
Formal Description of LDA
LDA: formalism
We are still not done with the document
What is missing? (Chain rule)
p(β, θ, z ,w) = . . . p(θd)(N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 25 / 35
Formal Description of LDA
LDA: formalism
We are still not done with the document
What is missing? (Chain rule)
p(β, θ, z ,w) = . . . p(θd)(N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 25 / 35
Formal Description of LDA
LDA: formalism
Under the document independence assumption
What is the probability of observing all D documents
p(β, θ, z ,w) = · · ·D∏
d=1
p(θd)(N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 26 / 35
Formal Description of LDA
LDA: formalism
Under the document independence assumption
What is the probability of observing all D documents
p(β, θ, z ,w) = · · ·D∏
d=1
p(θd)(N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 26 / 35
Formal Description of LDA
LDA: formalism
We are still not done with the document collection
What is missing?
p(β, θ, z ,w) =K∏i=1
p(βi )D∏
d=1
p(θd)(N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 27 / 35
Formal Description of LDA
LDA: formalism
We are still not done with the document collection
What is missing?
p(β, θ, z ,w) =K∏i=1
p(βi )D∏
d=1
p(θd)(N∏
n=1
p(zd ,n|θd)p(wd ,n|zd ,n, β))
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 27 / 35
Formal Description of LDA
LDA: formalism
θd Zd,n Wd,nN
D Kβk
α !
Figure 4: The graphical model for latent Dirichlet allocation. Each node is a randomvariable and is labeled according to its role in the generative process (see Figure 1). Thehidden nodes–the topic proportions, assignments and topics—are unshaded. The observednodes—the words of the documents—are shaded. The rectangles are “plate” notation, whichdenotes replication. The N plate denotes the collection words within documents; the D platedenotes the collection of documents within the collection.
a graphical language for describing families of probability distributions.5 The graphical modelfor LDA is in Figure 4. These three representations are equivalent ways of describing theprobabilistic assumptions behind LDA.
In the next section, we describe the inference algorithms for LDA. However, we first pause todescribe the short history of these ideas. LDA was developed to fix an issue with a previouslydeveloped probabilistic model probabilistic latent semantic analysis (pLSI) [21]. That modelwas itself a probabilistic version of the seminal work on latent semantic analysis [14], whichrevealed the utility of the singular value decomposition of the document-term matrix. Fromthis matrix factorization perspective, LDA can also be seen as a type of principal componentanalysis for discrete data [11, 12].
2.2 Posterior computation for LDA
We now turn to the computational problem, computing the conditional distribution of thetopic structure given the observed documents. (As we mentioned above, this is called theposterior.) Using our notation, the posterior is
p(β1:K , θ1:D, z1:D |w1:D) =p(β1:K , θ1:D, z1:D, w1:D)
p(w1:D). (2)
The numerator is the joint distribution of all the random variables, which can be easilycomputed for any setting of the hidden variables. The denominator is the marginal probabilityof the observations, which is the probability of seeing the observed corpus under any topicmodel. In theory, it can be computed by summing the joint distribution over every possibleinstantiation of the hidden topic structure.
5The field of graphical models is actually more than a language for describing families of distributions. Itis a field that illuminates the deep mathematical links between probabilistic independence, graph theory, andalgorithms for computing with probability distributions [35].
7
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 28 / 35
Formal Description of LDA
LDA: formalism
The hidden nodes are unshaded
They include the topic proportions, assignements and topics
The observed nodes are shaded
They include the words and the documents
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 29 / 35
Formal Description of LDA
LDA: formalism
α and η are the parameters of Dirichlet priors
With α we determine the distribution of topics over documents
If we want to model the fact that each document is only about a fewtopics we need to set α to produce sparse distributions
With η we can model the distribution over topics in the documentcollection
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 30 / 35
LDA Computation
Posterior computation
We want to compute the posterior
The conditional distribution of the topic structure given the observeddocuments
p(β, θ, z |w) =p(β, θ, z ,w)
p(w)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 31 / 35
LDA Computation
Posterior computation
We can compute the joint distribution for any setting of the hiddenvariables
The denominator is the marginal probability of the observations(Bayesian evidence)
This is the probability of observing the documents under any topicmodeling
In theory, we need to sum the joint distribution over every possibleinstantiation of the hidden topic structure
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 32 / 35
LDA Computation
Posterior computation
The number of possible topic structures is exponentially large
We can not compute the denominator directly
We need to approximate it
Two possible approaches: sampling or variational algorithms (ICGlectures by Thomas Pock)
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 33 / 35
LDA Computation
Posterior computation
Sampling approaches: Markov Chain Monte Carlo
Metropolis algorithm
Gibbs sampling
We construct a Markov chain with limiting distribution equal to theposterior
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 34 / 35
LDA Computation
LDA: implementations
http://www.cs.princeton.edu/~blei/lda-c/
http://cran.r-project.org/web/packages/lda/
http://mallet.cs.umass.edu/
http:
//www.cs.princeton.edu/~blei/downloads/onlineldavb.tar
http://www.princeton.edu/~achaney/tmve/wiki100k/browse/
topic-presence.html
Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 35 / 35