Knowledge Discovery and Data Mining 2 (VU)...

Knowledge Discovery and Data Mining 2 (VU) (707.004)Topic Models: Latent Dirichlet Allocation

Denis Helic

KTI, TU Graz

Apr 9, 2014

Denis Helic (KTI, TU Graz) KDDM2 Apr 9, 2014 1 / 35

Introduction

Outline

1 Introduction

2 Latent Dirichlet Allocation

3 Formal Description of LDA

4 LDA Computation

Slides

Slides are partially based on “Introduction to Probabilistic Topic Models”by David Blei


Introduction

Probabilistic topic models

Topic models are algorithms that aim to discover hidden thematicstructure in large collection of documents

From information retrieval perspective

Nowadays, we use either keyword search or links to navigate e.g. theWeb

Suppose we can explore documents based on themes or topics ofthese documents

First, you select a topic and examine the documents related to thattopic


Introduction

Probabilistic topic models

For example, categories in Wikipedia

Sports, Politics, Geography, etc.

These are editorially created, which is a huge effort

On the other hand, topic models are statistically sound algorithmsthat analyze the words of documents to discover topics


Introduction

Topic models: short recap

Latent semantic analysis applies algebraic methods, e.g. matrixdecomposition to identify “concepts”

No statistical model behind it → it assumes normal distribution overword occurrences

Probabilistic latent semantic indexing is the first probabilistic topicmodel

However, it does not have a probabilitic model at the level ofdocuments


Latent Dirichlet Allocation

Latent Dirichlet Allocation: LDA

The intuition behind LDA is that documents exhibit multiple topics

For example Wikipedia article on Bayes:http://en.wikipedia.org/wiki/Thomas_Bayes

This document is e.g. about a person, religion, probability, andstatistics

It has words that are about those topics


http://en.wikipedia.org/wiki/Thomas_Bayes



Person

Thomas, Bayes, son, born

Religion

Presbyterian, minister, nonconformist, theology

Probability

distribution, binomial, theorem, events

Statistics

interpretations, belief, Bayesian, observable




gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3




LDA is a statistical model of document collections that tries tocapture this intiuition

It is defined as a generative model

We normally apply to oposite direction: statistical inference

To learn the parameters of the model, i.e. the topic distribution, thedocument topic distribution and the word topic distribution

The only observations that we have are words in documents



LDA: generative model

Formally, a topic is a distribution over a fixed vocabulary

Vocabulary: all words that we observe

A topic: probability distribution of words

For example: Probability topic

Words such as distribution, Gaussian, condiotional will have highprobability for this topic

Words such as hiking, soccer, water will have low probability for thistopic




First, we generate the topics and the word topic distribution

Next, for each document we generate words in a two-stage process1 Randomly choose a distribution over topics (what is this document

about)2 For each word in the document

(a) Randomly choose a topic from the distribution over topics from step 1(b) Randomly choose a word from the corresponding distribution over the

vocabulary




The statistical model reflects the intuition that documents exhibitmultiple topics

Each document exhibits the topics with different proportions (step 1)

Each word in each document is drawn from one of the topics (step 2b)

The selected topic is chosen from the per-document distribution overtopics (step 2a)




“Genetics” “Evolution” “Disease” “Computers”

human evolution disease computergenome evolutionary host models

dna species bacteria informationgenetic organisms diseases datagenes life resistance computers

sequence origin bacterial systemgene biology new network

molecular groups strains systemssequencing phylogenetic control model

map living infectious parallelinformation diversity malaria methods

genetics group parasite networksmapping new parasites softwareproject two united new

sequences common tuberculosis simulations

1 8 16 26 36 46 56 66 76 86 96

Topics

Probability

0.0

0.1

0.2

0.3

0.4

Figure 2: Real inference with LDA. We fit a 100-topic LDA model to 17,000 articlesfrom the journal Science. At left is the inferred topic proportions for the example article inFigure 1. At right are the top 15 most frequent words from the most frequent topics found inthis article.

is drawn from one of the topics (step #2b), where the selected topic is chosen from theper-document distribution over topics (step #2a).2

In the example article, the distribution over topics would place probability on genetics,data analysis and evolutionary biology, and each word is drawn from one of those threetopics. Notice that the next article in the collection might be about data analysis andneuroscience; its distribution over topics would place probability on those two topics. Thisis the distinguishing characteristic of latent Dirichlet allocation—all the documents in thecollection share the same set of topics, but each document exhibits those topics with differentproportion.

As we described in the introduction, the goal of topic modeling is to automatically discoverthe topics from a collection of documents. The documents themselves are observed, whilethe topic structure—the topics, per-document topic distributions, and the per-documentper-word topic assignments—are hidden structure. The central computational problem fortopic modeling is to use the observed documents to infer the hidden topic structure. Thiscan be thought of as “reversing” the generative process—what is the hidden structure thatlikely generated the observed collection?

Figure 2 illustrates example inference using the same example document from Figure 1.Here, we took 17,000 articles from Science magazine and used a topic modeling algorithm toinfer the hidden topic structure. (The algorithm assumed that there were 100 topics.) We

2We should explain the mysterious name, “latent Dirichlet allocation.” The distribution that is used todraw the per-document topic distributions in step #1 (the cartoon histogram in Figure 1) is called a Dirichletdistribution. In the generative process for LDA, the result of the Dirichlet is used to allocate the words of thedocument to different topics. Why latent? Keep reading.

4




In the example, the most probable topics would be genetics, dataanalysis, and evolutionary biology

Then each word is drawn from one of those topics

The next document might have different topics

LDA: all documents in the collection share the same topics

Each document exhibits those topics with different proportions



LDA: inference

The goal of topic modeling is to automatically discover the topicsfrom a collection of documents

The documents and words are observed

The topic structure is hidden

The topic structure: the topics, per-document topic distribution,per-document per-word topic assignement

We use observed variables to infer the hidden structure



LDA: inference

4

consumption

earnings

estate

exemption

funds

income

organizations

revenue

subsidies

taxtaxation

taxes

taxpayers

treasuryyear

6

crime

crimes

defendantdefendants

evidence

guilty

judge

judges

jurors

jury

offense

punishment

sentence

sentencing

trial

7

app

cause

class

damages

defendant

defendantsevidence

information

medical

plaintiff

police

reasonable

rule

standard

tort

11

amendment

civil

clause

congress

congressionaldoctrine

federal

government

jurisdiction

legislation

national

protection

statute

statutes

supreme

14

accompanying

civil

criminal

force

human

language

lawyers

life

notes

people

person

persons

society

status

world

10

bargaining

collective

employee

employees

employeremployersemployment

industrial

job

labor

union

unions

work

worker

workers

15

amendment

conduct

content

contextculture

equality

expression

free

freedom

ideas

informationprotect

protected

speech

values

9

black

blacks

discrimination

education

group

minorityprotection

race

racial

religious

school

schools

students

supreme

white

8

bankruptcy

costs

economic

efficiency

expected

goods

investment

likely

payproduct

propertyrisk

rulerules

transaction

5

choice

control

current

effects

federal

future

government

greater

group

level

number

policy

private

problems

property

3

child

children

discrimination

family

female

gender

male

marriage

men

parents

sex

sexual

social

woman

women

1

assets

capital

corporate

cost

efficient

firm

firms

insurance

market

offer

price

share

shareholdersstock

value

17

antitrust

business

commercial

consumerconsumers

economicindustry

information

investors

market

prices

protection

regulation

securities

standard

19

actions

cir

claim

claimsconduct

constitutional

criminal

immunity

inc

judgment

liability

litigation

plaintiffs

suit

supp

12

argued

authority

early

good

great

john

justice

laws

limited

moral

review

said

term

true

war

13

agreement

bargaining

breach

contract

contracting

contracts

contractual

creditors

debtexchange

liability

limited

parties

party

terms

16

amendment

article

citizens

constitution

constitutional

fourteenth

government

history

justice

legislative

majority

opinion

people

political

republican

2

administrative

agency

authoritycommittee

cong

decisions

executive

foreignjudicial

legislative

policy

powers

president

senate

statutory

20

community

direct

economic

equal

groups

history

international

likely

local

members

national

political

reform

report

section

18

argument

claim

common

decisions

judicial

principle

reason

role

rule

rules

social

terms

text

theory

work

Figure 3: A topic model fit to the Yale Law Journal. Here there are twenty topics (the topeight are plotted). Each topic is illustrated with its top most frequent words. Each word’sposition along the x-axis denotes its specificity to the documents. For example “estate” inthe first topic is more specific than “tax.”

then computed the inferred topic distribution for the example article (Figure 2, left), thedistribution over topics that best describes its particular collection of words. Notice that thistopic distribution, though it can use any of the topics, has only “activated” a handful of them.Further, we can examine the most probable terms from each of the most probable topics(Figure 2, right). On examination, we see that these terms are recognizable as terms aboutgenetics, survival, and data analysis, the topics that are combined in the example article.

We emphasize that the algorithms have no information about these subjects and the articlesare not labeled with topics or keywords. The interpretable topic distributions arise bycomputing the hidden structure that likely generated the observed collection of documents.3

For example, Figure 3 illustrates topics discovered from Yale Law Journal. (Here the numberof topics was set to be twenty.) Topics about subjects like genetics and data analysis arereplaced by topics about discrimination and contract law.

The utility of topic models stems from the property that the inferred hidden structureresembles the thematic structure of the collection. This interpretable hidden structureannotates each document in the collection—a task that is painstaking to perform by hand—and these annotations can be used to aid tasks like information retrieval, classification, and

3Indeed calling these models “topic models” is retrospective—the topics that emerge from the inferencealgorithm are interpretable for almost any collection that is analyzed. The fact that these look like topics hasto do with the statistical structure of observed language and how it interacts with the specific probabilisticassumptions of LDA.

5



LDA: inference

The example shows the topics from the Yale Law Journal

The algorithm assumed 20 topics

Each topic is illustrated with the top most frequent word

Position along x-axis denotes the word’s specificity to the documents


Formal Description of LDA

LDA: formalism

We treat data as arising from a generative process that includeslatent (hidden) variables

This generative process defines a joint probability distribution overboth the observed and hidden random varaibles

When analyzing data we use that joint distribution to compute theconditional distribution

We are interested in the distribution of hidden variables given theobserved variables



LDA: formalism

We are interested in the topic structure given the documents andwords

In Bayes terms this conditional distribution is the posterior distribution

The computational problem is the problem of computing the posteriordistribution

As oposed to “nice” settings with e.g. conjugate priors where we cancalculate the posterior analytically in LDA this is not possible



LDA: formalism

LDA is defined by:

1 Number of topics K , number of documents D, length of a document N2 K distributions over the vocabulary where βk is the distribution over

words for topic k3 D distributions over the topics where θd is the distribution over topics

for document d4 The topic assignement for the document d is zd , and zd,n is the topic

assignement for the nth worh in document d5 Observed words for document d are wd , wd,n is the nth word in

document d



LDA: formalism

Joint distribution of observed and hidden variables:

p(β, θ, z ,w) = . . .



LDA: formalism

What is the probability to observe a word wd ,n

What other variables is the nth word in the dth documentconditioned on?

p(β, θ, z ,w) = . . . p(wd ,n|zd ,n, β)



LDA: formalism

Let us apply the chain rule: what is the probability of having thehidden topic z for the nth worh in the dth document

What other variables is the nth topic in the dth documentconditioned on?

p(β, θ, z ,w) = . . . p(zd ,n|θd)p(wd ,n|zd ,n, β)



LDA: formalism

Under the “bag-of-words” assumption, i.e. under the wordindependence assumption

What is the probability of observing a document with N words

p(β, θ, z ,w) = . . . (N∏

n=1

p(zd ,n|θd)p(wd ,n|zd ,n, β))



LDA: formalism

We are still not done with the document

What is missing? (Chain rule)

p(β, θ, z ,w) = . . . p(θd)(N∏

n=1




LDA: formalism

Under the document independence assumption

What is the probability of observing all D documents

p(β, θ, z ,w) = · · ·D∏

d=1

p(θd)(N∏

n=1




LDA: formalism

We are still not done with the document collection

What is missing?

p(β, θ, z ,w) =K∏i=1

p(βi )D∏

d=1

p(θd)(N∏

n=1




LDA: formalism

θd Zd,n Wd,nN

D Kβk

α !

Figure 4: The graphical model for latent Dirichlet allocation. Each node is a randomvariable and is labeled according to its role in the generative process (see Figure 1). Thehidden nodes–the topic proportions, assignments and topics—are unshaded. The observednodes—the words of the documents—are shaded. The rectangles are “plate” notation, whichdenotes replication. The N plate denotes the collection words within documents; the D platedenotes the collection of documents within the collection.

a graphical language for describing families of probability distributions.5 The graphical modelfor LDA is in Figure 4. These three representations are equivalent ways of describing theprobabilistic assumptions behind LDA.

In the next section, we describe the inference algorithms for LDA. However, we first pause todescribe the short history of these ideas. LDA was developed to fix an issue with a previouslydeveloped probabilistic model probabilistic latent semantic analysis (pLSI) [21]. That modelwas itself a probabilistic version of the seminal work on latent semantic analysis [14], whichrevealed the utility of the singular value decomposition of the document-term matrix. Fromthis matrix factorization perspective, LDA can also be seen as a type of principal componentanalysis for discrete data [11, 12].

2.2 Posterior computation for LDA

We now turn to the computational problem, computing the conditional distribution of thetopic structure given the observed documents. (As we mentioned above, this is called theposterior.) Using our notation, the posterior is

p(β1:K , θ1:D, z1:D |w1:D) =p(β1:K , θ1:D, z1:D, w1:D)

p(w1:D). (2)

The numerator is the joint distribution of all the random variables, which can be easilycomputed for any setting of the hidden variables. The denominator is the marginal probabilityof the observations, which is the probability of seeing the observed corpus under any topicmodel. In theory, it can be computed by summing the joint distribution over every possibleinstantiation of the hidden topic structure.

5The field of graphical models is actually more than a language for describing families of distributions. Itis a field that illuminates the deep mathematical links between probabilistic independence, graph theory, andalgorithms for computing with probability distributions [35].

7



LDA: formalism

The hidden nodes are unshaded

They include the topic proportions, assignements and topics

The observed nodes are shaded

They include the words and the documents



LDA: formalism

α and η are the parameters of Dirichlet priors

With α we determine the distribution of topics over documents

If we want to model the fact that each document is only about a fewtopics we need to set α to produce sparse distributions

With η we can model the distribution over topics in the documentcollection


LDA Computation

Posterior computation

We want to compute the posterior

The conditional distribution of the topic structure given the observeddocuments

p(β, θ, z |w) =p(β, θ, z ,w)

p(w)


LDA Computation


We can compute the joint distribution for any setting of the hiddenvariables

The denominator is the marginal probability of the observations(Bayesian evidence)

This is the probability of observing the documents under any topicmodeling

In theory, we need to sum the joint distribution over every possibleinstantiation of the hidden topic structure


LDA Computation


The number of possible topic structures is exponentially large

We can not compute the denominator directly

We need to approximate it

Two possible approaches: sampling or variational algorithms (ICGlectures by Thomas Pock)


LDA Computation


Sampling approaches: Markov Chain Monte Carlo

Metropolis algorithm

Gibbs sampling

We construct a Markov chain with limiting distribution equal to theposterior


LDA Computation

LDA: implementations

http://www.cs.princeton.edu/~blei/lda-c/

http://cran.r-project.org/web/packages/lda/

http://mallet.cs.umass.edu/

http:

//www.cs.princeton.edu/~blei/downloads/onlineldavb.tar

http://www.princeton.edu/~achaney/tmve/wiki100k/browse/

topic-presence.html


http://www.cs.princeton.edu/~blei/lda-c/

http://cran.r-project.org/web/packages/lda/

http://mallet.cs.umass.edu/

http://www.cs.princeton.edu/~blei/downloads/onlineldavb.tar

http://www.cs.princeton.edu/~blei/downloads/onlineldavb.tar

http://www.princeton.edu/~achaney/tmve/wiki100k/browse/topic-presence.html

http://www.princeton.edu/~achaney/tmve/wiki100k/browse/topic-presence.html

Knowledge Discovery and Data Mining 2 (VU)...

Documents

Transcript of Knowledge Discovery and Data Mining 2 (VU)...