Topic Models

DIGITAL Institute for Information and Communication Technologies

Topic Models

Claudia WagnerGraz, 16.9.2010

2

Semantic Representation of Text

a) Network Model (nodes and edges) b) Space Model (points and proximity) c) Probabilistic Models (words belong to a set of

probabilistic topics)

(Griffiths, 2007)

3

Topic Models

= probabilistic models for uncovering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis of the original texts (Blei, 2003)

Aim: discover patterns of word-use and connect documents that exhibit similar patterns

Idea: documents are mixtures of topics and a topic is a probability distribution over words

4

Topic Models

source: http://www.cs.umass.edu/~wallach/talks/priors.pdf

wac

each topic == distr. of words --> i.e. the sum of probabilities of all words of one topic is 1;If one word was sampled many times from topic z, than topic z becomes more likely to have generated word w; If topic z becomes more likely to have generated word w than all other topics become less likely for word w (EXPLAINING AWAY!)each document == distribution of topics --> i.e. the sum of probabilities of each document is 1;If one topic becomes more likely to have generated document d (in this case e.g. topic yellow), than all other topics become less likely.P(topic_yellow) in doc d depends on the num of words in doc d which have been generated by topic_yellow; in this case most words in d have been generated by topic_yellow.A word has been generated by topic_yellow if word-distribution of topic_yellow was more likely to produce word w than word-distribution of all other topics.

5

Topic Models

Topic 1 Topic 2

3 latent variables:

Word distribution per topic(word-topic-matrix)

Topic distribution per doc(topic-doc-matrix)

Topic word assignment

(Steyvers, 2006)

6

Summary

Observed variables: Word-distribution per document

3 latent variables Topic distribution per document : P(z) = θ(d)

Word distribution per topic: P(w, z) = φ(z)

Word-Topic assignment: P(z|w)

Training: Learn latent variables on trainings-collection of documents

Test: Predict topic distribution θ(d) of an unseen document d

7

Topic Models

pLSA (Hoffmann, 1999) LDA (Blei, 2003) Author Model (McCallum, 1999) Author-Topic Model (Rosen-Zvi, 2004) Author-Recipient Topic Model (McCallum, 2004) Group-Topic Model (Wang, 2005) Community-Author-Recipient Topic Model (Pathak,

2009) Semi-Supervised Topic Models

Labeled LDA (Ramage, 2009)

8

pLSA (Hoffmann, 1999)

Problem: Not a proper generative model for new documents!Why? Because we do not learn any corpus-level parameter we learn for each doc of the trainingsset a topic-distribution

z

)|(*)|(*)(),( zPzwPdPwdP

number of documents

number of words

P( z | θ)

P( w | z )

Topic distribution of a document

9

Latent Dirichlet Allocation (LDA) (Blei, 2003)

Advantage: We learn topic distribution of a corpus we can predict topic distribution of an unseen document of this corpus by observing its words

Hyper-parameters α and β are corpus-level parameters are only sampled once

P( w | z, φ (z) )

P(φ(z) | β)

z

dzzd zPzwPPPdPwdP )|(*),|(*)|(*)|(*)(),( )()()()(

number of documentsnumber of words

10

Dirichlet Prior α α is a prior on the topic-distribution of documents (of a corpus) α is a corpus-level parameter (is chosen once) α is a force on the topic combinations Amount of smoothing determined by α Higher α more smoothing less „distinct“ topics Low α the pressure is to pick for each document a topic distribution

favoring just a few topics Recommended value: α = 50/T (or less if T is very small)

High αLow α

Each doc’s topic distribution θ is a smooth mix of all topics

Each doc’s topic distribution θ must favor few topics

Topic-distr. of Doc1 =

(1/3, 1/3, 1/3) Topic-distr. of Doc2 =

(1, 0, 0)

Doc1

Doc2

11

Dirichlet Prior β β is a prior on the word-distribution β is a corpus-level parameter (is chosen once) β is a force on the word combinations Amount of smoothing determined by β Higher β more smoothing Low β the pressure is to pick for each topic w word

distribution favoring just a few words Recommended values: β = 0.01

High β Low βTopic-distr. of Doc1 =

(1/3, 1/3, 1/3)

Word-distr. of Topic2 =

(1, 0, 0)

Topic1 Topic2

12

Matrix Representation of LDA

observed latent

latent

θ(d)φ(z)

13

Statistical Inference and Parameter Estimation

Key problem:

Compute posterior distribution of the hidden variables given a document

Posterior distribution is intractable for exact inference

(Blei, 2003)

Latent Vars Observed VarsandPriors

14

Statistical Inference and Parameter Estimation

How can we estimate posterior distribution of hidden variables given a corpus of trainings-documents? Direct (e.g. via expectation maximization, variational inference or

expectation propagation algorithms)

Indirect i.e. estimate the posterior distribution over z (i.e. P(z)) Gibbs sampling, a form of Markov chain Monte Carlo, is often

used to estimate the posterior probability over a high-dimensional random variable z

17

Gibbs Sampling generates a sequence of samples from the joint probability

distribution of two or more random variables.

Aim: compute posterior distribution over latent variable z Pre-request: we must know the conditional probability of z

P( zi = j | z-i , wi , di , . )

Why do we need to estimate P(z|w) via random walk?

z is a high-dimensional random variable

If num of topics T = 50 and num of words = 1000

We must visit 501000 points and compute P(z) for all of them.

18

Gibbs Sampling for LDA Random start Iterative For each word we compute

How dominante is a topic z in the doc d? How often was the topic z already used in doc d?

How likely is a word for a topic z? How often was the word w already assigned to topic z?

19

Run Gibbs Sampling Example (1)

topic1 topic2

money 3 2

bank 3 6

Loan 2 1

River 2 2

Stream 2 1

1 12

2 2

2

1

1

1 2

1 2 1

21

12

1 2 21

21

2

doc1 doc2 doc3

topic1 4 4 4

topic2 4 4 4

1. Random topic assignments

2. 2 count-matrices:

CWT Words per topic

CDT Topics per document

20

Gibbs Sampling for LDA

Probability that topic j is chosen for word wi, conditioned on all other assigned topics of words in this doc and all other observed vars.

Count number of times a word token wi was assigned to a topic j across all docs

Count number of times a topic j was already assigned to some word token in doc di

unnormalized!

=> divide the probability of assigning topic j to word wi by the sum over all topics T

22


topic1 topic2

money 3 2

bank 3 6

Loan 2 1

River 2 2

Stream 2 1

12

2 2

2

1

1

1 2

1 2 1

21

12

1 2 21

21

2

doc1 doc2 doc3

topic1 4 4 4

topic2 4 4 4

First Iteration: Decrement CDT and CWT for current topic j Sample new topic from the current topic-

distribution of a doc

32

2

5

3

23


topic1 topic2

money 2 3

bank 3 6

Loan 2 1

River 2 2

Stream 2 1

12

2 2

2

1

1

1 2

1 2 1

21

12

1 2 21

21

2

doc1 doc2 doc3

topic1 3 4 4

topic2 5 4 4

First Iteration: Decrement CDT and CWT for current topic j Sample new topic from the current topic-

distribution of a doc

2

4

2

55 6

24


α = 50/T = 25 and β = 0.01

39.025*23

254*

01.0*57

01.05,.),,|2(

iii dbankztopiczP

“Bank” is assigned to Topic 2

19.025*24

253*

01.0*58

01.03,.),,|1(

iii dbankztopiczP

How often were all other topics used in doc di

How often was topic j used in doc di

25

Summary: Run Gibbs Sampling

Gibbs sampling is used to estimate topic assignment for each word of each doc

Factors affecting topic assignments How likely is a word w for a topic j?

Probability of word w under topic j How dominante is a topic j in a doc d?

Probability that topic j has under the current topic distribution for document d

Once many tokens of a word have been assigned to topic j (across documents), the probability of assigning any particular token of that word to topic j increases all other topics become less likely for word w (Explaining Away).

Once a topic j has been used multiple times in one document, it will increase the probability that any word from that document will be assigned to topic j all other documents become less likely for topic j (Explaining Away).

26

Black = topic 1White = topic2

Random Start N iterations Each iteration updates count-matrices

Convergence: count-matrices stop changing

Gibbs samples start to approximate the target distribution (i.e., the posterior distribution over z)

Gibbs Sampling Convergence

27

Gibbs Sampling Convergence

Ignore some number of samples at the beginning (Burn-In period)

Consider only every nth sample when averaging values to compute an expectation

Why? successive Gibbs-samples are not independent they form a Markov

chain with some amount of correlation The stationary distribution of the Markov chain is the desired joint

distribution over the latent variables, but it may take a while for that stationary distribution to be reached

Techniques that may reduce autocorrelation between several latent variables are simulated annealing, collapsed Gibbs sampling or blocked Gibbs sampling;

29

Author-Topic (AT)Model(Rosen-Zvi, 2004)

Aim: discover patterns of word-use and connect authors that exhibit similar patterns

Idea/Intuition: Words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixture

Each author == distribution over topics Each topic == distribution over words Each document with multiple authors ==

distribution over topics that is a mixture of the distributions associated with the authors.

30

AT-Model Algorithm Sample author

For each doc d and each word w of that doc an author x is sampled from the doc‘s author distribution/set ad.

Sample topic For each doc d and each word w of that doc a

topic z is sampled from the topic distribution θ(x) of the author x which has been assigned to that word.

Sample word From the word-distribution φ(z) of each sampled

topic z a word w is sampled.

z

xzzxd xzPzwPPPaxPwxP ),|(*),|(*)|(*)|(*)|(),( )()()()(

P( w | z, φ (z) )

P( z | x, θ(x) )

31

AT ModelLatent Variables

Latent Variables:

2) Author-distribution of each topic determines which topics are used by which authors count matrix CAT

1) Author-Topic assignment for each word

3) Word-distribution of each topic count matrix CWT

?

32

Matrix Representation of Author-Topic-Model

source: http://www.ics.uci.edu/~smyth/kddpapers/UCI_KD-D_author_topic_preprint.pdf

θ(x)φ(z) ad

observedobservedlatent

latent

33

Example (1)

topic1 topic2

money 3 2

bank 3 6

loan 2 1

river 2 2

stream 2 1

1 12

2 2

2

1

1

1 2

1 2 1

21

12

1 2 1

21

2

author1 author2 author3

topic1 4 8 0

topic2 0 8 4

1. Random topic-author assignments

2. 2 count-matrices:

CWT Words per topic

CAT Authors per topic

1 2 1

2

1

1

2

2 2

2

2

2

2

2

2

23

22

2

3

3

3

2

22

34

Gibbs Sampling for Author-Topic-Model

Estimate posterior distribution of 2 random variables: z and x. For each word, we draw an author xi and a topic zi (OR a pair (zi;

xi) as a block) conditioned on all other variables Blocked Gibbs sampling improves convergence of the Gibbs

sampler when the variables are highly dependent

Count number of times an author k was already assigned to topic j.

Count number of times a word token wi was assigned to a topic j across all docs

35

Problems of the AT Model

AT model learns author‘s topic distribution for a document-corpus

But we don‘t learn topic distribution of documents AT model cannot model idiosyncratic aspects of a

document

36

AT Modelwith Fictitious Authors

Add one fictitious author for each document; ad +1 uniform or non-uniform distribution over authors (including

the fictitious author)

Each word is either sampled from a real author‘s or the fictitious author‘s topic distribution.i.e., we learn topic-distribution for real-authors and for fictitious „author“ (= documents).

Problem reported in (Hong, 2010): topic distribution of each twitter message learnt via AT-model was worse than LDA with USER schema sparse messages and not all words of one message are used to learn document‘s topic distribution.

clauwa

LDA is one extreme: each document has its OWN topic distributionAT-Model is other extreme: documents has NO OWN topic distributionSolution: AT-Model with one fictitious author for each document!

37

Predictive Power of different models (Rosen-Zvi, 2005)

Experiment:Trainingsdata: 1 557 papers

Testdata:183 papers (102 are single-authored papers).

They choose test data documents in such a way that each author of a test set document also appears in the training set as an author.

clauwa

The LDA model learns a topic mixture for each document in the training data. Thus, on a new document with zero or even just a few observed words, it is diffcult for the LDA model to provide predictions that are tuned to that document. In contrast, the author topic model performs better than LDA with few (or even zero) words observed from a document, by making use of available the side-information about the authors of the document.Once enough words from a specific document have been observed the predictive performanceof the LDA model improves since it can learn a more accurate predictive model for that specificdocument.

38

Author-Recipients-Topic (ART) Model (McCallum, 2004)

Observed Variables: Words per message Authors per message Recipients per message

Sample for each word a recipient-author pair AND a topic conditioned on the receiver-author pair‘s

topic distribution θ(A,R)

Learn 2 corpus-level variables: Author-recipient-pair distribution for each topic Word-distribution for each topic

2 count matrices: Pair-topic Word-topic

, R

, x

P( z | x, ad, θ(A,R) )

P( w | z, φ(z) )

39

Gibbs SamplingART-Model

Random Start: Sample author-recipient pair for each wordSample topic for each word

Compute for each word wi:

Number of recipients of message

to which word wi belongs

Number of times topic t was assigned to an author-recipient-pair

Number of times current word token was assigned to topic t

Number of times all other topics were assigned to an author-recipient-pair

Number of times all other words were assigned to topic t

Number of words * beta

40

Labeled LDA(Ramage, 2009)

Word-topic assignments are drawn from a document’s topic distribution θwhich is restricted to the topic distribution Λ of the labels observed in d. Topic distribution of a label l is the same as topic distribution of all documents containing label l.

The document’s labels Λ are first generate using a Bernoulli coin toss for each topic k with a labeling prior φ.

Constraining the topic model to use only those topics that correspond to a document’s (observed) label set.

Topic assignments are limited to the document’s labels One-to-one correspondence between LDA’s latent topics and

user tags/labels

41

Group-Topic Model(Wang, 2005)

Discovery of groups is guided by the emerging topics Discovery of topics is guided by the emerging groups

GT-model is an extension of the blockstructure model group-membership is conditioned on a latent variable associated with the attributes of the relation (i.e., the words) latent variable represents the topics which have generated the words.

GT model discovers topics relevant to relationships between entities in the social network

42

Group-Topic Model(Wang, 2005)

Generative process for each event (an interaction between entities) pick the topic t of the

event and then generates all the words describing the event according to the topics’s word-distribution φ

for each entity s, which interacts within this event, the group assignment g is chosen conditionally from a particular multinomial (discrete) distribution θ over groups for each topic t.

For each event we have a matrix V which stores whether groups of 2 entities behaved the same or not during an event.

Number of events (=interactions

between entities)

Number of entities

43

CART Model(Pathak, 2008)

Generative process To generate email ed a community cd is chosen uniformly at random Based the community cd , the author ad and the set of recipients ρd

are chosen To generate every word w(d,i) in that email, a recipient r(d,i) is chosen

uniformly at random from the set of recipients ρd

Based on the community cd, author ad and recipient r(d,i), a topic z(d,i) is chosen

The word w(d,i) itself is chosen based on the topic z(d,i)

Gibbs-sampling:alternates between updating latent communities c conditioned on other variables, and updating recipient-topic tuples (r, z) for each word conditioned on other variables.

44

Copycat Model(Dietz, 2007)

Topics of a citing document are a “weighted sum” of documents it cites. The weights of the terms capture the notion of the influence

Generative process For each word of the citing publication d a cited publication c’ is picked from

the set of all cited publications γ. For each word in the citing publication d a topic is picked according to the

current topic distribution which is a mix of the topic distribution of the assigned cited documents c’.

45


Example: A publication c is cited by two publications d1 and d2.

The topic mixture of c is not only about all words in the cited publication c, but also about some words in d1 and d2, which are associated

with c. This way, the topic mixture of c is influenced by the citing publications d1

and d2! The topic distribution of the cited document c in turn influences the

association of words in d1 and d2 to c. All tokens that are associated with a cited publication are called the

topical atmosphere of a cited publication

d1c

d2

cites

46


Bi-partite citation graph 2 disjoint node sets D and C D contains only nodes with outgoing citation links (the citing

publications) C contains nodes with incoming links (the cited publications).

Documents in the original citation graph with incoming and outgoing links are represented as two nodes

47


Problem: bidirectional interdependence of links and topics caused by the topical atmosphere

Publications originated in one research area (such as Gibbs sampling, which originated in physics) will also be associated with topics they are often cited by (such as machine learning).

Problem: enforces each word in a citing publication to be associated with a cited publication noise

48

Citation InfluenceModel(Dietz, 2007)

Copycat Model enforces each word in a citing publication to be associated with a cited publication this introduces noise

A citing publication may choose to draw a word’s topic from a topic mixture of a citing publication θc (the topical atmosphere) or from it’s own topic mixture ψd.

The choice is modeled by a flip of an unfair coin s. The parameter λ of the coin is learned by the model, given an asymmetric beta prior, which prefers the topic mixture θ of a cited publication.

The parameter λ yields an estimate for how well a publication fits to all its citations

innovation topic mixture of a citingpublication

distribution of citation influences

parameter of the coin flip, choosing to drawtopics from θ or ψ

49

References David M. Blei, Andrew Y. Ng, Michael I. Jordan: Latent Dirichlet Allocation. Journal of Machine Learning Research

3: 993-1022 (2003).

Dietz, L., Bickel, S. and Scheffer, T. (2007). Unsupervised prediction of citation influences. Proc. ICML, 2007.

Thomas Hoffmann, Probabilistic Latent Semantic Analysis, Proc. of Uncertainty in Artificial Intelligence, UAI'99, (1999).

Thomas L. Griffiths, Joshua B. Tenenbaum, Mark Steyvers, Topics in semantic representation, (2007).

Michal Rosen-Zvi, Chaitanya Chemudugunta, Thomas Griffiths, Padhraic Smyth, Irvine Mark Steyvers, Learning author-topic models from text corpora, (2010).

Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers and Padhraic Smyth, The author-topic model for authors and documents, In Proceedings of the 20th conference on Uncertainty in artificial intelligence (2004).

Andrew Mccallum, Andres Corrada-Emmanuel, Xuerui Wang, The Author-Recipient-Topic Model for Topic and Role Discovery in Social Networks: Experiments with Enron and Academic Email, Tech-Report, (2004).

Nishith Pathak, Colin Delong, Arindam Banerjee, Kendrick Erickson, Social Topic Models for Community Extraction, In The 2nd SNA-KDD Workshop ’08, (2008).

Steyvers and Griffiths, Probabilistic Topic Models, (2006).

Ramage, Daniel and Hall, David and Nallapati, Ramesh and Manning, Christopher D., Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora, EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (2009)

Xuerui Wang, Natasha Mohanty, Andrew McCallum, Group and topic discovery from relations and text, (2005).

Hanna M. Wallach, David Mimno and Andrew McCallum, Rethinking LDA: Why Priors Matter (2009)

Topic Models

Documents

Transcript of Topic Models