Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...

78
Web-Mining Agents Topic Analysis: pLSI and LDA Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Transcript of Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...

Page 1: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Web-Mining AgentsTopic Analysis: pLSI and LDA

Tanya BraunRalf Möller

Universität zu LübeckInstitut für Informationssysteme

Page 2: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Pilfered from:

Ramesh M. NallapatiMachine Learning applied to Natural Language Processing

Thomas J. Watson Research Center, Yorktown Heights, NY USA

from his presentation onGenerative Topic Models for Community Analysis

&

David M. BleiMLSS 2012

from his presentation on Probabilistic Topic Models&

Sina MiranCS 290D Paper Presentation on

Probabilistic Latent Semantic Indexing (PLSI)

Acknowledgments

2

Page 3: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Recap

• Agents– Task/goal: Information retrieval– Environment: Document repository– Means:

• Vector space (bag-of-words)– Dimension reduction (LSI)

• Probability based retrieval (binary)– Language models

• Today: Topic models as special language models– Probabilistic LSI (pLSI)– Latent Dirichlet Allocation (LDA)

• Soon: What agents can take with them– What agents can leave at the repository (win-win)

3

Page 4: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Objectives

• Topic Models: statistical methods that analyze the words of texts in order to:– Discover the themes that run through them (topics)– How those themes are connected to each other– How they change over time

4

Probabilistic topic models

1880 1900 1920 1940 1960 1980 2000

o o o o o ooo

o

o

o

o

o

o

oooo o o

o oo o o

o o oo o o

o ooooo o

oo

o

o

o

o

o

o

o

o

o o

o oo o

oo

o

oo o

o

o

o

o

o

o

o

o

o

ooo

oo o

1880 1900 1920 1940 1960 1980 2000

o o o

o

oo

oo

o

oo o

o

o

oo o o o

ooo o

o o

o oooo

o

o

oo

ooo

o

o

o

o

o

ooooooo o

o o o o o o oo o o

o oooo

o

o

o

o

o ooo o

o

RELATIVITY

LASER

FORCE

NERVE

OXYGEN

NEURON

"Theoretical Physics" "Neuroscience"

Page 5: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Topic Modeling Scenario

• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics

5

Page 6: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Topic Modeling Scenario

• In reality, we only observe the documents• The other structures are hidden variables• Topic modeling algorithms infer these variables from data

6

Latent Dirichlet allocation (LDA)

Topics DocumentsTopic proportions and

assignments

• In reality, we only observe the documents

• The other structure are hidden variables

• Topic modeling algorithms infer these variables from data.

Page 7: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Plate Notation

• Naïve Bayes Model: Compact representation– C = topic/class (name for a word distribution)– N = number of words in considered document– Wi one specific word in corpus– M documents, W now words in document

– Idea: Generate doc from P(W, C)7

C

w1 w2 w3 ….. wN

C

W

NM

M

Page 8: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Generative vs. Descriptive Models

• Generative models: Learn P(x, y)– Tasks:

• Transform P(x,y) into P(y | x) for classification• Use the model to predict (infer) new data

– Advantages• Assumptions and model are explicit• Use well-known algorithms (e.g., MCMC, EM)

• Descriptive models: Learn P(y | x)– Task: Classify data– Advantages

• Fewer parameters to learn• Better performance for classification

8

Page 9: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Earlier Topic Models: Topics Known

• Unigram– No context information

9

W

N𝑃 𝑤#,… ,𝑤& =(𝑃(𝑤*)

*

Dan!Jurafsky!

Simplest!case:!Unigram!model!

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass!!thrift, did, eighty, said, hard, 'm, july, bullish!!that, or, limited, the!

Some!automa*cally!generated!sentences!from!a!unigram!model!

!

P(w1w2…wn ) " P(wi)i#

Automatically generated sentences from a unigram model

Page 10: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Earlier Topic Models: Topics Known

• Unigram– No context information

• Mixture of Unigrams– One topic per document

– How to specify 𝑃(𝑐.)?

10

W

N

C

W

NM

𝑃 𝑤#,… ,𝑤& =(𝑃(𝑤*)�

*

(𝑃 𝑤#,… ,𝑤&, 𝑐.0

.1#

=(𝑃(𝑐.)0

.1#

(𝑃(𝑤*|𝑐.)�

*

Page 11: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Mixture of Unigrams: Known Topics

• Multinomial Naïve Bayes– For each document d = 1, …, M

• Generate cd ~ Mult( . | p)

• For each position i = 1, ... , Nd

– Generate wi ~ Mult( . | b, cd)

11

C

W

NM

p

b

(𝑃 𝑤#,… ,𝑤&3, 𝑐.|𝛽, 𝜋0

.1#

=(𝑃(𝑐.|𝜋)0

.1#

(𝑃 𝑤* 𝛽, 𝑐.

&3

*1#

=(𝜋63(𝛽63,78

&3

*1#

0

.1#

𝜋63 ≔ 𝑃 𝑐. 𝜋𝛽63,78 ≔ 𝑃 𝑤* 𝛽, 𝑐.

multinomial

Indication vectorc=(z1,…zk)⊤ ∈ {0, 1}K

P(C=c)=∏ p𝑘𝑧𝑘=>1#

Page 12: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Multinomial Distribution

• Generalization of binomial distribution– K possible outcomes instead of 2– Probability mass function

• n = number of trials• xj = count for how often class j occurring• pj = probability of class j occurring

• Here, the input to Γ @ is a positive integer, so

12

𝑓 𝑥#, … , 𝑥=; 𝑝#, … , 𝑝= =Γ ∑ 𝑥* + 1�

*∏ Γ(𝑥* + 1)�*

(𝑝*H8

=

*1#

Γ 𝑛 = 𝑛 − 1 !

Page 13: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Mixture of Unigrams: Unknown Topics

13

Z

W

NM

p

b

• Topics/classes are hidden– Joint probability of words and classes

– Sum over topics (K = number of topics)

(𝑃 𝑤#,… ,𝑤&3, 𝑐.|𝛽, 𝜋0

.1#

=(𝜋63(𝛽63,78

&3

*1#

0

.1#

(𝑃 𝑤#,… , 𝑤&3|𝛽, 𝜋0

.1#

=(L𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

𝜋MN ≔ 𝑃 𝑧> 𝜋𝛽MN,78 ≔ 𝑃 𝑤* 𝛽, 𝑧>

Page 14: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Mixture of Unigrams: Learning

• Learn parameters 𝜋 and 𝛽

• Use likelihood

• Solve

– Not a concave/convex function– Note: a non-concave/non-convex function

is not necessarily convex/concave– Possibly no unique max, many saddle or turning points

No easy way to find roots of derivative14

(𝑃 𝑤#,… ,𝑤&3|𝛽, 𝜋0

.1#

=(L𝜋M>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

L log𝑃 𝑤#, … , 𝑤&3|𝛽, 𝜋0

.1#

= L logL𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

Page 15: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Trick: Optimize Lower Bound

15

(𝛾.>, 𝜃)

Page 16: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Mixture of Unigrams: Learning

• The problem

• Optimize w.r.t. each document• Derive lower bound

H(g)

logL𝛾*𝑥*

*

≥ L𝛾*log 𝑥*

*

where𝛾* ≥ 0 ∧L𝛾* = 1�

*

logL𝑥*

*

= logL𝛾*𝑥*𝛾*

*

≥ L 𝛾* log 𝑥* − 𝛾* log 𝛾*

*

logL𝜋>(𝛽>,78

&3

*1#

=

>1#

≥ L 𝛾> log(𝜋>(𝛽>,78

&3

*1#

)=

>1#

+ 𝐻(𝛾)

𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#The problem again

Jensen's inequality

16

𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

Page 17: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Mixture of Unigrams: Learning

• For each document d

• Chicken-and-egg problem:– If we knew 𝛾.> we could find 𝜋> and 𝛽>,78 with ML

– If we knew 𝜋> and 𝛽>,78 we could find 𝛾.> with ML

• Finally we need 𝜋𝑧>and 𝛽M>,78• Solution: Expectation Maximization

– Iterative algorithm to find local optimum– Guaranteed to maximize a lower bound on the log-

likelihood of the observed data

17

logL𝜋>(𝛽>,78

&3

*1#

=

>1#

≥ L 𝛾.> log(𝜋>(𝛽>,78

&3

*1#

)=

>1#

+ 𝐻(𝛾)

Page 18: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Graphical Idea of the EM Algorithm

18

(𝛾.>, 𝜃)

𝜃 = (𝜋>, 𝛽>,78)

Page 19: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Mixture of Unigrams: Learning

• For each document d

• EM solution– E step

– M step

19

logL𝜋>(𝛽>,78

&3

*1#

=

>1#

≥ L 𝛾.> log(𝜋>(𝛽>,78

&3

*1#

)=

>1#

+ 𝐻(𝛾)

𝛾.>(`a#) =

𝜋>(`) + ∑ 𝛽>,78

`&3*1#

∑ 𝜋b(`) ∏ 𝛽>,78

(`)&3*1#

=b1#

𝜋>(`a#) =

∑ 𝛾.>`0

.1#𝑀

𝛽>,78(`a#) =

∑ 𝛾.>` 𝑛(𝑑, 𝑤*)0

.1#

∑ 𝛾.>(`) ∑ 𝑛(𝑑, 𝑤b)

&3b1#

0.1#

Page 20: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Back to Topic Modeling Scenario

20

Latent Dirichlet allocation (LDA)

Topics DocumentsTopic proportions and

assignments

• In reality, we only observe the documents

• The other structure are hidden variables

• Topic modeling algorithms infer these variables from data.

Page 21: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Probabilistic LSI

• Select a document d with probability P(d)

• For each word of d in the training set– Choose a topic z with probability

P(z | d)– Generate a word with probability

P(w | z)

• Documents can have multiple topics

21

d

z

w

MN

Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

𝑃 𝑑,𝑤* = 𝑃 𝑑 L𝑃 𝑤* 𝑧> 𝑃(𝑧>|𝑑)=

>1#

Page 22: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

pLSI

• Joint probability for all documents, words

• Distribution for document d, word wi

• Reformulate 𝑃(𝑧>|𝑑)with Bayes’ Rule

22

𝑃 𝑑,𝑤* = 𝑃 𝑑 L𝑃 𝑤* 𝑧> 𝑃(𝑧>|𝑑)=

>1#

𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

d

zk

wi

P(d)

P(zk|d)

P(wi|zk)

d

zk

wi

P(zk)

P(d|zk) P(zk|wi)

((𝑃 𝑑,𝑤* e(.,78)&3

*1#

0

.1#

Page 23: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

pLSI: Learning Using EM

• Model

• Likelihood

• Parameters to learn (M step)

• (E step)

23

𝐿 = LL𝑛 𝑑,𝑤* log 𝑃(𝑑, 𝑤*)&3

*1#

0

.1#

= LL𝑛(𝑑,𝑤*) logL𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

&3

*1#

0

.1#

((𝑃 𝑑,𝑤* e(.,78)&3

*1#

0

.1#

𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

𝑃 𝑤* 𝑧>𝑃 𝑧>𝑃(𝑑|𝑧>)

𝑃 𝑧> 𝑑, 𝑤*

Page 24: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

pLSI: Learning Using EM

• EM solution– E step

– M step

24

𝑃(𝑧>|𝑑, 𝑤*) =𝑃(𝑧>)𝑃 𝑑 𝑧> 𝑃(𝑤*|𝑧>)

∑ 𝑃(𝑧g)𝑃 𝑑 𝑧g 𝑃(𝑤*|𝑧g)=g1#

𝑃 𝑤* 𝑧> =∑ 𝑛 𝑑, 𝑤* 𝑃 𝑧> 𝑑, 𝑤*0.1#

∑ ∑ 𝑛 𝑑,𝑤b 𝑃 𝑧> 𝑑, 𝑤b&3b1#

0.1#

𝑃 𝑑 𝑧> =∑ 𝑛 𝑑, 𝑤* 𝑃 𝑧> 𝑑, 𝑤*&3*1#

∑ ∑ 𝑛 𝑐, 𝑤* 𝑃 𝑧> 𝑐, 𝑤*&h*1#

061#

𝑃 𝑧> =1𝑅LL𝑛 𝑑,𝑤* 𝑃 𝑧> 𝑑, 𝑤* ,

&3

*1#

0

.1#

𝑅 = LL𝑛 𝑑,𝑤*

&3

*1#

0

.1#

Page 25: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

pLSI: Overview

• More realistic than mixture model– Documents can discuss multiple topics!

• Problems– Very many parameters– Danger of overfitting

25

Page 26: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

pLSI Testrun

• PLSI topics (TDT-1 corpus)– Approx. 7 million words, 15863 documents, K = 128

The two most probable topics that generate the term “flight” (left) and “love” (right).

List of most probable words per topic, with decreasing probability going down the list.

26

Page 27: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Relation with LSI

• Difference:– LSI: minimize Frobenius (L-2) norm– pLSI: log-likelihood of training data

27

𝑃 = 𝑈>Σl𝑉>n𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

𝑈> = 𝑃 𝑑 𝑧> .,>Σl = diag 𝑃 𝑧> >𝑉> = 𝑃 𝑤* 𝑧> *,>

3.PLSI Model Definition

13/24

This can be demonstrated as a matrix factorization

�𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 𝑑𝑑,𝑤𝑤 =�𝑧𝑧∈𝑍𝑍

𝑃𝑃 𝑑𝑑 𝑧𝑧)𝑃𝑃(𝑧𝑧)𝑃𝑃 𝑤𝑤 𝑧𝑧)

Contrast to SVD:• No orthonormality condition for 𝑈𝑈 and 𝑉𝑉 here.• The elements of 𝑈𝑈 and 𝑉𝑉 are non-negative.

Probabilistic Latent Sematic Indexing (PLSI)

Page 28: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

pLSI with Multinomials

• Multinomial Naïve Bayes– Select document d ~ Mult( ∙ | p)

• For each position i = 1, ... , Nd

– Generate zi ~ Mult(∙ | d, 𝜃.)– Generate wi ~ Mult( ∙ | zi, bk)

28

(𝑃 𝑤#,… ,𝑤&3, 𝑑|𝛽, 𝜃, 𝜋0

.1#

=(𝑃(𝑑|𝜋)0

.1#

(L𝑃(𝑧* = 𝑘|𝑑, 𝜃.)𝑃 𝑤* 𝛽>=

>1#

&3

*1#

=(𝜋.(L𝜃.,>𝛽>,78

=

>1#

&3

*1#

0

.1#

d

z

w

b

M

qd

p

N

Page 29: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Prior Distribution for Topic Mixture

• Goal: topic mixture proportions for each document are drawn from some distribution.– Distribution on multinomials (k-tuples of non-negative

numbers that sum to one)

• The space is of all of these multinomials can be interpreted geometrically as a (k-1)-simplex– Generalization of a triangle to (k-1) dimensions

• Criteria for selecting our prior:– It needs to be defined for a (k-1)-simplex– Should have nice properties

29

In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. [Wikipedia]

Page 30: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Latent Dirichlet Allocation

• Document = mixture of topics (as in pLSI), but according to a Dirichlet prior

– When we use a uniform Dirichlet prior, pLSI=LDA

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 30

Page 31: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Dirichlet Distributions

• Defined over a (k-1)-simplex– Takes K non-negative arguments which sum to one. – Consequently it is a natural distribution to use over

multinomial distributions.• The Dirichlet parameter ai can be thought of as a prior

count of the ith class• Conjugate prior to the multinomial distribution

– Conjugate prior: if our likelihood is multinomial with a Dirichlet prior, then the posterior is also a Dirichlet

31

𝑓 𝑥#, … , 𝑥=; 𝑝#, … , 𝑝= =Γ ∑ 𝑥* + 1�

*∏ Γ(𝑥* + 1)�*

(𝑝*H8

=

*1#

𝑝(𝜃|𝛼) =Γ ∑ 𝛼*�

*∏ Γ(𝛼*)�*

(𝜃*s8t#

=

*1#

Page 32: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA Model

q

z4z3z2z1

w4w3w2w1

a

b

q

z4z3z2z1

w4w3w2w1

q

z4z3z2z1

w4w3w2w1

32

Page 33: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA Model – Parameters

←Proportions parameter(k-dimensional vector of real numbers)

←Per-document topic distribution(k-dimensional vector ofprobabilities summing up to 1)

←Per-word topic assignment(number from 1 to k)

←Observed word (number from 1 to v, where v is the number of words in the vocabulary)

←Word “prior”(v-dimensional)

33

z

w

b

M

q

a

N

Page 34: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Dirichlet Distribution over a 2-Simplex

34

A panel illustrating probability density functions of a few Dirichletdistributions over a 2-simplex, for the following α vectors (clockwise, starting from the upper left corner): (1.3, 1.3, 1.3), (3,3,3), (7,7,7), (2,6,11), (14, 9, 5), (6,2,6). [Wikipedia]

Page 35: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA Model – Plate Notation

• For each document d,– Generate qd ~ Dirichlet(∙ | a)– For each position i = 1, ... , Nd

• Generate a topic zi ~ Mult(∙ | qd)• Generate a word wi ~ Mult(∙ | zi,b)

35

z

w

b

M

q

a

N

𝑃 𝛽, 𝜃, 𝑧#, … , 𝑧&3, 𝑤#, … , 𝑤&3

=(𝑃 𝜃. 𝛼 (𝑃 𝑧* 𝜃. 𝑃(𝑤*|𝛽, 𝑧*) &3

*1#

0

.1#

Page 36: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Corpus-level Parameter 𝛼= K-1 𝛴i𝛼𝑖

• Let 𝛼 = 1• Per-document topic distribution: K = 10, D = 15

36

↵= 1

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ●

●● ●

6

●●

●●

●●

11

● ●

●●

1 2 3 4 5 6 7 8 9 10

2

●●

●●

7

● ●

● ●●

● ●

12

●●

● ●

1 2 3 4 5 6 7 8 9 10

3

●●

●●

8

●●

13

● ● ●

●●

●●

1 2 3 4 5 6 7 8 9 10

4

● ● ●

● ●

● ● ●●

9

●●

● ●

14

● ●

● ●

● ●

1 2 3 4 5 6 7 8 9 10

5

● ●

● ●

● ●

●●

10

● ● ●

●● ●

15

● ●

● ●● ●

●●

1 2 3 4 5 6 7 8 9 10

Page 37: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

• 𝛼 = 100

Corpus-level Parameter 𝛼

• 𝛼 = 10

37

↵= 10

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ●● ● ● ●

6

●●

●● ● ●

●●

11

●● ●

●●

●● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ●●

● ●●

●●

● ●

7

●●

● ● ● ●● ●

12

● ●

● ● ● ●●

1 2 3 4 5 6 7 8 9 10

3

● ●

●●

● ●

8

●●

● ●●

●● ●

13

●●

● ● ●●

●● ●

1 2 3 4 5 6 7 8 9 10

4

● ●● ●

● ●● ●

●●

9

●●

●● ● ●

●● ●

14

●●

● ●●

● ●

1 2 3 4 5 6 7 8 9 10

5

● ●●

●● ● ●

10

●●

●●

●●

●● ●

15

●● ● ●

● ●

●●

1 2 3 4 5 6 7 8 9 10

↵= 100

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ● ● ● ● ●●

● ● ●

6

● ● ● ● ● ● ● ● ● ●

11

● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ● ● ● ● ● ● ● ● ●

7

● ● ● ● ● ● ● ● ● ●

12

●● ● ●

●● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

3

● ● ● ● ● ● ● ● ● ●

8

● ● ● ● ● ● ● ● ● ●

13

● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

4

● ● ● ● ●●

● ● ● ●

9

● ● ● ● ● ● ● ●● ●

14

● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

5

● ● ● ● ● ● ● ● ● ●

10

● ● ● ● ● ● ● ● ● ●

15

● ● ● ●● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

Page 38: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

• 𝛼 = 0.01

Corpus-level Parameter 𝛼

• 𝛼 = 0.1

38

↵= 0.1

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ● ● ● ● ●

● ● ●

6

● ● ●

● ●

11

●●

● ●

● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ●● ● ●

● ●

7

● ● ●

● ●

12

● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

3

●●

● ● ●

● ●

8

● ●

● ● ● ● ●

13

●●

●●

● ● ● ●

1 2 3 4 5 6 7 8 9 10

4●

●● ● ● ● ● ● ● ●

9

● ●

●● ● ● ●

14●

● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

5

● ● ● ●

● ●

10

● ● ● ● ●

15

● ●

● ●

1 2 3 4 5 6 7 8 9 10

↵= 0.01

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ● ● ● ● ● ● ●

6

● ● ●

● ● ● ● ● ●

11

●● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ● ● ●●

● ●

● ●

7

● ● ● ● ● ● ● ●

12

● ● ●

● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

3

● ● ● ● ● ●

8

● ● ● ● ● ● ●

● ●

13

● ● ● ●

● ● ● ●

1 2 3 4 5 6 7 8 9 10

4

● ● ● ● ●

● ● ●●

9

● ● ● ● ● ● ● ● ●

14

● ● ● ● ●

● ●

1 2 3 4 5 6 7 8 9 10

5

● ● ● ● ● ● ●

● ●

10

● ● ● ● ● ●

● ● ●

15

● ● ● ● ● ● ●

● ●

1 2 3 4 5 6 7 8 9 10

Page 39: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Back to Topic Modeling Scenario

39

Latent Dirichlet allocation (LDA)

Topics DocumentsTopic proportions and

assignments

• In reality, we only observe the documents

• The other structure are hidden variables

• Topic modeling algorithms infer these variables from data.

Page 40: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Smoothed LDA Model

• Give a different word distribution to each topic– 𝛽 is 𝐾×𝑉 matrix (V vocabulary

size), each row denotes word distribution of a topic

• For each document d– Choose qd ~ Dirichlet(∙ | a)– Choose 𝛽>~ Dirichlet(𝜂∙ |)– For each position i = 1, ... , Nd

• Generate a topic zi ~ Mult(∙ | qd)• Generate a word wi ~ Mult(∙ | zi,bk)

40

z

w

𝜂

M

q

a

N K𝛽>

Page 41: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Smoothed LDA Model

41

z

w

𝜂

M

q

a

N K𝛽>

𝑃 𝛽, 𝜃, 𝑧#, … , 𝑧&3, 𝑤#, … , 𝑤&3

= (𝑃(𝛽>|𝜂)=

>1#

・ (𝑃 𝜃. 𝛼 (𝑃 𝑧.,* 𝜃. 𝑃(𝑤.,*|𝛽#:=, 𝑧.,*) &3

*1#

0

.1#

Page 42: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Back to Topic Modeling Scenario

• But…

42

Page 43: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Why does LDA “work”?

• Trade-off between two goals1. For each document, allocate its words to as few topics as possible.2. For each topic, assign high probability to as few terms as possible.

• These goals are at odds.– Putting a document in a single topic makes #2 hard:

All of its words must have probability under that topic. – Putting very few words in each topic makes #1 hard:

To cover a document’s words, it must assign many topics to it.

• Trading off these goals finds groups of tightly co-occurring words

43

Page 44: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Inference: The Problem (non-smoothed version)

• To which topics does a given document belong?– Compute the posterior distribution of the hidden

variables given a document:

44

Inference: The problem

To which topics does a given document belong to? Thus want tocompute the posterior distribution of the hidden variables given adocument:

p(⇤, z|w,�,⇥) =p(⇤, z,w|�, ⇥)

p(w|�, ⇥)

where

p(⇤, z,w|�, ⇥) = p(⇤|�)N⌦

n=1

p(zn|⇤)p(wn|zn,⇥)

and

p(w|�, ⇥) =�(⌥

i �i )�i �(�i )

↵ � k⌦

i=1

⇤�i�1i

⇥⇤

⇧N⌦

n=1

k

i=1

V⌦

j=1

(⇤i⇥ij)wj

n

⌃ d⇤.

This not only looks awkward, but is as well computationally intractable ingeneral. Coupling between ⇤ and ⇥ij . Solution: Approximations.

𝑃 𝜃, 𝒛 𝒘, 𝛼, 𝛽 =𝑃 𝜃, 𝒛, 𝒘 𝛼, 𝛽𝑃 𝒘 𝛼, 𝛽

𝑃 𝜃, 𝒛, 𝒘 𝛼, 𝛽 = 𝑃(𝜃|𝛼)(𝑃 𝑧* 𝜃 𝑃(𝑤*|𝑧*, 𝛽)&

*1#

𝑃 𝒘 𝛼, 𝛽 =Γ ∑ 𝛼*�

*∏ Γ(𝛼*)�*

� (𝜃>sNt#

=

>1#

(L( 𝜃>𝛽>b78�

b1#

=

>1#

&

*1#

𝑑𝜃�

Page 45: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA Learning

• Parameter learning:– Variational EM

• Numerical approximation using lower-bounds• Results in biased solutions• Convergence has numerical guarantees

– Gibbs Sampling• Stochastic simulation• Unbiased solutions• Stochastic convergence

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 45

Page 46: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA: Variational Inference

• Replace LDA model with simpler one

• Minimize the Kullback-Leibler divergence between the two distributions.

46

Variational inference

• Replace the graphical model of LDA by a simpler one.

� ⇤ z w

MN

zMN

• Minimize the Kullback-Leibler divergence between the twodistributions. This is a standard-trick in machine learning. Oftenequivalent to maximum likelihood. Gradient descent procedure.

• Variational approach in contrast to often used Markov chain MonteCarlo. Variational is not a sampling approach where one samplesfrom the posterior distribution, but rather we try to find a tightestlower bound.

• Problematic coupling between ⇥ and � not present in simplergraphical model.

KL

Page 47: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA: Gibbs Sampling

• MCMC algorithm– Fix all current values but one and sample that value,

e.g, for z

– Eventually converges to true posterior

47

𝑃(𝑧* = 𝑘|𝒛t*, 𝒘)

Page 48: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Variational Inference vs. Gibbs Sampling

• Gibbs sampling is slower (takes days for mod.-sized datasets), variational inference takes a few hours.

• Gibbs sampling is more accurate. • Gibbs sampling convergence is difficult to test,

although quite a few machine learning approximate inference techniques also have the same problem.

48

Page 49: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA Application: Reuters Data

• Setup– 100-topic LDA trained on a 16,000 documents corpus of

news articles by Reuters– Some standard stop words removed

• Top words from some of the P(w|z)

49

How LDA performs on Reuters data (1/2)

About the experiments

• 100-topic LDA trained on a 16’000 documents corpus of newsarticles by Reuters (the news agency).

• Some standard stop words removed.

Top words from some of the p(w |z)

“Arts” “Budgets” “Children” “Education”new million children schoolfilm tax women studentsshow program people schoolsmusic budget child educationmovie billion years teachersplay federal families highmusical year work public

Page 50: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

LDA Application: Reuters Data

• Result

50

How LDA performs on Reuters data (2/2)

Inference on a held-out documentAgain: “Arts”, “Budgets”, “Children”, “Education”.

The William Randolph Hearst Foundation will give $1.25 million toLincoln Center, Metropolitan Opera Co., New York Philharmonic andJuilliard School. “Our board felt that we had a real opportunity to makea mark on the future of the performing arts with these grants an actevery bit as important as our traditional areas of support in health,medical research, education and the social services,” Hearst FoundationPresident Randolph A. Hearst said Monday in announcing the grants.

Page 51: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Measuring Performance

• Perplexity of a probability model• How well a probability distribution or probability model

predicts a sample– q: Model of an unknown probability distribution p

based on a training sample drawn from p– Evaluate q by asking how well it predicts a separate test sample 𝑥#, … , 𝑥& also drawn from p

– Perplexity of q w.r.t. sample 𝑥#, … , 𝑥&defined as

– A better model q will tend to assign higher probabilities to 𝑞 𝑥*→ lower perplexity (“less surprised by sample”)

51[Wikipedia]

2t#& ∑ ���� � H8�

8��

Page 52: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Perplexity of Various Models

Unigram

LDA

52

Page 53: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Use of LDA

• A widely used topic model (Griffiths, Steyvers, 04)• Complexity is an issue• Use in IR:

– Ad hoc retrieval (Wei and Croft, SIGIR 06: TREC benchmarks)– Improvements over traditional LM (e.g., LSI techniques)– But no consensus on whether there is any improvement

over Relevance model, i.e., model with relevance feedback (relevance feedback part of the TREC tests)

T. Griffiths, M. Steyvers, Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004

53

Xing Wei and W. Bruce Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). ACM, New York, NY, USA, 178-185. 2006.

TREC=Text REtrieval Conference

Page 54: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Generative Topic Models for Community Analysis

Pilfered from: Ramesh Nallapatihttp://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt

&

Arthur Asuncion, Qiang Liu, Padhraic Smyth:Statistical Approaches to Joint Modeling of Text

and Network Data54

Page 55: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

What if the corpus has network structure?

55

CORA citation network. Figure from [Chang, Blei, AISTATS 2009]

J. Chang, and D. Blei. Relational Topic Models for Document Networks.AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009.

Page 56: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Outline

• Topic Models for Community Analysis– Citation Modeling

• with pLSI• with LDA

– Modeling influence of citations– Relational Topic Models

56

Page 57: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Hyperlink Modeling Using pLSI

• Select document d ~ Mult(π)– For each position n = 1,…, Nd

• Generate zn ~ Mult(@ |𝜃.)• Generate wn ~ Mult(@ |𝛽M�)

– For each citation j = 1,…, Ld

• Generate zj ~ Mult(@ |𝜃.)• Generate cj ~ Mult(@ |𝛾M�)

57

d

z

w

b

M

qd

p

N

z

c

g

L

D. A. Cohn, Th. Hofmann, The Missing Link - A ProbabilisticModel of Document Content and Hypertext Connectivity, In: Proc. NIPS, pp. 430-436, 2000

Page 58: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Hyperlink Modeling Using pLSI

• pLSI likelihood

• New likelihood

• Learning using EM58

(𝑃 𝑤#,… ,𝑤&3, 𝑑 𝜃, 𝛽, 𝜋0

.1#

= (𝜋. (L𝜃.>𝛽>7�

=

>1#

&3

*1#

0

.1#

(𝑃 𝑤#,… , 𝑤&3, 𝑐#, … , 𝑐�3, 𝑑 𝜃, 𝛽, 𝛾, 𝜋0

.1#

= (𝜋. (L𝜃.>𝛽>7�

=

>1#

&3

*1#

(L𝜃.>𝛾>6�

=

>1#

�3

b1#

0

.1#

d

z

w

b

M

qd

p

N

z

c

g

L

Page 59: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Hyperlink Modeling Using pLSI

• Heuristic– 0 < a < 1 determines the relative importance of content

and hyperlinks

59

(𝑃 𝑤#,… ,𝑤&3, 𝑐#, … , 𝑐�3, 𝑑 𝜃, 𝛽, 𝛾, 𝜋0

.1#

= (𝜋. (L𝜃.>𝛽>7�

=

>1#

&3

*1#

s

(L𝜃.>𝛾>6�

=

>1#

�3

b1#

#ts0

.1#

Page 60: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Hyperlink Modeling Using pLSI

• Classification performance

60

Hyperlink Content Hyperlink Content

Page 61: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Hyperlink modeling using LDA

• For each document d,– Generate qd ~ Dirichlet(a)– For each position i = 1, ... , Nd

• Generate a topic zi ~ Mult(@ |𝜃.)• Generate a word wi ~ Mult (@ |𝛽M�)

– For each citation j = 1, …, Lc

• Generate zi ~ Mult(qd)• Generate ci ~ Mult (@ |𝛾M�)

• Learning using variational EM, Gibbs sampling

61

z

w

b

M

q

N

a

z

c

g

L

E. Erosheva, S Fienberg, J. Lafferty, Mixed-membership models of scientific publications. Proc National Academy Science U S A. 2004 Apr 6;101 Suppl 1:5220-7. Epub 2004 Mar 12.

Page 62: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Link-pLSI-LDA: Topic Influence in Blogs

R. Nallapati, A. Ahmed, E. Xing, W.W. Cohen, Joint Latent Topic Models for Text and Citations, In: Proc. KDD, 2008. 62

𝛾

Page 63: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

63

Page 64: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Modeling Citation Influences - Copycat Model

• Each topic in a citing document is drawn from one of the topic mixtures of cited publications

64L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.

Page 65: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Modeling Citation Influences

• Citation influence model: Combination of LDA and Copycat model

65L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.

Page 66: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Modeling Citation Influences

• Citation influence graph for LDA paper

66

Page 67: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Modeling Citation Influences

• Words in LDA paper assigned to citations

67

Page 68: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Modeling Citation Influences

• Predictive Performance

68

Page 69: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Relational Topic Model (RTM) [ChangBlei 2009]

• Same setup as LDA, except now we have observed network information across documents

69

“Link probability function”

Documents with similar topics are more likely to be linked.

𝑦.,.�~𝜓 𝑦.,.� 𝑧., 𝑧.�, 𝜂)

M

𝑧.,e

𝑤.,e

𝜃.

α

𝑁.K

𝛽>

𝑧.�,e

𝑤.�,e

𝜃.�

𝑁.�

𝑦.,.�

𝜂

J. Chang, and D. Blei. Relational Topic Models for Document Networks.AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009.

Page 70: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Relational Topic Model (RTM) [ChangBlei 2009]

• For each document d– Draw topic proportions 𝜃.|𝛼~𝐷𝑖𝑟 𝛼

– For each word 𝑤.,e• Draw assignment 𝑧.,e|𝜃.~𝑀𝑢𝑙𝑡(𝜃.)

• Draw word 𝑤.,e|𝑧.,e, 𝛽#:=~𝑀𝑢𝑙𝑡(𝛽M3,�)

– For each pair of documents 𝑑, 𝑑�

• Draw binary link indicator 𝑦|𝑧., 𝑧.�~𝜓(⋅ |𝑧., 𝑧.�)

70

𝑧.,e

𝑤.,e

𝜃.

α

𝑁.K

𝛽>

𝑧.�,e

𝑤.�,e

𝜃.�

𝑁.�

𝑦.,.�

𝜂

Page 71: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Collapsed Gibbs Sampling for RTM

• Conditional distribution of each 𝑧:

• Using the exponential link probability function, it is computationally efficient to calculate the “edge” term.

• It is very costly to compute the “non-edge” term exactly.

71

LDA term

“Edge” term

“Non-edge” term

𝑃 𝑧.,e = 𝑘 𝑧¬.,e,⋅ ∝ 𝑁.>¬.,e + 𝛼

𝑁>7¬.,e + 𝛽

𝑁>¬.,e +𝑊𝛽

( 𝜓� 𝑦.,.� = 1|𝒛., 𝒛.�, 𝜂�

.��.: 3,3�1#

( 𝜓� 𝑦.,.� = 0|𝒛., 𝒛.�, 𝜂�

.��.: 3,3�1¡

Page 72: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Approximating the Non-edges

1. Assume non-edges are “missing” and ignore the term entirely (Chang/Blei)

2. Make the following fast approximation:

3. Subsample non-edges and exactly calculate the term over subset.

4. Subsample non-edges but instead of recalculating statistics for every 𝑧.,e token, calculate statistics once per document and cache them over each Gibbs sweep.

72

( 1− exp 𝑐*

*

≈ 1 − exp 𝑐*̅ &, where𝑐*̅ = 1𝑁L𝑐*

*

Page 73: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

73

Document networks

# Docs # Links Ave. Doc-Length

Vocab-Size Link Semantics

CORA 4,000 17,000 1,200 60,000 Paper citation (undirected)

Netflix Movies

10,000 43,000 640 38,000 Common actor/director

Enron (Undirected)

1,000 16,000 7,000 55,000 Communication between person i and person j

Enron (Directed)

2,000 21,000 3,500 55,000 Email from person i to person j

Page 74: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Link Rank

• “Link rank” on held-out data as evaluation metric– Lower is better

• How to compute link rank (simplified):1. Train model with {dtrain}.2. Given the model, calculate probability that dtest would link

to each dtrain. Rank {dtrain} according to these probabilities. 3. For each observed link between dtest and {dtrain}, find the

“rank”, and average all these ranks to obtain the “link rank”74

dtest

{dtrain} Black-box predictor

Ranking over {dtrain}

Edges between dtest and {dtrain}Edges among {dtrain}

Link ranks

Page 75: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

75

Results on CORA data

Comparison on CORA, K=20

150

170

190

210

230

250

270

Baseline(TFIDF/Cosine)

LDA + Regression Ignoring non-edges Fast approximation ofnon-edges

Subsampling non-edges (20%)+Caching

Link

Ran

k

8-fold cross-validation. Random guessing gives link rank = 2000.

Page 76: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

Results on CORA data

• Model does better with more topics• Model does better with more words in each document

76

0 20 40 60 80 100 120 140 160100

150

200

250

300

350

400

Number of Topics

Link

Ran

k

BaselineRTM, Fast Approximation

0 0.2 0.4 0.6 0.8 1150

200

250

300

350

400

450

500

550

600

650

Percentage of Words

Link

Ran

k

BaselineLDA + Regression (K=40)Ignoring Non-Edges (K=40)Fast Approximation (K=40)Subsampling (5%) + Caching (K=40)

Page 77: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

77

Timing Results on CORA

“Subsampling (20%) without caching” not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000

1000 1500 2000 2500 3000 3500 40000

1000

2000

3000

4000

5000

6000

7000

Number of Documents

Tim

e (in

sec

onds

)

CORA, K=20

LDA + RegressionIgnoring Non-EdgesFast ApproximationSubsampling (5%) + CachingSubsampling (20%) + Caching

Page 78: Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of topics (as in pLSI), but according to a Dirichletprior – When we use a uniform Dirichletprior,

78

Conclusion

• Relational topic modeling provides a useful start for combining text and network data in a single statistical framework

• RTM can improve over simpler approaches for link prediction

• Opportunities for future work:– Faster algorithms for larger data sets– Better understanding of non-edge modeling– Extended models