Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...

Web-Mining AgentsTopic Analysis: pLSI and LDA

Tanya BraunRalf Möller

Universität zu LübeckInstitut für Informationssysteme

Pilfered from:

Ramesh M. NallapatiMachine Learning applied to Natural Language Processing

Thomas J. Watson Research Center, Yorktown Heights, NY USA

from his presentation onGenerative Topic Models for Community Analysis

&

David M. BleiMLSS 2012

from his presentation on Probabilistic Topic Models&

Sina MiranCS 290D Paper Presentation on

Probabilistic Latent Semantic Indexing (PLSI)

Acknowledgments

2

Recap

• Agents– Task/goal: Information retrieval– Environment: Document repository– Means:

• Vector space (bag-of-words)– Dimension reduction (LSI)

• Probability based retrieval (binary)– Language models

• Today: Topic models as special language models– Probabilistic LSI (pLSI)– Latent Dirichlet Allocation (LDA)

• Soon: What agents can take with them– What agents can leave at the repository (win-win)

3

Objectives

• Topic Models: statistical methods that analyze the words of texts in order to:– Discover the themes that run through them (topics)– How those themes are connected to each other– How they change over time

4

Probabilistic topic models

1880 1900 1920 1940 1960 1980 2000

o o o o o ooo

o

o

o

o

o

o

oooo o o

o oo o o

o o oo o o

o ooooo o

oo

o

o

o

o

o

o

o

o

o o

o oo o

oo

o

oo o

o

o

o

o

o

o

o

o

o

ooo

oo o

1880 1900 1920 1940 1960 1980 2000

o o o

o

oo

oo

o

oo o

o

o

oo o o o

ooo o

o o

o oooo

o

o

oo

ooo

o

o

o

o

o

ooooooo o

o o o o o o oo o o

o oooo

o

o

o

o

o ooo o

o

RELATIVITY

LASER

FORCE

NERVE

OXYGEN

NEURON

"Theoretical Physics" "Neuroscience"

Topic Modeling Scenario

• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics

5

Topic Modeling Scenario

• In reality, we only observe the documents• The other structures are hidden variables• Topic modeling algorithms infer these variables from data

6

Latent Dirichlet allocation (LDA)

Topics DocumentsTopic proportions and

assignments

• In reality, we only observe the documents

• The other structure are hidden variables

• Topic modeling algorithms infer these variables from data.

Plate Notation

• Naïve Bayes Model: Compact representation– C = topic/class (name for a word distribution)– N = number of words in considered document– Wi one specific word in corpus– M documents, W now words in document

– Idea: Generate doc from P(W, C)7

C

w1 w2 w3 ….. wN

C

W

NM

M

Generative vs. Descriptive Models

• Generative models: Learn P(x, y)– Tasks:

• Transform P(x,y) into P(y | x) for classification• Use the model to predict (infer) new data

– Advantages• Assumptions and model are explicit• Use well-known algorithms (e.g., MCMC, EM)

• Descriptive models: Learn P(y | x)– Task: Classify data– Advantages

• Fewer parameters to learn• Better performance for classification

8

Earlier Topic Models: Topics Known

• Unigram– No context information

9

W

N𝑃 𝑤#,… ,𝑤& =(𝑃(𝑤*)

�

*

Dan!Jurafsky!

Simplest!case:!Unigram!model!

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass!!thrift, did, eighty, said, hard, 'm, july, bullish!!that, or, limited, the!

Some!automa*cally!generated!sentences!from!a!unigram!model!

!

P(w1w2…wn ) " P(wi)i#

Automatically generated sentences from a unigram model

Earlier Topic Models: Topics Known

• Unigram– No context information

• Mixture of Unigrams– One topic per document

– How to specify 𝑃(𝑐.)?

10

W

N

C

W

NM

𝑃 𝑤#,… ,𝑤& =(𝑃(𝑤*)�

*

(𝑃 𝑤#,… ,𝑤&, 𝑐.0

.1#

=(𝑃(𝑐.)0

.1#

(𝑃(𝑤*|𝑐.)�

*

Mixture of Unigrams: Known Topics

• Multinomial Naïve Bayes– For each document d = 1, …, M

• Generate cd ~ Mult( . | p)

• For each position i = 1, ... , Nd

– Generate wi ~ Mult( . | b, cd)

11

C

W

NM

p

b

(𝑃 𝑤#,… ,𝑤&3, 𝑐.|𝛽, 𝜋0

.1#

=(𝑃(𝑐.|𝜋)0

.1#

(𝑃 𝑤* 𝛽, 𝑐.

&3

*1#

=(𝜋63(𝛽63,78

&3

*1#

0

.1#

𝜋63 ≔ 𝑃 𝑐. 𝜋𝛽63,78 ≔ 𝑃 𝑤* 𝛽, 𝑐.

multinomial

Indication vectorc=(z1,…zk)⊤ ∈ {0, 1}K

P(C=c)=∏ p𝑘𝑧𝑘=>1#

Multinomial Distribution

• Generalization of binomial distribution– K possible outcomes instead of 2– Probability mass function

• n = number of trials• xj = count for how often class j occurring• pj = probability of class j occurring

• Here, the input to Γ @ is a positive integer, so

12

𝑓 𝑥#, … , 𝑥=; 𝑝#, … , 𝑝= =Γ ∑ 𝑥* + 1�

*∏ Γ(𝑥* + 1)�*

(𝑝*H8

=

*1#

Γ 𝑛 = 𝑛 − 1 !

Mixture of Unigrams: Unknown Topics

13

Z

W

NM

p

b

• Topics/classes are hidden– Joint probability of words and classes

– Sum over topics (K = number of topics)

(𝑃 𝑤#,… ,𝑤&3, 𝑐.|𝛽, 𝜋0

.1#

=(𝜋63(𝛽63,78

&3

*1#

0

.1#

(𝑃 𝑤#,… , 𝑤&3|𝛽, 𝜋0

.1#

=(L𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

𝜋MN ≔ 𝑃 𝑧> 𝜋𝛽MN,78 ≔ 𝑃 𝑤* 𝛽, 𝑧>

Mixture of Unigrams: Learning

• Learn parameters 𝜋 and 𝛽

• Use likelihood

• Solve

– Not a concave/convex function– Note: a non-concave/non-convex function

is not necessarily convex/concave– Possibly no unique max, many saddle or turning points

No easy way to find roots of derivative14

(𝑃 𝑤#,… ,𝑤&3|𝛽, 𝜋0

.1#

=(L𝜋M>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

L log𝑃 𝑤#, … , 𝑤&3|𝛽, 𝜋0

.1#

= L logL𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78

&3

*1#

=

>1#

0

.1#

Trick: Optimize Lower Bound

15

(𝛾.>, 𝜃)


• The problem

• Optimize w.r.t. each document• Derive lower bound

H(g)

logL𝛾*𝑥*

�

*

≥ L𝛾*log 𝑥*

�

*

where𝛾* ≥ 0 ∧L𝛾* = 1�

*

logL𝑥*

�

*

= logL𝛾*𝑥*𝛾*

�

*

≥ L 𝛾* log 𝑥* − 𝛾* log 𝛾*

�

*

logL𝜋>(𝛽>,78

&3

*1#

=

>1#

≥ L 𝛾> log(𝜋>(𝛽>,78

&3

*1#

)=

>1#

+ 𝐻(𝛾)


&3

*1#

=

>1#

0

.1#The problem again

Jensen's inequality

16


&3

*1#

=

>1#

0

.1#


• For each document d

• Chicken-and-egg problem:– If we knew 𝛾.> we could find 𝜋> and 𝛽>,78 with ML

– If we knew 𝜋> and 𝛽>,78 we could find 𝛾.> with ML

• Finally we need 𝜋𝑧>and 𝛽M>,78• Solution: Expectation Maximization

– Iterative algorithm to find local optimum– Guaranteed to maximize a lower bound on the log-

likelihood of the observed data

17

logL𝜋>(𝛽>,78

&3

*1#

=

>1#

≥ L 𝛾.> log(𝜋>(𝛽>,78

&3

*1#

)=

>1#

+ 𝐻(𝛾)

Graphical Idea of the EM Algorithm

18

(𝛾.>, 𝜃)

𝜃 = (𝜋>, 𝛽>,78)


• For each document d

• EM solution– E step

– M step

19

logL𝜋>(𝛽>,78

&3

*1#

=

>1#

≥ L 𝛾.> log(𝜋>(𝛽>,78

&3

*1#

)=

>1#

+ 𝐻(𝛾)

𝛾.>(à#) =

𝜋>(`) + ∑ 𝛽>,78

`&3*1#

∑ 𝜋b(`) ∏ 𝛽>,78

(`)&3*1#

=b1#

𝜋>(à#) =

∑ 𝛾.>`0

.1#𝑀

𝛽>,78(à#) =

∑ 𝛾.>` 𝑛(𝑑, 𝑤*)0

.1#

∑ 𝛾.>(`) ∑ 𝑛(𝑑, 𝑤b)

&3b1#

0.1#

Back to Topic Modeling Scenario

20



assignments




Probabilistic LSI

• Select a document d with probability P(d)

• For each word of d in the training set– Choose a topic z with probability

P(z | d)– Generate a word with probability

P(w | z)

• Documents can have multiple topics

21

d

z

w

MN

Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999

𝑃 𝑑,𝑤* = 𝑃 𝑑 L𝑃 𝑤* 𝑧> 𝑃(𝑧>|𝑑)=

>1#

pLSI: Learning Using EM

• Model

• Likelihood

• Parameters to learn (M step)

• (E step)

23

𝐿 = LL𝑛 𝑑,𝑤* log 𝑃(𝑑, 𝑤*)&3

*1#

0

.1#

= LL𝑛(𝑑,𝑤*) logL𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

&3

*1#

0

.1#

((𝑃 𝑑,𝑤* e(.,78)&3

*1#

0

.1#

𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

𝑃 𝑤* 𝑧>𝑃 𝑧>𝑃(𝑑|𝑧>)

𝑃 𝑧> 𝑑, 𝑤*

pLSI: Learning Using EM

• EM solution– E step

– M step

24

𝑃(𝑧>|𝑑, 𝑤*) =𝑃(𝑧>)𝑃 𝑑 𝑧> 𝑃(𝑤*|𝑧>)

∑ 𝑃(𝑧g)𝑃 𝑑 𝑧g 𝑃(𝑤*|𝑧g)=g1#

𝑃 𝑤* 𝑧> =∑ 𝑛 𝑑, 𝑤* 𝑃 𝑧> 𝑑, 𝑤*0.1#

∑ ∑ 𝑛 𝑑,𝑤b 𝑃 𝑧> 𝑑, 𝑤b&3b1#

0.1#

𝑃 𝑑 𝑧> =∑ 𝑛 𝑑, 𝑤* 𝑃 𝑧> 𝑑, 𝑤*&3*1#

∑ ∑ 𝑛 𝑐, 𝑤* 𝑃 𝑧> 𝑐, 𝑤*&h*1#

061#

𝑃 𝑧> =1𝑅LL𝑛 𝑑,𝑤* 𝑃 𝑧> 𝑑, 𝑤* ,

&3

*1#

0

.1#

𝑅 = LL𝑛 𝑑,𝑤*

&3

*1#

0

.1#

pLSI: Overview

• More realistic than mixture model– Documents can discuss multiple topics!

• Problems– Very many parameters– Danger of overfitting

25

pLSI Testrun

• PLSI topics (TDT-1 corpus)– Approx. 7 million words, 15863 documents, K = 128

The two most probable topics that generate the term “flight” (left) and “love” (right).

List of most probable words per topic, with decreasing probability going down the list.

26

Relation with LSI

• Difference:– LSI: minimize Frobenius (L-2) norm– pLSI: log-likelihood of training data

27

𝑃 = 𝑈>Σl𝑉>n𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=

>1#

𝑈> = 𝑃 𝑑 𝑧> .,>Σl = diag 𝑃 𝑧> >𝑉> = 𝑃 𝑤* 𝑧> *,>

3.PLSI Model Definition

13/24

This can be demonstrated as a matrix factorization

�𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 𝑑𝑑,𝑤𝑤 =�𝑧𝑧∈𝑍𝑍

𝑃𝑃 𝑑𝑑 𝑧𝑧)𝑃𝑃(𝑧𝑧)𝑃𝑃 𝑤𝑤 𝑧𝑧)

Contrast to SVD:• No orthonormality condition for 𝑈𝑈 and 𝑉𝑉 here.• The elements of 𝑈𝑈 and 𝑉𝑉 are non-negative.

Probabilistic Latent Sematic Indexing (PLSI)

pLSI with Multinomials

• Multinomial Naïve Bayes– Select document d ~ Mult( ∙ | p)

• For each position i = 1, ... , Nd

– Generate zi ~ Mult(∙ | d, 𝜃.)– Generate wi ~ Mult( ∙ | zi, bk)

28

(𝑃 𝑤#,… ,𝑤&3, 𝑑|𝛽, 𝜃, 𝜋0

.1#

=(𝑃(𝑑|𝜋)0

.1#

(L𝑃(𝑧* = 𝑘|𝑑, 𝜃.)𝑃 𝑤* 𝛽>=

>1#

&3

*1#

=(𝜋.(L𝜃.,>𝛽>,78

=

>1#

&3

*1#

0

.1#

d

z

w

b

M

qd

p

N

Prior Distribution for Topic Mixture

• Goal: topic mixture proportions for each document are drawn from some distribution.– Distribution on multinomials (k-tuples of non-negative

numbers that sum to one)

• The space is of all of these multinomials can be interpreted geometrically as a (k-1)-simplex– Generalization of a triangle to (k-1) dimensions

• Criteria for selecting our prior:– It needs to be defined for a (k-1)-simplex– Should have nice properties

29

In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. [Wikipedia]

Latent Dirichlet Allocation

• Document = mixture of topics (as in pLSI), but according to a Dirichlet prior

– When we use a uniform Dirichlet prior, pLSI=LDA

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 30

Dirichlet Distributions

• Defined over a (k-1)-simplex– Takes K non-negative arguments which sum to one. – Consequently it is a natural distribution to use over

multinomial distributions.• The Dirichlet parameter ai can be thought of as a prior

count of the ith class• Conjugate prior to the multinomial distribution

– Conjugate prior: if our likelihood is multinomial with a Dirichlet prior, then the posterior is also a Dirichlet

31

𝑓 𝑥#, … , 𝑥=; 𝑝#, … , 𝑝= =Γ ∑ 𝑥* + 1�

*∏ Γ(𝑥* + 1)�*

(𝑝*H8

=

*1#

𝑝(𝜃|𝛼) =Γ ∑ 𝛼*�

*∏ Γ(𝛼*)�*

(𝜃*s8t#

=

*1#

LDA Model

q

z4z3z2z1

w4w3w2w1

a

b

q

z4z3z2z1

w4w3w2w1

q

z4z3z2z1

w4w3w2w1

32

LDA Model – Parameters

←Proportions parameter(k-dimensional vector of real numbers)

←Per-document topic distribution(k-dimensional vector ofprobabilities summing up to 1)

←Per-word topic assignment(number from 1 to k)

←Observed word (number from 1 to v, where v is the number of words in the vocabulary)

←Word “prior”(v-dimensional)

33

z

w

b

M

q

a

N

Dirichlet Distribution over a 2-Simplex

34

A panel illustrating probability density functions of a few Dirichletdistributions over a 2-simplex, for the following α vectors (clockwise, starting from the upper left corner): (1.3, 1.3, 1.3), (3,3,3), (7,7,7), (2,6,11), (14, 9, 5), (6,2,6). [Wikipedia]

LDA Model – Plate Notation

• For each document d,– Generate qd ~ Dirichlet(∙ | a)– For each position i = 1, ... , Nd

• Generate a topic zi ~ Mult(∙ | qd)• Generate a word wi ~ Mult(∙ | zi,b)

35

z

w

b

M

q

a

N

𝑃 𝛽, 𝜃, 𝑧#, … , 𝑧&3, 𝑤#, … , 𝑤&3

=(𝑃 𝜃. 𝛼 (𝑃 𝑧* 𝜃. 𝑃(𝑤*|𝛽, 𝑧*) &3

*1#

0

.1#

Corpus-level Parameter 𝛼= K-1 𝛴i𝛼𝑖

• Let 𝛼 = 1• Per-document topic distribution: K = 10, D = 15

36

↵= 1

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

●

●

●

● ●

●

●● ●

●

6

●●

●

●●

●

●●

●

●

11

● ●

●

●

●

●

●

●●

●

1 2 3 4 5 6 7 8 9 10

2

●●

●

●

●

●●

●

●

●

7

●

● ●

●

●

● ●●

● ●

12

●

●

●●

●

●

●

● ●

●

1 2 3 4 5 6 7 8 9 10

3

●

●

●

●●

●

●

●●

●

8

●

●

●

●

●

●

●

●●

●

13

●

●

●

● ● ●

●●

●●

1 2 3 4 5 6 7 8 9 10

4

● ● ●

●

● ●

● ● ●●

9

●

●

●

●

●

●●

●

● ●

14

● ●

●

● ●

●

● ●

●

●

1 2 3 4 5 6 7 8 9 10

5

● ●

●

● ●

● ●

●●

●

10

●

● ● ●

●

●

●● ●

●

15

● ●

●

● ●● ●

●

●●

1 2 3 4 5 6 7 8 9 10

• 𝛼 = 100

Corpus-level Parameter 𝛼

• 𝛼 = 10

37

↵= 10

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ●● ● ● ●

●

●

●

●

6

●●

●● ● ●

●

●

●●

11

●● ●

●

●●

●● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ●●

● ●●

●●

● ●

7

●●

●

● ● ● ●● ●

●

12

● ●

●

● ● ● ●●

●

●

1 2 3 4 5 6 7 8 9 10

3

●

● ●

●●

●

● ●

●

●

8

●●

● ●●

●● ●

●

●

13

●●

● ● ●●

●● ●

●

1 2 3 4 5 6 7 8 9 10

4

● ●● ●

● ●● ●

●●

9

●

●●

●● ● ●

●● ●

14

●●

● ●●

●

●

● ●

●

1 2 3 4 5 6 7 8 9 10

5

● ●●

●

●

●● ● ●

●

10

●●

●●

●

●●

●● ●

15

●

●

●● ● ●

● ●

●●

1 2 3 4 5 6 7 8 9 10

↵= 100

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ● ● ● ● ●●

● ● ●

6

● ● ● ● ● ● ● ● ● ●

11

● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ● ● ● ● ● ● ● ● ●

7

● ● ● ● ● ● ● ● ● ●

12

●● ● ●

●● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

3

● ● ● ● ● ● ● ● ● ●

8

● ● ● ● ● ● ● ● ● ●

13

● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

4

● ● ● ● ●●

● ● ● ●

9

● ● ● ● ● ● ● ●● ●

14

● ● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

5

● ● ● ● ● ● ● ● ● ●

10

● ● ● ● ● ● ● ● ● ●

15

● ● ● ●● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

• 𝛼 = 0.01

Corpus-level Parameter 𝛼

• 𝛼 = 0.1

38

↵= 0.1

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

● ● ● ● ● ●

●

● ● ●

6

● ● ●

●

●

●

● ●

●

●

11

●●

●

● ●

●

●

● ● ●

1 2 3 4 5 6 7 8 9 10

2

● ●● ● ●

●

● ●

●

●

7

● ● ●

●

●

●

●

● ●

●

12

●

● ● ● ● ● ● ●

●

●

1 2 3 4 5 6 7 8 9 10

3

●●

●

●

● ● ●

●

● ●

8

●

● ●

●

●

● ● ● ● ●

13

●●

●

●●

● ● ● ●

●

1 2 3 4 5 6 7 8 9 10

4●

●● ● ● ● ● ● ● ●

9

●

●

● ●

●

●● ● ● ●

14●

● ● ● ● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

5

● ● ● ●

●

●

●

●

● ●

10

● ● ● ● ●

●

●

●

●

●

15

●

●

●

● ●

●

●

● ●

●

1 2 3 4 5 6 7 8 9 10

↵= 0.01

item

valu

e

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

1

●

●

● ● ● ● ● ● ● ●

6

● ● ●

●

● ● ● ● ● ●

11

●● ● ● ● ● ●

●

●

●

1 2 3 4 5 6 7 8 9 10

2

● ● ● ●●

● ●

●

● ●

7

● ● ● ● ● ● ● ●

●

●

12

● ● ●

●

● ● ● ● ● ●

1 2 3 4 5 6 7 8 9 10

3

●

●

●

●

● ● ● ● ● ●

8

● ● ● ● ● ● ●

●

● ●

13

● ● ● ●

●

●

● ● ● ●

1 2 3 4 5 6 7 8 9 10

4

● ● ● ● ●

●

● ● ●●

9

● ● ● ● ● ● ● ● ●

●

14

● ● ● ● ●

●

● ●

●

●

1 2 3 4 5 6 7 8 9 10

5

● ● ● ● ● ● ●

●

● ●

10

● ● ● ● ● ●

●

● ● ●

15

● ● ● ● ● ● ●

●

● ●

1 2 3 4 5 6 7 8 9 10


39



assignments




Smoothed LDA Model

• Give a different word distribution to each topic– 𝛽 is 𝐾×𝑉 matrix (V vocabulary

size), each row denotes word distribution of a topic

• For each document d– Choose qd ~ Dirichlet(∙ | a)– Choose 𝛽>~ Dirichlet(𝜂∙ |)– For each position i = 1, ... , Nd

• Generate a topic zi ~ Mult(∙ | qd)• Generate a word wi ~ Mult(∙ | zi,bk)

40

z

w

𝜂

M

q

a

N K𝛽>

Smoothed LDA Model

41

z

w

𝜂

M

q

a

N K𝛽>

𝑃 𝛽, 𝜃, 𝑧#, … , 𝑧&3, 𝑤#, … , 𝑤&3

= (𝑃(𝛽>|𝜂)=

>1#

・ (𝑃 𝜃. 𝛼 (𝑃 𝑧.,* 𝜃. 𝑃(𝑤.,*|𝛽#:=, 𝑧.,*) &3

*1#

0

.1#


• But…

42

Why does LDA “work”?

• Trade-off between two goals1. For each document, allocate its words to as few topics as possible.2. For each topic, assign high probability to as few terms as possible.

• These goals are at odds.– Putting a document in a single topic makes #2 hard:

All of its words must have probability under that topic. – Putting very few words in each topic makes #1 hard:

To cover a document’s words, it must assign many topics to it.

• Trading off these goals finds groups of tightly co-occurring words

43

Inference: The Problem (non-smoothed version)

• To which topics does a given document belong?– Compute the posterior distribution of the hidden

variables given a document:

44

Inference: The problem

To which topics does a given document belong to? Thus want tocompute the posterior distribution of the hidden variables given adocument:

p(⇤, z|w,�,⇥) =p(⇤, z,w|�, ⇥)

p(w|�, ⇥)

where

p(⇤, z,w|�, ⇥) = p(⇤|�)N⌦

n=1

p(zn|⇤)p(wn|zn,⇥)

and

p(w|�, ⇥) =�(⌥

i �i )�i �(�i )

↵ � k⌦

i=1

⇤�i�1i

⇥⇤

⇧N⌦

n=1

k

i=1

V⌦

j=1

(⇤i⇥ij)wj

n

⌅

⌃ d⇤.

This not only looks awkward, but is as well computationally intractable ingeneral. Coupling between ⇤ and ⇥ij . Solution: Approximations.

𝑃 𝜃, 𝒛 𝒘, 𝛼, 𝛽 =𝑃 𝜃, 𝒛, 𝒘 𝛼, 𝛽𝑃 𝒘 𝛼, 𝛽

𝑃 𝜃, 𝒛, 𝒘 𝛼, 𝛽 = 𝑃(𝜃|𝛼)(𝑃 𝑧* 𝜃 𝑃(𝑤*|𝑧*, 𝛽)&

*1#

𝑃 𝒘 𝛼, 𝛽 =Γ ∑ 𝛼*�

*∏ Γ(𝛼*)�*

� (𝜃>sNt#

=

>1#

(L( 𝜃>𝛽>b78�

�

b1#

=

>1#

&

*1#

𝑑𝜃�

�

LDA Learning

• Parameter learning:– Variational EM

• Numerical approximation using lower-bounds• Results in biased solutions• Convergence has numerical guarantees

– Gibbs Sampling• Stochastic simulation• Unbiased solutions• Stochastic convergence

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 45

LDA: Variational Inference

• Replace LDA model with simpler one

• Minimize the Kullback-Leibler divergence between the two distributions.

46

Variational inference

• Replace the graphical model of LDA by a simpler one.

� ⇤ z w

⇥

MN

�

⇥

⇤

zMN

• Minimize the Kullback-Leibler divergence between the twodistributions. This is a standard-trick in machine learning. Oftenequivalent to maximum likelihood. Gradient descent procedure.

• Variational approach in contrast to often used Markov chain MonteCarlo. Variational is not a sampling approach where one samplesfrom the posterior distribution, but rather we try to find a tightestlower bound.

• Problematic coupling between ⇥ and � not present in simplergraphical model.

KL

LDA: Gibbs Sampling

• MCMC algorithm– Fix all current values but one and sample that value,

e.g, for z

– Eventually converges to true posterior

47

𝑃(𝑧* = 𝑘|𝒛t*, 𝒘)

Variational Inference vs. Gibbs Sampling

• Gibbs sampling is slower (takes days for mod.-sized datasets), variational inference takes a few hours.

• Gibbs sampling is more accurate. • Gibbs sampling convergence is difficult to test,

although quite a few machine learning approximate inference techniques also have the same problem.

48

LDA Application: Reuters Data

• Setup– 100-topic LDA trained on a 16,000 documents corpus of

news articles by Reuters– Some standard stop words removed

• Top words from some of the P(w|z)

49

How LDA performs on Reuters data (1/2)

About the experiments

• 100-topic LDA trained on a 16’000 documents corpus of newsarticles by Reuters (the news agency).

• Some standard stop words removed.

Top words from some of the p(w |z)

“Arts” “Budgets” “Children” “Education”new million children schoolfilm tax women studentsshow program people schoolsmusic budget child educationmovie billion years teachersplay federal families highmusical year work public

LDA Application: Reuters Data

• Result

50

How LDA performs on Reuters data (2/2)

Inference on a held-out documentAgain: “Arts”, “Budgets”, “Children”, “Education”.

The William Randolph Hearst Foundation will give $1.25 million toLincoln Center, Metropolitan Opera Co., New York Philharmonic andJuilliard School. “Our board felt that we had a real opportunity to makea mark on the future of the performing arts with these grants an actevery bit as important as our traditional areas of support in health,medical research, education and the social services,” Hearst FoundationPresident Randolph A. Hearst said Monday in announcing the grants.

Measuring Performance

• Perplexity of a probability model• How well a probability distribution or probability model

predicts a sample– q: Model of an unknown probability distribution p

based on a training sample drawn from p– Evaluate q by asking how well it predicts a separate test sample 𝑥#, … , 𝑥& also drawn from p

– Perplexity of q w.r.t. sample 𝑥#, … , 𝑥&defined as

– A better model q will tend to assign higher probabilities to 𝑞 𝑥*→ lower perplexity (“less surprised by sample”)

51[Wikipedia]

2t#& ∑ �� H8�

8��

Perplexity of Various Models

Unigram

LDA

52

Use of LDA

• A widely used topic model (Griffiths, Steyvers, 04)• Complexity is an issue• Use in IR:

– Ad hoc retrieval (Wei and Croft, SIGIR 06: TREC benchmarks)– Improvements over traditional LM (e.g., LSI techniques)– But no consensus on whether there is any improvement

over Relevance model, i.e., model with relevance feedback (relevance feedback part of the TREC tests)

T. Griffiths, M. Steyvers, Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004

53

Xing Wei and W. Bruce Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). ACM, New York, NY, USA, 178-185. 2006.

TREC=Text REtrieval Conference

Generative Topic Models for Community Analysis

Pilfered from: Ramesh Nallapatihttp://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt

&

Arthur Asuncion, Qiang Liu, Padhraic Smyth:Statistical Approaches to Joint Modeling of Text

and Network Data54

What if the corpus has network structure?

55

CORA citation network. Figure from [Chang, Blei, AISTATS 2009]

J. Chang, and D. Blei. Relational Topic Models for Document Networks.AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009.

Outline

• Topic Models for Community Analysis– Citation Modeling

• with pLSI• with LDA

– Modeling influence of citations– Relational Topic Models

56

Hyperlink Modeling Using pLSI

• Select document d ~ Mult(π)– For each position n = 1,…, Nd

• Generate zn ~ Mult(@ |𝜃.)• Generate wn ~ Mult(@ |𝛽M�)

– For each citation j = 1,…, Ld

• Generate zj ~ Mult(@ |𝜃.)• Generate cj ~ Mult(@ |𝛾M�)

57

d

z

w

b

M

qd

p

N

z

c

g

L

D. A. Cohn, Th. Hofmann, The Missing Link - A ProbabilisticModel of Document Content and Hypertext Connectivity, In: Proc. NIPS, pp. 430-436, 2000


• pLSI likelihood

• New likelihood

• Learning using EM58

(𝑃 𝑤#,… ,𝑤&3, 𝑑 𝜃, 𝛽, 𝜋0

.1#

= (𝜋. (L𝜃.>𝛽>7�

=

>1#

&3

*1#

0

.1#

(𝑃 𝑤#,… , 𝑤&3, 𝑐#, … , 𝑐�3, 𝑑 𝜃, 𝛽, 𝛾, 𝜋0

.1#

= (𝜋. (L𝜃.>𝛽>7�

=

>1#

&3

*1#

(L𝜃.>𝛾>6�

=

>1#

�3

b1#

0

.1#

d

z

w

b

M

qd

p

N

z

c

g

L


• Heuristic– 0 < a < 1 determines the relative importance of content

and hyperlinks

59

(𝑃 𝑤#,… ,𝑤&3, 𝑐#, … , 𝑐�3, 𝑑 𝜃, 𝛽, 𝛾, 𝜋0

.1#

= (𝜋. (L𝜃.>𝛽>7�

=

>1#

&3

*1#

s

(L𝜃.>𝛾>6�

=

>1#

�3

b1#

#ts0

.1#


• Classification performance

60

Hyperlink Content Hyperlink Content

Hyperlink modeling using LDA

• For each document d,– Generate qd ~ Dirichlet(a)– For each position i = 1, ... , Nd

• Generate a topic zi ~ Mult(@ |𝜃.)• Generate a word wi ~ Mult (@ |𝛽M�)

– For each citation j = 1, …, Lc

• Generate zi ~ Mult(qd)• Generate ci ~ Mult (@ |𝛾M�)

• Learning using variational EM, Gibbs sampling

61

z

w

b

M

q

N

a

z

c

g

L

E. Erosheva, S Fienberg, J. Lafferty, Mixed-membership models of scientific publications. Proc National Academy Science U S A. 2004 Apr 6;101 Suppl 1:5220-7. Epub 2004 Mar 12.

Link-pLSI-LDA: Topic Influence in Blogs

R. Nallapati, A. Ahmed, E. Xing, W.W. Cohen, Joint Latent Topic Models for Text and Citations, In: Proc. KDD, 2008. 62

𝛾

Modeling Citation Influences - Copycat Model

• Each topic in a citing document is drawn from one of the topic mixtures of cited publications

64L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.

Modeling Citation Influences

• Citation influence model: Combination of LDA and Copycat model

65L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.


• Citation influence graph for LDA paper

66


• Words in LDA paper assigned to citations

67


• Predictive Performance

68

Relational Topic Model (RTM) [ChangBlei 2009]

• Same setup as LDA, except now we have observed network information across documents

69

“Link probability function”

Documents with similar topics are more likely to be linked.

𝑦.,.�~𝜓 𝑦.,.� 𝑧., 𝑧.�, 𝜂)

M

𝑧.,e

𝑤.,e

𝜃.

α

𝑁.K

𝛽>

𝑧.�,e

𝑤.�,e

𝜃.�

𝑁.�

𝑦.,.�

𝜂

J. Chang, and D. Blei. Relational Topic Models for Document Networks.AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009.

Relational Topic Model (RTM) [ChangBlei 2009]

• For each document d– Draw topic proportions 𝜃.|𝛼~𝐷𝑖𝑟 𝛼

– For each word 𝑤.,e• Draw assignment 𝑧.,e|𝜃.~𝑀𝑢𝑙𝑡(𝜃.)

• Draw word 𝑤.,e|𝑧.,e, 𝛽#:=~𝑀𝑢𝑙𝑡(𝛽M3,�)

– For each pair of documents 𝑑, 𝑑�

• Draw binary link indicator 𝑦|𝑧., 𝑧.�~𝜓(⋅ |𝑧., 𝑧.�)

70

𝑧.,e

𝑤.,e

𝜃.

α

𝑁.K

𝛽>

𝑧.�,e

𝑤.�,e

𝜃.�

𝑁.�

𝑦.,.�

𝜂

Collapsed Gibbs Sampling for RTM

• Conditional distribution of each 𝑧:

• Using the exponential link probability function, it is computationally efficient to calculate the “edge” term.

• It is very costly to compute the “non-edge” term exactly.

71

LDA term

“Edge” term

“Non-edge” term

𝑃 𝑧.,e = 𝑘 𝑧¬.,e,⋅ ∝ 𝑁.>¬.,e + 𝛼

𝑁>7¬.,e + 𝛽

𝑁>¬.,e +𝑊𝛽

( 𝜓� 𝑦.,.� = 1|𝒛., 𝒛.�, 𝜂�

.��.: 3,3�1#

( 𝜓� 𝑦.,.� = 0|𝒛., 𝒛.�, 𝜂�

.��.: 3,3�1¡

Approximating the Non-edges

1. Assume non-edges are “missing” and ignore the term entirely (Chang/Blei)

2. Make the following fast approximation:

3. Subsample non-edges and exactly calculate the term over subset.

4. Subsample non-edges but instead of recalculating statistics for every 𝑧.,e token, calculate statistics once per document and cache them over each Gibbs sweep.

72

( 1− exp 𝑐*

�

*

≈ 1 − exp 𝑐*̅ &, where𝑐*̅ = 1𝑁L𝑐*

�

*

73

Document networks

# Docs # Links Ave. Doc-Length

Vocab-Size Link Semantics

CORA 4,000 17,000 1,200 60,000 Paper citation (undirected)

Netflix Movies

10,000 43,000 640 38,000 Common actor/director

Enron (Undirected)

1,000 16,000 7,000 55,000 Communication between person i and person j

Enron (Directed)

2,000 21,000 3,500 55,000 Email from person i to person j

Link Rank

• “Link rank” on held-out data as evaluation metric– Lower is better

• How to compute link rank (simplified):1. Train model with {dtrain}.2. Given the model, calculate probability that dtest would link

to each dtrain. Rank {dtrain} according to these probabilities. 3. For each observed link between dtest and {dtrain}, find the

“rank”, and average all these ranks to obtain the “link rank”74

dtest

{dtrain} Black-box predictor

Ranking over {dtrain}

Edges between dtest and {dtrain}Edges among {dtrain}

Link ranks

75

Results on CORA data

Comparison on CORA, K=20

150

170

190

210

230

250

270

Baseline(TFIDF/Cosine)

LDA + Regression Ignoring non-edges Fast approximation ofnon-edges

Subsampling non-edges (20%)+Caching

Link

Ran

k

8-fold cross-validation. Random guessing gives link rank = 2000.

Results on CORA data

• Model does better with more topics• Model does better with more words in each document

76

0 20 40 60 80 100 120 140 160100

150

200

250

300

350

400

Number of Topics

Link

Ran

k

BaselineRTM, Fast Approximation

0 0.2 0.4 0.6 0.8 1150

200

250

300

350

400

450

500

550

600

650

Percentage of Words

Link

Ran

k

BaselineLDA + Regression (K=40)Ignoring Non-Edges (K=40)Fast Approximation (K=40)Subsampling (5%) + Caching (K=40)

77

Timing Results on CORA

“Subsampling (20%) without caching” not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000

1000 1500 2000 2500 3000 3500 40000

1000

2000

3000

4000

5000

6000

7000

Number of Documents

Tim

e (in

sec

onds

)

CORA, K=20

LDA + RegressionIgnoring Non-EdgesFast ApproximationSubsampling (5%) + CachingSubsampling (20%) + Caching

78

Conclusion

• Relational topic modeling provides a useful start for combining text and network data in a single statistical framework

• RTM can improve over simpler approaches for link prediction

• Opportunities for future work:– Faster algorithms for larger data sets– Better understanding of non-edge modeling– Extended models

Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...

Documents

Transcript of Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...