Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...
Transcript of Web-Mining Agentsmoeller/Lectures/... · Latent DirichletAllocation • Document = mixture of...
Web-Mining AgentsTopic Analysis: pLSI and LDA
Tanya BraunRalf Möller
Universität zu LübeckInstitut für Informationssysteme
Pilfered from:
Ramesh M. NallapatiMachine Learning applied to Natural Language Processing
Thomas J. Watson Research Center, Yorktown Heights, NY USA
from his presentation onGenerative Topic Models for Community Analysis
&
David M. BleiMLSS 2012
from his presentation on Probabilistic Topic Models&
Sina MiranCS 290D Paper Presentation on
Probabilistic Latent Semantic Indexing (PLSI)
Acknowledgments
2
Recap
• Agents– Task/goal: Information retrieval– Environment: Document repository– Means:
• Vector space (bag-of-words)– Dimension reduction (LSI)
• Probability based retrieval (binary)– Language models
• Today: Topic models as special language models– Probabilistic LSI (pLSI)– Latent Dirichlet Allocation (LDA)
• Soon: What agents can take with them– What agents can leave at the repository (win-win)
3
Objectives
• Topic Models: statistical methods that analyze the words of texts in order to:– Discover the themes that run through them (topics)– How those themes are connected to each other– How they change over time
4
Probabilistic topic models
1880 1900 1920 1940 1960 1980 2000
o o o o o ooo
o
o
o
o
o
o
oooo o o
o oo o o
o o oo o o
o ooooo o
oo
o
o
o
o
o
o
o
o
o o
o oo o
oo
o
oo o
o
o
o
o
o
o
o
o
o
ooo
oo o
1880 1900 1920 1940 1960 1980 2000
o o o
o
oo
oo
o
oo o
o
o
oo o o o
ooo o
o o
o oooo
o
o
oo
ooo
o
o
o
o
o
ooooooo o
o o o o o o oo o o
o oooo
o
o
o
o
o ooo o
o
RELATIVITY
LASER
FORCE
NERVE
OXYGEN
NEURON
"Theoretical Physics" "Neuroscience"
Topic Modeling Scenario
• Each topic is a distribution over words• Each document is a mixture of corpus-wide topics• Each word is drawn from one of those topics
5
Topic Modeling Scenario
• In reality, we only observe the documents• The other structures are hidden variables• Topic modeling algorithms infer these variables from data
6
Latent Dirichlet allocation (LDA)
Topics DocumentsTopic proportions and
assignments
• In reality, we only observe the documents
• The other structure are hidden variables
• Topic modeling algorithms infer these variables from data.
Plate Notation
• Naïve Bayes Model: Compact representation– C = topic/class (name for a word distribution)– N = number of words in considered document– Wi one specific word in corpus– M documents, W now words in document
– Idea: Generate doc from P(W, C)7
C
w1 w2 w3 ….. wN
C
W
NM
M
Generative vs. Descriptive Models
• Generative models: Learn P(x, y)– Tasks:
• Transform P(x,y) into P(y | x) for classification• Use the model to predict (infer) new data
– Advantages• Assumptions and model are explicit• Use well-known algorithms (e.g., MCMC, EM)
• Descriptive models: Learn P(y | x)– Task: Classify data– Advantages
• Fewer parameters to learn• Better performance for classification
8
Earlier Topic Models: Topics Known
• Unigram– No context information
9
W
N𝑃 𝑤#,… ,𝑤& =(𝑃(𝑤*)
�
*
Dan!Jurafsky!
Simplest!case:!Unigram!model!
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass!!thrift, did, eighty, said, hard, 'm, july, bullish!!that, or, limited, the!
Some!automa*cally!generated!sentences!from!a!unigram!model!
!
P(w1w2…wn ) " P(wi)i#
Automatically generated sentences from a unigram model
Earlier Topic Models: Topics Known
• Unigram– No context information
• Mixture of Unigrams– One topic per document
– How to specify 𝑃(𝑐.)?
10
W
N
C
W
NM
𝑃 𝑤#,… ,𝑤& =(𝑃(𝑤*)�
*
(𝑃 𝑤#,… ,𝑤&, 𝑐.0
.1#
=(𝑃(𝑐.)0
.1#
(𝑃(𝑤*|𝑐.)�
*
Mixture of Unigrams: Known Topics
• Multinomial Naïve Bayes– For each document d = 1, …, M
• Generate cd ~ Mult( . | p)
• For each position i = 1, ... , Nd
– Generate wi ~ Mult( . | b, cd)
11
C
W
NM
p
b
(𝑃 𝑤#,… ,𝑤&3, 𝑐.|𝛽, 𝜋0
.1#
=(𝑃(𝑐.|𝜋)0
.1#
(𝑃 𝑤* 𝛽, 𝑐.
&3
*1#
=(𝜋63(𝛽63,78
&3
*1#
0
.1#
𝜋63 ≔ 𝑃 𝑐. 𝜋𝛽63,78 ≔ 𝑃 𝑤* 𝛽, 𝑐.
multinomial
Indication vectorc=(z1,…zk)⊤ ∈ {0, 1}K
P(C=c)=∏ p𝑘𝑧𝑘=>1#
Multinomial Distribution
• Generalization of binomial distribution– K possible outcomes instead of 2– Probability mass function
• n = number of trials• xj = count for how often class j occurring• pj = probability of class j occurring
• Here, the input to Γ @ is a positive integer, so
12
𝑓 𝑥#, … , 𝑥=; 𝑝#, … , 𝑝= =Γ ∑ 𝑥* + 1�
*∏ Γ(𝑥* + 1)�*
(𝑝*H8
=
*1#
Γ 𝑛 = 𝑛 − 1 !
Mixture of Unigrams: Unknown Topics
13
Z
W
NM
p
b
• Topics/classes are hidden– Joint probability of words and classes
– Sum over topics (K = number of topics)
(𝑃 𝑤#,… ,𝑤&3, 𝑐.|𝛽, 𝜋0
.1#
=(𝜋63(𝛽63,78
&3
*1#
0
.1#
(𝑃 𝑤#,… , 𝑤&3|𝛽, 𝜋0
.1#
=(L𝜋𝑧>(𝛽M>,78
&3
*1#
=
>1#
0
.1#
𝜋MN ≔ 𝑃 𝑧> 𝜋𝛽MN,78 ≔ 𝑃 𝑤* 𝛽, 𝑧>
Mixture of Unigrams: Learning
• Learn parameters 𝜋 and 𝛽
• Use likelihood
• Solve
– Not a concave/convex function– Note: a non-concave/non-convex function
is not necessarily convex/concave– Possibly no unique max, many saddle or turning points
No easy way to find roots of derivative14
(𝑃 𝑤#,… ,𝑤&3|𝛽, 𝜋0
.1#
=(L𝜋M>(𝛽M>,78
&3
*1#
=
>1#
0
.1#
L log𝑃 𝑤#, … , 𝑤&3|𝛽, 𝜋0
.1#
= L logL𝜋𝑧>(𝛽M>,78
&3
*1#
=
>1#
0
.1#
𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78
&3
*1#
=
>1#
0
.1#
Trick: Optimize Lower Bound
15
(𝛾.>, 𝜃)
Mixture of Unigrams: Learning
• The problem
• Optimize w.r.t. each document• Derive lower bound
H(g)
logL𝛾*𝑥*
�
*
≥ L𝛾*log 𝑥*
�
*
where𝛾* ≥ 0 ∧L𝛾* = 1�
*
logL𝑥*
�
*
= logL𝛾*𝑥*𝛾*
�
*
≥ L 𝛾* log 𝑥* − 𝛾* log 𝛾*
�
*
logL𝜋>(𝛽>,78
&3
*1#
=
>1#
≥ L 𝛾> log(𝜋>(𝛽>,78
&3
*1#
)=
>1#
+ 𝐻(𝛾)
𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78
&3
*1#
=
>1#
0
.1#The problem again
Jensen's inequality
16
𝑎𝑟𝑔𝑚𝑎𝑥𝛽𝜋L logL𝜋𝑧>(𝛽M>,78
&3
*1#
=
>1#
0
.1#
Mixture of Unigrams: Learning
• For each document d
• Chicken-and-egg problem:– If we knew 𝛾.> we could find 𝜋> and 𝛽>,78 with ML
– If we knew 𝜋> and 𝛽>,78 we could find 𝛾.> with ML
• Finally we need 𝜋𝑧>and 𝛽M>,78• Solution: Expectation Maximization
– Iterative algorithm to find local optimum– Guaranteed to maximize a lower bound on the log-
likelihood of the observed data
17
logL𝜋>(𝛽>,78
&3
*1#
=
>1#
≥ L 𝛾.> log(𝜋>(𝛽>,78
&3
*1#
)=
>1#
+ 𝐻(𝛾)
Graphical Idea of the EM Algorithm
18
(𝛾.>, 𝜃)
𝜃 = (𝜋>, 𝛽>,78)
Mixture of Unigrams: Learning
• For each document d
• EM solution– E step
– M step
19
logL𝜋>(𝛽>,78
&3
*1#
=
>1#
≥ L 𝛾.> log(𝜋>(𝛽>,78
&3
*1#
)=
>1#
+ 𝐻(𝛾)
𝛾.>(`a#) =
𝜋>(`) + ∑ 𝛽>,78
`&3*1#
∑ 𝜋b(`) ∏ 𝛽>,78
(`)&3*1#
=b1#
𝜋>(`a#) =
∑ 𝛾.>`0
.1#𝑀
𝛽>,78(`a#) =
∑ 𝛾.>` 𝑛(𝑑, 𝑤*)0
.1#
∑ 𝛾.>(`) ∑ 𝑛(𝑑, 𝑤b)
&3b1#
0.1#
Back to Topic Modeling Scenario
20
Latent Dirichlet allocation (LDA)
Topics DocumentsTopic proportions and
assignments
• In reality, we only observe the documents
• The other structure are hidden variables
• Topic modeling algorithms infer these variables from data.
Probabilistic LSI
• Select a document d with probability P(d)
• For each word of d in the training set– Choose a topic z with probability
P(z | d)– Generate a word with probability
P(w | z)
• Documents can have multiple topics
21
d
z
w
MN
Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the 22nd Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999
𝑃 𝑑,𝑤* = 𝑃 𝑑 L𝑃 𝑤* 𝑧> 𝑃(𝑧>|𝑑)=
>1#
pLSI
• Joint probability for all documents, words
• Distribution for document d, word wi
• Reformulate 𝑃(𝑧>|𝑑)with Bayes’ Rule
22
𝑃 𝑑,𝑤* = 𝑃 𝑑 L𝑃 𝑤* 𝑧> 𝑃(𝑧>|𝑑)=
>1#
𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=
>1#
d
zk
wi
P(d)
P(zk|d)
P(wi|zk)
d
zk
wi
P(zk)
P(d|zk) P(zk|wi)
((𝑃 𝑑,𝑤* e(.,78)&3
*1#
0
.1#
pLSI: Learning Using EM
• Model
• Likelihood
• Parameters to learn (M step)
• (E step)
23
𝐿 = LL𝑛 𝑑,𝑤* log 𝑃(𝑑, 𝑤*)&3
*1#
0
.1#
= LL𝑛(𝑑,𝑤*) logL𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=
>1#
&3
*1#
0
.1#
((𝑃 𝑑,𝑤* e(.,78)&3
*1#
0
.1#
𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=
>1#
𝑃 𝑤* 𝑧>𝑃 𝑧>𝑃(𝑑|𝑧>)
𝑃 𝑧> 𝑑, 𝑤*
pLSI: Learning Using EM
• EM solution– E step
– M step
24
𝑃(𝑧>|𝑑, 𝑤*) =𝑃(𝑧>)𝑃 𝑑 𝑧> 𝑃(𝑤*|𝑧>)
∑ 𝑃(𝑧g)𝑃 𝑑 𝑧g 𝑃(𝑤*|𝑧g)=g1#
𝑃 𝑤* 𝑧> =∑ 𝑛 𝑑, 𝑤* 𝑃 𝑧> 𝑑, 𝑤*0.1#
∑ ∑ 𝑛 𝑑,𝑤b 𝑃 𝑧> 𝑑, 𝑤b&3b1#
0.1#
𝑃 𝑑 𝑧> =∑ 𝑛 𝑑, 𝑤* 𝑃 𝑧> 𝑑, 𝑤*&3*1#
∑ ∑ 𝑛 𝑐, 𝑤* 𝑃 𝑧> 𝑐, 𝑤*&h*1#
061#
𝑃 𝑧> =1𝑅LL𝑛 𝑑,𝑤* 𝑃 𝑧> 𝑑, 𝑤* ,
&3
*1#
0
.1#
𝑅 = LL𝑛 𝑑,𝑤*
&3
*1#
0
.1#
pLSI: Overview
• More realistic than mixture model– Documents can discuss multiple topics!
• Problems– Very many parameters– Danger of overfitting
25
pLSI Testrun
• PLSI topics (TDT-1 corpus)– Approx. 7 million words, 15863 documents, K = 128
The two most probable topics that generate the term “flight” (left) and “love” (right).
List of most probable words per topic, with decreasing probability going down the list.
26
Relation with LSI
• Difference:– LSI: minimize Frobenius (L-2) norm– pLSI: log-likelihood of training data
27
𝑃 = 𝑈>Σl𝑉>n𝑃 𝑑, 𝑤* = L𝑃 𝑑 𝑧> 𝑃(𝑧>)𝑃(𝑤*|𝑧>)=
>1#
𝑈> = 𝑃 𝑑 𝑧> .,>Σl = diag 𝑃 𝑧> >𝑉> = 𝑃 𝑤* 𝑧> *,>
3.PLSI Model Definition
13/24
This can be demonstrated as a matrix factorization
�𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 𝑑𝑑,𝑤𝑤 =�𝑧𝑧∈𝑍𝑍
𝑃𝑃 𝑑𝑑 𝑧𝑧)𝑃𝑃(𝑧𝑧)𝑃𝑃 𝑤𝑤 𝑧𝑧)
Contrast to SVD:• No orthonormality condition for 𝑈𝑈 and 𝑉𝑉 here.• The elements of 𝑈𝑈 and 𝑉𝑉 are non-negative.
Probabilistic Latent Sematic Indexing (PLSI)
pLSI with Multinomials
• Multinomial Naïve Bayes– Select document d ~ Mult( ∙ | p)
• For each position i = 1, ... , Nd
– Generate zi ~ Mult(∙ | d, 𝜃.)– Generate wi ~ Mult( ∙ | zi, bk)
28
(𝑃 𝑤#,… ,𝑤&3, 𝑑|𝛽, 𝜃, 𝜋0
.1#
=(𝑃(𝑑|𝜋)0
.1#
(L𝑃(𝑧* = 𝑘|𝑑, 𝜃.)𝑃 𝑤* 𝛽>=
>1#
&3
*1#
=(𝜋.(L𝜃.,>𝛽>,78
=
>1#
&3
*1#
0
.1#
d
z
w
b
M
qd
p
N
Prior Distribution for Topic Mixture
• Goal: topic mixture proportions for each document are drawn from some distribution.– Distribution on multinomials (k-tuples of non-negative
numbers that sum to one)
• The space is of all of these multinomials can be interpreted geometrically as a (k-1)-simplex– Generalization of a triangle to (k-1) dimensions
• Criteria for selecting our prior:– It needs to be defined for a (k-1)-simplex– Should have nice properties
29
In Bayesian probability theory, if the posterior distributions p(θ|x) are in the same family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function. [Wikipedia]
Latent Dirichlet Allocation
• Document = mixture of topics (as in pLSI), but according to a Dirichlet prior
– When we use a uniform Dirichlet prior, pLSI=LDA
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 30
Dirichlet Distributions
• Defined over a (k-1)-simplex– Takes K non-negative arguments which sum to one. – Consequently it is a natural distribution to use over
multinomial distributions.• The Dirichlet parameter ai can be thought of as a prior
count of the ith class• Conjugate prior to the multinomial distribution
– Conjugate prior: if our likelihood is multinomial with a Dirichlet prior, then the posterior is also a Dirichlet
31
𝑓 𝑥#, … , 𝑥=; 𝑝#, … , 𝑝= =Γ ∑ 𝑥* + 1�
*∏ Γ(𝑥* + 1)�*
(𝑝*H8
=
*1#
𝑝(𝜃|𝛼) =Γ ∑ 𝛼*�
*∏ Γ(𝛼*)�*
(𝜃*s8t#
=
*1#
LDA Model
q
z4z3z2z1
w4w3w2w1
a
b
q
z4z3z2z1
w4w3w2w1
q
z4z3z2z1
w4w3w2w1
32
LDA Model – Parameters
←Proportions parameter(k-dimensional vector of real numbers)
←Per-document topic distribution(k-dimensional vector ofprobabilities summing up to 1)
←Per-word topic assignment(number from 1 to k)
←Observed word (number from 1 to v, where v is the number of words in the vocabulary)
←Word “prior”(v-dimensional)
33
z
w
b
M
q
a
N
Dirichlet Distribution over a 2-Simplex
34
A panel illustrating probability density functions of a few Dirichletdistributions over a 2-simplex, for the following α vectors (clockwise, starting from the upper left corner): (1.3, 1.3, 1.3), (3,3,3), (7,7,7), (2,6,11), (14, 9, 5), (6,2,6). [Wikipedia]
LDA Model – Plate Notation
• For each document d,– Generate qd ~ Dirichlet(∙ | a)– For each position i = 1, ... , Nd
• Generate a topic zi ~ Mult(∙ | qd)• Generate a word wi ~ Mult(∙ | zi,b)
35
z
w
b
M
q
a
N
𝑃 𝛽, 𝜃, 𝑧#, … , 𝑧&3, 𝑤#, … , 𝑤&3
=(𝑃 𝜃. 𝛼 (𝑃 𝑧* 𝜃. 𝑃(𝑤*|𝛽, 𝑧*) &3
*1#
0
.1#
Corpus-level Parameter 𝛼= K-1 𝛴i𝛼𝑖
• Let 𝛼 = 1• Per-document topic distribution: K = 10, D = 15
36
↵= 1
item
valu
e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
●
●
●
● ●
●
●● ●
●
6
●●
●
●●
●
●●
●
●
11
● ●
●
●
●
●
●
●●
●
1 2 3 4 5 6 7 8 9 10
2
●●
●
●
●
●●
●
●
●
7
●
● ●
●
●
● ●●
● ●
12
●
●
●●
●
●
●
● ●
●
1 2 3 4 5 6 7 8 9 10
3
●
●
●
●●
●
●
●●
●
8
●
●
●
●
●
●
●
●●
●
13
●
●
●
● ● ●
●●
●●
1 2 3 4 5 6 7 8 9 10
4
● ● ●
●
● ●
● ● ●●
9
●
●
●
●
●
●●
●
● ●
14
● ●
●
● ●
●
● ●
●
●
1 2 3 4 5 6 7 8 9 10
5
● ●
●
● ●
● ●
●●
●
10
●
● ● ●
●
●
●● ●
●
15
● ●
●
● ●● ●
●
●●
1 2 3 4 5 6 7 8 9 10
• 𝛼 = 100
Corpus-level Parameter 𝛼
• 𝛼 = 10
37
↵= 10
item
valu
e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
● ●● ● ● ●
●
●
●
●
6
●●
●● ● ●
●
●
●●
11
●● ●
●
●●
●● ● ●
1 2 3 4 5 6 7 8 9 10
2
● ●●
● ●●
●●
● ●
7
●●
●
● ● ● ●● ●
●
12
● ●
●
● ● ● ●●
●
●
1 2 3 4 5 6 7 8 9 10
3
●
● ●
●●
●
● ●
●
●
8
●●
● ●●
●● ●
●
●
13
●●
● ● ●●
●● ●
●
1 2 3 4 5 6 7 8 9 10
4
● ●● ●
● ●● ●
●●
9
●
●●
●● ● ●
●● ●
14
●●
● ●●
●
●
● ●
●
1 2 3 4 5 6 7 8 9 10
5
● ●●
●
●
●● ● ●
●
10
●●
●●
●
●●
●● ●
15
●
●
●● ● ●
● ●
●●
1 2 3 4 5 6 7 8 9 10
↵= 100
item
valu
e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
● ● ● ● ● ●●
● ● ●
6
● ● ● ● ● ● ● ● ● ●
11
● ● ● ● ● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
2
● ● ● ● ● ● ● ● ● ●
7
● ● ● ● ● ● ● ● ● ●
12
●● ● ●
●● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
3
● ● ● ● ● ● ● ● ● ●
8
● ● ● ● ● ● ● ● ● ●
13
● ● ● ● ● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
4
● ● ● ● ●●
● ● ● ●
9
● ● ● ● ● ● ● ●● ●
14
● ● ● ● ● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
5
● ● ● ● ● ● ● ● ● ●
10
● ● ● ● ● ● ● ● ● ●
15
● ● ● ●● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
• 𝛼 = 0.01
Corpus-level Parameter 𝛼
• 𝛼 = 0.1
38
↵= 0.1
item
valu
e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
● ● ● ● ● ●
●
● ● ●
6
● ● ●
●
●
●
● ●
●
●
11
●●
●
● ●
●
●
● ● ●
1 2 3 4 5 6 7 8 9 10
2
● ●● ● ●
●
● ●
●
●
7
● ● ●
●
●
●
●
● ●
●
12
●
● ● ● ● ● ● ●
●
●
1 2 3 4 5 6 7 8 9 10
3
●●
●
●
● ● ●
●
● ●
8
●
● ●
●
●
● ● ● ● ●
13
●●
●
●●
● ● ● ●
●
1 2 3 4 5 6 7 8 9 10
4●
●● ● ● ● ● ● ● ●
9
●
●
● ●
●
●● ● ● ●
14●
● ● ● ● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
5
● ● ● ●
●
●
●
●
● ●
10
● ● ● ● ●
●
●
●
●
●
15
●
●
●
● ●
●
●
● ●
●
1 2 3 4 5 6 7 8 9 10
↵= 0.01
item
valu
e
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
1
●
●
● ● ● ● ● ● ● ●
6
● ● ●
●
● ● ● ● ● ●
11
●● ● ● ● ● ●
●
●
●
1 2 3 4 5 6 7 8 9 10
2
● ● ● ●●
● ●
●
● ●
7
● ● ● ● ● ● ● ●
●
●
12
● ● ●
●
● ● ● ● ● ●
1 2 3 4 5 6 7 8 9 10
3
●
●
●
●
● ● ● ● ● ●
8
● ● ● ● ● ● ●
●
● ●
13
● ● ● ●
●
●
● ● ● ●
1 2 3 4 5 6 7 8 9 10
4
● ● ● ● ●
●
● ● ●●
9
● ● ● ● ● ● ● ● ●
●
14
● ● ● ● ●
●
● ●
●
●
1 2 3 4 5 6 7 8 9 10
5
● ● ● ● ● ● ●
●
● ●
10
● ● ● ● ● ●
●
● ● ●
15
● ● ● ● ● ● ●
●
● ●
1 2 3 4 5 6 7 8 9 10
Back to Topic Modeling Scenario
39
Latent Dirichlet allocation (LDA)
Topics DocumentsTopic proportions and
assignments
• In reality, we only observe the documents
• The other structure are hidden variables
• Topic modeling algorithms infer these variables from data.
Smoothed LDA Model
• Give a different word distribution to each topic– 𝛽 is 𝐾×𝑉 matrix (V vocabulary
size), each row denotes word distribution of a topic
• For each document d– Choose qd ~ Dirichlet(∙ | a)– Choose 𝛽>~ Dirichlet(𝜂∙ |)– For each position i = 1, ... , Nd
• Generate a topic zi ~ Mult(∙ | qd)• Generate a word wi ~ Mult(∙ | zi,bk)
40
z
w
𝜂
M
q
a
N K𝛽>
Smoothed LDA Model
41
z
w
𝜂
M
q
a
N K𝛽>
𝑃 𝛽, 𝜃, 𝑧#, … , 𝑧&3, 𝑤#, … , 𝑤&3
= (𝑃(𝛽>|𝜂)=
>1#
・ (𝑃 𝜃. 𝛼 (𝑃 𝑧.,* 𝜃. 𝑃(𝑤.,*|𝛽#:=, 𝑧.,*) &3
*1#
0
.1#
Back to Topic Modeling Scenario
• But…
42
Why does LDA “work”?
• Trade-off between two goals1. For each document, allocate its words to as few topics as possible.2. For each topic, assign high probability to as few terms as possible.
• These goals are at odds.– Putting a document in a single topic makes #2 hard:
All of its words must have probability under that topic. – Putting very few words in each topic makes #1 hard:
To cover a document’s words, it must assign many topics to it.
• Trading off these goals finds groups of tightly co-occurring words
43
Inference: The Problem (non-smoothed version)
• To which topics does a given document belong?– Compute the posterior distribution of the hidden
variables given a document:
44
Inference: The problem
To which topics does a given document belong to? Thus want tocompute the posterior distribution of the hidden variables given adocument:
p(⇤, z|w,�,⇥) =p(⇤, z,w|�, ⇥)
p(w|�, ⇥)
where
p(⇤, z,w|�, ⇥) = p(⇤|�)N⌦
n=1
p(zn|⇤)p(wn|zn,⇥)
and
p(w|�, ⇥) =�(⌥
i �i )�i �(�i )
↵ � k⌦
i=1
⇤�i�1i
⇥⇤
⇧N⌦
n=1
k
i=1
V⌦
j=1
(⇤i⇥ij)wj
n
⌅
⌃ d⇤.
This not only looks awkward, but is as well computationally intractable ingeneral. Coupling between ⇤ and ⇥ij . Solution: Approximations.
𝑃 𝜃, 𝒛 𝒘, 𝛼, 𝛽 =𝑃 𝜃, 𝒛, 𝒘 𝛼, 𝛽𝑃 𝒘 𝛼, 𝛽
𝑃 𝜃, 𝒛, 𝒘 𝛼, 𝛽 = 𝑃(𝜃|𝛼)(𝑃 𝑧* 𝜃 𝑃(𝑤*|𝑧*, 𝛽)&
*1#
𝑃 𝒘 𝛼, 𝛽 =Γ ∑ 𝛼*�
*∏ Γ(𝛼*)�*
� (𝜃>sNt#
=
>1#
(L( 𝜃>𝛽>b78�
�
b1#
=
>1#
&
*1#
𝑑𝜃�
�
LDA Learning
• Parameter learning:– Variational EM
• Numerical approximation using lower-bounds• Results in biased solutions• Convergence has numerical guarantees
– Gibbs Sampling• Stochastic simulation• Unbiased solutions• Stochastic convergence
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, January 2003 45
LDA: Variational Inference
• Replace LDA model with simpler one
• Minimize the Kullback-Leibler divergence between the two distributions.
46
Variational inference
• Replace the graphical model of LDA by a simpler one.
� ⇤ z w
⇥
MN
�
⇥
⇤
zMN
• Minimize the Kullback-Leibler divergence between the twodistributions. This is a standard-trick in machine learning. Oftenequivalent to maximum likelihood. Gradient descent procedure.
• Variational approach in contrast to often used Markov chain MonteCarlo. Variational is not a sampling approach where one samplesfrom the posterior distribution, but rather we try to find a tightestlower bound.
• Problematic coupling between ⇥ and � not present in simplergraphical model.
KL
LDA: Gibbs Sampling
• MCMC algorithm– Fix all current values but one and sample that value,
e.g, for z
– Eventually converges to true posterior
47
𝑃(𝑧* = 𝑘|𝒛t*, 𝒘)
Variational Inference vs. Gibbs Sampling
• Gibbs sampling is slower (takes days for mod.-sized datasets), variational inference takes a few hours.
• Gibbs sampling is more accurate. • Gibbs sampling convergence is difficult to test,
although quite a few machine learning approximate inference techniques also have the same problem.
48
LDA Application: Reuters Data
• Setup– 100-topic LDA trained on a 16,000 documents corpus of
news articles by Reuters– Some standard stop words removed
• Top words from some of the P(w|z)
49
How LDA performs on Reuters data (1/2)
About the experiments
• 100-topic LDA trained on a 16’000 documents corpus of newsarticles by Reuters (the news agency).
• Some standard stop words removed.
Top words from some of the p(w |z)
“Arts” “Budgets” “Children” “Education”new million children schoolfilm tax women studentsshow program people schoolsmusic budget child educationmovie billion years teachersplay federal families highmusical year work public
LDA Application: Reuters Data
• Result
50
How LDA performs on Reuters data (2/2)
Inference on a held-out documentAgain: “Arts”, “Budgets”, “Children”, “Education”.
The William Randolph Hearst Foundation will give $1.25 million toLincoln Center, Metropolitan Opera Co., New York Philharmonic andJuilliard School. “Our board felt that we had a real opportunity to makea mark on the future of the performing arts with these grants an actevery bit as important as our traditional areas of support in health,medical research, education and the social services,” Hearst FoundationPresident Randolph A. Hearst said Monday in announcing the grants.
Measuring Performance
• Perplexity of a probability model• How well a probability distribution or probability model
predicts a sample– q: Model of an unknown probability distribution p
based on a training sample drawn from p– Evaluate q by asking how well it predicts a separate test sample 𝑥#, … , 𝑥& also drawn from p
– Perplexity of q w.r.t. sample 𝑥#, … , 𝑥&defined as
– A better model q will tend to assign higher probabilities to 𝑞 𝑥*→ lower perplexity (“less surprised by sample”)
51[Wikipedia]
2t#& ∑ ���� � H8�
8��
Perplexity of Various Models
Unigram
LDA
52
Use of LDA
• A widely used topic model (Griffiths, Steyvers, 04)• Complexity is an issue• Use in IR:
– Ad hoc retrieval (Wei and Croft, SIGIR 06: TREC benchmarks)– Improvements over traditional LM (e.g., LSI techniques)– But no consensus on whether there is any improvement
over Relevance model, i.e., model with relevance feedback (relevance feedback part of the TREC tests)
T. Griffiths, M. Steyvers, Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. 2004
53
Xing Wei and W. Bruce Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '06). ACM, New York, NY, USA, 178-185. 2006.
TREC=Text REtrieval Conference
Generative Topic Models for Community Analysis
Pilfered from: Ramesh Nallapatihttp://www.cs.cmu.edu/~wcohen/10-802/lda-sep-18.ppt
&
Arthur Asuncion, Qiang Liu, Padhraic Smyth:Statistical Approaches to Joint Modeling of Text
and Network Data54
What if the corpus has network structure?
55
CORA citation network. Figure from [Chang, Blei, AISTATS 2009]
J. Chang, and D. Blei. Relational Topic Models for Document Networks.AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009.
Outline
• Topic Models for Community Analysis– Citation Modeling
• with pLSI• with LDA
– Modeling influence of citations– Relational Topic Models
56
Hyperlink Modeling Using pLSI
• Select document d ~ Mult(π)– For each position n = 1,…, Nd
• Generate zn ~ Mult(@ |𝜃.)• Generate wn ~ Mult(@ |𝛽M�)
– For each citation j = 1,…, Ld
• Generate zj ~ Mult(@ |𝜃.)• Generate cj ~ Mult(@ |𝛾M�)
57
d
z
w
b
M
qd
p
N
z
c
g
L
D. A. Cohn, Th. Hofmann, The Missing Link - A ProbabilisticModel of Document Content and Hypertext Connectivity, In: Proc. NIPS, pp. 430-436, 2000
Hyperlink Modeling Using pLSI
• pLSI likelihood
• New likelihood
• Learning using EM58
(𝑃 𝑤#,… ,𝑤&3, 𝑑 𝜃, 𝛽, 𝜋0
.1#
= (𝜋. (L𝜃.>𝛽>7�
=
>1#
&3
*1#
0
.1#
(𝑃 𝑤#,… , 𝑤&3, 𝑐#, … , 𝑐�3, 𝑑 𝜃, 𝛽, 𝛾, 𝜋0
.1#
= (𝜋. (L𝜃.>𝛽>7�
=
>1#
&3
*1#
(L𝜃.>𝛾>6�
=
>1#
�3
b1#
0
.1#
d
z
w
b
M
qd
p
N
z
c
g
L
Hyperlink Modeling Using pLSI
• Heuristic– 0 < a < 1 determines the relative importance of content
and hyperlinks
59
(𝑃 𝑤#,… ,𝑤&3, 𝑐#, … , 𝑐�3, 𝑑 𝜃, 𝛽, 𝛾, 𝜋0
.1#
= (𝜋. (L𝜃.>𝛽>7�
=
>1#
&3
*1#
s
(L𝜃.>𝛾>6�
=
>1#
�3
b1#
#ts0
.1#
Hyperlink Modeling Using pLSI
• Classification performance
60
Hyperlink Content Hyperlink Content
Hyperlink modeling using LDA
• For each document d,– Generate qd ~ Dirichlet(a)– For each position i = 1, ... , Nd
• Generate a topic zi ~ Mult(@ |𝜃.)• Generate a word wi ~ Mult (@ |𝛽M�)
– For each citation j = 1, …, Lc
• Generate zi ~ Mult(qd)• Generate ci ~ Mult (@ |𝛾M�)
• Learning using variational EM, Gibbs sampling
61
z
w
b
M
q
N
a
z
c
g
L
E. Erosheva, S Fienberg, J. Lafferty, Mixed-membership models of scientific publications. Proc National Academy Science U S A. 2004 Apr 6;101 Suppl 1:5220-7. Epub 2004 Mar 12.
Link-pLSI-LDA: Topic Influence in Blogs
R. Nallapati, A. Ahmed, E. Xing, W.W. Cohen, Joint Latent Topic Models for Text and Citations, In: Proc. KDD, 2008. 62
𝛾
63
Modeling Citation Influences - Copycat Model
• Each topic in a citing document is drawn from one of the topic mixtures of cited publications
64L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.
Modeling Citation Influences
• Citation influence model: Combination of LDA and Copycat model
65L. Dietz, St. Bickel, and T. Scheffer, Unsupervised Prediction of Citation Influences, In: Proc. ICML 2007.
Modeling Citation Influences
• Citation influence graph for LDA paper
66
Modeling Citation Influences
• Words in LDA paper assigned to citations
67
Modeling Citation Influences
• Predictive Performance
68
Relational Topic Model (RTM) [ChangBlei 2009]
• Same setup as LDA, except now we have observed network information across documents
69
“Link probability function”
Documents with similar topics are more likely to be linked.
𝑦.,.�~𝜓 𝑦.,.� 𝑧., 𝑧.�, 𝜂)
M
𝑧.,e
𝑤.,e
𝜃.
α
𝑁.K
𝛽>
𝑧.�,e
𝑤.�,e
𝜃.�
𝑁.�
𝑦.,.�
𝜂
J. Chang, and D. Blei. Relational Topic Models for Document Networks.AISTATS, volume 5 of JMLR Proceedings, page 81-88. JMLR.org, 2009.
Relational Topic Model (RTM) [ChangBlei 2009]
• For each document d– Draw topic proportions 𝜃.|𝛼~𝐷𝑖𝑟 𝛼
– For each word 𝑤.,e• Draw assignment 𝑧.,e|𝜃.~𝑀𝑢𝑙𝑡(𝜃.)
• Draw word 𝑤.,e|𝑧.,e, 𝛽#:=~𝑀𝑢𝑙𝑡(𝛽M3,�)
– For each pair of documents 𝑑, 𝑑�
• Draw binary link indicator 𝑦|𝑧., 𝑧.�~𝜓(⋅ |𝑧., 𝑧.�)
70
𝑧.,e
𝑤.,e
𝜃.
α
𝑁.K
𝛽>
𝑧.�,e
𝑤.�,e
𝜃.�
𝑁.�
𝑦.,.�
𝜂
Collapsed Gibbs Sampling for RTM
• Conditional distribution of each 𝑧:
• Using the exponential link probability function, it is computationally efficient to calculate the “edge” term.
• It is very costly to compute the “non-edge” term exactly.
71
LDA term
“Edge” term
“Non-edge” term
𝑃 𝑧.,e = 𝑘 𝑧¬.,e,⋅ ∝ 𝑁.>¬.,e + 𝛼
𝑁>7¬.,e + 𝛽
𝑁>¬.,e +𝑊𝛽
( 𝜓� 𝑦.,.� = 1|𝒛., 𝒛.�, 𝜂�
.��.: 3,3�1#
( 𝜓� 𝑦.,.� = 0|𝒛., 𝒛.�, 𝜂�
.��.: 3,3�1¡
Approximating the Non-edges
1. Assume non-edges are “missing” and ignore the term entirely (Chang/Blei)
2. Make the following fast approximation:
3. Subsample non-edges and exactly calculate the term over subset.
4. Subsample non-edges but instead of recalculating statistics for every 𝑧.,e token, calculate statistics once per document and cache them over each Gibbs sweep.
72
( 1− exp 𝑐*
�
*
≈ 1 − exp 𝑐*̅ &, where𝑐*̅ = 1𝑁L𝑐*
�
*
73
Document networks
# Docs # Links Ave. Doc-Length
Vocab-Size Link Semantics
CORA 4,000 17,000 1,200 60,000 Paper citation (undirected)
Netflix Movies
10,000 43,000 640 38,000 Common actor/director
Enron (Undirected)
1,000 16,000 7,000 55,000 Communication between person i and person j
Enron (Directed)
2,000 21,000 3,500 55,000 Email from person i to person j
Link Rank
• “Link rank” on held-out data as evaluation metric– Lower is better
• How to compute link rank (simplified):1. Train model with {dtrain}.2. Given the model, calculate probability that dtest would link
to each dtrain. Rank {dtrain} according to these probabilities. 3. For each observed link between dtest and {dtrain}, find the
“rank”, and average all these ranks to obtain the “link rank”74
dtest
{dtrain} Black-box predictor
Ranking over {dtrain}
Edges between dtest and {dtrain}Edges among {dtrain}
Link ranks
75
Results on CORA data
Comparison on CORA, K=20
150
170
190
210
230
250
270
Baseline(TFIDF/Cosine)
LDA + Regression Ignoring non-edges Fast approximation ofnon-edges
Subsampling non-edges (20%)+Caching
Link
Ran
k
8-fold cross-validation. Random guessing gives link rank = 2000.
Results on CORA data
• Model does better with more topics• Model does better with more words in each document
76
0 20 40 60 80 100 120 140 160100
150
200
250
300
350
400
Number of Topics
Link
Ran
k
BaselineRTM, Fast Approximation
0 0.2 0.4 0.6 0.8 1150
200
250
300
350
400
450
500
550
600
650
Percentage of Words
Link
Ran
k
BaselineLDA + Regression (K=40)Ignoring Non-Edges (K=40)Fast Approximation (K=40)Subsampling (5%) + Caching (K=40)
77
Timing Results on CORA
“Subsampling (20%) without caching” not shown since it takes 62,000 seconds for D=1000 and 3,720,150 seconds for D=4000
1000 1500 2000 2500 3000 3500 40000
1000
2000
3000
4000
5000
6000
7000
Number of Documents
Tim
e (in
sec
onds
)
CORA, K=20
LDA + RegressionIgnoring Non-EdgesFast ApproximationSubsampling (5%) + CachingSubsampling (20%) + Caching
78
Conclusion
• Relational topic modeling provides a useful start for combining text and network data in a single statistical framework
• RTM can improve over simpler approaches for link prediction
• Opportunities for future work:– Faster algorithms for larger data sets– Better understanding of non-edge modeling– Extended models