Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

48
600.465 - Intro to NLP - J. Eisner 1 Topic Modeling

Transcript of Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP -...

Page 1: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 1

Topic Modeling

Page 2: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 2

Embedding Documents A trick from Information Retrieval Each document in corpus is a length-k vector Or each paragraph, or whatever

(0, 3, 3, 1, 0, 7, . . . 1, 0)

a single documentPretty sparse, so pretty noisy! Hard to tell which docs are similar.

Wish we could smooth the vector:what would the document look like if it rambled on forever?

Page 3: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 3

Latent Semantic Analysis A trick from Information Retrieval Each document in corpus is a length-k vector Plot all documents in corpus

True plot in k dimensionsReduced-dimensionality plot

a clustering algorithmmight discover thiscluster of “similar”documents – similarthanks to smoothing!

Page 4: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 4

Latent Semantic Analysis Reduced plot is a perspective drawing of true plot It projects true plot onto a few axes a best choice of axes – shows most variation in the data. Found by linear algebra: “Principal Components Analysis” (PCA)

True plot in k dimensionsReduced-dimensionality plot

Page 5: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 5

Latent Semantic Analysis SVD plot allows best possible reconstruction of true plot

(i.e., can recover 3-D coordinates with minimal distortion) Ignores variation in the axes that it didn’t pick Hope that variation’s just noise and we want to ignore it

True plot in k dimensionsReduced-dimensionality plot

topic A

topi

c B

Page 6: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 6

Latent Semantic Analysis SVD finds a small number of topic vectors Approximates each doc as linear combination of topics Coordinates in reduced plot = linear coefficients How much of topic A in this document? How much of topic B? Each topic is a collection of words that tend to appear together

True plot in k dimensionsReduced-dimensionality plot

topic A

topi

c B

Page 7: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 7

Latent Semantic Analysis New coordinates might actually be useful for Info Retrieval To compare 2 documents, or a query and a document: Project both into reduced space: do they have topics in common? Even if they have no words in common!

True plot in k dimensionsReduced-dimensionality plot

topic A

topi

c B

Page 8: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 8

Latent Semantic Analysis topics extracted for IR might help sense disambiguation

Each word is like a tiny document: (0,0,0,1,0,0,…) Express word as a linear combination of topics Each topic corresponds to a sense? E.g., “Jordan” has Mideast and Sports topics

(plus Advertising topic, alas, which is same sense as Sports) Word’s sense in a document: which of its topics are strongest in the

document? Groups senses as well as splitting them One word has several topics and many words have same topic

Page 9: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 9

Latent Semantic Analysis A perspective on Principal Components Analysis (PCA) Imagine an electrical circuit that connects terms to docs …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrix of strengths(how strong is each

term in each document?)

Each connection has aweight given by the matrix.

Page 10: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 10

Latent Semantic Analysis Which documents is term 5 strong in?

docs 2, 5, 6light up strongest.

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Page 11: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 11

This answers a queryconsisting of terms 5 and 8!

really just matrix multiplication:term vector (query) x strength matrix = doc vector .

Latent Semantic Analysis Which documents are terms 5 and 8 strong in?

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Page 12: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 12

Latent Semantic Analysis Conversely, what terms are strong in document 5?

gives doc 5’s coordinates!

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Page 13: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 13

Latent Semantic Analysis SVD approximates by smaller 3-layer network Forces sparse data through a bottleneck, smoothing it

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

topics

Page 14: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 14

Latent Semantic Analysis I.e., smooth sparse data by matrix approx: M A B A encodes camera angle, B gives each doc’s new coords

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrixM

A

B

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

topics

Page 15: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 15

Latent Semantic AnalysisCompletely symmetric! Regard A, B as projecting terms and docsinto a low-dimensional “topic space” where their similarity can bejudged.

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

matrixM

A

B

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

topics

Page 16: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 16

Latent Semantic Analysis Completely symmetric. Regard A, B as projecting terms and docs

into a low-dimensional “topic space” where their similarity can bejudged.

Cluster documents (helps sparsity problem!) Cluster words Compare a word with a doc Identify a word’s topics with its senses sense disambiguation by looking at document’s senses

Identify a document’s topics with its topics topic categorization

Page 17: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 17

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

SVD actually decomposes M = A D B’ exactly A = camera angle (orthonormal); D diagonal; B’ orthonormal

matrixM

A

B’

D

Page 18: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 18

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units

matrixM

A

B’

D

Page 19: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 19

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

Keep only the largest j < k diagonal elements of D This gives best possible approximation to M using only j blue units

matrixM

A

B’

D

Page 20: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

600.465 - Intro to NLP - J. Eisner 20

If you’ve seen SVD before …

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

documents1 2 3 4 5 6 7

1 2 3 4 5 6 7 8 9terms

To simplify picture, can write M A (DB’) = AB

matrixM

A

B =DB’

How should you pick j (number of blue units)? Just like picking number of clusters: How well does system work with each j (on held-out data)?

Page 21: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

21

Directed graphical models (Bayes nets)

slide thanks to Zoubin Ghahramani (modified)

Under any model: p(A, B, C, D, E) = p(A)p(B|A)p(C|A,B)p(D|A,B,C)p(E|A,B,C,D)Model above says:

Page 22: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

22

Unigram model for generating text

w1 w2 w3…

p(w1) p(w2) p(w3) …

Page 23: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

23

Explicitly show model’s parameters

w1 w2 w3…

p() p(w1 | ) p(w2 | ) p(w3 | ) …

“ is a vector that sayswhich unigrams are likely”

Page 24: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

24

“Plate notation” simplifies diagram

p() p(w1 | ) p(w2 | ) p(w3 | ) …

wN1

“ is a vector that sayswhich unigrams are likely”

Page 25: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

25

Learn from observed words(rather than vice-versa)

p() p(w1 | ) p(w2 | ) p(w3 | ) …

wN1

Page 26: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

26

Explicitly show prior over (e.g., Dirichlet)

wN1

p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …

“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess” given

Dirichlet()wi

Page 27: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

27

dog the cat

0

Dirichlet DistributionEach point on a k dimensional simplex is a multinomial probabilitydistribution:

1

1

1

1

3

2

3.0

5.0

2.0

dog the cat

0

0

1

1i

i

1i

i

slide thanks to Nigel Crook

Page 28: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

28

Dirichlet Distribution

0 1

1

1

3

2

1

A Dirichlet Distribution is a distribution over multinomialdistributions in the simplex.

0 1

1

1

1

3

21

1

3

11

12

slide thanks to Nigel Crook

Page 29: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

29

slide thanks to Percy Liang and Dan Klein

Page 30: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

30

Dirichlet DistributionExample draws from a Dirichlet Distribution over the 3-simplex:

0

1

1

2

3

1

2

0

3

1

2

3

Dirichlet(5,5,5)

Dirichlet(0.2, 5, 0.2)

Dirichlet(0.5,0.5,0.5)

slide thanks to Nigel Crook

Page 31: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

31

Explicitly show prior over (e.g., Dirichlet)

wN

p() p( | ) p(w1 | ) p(w2 | ) p(w3 | ) …

“Even if we didn’t observeword 5, the prior says that5 = 0 is a terrible guess”

Posterior distributionp( | , w)

is also a Dirichletjust like the prior p( | ).

prior = Dirichlet() posterior = Dirichlet(+counts(w))Mean of posterior is like the max-likelihood estimate of ,but smooth the corpus counts by adding “pseudocounts” .

(But better to use whole posterior, not just the mean.)

Page 32: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

32

Training and Test Documents

wN1

“Learn from document 1,use it to predict document 2”

wN2

What do goodconfigurations looklike if N1 is large?

What if N1 is small?

test

train

Page 33: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

33

Many Documents

w1N1

“Each document has itsown unigram model”

w2N2

w3N3 Now does observing

docs 1 and 3 help stillpredict doc 2?

Only if learns thatall the ’s are similar

(low variance).

And in that case,why even haveseparate ’s?

Page 34: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

34

Many Documents

wND

D

“Each document has itsown unigram model”

givend Dirichlet()wdi d

or tuned to maximizetraining or dev set likelihood

Page 35: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

35

Bayesian Text Categorization

wND

DK

“Each document choosesone of only K topics(unigram models)”

givenk Dirichlet()wdi k but which k?

Page 36: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

36

Bayesian Text Categorization

w

z

ND

DK

“Each document choosesone of only K topics(unigram models)”

Allows documents to differconsiderably while somestill share parameters.

And, we can infer theprobability that twodocuments have the

same topic z.

Might observe some topics.

givenk Dirichlet()wdi zd

a topicin 1…K

given Dirichlet()zd

a distributionover topics 1…K

Page 37: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

37

Latent Dirichlet Allocation(Blei, Ng & Jordan 2003)

w

z

ND

DK

“Each documentchooses a mixtureof all K topics;

each word gets itsown topic”

Page 38: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

38

(Part of) one assignment to LDA’s variables

slide thanks to Dave Blei

Page 39: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

39

(Part of) one assignment to LDA’s variables

slide thanks to Dave Blei

Page 40: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

40

Latent Dirichlet Allocation: Inference?

w

z1

DK

w1 w2

z2

w3

z3

K

Page 41: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

41

Finite-State Dirichlet Allocation(Cui & Eisner 2006)

w1

z1

w2

z2

w3

z3…

K

D

“A different HMM foreach document”

Page 42: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

42

Variants of Latent Dirichlet Allocation

Syntactic topic model: A word or its topic isinfluenced by its syntactic position.

Correlated topic model, hierarchical topic model, …:Some topics resemble other topics.

Polylingual topic model: All versions of the samedocument use the same topic mixture, even ifthey’re in different languages. (Why useful?)

Relational topic model: Documents on the sametopic are generated separately but tend to link toone another. (Why useful?)

Dynamic topic model: We also observe a year foreach document. The k topics used in 2011 haveevolved slightly from their counterparts in 2010.

Page 43: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

43

Dynamic Topic Model

slide thanks to Dave Blei

Page 44: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

44

Dynamic Topic Model

slide thanks to Dave Blei

Page 45: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

45

Dynamic Topic Model

slide thanks to Dave Blei

Page 46: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

46

Dynamic Topic Model

slide thanks to Dave Blei

Page 47: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

47

Remember: Finite-State Dirichlet Allocation(Cui & Eisner 2006)

w1

z1

w2

z2

w3

z3…

K

D

“A different HMM foreach document”

Page 48: Topic Modeling - cs.jhu.edujason/465/PDFSlides/lect31-topicmodels.pdf · 600.465 - Intro to NLP - J. Eisner 2 Embedding Documents A trick from Information Retrieval Each document

48

Bayesian HMM

w1

z1

w2

z2

w3

z3…

K

D

“Shared HMM forall documents”

(or just have 1 document)

We have to estimatetransition parameters and emission parameters .