Topic Modelling: Beyond Bag of Words

18
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008

description

Topic Modelling: Beyond Bag of Words. By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008. Outline. Introduction / Motivation Bigram language models (MacKay & Peto, 1995) N-gram topic models (LDA, Blei et al.) Bigram Topic Model Results Conclusion. Introduction. - PowerPoint PPT Presentation

Transcript of Topic Modelling: Beyond Bag of Words

Page 1: Topic Modelling: Beyond Bag of Words

Topic Modelling: Beyond Bag of Words

By Hanna M. WallachICML 2006

Presented by Eric Wang, April 25th 2008

Page 2: Topic Modelling: Beyond Bag of Words

Outline• Introduction / Motivation

• Bigram language models (MacKay & Peto, 1995)

• N-gram topic models (LDA, Blei et al.)

• Bigram Topic Model

• Results

• Conclusion

Page 3: Topic Modelling: Beyond Bag of Words

Introduction• Generative topic models fall into two major categories

– Bigram language models: Generate words based on some measure of previous words.

– N-gram topic models: Generate words based on latent topics inferred from word or document correlations.

• N-gram topic models are independent of word order, while bigram models consider pairs of words with the leading word defining a “context”.

• Is word order important? Consider the following example– “The department chair couches offers.”– “The chair department offers couches.”

• To an n-gram model, the two sentences are identical, but to a reader they are not the same. Therefore, a great deal of semantic information must reside in word order. Bigram models would see the two sentences as being different.

Page 4: Topic Modelling: Beyond Bag of Words

Bigram language models: Hierarchical Dirichlet Language Model

• Bigram topic models are specified by the conditional distribution .

• The matrix can be thought of as a transition probability matrix.

• Given a corpus , the likelihood function is

• And the prior on is

(1)

(2)

Page 5: Topic Modelling: Beyond Bag of Words

Bigram language models: Hierarchical Dirichlet Language Model

• Combining (1) and (2), and integrating out the transition matrix yields the evidence of conditioned on the hyperparameters

• We can also obtain the predictive distribution

• Where is the number of times word i follows word j in the corpus, and is the number of times word j appears in the corpus. We say that word j, being the leading word of the two word pair, sets a “context”, which is analogous to factors and topics in other models.

• Mackay and Peto showed that the optimal is found using the empirical Bayes method to maximize (3).

(3)

(4)

Page 6: Topic Modelling: Beyond Bag of Words

N-gram topic models: Latent Dirichlet Allocation• Latent Dirichlet Allocation does not consider word order.

• The matrices and govern the word emission conditioned on topic, and topic emission conditioned on document, respectively.

• Where t is the word index within the corpurs, i is the word index in the dictionary, k is the topic index, and d is the document index.

Page 7: Topic Modelling: Beyond Bag of Words

N-gram topic models: Latent Dirichlet Allocation• Therefore, the joint probability of the corpus and the set

of latent topics is

• Where is the number times topic k has generated word i, and is the number of times topic k was generated in document d.

• We place Dirichlet priors on and

(5)

(6)

(7)

Page 8: Topic Modelling: Beyond Bag of Words

N-gram topic models: Latent Dirichlet Allocation• Combining (5), (6) and (7), and integrating out and ,

• However, (8) is intractable so approximation methods (MCMC, VB) are used to get around this issue.

• Assuming optimal parameters and , approximate predictive distributions for topic k and document d are

(8)

Number of time topic k appears in z Number of words in document d

Number of times topic k generates word i Number of times topic k appears in document d

(10)

(9)

Page 9: Topic Modelling: Beyond Bag of Words

Bigram Topic Model• We would like to create a model which considers both topics

(like LDA) and word order (like bigram language models).

• We accomplish this by using a simple extension of LDA.

• We specify a new conditional distribution for word generation

• These parameters form a matrix where each “plane” can be thought of as the characteristic transition matrix for a topic

kji ,|

Topic “planes”, k

(11)

j, as before, defines the context or leading word of a word pair. i is the word index of the trailing word. k is the topic plane index

j

i

Page 10: Topic Modelling: Beyond Bag of Words

Bigram Topic Model• Topic generation is identical to LDA

• We place a Dirichlet prior on the topic generation parameters

• Then the joint probability of the corpus and a single set of latent topics is

(12)

(13)

Page 11: Topic Modelling: Beyond Bag of Words

Bigram Topic Model• In both the Heirarchical Dirichlet Language Model and LDA,

the prior over (either the context matrix or topic matrix) are coupled in the sense that the hyperparameter vectoris shared between all possible contexts or topics.

• However, in this model, because we induced dependence on both topic k and context j, there are two possible priors on

1) Global sharing

2) Topic level sharing

Here, a single set of hyperparameters are shared across all contexts in all topics. This leads to a simpler formulation.

More intuitively, we allow each topic k to have a set of hyperparameters shared by only the contexts in the topic.

Page 12: Topic Modelling: Beyond Bag of Words

Bigram Topic Model• We are now in a position to describe the generative process

of the Bigram Topic Model

Page 13: Topic Modelling: Beyond Bag of Words

Bigram Topic Model• Combining (12), (13) and Prior 1 and integrating out and

we arrive at the evidence

• Alternatively, combining (12), (13), and Prior 2 and integrating out and

• Again, the summation is intractable, so as before, we utilize approximations.

(14)

(15)

Page 14: Topic Modelling: Beyond Bag of Words

Bigram Topic Model• Given optimum hyperparameters , the

predictive distributions of over words given the previous word and topic k are

• The predictive distribution of the topic k given document d is

Prior 1

Prior 2

(16)

(18)

(17)

Page 15: Topic Modelling: Beyond Bag of Words

Inference of Hyperparameters• A Gibbs EM algorithm is employed to find optimal

hyperparameters and either

• We summarize the EM algorithm below

Where or

Prior 1

Prior 2

Page 16: Topic Modelling: Beyond Bag of Words

Results• 2 150 document datasets were used.

– 150 random abstracts from the Phsycological Abstract dataset (100 training, 50 test).• 1374 word dictionary

– 150 random postings from 20 Newsgroups dataset (100 training, 50 test).• 2281 word dictionary

• All words occurring only once were removed, along with punctuation and numbers.

Page 17: Topic Modelling: Beyond Bag of Words

Results

Plot of Information Rate (bits per word) as a function of number of topics, with results from the Phsychological Abstract dataset on the left, and 20 Newsgroups dataset on the right.

Information rate is computed as shown

Page 18: Topic Modelling: Beyond Bag of Words

Future Work / Conclusion• Another possible prior over would be similar to prior 2, but

it would impose sharing of hyperparameters among contexts.– That is, all word pairs which share the same leading word.

• It is not entirely clear if this approach would result in any improvement. Further, the computational complexity of this approach is much greater than using prior 2.

• The bigram topic model shows improved performance compared to both the bigram language model and LDA, and is an encouraging direction of research.

• It is much more feasible to consider word level models when word order is not ignored.