Interactive, Multilingual Topic Modelsusers.umiacs.umd.edu/~jbg/temp/2020_tm.pdfInteractive,...

Transcript of Interactive, Multilingual Topic Modelsusers.umiacs.umd.edu/~jbg/temp/2020_tm.pdfInteractive,...

Interactive, Multilingual Topic Models

Jordan Boyd-Graber, 2020
The Challenge of Big Data

Every second . . .

� 600 new blog posts appear

� 34,000 tweets are tweeted

� 30 GB of data uploaded toFacebook

Unstructured

No XML, no semantic web, noannotation. Often just raw text.

Every second . . .




Unstructured


Every second . . .




Unstructured


Common task: what’s going on inthis dataset.

� Intelligence analysts

� Brand monitoring

� Journalists

� Humanists

Every second . . .




Unstructured


Common task: what’s going on inthis dataset.

� Intelligence analysts

� Brand monitoring

� Journalists

� Humanists

Common solution: topic models
Topic Models as a Black Box

From an input corpus and number of topics K → words to topics

Forget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As

Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of

a MouseA Peaceful Crew Puts

Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish

among themselves as shopping malls

Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

Corpus


computer, technology,

system, service, site,

phone, internet, machine

play, film, movie, theater,

production, star, director,

stage

sell, sale, store, product,

business, advertising,

market, consumer

TOPIC 1 TOPIC 2 TOPIC 3


Forget the Bootleg, Just Download the Movie Legally

Multiplex Heralded As Linchpin To Growth

The Shape of Cinema, Transformed At the Click of

a Mouse

A Peaceful Crew Puts Muppets Where Its Mouth Is

Stock Trades: A Better Deal For Investors Isn't Simple

The three big Internet portals begin to distinguish

among themselves as shopping mallsRed Light, Green Light: A

2-Tone L.E.D. to Simplify Screens

TOPIC 2

TOPIC 3

TOPIC 1
Word Intrusion

1. Take the highest probability words from a topic

Original Topic

dog, cat, horse, pig, cow

2. Take a high-probability word from another topic and add it

Topic with Intruder

dog, cat, apple, horse, pig, cow

3. We ask users to find the word that doesn’t belong

Hypothesis

If the topics are interpretable, users will consistently choose trueintruder
Word Intrusion


Original Topic



Topic with Intruder



Hypothesis

Word Intrusion


Original Topic



Topic with Intruder



Hypothesis

Word Intrusion: Which Topics are Interpretable?

New York Times, 50 Topics

Model Precision

Nu

mb

er

of

To

pic

s

05

10

15

0.000 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000

committeelegislationproposalrepublican

taxis

fireplacegaragehousekitchenlist

americansjapanesejewishstatesterrorist

artistexhibitiongallerymuseumpainting

Model Precision: percentage of correct intruders found
Interpretability and Likelihood

Model Precision on New York TimesM

odel

Pre

cisio

nBe

tter

Predictive Log Likelihood

0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

! !

!

!

!

!

!

!

!

! !

!

! !!

! !!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!

!

!

!!!

!

!

!!

!

!!!

!!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

! 100

! 150


0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

!

!

!

!

!

!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!!

!

!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dd

s

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

100

150


0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

! !

!

!

!

!

!

!

!

! !

!

! !!

! !!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!

!

!

!!!

!

!

!!

!

!!!

!!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

! 100

! 150


0.65

0.70

0.75

0.80

!2.5

!2.0

!1.5

!1.0

New York Times

!

!

!

!

!

!

!7.32 !7.30 !7.28 !7.26 !7.24

Wikipedia

!

!!

!

!!

!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

Model P

recis

ion

Topic

Log O

dds

Model

! CTM

! LDA

! pLSI

Number of topics

! 50

100

150

Held-out LikelihoodBetter

within a model, higher likelihood 6= higher interpretability
Interactive Topic Modeling

Yuening Hu, Jordan Boyd-Graber,Brianna Satinoff, and AlisonSmith. Interactive Topic Modeling.Machine Learning, 2014.
Topic Before

1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

...

20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
Topic Before





...

Topic Before





...


Suggestion

boris, communist, gorbachev,mikhail, russia, russian, soviet,union, yeltsin
Topic Before





...


Topic After

1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years

2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut

3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation

4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan

...

20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations
Topic Before





...


Topic After





...

Topic Before





...


Topic After





...

Example: Negative Constraint

Topic Words

318

bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery

Topic Words

318


Negative Constraint

spinal cord, bladder

Topic Words

318


Topic Words

318

sci, spinal cord, spinal cord injury,spinal, injury, recovery, motor, reflex,urothelial, injured, functional recovery,plasticity, locomotor, cervical, locomo-tion

Negative Constraint

spinal cord, bladder
Multilingual Anchoring:Interactive Topic Modeling andAlignment Across Languages

Michelle Yuan, Benjamin VanDurme, and Jordan Boyd-Graber.Neural Information ProcessingSystems, 2018.
(Source: National Geographic)

� Large text collections oftenrequire topic triage quickly inlow-resource settings (e.g.natural disaster, politicalinstability).

� Analysts need to examinemultilingual text collections,but are scarce in one or morelanguages.
Generative Approaches

� Polylingual Topic Model [Mimno et al. 2009]

� Jointlda [Jagarlamudi and Daumé 2010]

� Polylingual Tree-based Topic model [Hu et al. 2014b]

� mcta [Shi et al. 2016]

These methods are slow, assume extensive knowledge aboutlanguages, and preclude human refinement.
Generative Approaches

� Polylingual Topic Model [Mimno et al. 2009]

� Jointlda [Jagarlamudi and Daumé 2010]

� Polylingual Tree-based Topic model [Hu et al. 2014b]

� mcta [Shi et al. 2016]

These methods are slow, assume extensive knowledge aboutlanguages, and preclude human refinement.
Coral reefs have been damaged by

sources of pollution, such as coastal

development, deforestation, and

agriculture. Destruction of coral reefs

could impact food supply, protection,

and income …

farming, livestock, environment,





and income …






and income …






and income …






and income …

Anchor words

Definition

An anchor word is a word that appears with high probability in onetopic but with low probability in all other topics.
From Co-occurrence to Topics

� Normally, we want to find p(word | topic) [Blei et al. 2003b].� Instead, what if we can easily find p(word | topic) through using

anchor words and conditional word co-occurrencep(word 2 |word 1)?

Q̄i ,j = p(w2 = j |w1 = i)

concealer

carburetor album

linerC1

C2

C3

Q

Q̄liner ≈ C1Q̄carburetor + C2Q̄concealer + C3Q̄album

= 0.4 ∗

0.3· · ·0.1

+ 0.2 ∗

0.1· · ·0.2

+ 0.4 ∗

0.1· · ·0.4

Q̄i ,j = p(w2 = j |w1 = i)

concealer

carburetor album

linerC1

C2

C3

Q

Q̄liner ≈ C1Q̄carburetor + C2Q̄concealer + C3Q̄album

= 0.4 ∗

0.3· · ·0.1

+ 0.2 ∗

0.1· · ·0.2

+ 0.4 ∗

0.1· · ·0.4
Anchoring

� If an anchor word appears in a document, then its correspondingtopic is among the set of topics used to generatedocument [Arora et al. 2012].

� Anchoring algorithm uses word co-occurrence to find anchors andgradient-based inference to recover topic-worddistribution [Arora et al. 2013].

� Runtime is fast because algorithm scales with number of uniqueword types, rather than number of documents or tokens.
Anchoring

1. Construct co-occurrence matrix from documents with vocabularyof size V :

Q̄i ,j = p(w2 = j |w1 = i).

2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:

Q̄i ≈K∑

k=1

Ci ,kQ̄sk subject toK∑

k=1

Ci ,k = 1 and Ci ,k ≥ 0.

3. Find topic-word matrix:

Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)

= Ci ,k

V∑

j=1

Q̄i ,j .
Anchoring


Q̄i ,j = p(w2 = j |w1 = i).


Q̄i ≈K∑

k=1


k=1

Ci ,k = 1 and Ci ,k ≥ 0.



= Ci ,k

V∑

j=1

Q̄i ,j .
Anchoring


Q̄i ,j = p(w2 = j |w1 = i).


Q̄i ≈K∑

k=1


k=1

Ci ,k = 1 and Ci ,k ≥ 0.



= Ci ,k

V∑

j=1

Q̄i ,j .
Finding Anchor Words

� So far, we assume that anchor words are given.

� How do we find anchor words from documents?

concealer

carburetor album

bourgeoisie

forest

Anchor words are the vertices of the co-occurrence convex hull.

concealer

carburetor album

bourgeoisie

forest


concealer

carburetor album

bourgeoisie

forest

Issues with Topic Models

Topics

music concert singer voice chorus songs albumsinger pop songs music album chorale jazzcosmetics makeup eyeliner lipstick foundation primer eyeshadow

Duplicate topics.

Topics

music concert singer voice chorus songs albumsinger pop songs music album chorale jazzcosmetics makeup eyeliner lipstick foundation primer eyeshadow

Duplicate topics.

Topics

music band art history literature books earthbts taehyung idol kpop jin jungkook jimin

Ambiguous topics.Overly-specific topics.

Topics

music band art history literature books earthbts taehyung idol kpop jin jungkook jimin

Ambiguous topics.Overly-specific topics.
Interactive Anchoring

� Incorporating interactivity in topic modeling has shown to improvequality of model [Hu et al. 2014a].

� Anchoring algorithm offers speed for interactive work, but singleanchors are unintuitive to users.

� Ankura is an interactive topic modeling system that allows usersto choose multiple anchors for each topic [Lund et al. 2017].

� After receiving human feedback, Ankura only takes a few secondsto update topic model.

These methods only work for monolingual document collections.
Interactive Anchoring

� Incorporating interactivity in topic modeling has shown to improvequality of model [Hu et al. 2014a].

� Anchoring algorithm offers speed for interactive work, but singleanchors are unintuitive to users.

� Ankura is an interactive topic modeling system that allows usersto choose multiple anchors for each topic [Lund et al. 2017].

� After receiving human feedback, Ankura only takes a few secondsto update topic model.

These methods only work for monolingual document collections.
Linking Words

Definition

Language L is a set of word types w .

Definition

Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.

Idea: If dictionary B contains entry (w , v), create a link between wand v .
Linking Words

Definition


Definition


Linking Words

Definition


Definition


Finding Multilingual Anchors

concealer

forest

carburetor

cosmopolitan

album

bourgeoisie

concealer

forest

carburetor

cosmopolitan

album

bourgeoisie

concealer

forest

carburetor

cosmopolitan

album

bourgeoisie

Bourgeoisie does not have a Chinese translation, so it

cannot be picked as an anchor word even if it is the farthest word from the convex hull.

concealer

forest

carburetor

cosmopolitan

album

bourgeoisie



concealer

forest

carburetor

cosmopolitan

album

bourgeoisie



Forest and its translation are not the furthest

points from their respective convex hull, but neither are too

close. So, they are chosen as the next anchor words.

concealer

forest

carburetor

cosmopolitan

album

bourgeoisie



Forest and its translation are not the furthest

points from their respective convex hull, but neither are too

close. So, they are chosen as the next anchor words.
Multilingual Anchoring

1. Given a dictionary, create links between words that are translationsof each other.

2. Select an anchor word for each language such that the words arelinked and span of anchor words is maximized.

3. Once anchor words are found, separately find topic-worddistributions for each language.
� What if dictionary entries are scarce or inaccurate?

� What if topics aren’t aligned properly across languages?

Incorporate human-in-the-loop topic modeling tools.
� What if dictionary entries are scarce or inaccurate?

� What if topics aren’t aligned properly across languages?

Incorporate human-in-the-loop topic modeling tools.
MTAnchor
Experiments

Datasets:

1. Wikipedia articles (en, zh)

2. Amazon reviews (en, zh)

3. lorelei documents (en, si)
Experiments

Metrics:

1. Classification accuracy� Intra-lingual: train topic model on documents in one language and

test on other documents in the same languages� Cross-lingual: train topic model on documents in one language and

test on other documents in a different language.

2. Topic coherence [Lau et al. 2014].� Intrinsic: use the trained documents as the reference corpus to

measure local interpretability.� Extrinsic: use a large dataset (i.e. entire Wikipedia) as the reference

corpus to measure global interpretability.
Comparing Models

Classification accuracy

Dataset Method en-izh-isi-i

en-czh-csi-c

WikipediaMultilingualanchoring

69.5% 71.2% 50.4% 47.8%

MTAnchor(maximum)

80.7% 75.3% 57.6% 54.5%

MTAnchor(median)

69.5% 71.4% 50.3% 47.2%

mcta 51.6% 33.4% 23.2% 39.8%

AmazonMultilingualanchoring

59.8% 61.1% 51.7% 53.2%

mcta 49.5% 50.6% 50.3% 49.5%

loreleiMultilingualanchoring

20.8% 32.7% 24.5% 24.7%

mcta 13.0% 26.5% 4.1% 15.6%
Comparing Models

Topic coherence

Dataset Method en-izh-isi-i

en-ezh-esi-e

WikipediaMultilingualanchoring

0.14 0.18 0.08 0.13

MTAnchor(maximum)

0.20 0.20 0.10 0.15

MTAnchor(median)

0.14 0.18 0.08 0.13

mcta 0.13 0.09 0.00 0.04

AmazonMultilingualanchoring

0.07 0.06 0.03 0.05

mcta -0.03 0.02 0.02 0.01

loreleiMultilingualanchoring

0.08 0.00 0.03 n/a

mcta 0.13 0.00 0.04 n/a
Multilingual Anchoring Is Much Faster

00.20.40.60.8

F1 S

core

Train: Chinese

MethodMCTAMTAnchor (max)Multilingual Anchoring

Train: English

Test: Chinese

0 100 200 3000

0.20.40.60.8

0 100 200 300Time (min)

Test: English
Improving Topics Through Interactivity

0 5 10 15 20Time (min)

0.70

0.75

0.80

0.85F1

Sco

re

0 10 20Time (min)

0.65

0.70

0.75

0.80

0.85

F1 S

core

0.40.50.60.70.80.9

F1 S

core

Train: Chinese Train: EnglishDev: Chinese

0 10 200.40.50.60.70.80.9

0 10 20Time (min)

Dev: English
Comparing Topics

Dataset Method Topic

Wikipedia mcta dog san movie mexican fighter novel california主演改編本小說拍攝角色戰士

Multilingual anchoring adventure daughter bob kong hong robert movie主演改編本片飾演冒冒冒險險險講述編劇

MTAnchor kong hong movie office martial box reception主演改編飾演本片演演演員員員編編編劇劇劇講述

Amazon mcta woman food eat person baby god chapter來貨頂頂水耳機貨物張傑傑同樣

Multilingual anchoring eat diet food recipe healthy lose weight健健健康康康幫吃身體全面同事中醫

lorelei mcta help need floodrelief please families needed victimMultilingual anchoring aranayake warning landslide site missing nbro areas
Why Not Use Deep Learning?

� Neural networks are data-hungry and unsuitable for low-resourcelanguages

� Deep learning models take long amounts of time to train

� Pathologies of neural models make interpretationdifficult [Feng et al. 2018]
Summary

� Anchoring algorithm can be applied in multilingual settings.

� People can provide helpful linguistic or cultural knowledge toconstruct better multilingual topic models.
Future Work

� Apply human-in-the-loop algorithms to other tasks in NLP.

� Better understand the effect of human feedback on cross-lingualrepresentation learning.
ALTO: Active Learning withTopic Overviews for SpeedingLabel Induction and DocumentLabeling

Forough Poursabzi-Sangdeh,Jordan Boyd-Graber, LeahFindlater, and Kevin Seppi.Association for ComputationalLinguistics, 2016.
Many Documents
Sort into Categories
Evaluation

� User study

� 40 minutes

� Sort documents into categories� What information / interface helps best

� Train a classifier on human examples� Compare classifier labels to expert judgements
Evaluation

� User study

� 40 minutes


Evaluation

� User study

� 40 minutes


Evaluation

� User study

� 40 minutes


� Train a classifier on human examples (don’t tell them how manylabels)

� Compare classifier labels to expert judgements
Evaluation

� User study

� 40 minutes


� Train a classifier on human examples (don’t tell them how manylabels)

� Compare classifier labels to expert judgements (purity)

purity(U,G) =1

N

∑

l

maxj|Ul ∩ Gj |, (1)
Which is more Useful?

Who should drive?
Which is more Useful?

Active Learning Topic Models
Direct users to document
● ●

●

●●

● ● ●

0.2

0.3

0.4

0.5

10 20 30 40Elapsed Time (min)

Med

ian

(ove

r pa

rtic

ipan

ts)

condition● No Organization with Active Learning

Active learning if time is short
● ●

●

●●

● ● ●

0.2

0.3

0.4

0.5


Med

ian

(ove

r pa

rtic

ipan

ts)


No Organization, no Active Learning

Better than status quo
● ●

●

●●

● ● ●

0.2

0.3

0.4

0.5


Med

ian

(ove

r pa

rtic

ipan

ts)



Topics with Active Learning

Active learning can help topic models
● ●

●

●●

● ● ●

0.2

0.3

0.4

0.5

Purity


Med

ian

(ove

r pa

rtic

ipan

ts)




Topics, no Active Learning

Topic models help users understand the collection
● ●

●

●●

● ● ●

0.2

0.3

0.4

0.5

Purity


Med

ian

(ove

r pa

rtic

ipan

ts)




Topics, no Active Learning

Moral: machines and humans together (if you let them)
Ongoing and Future Work

� Embedding interactivity in applications

� Visualizations to measure machine learning explainability

� Using morphology in infinite representations

� Multilingual analysis
Ongoing and Future Work

� Embedding interactivity in applications

� Visualizations to measure machine learning explainability

� Using morphology in infinite representations

� Multilingual analysis
Thanks

Collaborators

Yuening Hu (UMD), Ke Zhai (UMD), Viet-An Nguyen (UMD), DaveBlei (Princeton), Jonathan Chang (Facebook), Philip Resnik (UMD),Christiane Fellbaum (Princeton), Jerry Zhu (Wisconsin), Sean Gerrish(Sift), Chong Wang (CMU), Dan Osherson (Princeton), SineadWilliamson (CMU)

Funders
References

Alvin Grissom II, He He, Jordan Boyd-Graber, and John Morgan.

2014.Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation.In Empirical Methods in Natural Language Processing.

Sanjeev Arora, Rong Ge, and Ankur Moitra.

2012.Learning topic models–going beyond SVD.In Foundations of Computer Science (FOCS).

Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael

Zhu.2013.A practical algorithm for topic modeling with provable guarantees.In Proceedings of the International Conference of Machine Learning.

David M. Blei, Andrew Ng, and Michael Jordan.

2003a.Latent Dirichlet allocation.Machine Learning Journal, 3.

David M. Blei, Andrew Y. Ng, and Michael I. Jordan.

2003b.Latent Dirichlet allocation.Machine Learning Journal.

Jordan L. Boyd-Graber, Sonya S. Nikolova, Karyn A. Moffatt, Kenrick C. Kin, Joshua Y. Lee, Lester W. Mackey,

Marilyn M. Tremaine, and Maria M. Klawe.2006.Participatory design with proxies: Developing a desktop-PDA system to support people with aphasia.In International Conference on Human Factors in Computing Systems.

Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit Iyyer, and Jordan Boyd-Graber.

2018.Pathologies of neural models make interpretation difficult.In Proceedings of Emperical Methods in Natural Language Processing.

He He, Jordan Boyd-Graber, and Hal Daumé III.

2016.Opponent modeling in deep reinforcement learning.In International Conference on Machine Learning.

Thomas Hofmann.

1999.Probabilistic latent semantic analysis.In Proceedings of Uncertainty in Artificial Intelligence.

Yuening Hu, Jordan Boyd-Graber, Hal Daumé III, and Z. Irene Ying.

2013.Binary to bushy: Bayesian hierarchical clustering with the beta coalescent.In Neural Information Processing Systems.

Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.

2014a.Interactive topic modeling.

Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber.

2014b.Polylingual tree-based topic models for translation domain adaptation.In Proceedings of the Association for Computational Linguistics.

Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber.

2014c.Polylingual tree-based topic models for translation domain adaptation.In Association for Computational Linguistics.

Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III.

2015.Deep unordered composition rivals syntactic methods for text classification.In Association for Computational Linguistics.

Jagadeesh Jagarlamudi and Hal Daumé.

2010.Extracting multilingual topics from unaligned comparable corpora.In Proceedings of the European Conference on Information Retrieval.

T. Landauer and S. Dumais.

1997.Solutions to Plato’s problem: The latent semantic analsyis theory of acquisition, induction and representation ofknowledge.Psychological Review, (104).

Jey Han Lau, David Newman, and Timothy Baldwin.

2014.Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality.In Proceedings of the European Chapter of the Association for Computational Linguistics.

Jeffrey Lund, Connor Cook, Kevin Seppi, and Jordan Boyd-Graber.

2017.Tandem anchoring: A multiword anchor approach for interactive topic modeling.In Proceedings of the Association for Computational Linguistics.

Xiaojuan Ma and Perry R. Cook.

2009.How well do visual verbs work in daily communication for young and old adults?In International Conference on Human Factors in Computing Systems, pages 361–364.

David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and Andrew McCallum.

2009.Polylingual topic models.In Proceedings of Emperical Methods in Natural Language Processing.

Viet-An Nguyen, Jordan Boyd-Graber, and Stephen Altschul.

2013.Dirichlet mixtures, the dirichlet process, and the structure of protein space.Journal of Computational Biology, 20(1).

Sonya S. Nikolova, Jordan Boyd-Graber, Christiane Fellbaum, and Perry Cook.

2009.Better vocabularies for assistive communication aids: Connecting terms using semantic networks and untrainedannotators.In ACM Conference on Computers and Accessibility.

Bei Shi, Wai Lam, Lidong Bing, and Yinqing Xu.

2016.Detecting common discussion topics across culture from news reader comments.In Proceedings of the Association for Computational Linguistics.
Latent Dirichlet Allocation: A Generative Model

� Focus in this talk: statistical methods� Model: story of how your data came to be� Latent variables: missing pieces of your story� Statistical inference: filling in those missing pieces

� We use latent Dirichlet allocation (LDA) [Blei et al. 2003a], a fullyBayesian version of pLSI [Hofmann 1999], probabilistic version ofLSA [Landauer and Dumais 1997]

MNθd zn wn

Kβk

α

λ

� For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ

� For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

� For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.

� Choose the observed word wn from the distribution βzn .

MNθd zn wn

Kβk

α

λ





MNθd zn wn

Kβk

α

λ





MNθd zn wn

Kβk

α

λ





MNθd zn wn

Kβk

α

λ





We use statistical inference to uncover the most likely unobservedvariables given observed data.
βkK

zn

wn

θdα

MNd

ηk Forget the Bootleg, Just Download the Movie Legally

Multiplex Heralded As Linchpin To

GrowthThe Shape of

Cinema, Transformed At the Click of a

Mouse A Peaceful Crew Puts Muppets

Where Its Mouth Is

Stock Trades: A Better Deal For Investors Isn't

Simple

Internet portals begin to distinguish among themselves as shopping malls

Red Light, Green Light: A

2-Tone L.E.D. to Simplify Screens

TOPIC 2"BUSINESS"

TOPIC 3"ENTERTAINMENT"

TOPIC 1"TECHNOLOGY"

computer,

technology,

system,

service, site,

phone,

internet,

machine

play, film,

movie, theater,

production,

star, director,

stage

sell, sale,

store, product,

business,

advertising,

market,

consumer

TOPIC 1

TOPIC 2

TOPIC 3
Inference

� We are interested in posterior distribution

p(Z |X ,Θ) (2)

� Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.

p(z ,β,θ|w , α, λ) (3)

p(w , z ,θ,β|α, λ) =∏

k

p(βk |λ)∏

d

p(θd |α)∏

n

p(zd ,n|θd)p(wd ,n|βzd,n)
Inference


p(Z |X ,Θ) (2)


p(z ,β,θ|w , α, λ) (3)

p(w , z ,θ,β|α, λ) =∏

k

p(βk |λ)∏

d

p(θd |α)∏

n

Inference


p(Z |X ,Θ) (2)


p(z ,β,θ|w , α, λ) (3)

p(w , z ,θ,β|α, λ) =∏

k

p(βk |λ)∏

d

p(θd |α)∏

n

Gibbs Sampling

� A form of Markov Chain Monte Carlo

� Chain is a sequence of random variable states

� Given a state {z1, . . . zN} given certain technical conditions,drawing zk ∼ p(z1, . . . zk−1, zk+1, . . . zN |X ,Θ) for all k(repeatedly) results in a Markov Chain whose stationarydistribution is the posterior.

� For notational convenience, call z with zd ,n removed z−d ,n
Inference

Hollywood studios are preparing to let people

download and buy electronic copies of movies over

the Internet, much as record labels now sell songs for

99 cents through Apple Computer's iTunes music store

and other online services ...






stage



market, consumer
Inference











stage



market, consumer
Inference











stage



market, consumer
Inference











stage



market, consumer
Inference











stage



market, consumer
Inference











stage



market, consumer
Inference











stage



market, consumer
Gibbs Sampling

� For LDA, we will sample the topic assignments

� Thus, we want:

p(zd ,n = k |z−d ,n,w , α, λ) =p(zd ,n = k, z−d ,n|w , α, λ)

p(z−d ,n|w , α, λ)

� The topics and per-document topic proportions are integrated out/ marginalized / collapsed

� Let nd ,i be the number of words taking topic i in document d . Letvk,w be the number of times word w is used in topic k .

=

∫θd

(∏i 6=k θ

αi+nd,i−1d

)θαk+nd,id dθd

∫βk

(∏i 6=wd,n β

λi+vk,i−1k,i

)βλi+vk,ik,wd,n

dβk∫θd

(∏i θαi+nd,i−1d

)dθd

∫βk

(∏i β

λi+vk,i−1k,i

)dβk
Gibbs Sampling


� Thus, we want:

p(zd ,n = k |z−d ,n,w , α, λ) =p(zd ,n = k, z−d ,n|w , α, λ)

p(z−d ,n|w , α, λ)

� The topics and per-document topic proportions are integrated out/ marginalized / collapsed

� Let nd ,i be the number of words taking topic i in document d . Letvk,w be the number of times word w is used in topic k .

=

∫θd

(∏i 6=k θ

αi+nd,i−1d

)θαk+nd,id dθd

∫βk

(∏i 6=wd,n β

λi+vk,i−1k,i

)βλi+vk,ik,wd,n

dβk∫θd


)dθd

∫βk

(∏i β

λi+vk,i−1k,i

)dβk
Gibbs Sampling


� The topics and per-document topic proportions are integrated out/ marginalized / Rao-Blackwellized

� Thus, we want:

p(zd ,n = k |z−d ,n,w , α, λ) =nd ,k + αk∑Ki nd ,i + αi

vk,wd,n + λwd,n∑i vk,i + λi
Gibbs Sampling

� Integral is normalizer of Dirichlet distribution

∫

βk

(∏

i

βλi+vk,i−1k,i

)dβk =

∏Vi |βi + vk,i|∑V

i βi + vk,i

� So we can simplify

∫θd

(∏i 6=k θ

αi+nd,i−1d

)θαk+nd,id dθd

∫βk

(∏i 6=wd,n

βλi+vk,i−1k,i

)βλi+vk,ik,wd,n

dβk∫θd


)dθd

∫βk

(∏i β

λi+vk,i−1k,i

)dβk

=

|αk+nd,k+1|∑K

i αi+nd,i+1

∏Ki 6=k |αk + nd,k∏K

i |αi+nd,i|∑K

i αi+nd,i

|λwd,n +vk,wd,n +1

|∑V

i λi+vk,i+1

∏Vi 6=wd,n

|λk + vk,wd,n∏Vi |λi+vk,i|∑V

i λi+vk,i
Gibbs Sampling

� Integral is normalizer of Dirichlet distribution

∫

βk

(∏

i

βλi+vk,i−1k,i

)dβk =

∏Vi |βi + vk,i|∑V

i βi + vk,i

� So we can simplify

∫θd

(∏i 6=k θ

αi+nd,i−1d

)θαk+nd,id dθd

∫βk

(∏i 6=wd,n

βλi+vk,i−1k,i

)βλi+vk,ik,wd,n

dβk∫θd


)dθd

∫βk

(∏i β

λi+vk,i−1k,i

)dβk

=

|αk+nd,k+1|∑K

i αi+nd,i+1


i |αi+nd,i|∑K

i αi+nd,i

|λwd,n +vk,wd,n +1

|∑V

i λi+vk,i+1

∏Vi 6=wd,n


i λi+vk,i
Gamma Function Identity

z =Γ(z + 1)

Γ(z)(4)

|αk+nd,k+1|∑K

i αi+nd,i+1


i |αi+nd,i|∑K

i αi+nd,i

|λwd,n +vk,wd,n +1

|∑V

i λi+vk,i+1

∏Vi 6=wd,n


i λi+vk,i

=nd,k + αk∑Ki nd,i + αi

Gibbs Sampling Equation

nd ,k + αk∑Ki nd ,i + αi


(5)

� Number of times document d uses topic k

� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

� Dirichlet parameter for topic to word distribution

� How much this document likes topic k

� How much this topic likes word wd ,n



(5)








(5)








(5)








(5)








(5)





Sample Document
Sample Document
Randomly Assign Topics
Randomly Assign Topics
Total Topic Counts

Sampling Equation


Total Topic Counts

Sampling Equation


Total Topic Counts

Sampling Equation


We want to sample this word . . .
We want to sample this word . . .
Decrement its count
What is the conditional distribution for this topic?
Part 1: How much does this document like each topic?

Sampling Equation



Sampling Equation



Sampling Equation


Part 2: How much does each topic like the word?

Sampling Equation



Sampling Equation



Sampling Equation


Geometric interpretation
Update counts
Update counts
Update counts
Details: how to sample from a distribution

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

QKi 6=k � (↵i + nd ,i )

�(↵k+nd,k+1)�(

PKi ↵i+nd,i+1)QK

i �(↵i+nd,i)�(

PKi ↵i+nd,i)

QVi 6=wd,n � (�i + vk,i )

�⇣�k+vk,wd,n+1

⌘

�(PV

i �i+vk,i+1)QVi �(�i+vk,i)

�(PV

i �i+vk,i)

=nd ,k + ↵kPKi nd ,i + ↵i

vk,wd,n + �wd,nPi vk,i + �i

Gamma Function Identity

z =�(z + 1)

�(z)(6)

Norm

alize

0.0

1.0

0.112
Algorithm

1. For each iteration i :1.1 For each document d and word n currently assigned to zold :

1.1.1 Decrement nd,zold and vzold ,wd,n

1.1.2 Sample znew = k with probability proportional tond,k+αk∑Ki nd,i+αi

vk,wd,n+λwd,n∑

i vk,i+λi

1.1.3 Increment nd,znew and vznew ,wd,n
Näıve Implementation

Algorithm

1. For each iteration i :1.1 For each document d and word n currently assigned to zold :

1.1.1 Decrement nd,zold and vzold ,wd,n

1.1.2 Sample znew = k with probability proportional tond,k+αk∑Ki nd,i+αi

vk,wd,n+λwd,n∑

i vk,i+λi

1.1.3 Increment nd,znew and vznew ,wd,n
Desiderata

� Hyperparameters: Sample them too (slice sampling)

� Initialization: Random

� Sampling: Until likelihood converges

� Lag / burn-in: Difference of opinion on this

� Number of chains: Should do more than one
Available implementations

� Mallet (http://mallet.cs.umass.edu)

� LDAC (http://www.cs.princeton.edu/ blei/lda-c)

� Topicmod (http://code.google.com/p/topicmod)
SHLDA Model

��

�

�

��

�

∞

��

�

�

��

� �

�

��

��

��

��

�

∞ ��

��

�
Infvoc Classification Accuracy

S=

155

τ 0=

64κ

=0.

6

infvoc αβ = 3k T = 40k U = 10 52.683

fixvoc vb-dict 45.514

fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474

fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

dtm-dict tcv = 0.001 62.845

Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.

S=

155

τ 0=

64κ

=0.

6

infvoc αβ = 3k T = 40k U = 10 52.683

fixvoc vb-dict 45.514


fixvoc-hash vb-dict 52.525

fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

dtm-dict tcv = 0.001 62.845


Topics learned with hashing are no longer interpretable, they can onlybe used as features.

S=

155

τ 0=

64κ

=0.

6

infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514


fixvoc-hash vb-dict 52.525

fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

dtm-dict tcv = 0.001 62.845



S=

155

τ 0=

64κ

=0.

6

infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474


dtm-dict tcv = 0.001 62.845



S=

155

τ 0=

64κ

=0.

6

infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474


dtm-dict tcv = 0.001 62.845


Unassign (d , n,wd,n, zd,n = k)

1: T : Td ,k ← Td ,k − 12: If wd ,n /∈ Ωold ,

P : Pk,wd,n ← Pk,wd,n − 13: Else: suppose wd ,n ∈ Ωoldm ,

P : Pk,m ← Pk,m − 1W : Wk,m,wd,n ←Wk,m,wd,n − 1
SparseLDA

p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k

(6)

∝ αkββV + n·|k︸︷︷︸

sLDA

+nk|dβ

βV + n·|k︸︷︷︸rLDA

+(αk + nk|d)nw |kβV + n·|k︸︷︷︸

qLDA
Tree-based sampling

p(zd ,n = k , ld ,n = λ|Z−, L−,wd ,n) (7)

∝ (αk + nk|d)∏

(i→j)∈λ

βi→j + ni→j |k∑j ′ (βi→j ′ + ni→j ′|k)
Factorizing Tree-Based Prior


(8)

∝ αkββV + n·|k︸︷︷︸

sLDA

+nk|dβ



qLDA

s =∑

k,λ

αk∏

(i→j)∈λ βi→j∏(i→j)∈λ

∑j ′ (βi→j ′ + ni→j ′|k)

≤∑

k,λ

αk∏


∑j ′ βi→j ′

= s ′. (9)
Factorizing Tree-Based Prior


(8)

∝ αkββV + n·|k︸︷︷︸

sLDA

+nk|dβ



qLDA

s =∑

k,λ

αk∏


∑j ′ (βi→j ′ + ni→j ′|k)

≤∑

k,λ

αk∏


∑j ′ βi→j ′

= s ′. (9)
1: for word w in this document do2: sample = rand() ∗(s ′ + r + q)3: if sample < s ′ then4: compute s5: sample = sample ∗(s + r + q)/(s ′ + r + q)6: if sample < s then7: return topic k and path λ sampled from s8: end if9: sample − = s

10: else11: sample − = s ′12: end if13: if sample < r then14: return topic k and path λ sampled from r15: end if16: sample − = r17: return topic k and path λ sampled from q18: end for
Number of TopicsT50 T100 T200 T500

Naive 5.700 12.655 29.200 71.223Fast 4.935 9.222 17.559 40.691

Fast-RB 2.937 4.037 5.880 8.551Fast-RB-sD 2.675 3.795 5.400 8.363Fast-RB-sW 2.449 3.363 4.894 7.404

Fast-RB-sDW 2.225 3.241 4.672 7.424

Number of CorrelationsC50 C100 C200 C500

Näıve 11.166 12.586 13.000 15.377Fast 8.889 9.165 9.177 8.079

Fast-RB 3.995 4.078 3.858 3.156Fast-RB-sD 3.660 3.795 3.593 3.065Fast-RB-sW 3.272 3.363 3.308 2.787

Fast-RB-sDW 3.026 3.241 3.091 2.627

Interactive, Multilingual Topic Modelsusers.umiacs.umd.edu/~jbg/temp/2020_tm.pdfInteractive,...

Documents

Transcript of Interactive, Multilingual Topic Modelsusers.umiacs.umd.edu/~jbg/temp/2020_tm.pdfInteractive,...