Interactive, Multilingual Topic Modelsusers.umiacs.umd.edu/~jbg/temp/2020_tm.pdfInteractive,...

169
Interactive, Multilingual Topic Models Jordan Boyd-Graber, 2020

Transcript of Interactive, Multilingual Topic Modelsusers.umiacs.umd.edu/~jbg/temp/2020_tm.pdfInteractive,...

  • Interactive, Multilingual Topic Models

    Jordan Boyd-Graber, 2020

  • The Challenge of Big Data

    Every second . . .

    � 600 new blog posts appear

    � 34,000 tweets are tweeted

    � 30 GB of data uploaded toFacebook

    Unstructured

    No XML, no semantic web, noannotation. Often just raw text.

  • The Challenge of Big Data

    Every second . . .

    � 600 new blog posts appear

    � 34,000 tweets are tweeted

    � 30 GB of data uploaded toFacebook

    Unstructured

    No XML, no semantic web, noannotation. Often just raw text.

  • The Challenge of Big Data

    Every second . . .

    � 600 new blog posts appear

    � 34,000 tweets are tweeted

    � 30 GB of data uploaded toFacebook

    Unstructured

    No XML, no semantic web, noannotation. Often just raw text.

    Common task: what’s going on inthis dataset.

    � Intelligence analysts

    � Brand monitoring

    � Journalists

    � Humanists

  • The Challenge of Big Data

    Every second . . .

    � 600 new blog posts appear

    � 34,000 tweets are tweeted

    � 30 GB of data uploaded toFacebook

    Unstructured

    No XML, no semantic web, noannotation. Often just raw text.

    Common task: what’s going on inthis dataset.

    � Intelligence analysts

    � Brand monitoring

    � Journalists

    � Humanists

    Common solution: topic models

  • Topic Models as a Black Box

    From an input corpus and number of topics K → words to topics

    Forget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As

    Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of

    a MouseA Peaceful Crew Puts

    Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish

    among themselves as shopping malls

    Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens

    Corpus

  • Topic Models as a Black Box

    From an input corpus and number of topics K → words to topics

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

    TOPIC 1 TOPIC 2 TOPIC 3

  • Topic Models as a Black Box

    From an input corpus and number of topics K → words to topics

    Forget the Bootleg, Just Download the Movie Legally

    Multiplex Heralded As Linchpin To Growth

    The Shape of Cinema, Transformed At the Click of

    a Mouse

    A Peaceful Crew Puts Muppets Where Its Mouth Is

    Stock Trades: A Better Deal For Investors Isn't Simple

    The three big Internet portals begin to distinguish

    among themselves as shopping mallsRed Light, Green Light: A

    2-Tone L.E.D. to Simplify Screens

    TOPIC 2

    TOPIC 3

    TOPIC 1

  • Word Intrusion

    1. Take the highest probability words from a topic

    Original Topic

    dog, cat, horse, pig, cow

    2. Take a high-probability word from another topic and add it

    Topic with Intruder

    dog, cat, apple, horse, pig, cow

    3. We ask users to find the word that doesn’t belong

    Hypothesis

    If the topics are interpretable, users will consistently choose trueintruder

  • Word Intrusion

    1. Take the highest probability words from a topic

    Original Topic

    dog, cat, horse, pig, cow

    2. Take a high-probability word from another topic and add it

    Topic with Intruder

    dog, cat, apple, horse, pig, cow

    3. We ask users to find the word that doesn’t belong

    Hypothesis

    If the topics are interpretable, users will consistently choose trueintruder

  • Word Intrusion

    1. Take the highest probability words from a topic

    Original Topic

    dog, cat, horse, pig, cow

    2. Take a high-probability word from another topic and add it

    Topic with Intruder

    dog, cat, apple, horse, pig, cow

    3. We ask users to find the word that doesn’t belong

    Hypothesis

    If the topics are interpretable, users will consistently choose trueintruder

  • Word Intrusion: Which Topics are Interpretable?

    New York Times, 50 Topics

    Model Precision

    Nu

    mb

    er

    of

    To

    pic

    s

    05

    10

    15

    0.000 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000

    committeelegislationproposalrepublican

    taxis

    fireplacegaragehousekitchenlist

    americansjapanesejewishstatesterrorist

    artistexhibitiongallerymuseumpainting

    Model Precision: percentage of correct intruders found

  • Interpretability and Likelihood

    Model Precision on New York TimesM

    odel

    Pre

    cisio

    nBe

    tter

    Predictive Log Likelihood

    0.65

    0.70

    0.75

    0.80

    !2.5

    !2.0

    !1.5

    !1.0

    New York Times

    ! !

    !

    !

    !

    !

    !

    !

    !

    ! !

    !

    ! !!

    ! !!

    !7.32 !7.30 !7.28 !7.26 !7.24

    Wikipedia

    !

    !

    !

    !

    !!!

    !

    !

    !!

    !

    !!!

    !!!

    !7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

    Model P

    recis

    ion

    Topic

    Log O

    dds

    Model

    ! CTM

    ! LDA

    ! pLSI

    Number of topics

    ! 50

    ! 100

    ! 150

    Predictive Log Likelihood

    0.65

    0.70

    0.75

    0.80

    !2.5

    !2.0

    !1.5

    !1.0

    New York Times

    !

    !

    !

    !

    !

    !

    !7.32 !7.30 !7.28 !7.26 !7.24

    Wikipedia

    !

    !!

    !

    !!

    !7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

    Model P

    recis

    ion

    Topic

    Log O

    dd

    s

    Model

    ! CTM

    ! LDA

    ! pLSI

    Number of topics

    ! 50

    100

    150

    Predictive Log Likelihood

    0.65

    0.70

    0.75

    0.80

    !2.5

    !2.0

    !1.5

    !1.0

    New York Times

    ! !

    !

    !

    !

    !

    !

    !

    !

    ! !

    !

    ! !!

    ! !!

    !7.32 !7.30 !7.28 !7.26 !7.24

    Wikipedia

    !

    !

    !

    !

    !!!

    !

    !

    !!

    !

    !!!

    !!!

    !7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

    Model P

    recis

    ion

    Topic

    Log O

    dds

    Model

    ! CTM

    ! LDA

    ! pLSI

    Number of topics

    ! 50

    ! 100

    ! 150

    Predictive Log Likelihood

    0.65

    0.70

    0.75

    0.80

    !2.5

    !2.0

    !1.5

    !1.0

    New York Times

    !

    !

    !

    !

    !

    !

    !7.32 !7.30 !7.28 !7.26 !7.24

    Wikipedia

    !

    !!

    !

    !!

    !7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40

    Model P

    recis

    ion

    Topic

    Log O

    dds

    Model

    ! CTM

    ! LDA

    ! pLSI

    Number of topics

    ! 50

    100

    150

    Held-out LikelihoodBetter

    within a model, higher likelihood 6= higher interpretability

  • Interactive Topic Modeling

    Yuening Hu, Jordan Boyd-Graber,Brianna Satinoff, and AlisonSmith. Interactive Topic Modeling.Machine Learning, 2014.

  • Topic Before

    1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

    2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

    3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

    4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

    ...

    20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party

  • Topic Before

    1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

    2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

    3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

    4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

    ...

    20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party

  • Topic Before

    1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

    2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

    3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

    4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

    ...

    20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party

    Suggestion

    boris, communist, gorbachev,mikhail, russia, russian, soviet,union, yeltsin

  • Topic Before

    1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

    2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

    3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

    4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

    ...

    20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party

    Topic After

    1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years

    2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut

    3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation

    4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan

    ...

    20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations

  • Topic Before

    1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

    2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

    3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

    4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

    ...

    20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party

    Topic After

    1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years

    2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut

    3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation

    4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan

    ...

    20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations

  • Topic Before

    1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military

    2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david

    3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing

    4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see

    ...

    20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party

    Topic After

    1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years

    2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut

    3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation

    4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan

    ...

    20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations

  • Example: Negative Constraint

    Topic Words

    318

    bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery

  • Example: Negative Constraint

    Topic Words

    318

    bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery

    Negative Constraint

    spinal cord, bladder

  • Example: Negative Constraint

    Topic Words

    318

    bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery

    Topic Words

    318

    sci, spinal cord, spinal cord injury,spinal, injury, recovery, motor, reflex,urothelial, injured, functional recovery,plasticity, locomotor, cervical, locomo-tion

    Negative Constraint

    spinal cord, bladder

  • Multilingual Anchoring:Interactive Topic Modeling andAlignment Across Languages

    Michelle Yuan, Benjamin VanDurme, and Jordan Boyd-Graber.Neural Information ProcessingSystems, 2018.

  • (Source: National Geographic)

    � Large text collections oftenrequire topic triage quickly inlow-resource settings (e.g.natural disaster, politicalinstability).

    � Analysts need to examinemultilingual text collections,but are scarce in one or morelanguages.

  • Generative Approaches

    � Polylingual Topic Model [Mimno et al. 2009]

    � Jointlda [Jagarlamudi and Daumé 2010]

    � Polylingual Tree-based Topic model [Hu et al. 2014b]

    � mcta [Shi et al. 2016]

    These methods are slow, assume extensive knowledge aboutlanguages, and preclude human refinement.

  • Generative Approaches

    � Polylingual Topic Model [Mimno et al. 2009]

    � Jointlda [Jagarlamudi and Daumé 2010]

    � Polylingual Tree-based Topic model [Hu et al. 2014b]

    � mcta [Shi et al. 2016]

    These methods are slow, assume extensive knowledge aboutlanguages, and preclude human refinement.

  • Coral reefs have been damaged by

    sources of pollution, such as coastal

    development, deforestation, and

    agriculture. Destruction of coral reefs

    could impact food supply, protection,

    and income …

    farming, livestock, environment,

  • Coral reefs have been damaged by

    sources of pollution, such as coastal

    development, deforestation, and

    agriculture. Destruction of coral reefs

    could impact food supply, protection,

    and income …

    farming, livestock, environment,

  • Coral reefs have been damaged by

    sources of pollution, such as coastal

    development, deforestation, and

    agriculture. Destruction of coral reefs

    could impact food supply, protection,

    and income …

    farming, livestock, environment,

  • Coral reefs have been damaged by

    sources of pollution, such as coastal

    development, deforestation, and

    agriculture. Destruction of coral reefs

    could impact food supply, protection,

    and income …

    farming, livestock, environment,

  • Coral reefs have been damaged by

    sources of pollution, such as coastal

    development, deforestation, and

    agriculture. Destruction of coral reefs

    could impact food supply, protection,

    and income …

    farming, livestock, environment,

  • Anchor words

    Definition

    An anchor word is a word that appears with high probability in onetopic but with low probability in all other topics.

  • From Co-occurrence to Topics

    � Normally, we want to find p(word | topic) [Blei et al. 2003b].� Instead, what if we can easily find p(word | topic) through using

    anchor words and conditional word co-occurrencep(word 2 |word 1)?

  • From Co-occurrence to Topics

    Q̄i ,j = p(w2 = j |w1 = i)

    concealer

    carburetor album

    linerC1

    C2

    C3

    Q

    Q̄liner ≈ C1Q̄carburetor + C2Q̄concealer + C3Q̄album

    = 0.4 ∗

    0.3· · ·0.1

    + 0.2 ∗

    0.1· · ·0.2

    + 0.4 ∗

    0.1· · ·0.4

  • From Co-occurrence to Topics

    Q̄i ,j = p(w2 = j |w1 = i)

    concealer

    carburetor album

    linerC1

    C2

    C3

    Q

    Q̄liner ≈ C1Q̄carburetor + C2Q̄concealer + C3Q̄album

    = 0.4 ∗

    0.3· · ·0.1

    + 0.2 ∗

    0.1· · ·0.2

    + 0.4 ∗

    0.1· · ·0.4

  • Anchoring

    � If an anchor word appears in a document, then its correspondingtopic is among the set of topics used to generatedocument [Arora et al. 2012].

    � Anchoring algorithm uses word co-occurrence to find anchors andgradient-based inference to recover topic-worddistribution [Arora et al. 2013].

    � Runtime is fast because algorithm scales with number of uniqueword types, rather than number of documents or tokens.

  • Anchoring

    1. Construct co-occurrence matrix from documents with vocabularyof size V :

    Q̄i ,j = p(w2 = j |w1 = i).

    2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:

    Q̄i ≈K∑

    k=1

    Ci ,kQ̄sk subject toK∑

    k=1

    Ci ,k = 1 and Ci ,k ≥ 0.

    3. Find topic-word matrix:

    Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)

    = Ci ,k

    V∑

    j=1

    Q̄i ,j .

  • Anchoring

    1. Construct co-occurrence matrix from documents with vocabularyof size V :

    Q̄i ,j = p(w2 = j |w1 = i).

    2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:

    Q̄i ≈K∑

    k=1

    Ci ,kQ̄sk subject toK∑

    k=1

    Ci ,k = 1 and Ci ,k ≥ 0.

    3. Find topic-word matrix:

    Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)

    = Ci ,k

    V∑

    j=1

    Q̄i ,j .

  • Anchoring

    1. Construct co-occurrence matrix from documents with vocabularyof size V :

    Q̄i ,j = p(w2 = j |w1 = i).

    2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:

    Q̄i ≈K∑

    k=1

    Ci ,kQ̄sk subject toK∑

    k=1

    Ci ,k = 1 and Ci ,k ≥ 0.

    3. Find topic-word matrix:

    Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)

    = Ci ,k

    V∑

    j=1

    Q̄i ,j .

  • Finding Anchor Words

    � So far, we assume that anchor words are given.

    � How do we find anchor words from documents?

  • Finding Anchor Words

    concealer

    carburetor album

    bourgeoisie

    forest

    Anchor words are the vertices of the co-occurrence convex hull.

  • Finding Anchor Words

    concealer

    carburetor album

    bourgeoisie

    forest

    Anchor words are the vertices of the co-occurrence convex hull.

  • Finding Anchor Words

    concealer

    carburetor album

    bourgeoisie

    forest

    Anchor words are the vertices of the co-occurrence convex hull.

  • Issues with Topic Models

    Topics

    music concert singer voice chorus songs albumsinger pop songs music album chorale jazzcosmetics makeup eyeliner lipstick foundation primer eyeshadow

    Duplicate topics.

  • Issues with Topic Models

    Topics

    music concert singer voice chorus songs albumsinger pop songs music album chorale jazzcosmetics makeup eyeliner lipstick foundation primer eyeshadow

    Duplicate topics.

  • Issues with Topic Models

    Topics

    music band art history literature books earthbts taehyung idol kpop jin jungkook jimin

    Ambiguous topics.Overly-specific topics.

  • Issues with Topic Models

    Topics

    music band art history literature books earthbts taehyung idol kpop jin jungkook jimin

    Ambiguous topics.Overly-specific topics.

  • Interactive Anchoring

    � Incorporating interactivity in topic modeling has shown to improvequality of model [Hu et al. 2014a].

    � Anchoring algorithm offers speed for interactive work, but singleanchors are unintuitive to users.

    � Ankura is an interactive topic modeling system that allows usersto choose multiple anchors for each topic [Lund et al. 2017].

    � After receiving human feedback, Ankura only takes a few secondsto update topic model.

    These methods only work for monolingual document collections.

  • Interactive Anchoring

    � Incorporating interactivity in topic modeling has shown to improvequality of model [Hu et al. 2014a].

    � Anchoring algorithm offers speed for interactive work, but singleanchors are unintuitive to users.

    � Ankura is an interactive topic modeling system that allows usersto choose multiple anchors for each topic [Lund et al. 2017].

    � After receiving human feedback, Ankura only takes a few secondsto update topic model.

    These methods only work for monolingual document collections.

  • Linking Words

    Definition

    Language L is a set of word types w .

    Definition

    Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.

    Idea: If dictionary B contains entry (w , v), create a link between wand v .

  • Linking Words

    Definition

    Language L is a set of word types w .

    Definition

    Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.

    Idea: If dictionary B contains entry (w , v), create a link between wand v .

  • Linking Words

    Definition

    Language L is a set of word types w .

    Definition

    Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.

    Idea: If dictionary B contains entry (w , v), create a link between wand v .

  • Finding Multilingual Anchors

    concealer

    forest

    carburetor

    cosmopolitan

    album

    bourgeoisie

  • Finding Multilingual Anchors

    concealer

    forest

    carburetor

    cosmopolitan

    album

    bourgeoisie

  • Finding Multilingual Anchors

    concealer

    forest

    carburetor

    cosmopolitan

    album

    bourgeoisie

    Bourgeoisie does not have a Chinese translation, so it

    cannot be picked as an anchor word even if it is the farthest word from the convex hull.

  • Finding Multilingual Anchors

    concealer

    forest

    carburetor

    cosmopolitan

    album

    bourgeoisie

    Bourgeoisie does not have a Chinese translation, so it

    cannot be picked as an anchor word even if it is the farthest word from the convex hull.

  • Finding Multilingual Anchors

    concealer

    forest

    carburetor

    cosmopolitan

    album

    bourgeoisie

    Bourgeoisie does not have a Chinese translation, so it

    cannot be picked as an anchor word even if it is the farthest word from the convex hull.

    Forest and its translation are not the furthest

    points from their respective convex hull, but neither are too

    close. So, they are chosen as the next anchor words.

  • Finding Multilingual Anchors

    concealer

    forest

    carburetor

    cosmopolitan

    album

    bourgeoisie

    Bourgeoisie does not have a Chinese translation, so it

    cannot be picked as an anchor word even if it is the farthest word from the convex hull.

    Forest and its translation are not the furthest

    points from their respective convex hull, but neither are too

    close. So, they are chosen as the next anchor words.

  • Multilingual Anchoring

    1. Given a dictionary, create links between words that are translationsof each other.

    2. Select an anchor word for each language such that the words arelinked and span of anchor words is maximized.

    3. Once anchor words are found, separately find topic-worddistributions for each language.

  • � What if dictionary entries are scarce or inaccurate?

    � What if topics aren’t aligned properly across languages?

    Incorporate human-in-the-loop topic modeling tools.

  • � What if dictionary entries are scarce or inaccurate?

    � What if topics aren’t aligned properly across languages?

    Incorporate human-in-the-loop topic modeling tools.

  • MTAnchor

  • Experiments

    Datasets:

    1. Wikipedia articles (en, zh)

    2. Amazon reviews (en, zh)

    3. lorelei documents (en, si)

  • Experiments

    Metrics:

    1. Classification accuracy� Intra-lingual: train topic model on documents in one language and

    test on other documents in the same languages� Cross-lingual: train topic model on documents in one language and

    test on other documents in a different language.

    2. Topic coherence [Lau et al. 2014].� Intrinsic: use the trained documents as the reference corpus to

    measure local interpretability.� Extrinsic: use a large dataset (i.e. entire Wikipedia) as the reference

    corpus to measure global interpretability.

  • Comparing Models

    Classification accuracy

    Dataset Method en-izh-isi-i

    en-czh-csi-c

    WikipediaMultilingualanchoring

    69.5% 71.2% 50.4% 47.8%

    MTAnchor(maximum)

    80.7% 75.3% 57.6% 54.5%

    MTAnchor(median)

    69.5% 71.4% 50.3% 47.2%

    mcta 51.6% 33.4% 23.2% 39.8%

    AmazonMultilingualanchoring

    59.8% 61.1% 51.7% 53.2%

    mcta 49.5% 50.6% 50.3% 49.5%

    loreleiMultilingualanchoring

    20.8% 32.7% 24.5% 24.7%

    mcta 13.0% 26.5% 4.1% 15.6%

  • Comparing Models

    Topic coherence

    Dataset Method en-izh-isi-i

    en-ezh-esi-e

    WikipediaMultilingualanchoring

    0.14 0.18 0.08 0.13

    MTAnchor(maximum)

    0.20 0.20 0.10 0.15

    MTAnchor(median)

    0.14 0.18 0.08 0.13

    mcta 0.13 0.09 0.00 0.04

    AmazonMultilingualanchoring

    0.07 0.06 0.03 0.05

    mcta -0.03 0.02 0.02 0.01

    loreleiMultilingualanchoring

    0.08 0.00 0.03 n/a

    mcta 0.13 0.00 0.04 n/a

  • Multilingual Anchoring Is Much Faster

    00.20.40.60.8

    F1 S

    core

    Train: Chinese

    MethodMCTAMTAnchor (max)Multilingual Anchoring

    Train: English

    Test: Chinese

    0 100 200 3000

    0.20.40.60.8

    0 100 200 300Time (min)

    Test: English

  • Improving Topics Through Interactivity

    0 5 10 15 20Time (min)

    0.70

    0.75

    0.80

    0.85F1

    Sco

    re

  • Improving Topics Through Interactivity

    0 10 20Time (min)

    0.65

    0.70

    0.75

    0.80

    0.85

    F1 S

    core

  • Improving Topics Through Interactivity

    0.40.50.60.70.80.9

    F1 S

    core

    Train: Chinese Train: EnglishDev: Chinese

    0 10 200.40.50.60.70.80.9

    0 10 20Time (min)

    Dev: English

  • Comparing Topics

    Dataset Method Topic

    Wikipedia mcta dog san movie mexican fighter novel california主演 改編 本 小說 拍攝 角色 戰士

    Multilingual anchoring adventure daughter bob kong hong robert movie主演 改編 本片 飾演 冒冒冒險險險 講述 編劇

    MTAnchor kong hong movie office martial box reception主演 改編 飾演 本片 演演演員員員 編編編劇劇劇 講述

    Amazon mcta woman food eat person baby god chapter來貨 頂頂 水 耳機 貨物 張傑 傑 同樣

    Multilingual anchoring eat diet food recipe healthy lose weight健健健康康康 幫 吃 身體 全面 同事 中醫

    lorelei mcta help need floodrelief please families needed victimMultilingual anchoring aranayake warning landslide site missing nbro areas

  • Why Not Use Deep Learning?

    � Neural networks are data-hungry and unsuitable for low-resourcelanguages

    � Deep learning models take long amounts of time to train

    � Pathologies of neural models make interpretationdifficult [Feng et al. 2018]

  • Summary

    � Anchoring algorithm can be applied in multilingual settings.

    � People can provide helpful linguistic or cultural knowledge toconstruct better multilingual topic models.

  • Future Work

    � Apply human-in-the-loop algorithms to other tasks in NLP.

    � Better understand the effect of human feedback on cross-lingualrepresentation learning.

  • ALTO: Active Learning withTopic Overviews for SpeedingLabel Induction and DocumentLabeling

    Forough Poursabzi-Sangdeh,Jordan Boyd-Graber, LeahFindlater, and Kevin Seppi.Association for ComputationalLinguistics, 2016.

  • Many Documents

  • Sort into Categories

  • Evaluation

    � User study

    � 40 minutes

    � Sort documents into categories� What information / interface helps best

    � Train a classifier on human examples� Compare classifier labels to expert judgements

  • Evaluation

    � User study

    � 40 minutes

    � Sort documents into categories� What information / interface helps best

    � Train a classifier on human examples� Compare classifier labels to expert judgements

  • Evaluation

    � User study

    � 40 minutes

    � Sort documents into categories� What information / interface helps best

    � Train a classifier on human examples� Compare classifier labels to expert judgements

  • Evaluation

    � User study

    � 40 minutes

    � Sort documents into categories� What information / interface helps best

    � Train a classifier on human examples (don’t tell them how manylabels)

    � Compare classifier labels to expert judgements

  • Evaluation

    � User study

    � 40 minutes

    � Sort documents into categories� What information / interface helps best

    � Train a classifier on human examples (don’t tell them how manylabels)

    � Compare classifier labels to expert judgements (purity)

    purity(U,G) =1

    N

    l

    maxj|Ul ∩ Gj |, (1)

  • Which is more Useful?

    Who should drive?

  • Which is more Useful?

    Active Learning Topic Models

  • Direct users to document

  • ● ●

    ●●

    ● ● ●

    0.2

    0.3

    0.4

    0.5

    10 20 30 40Elapsed Time (min)

    Med

    ian

    (ove

    r pa

    rtic

    ipan

    ts)

    condition● No Organization with Active Learning

    Active learning if time is short

  • ● ●

    ●●

    ● ● ●

    0.2

    0.3

    0.4

    0.5

    10 20 30 40Elapsed Time (min)

    Med

    ian

    (ove

    r pa

    rtic

    ipan

    ts)

    condition● No Organization with Active Learning

    No Organization, no Active Learning

    Better than status quo

  • ● ●

    ●●

    ● ● ●

    0.2

    0.3

    0.4

    0.5

    10 20 30 40Elapsed Time (min)

    Med

    ian

    (ove

    r pa

    rtic

    ipan

    ts)

    condition● No Organization with Active Learning

    No Organization, no Active Learning

    Topics with Active Learning

    Active learning can help topic models

  • ● ●

    ●●

    ● ● ●

    0.2

    0.3

    0.4

    0.5

    Purity

    10 20 30 40Elapsed Time (min)

    Med

    ian

    (ove

    r pa

    rtic

    ipan

    ts)

    condition● No Organization with Active Learning

    No Organization, no Active Learning

    Topics with Active Learning

    Topics, no Active Learning

    Topic models help users understand the collection

  • ● ●

    ●●

    ● ● ●

    0.2

    0.3

    0.4

    0.5

    Purity

    10 20 30 40Elapsed Time (min)

    Med

    ian

    (ove

    r pa

    rtic

    ipan

    ts)

    condition● No Organization with Active Learning

    No Organization, no Active Learning

    Topics with Active Learning

    Topics, no Active Learning

    Moral: machines and humans together (if you let them)

  • Ongoing and Future Work

    � Embedding interactivity in applications

    � Visualizations to measure machine learning explainability

    � Using morphology in infinite representations

    � Multilingual analysis

  • Ongoing and Future Work

    � Embedding interactivity in applications

    � Visualizations to measure machine learning explainability

    � Using morphology in infinite representations

    � Multilingual analysis

  • Thanks

    Collaborators

    Yuening Hu (UMD), Ke Zhai (UMD), Viet-An Nguyen (UMD), DaveBlei (Princeton), Jonathan Chang (Facebook), Philip Resnik (UMD),Christiane Fellbaum (Princeton), Jerry Zhu (Wisconsin), Sean Gerrish(Sift), Chong Wang (CMU), Dan Osherson (Princeton), SineadWilliamson (CMU)

    Funders

  • References

    Alvin Grissom II, He He, Jordan Boyd-Graber, and John Morgan.

    2014.Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation.In Empirical Methods in Natural Language Processing.

    Sanjeev Arora, Rong Ge, and Ankur Moitra.

    2012.Learning topic models–going beyond SVD.In Foundations of Computer Science (FOCS).

    Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael

    Zhu.2013.A practical algorithm for topic modeling with provable guarantees.In Proceedings of the International Conference of Machine Learning.

    David M. Blei, Andrew Ng, and Michael Jordan.

    2003a.Latent Dirichlet allocation.Machine Learning Journal, 3.

    David M. Blei, Andrew Y. Ng, and Michael I. Jordan.

    2003b.Latent Dirichlet allocation.Machine Learning Journal.

    Jordan L. Boyd-Graber, Sonya S. Nikolova, Karyn A. Moffatt, Kenrick C. Kin, Joshua Y. Lee, Lester W. Mackey,

    Marilyn M. Tremaine, and Maria M. Klawe.2006.Participatory design with proxies: Developing a desktop-PDA system to support people with aphasia.In International Conference on Human Factors in Computing Systems.

    Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit Iyyer, and Jordan Boyd-Graber.

    2018.Pathologies of neural models make interpretation difficult.In Proceedings of Emperical Methods in Natural Language Processing.

    He He, Jordan Boyd-Graber, and Hal Daumé III.

    2016.Opponent modeling in deep reinforcement learning.In International Conference on Machine Learning.

    Thomas Hofmann.

    1999.Probabilistic latent semantic analysis.In Proceedings of Uncertainty in Artificial Intelligence.

    Yuening Hu, Jordan Boyd-Graber, Hal Daumé III, and Z. Irene Ying.

    2013.Binary to bushy: Bayesian hierarchical clustering with the beta coalescent.In Neural Information Processing Systems.

    Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.

    2014a.Interactive topic modeling.

    Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber.

    2014b.Polylingual tree-based topic models for translation domain adaptation.In Proceedings of the Association for Computational Linguistics.

    Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber.

    2014c.Polylingual tree-based topic models for translation domain adaptation.In Association for Computational Linguistics.

    Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III.

    2015.Deep unordered composition rivals syntactic methods for text classification.In Association for Computational Linguistics.

    Jagadeesh Jagarlamudi and Hal Daumé.

    2010.Extracting multilingual topics from unaligned comparable corpora.In Proceedings of the European Conference on Information Retrieval.

    T. Landauer and S. Dumais.

    1997.Solutions to Plato’s problem: The latent semantic analsyis theory of acquisition, induction and representation ofknowledge.Psychological Review, (104).

    Jey Han Lau, David Newman, and Timothy Baldwin.

    2014.Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality.In Proceedings of the European Chapter of the Association for Computational Linguistics.

    Jeffrey Lund, Connor Cook, Kevin Seppi, and Jordan Boyd-Graber.

    2017.Tandem anchoring: A multiword anchor approach for interactive topic modeling.In Proceedings of the Association for Computational Linguistics.

    Xiaojuan Ma and Perry R. Cook.

    2009.How well do visual verbs work in daily communication for young and old adults?In International Conference on Human Factors in Computing Systems, pages 361–364.

    David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and Andrew McCallum.

    2009.Polylingual topic models.In Proceedings of Emperical Methods in Natural Language Processing.

    Viet-An Nguyen, Jordan Boyd-Graber, and Stephen Altschul.

    2013.Dirichlet mixtures, the dirichlet process, and the structure of protein space.Journal of Computational Biology, 20(1).

    Sonya S. Nikolova, Jordan Boyd-Graber, Christiane Fellbaum, and Perry Cook.

    2009.Better vocabularies for assistive communication aids: Connecting terms using semantic networks and untrainedannotators.In ACM Conference on Computers and Accessibility.

    Bei Shi, Wai Lam, Lidong Bing, and Yinqing Xu.

    2016.Detecting common discussion topics across culture from news reader comments.In Proceedings of the Association for Computational Linguistics.

  • Latent Dirichlet Allocation: A Generative Model

    � Focus in this talk: statistical methods� Model: story of how your data came to be� Latent variables: missing pieces of your story� Statistical inference: filling in those missing pieces

    � We use latent Dirichlet allocation (LDA) [Blei et al. 2003a], a fullyBayesian version of pLSI [Hofmann 1999], probabilistic version ofLSA [Landauer and Dumais 1997]

  • Latent Dirichlet Allocation: A Generative Model

    MNθd zn wn

    Kβk

    α

    λ

    � For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ

    � For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

    � For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.

    � Choose the observed word wn from the distribution βzn .

  • Latent Dirichlet Allocation: A Generative Model

    MNθd zn wn

    Kβk

    α

    λ

    � For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ

    � For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

    � For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.

    � Choose the observed word wn from the distribution βzn .

  • Latent Dirichlet Allocation: A Generative Model

    MNθd zn wn

    Kβk

    α

    λ

    � For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ

    � For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

    � For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.

    � Choose the observed word wn from the distribution βzn .

  • Latent Dirichlet Allocation: A Generative Model

    MNθd zn wn

    Kβk

    α

    λ

    � For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ

    � For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

    � For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.

    � Choose the observed word wn from the distribution βzn .

  • Latent Dirichlet Allocation: A Generative Model

    MNθd zn wn

    Kβk

    α

    λ

    � For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ

    � For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α

    � For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.

    � Choose the observed word wn from the distribution βzn .

    We use statistical inference to uncover the most likely unobservedvariables given observed data.

  • βkK

    zn

    wn

    θdα

    MNd

    ηk Forget the Bootleg, Just Download the Movie Legally

    Multiplex Heralded As Linchpin To

    GrowthThe Shape of

    Cinema, Transformed At the Click of a

    Mouse A Peaceful Crew Puts Muppets

    Where Its Mouth Is

    Stock Trades: A Better Deal For Investors Isn't

    Simple

    Internet portals begin to distinguish among themselves as shopping malls

    Red Light, Green Light: A

    2-Tone L.E.D. to Simplify Screens

    TOPIC 2"BUSINESS"

    TOPIC 3"ENTERTAINMENT"

    TOPIC 1"TECHNOLOGY"

    computer,

    technology,

    system,

    service, site,

    phone,

    internet,

    machine

    play, film,

    movie, theater,

    production,

    star, director,

    stage

    sell, sale,

    store, product,

    business,

    advertising,

    market,

    consumer

    TOPIC 1

    TOPIC 2

    TOPIC 3

  • Inference

    � We are interested in posterior distribution

    p(Z |X ,Θ) (2)

    � Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.

    p(z ,β,θ|w , α, λ) (3)

    p(w , z ,θ,β|α, λ) =∏

    k

    p(βk |λ)∏

    d

    p(θd |α)∏

    n

    p(zd ,n|θd)p(wd ,n|βzd,n)

  • Inference

    � We are interested in posterior distribution

    p(Z |X ,Θ) (2)

    � Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.

    p(z ,β,θ|w , α, λ) (3)

    p(w , z ,θ,β|α, λ) =∏

    k

    p(βk |λ)∏

    d

    p(θd |α)∏

    n

    p(zd ,n|θd)p(wd ,n|βzd,n)

  • Inference

    � We are interested in posterior distribution

    p(Z |X ,Θ) (2)

    � Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.

    p(z ,β,θ|w , α, λ) (3)

    p(w , z ,θ,β|α, λ) =∏

    k

    p(βk |λ)∏

    d

    p(θd |α)∏

    n

    p(zd ,n|θd)p(wd ,n|βzd,n)

  • Gibbs Sampling

    � A form of Markov Chain Monte Carlo

    � Chain is a sequence of random variable states

    � Given a state {z1, . . . zN} given certain technical conditions,drawing zk ∼ p(z1, . . . zk−1, zk+1, . . . zN |X ,Θ) for all k(repeatedly) results in a Markov Chain whose stationarydistribution is the posterior.

    � For notational convenience, call z with zd ,n removed z−d ,n

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Inference

    Hollywood studios are preparing to let people

    download and buy electronic copies of movies over

    the Internet, much as record labels now sell songs for

    99 cents through Apple Computer's iTunes music store

    and other online services ...

    computer, technology,

    system, service, site,

    phone, internet, machine

    play, film, movie, theater,

    production, star, director,

    stage

    sell, sale, store, product,

    business, advertising,

    market, consumer

  • Gibbs Sampling

    � For LDA, we will sample the topic assignments

    � Thus, we want:

    p(zd ,n = k |z−d ,n,w , α, λ) =p(zd ,n = k, z−d ,n|w , α, λ)

    p(z−d ,n|w , α, λ)

    � The topics and per-document topic proportions are integrated out/ marginalized / collapsed

    � Let nd ,i be the number of words taking topic i in document d . Letvk,w be the number of times word w is used in topic k .

    =

    ∫θd

    (∏i 6=k θ

    αi+nd,i−1d

    )θαk+nd,id dθd

    ∫βk

    (∏i 6=wd,n β

    λi+vk,i−1k,i

    )βλi+vk,ik,wd,n

    dβk∫θd

    (∏i θαi+nd,i−1d

    )dθd

    ∫βk

    (∏i β

    λi+vk,i−1k,i

    )dβk

  • Gibbs Sampling

    � For LDA, we will sample the topic assignments

    � Thus, we want:

    p(zd ,n = k |z−d ,n,w , α, λ) =p(zd ,n = k, z−d ,n|w , α, λ)

    p(z−d ,n|w , α, λ)

    � The topics and per-document topic proportions are integrated out/ marginalized / collapsed

    � Let nd ,i be the number of words taking topic i in document d . Letvk,w be the number of times word w is used in topic k .

    =

    ∫θd

    (∏i 6=k θ

    αi+nd,i−1d

    )θαk+nd,id dθd

    ∫βk

    (∏i 6=wd,n β

    λi+vk,i−1k,i

    )βλi+vk,ik,wd,n

    dβk∫θd

    (∏i θαi+nd,i−1d

    )dθd

    ∫βk

    (∏i β

    λi+vk,i−1k,i

    )dβk

  • Gibbs Sampling

    � For LDA, we will sample the topic assignments

    � The topics and per-document topic proportions are integrated out/ marginalized / Rao-Blackwellized

    � Thus, we want:

    p(zd ,n = k |z−d ,n,w , α, λ) =nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Gibbs Sampling

    � Integral is normalizer of Dirichlet distribution

    βk

    (∏

    i

    βλi+vk,i−1k,i

    )dβk =

    ∏Vi |βi + vk,i|∑V

    i βi + vk,i

    � So we can simplify

    ∫θd

    (∏i 6=k θ

    αi+nd,i−1d

    )θαk+nd,id dθd

    ∫βk

    (∏i 6=wd,n

    βλi+vk,i−1k,i

    )βλi+vk,ik,wd,n

    dβk∫θd

    (∏i θαi+nd,i−1d

    )dθd

    ∫βk

    (∏i β

    λi+vk,i−1k,i

    )dβk

    =

    |αk+nd,k+1|∑K

    i αi+nd,i+1

    ∏Ki 6=k |αk + nd,k∏K

    i |αi+nd,i|∑K

    i αi+nd,i

    |λwd,n +vk,wd,n +1

    |∑V

    i λi+vk,i+1

    ∏Vi 6=wd,n

    |λk + vk,wd,n∏Vi |λi+vk,i|∑V

    i λi+vk,i

  • Gibbs Sampling

    � Integral is normalizer of Dirichlet distribution

    βk

    (∏

    i

    βλi+vk,i−1k,i

    )dβk =

    ∏Vi |βi + vk,i|∑V

    i βi + vk,i

    � So we can simplify

    ∫θd

    (∏i 6=k θ

    αi+nd,i−1d

    )θαk+nd,id dθd

    ∫βk

    (∏i 6=wd,n

    βλi+vk,i−1k,i

    )βλi+vk,ik,wd,n

    dβk∫θd

    (∏i θαi+nd,i−1d

    )dθd

    ∫βk

    (∏i β

    λi+vk,i−1k,i

    )dβk

    =

    |αk+nd,k+1|∑K

    i αi+nd,i+1

    ∏Ki 6=k |αk + nd,k∏K

    i |αi+nd,i|∑K

    i αi+nd,i

    |λwd,n +vk,wd,n +1

    |∑V

    i λi+vk,i+1

    ∏Vi 6=wd,n

    |λk + vk,wd,n∏Vi |λi+vk,i|∑V

    i λi+vk,i

  • Gamma Function Identity

    z =Γ(z + 1)

    Γ(z)(4)

    |αk+nd,k+1|∑K

    i αi+nd,i+1

    ∏Ki 6=k |αk + nd,k∏K

    i |αi+nd,i|∑K

    i αi+nd,i

    |λwd,n +vk,wd,n +1

    |∑V

    i λi+vk,i+1

    ∏Vi 6=wd,n

    |λk + vk,wd,n∏Vi |λi+vk,i|∑V

    i λi+vk,i

    =nd,k + αk∑Ki nd,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Gibbs Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

    (5)

    � Number of times document d uses topic k

    � Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

    � Dirichlet parameter for topic to word distribution

    � How much this document likes topic k

    � How much this topic likes word wd ,n

  • Gibbs Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

    (5)

    � Number of times document d uses topic k

    � Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

    � Dirichlet parameter for topic to word distribution

    � How much this document likes topic k

    � How much this topic likes word wd ,n

  • Gibbs Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

    (5)

    � Number of times document d uses topic k

    � Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

    � Dirichlet parameter for topic to word distribution

    � How much this document likes topic k

    � How much this topic likes word wd ,n

  • Gibbs Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

    (5)

    � Number of times document d uses topic k

    � Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

    � Dirichlet parameter for topic to word distribution

    � How much this document likes topic k

    � How much this topic likes word wd ,n

  • Gibbs Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

    (5)

    � Number of times document d uses topic k

    � Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

    � Dirichlet parameter for topic to word distribution

    � How much this document likes topic k

    � How much this topic likes word wd ,n

  • Gibbs Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

    (5)

    � Number of times document d uses topic k

    � Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution

    � Dirichlet parameter for topic to word distribution

    � How much this document likes topic k

    � How much this topic likes word wd ,n

  • Sample Document

  • Sample Document

  • Randomly Assign Topics

  • Randomly Assign Topics

  • Total Topic Counts

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Total Topic Counts

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Total Topic Counts

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • We want to sample this word . . .

  • We want to sample this word . . .

  • Decrement its count

  • What is the conditional distribution for this topic?

  • Part 1: How much does this document like each topic?

  • Part 1: How much does this document like each topic?

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Part 1: How much does this document like each topic?

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Part 1: How much does this document like each topic?

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Part 2: How much does each topic like the word?

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Part 2: How much does each topic like the word?

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Part 2: How much does each topic like the word?

    Sampling Equation

    nd ,k + αk∑Ki nd ,i + αi

    vk,wd,n + λwd,n∑i vk,i + λi

  • Geometric interpretation

  • Geometric interpretation

  • Geometric interpretation

  • Update counts

  • Update counts

  • Update counts

  • Details: how to sample from a distribution

    Topic 1

    Topic 2

    Topic 3

    Topic 4

    Topic 5

    QKi 6=k � (↵i + nd ,i )

    �(↵k+nd,k+1)�(

    PKi ↵i+nd,i+1)QK

    i �(↵i+nd,i)�(

    PKi ↵i+nd,i)

    QVi 6=wd,n � (�i + vk,i )

    �⇣�k+vk,wd,n+1

    �(PV

    i �i+vk,i+1)QVi �(�i+vk,i)

    �(PV

    i �i+vk,i)

    =nd ,k + ↵kPKi nd ,i + ↵i

    vk,wd,n + �wd,nPi vk,i + �i

    Gamma Function Identity

    z =�(z + 1)

    �(z)(6)

    Norm

    alize

    0.0

    1.0

    0.112

  • Algorithm

    1. For each iteration i :1.1 For each document d and word n currently assigned to zold :

    1.1.1 Decrement nd,zold and vzold ,wd,n

    1.1.2 Sample znew = k with probability proportional tond,k+αk∑Ki nd,i+αi

    vk,wd,n+λwd,n∑

    i vk,i+λi

    1.1.3 Increment nd,znew and vznew ,wd,n

  • Näıve Implementation

    Algorithm

    1. For each iteration i :1.1 For each document d and word n currently assigned to zold :

    1.1.1 Decrement nd,zold and vzold ,wd,n

    1.1.2 Sample znew = k with probability proportional tond,k+αk∑Ki nd,i+αi

    vk,wd,n+λwd,n∑

    i vk,i+λi

    1.1.3 Increment nd,znew and vznew ,wd,n

  • Desiderata

    � Hyperparameters: Sample them too (slice sampling)

    � Initialization: Random

    � Sampling: Until likelihood converges

    � Lag / burn-in: Difference of opinion on this

    � Number of chains: Should do more than one

  • Available implementations

    � Mallet (http://mallet.cs.umass.edu)

    � LDAC (http://www.cs.princeton.edu/ blei/lda-c)

    � Topicmod (http://code.google.com/p/topicmod)

  • SHLDA Model

    ���

    ��

    ��

    ��

    � �

    �� ���

    ����

    ����

    ���

    ∞ ��

    �� �

  • Infvoc Classification Accuracy

    S=

    155

    τ 0=

    64κ

    =0.

    6

    infvoc αβ = 3k T = 40k U = 10 52.683

    fixvoc vb-dict 45.514

    fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474

    fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

    dtm-dict tcv = 0.001 62.845

    Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.

  • Infvoc Classification Accuracy

    S=

    155

    τ 0=

    64κ

    =0.

    6

    infvoc αβ = 3k T = 40k U = 10 52.683

    fixvoc vb-dict 45.514

    fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474

    fixvoc-hash vb-dict 52.525

    fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

    dtm-dict tcv = 0.001 62.845

    Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.

    Topics learned with hashing are no longer interpretable, they can onlybe used as features.

  • Infvoc Classification Accuracy

    S=

    155

    τ 0=

    64κ

    =0.

    6

    infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514

    fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474

    fixvoc-hash vb-dict 52.525

    fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

    dtm-dict tcv = 0.001 62.845

    Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.

    Topics learned with hashing are no longer interpretable, they can onlybe used as features.

  • Infvoc Classification Accuracy

    S=

    155

    τ 0=

    64κ

    =0.

    6

    infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474

    fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

    dtm-dict tcv = 0.001 62.845

    Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.

    Topics learned with hashing are no longer interpretable, they can onlybe used as features.

  • Infvoc Classification Accuracy

    S=

    155

    τ 0=

    64κ

    =0.

    6

    infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474

    fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948

    dtm-dict tcv = 0.001 62.845

    Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.

    Topics learned with hashing are no longer interpretable, they can onlybe used as features.

  • Unassign (d , n,wd,n, zd,n = k)

    1: T : Td ,k ← Td ,k − 12: If wd ,n /∈ Ωold ,

    P : Pk,wd,n ← Pk,wd,n − 13: Else: suppose wd ,n ∈ Ωoldm ,

    P : Pk,m ← Pk,m − 1W : Wk,m,wd,n ←Wk,m,wd,n − 1

  • SparseLDA

    p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k

    (6)

    ∝ αkββV + n·|k︸ ︷︷ ︸

    sLDA

    +nk|dβ

    βV + n·|k︸ ︷︷ ︸rLDA

    +(αk + nk|d)nw |kβV + n·|k︸ ︷︷ ︸

    qLDA

  • Tree-based sampling

    p(zd ,n = k , ld ,n = λ|Z−, L−,wd ,n) (7)

    ∝ (αk + nk|d)∏

    (i→j)∈λ

    βi→j + ni→j |k∑j ′ (βi→j ′ + ni→j ′|k)

  • Factorizing Tree-Based Prior

    p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k

    (8)

    ∝ αkββV + n·|k︸ ︷︷ ︸

    sLDA

    +nk|dβ

    βV + n·|k︸ ︷︷ ︸rLDA

    +(αk + nk|d)nw |kβV + n·|k︸ ︷︷ ︸

    qLDA

    s =∑

    k,λ

    αk∏

    (i→j)∈λ βi→j∏(i→j)∈λ

    ∑j ′ (βi→j ′ + ni→j ′|k)

    ≤∑

    k,λ

    αk∏

    (i→j)∈λ βi→j∏(i→j)∈λ

    ∑j ′ βi→j ′

    = s ′. (9)

  • Factorizing Tree-Based Prior

    p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k

    (8)

    ∝ αkββV + n·|k︸ ︷︷ ︸

    sLDA

    +nk|dβ

    βV + n·|k︸ ︷︷ ︸rLDA

    +(αk + nk|d)nw |kβV + n·|k︸ ︷︷ ︸

    qLDA

    s =∑

    k,λ

    αk∏

    (i→j)∈λ βi→j∏(i→j)∈λ

    ∑j ′ (βi→j ′ + ni→j ′|k)

    ≤∑

    k,λ

    αk∏

    (i→j)∈λ βi→j∏(i→j)∈λ

    ∑j ′ βi→j ′

    = s ′. (9)

  • 1: for word w in this document do2: sample = rand() ∗(s ′ + r + q)3: if sample < s ′ then4: compute s5: sample = sample ∗(s + r + q)/(s ′ + r + q)6: if sample < s then7: return topic k and path λ sampled from s8: end if9: sample − = s

    10: else11: sample − = s ′12: end if13: if sample < r then14: return topic k and path λ sampled from r15: end if16: sample − = r17: return topic k and path λ sampled from q18: end for

  • Number of TopicsT50 T100 T200 T500

    Naive 5.700 12.655 29.200 71.223Fast 4.935 9.222 17.559 40.691

    Fast-RB 2.937 4.037 5.880 8.551Fast-RB-sD 2.675 3.795 5.400 8.363Fast-RB-sW 2.449 3.363 4.894 7.404

    Fast-RB-sDW 2.225 3.241 4.672 7.424

    Number of CorrelationsC50 C100 C200 C500

    Näıve 11.166 12.586 13.000 15.377Fast 8.889 9.165 9.177 8.079

    Fast-RB 3.995 4.078 3.858 3.156Fast-RB-sD 3.660 3.795 3.593 3.065Fast-RB-sW 3.272 3.363 3.308 2.787

    Fast-RB-sDW 3.026 3.241 3.091 2.627