Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic...

313
Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences University of Wisconsin–Madison 24th August 2010 Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 1 / 63

Transcript of Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic...

Page 1: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Incorporating Domain Knowledgein Latent Topic Models

Final oral examination

David Andrzejewski

Department of Computer SciencesUniversity of Wisconsin–Madison

24th August 2010

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 1 / 63

Page 2: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents

Politics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 3: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents Word Counts

= [5 2 ... 7]

= [3 8 ... 2]

= [0 6 ... 1]

Politics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 4: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents Word Counts

= [5 2 ... 7]

= [3 8 ... 2]

= [0 6 ... 1]

Politics

Algorithms

Sports

Topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 5: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents Word Counts

= [5 2 ... 7]

= [3 8 ... 2]

= [0 6 ... 1]

Politics

Algorithms

Sports

TopicsWhich words in each topic?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 6: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents Word Counts

= [5 2 ... 7]

= [3 8 ... 2]

= [0 6 ... 1]

Politics

Algorithms

Sports

Topics

Which topics in each doc?

Which words in each topic?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 7: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents Word Counts

= [5 2 ... 7]

= [3 8 ... 2]

= [0 6 ... 1]

Politics

Algorithms

Sports

Topics

Which topics in each doc?

Which words in each topic?

LEARNED FROM DATA

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 8: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic modeling overview

Documents Word Counts

= [5 2 ... 7]

= [3 8 ... 2]

= [0 6 ... 1]

Politics

Algorithms

Sports

Topics

Which topics in each doc?

Which words in each topic?

LEARNED FROM DATA+

DOMAIN KNOWLEDGE

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 2 / 63

Page 9: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Exploratory browsing with topic modelsGoldberg et al., NAACL HLT 2009

Dataset: 89,574 New Year’s wishes (NYC Times Square website)Goal: understand/summarize common wish themesExample wishes:

Peace on earthown a breweryI hope I get into Univ. of Penn graduate school.The safe return of my friends in Iraqfind a cure for cancerTo lose weight and get a boyfriendI Hope Barack Obama Wins the PresidencyTo win the lottery!

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 3 / 63

Page 10: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Exploratory browsing with topic modelsGoldberg et al., NAACL HLT 2009

Dataset: 89,574 New Year’s wishes (NYC Times Square website)Goal: understand/summarize common wish themesExample wishes:

Peace on earthown a breweryI hope I get into Univ. of Penn graduate school.The safe return of my friends in Iraqfind a cure for cancerTo lose weight and get a boyfriendI Hope Barack Obama Wins the PresidencyTo win the lottery!

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 3 / 63

Page 11: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Corpus-wide word frequencies

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 4 / 63

Page 12: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Learned topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 4 / 63

Page 13: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Learned topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 4 / 63

Page 14: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Learned topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 4 / 63

Page 15: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Author/document profilingMatch papers to reviewers(Mimno & McCallum, 2007)Assign developers to bugs(Linstead et al., 2007)Scientific impact/influence(Gerrish & Blei, 2009)

Networks (Henderson &Eliassi-Rad, 2009)Trends(Wang & McCallum, 2006)Info retrieval (http://rexa.info)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 5 / 63

Page 16: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Author/document profilingMatch papers to reviewers(Mimno & McCallum, 2007)Assign developers to bugs(Linstead et al., 2007)Scientific impact/influence(Gerrish & Blei, 2009)

Networks (Henderson &Eliassi-Rad, 2009)Trends(Wang & McCallum, 2006)Info retrieval (http://rexa.info)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 5 / 63

Page 17: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Author/document profilingMatch papers to reviewers(Mimno & McCallum, 2007)Assign developers to bugs(Linstead et al., 2007)Scientific impact/influence(Gerrish & Blei, 2009)

Networks (Henderson &Eliassi-Rad, 2009)Trends(Wang & McCallum, 2006)Info retrieval (http://rexa.info)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 5 / 63

Page 18: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Author/document profilingMatch papers to reviewers(Mimno & McCallum, 2007)Assign developers to bugs(Linstead et al., 2007)Scientific impact/influence(Gerrish & Blei, 2009)

Networks (Henderson &Eliassi-Rad, 2009)Trends(Wang & McCallum, 2006)Info retrieval (http://rexa.info)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 5 / 63

Page 19: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Quick stats review

Dirichlet[20, 5, 5]

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 6 / 63

Page 20: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Quick stats review

Dirichlet[20, 5, 5]

Multinomial[0.6, 0.15, 0.25]

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 6 / 63

Page 21: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Quick stats review

Dirichlet[20, 5, 5]

Multinomial Observed counts

A, A, B, C, A, C

[3, 1, 2][0.6, 0.15, 0.25]

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 6 / 63

Page 22: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Quick stats review

Dirichlet[20, 5, 5]

Multinomial Observed counts

A, A, B, C, A, C

[3, 1, 2][0.6, 0.15, 0.25]

CONJUGACY!

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 6 / 63

Page 23: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 24: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 25: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 26: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 27: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

Ndw

D

αβT

zφ θ

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 28: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 29: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 30: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

For each topic tφt ∼ Dirichlet(β)

For each document dθd ∼ Dirichlet(α)For each word w

Topic z ∼ Multinomial(θd )Word w ∼ Multinomial(φz)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 7 / 63

Page 31: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 0 2 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 32: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 0 2 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 33: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 0 2 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 34: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 0 2 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 35: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 2 2 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 36: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 2 1 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 37: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 2 1 2 1 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 38: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent Dirichlet Allocation (LDA)Blei et al., JMLR 2003

θ

Ndw

D

αβT

Given observed words w, want the posterior over hidden topics zUse Bayes’ Rule to calculate P(z|w)? INTRACTABLEOne approach: Collapsed Gibbs Sampling

w = A B B A C B A Az = 2 1 2 2 2 1 0 0

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 8 / 63

Page 39: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Outline

1 Topic modelingLatent Dirichlet Allocation (LDA)IssuesRelated work

2 Preliminary work

3 New workLogicLDABiological text mining

4 ConclusionDiscussionFuture work

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 9 / 63

Page 40: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topics may not align with user goals / intuitions

Topic = {north south carolina korea korean...}(Newman et al, 2009)

20 news: “mac” vs “pc”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 10 / 63

Page 41: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topics may not align with user goals / intuitions

Topic = {north south carolina korea korean...}(Newman et al, 2009)

20 news: “mac” vs “pc”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 10 / 63

Page 42: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topics may not align with user goals / intuitions

Topic = {north south carolina korea korean...}(Newman et al, 2009)

20 news: “mac” vs “pc”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 10 / 63

Page 43: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topics may not align with user goals / intuitions

Topic = {north south carolina korea korean...}(Newman et al, 2009)

20 news: “mac” vs “pc”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 10 / 63

Page 44: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Popular demand

Posted question on machine learning website∗

“LDA is nice, but unpredictable as it does not always give me the topicsI wanted...I am looking for something like LDA, but...i can pick the seedwords for each topic...”

∗ http://metaoptimize.com/qa/

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 11 / 63

Page 45: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Extending LDA - existing work

LDA: generative probabilistic model→ extensibleAssociated observations (e.g., images-Blei and Jordan, 2003)Document-topic (e.g., correlated topics-Blei and Lafferty 2006)Topic-word (e.g., bigram-Wallach, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 12 / 63

Page 46: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Extending LDA - existing work

LDA: generative probabilistic model→ extensibleAssociated observations (e.g., images-Blei and Jordan, 2003)Document-topic (e.g., correlated topics-Blei and Lafferty 2006)Topic-word (e.g., bigram-Wallach, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 12 / 63

Page 47: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Extending LDA - existing work

LDA: generative probabilistic model→ extensibleAssociated observations (e.g., images-Blei and Jordan, 2003)Document-topic (e.g., correlated topics-Blei and Lafferty 2006)Topic-word (e.g., bigram-Wallach, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 12 / 63

Page 48: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Extending LDA - existing work

LDA: generative probabilistic model→ extensibleAssociated observations (e.g., images-Blei and Jordan, 2003)Document-topic (e.g., correlated topics-Blei and Lafferty 2006)Topic-word (e.g., bigram-Wallach, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 12 / 63

Page 49: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

How this work is different

Existing work: application-specific, richer structureThis work: general direct domain knowledge

“These words do not belong in the same topic”“I want a topic about X ”“This topic is incompatible with this document”“These two topics are incompatible and should not co-occur insame sentence”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 13 / 63

Page 50: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

How this work is different

Existing work: application-specific, richer structureThis work: general direct domain knowledge

“These words do not belong in the same topic”“I want a topic about X ”“This topic is incompatible with this document”“These two topics are incompatible and should not co-occur insame sentence”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 13 / 63

Page 51: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

How this work is different

Existing work: application-specific, richer structureThis work: general direct domain knowledge

“These words do not belong in the same topic”“I want a topic about X ”“This topic is incompatible with this document”“These two topics are incompatible and should not co-occur insame sentence”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 13 / 63

Page 52: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

How this work is different

Existing work: application-specific, richer structureThis work: general direct domain knowledge

“These words do not belong in the same topic”“I want a topic about X ”“This topic is incompatible with this document”“These two topics are incompatible and should not co-occur insame sentence”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 13 / 63

Page 53: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

How this work is different

Existing work: application-specific, richer structureThis work: general direct domain knowledge

“These words do not belong in the same topic”“I want a topic about X ”“This topic is incompatible with this document”“These two topics are incompatible and should not co-occur insame sentence”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 13 / 63

Page 54: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

How this work is different

Existing work: application-specific, richer structureThis work: general direct domain knowledge

“These words do not belong in the same topic”“I want a topic about X ”“This topic is incompatible with this document”“These two topics are incompatible and should not co-occur insame sentence”

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 13 / 63

Page 55: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Specific aims from preliminary examination

Specific Aim 1Develop latent topic models capable of expressing novel forms ofdomain knowledge (LogicLDA).

Specific Aim 2Apply knowledge-augmented latent topic models to real-worldproblems (biological text mining application).

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 14 / 63

Page 56: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Specific aims from preliminary examination

Specific Aim 1Develop latent topic models capable of expressing novel forms ofdomain knowledge (LogicLDA).

Specific Aim 2Apply knowledge-augmented latent topic models to real-worldproblems (biological text mining application).

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 14 / 63

Page 57: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Overview

Standard LDA model

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 15 / 63

Page 58: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Overview

Special topics which only appear in certain documents

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 15 / 63

Page 59: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Overview

Influence latent topic assignment zi of individual tokens

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 15 / 63

Page 60: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Overview

Must-Link and Cannot-Link preferences over words

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 15 / 63

Page 61: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Overview

General domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 15 / 63

Page 62: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Overview

Application case study

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 15 / 63

Page 63: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Outline

1 Topic modelingLatent Dirichlet Allocation (LDA)IssuesRelated work

2 Preliminary work

3 New workLogicLDABiological text mining

4 ConclusionDiscussionFuture work

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 16 / 63

Page 64: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Restricted topics with ∆LDAAndrzejewski et al, ECML 2007

ContributionSpecial topics which only appear in certain documents

Document label determines α prior on document-topic θStatistical debugging: we know some runs fail(crash or bad output)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 17 / 63

Page 65: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Restricted topics with ∆LDAAndrzejewski et al, ECML 2007

ContributionSpecial topics which only appear in certain documents

Document label determines α prior on document-topic θStatistical debugging: we know some runs fail(crash or bad output)

α =

[α(s)

α(f )

]=

[1 1 1 0 01 1 1 1 1

]

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 17 / 63

Page 66: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Restricted topics with ∆LDAAndrzejewski et al, ECML 2007

ContributionSpecial topics which only appear in certain documents

Document label determines α prior on document-topic θStatistical debugging: we know some runs fail(crash or bad output)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 17 / 63

Page 67: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic-in-set knowledgeAndrzejewski and Zhu, SSLNLP 2009

ContributionInfluence latent topic assignment zi of individual tokens

“Apple pie” vs “Apple iPod”Use “seed” words {translation, tRNA, ribosome, anticodon}

w = A B B A C B A Az = ? 0 0 ? ? 0 ? ?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 18 / 63

Page 68: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic-in-set knowledgeAndrzejewski and Zhu, SSLNLP 2009

ContributionInfluence latent topic assignment zi of individual tokens

“Apple pie” vs “Apple iPod”Use “seed” words {translation, tRNA, ribosome, anticodon}

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 18 / 63

Page 69: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic-in-set knowledgeAndrzejewski and Zhu, SSLNLP 2009

ContributionInfluence latent topic assignment zi of individual tokens

“Apple pie” vs “Apple iPod”Use “seed” words {translation, tRNA, ribosome, anticodon}

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 18 / 63

Page 70: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Topic-in-set knowledgeAndrzejewski and Zhu, SSLNLP 2009

ContributionInfluence latent topic assignment zi of individual tokens

“Apple pie” vs “Apple iPod”Use “seed” words {translation, tRNA, ribosome, anticodon}

Topic 0

translation, ribosomal, trna, rrna, initiation, ribosome, proteinribosomes, is, factor, processing, translational, nucleolarpre-rrna, synthesis, small, 60s, eukaryotic, biogenesis, subunittrnas, subunits, large, nucleolus, factors, 40, synthetase, freemodification, rna, depletion, eif-2, initiator, 40s, ef-3anticodon, maturation 18s, eif2, mature, eif4e, synthetasesaminoacylation, snornas, assembly, eif4g, elongation

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 18 / 63

Page 71: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest Prior on topicsAndrzejewski et al, ICML 2009

ContributionMust-Link: tie words together (e.g., synonyms)Cannot-Link: keep words apart (e.g., antonyms)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 19 / 63

Page 72: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Must-Link (college, school)Inspired by constrained clustering (Basu et al, 2008)

∀t , we want P(college|t) ≈ P(school |t)Cannot be encoded by Dirichlet→ Dirichlet Tree

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 20 / 63

Page 73: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Must-Link (college, school)Inspired by constrained clustering (Basu et al, 2008)

∀t , we want P(college|t) ≈ P(school |t)Cannot be encoded by Dirichlet→ Dirichlet Tree

school college

lottery

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 20 / 63

Page 74: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Cannot-Link (school, cancer)Inspired by constrained clustering (Basu et al., 2008)

No topic-word multinomial φt = P(w |t) should have both:High probability P(school |t)High probability P(cancer |t)

Requires mixture of Dirichlet Trees (Dirichlet Forest)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 21 / 63

Page 75: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Cannot-Link (school, cancer)Inspired by constrained clustering (Basu et al., 2008)

No topic-word multinomial φt = P(w |t) should have both:High probability P(school |t)High probability P(cancer |t)

Requires mixture of Dirichlet Trees (Dirichlet Forest)

cancer school

cure

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 21 / 63

Page 76: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Outline

1 Topic modelingLatent Dirichlet Allocation (LDA)IssuesRelated work

2 Preliminary work

3 New workLogicLDABiological text mining

4 ConclusionDiscussionFuture work

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 22 / 63

Page 77: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

New work: LogicLDA(submitted to NIPS 2010)

Generalizes preliminary workSpecial topics which only appear in certain documents (∆LDA)Influence latent topic zi of tokens (topic-in-set)Must-Link and Cannot-Link between words (Dirichlet Forest)

New types of domain knowledgeArbitrary side informationRelational constraints among zi

Generalize other LDA variants

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 23 / 63

Page 78: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

New work: LogicLDA(submitted to NIPS 2010)

Generalizes preliminary workSpecial topics which only appear in certain documents (∆LDA)Influence latent topic zi of tokens (topic-in-set)Must-Link and Cannot-Link between words (Dirichlet Forest)

New types of domain knowledgeArbitrary side informationRelational constraints among zi

Generalize other LDA variants

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 23 / 63

Page 79: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

General domain knowledge

Express domain knowledge as first-order logic (FOL)Weighted knowledge base (KB) of rulesLearned topics influenced by both

Word-document statistics (as in LDA)Domain knowledge rules

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 24 / 63

Page 80: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

General domain knowledge

Express domain knowledge as first-order logic (FOL)Weighted knowledge base (KB) of rulesLearned topics influenced by both

Word-document statistics (as in LDA)Domain knowledge rules

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 24 / 63

Page 81: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

General domain knowledge

Express domain knowledge as first-order logic (FOL)Weighted knowledge base (KB) of rulesLearned topics influenced by both

Word-document statistics (as in LDA)Domain knowledge rules

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 24 / 63

Page 82: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

General domain knowledge

Express domain knowledge as first-order logic (FOL)Weighted knowledge base (KB) of rulesLearned topics influenced by both

Word-document statistics (as in LDA)Domain knowledge rules

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 24 / 63

Page 83: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

General domain knowledge

Express domain knowledge as first-order logic (FOL)Weighted knowledge base (KB) of rulesLearned topics influenced by both

Word-document statistics (as in LDA)Domain knowledge rules

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 24 / 63

Page 84: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Converting LDA variables to logic

Value Logical Predicate Description

LDAzi = t Z(i , t) Latent topicwi = v W(i , v) Observed worddi = j D(i , j) Observed document

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 25 / 63

Page 85: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Converting LDA variables to logic

Value Logical Predicate Description

LDAzi = t Z(i , t) Latent topicwi = v W(i , v) Observed worddi = j D(i , j) Observed document

AttributesHasLabel(j , `) Document label

S(i , k) Observed sentence

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 25 / 63

Page 86: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Domain knowledge in FOL

CNF Knowledge Base KB = {(λ1, ψ1), . . . , (λL, ψL)}Rule ψk : ∀i W(i ,endothelium)⇒ Z(i ,3)

Weight λk > 0 (“strength” of rule)

Example KBλk ∀ ψk

5 i W(i ,embryo)⇒ Z(i ,3)500 i , j D(i , j) ∧ HasLabel(j ,+)⇒ ¬Z(i ,3)

Can specify “contradictory” domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 26 / 63

Page 87: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Domain knowledge in FOL

CNF Knowledge Base KB = {(λ1, ψ1), . . . , (λL, ψL)}Rule ψk : ∀i W(i ,endothelium)⇒ Z(i ,3)

Weight λk > 0 (“strength” of rule)

Example KBλk ∀ ψk

5 i W(i ,embryo)⇒ Z(i ,3)500 i , j D(i , j) ∧ HasLabel(j ,+)⇒ ¬Z(i ,3)

Can specify “contradictory” domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 26 / 63

Page 88: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Domain knowledge in FOL

CNF Knowledge Base KB = {(λ1, ψ1), . . . , (λL, ψL)}Rule ψk : ∀i W(i ,endothelium)⇒ Z(i ,3)

Weight λk > 0 (“strength” of rule)

Example KBλk ∀ ψk

5 i W(i ,embryo)⇒ Z(i ,3)500 i , j D(i , j) ∧ HasLabel(j ,+)⇒ ¬Z(i ,3)

Can specify “contradictory” domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 26 / 63

Page 89: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Domain knowledge in FOL

CNF Knowledge Base KB = {(λ1, ψ1), . . . , (λL, ψL)}Rule ψk : ∀i W(i ,endothelium)⇒ Z(i ,3)

Weight λk > 0 (“strength” of rule)

Example KBλk ∀ ψk

5 i W(i ,embryo)⇒ Z(i ,3)500 i , j D(i , j) ∧ HasLabel(j ,+)⇒ ¬Z(i ,3)

Can specify “contradictory” domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 26 / 63

Page 90: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Domain knowledge in FOL

CNF Knowledge Base KB = {(λ1, ψ1), . . . , (λL, ψL)}Rule ψk : ∀i W(i ,endothelium)⇒ Z(i ,3)

Weight λk > 0 (“strength” of rule)

Example KBλk ∀ ψk

5 i W(i ,embryo)⇒ Z(i ,3)500 i , j D(i , j) ∧ HasLabel(j ,+)⇒ ¬Z(i ,3)

Can specify “contradictory” domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 26 / 63

Page 91: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Domain knowledge in FOL

CNF Knowledge Base KB = {(λ1, ψ1), . . . , (λL, ψL)}Rule ψk : ∀i W(i ,endothelium)⇒ Z(i ,3)

Weight λk > 0 (“strength” of rule)

Example KBλk ∀ ψk

5 i W(i ,embryo)⇒ Z(i ,3)500 i , j D(i , j) ∧ HasLabel(j ,+)⇒ ¬Z(i ,3)

Can specify “contradictory” domain knowledge

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 26 / 63

Page 92: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Adding logic to LDA via propositionalization

Example Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

G(ψCL) = set of ground formulas g for every (i , j , t)i ∈ {1,2, . . . ,N}j ∈ {1,2, . . . ,N}t ∈ {1,2, . . . ,T}

Each g ∈ G(ψCL) associated with λ1g(z) term(Markov Logic Networks (MLNs), Richardson & Domingos, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 27 / 63

Page 93: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Adding logic to LDA via propositionalization

Example Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

G(ψCL) = set of ground formulas g for every (i , j , t)i ∈ {1,2, . . . ,N}j ∈ {1,2, . . . ,N}t ∈ {1,2, . . . ,T}

Each g ∈ G(ψCL) associated with λ1g(z) term(Markov Logic Networks (MLNs), Richardson & Domingos, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 27 / 63

Page 94: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Adding logic to LDA via propositionalization

Example Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

G(ψCL) = set of ground formulas g for every (i , j , t)i ∈ {1,2, . . . ,N}j ∈ {1,2, . . . ,N}t ∈ {1,2, . . . ,T}

Each g ∈ G(ψCL) associated with λ1g(z) term(Markov Logic Networks (MLNs), Richardson & Domingos, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 27 / 63

Page 95: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Adding logic to LDA via propositionalization

Example Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

G(ψCL) = set of ground formulas g for every (i , j , t)i ∈ {1,2, . . . ,N}j ∈ {1,2, . . . ,N}t ∈ {1,2, . . . ,T}

Each g ∈ G(ψCL) associated with λ1g(z) term(Markov Logic Networks (MLNs), Richardson & Domingos, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 27 / 63

Page 96: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Adding logic to LDA via propositionalization

Example Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

G(ψCL) = set of ground formulas g for every (i , j , t)i ∈ {1,2, . . . ,N}j ∈ {1,2, . . . ,N}t ∈ {1,2, . . . ,T}

Each g ∈ G(ψCL) associated with λ1g(z) term(Markov Logic Networks (MLNs), Richardson & Domingos, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 27 / 63

Page 97: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Adding logic to LDA via propositionalization

Example Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

G(ψCL) = set of ground formulas g for every (i , j , t)i ∈ {1,2, . . . ,N}j ∈ {1,2, . . . ,N}t ∈ {1,2, . . . ,T}

Each g ∈ G(ψCL) associated with λ1g(z) term(Markov Logic Networks (MLNs), Richardson & Domingos, 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 27 / 63

Page 98: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LDA

P ∝

(T∏t

p(φt |β)

) D∏j

p(θj |α)

( N∏i

φzi (wi)θdi (zi)

)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 28 / 63

Page 99: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LDA

P ∝

(T∏t

p(φt |β)

) D∏j

p(θj |α)

( N∏i

φzi (wi)θdi (zi)

)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 28 / 63

Page 100: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LDA→ LogicLDA

P ∝

(T∏t

p(φt |β)

) D∏j

p(θj |α)

( N∏i

φzi (wi)θdi (zi)

)

× exp

L∑l

∑g∈G(ψl )

λl1g(z,w,d,o)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 28 / 63

Page 101: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA MAP inference

Find most probable (z, φ, θ)

Q(z, φ, θ) =T∑t

log p(φt |β) +D∑j

log p(θj |α)

+N∑i

logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z,w,d,o)

LDA termsLogic terms

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 29 / 63

Page 102: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA MAP inference

Find most probable (z, φ, θ)

Q(z, φ, θ) =T∑t

log p(φt |β) +D∑j

log p(θj |α)

+N∑i

logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z,w,d,o)

LDA termsLogic terms

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 29 / 63

Page 103: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA MAP inference

Find most probable (z, φ, θ)

Q(z, φ, θ) =T∑t

log p(φt |β) +D∑j

log p(θj |α)

+N∑i

logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z,w,d,o)

LDA termsLogic terms

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 29 / 63

Page 104: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference techniques

1 Collapsed Gibbs (CGS)2 LDA followed by MaxWalkSAT (MWS)3 MaxWalkSAT with LDA (M+L)4 Mirror descent (Mir)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 30 / 63

Page 105: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference techniques

1 Collapsed Gibbs (CGS)2 LDA followed by MaxWalkSAT (MWS)3 MaxWalkSAT with LDA (M+L)4 Mirror descent (Mir)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 30 / 63

Page 106: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference techniques

1 Collapsed Gibbs (CGS)2 LDA followed by MaxWalkSAT (MWS)3 MaxWalkSAT with LDA (M+L)4 Mirror descent (Mir)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 30 / 63

Page 107: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference techniques

1 Collapsed Gibbs (CGS)2 LDA followed by MaxWalkSAT (MWS)3 MaxWalkSAT with LDA (M+L)4 Mirror descent (Mir)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 30 / 63

Page 108: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 1: Collapsed Gibbs Sampling (CGS)

Combine collapsed Gibbs for LDA and MLNFor each zi , must consider all affected ground formulas gSelect the sample which maximizes the log-posterior Q(z, φ, θ)

IssuesMust consider all ground gNot intended to maximize QSlow mixing

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 31 / 63

Page 109: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 1: Collapsed Gibbs Sampling (CGS)

Combine collapsed Gibbs for LDA and MLNFor each zi , must consider all affected ground formulas gSelect the sample which maximizes the log-posterior Q(z, φ, θ)

IssuesMust consider all ground gNot intended to maximize QSlow mixing

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 31 / 63

Page 110: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 1: Collapsed Gibbs Sampling (CGS)

Combine collapsed Gibbs for LDA and MLNFor each zi , must consider all affected ground formulas gSelect the sample which maximizes the log-posterior Q(z, φ, θ)

IssuesMust consider all ground gNot intended to maximize QSlow mixing

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 31 / 63

Page 111: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 1: Collapsed Gibbs Sampling (CGS)

Combine collapsed Gibbs for LDA and MLNFor each zi , must consider all affected ground formulas gSelect the sample which maximizes the log-posterior Q(z, φ, θ)

IssuesMust consider all ground gNot intended to maximize QSlow mixing

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 31 / 63

Page 112: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 1: Collapsed Gibbs Sampling (CGS)

Combine collapsed Gibbs for LDA and MLNFor each zi , must consider all affected ground formulas gSelect the sample which maximizes the log-posterior Q(z, φ, θ)

IssuesMust consider all ground gNot intended to maximize QSlow mixing

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 31 / 63

Page 113: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 2: LDA then MaxWalkSAT

1 (z, φ, θ)← MAP inference with respect to LDA2 z← post-process to maximize weight of satisfied g

MaxWalkSAT (MWS) - Kautz et al, 1997For each step, sample an unsatisified clause

with probability p, satisfy by flipping an atom randomlyelse satisfy by flipping an atom greedily(with respect to global satisfied weight)

IssuesMust consider all ground gMay decrease full LogicLDA objective Q(satisfying KB may hurt LDA objective)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 32 / 63

Page 114: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 2: LDA then MaxWalkSAT

1 (z, φ, θ)← MAP inference with respect to LDA2 z← post-process to maximize weight of satisfied g

MaxWalkSAT (MWS) - Kautz et al, 1997For each step, sample an unsatisified clause

with probability p, satisfy by flipping an atom randomlyelse satisfy by flipping an atom greedily(with respect to global satisfied weight)

IssuesMust consider all ground gMay decrease full LogicLDA objective Q(satisfying KB may hurt LDA objective)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 32 / 63

Page 115: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 2: LDA then MaxWalkSAT

1 (z, φ, θ)← MAP inference with respect to LDA2 z← post-process to maximize weight of satisfied g

MaxWalkSAT (MWS) - Kautz et al, 1997For each step, sample an unsatisified clause

with probability p, satisfy by flipping an atom randomlyelse satisfy by flipping an atom greedily(with respect to global satisfied weight)

IssuesMust consider all ground gMay decrease full LogicLDA objective Q(satisfying KB may hurt LDA objective)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 32 / 63

Page 116: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 3: MaxWalkSAT with LDA (M+L)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

Maximizing with respect to zAlso consider LDA objective on greedy step

IssuesMust consider all ground g

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 33 / 63

Page 117: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 3: MaxWalkSAT with LDA (M+L)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

Maximizing with respect to zAlso consider LDA objective on greedy step

IssuesMust consider all ground g

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 33 / 63

Page 118: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 3: MaxWalkSAT with LDA (M+L)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

Maximizing with respect to zAlso consider LDA objective on greedy step

IssuesMust consider all ground g

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 33 / 63

Page 119: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 3: MaxWalkSAT with LDA (M+L)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

Maximizing with respect to zAlso consider LDA objective on greedy step

IssuesMust consider all ground g

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 33 / 63

Page 120: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 3: MaxWalkSAT with LDA (M+L)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

Maximizing with respect to zAlso consider LDA objective on greedy step

IssuesMust consider all ground g

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 33 / 63

Page 121: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Combinatorial explosion in G(ψk)

First-Order Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

This rule has a grounding g for EVERY (i , j , t)N = 106, T = 100� “financial bailout” numbers (uh oh)Markov Logic Network tricks fail (e.g., “lifted” inference)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 34 / 63

Page 122: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Combinatorial explosion in G(ψk)

First-Order Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

This rule has a grounding g for EVERY (i , j , t)N = 106, T = 100� “financial bailout” numbers (uh oh)Markov Logic Network tricks fail (e.g., “lifted” inference)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 34 / 63

Page 123: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Combinatorial explosion in G(ψk)

First-Order Cannot-Link rule ψCL

λk ∀ ψk

5 i , j , t W(i ,neural) ∧ W(j ,disorder)⇒ ¬Z(i , t) ∨ ¬Z(j , t)

This rule has a grounding g for EVERY (i , j , t)N = 106, T = 100� “financial bailout” numbers (uh oh)Markov Logic Network tricks fail (e.g., “lifted” inference)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 34 / 63

Page 124: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Ignore trivial rule groundingsShavlik & Natarajan, 2009

λk ∀ ψk

5 i W(i ,apple)⇒ Z(i ,3)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 35 / 63

Page 125: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Ignore trivial rule groundingsShavlik & Natarajan, 2009

λk ∀ ψk

5 i W(i ,apple)⇒ Z(i ,3)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 35 / 63

Page 126: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Ignore trivial rule groundingsShavlik & Natarajan, 2009

λk ∀ ψk

5 i W(i ,apple)⇒ Z(i ,3)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 35 / 63

Page 127: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 128: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 129: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 130: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 131: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 132: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 133: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 134: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference 4: Mirror descent (Mir)

For each step1 (φ, θ)← argmax

φ,θQ(z, φ, θ) z fixed

2 z← argmaxz

Q(z, φ, θ) (φ, θ) fixed

z \ zKB ← argmax with respect to (φ, θ) TRIVIALzKB ← mirror descent HARD

Scalable approach to optimize zKB

1 Relax discrete problem to continuous2 Optimize relaxed problem with stochastic gradient descent3 Round relaxed z to recover final assignment

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 36 / 63

Page 135: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 136: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 137: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 138: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 139: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 140: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 141: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Represent 1 as a polynomial

g = Z(i ,1) ∨ ¬Z(j ,2), and t ∈ {1,2,3}

1 Take complement ¬g ¬Z(i ,1) ∧ Z(j ,2)

2 Remove negations (¬g)+ (Z(i ,2) ∨ Z(i ,3)) ∧ Z(j ,2)

3 Numeric zit ∈ {0,1} (zi2 + zi3)zj2

4 Polynomial 1g(z) 1− (zi2 + zi3)zj2

5 Relax discrete zit zit ∈ {0,1} → zit ∈ [0,1]

1g(z) = 1−∏gi 6=∅

∑Z(i,t)∈(¬gi )+

zit

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 37 / 63

Page 142: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = tRepresent indicator function 1g(z) as polynomial in zitCan calculate ∇Q

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 143: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1}Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 144: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 145: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 146: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 147: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q...but G(ψk ) may be very large

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 148: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q...but G(ψk ) may be very large

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 149: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q...but G(ψk ) may be very large

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 150: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q...but G(ψk ) may be very large

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 151: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q...but G(ψk ) may be very large

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 152: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Technical contribution: scalable zKB inference

argmaxz

N∑i

T∑t

zit logφzi (wi)θdi (zi) +L∑l

∑g∈G(ψl )

λl1g(z)

1 Continuous relaxationzi = t → zit ∈ {0,1} → zit ∈ [0,1]Represent indicator function 1g(z) as polynomial in zitCan calculate ∇Q...but G(ψk ) may be very large

2 Stochastic gradient - sample a term from objective function QLogic: single ground formula gLDA: single corpus index i

3 Entropic Mirror Descent (Beck & Teboulle, 2003)

zit ←zit exp (η∇zit f )∑t ′ zit ′ exp (η∇zit′ f )

4 Recover discrete z: zi = argmaxt

zit for i = 1, . . . ,N

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 38 / 63

Page 153: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA experiments

Synthetic and benchmark corpora (e.g., 20news)Simple KBs: at most a few rules eachQuestions

Can we influence the learned topics?How do the inference techniques perform?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 39 / 63

Page 154: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA experiments

Synthetic and benchmark corpora (e.g., 20news)Simple KBs: at most a few rules eachQuestions

Can we influence the learned topics?How do the inference techniques perform?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 39 / 63

Page 155: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA experiments

Synthetic and benchmark corpora (e.g., 20news)Simple KBs: at most a few rules eachQuestions

Can we influence the learned topics?How do the inference techniques perform?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 39 / 63

Page 156: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA experiments

Synthetic and benchmark corpora (e.g., 20news)Simple KBs: at most a few rules eachQuestions

Can we influence the learned topics?How do the inference techniques perform?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 39 / 63

Page 157: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA experiments

Synthetic and benchmark corpora (e.g., 20news)Simple KBs: at most a few rules eachQuestions

Can we influence the learned topics?How do the inference techniques perform?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 39 / 63

Page 158: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Qualitative results: movie review corpus (Pol)

Movie review corpus (Pang & Lee, 2005)“movie” overrepresented in negative reviews“film” overrepresented in positive reviews

Goal: investigate this difference with topic models

Logic Cannot-Link (movie,film)W(i ,movie) ∧ W(j , film)⇒ (¬Z(i , t) ∨ ¬Z(j , t))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 40 / 63

Page 159: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Qualitative results: movie review corpus (Pol)

Movie review corpus (Pang & Lee, 2005)“movie” overrepresented in negative reviews“film” overrepresented in positive reviews

Goal: investigate this difference with topic models

Logic Cannot-Link (movie,film)W(i ,movie) ∧ W(j , film)⇒ (¬Z(i , t) ∨ ¬Z(j , t))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 40 / 63

Page 160: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Qualitative results: movie review corpus (Pol)

Movie review corpus (Pang & Lee, 2005)“movie” overrepresented in negative reviews“film” overrepresented in positive reviews

Goal: investigate this difference with topic models

Logic Cannot-Link (movie,film)W(i ,movie) ∧ W(j , film)⇒ (¬Z(i , t) ∨ ¬Z(j , t))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 40 / 63

Page 161: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Qualitative results: movie review corpus (Pol)

Movie review corpus (Pang & Lee, 2005)“movie” overrepresented in negative reviews“film” overrepresented in positive reviews

Goal: investigate this difference with topic models

Logic Cannot-Link (movie,film)W(i ,movie) ∧ W(j , film)⇒ (¬Z(i , t) ∨ ¬Z(j , t))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 40 / 63

Page 162: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Qualitative results: movie review corpus (Pol)

Movie review corpus (Pang & Lee, 2005)“movie” overrepresented in negative reviews“film” overrepresented in positive reviews

Goal: investigate this difference with topic models

Logic Cannot-Link (movie,film)W(i ,movie) ∧ W(j , film)⇒ (¬Z(i , t) ∨ ¬Z(j , t))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 40 / 63

Page 163: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Qualitative results: movie review corpus (Pol)

Movie review corpus (Pang & Lee, 2005)“movie” overrepresented in negative reviews“film” overrepresented in positive reviews

Goal: investigate this difference with topic models

Logic Cannot-Link (movie,film)W(i ,movie) ∧ W(j , film)⇒ (¬Z(i , t) ∨ ¬Z(j , t))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 40 / 63

Page 164: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Standard LDA topic

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 41 / 63

Page 165: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Standard LDA topic and LogicLDA topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 41 / 63

Page 166: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference results - maximize log-posterior Q(means over 10 randomized runs)

LogicLDA

Mir M+L MWS CGS LDA Alchemy

S1 8.64 9.08 8.19 9.08 3.83 1.43S2 3.44 3.49 3.49 3.47 2.17 2.72

Mac 3.35 3.36 2.49 2.48 2.24 −5.7NC

Comp 2.48 2.48 0.15 2.20 −0.84 NCCon 18.52 19.04 −3.79 16.68 −4.07 NCPol 9.63 NC NC NC 9.56 NC

HDG 116.80 NC NC NC 64.04 NC

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 42 / 63

Page 167: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference results - maximize log-posterior Q(means over 10 randomized runs)

LogicLDA

Mir M+L MWS CGS LDA Alchemy

S1 8.64 9.08 8.19 9.08 3.83 1.43S2 3.44 3.49 3.49 3.47 2.17 2.72

Mac 3.35 3.36 2.49 2.48 2.24 −5.7NC

Comp 2.48 2.48 0.15 2.20 −0.84 NCCon 18.52 19.04 −3.79 16.68 −4.07 NCPol 9.63 NC NC NC 9.56 NC

HDG 116.80 NC NC NC 64.04 NC

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 42 / 63

Page 168: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference results - maximize log-posterior Q(means over 10 randomized runs)

LogicLDA

Mir M+L MWS CGS LDA Alchemy

S1 8.64 9.08 8.19 9.08 3.83 1.43S2 3.44 3.49 3.49 3.47 2.17 2.72

Mac 3.35 3.36 2.49 2.48 2.24 −5.7NC

Comp 2.48 2.48 0.15 2.20 −0.84 NCCon 18.52 19.04 −3.79 16.68 −4.07 NCPol 9.63 NC NC NC 9.56 NC

HDG 116.80 NC NC NC 64.04 NC

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 42 / 63

Page 169: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference results - maximize log-posterior Q(means over 10 randomized runs)

LogicLDA

Mir M+L MWS CGS LDA Alchemy

S1 8.64 9.08 8.19 9.08 3.83 1.43S2 3.44 3.49 3.49 3.47 2.17 2.72

Mac 3.35 3.36 2.49 2.48 2.24 −5.7NC

Comp 2.48 2.48 0.15 2.20 −0.84 NCCon 18.52 19.04 −3.79 16.68 −4.07 NCPol 9.63 NC NC NC 9.56 NC

HDG 116.80 NC NC NC 64.04 NC

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 42 / 63

Page 170: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Inference results - maximize log-posterior Q(means over 10 randomized runs)

LogicLDA

Mir M+L MWS CGS LDA Alchemy

S1 8.64 9.08 8.19 9.08 3.83 1.43S2 3.44 3.49 3.49 3.47 2.17 2.72

Mac 3.35 3.36 2.49 2.48 2.24 −5.7NC

Comp 2.48 2.48 0.15 2.20 −0.84 NCCon 18.52 19.04 −3.79 16.68 −4.07 NCPol 9.63 NC NC NC 9.56 NC

HDG 116.80 NC NC NC 64.04 NC

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 42 / 63

Page 171: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA can encode existing LDA variants

Example: Hidden Topic Markov Model (HTMM) - Gruber et al, 2007Each sentence uses only one topicTopic transitions possible between sentences with probability ε

FOL encoding of HTMM

λk ∀ ψk

∞ i , j , s, t S(i , s) ∧ S(j , s) ∧ Z(i , t)⇒ Z(j , t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 43 / 63

Page 172: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA can encode existing LDA variants

Example: Hidden Topic Markov Model (HTMM) - Gruber et al, 2007Each sentence uses only one topicTopic transitions possible between sentences with probability ε

FOL encoding of HTMM

λk ∀ ψk

∞ i , j , s, t S(i , s) ∧ S(j , s) ∧ Z(i , t)⇒ Z(j , t)− log ε i , s, t S(i , s) ∧ ¬S(i + 1, s) ∧ Z(i , t)⇒ Z(i + 1, t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 43 / 63

Page 173: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 174: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 175: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 176: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 177: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 178: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 179: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA summary

ContributionFOL domain knowledge

Generalize preliminary workSide informationDependencies among zi

Scalable inferenceAllows us to find z balancing LDA and logicNecessary for general KBs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 44 / 63

Page 180: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Outline

1 Topic modelingLatent Dirichlet Allocation (LDA)IssuesRelated work

2 Preliminary work

3 New workLogicLDABiological text mining

4 ConclusionDiscussionFuture work

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 45 / 63

Page 181: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

New application: biological text mining(in preparation)

Ron Stewart (Thomson lab) is interested in connections betweenexperimentally interesting geneshuman development concepts of interest

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 46 / 63

Page 182: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Task setup

Given “seed” terms for each conceptDo discover other related terms

Concept Provided terms

Neural neur dendro(cyte), glia, synapse, neural crestEmbryo human embryonic stem cell, inner cell mass, pluripotentBlood hematopoietic, blood, endothel(ium)

Gastrulation organizer, gastru(late)Cardiac heart, ventricle, auricle, aorta

Limb limb, blastema, zeugopod, autopod, stylopod

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 47 / 63

Page 183: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Task setup

Given “seed” terms for each conceptDo discover other related terms

Concept Provided terms

Neural neur dendro(cyte), glia, synapse, neural crestEmbryo human embryonic stem cell, inner cell mass, pluripotentBlood hematopoietic, blood, endothel(ium)

Gastrulation organizer, gastru(late)Cardiac heart, ventricle, auricle, aorta

Limb limb, blastema, zeugopod, autopod, stylopod

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 47 / 63

Page 184: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Standard LDA: Neural concept

Do standard LDA, then find topics containing seed terms in Top 50

brain system nervous neurons neuronal central development ’s neuralhuman gene disease function cortex spinal disorders developing motorcerebral glial peripheral cortical cord disorder astrocytes nerveneurological regions suggest schizophrenia including syndromeneurodegenerative mental involved retardation behavior cerebellummigration behavioral abnormal cerebellar found precursor resultsamyloid hippocampus sclerosis neurotrophic present

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 48 / 63

Page 185: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Standard LDA: Neural concept

Do standard LDA, then find topics containing seed terms in Top 50

brain system nervous neurons neuronal central development ’s neuralhuman gene disease function cortex spinal disorders developing motorcerebral glial peripheral cortical cord disorder astrocytes nerveneurological regions suggest schizophrenia including syndromeneurodegenerative mental involved retardation behavior cerebellummigration behavioral abnormal cerebellar found precursor resultsamyloid hippocampus sclerosis neurotrophic present

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 48 / 63

Page 186: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Seed and n-gram rules

Neural→ “synapse”→ Topic 0W(i , synapse)⇒ Z(i ,0)

Embryo→ “inner cell mass”→ Topic 1

W(i , inner) ∧ W(i + 1, cell) ∧ W(i + 2,mass)⇒ Z(i ,1)

W(i − 1, inner) ∧ W(i , cell) ∧ W(i + 1,mass)⇒ Z(i ,1)

W(i − 2, inner) ∧ W(i − 1, cell) ∧ W(i ,mass)⇒ Z(i ,1)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 49 / 63

Page 187: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Seed and n-gram rules

Neural→ “synapse”→ Topic 0W(i , synapse)⇒ Z(i ,0)

Embryo→ “inner cell mass”→ Topic 1

W(i , inner) ∧ W(i + 1, cell) ∧ W(i + 2,mass)⇒ Z(i ,1)

W(i − 1, inner) ∧ W(i , cell) ∧ W(i + 1,mass)⇒ Z(i ,1)

W(i − 2, inner) ∧ W(i − 1, cell) ∧ W(i ,mass)⇒ Z(i ,1)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 49 / 63

Page 188: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sentence inclusion

Create new development Topic 6{differentiation, maturation, develops, formation, differentiates}Development Topic 6 allows each seed Topic t in sentencet ∈ {0, . . . ,5}

Sentence(i , i1, . . . , iSk ) ∧ ¬Z(i1,6) ∧ . . . ∧ ¬Z(iSk ,6)⇒ ¬Z(i ,0)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 50 / 63

Page 189: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sentence inclusion

Create new development Topic 6{differentiation, maturation, develops, formation, differentiates}Development Topic 6 allows each seed Topic t in sentencet ∈ {0, . . . ,5}

Sentence(i , i1, . . . , iSk ) ∧ ¬Z(i1,6) ∧ . . . ∧ ¬Z(iSk ,6)⇒ ¬Z(i ,0)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 50 / 63

Page 190: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sentence exclusion

Create new disease Topic 7{patient, disease, parasite, chronic, virus, condition, disorder,symptom}Disease Topic 7 prevents each seed Topic t in sentencet ∈ {0, . . . ,5}

S(i , s) ∧ S(j , s) ∧ Z(i ,7)⇒ ¬Z(j ,0)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 51 / 63

Page 191: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Experimental setup

Different KBsLDA - Standard LDASEED - SeedINCL - Seed + InclusionEXCL - Seed + ExclusionALL - Seed + Inclusion + Exclusion

Labeled Top 50 words for each KB

Would knowing that this word is (statistically) associated with a geneincrease your belief that the gene is related to the target concept?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 52 / 63

Page 192: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Experimental setup

Different KBsLDA - Standard LDASEED - SeedINCL - Seed + InclusionEXCL - Seed + ExclusionALL - Seed + Inclusion + Exclusion

Labeled Top 50 words for each KB

Would knowing that this word is (statistically) associated with a geneincrease your belief that the gene is related to the target concept?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 52 / 63

Page 193: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Experimental setup

Different KBsLDA - Standard LDASEED - SeedINCL - Seed + InclusionEXCL - Seed + ExclusionALL - Seed + Inclusion + Exclusion

Labeled Top 50 words for each KB

Would knowing that this word is (statistically) associated with a geneincrease your belief that the gene is related to the target concept?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 52 / 63

Page 194: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Accuracy at Top 50 threshold(means over 10 randomized runs)

LogicLDA KBs

ALL INCL EXCL SEED LDA

Neural 0.59 0.57 0.54 0.54 0.31Embryo 0.24 0.24 0.23 0.23 0.07Blood 0.46 0.47 0.40 0.39 0.13Gast. 0.18 0.18 0.16 0.16 0.00

Cardiac 0.36 0.37 0.34 0.35 0.08Limb 0.18 0.18 0.15 0.14 0.09

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 53 / 63

Page 195: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Accuracy at Top 50 threshold(means over 10 randomized runs)

LogicLDA KBs

ALL INCL EXCL SEED LDA

Neural 0.59 0.57 0.54 0.54 0.31Embryo 0.24 0.24 0.23 0.23 0.07Blood 0.46 0.47 0.40 0.39 0.13Gast. 0.18 0.18 0.16 0.16 0.00

Cardiac 0.36 0.37 0.34 0.35 0.08Limb 0.18 0.18 0.15 0.14 0.09

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 53 / 63

Page 196: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Accuracy at Top 50 threshold(means over 10 randomized runs)

LogicLDA KBs

ALL INCL EXCL SEED LDA

Neural 0.59 0.57 0.54 0.54 0.31Embryo 0.24 0.24 0.23 0.23 0.07Blood 0.46 0.47 0.40 0.39 0.13Gast. 0.18 0.18 0.16 0.16 0.00

Cardiac 0.36 0.37 0.34 0.35 0.08Limb 0.18 0.18 0.15 0.14 0.09

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 53 / 63

Page 197: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Accuracy at Top 50 threshold(means over 10 randomized runs)

LogicLDA KBs

ALL INCL EXCL SEED LDA

Neural 0.59 0.57 0.54 0.54 0.31Embryo 0.24 0.24 0.23 0.23 0.07Blood 0.46 0.47 0.40 0.39 0.13Gast. 0.18 0.18 0.16 0.16 0.00

Cardiac 0.36 0.37 0.34 0.35 0.08Limb 0.18 0.18 0.15 0.14 0.09

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 53 / 63

Page 198: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Precision-Recall Neural

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 54 / 63

Page 199: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Precision-Recall Blood

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 55 / 63

Page 200: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Precision-Recall Cardiac

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 56 / 63

Page 201: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Novel terms discovered

INCL and ALL for neural{dendritic, forebrain, hindbrain, microglial, motoneurons, neuroblasts,neurogenesis, retinal}

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 57 / 63

Page 202: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Biological application

ContributionApply topic modeling to real-world text mining problemLogicLDA allows us to outperform standard LDA

Exploit sentence informationBetter quantitative performanceDiscover novel terms related to target concept

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 58 / 63

Page 203: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Biological application

ContributionApply topic modeling to real-world text mining problemLogicLDA allows us to outperform standard LDA

Exploit sentence informationBetter quantitative performanceDiscover novel terms related to target concept

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 58 / 63

Page 204: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Biological application

ContributionApply topic modeling to real-world text mining problemLogicLDA allows us to outperform standard LDA

Exploit sentence informationBetter quantitative performanceDiscover novel terms related to target concept

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 58 / 63

Page 205: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Biological application

ContributionApply topic modeling to real-world text mining problemLogicLDA allows us to outperform standard LDA

Exploit sentence informationBetter quantitative performanceDiscover novel terms related to target concept

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 58 / 63

Page 206: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Biological application

ContributionApply topic modeling to real-world text mining problemLogicLDA allows us to outperform standard LDA

Exploit sentence informationBetter quantitative performanceDiscover novel terms related to target concept

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 58 / 63

Page 207: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Outline

1 Topic modelingLatent Dirichlet Allocation (LDA)IssuesRelated work

2 Preliminary work

3 New workLogicLDABiological text mining

4 ConclusionDiscussionFuture work

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 59 / 63

Page 208: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent topic models + domain knowledgeCode available http://pages.cs.wisc.edu/~andrzeje/software.html

∆LDA (ECML 2007)Restricted topicsStatistical debugging

Topic-in-set (SSLNLP 2009)Topic assignments zi

“Seed” words

Dirichlet Forest prior (ICML 2009)Must-Link and Cannot-Link over topics φComposite operations (split,merge,isolate)

LogicLDA (submitted to NIPS 2010)General FOL domain knowledgeScalable inference methodsBiological text mining application

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 60 / 63

Page 209: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent topic models + domain knowledgeCode available http://pages.cs.wisc.edu/~andrzeje/software.html

∆LDA (ECML 2007)Restricted topicsStatistical debugging

Topic-in-set (SSLNLP 2009)Topic assignments zi

“Seed” words

Dirichlet Forest prior (ICML 2009)Must-Link and Cannot-Link over topics φComposite operations (split,merge,isolate)

LogicLDA (submitted to NIPS 2010)General FOL domain knowledgeScalable inference methodsBiological text mining application

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 60 / 63

Page 210: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent topic models + domain knowledgeCode available http://pages.cs.wisc.edu/~andrzeje/software.html

∆LDA (ECML 2007)Restricted topicsStatistical debugging

Topic-in-set (SSLNLP 2009)Topic assignments zi

“Seed” words

Dirichlet Forest prior (ICML 2009)Must-Link and Cannot-Link over topics φComposite operations (split,merge,isolate)

LogicLDA (submitted to NIPS 2010)General FOL domain knowledgeScalable inference methodsBiological text mining application

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 60 / 63

Page 211: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Latent topic models + domain knowledgeCode available http://pages.cs.wisc.edu/~andrzeje/software.html

∆LDA (ECML 2007)Restricted topicsStatistical debugging

Topic-in-set (SSLNLP 2009)Topic assignments zi

“Seed” words

Dirichlet Forest prior (ICML 2009)Must-Link and Cannot-Link over topics φComposite operations (split,merge,isolate)

LogicLDA (submitted to NIPS 2010)General FOL domain knowledgeScalable inference methodsBiological text mining application

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 60 / 63

Page 212: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Other work

True vs computer-generated Mondrians (SPIE 2010)

Identify wishes in text (NAACL HLT 2009)Incorporate “diversity” in rankings (NAACL HLT 2007)Find text passages relevant to biomedical queries (TREC 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 61 / 63

Page 213: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Other work

True vs computer-generated Mondrians (SPIE 2010)

Identify wishes in text (NAACL HLT 2009)Incorporate “diversity” in rankings (NAACL HLT 2007)Find text passages relevant to biomedical queries (TREC 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 61 / 63

Page 214: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Other work

True vs computer-generated Mondrians (SPIE 2010)

Identify wishes in text (NAACL HLT 2009)Incorporate “diversity” in rankings (NAACL HLT 2007)Find text passages relevant to biomedical queries (TREC 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 61 / 63

Page 215: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Other work

True vs computer-generated Mondrians (SPIE 2010)

Identify wishes in text (NAACL HLT 2009)Incorporate “diversity” in rankings (NAACL HLT 2007)Find text passages relevant to biomedical queries (TREC 2006)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 61 / 63

Page 216: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 217: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 218: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 219: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 220: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 221: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 222: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 223: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 224: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 225: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 226: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 227: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Potential future directions

Active learning: solicit user feedback which will be most “useful”Identify “weak” topics (split candidates)Develop better “interpretability” objective function(Chang et al., NIPS 2009)

Multimodal knowledgeText + Image (captions), Text + Experimental Data (biology)Interaction currently “hard-coded” into graphical modelAllow user to specify how data types interact

Domain knowledge for secondary tasksOften topics are to be used as input to another algorithmHow to best use domain knowledge to learn suitable topics

Statistical relational learning (SRL)Example: Citation(d ,d ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 62 / 63

Page 228: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Acknowledgements

Prelim and thesis committee membersMark Craven (co-advisor)Xiaojin Zhu (co-advisor)Jude ShavlikBen LiblitMichael NewtonBen Recht

UW–Madison AI community: Zhu and Craven groups, [Mlgroup],AIRG,. . .

Biological collaborators: Brandi Gancarz and Ron Stewart

Talk feedback: Ameet, Andreas, Bryan, Debbie, and Yimin

FundingComputation and Informatics in Biology and Medicine trainingprogram (CIBM, NIH/NLM T15 LM07359)Wisconsin Alumni Research Foundation (WARF)NIH grant R01 LM07050

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 63 / 63

Page 229: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Statistical debugging with latent topic models

Predicates → VocabularyPredicate counts → Word counts

Program run → Bag-of-words documentDebugging → Latent topic analysis

Bug patterns → Topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 64 / 63

Page 230: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Statistical debugging with latent topic models

Predicates → VocabularyPredicate counts → Word counts

Program run → Bag-of-words documentDebugging → Latent topic analysis

Bug patterns → Topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 64 / 63

Page 231: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Statistical debugging with latent topic models

Predicates → VocabularyPredicate counts → Word counts

Program run → Bag-of-words documentDebugging → Latent topic analysis

Bug patterns → Topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 64 / 63

Page 232: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Statistical debugging with latent topic models

Predicates → VocabularyPredicate counts → Word counts

Program run → Bag-of-words documentDebugging → Latent topic analysis

Bug patterns → Topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 64 / 63

Page 233: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Statistical debugging with latent topic models

Predicates → VocabularyPredicate counts → Word counts

Program run → Bag-of-words documentDebugging → Latent topic analysis

Bug patterns → Topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 64 / 63

Page 234: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Experimental results

Rand index vs true clustering (1 = total agreement)

∆LDA [1] [2]exif 1.00 0.88 1.00grep 0.97 0.71 0.77gzip 0.89 1.00 1.00moss 0.93 0.93 0.96

Baselines1 Statistical Debugging: Simultaneous Isolation of Multiple Bugs

(Zheng et al., ICML 2006)2 Scalable Statistical Bug Isolation (Liblit et al., PLDI 2005)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 65 / 63

Page 235: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 236: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 237: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 238: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 239: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 240: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 241: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Why latent topic modeling?

Graphical model→ composability/extensibilityEmbed other side information in ∆LDA (e.g., static analysis)Use ∆LDA probabilities in other procedures

Interpretable topics (gzip)Usage topic 1: longest_match() function(many redundant strings)Usage topic 2: command-line parsing, exit code(a “dry” program run)Usage topic 3: deflate_fast()(fast algorithm command-line flag)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 66 / 63

Page 242: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

∆LDA toy example

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 67 / 63

Page 243: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

∆LDA toy example

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 67 / 63

Page 244: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

∆LDA toy example

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 67 / 63

Page 245: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

∆LDA toy example

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 67 / 63

Page 246: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Debugging Dataset

Runs

Program Lines of Code Bugs Successful Failing

exif 10,611 2 352 30grep 15,721 2 609 200gzip 8,960 2 29 186moss 35,223 8 1727 1228

Topics

Program Word Types Usage Bug

exif 20 7 2grep 2,071 5 2gzip 3,929 5 2moss 1,982 14 8

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 68 / 63

Page 247: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

CBI Results

grep

0 0.5 10

0.2

0.4

0.6

0.8

1

topic 6

topi

c 7

56 bug1 runs144 bug2 runs

moss

−1

0

1

−0.6−0.4−0.200.20.40.60.8

−1

−0.5

0

0.5

PC

A3

PCA2PCA1

254 bug1 runs106 bug3 runs147 bug4 runs329 bug5 runs206 bug8 runs186 other runs

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 69 / 63

Page 248: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

z-labels via undirected LDA

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 70 / 63

Page 249: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

z-labels via undirected LDA

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 70 / 63

Page 250: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

z-labels via undirected LDA

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 70 / 63

Page 251: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

z-labels via undirected LDA

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 70 / 63

Page 252: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

z-labels via undirected LDA

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 70 / 63

Page 253: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Translation

Given: “translation” seed wordsDo: use z-labels to build “translation” topicLots of relevant/seed termsOverall captures “translation” concept

Topic 0

translation, ribosomal, trna, rrna, initiation, ribosome, proteinribosomes, is, factor, processing, translational, nucleolarpre-rrna, synthesis, small, 60s, eukaryotic, biogenesis, subunittrnas, subunits, large, nucleolus, factors, 40, synthetase, freemodification, rna, depletion, eif-2, initiator, 40s, ef-3anticodon, maturation 18s, eif2, mature, eif4e, synthetasesaminoacylation, snornas, assembly, eif4g, elongation

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 71 / 63

Page 254: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Translation

Given: “translation” seed wordsDo: use z-labels to build “translation” topicLots of relevant/seed termsOverall captures “translation” concept

Topic 0

translation, ribosomal, trna, rrna, initiation, ribosome, proteinribosomes, is, factor, processing, translational, nucleolarpre-rrna, synthesis, small, 60s, eukaryotic, biogenesis, subunittrnas, subunits, large, nucleolus, factors, 40, synthetase, freemodification, rna, depletion, eif-2, initiator, 40s, ef-3anticodon, maturation 18s, eif2, mature, eif4e, synthetasesaminoacylation, snornas, assembly, eif4g, elongation

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 71 / 63

Page 255: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Translation

Given: “translation” seed wordsDo: use z-labels to build “translation” topicLots of relevant/seed termsOverall captures “translation” concept

Topic 0

translation, ribosomal, trna, rrna, initiation, ribosome, proteinribosomes, is, factor, processing, translational, nucleolarpre-rrna, synthesis, small, 60s, eukaryotic, biogenesis, subunittrnas, subunits, large, nucleolus, factors, 40, synthetase, freemodification, rna, depletion, eif-2, initiator, 40s, ef-3anticodon, maturation 18s, eif2, mature, eif4e, synthetasesaminoacylation, snornas, assembly, eif4g, elongation

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 71 / 63

Page 256: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Translation

Given: “translation” seed wordsDo: use z-labels to build “translation” topicLots of relevant/seed termsOverall captures “translation” concept

Topic 0

translation, ribosomal, trna, rrna, initiation, ribosome, proteinribosomes, is, factor, processing, translational, nucleolarpre-rrna, synthesis, small, 60s, eukaryotic, biogenesis, subunittrnas, subunits, large, nucleolus, factors, 40, synthetase, freemodification, rna, depletion, eif-2, initiator, 40s, ef-3anticodon, maturation 18s, eif2, mature, eif4e, synthetasesaminoacylation, snornas, assembly, eif4g, elongation

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 71 / 63

Page 257: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Concept Expansion: Yeast Corpus

Corpus: approx 9,000 yeast-related abstractsGoal: use “seed” words to build a topic around a conceptTarget biological concept: translationSeed words: translation, tRNA, anticodon, ribosomeC(i) = {Topic 0}, η = 1 (hard constraint)T = 100, α = 0.5, β = 0.1

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 72 / 63

Page 258: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest example

Topic 13 go school cancer into well free cure college. . . graduate . . . law . . . surgery recovery . . .

Topic 13 mixes college and illness wish topicsPrefer to split [go school into college] and [cancer free cure well]Goal: topics separate these words,

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 73 / 63

Page 259: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest example

Topic 13 go school cancer into well free cure college. . . graduate . . . law . . . surgery recovery . . .

Topic 13 mixes college and illness wish topicsPrefer to split [go school into college] and [cancer free cure well]Goal: topics separate these words,

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 73 / 63

Page 260: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest example

Topic 13 go school cancer into well free cure college. . . graduate . . . law . . . surgery recovery . . .

Topic 13(a) job go school great into good college. . . business graduate finish grades away law accepted . . .

Topic 13(b) mom husband cancer hope free son well. . . full recovery surgery pray heaven pain aids . . .

Topic 13 mixes college and illness wish topicsPrefer to split [go school into college] and [cancer free cure well]Goal: topics separate these words, as well as related words

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 73 / 63

Page 261: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest example

Topic 13 go school cancer into well free cure college. . . graduate . . . law . . . surgery recovery . . .

Topic 13(a) job go school great into good college. . . business graduate finish grades away law accepted . . .

Topic 13(b) mom husband cancer hope free son well. . . full recovery surgery pray heaven pain aids . . .

Topic 13 mixes college and illness wish topicsPrefer to split [go school into college] and [cancer free cure well]Goal: topics separate these words, as well as related words

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 73 / 63

Page 262: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Tree (“dice factory 2.0”)Dennis III 1991, Minka 1999

Control variance of subsets of variablesSample Dirichlet(γ) at parent, distribute mass to childrenMass reaching leaves are final multinomial parameters φ∆(s) = 0 for all internal node s → standard Dirichlet(for our trees, true when η = 1)Conjugate to multinomial→ integrate out (“collapse”) φ

BA C

2β β

ηβηβ{γ(β = 1, η = 50)

φ=

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 74 / 63

Page 263: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Tree (“dice factory 2.0”)Dennis III 1991, Minka 1999

Control variance of subsets of variablesSample Dirichlet(γ) at parent, distribute mass to childrenMass reaching leaves are final multinomial parameters φ∆(s) = 0 for all internal node s → standard Dirichlet(for our trees, true when η = 1)Conjugate to multinomial→ integrate out (“collapse”) φ

BA C

2β β

ηβηβ{γ(β = 1, η = 50)

0.090.91

0.09φ=

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 74 / 63

Page 264: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Tree (“dice factory 2.0”)Dennis III 1991, Minka 1999

Control variance of subsets of variablesSample Dirichlet(γ) at parent, distribute mass to childrenMass reaching leaves are final multinomial parameters φ∆(s) = 0 for all internal node s → standard Dirichlet(for our trees, true when η = 1)Conjugate to multinomial→ integrate out (“collapse”) φ

BA C

2β β

ηβηβ{γ(β = 1, η = 50)

0.09

0.420.58

0.91

0.090.53 0.38φ=

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 74 / 63

Page 265: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Tree (“dice factory 2.0”)Dennis III 1991, Minka 1999

Control variance of subsets of variablesSample Dirichlet(γ) at parent, distribute mass to childrenMass reaching leaves are final multinomial parameters φ∆(s) = 0 for all internal node s → standard Dirichlet(for our trees, true when η = 1)Conjugate to multinomial→ integrate out (“collapse”) φ

p(φ|γ) =

(L∏k

φ(k)γ(k)−1

) I∏s

Γ(∑C(s)

k γ(k))

∏C(s)k Γ

(γ(k)

)L(s)∑

k

φ(k)

∆(s)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 74 / 63

Page 266: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Tree (“dice factory 2.0”)Dennis III 1991, Minka 1999

Control variance of subsets of variablesSample Dirichlet(γ) at parent, distribute mass to childrenMass reaching leaves are final multinomial parameters φ∆(s) = 0 for all internal node s → standard Dirichlet(for our trees, true when η = 1)Conjugate to multinomial→ integrate out (“collapse”) φ

p(φ|γ) =

(L∏k

φ(k)γ(k)−1

) I∏s

Γ(∑C(s)

k γ(k))

∏C(s)k Γ

(γ(k)

)L(s)∑

k

φ(k)

∆(s)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 74 / 63

Page 267: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Tree (“dice factory 2.0”)Dennis III 1991, Minka 1999

Control variance of subsets of variablesSample Dirichlet(γ) at parent, distribute mass to childrenMass reaching leaves are final multinomial parameters φ∆(s) = 0 for all internal node s → standard Dirichlet(for our trees, true when η = 1)Conjugate to multinomial→ integrate out (“collapse”) φ

p(w|γ) =

I∏s

Γ(∑C(s)

k γ(k))

Γ(∑C(s)

k

(γ(k) + n(k)

)) C(s)∏k

Γ(γ(k) + n(k)

)Γ(γ(k))

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 74 / 63

Page 268: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Must-Link (school,college) via Dirichlet Tree

Place (school,college) beneath internal nodeLarge edge weights beneath this node (large η)

ηβ

2β β

ηβ

college lotteryschool

school college

lottery

(β = 1, η = 50)

school college

lottery

Dir(β = [50,50,1])

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 75 / 63

Page 269: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Must-Link (school,college) via Dirichlet Tree

Place (school,college) beneath internal nodeLarge edge weights beneath this node (large η)

ηβ

2β β

ηβ

college lotteryschoolschool college

lottery

(β = 1, η = 50)

school college

lottery

Dir(β = [50,50,1])

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 75 / 63

Page 270: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Must-Link (school,college) via Dirichlet Tree

Place (school,college) beneath internal nodeLarge edge weights beneath this node (large η)

ηβ

2β β

ηβ

college lotteryschoolschool college

lottery

(β = 1, η = 50)school college

lottery

Dir(β = [50,50,1])

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 75 / 63

Page 271: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest LDA

θ

Ndw

D

αβT

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 76 / 63

Page 272: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest LDA

θ

Nd

w

D

αT

β

η

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 76 / 63

Page 273: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest LDA

θ

Nd

w

D

αT

β

η

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 76 / 63

Page 274: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest LDA

θ

Nd

w

D

αT

β

η

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 76 / 63

Page 275: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest LDA

θ

Nd

w

D

αT

β

η

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 76 / 63

Page 276: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Dirichlet Forest with biological domain knowledge

Given: concepts from the Gene Ontology (GO)Biological processes: transcription, translation, replicationProcess phases: initiation, elongation, termination

Do: learn meaningful process+phase “composite” topics

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 77 / 63

Page 277: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LDA DF1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10

transcription • • • • • •transcriptional • • • • • •template • • • •translation • • • •translational • • •tRNA • •replication • • • •cycle • • • • •division • • • •initiation • • • • • • • • •start • • • • • • •assembly • • • • • •elongation • • •termination • • •disassembly •release •stop • •

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 78 / 63

Page 278: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Maximal cliques

Maximal cliques of complement graph↔ independent setsWorst-case: 3

n3 (Moon & Moser 1965)

We are only concerned with connected graphs, but still O(3n3 )

(Griggs et al 1988)Find cliques with Bron-Kerbosch (branch-and-bound)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 79 / 63

Page 279: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Maximal cliques

Maximal cliques of complement graph↔ independent setsWorst-case: 3

n3 (Moon & Moser 1965)

We are only concerned with connected graphs, but still O(3n3 )

(Griggs et al 1988)Find cliques with Bron-Kerbosch (branch-and-bound)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 79 / 63

Page 280: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Maximal cliques

Maximal cliques of complement graph↔ independent setsWorst-case: 3

n3 (Moon & Moser 1965)

We are only concerned with connected graphs, but still O(3n3 )

(Griggs et al 1988)Find cliques with Bron-Kerbosch (branch-and-bound)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 79 / 63

Page 281: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Vocabulary [A,B,C,D,E ,F ,G]Must-Links (A,B)

Cannot-Links (A,D), (C,D), (E ,F )Cannot-Link-graph

G

C F

E

D

ABX

XX

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 282: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Connected components

C

A D E FB C G

F

E

D

ABX

X

X G

C

A D E FB C G

F

E

D

AB

G

ABCD EF G

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 283: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Subgraph complements

C

A D E FB C G

F

E

D

AB

G

C

A D E FB C G

F

E

D

AB

G

C

A D E FB C G

F

E

D

AB

G

ABCD EF G

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 284: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Maximal cliques

C

A D E FB C G

F

E

D

AB

G

C

A D E FB C G

F

E

D

AB

G

C

A D E FB C G

F

E

D

AB

G

ABCD EF G

C

A D E FB C G

F

E

D

AB

G

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 285: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Sample q(1) for first connected component

C

A D E FB C G

F

E

D

AB

G

q

AB DC AB C D

η η

(1)=?

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 286: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

q(1) = 1 (choose ABC)

C

A D E FB C G

F

E

D

AB

G

C

A D E FB C G

F

E

D

AB

G

AB DC AB C D

ηη ηη

=1q(1)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 287: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Insert chosen Cannot-Link subtree

C

A D E FB C G

F

E

D

AB

G

=1(1)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 288: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Put (A,B) under Must-Link subtree

C

A D E FB C G

F

E

D

AB

G

η η

=1(1)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 289: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Sample q(2) for second connected component

C

A D E FB C G

F

E

D

AB

G

q

η η

η η=1(1)

=?(2)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 290: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

q(2) = 2 (choose F )

C

A D E FB C G

F

E

D

AB

G

q

η η

ηη

=2(2)

=1(1)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 291: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Sampling a Tree from the Forest

Insert chosen Cannot-Link subtree

C

A D E FB C G

E

D

AB

G

q

q

F

ηη

ηη

=1(1)

=2(2)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 80 / 63

Page 292: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Original Wish Topics

Topic Top words sorted by φ = p(word|topic)0 love i you me and will forever that with hope1 and health for happiness family good my friends2 year new happy a this have and everyone years3 that is it you we be t are as not s will can4 my to get job a for school husband s that into5 to more of be and no money stop live people6 to our the home for of from end safe all come7 to my be i find want with love life meet man8 a and healthy my for happy to be have baby9 a 2008 in for better be to great job president

10 i wish that would for could will my lose can11 peace and for love all on world earth happiness12 may god in all your the you s of bless 200813 the in to of world best win 2008 go lottery14 me a com this please at you call 4 if 2 www

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 81 / 63

Page 293: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Original Wish Topics

Topic Top words sorted by φ = p(word|topic)0 love i you me and will forever that with hope1 and health for happiness family good my friends2 year new happy a this have and everyone years3 that is it you we be t are as not s will can4 my to get job a for school husband s that into5 to more of be and no money stop live people6 to our the home for of from end safe all come7 to my be i find want with love life meet man8 a and healthy my for happy to be have baby9 a 2008 in for better be to great job president

10 i wish that would for could will my lose can11 peace and for love all on world earth happiness12 may god in all your the you s of bless 200813 the in to of world best win 2008 go lottery14 me a com this please at you call 4 if 2 www

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 81 / 63

Page 294: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

isolate([to and for] . . .)

Topic Top words sorted by φ = p(word|topic)0 love forever marry happy together mom back1 health happiness good family friends prosperity2 life best live happy long great time ever wonderful3 out not up do as so what work don was like4 go school cancer into well free cure college5 no people stop less day every each take children6 home safe end troops iraq bring war husband house7 love peace true happiness hope joy everyone dreams8 happy healthy family baby safe prosperous everyone9 better job hope president paul great ron than person10 make money lose weight meet finally by lots hope married

Isolate and to for a the year in new all my 200812 god bless jesus loved know everyone love who loves13 peace world earth win lottery around save14 com call if 4 2 www u visit 1 3 email yahoo

Isolate i to wish my for and a be that the in

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 82 / 63

Page 295: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

isolate([to and for] . . .)

Topic Top words sorted by φ = p(word|topic)0 love forever marry happy together mom back1 health happiness good family friends prosperity2 life best live happy long great time ever wonderful3 out not up do as so what work don was like

MIXED go school cancer into well free cure college5 no people stop less day every each take children6 home safe end troops iraq bring war husband house7 love peace true happiness hope joy everyone dreams8 happy healthy family baby safe prosperous everyone9 better job hope president paul great ron than person10 make money lose weight meet finally by lots hope married

Isolate and to for a the year in new all my 200812 god bless jesus loved know everyone love who loves13 peace world earth win lottery around save14 com call if 4 2 www u visit 1 3 email yahoo

Isolate i to wish my for and a be that the in

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 82 / 63

Page 296: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

split([cancer free cure well],[go school into college])

0 love forever happy together marry fall1 health happiness family good friends2 life happy best live love long time3 as not do so what like much don was4 out make money house up work grow able5 people no stop less day every each take6 home safe end troops iraq bring war husband7 love peace happiness true everyone joy8 happy healthy family baby safe prosperous9 better president hope paul ron than person10 lose meet man hope boyfriend weight finally

Isolate and to for a the year in new all my 200812 god bless jesus loved everyone know loves13 peace world earth win lottery around save14 com call if 4 www 2 u visit 1 email yahoo 3

Isolate i to wish my for and a be that the in me getSplit job go school great into good collegeSplit mom husband cancer hope free son well

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 83 / 63

Page 297: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

split([cancer free cure well],[go school into college])

LOVE love forever happy together marry fall1 health happiness family good friends2 life happy best live love long time3 as not do so what like much don was4 out make money house up work grow able5 people no stop less day every each take6 home safe end troops iraq bring war husband7 love peace happiness true everyone joy8 happy healthy family baby safe prosperous9 better president hope paul ron than person

LOVE lose meet man hope boyfriend weight finallyIsolate and to for a the year in new all my 2008

12 god bless jesus loved everyone know loves13 peace world earth win lottery around save14 com call if 4 www 2 u visit 1 email yahoo 3

Isolate i to wish my for and a be that the in me getSplit job go school great into good collegeSplit mom husband cancer hope free son well

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 83 / 63

Page 298: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

merge([love . . . marry. . .],[meet . . . married. . .])

Topic Top words sorted by φ = p(word|topic)Merge love lose weight together forever marry meet

success health happiness family good friends prosperitylife life happy best live time long wishes ever years- as do not what someone so like don much he

money out make money up house work able pay own lotspeople no people stop less day every each other another

iraq home safe end troops iraq bring war returnjoy love true peace happiness dreams joy everyone

family happy healthy family baby safe prosperousvote better hope president paul ron than person bush

Isolate and to for a the year in new all mygod god bless jesus everyone loved know heart christ

peace peace world earth win lottery around savespam com call if u 4 www 2 3 visit 1

Isolate i to wish my for and a be that theSplit job go great school into good college hope moveSplit mom hope cancer free husband son well dad cure

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 84 / 63

Page 299: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Encoding LDA variants

Concept-Topic Model (Chemudugunta et. al., ISWC 2008)Special concept topics only emit certain wordsTopic t special words: wc1,wc2, . . . ,wcn

∀i Z(i , t)⇒ (W(i ,wc1) ∨ W(i ,wc2) ∨ . . . ∨ W(i ,wcn))

Hidden Markov Topic Model (Gruber et. al., AISTATS 2007)Same topic used for entire sentenceTopic transitions allowed at sentence boundary

∀i , j , s, t S(i , s) ∧ S(j , s) ∧ Z(i , t)⇒ Z(j , t)∀i , s S(i , s) ∧ ¬S(i + 1, s) ∧ Z(i , t)⇒ Z(j , t ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 85 / 63

Page 300: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Encoding LDA variants

Concept-Topic Model (Chemudugunta et. al., ISWC 2008)Special concept topics only emit certain wordsTopic t special words: wc1,wc2, . . . ,wcn

∀i Z(i , t)⇒ (W(i ,wc1) ∨ W(i ,wc2) ∨ . . . ∨ W(i ,wcn))

Hidden Markov Topic Model (Gruber et. al., AISTATS 2007)Same topic used for entire sentenceTopic transitions allowed at sentence boundary

∀i , j , s, t S(i , s) ∧ S(j , s) ∧ Z(i , t)⇒ Z(j , t)∀i , s S(i , s) ∧ ¬S(i + 1, s) ∧ Z(i , t)⇒ Z(j , t ′)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 85 / 63

Page 301: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Decomposing the z-step

Remove trivially true ground formulas(Shavlik and Natarajan, AAAI 2009)Example

W(i , taxes)⇒ Z(i ,3)true irrespective of z for all i where wi 6= taxes

Let zKB be all zi involved in ≥ 1 non-trivial groundingzi /∈ zKB can simply be set by argmax

tφt (wi)θdi (t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 86 / 63

Page 302: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Decomposing the z-step

Remove trivially true ground formulas(Shavlik and Natarajan, AAAI 2009)Example

W(i , taxes)⇒ Z(i ,3)true irrespective of z for all i where wi 6= taxes

Let zKB be all zi involved in ≥ 1 non-trivial groundingzi /∈ zKB can simply be set by argmax

tφt (wi)θdi (t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 86 / 63

Page 303: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Decomposing the z-step

Remove trivially true ground formulas(Shavlik and Natarajan, AAAI 2009)Example

W(i , taxes)⇒ Z(i ,3)true irrespective of z for all i where wi 6= taxes

Let zKB be all zi involved in ≥ 1 non-trivial groundingzi /∈ zKB can simply be set by argmax

tφt (wi)θdi (t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 86 / 63

Page 304: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Decomposing the z-step

Remove trivially true ground formulas(Shavlik and Natarajan, AAAI 2009)Example

W(i , taxes)⇒ Z(i ,3)true irrespective of z for all i where wi 6= taxes

Let zKB be all zi involved in ≥ 1 non-trivial groundingzi /∈ zKB can simply be set by argmax

tφt (wi)θdi (t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 86 / 63

Page 305: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Decomposing the z-step

Remove trivially true ground formulas(Shavlik and Natarajan, AAAI 2009)Example

W(i , taxes)⇒ Z(i ,3)true irrespective of z for all i where wi 6= taxes

Let zKB be all zi involved in ≥ 1 non-trivial groundingzi /∈ zKB can simply be set by argmax

tφt (wi)θdi (t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 86 / 63

Page 306: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Decomposing the z-step

Remove trivially true ground formulas(Shavlik and Natarajan, AAAI 2009)Example

W(i , taxes)⇒ Z(i ,3)true irrespective of z for all i where wi 6= taxes

Let zKB be all zi involved in ≥ 1 non-trivial groundingzi /∈ zKB can simply be set by argmax

tφt (wi)θdi (t)

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 86 / 63

Page 307: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Collapsed Gibbs Sampling

P(zi = v |z−i ,w) ∝

n(d)−i,v + α∑T

u (n(d)−i,u + α)

n(wi )−i,v + β∑W

w ′(n(w ′)−i,v + β)

× exp (

Li∑`

η`f`(zi = v , z−i ,w))

MLN research suggests this will not work wellWe do not really care about the full posterior anyways

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 87 / 63

Page 308: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Collapsed Gibbs Sampling

P(zi = v |z−i ,w) ∝

n(d)−i,v + α∑T

u (n(d)−i,u + α)

n(wi )−i,v + β∑W

w ′(n(w ′)−i,v + β)

× exp (

Li∑`

η`f`(zi = v , z−i ,w))

MLN research suggests this will not work wellWe do not really care about the full posterior anyways

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 87 / 63

Page 309: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Using syntax to recover related classes of words

cells→ special topic 0 (T-cells, lymphocytes)cell states→ special topic 1 (activated, treated, transfected)∀i , j Dependency(AdjectivalModifier , i , j)⇒ (z(i ,0)⇔ z(j ,1))

Activated lymphocytes pass into the blood stream...Transformation of neuraminidase treated lymphocytes......shortening of process length in OA treated neurons.

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 88 / 63

Page 310: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Using syntax to recover related classes of words

cells→ special topic 0 (T-cells, lymphocytes)cell states→ special topic 1 (activated, treated, transfected)∀i , j Dependency(AdjectivalModifier , i , j)⇒ (z(i ,0)⇔ z(j ,1))

Activated lymphocytes pass into the blood stream...Transformation of neuraminidase treated lymphocytes......shortening of process length in OA treated neurons.

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 88 / 63

Page 311: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Using syntax to recover related classes of words

cells→ special topic 0 (T-cells, lymphocytes)cell states→ special topic 1 (activated, treated, transfected)∀i , j Dependency(AdjectivalModifier , i , j)⇒ (z(i ,0)⇔ z(j ,1))

Activated lymphocytes pass into the blood stream...Transformation of neuraminidase treated lymphocytes......shortening of process length in OA treated neurons.

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 88 / 63

Page 312: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

Using syntax to recover related classes of words

cells→ special topic 0 (T-cells, lymphocytes)cell states→ special topic 1 (activated, treated, transfected)∀i , j Dependency(AdjectivalModifier , i , j)⇒ (z(i ,0)⇔ z(j ,1))

Activated lymphocytes pass into the blood stream...Transformation of neuraminidase treated lymphocytes......shortening of process length in OA treated neurons.

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 88 / 63

Page 313: Incorporating Domain Knowledge in Latent Topic …...Incorporating Domain Knowledge in Latent Topic Models Final oral examination David Andrzejewski Department of Computer Sciences

LogicLDA datasets

Dataset N W D T | ∪k G(ψk )|

S1 16 3 4 2 64S2 32 4 4 3 192

Mac 153986 3652 2000 20 4388860Comp 482634 8285 5000 20 6295Con 422229 6156 2740 25 99847Pol 733072 13196 2000 20 1204960000

HDG 2903640 13817 21352 100 47236814

Andrzejewski (UW-Madison) Incorporating Domain Knowledge Final defense 89 / 63