Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept...

Post on 15-Feb-2019

241 views 0 download

Transcript of Wordnet-Enhanced Topic Modelsphdforum.im.ntu.edu.tw/1012/20130308.pdf · Wordnet Concept...

Wordnet-Enhanced

Topic Models

Hsin-Min Lu

盧信銘

Department of Information Management

National Taiwan University

1

Outline

• Introduction

• Literature Review

• Wordnet-Enhanced Topic Model

• Experiments

2

Introduction

• Leveraging unstructured data is a

challenging yet rewarding task

• Topic modeling, a family of unsupervised

learning models, is useful in discovering

latent topic structures in free text data

• Topic models assume that a document is

the mixture of topic distributions

• Each topic is a distribution of the

vocabulary

3

4

Statistical Topic Models for Text Mining

Text

Collections

Probabilistic

Topic Modeling

… web 0.21

search 0.10

link 0.08

graph 0.05

Subtopic discovery

Opinion comparison

Summarization

Topical pattern

analysis

term 0.16

relevance 0.08

weight 0.07

feedback 0.04

independ. 0.03

model 0.03

Topic models

(Multinomial distributions)

PLSA [Hofmann 99]

LDA [Blei et al. 03]

Author-Topic

[Steyvers et al. 04]

Pachinko allocation [Li & McCallum 06]

Topic over time [Wang et al. 06]

Introduction (Cont’d.)

Introduction (Cont’d.)

• An on-going research stream is to

incorporate meta-data variables into topic

modeling

– Richer models

– Useful estimation results

• This study aims at incorporating Wordnet

synset information into topic models

– A topic may be the combination of Wordnet

synsets, or/and

– The hidden co-occurrence structure 5

Introduction (cont’d.)

• Wordnet-Enhanced Topic Model

– Incorporate Wordnet synsets into topic

models

– Wordnet synsets affect the prior of topics

– Multinomial-probit-like setting for prior

– Wordnet synset influence topic inference at

token-level

– Document-level random effects for document-

wide topic tendency

– Inference using Gibbs sampling

6

Literature Review

• Wordnet

• Latent Dirichlet Allocation (LDA)

• LDA with Dirichlet Forest Prior

• Concept-Topic Model

• LDA with Wordnet

7

Wordnet

• WordNet is a large lexical database of

English.

– POS: Nouns, verbs, adjectives and adverbs

• Words are organized by synsets

– A synset expresses a distinct concept

– Synsets are interlinked by means of

conceptual-semantic and lexical relations

– Synsets form a network

– Useful for computational linguistics and

natural language processing

8

Wordnet (Cont’d.)

• Important difference between Wordnet and

thesaurus

– WordNet interlinks not just word forms (strings

of letters) but specific senses of words

– WordNet labels the semantic relations among

words, whereas the groupings of words in a

thesaurus does not follow any explicit pattern

other than meaning similarity

9

WordNet (Cont’d.)

• A lexical semantic network relating word forms and

lexicalized concepts (i.e., concepts that speakers have

adopted word forms to express)

• Main relations—hyponymy/troponymy (kind-of/way-to),

meronymy (part-whole), synonymy, antonymy

• Predominantly hierarchical, few relations across

grammatical class, glosses & example sentences do not

participate in network

• Nouns organized under 9 unique beginners

• Command-line interface & C library

• Prehistoric (but greppable!) db format

Lexical Matrix

Creation of Synsets

Three principles: • Minimality

• Coverage

• Replacability

Synsets

{house} is ambiguous. {house, home} has the sense of a social unit living together;

Is this the minimal unit?

{family, house , home} will make the unit completely unambiguous.

For coverage:

{family, household, house, home} ordered according to frequency.

Replacability of the most frequent words is a requirement.

Synset creation

From first principles

– Pick all the senses from good standard dictionaries.

– Obtain synonyms for each sense.

– Needs hard and long hours of work.

Wordnet Statistics (Version 2.1)

POS Unique

Strings Synsets

Total

Word-Sense

Pairs

Noun 117097 81426 145104

Verb 11488 13650 24890

Adjective 22141 18877 31302

Adverb 4601 3644 5720

Totals 155327 117597 207016

15

Wordnet Example

• Fake (n) has three senses:

– Something that is counterfeit; not what is

seems to be (synonyms: sham, postiche)

– A person who makes deceitful pretenses

(synonyms: imposter, impotor, pretender,

faker, …)

– [Football] A deceptive move made by a

football player (synonym: juke)

16

juke

sham

postichefake, n

imposterimpostor

pretender

fakerfraud

shammerrole player

pseudopseud

entity

causal agent physical objectliving thing

organism, being

person

bad person

wrongdoer

deceiver

whole thing, unit

artifact

creation

representation

copy

imitation

act, human action

action

choice, selection

decision

move

tacticalmaneuver

feint

Wordnet Example (Cont’d.)

Unique beginner synsets

Topic Models

• Latent variable models are useful in

discovering hidden structures in text data

– Latent Semantic Indexing using singular value

decomposition (SVD) (Deerwester et al. 1990)

– Probabilistic Latent Semantic Indexing (pLSI)

(Hofmann 1999)

– Latent Dirichlet allocation (LDA) (Blei 2003)

19

Topic Models (Cont’d.)

• LDA addresses the shortcoming of its

predecessors

– SVD may contain negative factor loadings,

which makes the result hard to explain

– pLSI (aspect model) : The number of

parameters grow linearly w.r.t. the number of

documents

• Lead to model overfitting

– LDA outperforms pLSI in terms of testing

probability (perplexity) 20

LDA Generative Process

21

LDA Inference Problem

22

LDA Model

23

LDA Model (Cont’d.)

24

LDA Model (Cont’d.)

25

LDA: Intractable Inference

26

Model Estimation Methods

Model Latent Z Latent

Other Parameter

Collapsed

Gibbs Sampling LDA Sample Integrate

Out Integrate out 𝜙 𝑗

Stochastic EM TOT Sample Integrate

Out Integrate out 𝜙 𝑗 ;

maximize w.r.t

other parameters

Variational

Bayes LDA and

DTM Assume

Independent Assume

Independent

(retain

sequential

structure in

DTM)

Maximize

Augmented

Gibbs Sampling WNTM

(This

Study)

Sample Sample Integrate out 𝜙𝑗 ;

sample other

parameters 27

Collapsed Gibbs Sampling

28

𝑃 𝑊 𝑍, 𝜂 =

𝑃 𝑍 𝛼 =

Marginalize 𝜃𝑑

29

Marginalize 𝜃𝑑

30

Joint Probability

31

Posterior Probabilty

32

Posterior Probability (Cont’d.)

33

Limitations of The LDA Model

• Additional meta-data information cannot

be included into the model

– Partially addressed problem:

– The author-topic model (AT) (Rosen-Zvi

2010) and Dirichlet-multinomial regression

(DMR) (Mimno and McCallum, 2008)

– The AT model delivers worse performance

compared to the LDA model

• Except when testing articles are very short

– The AT model is not a general framework to

include arbitrary document-level meta data

into the model 34

LDA with Dirichlet Forest Prior

• Dirichlet Forest Prior can be used to

incorporate prior knowledge

– Mixture of Dirichlet tree distributions

– Two basic types of knowledge

• Must-Links: two words have similar probability

within any topic

• Cannot-Links: two words should not both have

large probability within any topic

35 Andrzejewski, Zhu, and Craven, ICML 2009

LDA with Dirichlet Forest Prior (Cont’d.)

– Additional types of knowledge:

• Split: separate two or more sets of word from a

single topic into different topics by placing must-

links within the sets and cannot-links between

them

• Merge: combine two or more sets of words using

must-links

• Isolate: placing must-links within the common set,

and placing cannot-link between the common set

and the other high-probability words from all topics

36

Dirichlet Tree Distribution For Must-Link

• A Dirichlet distribution is a composition of

Dirichlet distribution

(a) A, B, and C are vocabulary, start

sampling from the root node: model Must-

like(A, B)

(b) A instance with 𝛽 = 1 and 𝜂 = 50

37

Dirichlet Tree Distribution

• Dirichlet tree distribution can preserve

specific correlation structure that cannot

be accomplished by the standard Dirichlet

distribution

• (c) A large set of sample from the Dirichlet

tree in (b); note 𝑝(𝐴) ≈ 𝑝(𝐵)

• (d) Dirichlet

distribution

with (50, 50, 1)

38

Combining Dirichlet Tree for Cannot-Link

• (e) Cannot-Link (A,B) and Cannot-Link(B,C)

• (f) The complementary graph of (e)

• (g) The Dirichlet subtree for clique

{A,C}

• (h) The Dirichlet subtree for clique {B}

39

LDA with Dirichlet Forest Prior (Cont’d.)

• 𝑞~𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡𝐹𝑜𝑟𝑒𝑠𝑡(𝛽, 𝜂)

• 𝜙~𝐷𝑖𝑟𝑖𝑐ℎ𝑙𝑒𝑡𝑇𝑟𝑒𝑒(𝑞)

• A Dirichlet Forest is a mixture of

Dirichlet Trees

40

LDA with Dirichlet Forest Prior (Cont’d.)

41

Concept Topic Model

• Observed words are

either generated from

a set of hidden topics

or a set of fixed

concepts

42

Steyvers, Smyth, and Chemuduganta, 2011

LDA with Wordnet

• Words are generated by walking down the

tree of Wordnet synsets

43 Boyd-Graber, Blei, and Zhu, 2007, EMNLP

Research Gaps

• Dirichlet forest prior can be used to

“constraint” topic models

– However, a model cannot “turn off” the

constraints when they are inappropriate

• LDAWN provides a model-drive word-sense

disambigulation mechanism

– Not suitable for topic modeling since LDAWN

cannot handle words not in Wordnet

• CTM assumes that pre-existing concepts

are “constant”

– Different concept may emerge in different

context 44

Developing the

Wordnet-Enhanced Topic Model (WNTM)

• Need a more flexible framework to include

Wordnet concepts into the latent topic

model

• A topic in WNTM may be

– The combination of several WN synsets

– A new topic unrelated to existing synsets

– The combination of the above two

45

The WNTM

• 𝑥𝑑𝑖: the vector

for Wordnet

concept

• Token-level

influence

structure

• 𝑞𝑑,𝑗: document-

specific topic

tendency

• 𝑔𝑗: slope for 𝑥𝑑𝑖 46

The WNTM Model

• 𝐻𝑑𝑖,𝑗 = 𝑞𝑑,𝑗 + 𝑥𝑑𝑖′ 𝑔𝑗 + 𝑒𝑑𝑖,𝑗

• 𝑒𝑑𝑖,𝑗~𝑁 0, Σj

• 𝑧𝑑𝑖 = 0, if max 𝐻𝑑𝑖 < 0

𝑗, if max 𝐻𝑑𝑖 = 𝐻𝑑𝑖,𝑗

47

Inference: Gibbs Sampling

• Updating z

• 𝑝 𝑧𝑑𝑖 = 𝑗 𝑧−𝑑𝑖 , 𝑤𝑑𝑖 , 𝑤−𝑑𝑖 , 𝑋, 𝑞, 𝑔, Σ

∝ 𝑝 𝑤𝑑𝑖 𝑧𝑑𝑖 = 𝑗,⋅ 𝑝(𝑧𝑑𝑖 = 𝑗|𝑞, 𝑔, Σ, 𝑋)

=𝑛−𝑑𝑖,𝑗

𝑤𝑑𝑖 + 𝛽

𝑛−𝑑𝑖,𝑗⋅

+𝑊𝛽𝑝(𝑧𝑑𝑖 = 𝑗|𝑞𝑑, 𝑔, Σ, 𝑥𝑑𝑖)

48

Inference: Augmented Gibbs Sampling

• Updating H

• 𝐻𝑑𝑖,𝑗|𝐻𝑑𝑖,−𝑗~𝑇𝑟𝑢𝑛𝑐𝑎𝑡𝑒𝑑 𝑁𝑜𝑟𝑚𝑎𝑙(⋅)

– McCulloch and Rossi (1994), Imai and van

Dyk (2004)

49

Inference: Augmented Gibbs Sampling

• Draw 𝑎2∗ from 𝑡𝑟𝑎𝑐𝑒(Σ−1 𝑜𝑙𝑑)/χ 𝐽−1 2

2 .

• Draw 𝐻 𝑑𝑖,𝑗∗ by first draw 𝐻𝑑𝑖,𝑗

∗ conditional on 𝑧, 𝑞 𝑜𝑙𝑑 ,

𝑔 𝑜𝑙𝑑 , Σ 𝑜𝑙𝑑 , 𝐻 𝑜𝑙𝑑 and set 𝐻 𝑑𝑖,𝑗∗ = 𝑎∗𝐻𝑑𝑖,𝑗

∗ .

• Draw 𝑞 𝑛𝑒𝑤 , and 𝑔 𝑛𝑒𝑤 by first draw 𝑞 ∗, 𝑔 ∗, and

𝑎2∗∗ conditional on 𝐻 𝑑𝑖,𝑗∗ , 𝑎2∗, and Σ 𝑜𝑙𝑑 and set

𝑞 𝑛𝑒𝑤 = 𝑞 ∗/𝑎∗∗, 𝑔 𝑛𝑒𝑤 = 𝑔 ∗/𝑎∗∗.

• Draw Σ 𝑛𝑒𝑤 by first draw Σ ∗ conditional on 𝐻 𝑑𝑖,𝑗∗ , 𝑞 ∗,

𝑔 ∗ and set Σ 𝑛𝑒𝑤 = Σ ∗/Σ 11∗ , 𝐻𝑑𝑖,𝑗

𝑛𝑒𝑤=

𝐻 𝑑𝑖,𝑗∗

Σ 11∗

.

50

Implementation

• C, C++, and OpenMP (core functions) + R

(function interfaces) + Python (text pre-

processing)

51

Research Testbed

• Reuters-21578

– 11,771 documents

– 775,553 words

– 26,898 unique words

• Wordnet 2.1 is used for concept

construction

52

Wordnet Concept Construction

• Filter out Wordnet synsets that are most

relevant to the given corpus

• Definition of a concept

– A group of words with similar meanings

constructed from Wordnet synsets

• Consider nouns only

– Organized in a tree structure

53

Wordnet Concept Construction (Cont’d.)

• For each word,

– Find the root form using the morphy tool

– Identify the synsets for the word

– For each synset

• Construct a concept by merging words in this

synset, its descendants, its parent, its siblings, and

descendants of sibling

• Delete a concept if it contains less than 5 distinct

tokens

• A concept is not useful if it contains too few words

54

Wordnet Concept Construction

(Cont’d.)

• For each concept

– Compute average co-occurrence length

• the number of unique tokens appearing in a

document

• Average over all positive values

– Delete concepts with average co-occurrence

length <= 1.15

• Sort the concept in descending order by a

relevance score (average co-occurrence

length / number of unique token)

• Delete the concepts in the last 25

percentile

55

Wordnet Concepts

Concept

# Unique Words /

Avg. Freq. /

Avg. Co-occur. Len.

Words in the Concept

(List at Most 10 Words)

proportion.n.01 6/2372.7/1.24

scale, percent, pct, content, rate,

percentage.

security.n.04 7/1856.9/1.47

scrip, debenture, share, treasury,

convertible, stock, bond.

offer.n.02 9/842.4/1.24

price, question, proposition, prospectus,

tender, proposal, reward, bid, special.

fossil fuel.n.01 6/838.8/1.64 oil, jet, gas, petroleum, coal, crude.

funds.n.01 7/806.0/1.15

exchequer, pocket, till, trough, treasury,

roll, bank.

sum.n.01 49/736.3/2.21

figure, revenue, pool, win, purse, sales,

profits, rent, proceeds, payoff (list

truncated).

social science.n.01 5/688.2/1.17

econometrics, politics, economics,

finance, government.

slope.n.01 15/616.7/1.42

decline, upgrade, descent, waterside,

rise, coast, uphill, steep, brae, fall (list

truncated).

gregorian calendar

month.n.01 20/612.3/1.51

february, feb, mar, march, august, aug,

september, sept, december, dec (list

truncated). 56

Summary Statistics of Wordnet Concepts

# of Wordnet

Concepts Per

Word

0 1 2 3 4 or

more

Proportion 45% 27% 12% 8% 8%

57

Perplexity at Different Sweeps

• # of topic = 25 58

Estimated Topic “Statement”

Top Keywords:

estimate statement bill account action order coupon intervention

review case usair accounting suit pass transfer

Wordnet Concepts:

commercial document.n.01 (5.43)*

estimate statement bill account order

proceeding.n.01 (4.29) *

action intervention review case suit

relationship.n.03 (0.86) *

account hold restraint trust confinement

advantage.n.01 (0.69)*

account leverage profitability expediency privilege

fact.n.01 (0.51) *

case observation score specific item

Matching LDA Topic

Top Keywords:

ct net loss shr profit rev note oper avg shrs mths qtr sales exclude gain

*Estimated coefficients for Wordnet concepts. 59

Estimated Topic “Earnings”

Top Keywords:

mln ct net loss dlrs shr profit rev note year gain oper include avg

shrs

Wordnet Concepts:

advantage.n.01 (3.59) *

profit gain good leverage preference

subject.n.01 (-0.02) *

puzzle head precedent case question

push.n.01 (-0.03) *

pinch crunch nudge mill boost

legislature.n.01 (-0.06) *

diet congress house senate parliament

Matching LDA Topic

Top Keywords:

mln note net stg include profit tax extraordinary pretax operate full item

making turnover income 60

Estimated Topic “Market Update”

Top Keywords:

week total end product period average amount demand supply line

inflation term shipment number release

Wordnet Concepts:

quantity.n.03 (5.33) *

total product average amount term

part.n.09 (4.66) *

end period factor top beginning

work time.n.01 (4.38) *

week turn hours shift turnaround

economic process.n.01 (4.34) *

demand supply inflation consumption spiral

merchandise.n.01 (4.26) *

line shipment number release inventory cargo

Matching LDA Topic

Top Keywords:

union south area spokesman city ship strike port worker africa line week

affect state southern 61

Estimated Topic “Macroeconomics”

Top Keywords:

dollar market currency west yen economic dealer central growth cut

japan economy expect policy interest

Wordnet Concepts:

semite.n.01 (-0.03) *

palestinian arab saudi omani arabian

rational_number.n.01 (-0.11) *

thousandth fraction fourth eighth half

seed.n.01 (-0.12) *

soybean coffee hazelnut nut cob

fact.n.01 (-0.22) *

observation score specific item case

Matching LDA Topic

Top Keywords:

dollar currency yen west exchange market rates japan dealer central

german germany intervention finance paris 62

The Effect of Wordnet Concept

63

The Effect of Topic Number

64

Questions

65