第47回TokyoWebMining, トピックモデリングによる評判分析

36
トピックモデリングによる評判分析 @I_eric_Y 47TokyoWebMining 1

Transcript of 第47回TokyoWebMining, トピックモデリングによる評判分析

Page 1: 第47回TokyoWebMining, トピックモデリングによる評判分析

トピックモデリングによる評判分析

@I_eric_Y  第47回TokyoWebMining

1

Page 2: 第47回TokyoWebMining, トピックモデリングによる評判分析

Agenda 目的  1.  トピックモデルの拡張で問題を乗り越えて行く例/定性的な

可視化をコントロールする例をご紹介する  2.  上記技術に関してビジネス観点からご意見を頂く  

•  テキストの評判分析概観  •  テキストの評点回帰  •  トピックモデリングによるアプローチのメリット  

–  定性的な可視化  –  柔軟なモデルの改造  

•  実例:  Domain-­‐Dependent/Independent  Topic  Switching  Model  

2

Page 3: 第47回TokyoWebMining, トピックモデリングによる評判分析

テキストの評判分析

•  何らかの評価に関する情報をテキストから抽出,  整理,  集約,  提示する技術  

•  アンケートの分析やお客様の声の分析に使える技術  –  製品やお店の改善,  顧客ニーズの把握,  マーケティングとか  

•  データ分析ソリューションの一つとして備えられていることも多い  –  IBM,  NTTコミュニケーションズ,…  

•  Consumer  Generated  Mediaが出てきたときに急速に注目を浴びた  –  Amazon,  TwiNer,  食べログ,…  –  今では個人の意見や評判に関するデータがネットにあふれている  

3

Page 4: 第47回TokyoWebMining, トピックモデリングによる評判分析

評判分析を構成する技術

•  評価表現辞書の構築  –  評価表現(言葉)とその表現の極性(肯定/否定的)がペアとなった辞書  

•  評価情報の観点からの文書分類【今回主に扱うところ】  –  文書の粒度で評価極性を決める  

•  評価情報を含む文の抽出  –  文の粒度で評価極性を決める  

•  評価情報の要素組の抽出  –  要素組の粒度で評価極性を決める  

•  他にもあるかもしれません  4

Page 5: 第47回TokyoWebMining, トピックモデリングによる評判分析

評判情報による文書分類 •  更に2つに分けられる  

–  評価表現の比率で分類する  •  極性が肯定的な表現が多かったら肯定的な文書とする  •  研究はたくさんあるようです  

–  機械学習で分類する  •  2002年に初めて適用され,   初は主に2値分類(肯定/否定)  

– 高い分類精度を達成  •  興味深い知見も出ている  

– 形容詞だけでなく,  全ての単語を用いた方がよい結果に  – 表層素性に加え,  言語素性も入れるとよい結果に  

•  中立を含めた3値分類は意外に 近  –  2005年〜  

•  更に細かい粒度の分類,  そして評点の回帰へ【今回扱うところ】  –  AmazonのレビューのraRngをテキストから回帰する  

5

Page 6: 第47回TokyoWebMining, トピックモデリングによる評判分析

hNp://www.amazon.com/

6

テキストの評点回帰

•  テキストを見て評点を当てるシンプルなタスク  

Page 7: 第47回TokyoWebMining, トピックモデリングによる評判分析

トピックモデリングによるアプローチのメリット

•  ネット上の荒れたテキストにある程度対応できる  –  次元圧縮でトピックに丸め込み,  トピック分布を特徴量として回帰を行う  –  分かち書きが適当でも分け方が同じ規則であればトピック特徴量としては

問題ない場合もある  

•  定性的な可視化【今回主に扱うところ】  –  トピックの単語分布を見ると何となく内容がわかる  –  評価極性付きトピックとしてモデリングすると,  トピックの単語分布が極性

を反映したものになり,  「評判のトピック」を可視化できる    → 評判要約やopinion  miningに繋がる  

•  柔軟にモデルを改造できる【今回主に扱うところ】  –  何らかの問題にぶつかったときへの対応  ex)ドメイン依存問題への対応  –  定性的な可視化(トピックの単語分布)のコントロール  

ex)評判トピックの極性をコントロールしたい  7

Page 8: 第47回TokyoWebMining, トピックモデリングによる評判分析

トピックモデリングによるアプローチのメリット

•  ネット上の荒れたテキストにある程度対応できる  –  次元圧縮でトピックに丸め込み,  トピック分布を特徴量として回帰を行う  –  分かち書きが適当でも分け方が同じ規則であればトピック特徴量としては

問題ない場合もある  

•  定性的な可視化  –  トピックの単語分布を見ると何となく内容がわかる  –  評価極性付きトピックとしてモデリングすると,  トピックの単語分布が極性

を反映したものになり,  「評判のトピック」を可視化できる    → 評判要約やopinion  miningに繋がる  

•  柔軟にモデルを改造できる  –  何らかの問題にぶつかったときへの対応  ex)ドメイン依存問題への対応  –  定性的な可視化(トピックの単語分布)のコントロール  

ex)評判トピックの極性をコントロールしたい  8

Page 9: 第47回TokyoWebMining, トピックモデリングによる評判分析

hNp://www.amazon.com/

•  Supervised  Latent  Dirichlet  AllocaRon[1]  –  LDAと線形回帰のjoint  model  

linear  regression

y = �T w

9

topic  distribuRon  =  feature   0  0.25  0.5  

0.75  1  

A   B   C   D  �

基本手法

Page 10: 第47回TokyoWebMining, トピックモデリングによる評判分析

Supervised  LDA

•  LDA+線形回帰やbag-­‐of-­‐words  +  Lassoよりも精度が改善  •  jointで解くことでトピックに線形回帰の係数と対応した極性が

付与され,  単語分布もそれに応じる【評判のトピック】  

10

Figure 1: (Left) A graphical model

representation of Supervised Latent

Dirichlet allocation. (Bottom) The

topics of a 10-topic sLDA model fit to

the movie review data of Section 3.

both

motion

simple

perfect

fascinating

power

complex

however

cinematography

screenplay

performances

pictures

effective

picture

his

their

character

many

while

performance

between

!30 !20 !10 0 10 20

! !! ! !! ! ! ! !

more

has

than

films

director

will

characters

one

from

there

which

who

much

what

awful

featuring

routine

dry

offered

charlie

paris

not

about

movie

all

would

they

its

have

like

you

was

just

some

out

bad

guys

watchable

its

not

one

movie

least

problem

unfortunately

supposed

worse

flat

dull

�d Zd,n Wd,nN

D

K�k

Yd !, "2

By regressing the response on the empirical topic frequencies, we treat the response as non-exchangeable with the words. The document (i.e., words and their topic assignments) is generatedfirst, under full word exchangeability; then, based on the document, the response variable is gen-erated. In contrast, one could formulate a model in which y is regressed on the topic proportions✓ . This treats the response and all the words as jointly exchangeable. But as a practical matter,our chosen formulation seems more sensible: the response depends on the topic frequencies whichactually occurred in the document, rather than on the mean of the distribution generating the topics.Moreover, estimating a fully exchangeable model with enough topics allows some topics to be usedentirely to explain the response variables, and others to be used to explain the word occurrences.This degrades predictive performance, as demonstrated in [2].

We treat ↵, �1:K , ⌘, and � 2 as unknown constants to be estimated, rather than random variables. Wecarry out approximate maximum-likelihood estimation using a variational expectation-maximization(EM) procedure, which is the approach taken in unsupervised LDA as well [4].

2.1 Variational E-step

Given a document and response, the posterior distribution of the latent variables is

p(✓, z1:N | w1:N , y, ↵,�1:K , ⌘, � 2) =p(✓ | ↵)

Q

N

n=1 p(zn

| ✓)p(wn

| z

n

, �1:K )⌘

p(y | z1:N , ⌘, � 2)

R

d✓ p(✓ | ↵)P

z1:N

Q

N

n=1 p(zn

| ✓)p(wn

| z

n

, �1:K )⌘

p(y | z1:N , ⌘, � 2). (1)

The normalizing value is the marginal probability of the observed data, i.e., the document w1:N andresponse y. This normalizer is also known as the likelihood, or the evidence. As with LDA, it is notefficiently computable. Thus, we appeal to variational methods to approximate the posterior.

Variational objective function. We maximize the evidence lower bound (ELBO) L(·), which for asingle document has the form

log p

w1:N , y | ↵, �1:K , ⌘, � 2� � L(� , �1:N ; ↵, �1:K , ⌘, � 2) = E[log p(✓ | ↵)] +N

X

n=1

E[log p(Z

n

| ✓)] +N

X

n=1

E[log p(wn

| Z

n

, �1:K )] + E[log p(y | Z1:N , ⌘, � 2)] + H(q) . (2)

Here the expectation is taken with respect to a variational distribution q. We choose the fully factor-ized distribution,

q(✓, z1:N | � , �1:N ) = q(✓ | � )Q

N

n=1 q(zn

| �n

), (3)

3

parametric analogs, such as an approach based on kernel ICA [6]. In text analysis, McCallum et al.developed a joint topic model for words and categories [8], and Blei and Jordan developed an LDAmodel to predict caption words from images [2]. In chemogenomic profiling, Flaherty et al. [5]proposed “labelled LDA,” which is also a joint topic model, but for genes and protein functioncategories. It differs fundamentally from the model proposed here.

This paper is organized as follows. We first develop the supervised latent Dirichlet allocation model(sLDA) for document-response pairs. We derive parameter estimation and prediction algorithms forthe real-valued response case. Then we extend these techniques to handle diverse response types,using generalized linear models. We demonstrate our approach on two real-world problems. First,we use sLDA to predict movie ratings based on the text of the reviews. Second, we use sLDA topredict the number of “diggs” that a web page will receive in the www.digg.com community, aforum for sharing web content of mutual interest. The digg count prediction for a page is basedon the page’s description in the forum. In both settings, we find that sLDA provides much morepredictive power than regression on unsupervised LDA features. The sLDA approach also improveson the lasso, a modern regularized regression technique.

2 Supervised latent Dirichlet allocation

In topic models, we treat the words of a document as arising from a set of latent topics, that is, aset of unknown distributions over the vocabulary. Documents in a corpus share the same set of K

topics, but each document uses a mix of topics unique to itself. Thus, topic models are a relaxationof classical document mixture models, which associate each document with a single unknown topic.

Here we build on latent Dirichlet allocation (LDA) [4], a topic model that serves as the basis formany others. In LDA, we treat the topic proportions for a document as a draw from a Dirichletdistribution. We obtain the words in the document by repeatedly choosing a topic assignment fromthose proportions, then drawing a word from the corresponding topic.

In supervised latent Dirichlet allocation (sLDA), we add to LDA a response variable associatedwith each document. As mentioned, this variable might be the number of stars given to a movie, acount of the users in an on-line community who marked an article interesting, or the category of adocument. We jointly model the documents and the responses, in order to find latent topics that willbest predict the response variables for future unlabeled documents.

We emphasize that sLDA accommodates various types of response: unconstrained real values, realvalues constrained to be positive (e.g., failure times), ordered or unordered class labels, nonnegativeintegers (e.g., count data), and other types. However, the machinery used to achieve this generalitycomplicates the presentation. So we first give a complete derivation of sLDA for the special caseof an unconstrained real-valued response. Then, in Section 2.3, we present the general version ofsLDA, and explain how it handles diverse response types.

Focus now on the case y 2 R. Fix for a moment the model parameters: the K topics �1:K (each�

k

a vector of term probabilities), the Dirichlet parameter ↵, and the response parameters ⌘ and � 2.Under the sLDA model, each document and response arises from the following generative process:

1. Draw topic proportions ✓ | ↵ ⇠ Dir(↵).2. For each word

(a) Draw topic assignment z

n

| ✓ ⇠ Mult(✓).(b) Draw word w

n

| z

n

, �1:K ⇠ Mult(�z

n

).

3. Draw response variable y | z1:N , ⌘, � 2 ⇠ N�

⌘>z, � 2�.

Here we define z := (1/N )P

N

n=1 z

n

. The family of probability distributions corresponding to thisgenerative process is depicted as a graphical model in Figure 1.

Notice the response comes from a normal linear model. The covariates in this model are the (un-observed) empirical frequencies of the topics in the document. The regression coefficients on thosefrequencies constitute ⌘. Note that a linear model usually includes an intercept term, which amountsto adding a covariate that always equals one. Here, such a term is redundant, because the compo-nents of z always sum to one.

2

Figure 1: (Left) A graphical model

representation of Supervised Latent

Dirichlet allocation. (Bottom) The

topics of a 10-topic sLDA model fit to

the movie review data of Section 3.

both

motion

simple

perfect

fascinating

power

complex

however

cinematography

screenplay

performances

pictures

effective

picture

his

their

character

many

while

performance

between

!30 !20 !10 0 10 20

! !! ! !! ! ! ! !

more

has

than

films

director

will

characters

one

from

there

which

who

much

what

awful

featuring

routine

dry

offered

charlie

paris

not

about

movie

all

would

they

its

have

like

you

was

just

some

out

bad

guys

watchable

its

not

one

movie

least

problem

unfortunately

supposed

worse

flat

dull

�d Zd,n Wd,nN

D

K�k

Yd !, "2

By regressing the response on the empirical topic frequencies, we treat the response as non-exchangeable with the words. The document (i.e., words and their topic assignments) is generatedfirst, under full word exchangeability; then, based on the document, the response variable is gen-erated. In contrast, one could formulate a model in which y is regressed on the topic proportions✓ . This treats the response and all the words as jointly exchangeable. But as a practical matter,our chosen formulation seems more sensible: the response depends on the topic frequencies whichactually occurred in the document, rather than on the mean of the distribution generating the topics.Moreover, estimating a fully exchangeable model with enough topics allows some topics to be usedentirely to explain the response variables, and others to be used to explain the word occurrences.This degrades predictive performance, as demonstrated in [2].

We treat ↵, �1:K , ⌘, and � 2 as unknown constants to be estimated, rather than random variables. Wecarry out approximate maximum-likelihood estimation using a variational expectation-maximization(EM) procedure, which is the approach taken in unsupervised LDA as well [4].

2.1 Variational E-step

Given a document and response, the posterior distribution of the latent variables is

p(✓, z1:N | w1:N , y, ↵,�1:K , ⌘, � 2) =p(✓ | ↵)

Q

N

n=1 p(zn

| ✓)p(wn

| z

n

, �1:K )⌘

p(y | z1:N , ⌘, � 2)

R

d✓ p(✓ | ↵)P

z1:N

Q

N

n=1 p(zn

| ✓)p(wn

| z

n

, �1:K )⌘

p(y | z1:N , ⌘, � 2). (1)

The normalizing value is the marginal probability of the observed data, i.e., the document w1:N andresponse y. This normalizer is also known as the likelihood, or the evidence. As with LDA, it is notefficiently computable. Thus, we appeal to variational methods to approximate the posterior.

Variational objective function. We maximize the evidence lower bound (ELBO) L(·), which for asingle document has the form

log p

w1:N , y | ↵, �1:K , ⌘, � 2� � L(� , �1:N ; ↵, �1:K , ⌘, � 2) = E[log p(✓ | ↵)] +N

X

n=1

E[log p(Z

n

| ✓)] +N

X

n=1

E[log p(wn

| Z

n

, �1:K )] + E[log p(y | Z1:N , ⌘, � 2)] + H(q) . (2)

Here the expectation is taken with respect to a variational distribution q. We choose the fully factor-ized distribution,

q(✓, z1:N | � , �1:N ) = q(✓ | � )Q

N

n=1 q(zn

| �n

), (3)

3

Page 11: 第47回TokyoWebMining, トピックモデリングによる評判分析

models ra)ngs   mul)-­‐lingual

mul)-­‐aspects

polari)es domain  dependency

observed-­‐labels

BayesianNonparametrics

supervised  LDA ○   ○

ML-­‐sLDA ○ ○ ○

MAS ○ ○ ○

MG-­‐LDA ○

JST/ASUM ○

JAS ○ ○

Yoshida  et  al. ○   ○

DDI-­‐TSM ○ ○ ○   ○ ○

11

評判分析系トピックモデル •  多言語拡張/Aspect毎の評点回帰/離散極性値付

与/ドメイン適応…  

Page 12: 第47回TokyoWebMining, トピックモデリングによる評判分析

トピックモデリングによるアプローチのメリット

•  ネット上の荒れたテキストにある程度対応できる  –  次元圧縮でトピックに丸め込み,  トピック分布を特徴量として回帰を行う  –  分かち書きが適当でも分け方が同じ規則であればトピック特徴量としては

問題ない場合もある  

•  定性的な可視化  –  トピックの単語分布を見ると何となく内容がわかる  –  評価極性付きトピックとしてモデリングすると,  トピックの単語分布が極性

を反映したものになり,  「評判のトピック」を可視化できる    → 評判要約やopinion  miningに繋がる  

•  柔軟にモデルを改造できる  –  何らかの問題にぶつかったときへの対応  ex)ドメイン依存問題への対応  –  定性的な可視化(トピックの単語分布)のコントロール  

ex)評判トピックの極性をコントロールしたい  12

Page 13: 第47回TokyoWebMining, トピックモデリングによる評判分析

•  Domain  Dependent/Independent-­‐Topic  Switching  Modelを例に説明  •  単語のドメイン依存問題  •  ドメイン  =  ECサイトのカテゴリ:  BOOK,  CD,  DVD,…  

hNp://www.amazon.com/ 13

モデル拡張による問題解決の例

Page 14: 第47回TokyoWebMining, トピックモデリングによる評判分析

•  A  domain  can  contain  both  Domain-­‐dependent  and    -­‐independent  words.  

BOOK  -­‐  The  story  is  good  -­‐  Too  small  leNer  -­‐  Boring  magazine  -­‐  Product  was  scratched

KITCHEN  -­‐  The  toaster  doesn’t  work  -­‐  The  knife  is  sturdy  -­‐  This  dishcloth  is  easy  to  use  -­‐  Customer  support  is  not  good  

 

14

ドメイン依存問題

Page 15: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK  -­‐  The  story  is  good  -­‐  Too  small  le?er  -­‐  Boring  magazine  -­‐  Product  was  scratched

KITCHEN  -­‐  The  toaster  doesn’t  work  -­‐  The  knife  is  sturdy  -­‐  This  dishcloth  is  easy  to  use  -­‐  Customer  support  is  not  good  

15

•  A  domain  can  contain  both  Domain-­‐dependent  and    -­‐independent  words.  

ドメイン依存問題

Page 16: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK  -­‐  The  story  is  good  -­‐  Too  small  le?er  -­‐  Boring  magazine  -­‐  Product  was  scratched

KITCHEN  -­‐  The  toaster  doesn’t  work  -­‐  The  knife  is  sturdy  -­‐  This  dishcloth  is  easy  to  use  -­‐  Customer  support  is  not  good  

16

•  A  domain  can  contain  both  Domain-­‐dependent  and    -­‐independent  words.  

ドメイン依存問題

Page 17: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK  -­‐  The  story  is  good  -­‐  Too  small  le?er  -­‐  Boring  magazine  -­‐  Product  was  scratched

KITCHEN  -­‐  The  toaster  doesn’t  work  -­‐  The  knife  is  sturdy  -­‐  This  dishcloth  is  easy  to  use  -­‐  Customer  support  is  not  good  

17

•  A  domain  can  contain  both  Domain-­‐dependent  and    -­‐independent  words.  

ドメイン依存問題

•  製品のドメインで単語分布が異なるがsLDAではこれを考慮できない  –  ドメインで使われる評価表現は異なるにも関わらず  

•  ドメイン適応と類似した対策を取りたい  

Page 18: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK  -­‐  The  story  is  good  -­‐  Too  small  le?er  -­‐  Boring  magazine  -­‐  Product  was  scratched

KITCHEN  -­‐  The  toaster  doesn’t  work  -­‐  The  knife  is  sturdy  -­‐  This  dishcloth  is  easy  to  use  -­‐  Customer  support  is  not  good  

1.  Introducing  domain-­‐dependent/independent    topics  into  sLDA  

2.  Domain-­‐Dependent/Independent  Topic    Switching  Model  (DDI-­‐TSM)  

 

Proposal  

18

•  A  domain  can  contain  both  Domain-­‐dependent  and    -­‐independent  words.  

ドメイン依存問題

Page 19: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK$ CD$ DVD$ ELECRRONICS$

Domain$Dependent$Topics$

switch�

topic� word�

A$ B$ C$ D$ E$ F$ G$ H$ I$ …$

Domain$Independent$Topics$

Document�

observed$domain$labels�

ELECTRONICS

19

モデリングによるアプローチ

Page 20: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK$ CD$ DVD$ ELECRRONICS$

Domain$Dependent$Topics$

switch�

topic� word�

A$ B$ C$ D$ E$ F$ G$ H$ I$ …$

Domain$Independent$Topics$

Document�

observed$domain$labels�

ELECTRONICS

20

モデリングによるアプローチ

Page 21: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK$ CD$ DVD$ ELECRRONICS$

Domain$Dependent$Topics$

domain$dependent�

CD� music�

A$ B$ C$ D$ E$ F$ G$ H$ I$ …$

Domain$Independent$Topics$

Document$in$domain$‘CD’�

………"

ELECTRONICS

21

モデリングによるアプローチ

Page 22: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK$ CD$ DVD$ ELECRRONICS$

Domain$Dependent$Topics$

domain$dependent�

CD� music�

A$ B$ C$ D$ E$ F$ G$ H$ I$ …$

Domain$Independent$Topics$

Document$in$domain$‘CD’�

………"

ELECTRONICS

22

モデリングによるアプローチ

Page 23: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK$ CD$ DVD$ ELECRRONICS$

Domain$Dependent$Topics$

domain$dependent�

CD� music�

A$ B$ C$ D$ E$ F$ G$ H$ I$ …$

Domain$Independent$Topics$

Document$in$domain$‘CD’�

………"

ELECTRONICS

23

モデリングによるアプローチ

Page 24: 第47回TokyoWebMining, トピックモデリングによる評判分析

BOOK$ CD$ DVD$ ELECRRONICS$

Domain$Dependent$Topics$

domain$independent�

C� good�

A$ B$ C$ D$ E$ F$ G$ H$ I$ …$

Domain$Independent$Topics$

Document$in$domain$‘CD’�

………"

ELECTRONICS

24

モデリングによるアプローチ

Page 25: 第47回TokyoWebMining, トピックモデリングによる評判分析

•  Domain  Dependent/Independentは0/1の値をとるスイッチの潜在変数で切り替える  

switching    latent    variable

word  (observed)

topical    latent    variable  (same  as  LDA)

Z

0

music

Z

1

good

domain  dependent  topic  dist.

domain  independent  topic  dist.

Z

X

W

�DD

�DI

�DD

�DI

�DD

�DI

25

モデリングによるアプローチ

Page 26: 第47回TokyoWebMining, トピックモデリングによる評判分析

26

K:  The  number  of              topics  in  sLDA    DDI-­‐TSMは総トピック  数10~20で高速に  この精度を達成可能

評点回帰実験結果

Page 27: 第47回TokyoWebMining, トピックモデリングによる評判分析

トピックモデリングによるアプローチのメリット

•  ネット上の荒れたテキストにある程度対応できる  –  次元圧縮でトピックに丸め込み,  トピック分布を特徴量として回帰を行う  –  分かち書きが適当でも分け方が同じ規則であればトピック特徴量としては

問題ない場合もある  

•  定性的な可視化  –  トピックの単語分布を見ると何となく内容がわかる  –  評価極性付きトピックとしてモデリングすると,  トピックの単語分布が極性

を反映したものになり,  「評判のトピック」を可視化できる    → 評判要約やopinion  miningに繋がる  

•  柔軟にモデルを改造できる  –  何らかの問題にぶつかったときへの対応  ex)ドメイン依存問題への対応  –  定性的な可視化(トピックの単語分布)のコントロール  

ex)評判トピックの極性をコントロールしたい  27

Page 28: 第47回TokyoWebMining, トピックモデリングによる評判分析

Book−negativeBook−positiveDVD−negativeDVD−poisitiveElectronics−negativeElectronics−positiveKitchen−negativeKitchen−positive

weight that did not correspond to labels

bias

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4 Domain-­‐Dependent Domain-­‐Independent

weight  parameters

posiRve

neutral

negaRve

28

可視化のコントロールの例

Domain  Dependent  Topicに対応する係数は奇麗に揃っている  

Page 29: 第47回TokyoWebMining, トピックモデリングによる評判分析

•  Domain  Dependent  Topicはドメイン情報で制約をかける  –  Labeled  LDA[3]  

Observed  labels:  BOOK,  CD

0  0.2  0.4  0.6  0.8  1  

All  observed  labels:  BOOK,  CD,  DVD,  KITCHEN  -­‐>  4  domain  dependent  topics

29

可視化のコントロールの例

•  更に評点が3以上であればBOOK-­‐posiRve,  2以下であればBOOK-­‐negaRveというように細分化  

Page 30: 第47回TokyoWebMining, トピックモデリングによる評判分析

Book−negativeBook−positiveDVD−negativeDVD−poisitiveElectronics−negativeElectronics−positiveKitchen−negativeKitchen−positive

weight that did not correspond to labels

bias

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4 Domain-­‐Dependent Domain-­‐Independent

weight  parameters

Table 4: Typical words in domain-dependent and -independent topics.domain

dependent/independent domain Positive Negative Neutral

Book

book read authorstory character

pages text historicalchapter

great good bestinteresting likelove excellent

better wonderful

book read authorwriting characterpage novel storyno don’t nevernothing doesn’tfew didn’t bad

wrong disappointedwaste

good like bettergreat interesting best

-

domaindependent

DVD

film movie dvdstory character

scene action watchinghorror episode story

music videogreat good best

better funnyinteresting nice

like love wonderful

movie storycharacter video

actor music castno never didn’t

worst wastepoor boring

wrong terriblelike good best

funny interesting

-

Electronics

product ipod workplayer printer sony

phone batterykeyboard audiobutton speaker

monitor memorygreat good likebetter excellent

perfect happy clear

product speaker worksound phone playersoftware dvd radiotv device printeripod computerbattery sony

button headphonesnothing waste never

didn’t cannot problemdisappointed doesn’t

good great

-

Kitchen

co!ee watermachine filtercooking foodglass steel

stainless icerice espresso

wine tea toasterwonderful sturdysharp love greatgood well easy

best better

product water co!eesteel tank kitchenknives hot heat

maker design machinework vacuum filter

don’t doesn’tdidn’t never problem

few broke lessdisappointed poorcheap nothing no

good great better nice

-

domainindependent

-

good great lovework funny sexygreatly best fans

amusing coolthrilling succinctly

accuratelysatisfying gracious

amazon productservice customerquality warranty

support manufacturervendor

damage matter poorscratched blame wrong

problem defective

arthur haroldbravo vincentmoor america

adventure comedyminelli john

manhattan roxannebob napoleon

benjamin ghostbustersbook dvdamazon

Process is utilized for domain-independent topics, and wecan also determine whether those topics are positive ornegative in the form of continuous values. Fourth, weshowed that DDITSM outperforms the baseline model thatdoes not consider domain dependence in the sentimentregression task, predicting numerical ratings from reviewsor texts that lack ratings.

The experimental results showed two interesting findings.First, DDITSM converged more rapidly than the baselinemodel because of the strong constraint due to observeddomain information. Second, domain-independent topicshad positive, negative, and neutral polarities in the formof continuous values. Neutral domain-independent topicsincluded proper nouns, and this means that proper nouns

Electronics-­‐posiRve

Kitchen-­‐posiRve

30

コントロールされたトピックの単語分布

Page 31: 第47回TokyoWebMining, トピックモデリングによる評判分析

Book−negativeBook−positiveDVD−negativeDVD−poisitiveElectronics−negativeElectronics−positiveKitchen−negativeKitchen−positive

weight that did not correspond to labels

bias

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4 Domain-­‐Dependent Domain-­‐Independent

weight  parameters

Kitchen-­‐negaRve

Table 4: Typical words in domain-dependent and -independent topics.domain

dependent/independent domain Positive Negative Neutral

Book

book read authorstory character

pages text historicalchapter

great good bestinteresting likelove excellent

better wonderful

book read authorwriting characterpage novel storyno don’t nevernothing doesn’tfew didn’t bad

wrong disappointedwaste

good like bettergreat interesting best

-

domaindependent

DVD

film movie dvdstory character

scene action watchinghorror episode story

music videogreat good best

better funnyinteresting nice

like love wonderful

movie storycharacter video

actor music castno never didn’t

worst wastepoor boring

wrong terriblelike good best

funny interesting

-

Electronics

product ipod workplayer printer sony

phone batterykeyboard audiobutton speaker

monitor memorygreat good likebetter excellent

perfect happy clear

product speaker worksound phone playersoftware dvd radiotv device printeripod computerbattery sony

button headphonesnothing waste never

didn’t cannot problemdisappointed doesn’t

good great

-

Kitchen

co!ee watermachine filtercooking foodglass steel

stainless icerice espresso

wine tea toasterwonderful sturdysharp love greatgood well easy

best better

product water co!eesteel tank kitchenknives hot heat

maker design machinework vacuum filter

don’t doesn’tdidn’t never problem

few broke lessdisappointed poorcheap nothing no

good great better nice

-

domainindependent

-

good great lovework funny sexygreatly best fans

amusing coolthrilling succinctly

accuratelysatisfying gracious

amazon productservice customerquality warranty

support manufacturervendor

damage matter poorscratched blame wrong

problem defective

arthur haroldbravo vincentmoor america

adventure comedyminelli john

manhattan roxannebob napoleon

benjamin ghostbustersbook dvdamazon

Process is utilized for domain-independent topics, and wecan also determine whether those topics are positive ornegative in the form of continuous values. Fourth, weshowed that DDITSM outperforms the baseline model thatdoes not consider domain dependence in the sentimentregression task, predicting numerical ratings from reviewsor texts that lack ratings.

The experimental results showed two interesting findings.First, DDITSM converged more rapidly than the baselinemodel because of the strong constraint due to observeddomain information. Second, domain-independent topicshad positive, negative, and neutral polarities in the formof continuous values. Neutral domain-independent topicsincluded proper nouns, and this means that proper nouns

Electronics-­‐negaRve

31

コントロールされたトピックの単語分布

Page 32: 第47回TokyoWebMining, トピックモデリングによる評判分析

Book−negativeBook−positiveDVD−negativeDVD−poisitiveElectronics−negativeElectronics−positiveKitchen−negativeKitchen−positive

weight that did not correspond to labels

bias

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4 Domain-­‐Dependent Domain-­‐Independent

weight  parameters

Domain-­‐Independent-­‐neutral

Domain-­‐Independent-­‐posiRve

Table 4: Typical words in domain-dependent and -independent topics.domain

dependent/independent domain Positive Negative Neutral

Book

book read authorstory character

pages text historicalchapter

great good bestinteresting likelove excellent

better wonderful

book read authorwriting characterpage novel storyno don’t nevernothing doesn’tfew didn’t bad

wrong disappointedwaste

good like bettergreat interesting best

-

domaindependent

DVD

film movie dvdstory character

scene action watchinghorror episode story

music videogreat good best

better funnyinteresting nice

like love wonderful

movie storycharacter video

actor music castno never didn’t

worst wastepoor boring

wrong terriblelike good best

funny interesting

-

Electronics

product ipod workplayer printer sony

phone batterykeyboard audiobutton speaker

monitor memorygreat good likebetter excellent

perfect happy clear

product speaker worksound phone playersoftware dvd radiotv device printeripod computerbattery sony

button headphonesnothing waste never

didn’t cannot problemdisappointed doesn’t

good great

-

Kitchen

co!ee watermachine filtercooking foodglass steel

stainless icerice espresso

wine tea toasterwonderful sturdysharp love greatgood well easy

best better

product water co!eesteel tank kitchenknives hot heat

maker design machinework vacuum filter

don’t doesn’tdidn’t never problem

few broke lessdisappointed poorcheap nothing no

good great better nice

-

domainindependent

-

good great lovework funny sexygreatly best fans

amusing coolthrilling succinctly

accuratelysatisfying gracious

amazon productservice customerquality warranty

support manufacturervendor

damage matter poorscratched blame wrong

problem defective

arthur haroldbravo vincentmoor america

adventure comedyminelli john

manhattan roxannebob napoleon

benjamin ghostbustersbook dvdamazon

Process is utilized for domain-independent topics, and wecan also determine whether those topics are positive ornegative in the form of continuous values. Fourth, weshowed that DDITSM outperforms the baseline model thatdoes not consider domain dependence in the sentimentregression task, predicting numerical ratings from reviewsor texts that lack ratings.

The experimental results showed two interesting findings.First, DDITSM converged more rapidly than the baselinemodel because of the strong constraint due to observeddomain information. Second, domain-independent topicshad positive, negative, and neutral polarities in the formof continuous values. Neutral domain-independent topicsincluded proper nouns, and this means that proper nouns

Table 4: Typical words in domain-dependent and -independent topics.domain

dependent/independent domain Positive Negative Neutral

Book

book read authorstory character

pages text historicalchapter

great good bestinteresting likelove excellent

better wonderful

book read authorwriting characterpage novel storyno don’t nevernothing doesn’tfew didn’t bad

wrong disappointedwaste

good like bettergreat interesting best

-

domaindependent

DVD

film movie dvdstory character

scene action watchinghorror episode story

music videogreat good best

better funnyinteresting nice

like love wonderful

movie storycharacter video

actor music castno never didn’t

worst wastepoor boring

wrong terriblelike good best

funny interesting

-

Electronics

product ipod workplayer printer sony

phone batterykeyboard audiobutton speaker

monitor memorygreat good likebetter excellent

perfect happy clear

product speaker worksound phone playersoftware dvd radiotv device printeripod computerbattery sony

button headphonesnothing waste never

didn’t cannot problemdisappointed doesn’t

good great

-

Kitchen

co!ee watermachine filtercooking foodglass steel

stainless icerice espresso

wine tea toasterwonderful sturdysharp love greatgood well easy

best better

product water co!eesteel tank kitchenknives hot heat

maker design machinework vacuum filter

don’t doesn’tdidn’t never problem

few broke lessdisappointed poorcheap nothing no

good great better nice

-

domainindependent

-

good great lovework funny sexygreatly best fans

amusing coolthrilling succinctly

accuratelysatisfying gracious

amazon productservice customerquality warranty

support manufacturervendor

damage matter poorscratched blame wrong

problem defective

arthur haroldbravo vincentmoor america

adventure comedyminelli john

manhattan roxannebob napoleon

benjamin ghostbustersbook dvdamazon

Process is utilized for domain-independent topics, and wecan also determine whether those topics are positive ornegative in the form of continuous values. Fourth, weshowed that DDITSM outperforms the baseline model thatdoes not consider domain dependence in the sentimentregression task, predicting numerical ratings from reviewsor texts that lack ratings.

The experimental results showed two interesting findings.First, DDITSM converged more rapidly than the baselinemodel because of the strong constraint due to observeddomain information. Second, domain-independent topicshad positive, negative, and neutral polarities in the formof continuous values. Neutral domain-independent topicsincluded proper nouns, and this means that proper nouns

32

コントロールされたトピックの単語分布

Page 33: 第47回TokyoWebMining, トピックモデリングによる評判分析

Book−negativeBook−positiveDVD−negativeDVD−poisitiveElectronics−negativeElectronics−positiveKitchen−negativeKitchen−positive

weight that did not correspond to labels

bias

−16

−14

−12

−10

−8

−6

−4

−2

0

2

4 Domain-­‐Dependent Domain-­‐Independent

weight  parameters

Domain-­‐Independent-­‐negaRve

Table 4: Typical words in domain-dependent and -independent topics.domain

dependent/independent domain Positive Negative Neutral

Book

book read authorstory character

pages text historicalchapter

great good bestinteresting likelove excellent

better wonderful

book read authorwriting characterpage novel storyno don’t nevernothing doesn’tfew didn’t bad

wrong disappointedwaste

good like bettergreat interesting best

-

domaindependent

DVD

film movie dvdstory character

scene action watchinghorror episode story

music videogreat good best

better funnyinteresting nice

like love wonderful

movie storycharacter video

actor music castno never didn’t

worst wastepoor boring

wrong terriblelike good best

funny interesting

-

Electronics

product ipod workplayer printer sony

phone batterykeyboard audiobutton speaker

monitor memorygreat good likebetter excellent

perfect happy clear

product speaker worksound phone playersoftware dvd radiotv device printeripod computerbattery sony

button headphonesnothing waste never

didn’t cannot problemdisappointed doesn’t

good great

-

Kitchen

co!ee watermachine filtercooking foodglass steel

stainless icerice espresso

wine tea toasterwonderful sturdysharp love greatgood well easy

best better

product water co!eesteel tank kitchenknives hot heat

maker design machinework vacuum filter

don’t doesn’tdidn’t never problem

few broke lessdisappointed poorcheap nothing no

good great better nice

-

domainindependent

-

good great lovework funny sexygreatly best fans

amusing coolthrilling succinctly

accuratelysatisfying gracious

amazon productservice customerquality warranty

support manufacturervendor

damage matter poorscratched blame wrong

problem defective

arthur haroldbravo vincentmoor america

adventure comedyminelli john

manhattan roxannebob napoleon

benjamin ghostbustersbook dvdamazon

Process is utilized for domain-independent topics, and wecan also determine whether those topics are positive ornegative in the form of continuous values. Fourth, weshowed that DDITSM outperforms the baseline model thatdoes not consider domain dependence in the sentimentregression task, predicting numerical ratings from reviewsor texts that lack ratings.

The experimental results showed two interesting findings.First, DDITSM converged more rapidly than the baselinemodel because of the strong constraint due to observeddomain information. Second, domain-independent topicshad positive, negative, and neutral polarities in the formof continuous values. Neutral domain-independent topicsincluded proper nouns, and this means that proper nouns

33

Complaints  about  the  e-­‐commerce  site,    customer  support,  and  delivery

コントロールされたトピックの単語分布

Page 34: 第47回TokyoWebMining, トピックモデリングによる評判分析

•  評判の要約/レポート –  トピックモデルでできるのか? –  例えば(名詞,形容詞)のペアを入力にする

•  複雑な構造の扱い –  構文構造などをどう取り込んでいくか

•  学習時間の増加 –  モデルの巨大化/複雑化 –  データの巨大化 –  第6回DSIRNLP “大規模データに対するベイズ学習” @slideshare

•  ハイパーパラメータチューニング –  定量的指標と定性的可視化の乖離

34

残された課題

Page 35: 第47回TokyoWebMining, トピックモデリングによる評判分析

•  評判分析の中でもトピックモデリングによる評点回帰を扱った

•  メリットとしては –  荒れたテキストでもある程度うまくいく –  単語分布に極性が反映され, 定性的な可視化がコントロールできる –  可視化したいイメージや問題に合わせてモデルを柔軟に変更できる

•  例としてDDI-TSMを扱った –  ドメイン適応の問題をモデリングで対応 –  ドメイン毎の評判をなんとなく可視化(可視化のコントロール)

•  課題は結構あります –  Neural Networkが興味深い結果を出ているので, そもそもトピックモデ

ルでのアプローチが妥当か検討の余地がある 35

まとめ

Page 36: 第47回TokyoWebMining, トピックモデリングによる評判分析

References

36

[1]  D.  M.  Blei  and  J.  D.  McAuliffe.  Supervised  topic  models.  In  Neural  InformaRon  Processing  Systems,  20:121–128,  2008.  [2]  D.  M.  Blei,  A.  Y.  Ng,  and  M.  I.  Jordan.  Latent  dirichlet  allocaRon.  Journal  of  Machine  Learning  Research,  3:993–1022,  2003.  [3]  D.  Ramage,  D.  Hall,  R.  NallapaR,  and  C.  D.  Manning.  Labeled  lda:  A  supervised  topic  model  for  credit  aNribuRon  in  mulR-­‐labeled  corpora.  In  Proceedings  of  the  2009  Conference  on  Empirical  Methods  in  Natural  Language  Processing,  1:248–256,  2009.  [4]  Y.  W.  Teh,  M.  I.  Jordan,  M.  J.  Beal,  and  D.  M.  Blei.  Hierarchical  dirichlet  processes.  Journal  of  the  American  StaRsRcal  AssociaRon,  101(476):1566–1581,  2006.  [5]  J.  Blitzer,  M.  Dredze,  and  F.  Pereir.  Biographies,  bollywood,  boom-­‐boxes  and  blenders:  Domain  adaptaRon  for  senRment  classificaRon.  In  Annual  MeeRng-­‐AssociaRon  For  ComputaRonal  LinguisRcs,  45(1):440,  2007.