Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text...

30
Exploiting Associations between Word Clusters and Document Classes for Cross- domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi

Transcript of Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text...

Page 1: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text

Categorization

Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi

Page 2: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

OutlineIntroduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 2

Page 3: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Introduction

• Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution

Fuzhen Zhuang et al., SDM 2010

Training (labeled)

Classifier

Test (unlabeled)

Enterprise News Classification: including the classes“Product Announcement”, “Business scandal”, “Acquisition”, … …Product announcement:

HP's just-released LaserJet Pro P1100

printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...

Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300

desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p

laptop using. ...their performance

HP news Lenovo news

Different distribution

Fail !

3

Page 4: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Motivation (1)

• Example Analysis:

Fuzhen Zhuang et al., SDM 2010

Product announcement: HP's just-released LaserJet Pro P1100

printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...

Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300

desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p

laptop using. ...their performance

HP news Lenovo news

Product

word concept:

LaserJet, printer,

announcement, price,

ThinkPad, ThinkCentre,

announcement, price

Related

Productannounceme

nt

document class:

4

Page 5: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Motivation (2)

• Example Analysis:

Fuzhen Zhuang et al., SDM 2010

HP LaserJet, printer, price, performance et al.

Lenovo Thinkpad, Thinkcentre, price, performance et al.

The words expressing the same word concept are domain-dependent

5

ProductProduct

announcement

word conceptindicates

The association between word concepts and document classes is domain-independent

• Can we model this observation for classification? • We study to model it for cross-domain classification

• Domain-dependent word concepts• Domain-independent association between word concepts and document classes

Page 6: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Motivation (3)

• Example Analysis:

Fuzhen Zhuang et al., SDM 2010

Product announcement: HP's just-released LaserJet Pro P1100

printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...

Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300

desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p

laptop using. ...their performance

HP news Lenovo news

Product

word concept:

LaserJet, printer,

announcement, price…

ThinkPad, ThinkCentre

announcement price…

Related

Productannounceme

nt

document class:

6

Share some common words: announcement,

price, performance …

Page 7: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Outline• Introduction

Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 7

Page 8: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Preliminary Knowledge

• Basic formula of matrix tri-factorization:

where the input X is the word-document co-occurrence matrix

Fuzhen Zhuang et al., SDM 2010

F

G

S

8

Page 9: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Problem Formulation (1)

• Input: source domain Xs, target domain Xt

• Matrix tri-factorization based classification framework• Two-step Optimization Framework

(MTrick0)• Joint Optimization Framework

(MTrick)

Fuzhen Zhuang et al., SDM 2010 9

Page 10: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Problem Formulation (2)

• Sketch map of two-step optimization

Fuzhen Zhuang et al., SDM 2010

Source domain Xs

Ss

Fs Gs

Ft Gt

Targetdomain Xt

Ss

10

First step

Second step

Page 11: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Problem Formulation (3)• The optimization problem in source domain (First

step)

• The optimization problem in target domain (Second step)

Fuzhen Zhuang et al., SDM 2010

G0 is used as the supervision

information for this optimization

Our goal:

to obtain Fs , Gs and Ss

11

Our goal: to obtain Ft ,

Gt

Ss is the solution obtained from the

source domain

Page 12: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Problem Formulation (4)

• Sketch map of joint optimization

Fuzhen Zhuang et al., SDM 2010

Source domain Xs

Fs Gs

Ft Gt

Targetdomain Xt

S Knowledge Transfer

12

Page 13: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Problem Formulation (5)

• The joint optimization problem over source and target domain:

Fuzhen Zhuang et al., SDM 2010

G0 is the supervision informatio

n

the association S is shared as bridge to

transfer knowledge

13

Page 14: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Outline• Introduction

• Problem Formulation

Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 14

Page 15: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Solution for Optimization

• Alternately iterative algorithm is developed and the updated formulas are as follows,

Fuzhen Zhuang et al., SDM 2010

This is the solution for

joint optimization problem

15

Page 16: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Analysis of Algorithm Convergence

• According to the methodology of convergence analysis in the two works [Lee et al., NIPS’01] and [Ding et al., KDD’06], the following theorem holds.

Fuzhen Zhuang et al., SDM 2010

Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically.

16

Page 17: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Outline• Introduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

Experimental Validation

• Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 17

Page 18: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Preparation (1)• Construct Classification Tasks

rec and sci denote the positive and negative classes, respectively

For source domain:

For target domain:

144 ( ) Tasks can be constructed from this data set rec vs. sci

2 24 4P P

rec.autos rec.motorcycles

rec.baseball

rec.hockey

sci.crypt sic.electronics

sci.med sci.space

Fuzhen Zhuang et al., SDM 2010

recsci

rec.autos + sci.med

rec.motorcycles + sci.space

(4 x 4 cases)

(3 x 3 cases)

18

Page 19: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Preparation (2)• Data Sets 20 Newsgroup (three top categories are selected)

– Two data sets for binary classification: rec vs. sci and sci vs. talk

rec vs. sci : 144 tasks

sci vs. talk : 144 tasks

Reuters-21578 (the problems constructed in [Gao et al., KDD’08])

rec.autos rec.motorcycles

rec.baseball

rec.hockey

sci.crypt sic.electronics

sci.med sci.space

talk.guns talk.mideast

talk.misc talk.religion

Fuzhen Zhuang et al., SDM 2010

recsci

talk

19

Page 20: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Preparation (3)

• Compared Algorithms– Supervised Learning:

Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML’99]

– Semi-supervised Learning: TSVM [Joachims, ICML’99]

– Cross-domain Learning: CoCC [Dai et al., KDD’07] LWE [Gao et al., KDD’08]

•Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework)

•Measure: classification accuracy

Fuzhen Zhuang et al., SDM 2010 20

Page 21: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Results (1)• Comparisons among MTrick, MTrick0, CoCC, TSVM,

SVM and LG on data set rec vs. sci

Fuzhen Zhuang et al., SDM 2010

MTrick can perform well

even the accuracy of LG is lower than 65%

21

Page 22: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Results (2)• Comparisons among MTrick, MTrick0, CoCC, TSVM,

SVM and LG on data set sci vs. talk

Fuzhen Zhuang et al., SDM 2010

Similar with rec vs. sci Mtrick also achieves the

best results in this data set

22

Page 23: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Results (3)

• The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578

MTrick also performs very well on this data set

Fuzhen Zhuang et al., SDM 2010 23

Page 24: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Experimental Results Summary

• The systemic experiments show that MTrick outperforms all the compared algorithms

• Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great

• Also we can find that the joint optimization is better than the two-step optimization

Fuzhen Zhuang et al., SDM 2010 24

Page 25: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Overview• Introduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

Related Works

• ConclusionsFuzhen Zhuang et al., SDM 2010 25

Page 26: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Related Work (1)

• Cross-domain Learning Solve the distribution mismatch problems between the training

and testing data.– Instance weighting based approaches

Boosting based learning by Dai et al.[ICML’07] Instance weighting framework for NLP tasks by Jiang et al.

[ACL’07]

– Feature selection based approaches Two-phase feature selection framework by Jiang et al.[CIKM’07] Dimensionality reduction approach by Pan et al.[AAAI’08], which

focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains

Co-Clustering based Classification method by Dai et al. [KDD’07]

Fuzhen Zhuang et al., SDM 2010 26

Page 27: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Related Work (2)

• Nonnegative Matrix Factorization (NMF) Weighted nonnegative matrix factorization (WNMF)

by Guillamet et al. [PRL’03] Incorporating word space knowledge for document

clustering by Li et al. [SigIR’08] Orthogonal constrained NMF by Ding et al.[KDD’06] Cross-domain collaborative filtering by Li et al.

[IJCAI’09] Transfer the label information by sharing the

information of word clusters, proposed by Li et al.[SigIR’09]. However, the word clusters are not exactly the same due to distribution difference cross domains

Fuzhen Zhuang et al., SDM 2010 27

Page 28: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Outline• Introduction

• Problem Formulation

• Solution for Optimization Problem and Analysis of Algorithm Convergence

• Experimental Validation

• Related Works

ConclusionsFuzhen Zhuang et al., SDM 2010 28

Page 29: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Conclusions

• Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider‒ the domain-dependent concepts‒ the domain-independent association between concepts and document classes

• Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence

• Experiments on real-world text data sets show the effectiveness of the proposed approach

Fuzhen Zhuang et al., SDM 2010 29

Page 30: Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text Categorization Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong.

Thank you!

Q. & A.

Acknowledgement

Fuzhen Zhuang et al., SDM 2010 30