Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text...
-
Upload
eddy-christ -
Category
Documents
-
view
214 -
download
0
Transcript of Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text...
Exploiting Associations between Word Clusters and Document Classes for Cross-domain Text
Categorization
Fuzhen Zhuang, Ping Luo, Hui Xiong, Qing He, Yuhong Xiong, Zhongzhi Shi
OutlineIntroduction
• Problem Formulation
• Solution for Optimization Problem and Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
• ConclusionsFuzhen Zhuang et al., SDM 2010 2
Introduction
• Many traditional learning techniques work well only under the assumption: Training and test data follow the same distribution
Fuzhen Zhuang et al., SDM 2010
Training (labeled)
Classifier
Test (unlabeled)
Enterprise News Classification: including the classes“Product Announcement”, “Business scandal”, “Acquisition”, … …Product announcement:
HP's just-released LaserJet Pro P1100
printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...
Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300
desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p
laptop using. ...their performance
HP news Lenovo news
Different distribution
Fail !
3
Motivation (1)
• Example Analysis:
Fuzhen Zhuang et al., SDM 2010
Product announcement: HP's just-released LaserJet Pro P1100
printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...
Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300
desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p
laptop using. ...their performance
HP news Lenovo news
Product
word concept:
LaserJet, printer,
announcement, price,
ThinkPad, ThinkCentre,
announcement, price
Related
Productannounceme
nt
document class:
4
Motivation (2)
• Example Analysis:
Fuzhen Zhuang et al., SDM 2010
HP LaserJet, printer, price, performance et al.
Lenovo Thinkpad, Thinkcentre, price, performance et al.
The words expressing the same word concept are domain-dependent
5
ProductProduct
announcement
word conceptindicates
The association between word concepts and document classes is domain-independent
• Can we model this observation for classification? • We study to model it for cross-domain classification
• Domain-dependent word concepts• Domain-independent association between word concepts and document classes
Motivation (3)
• Example Analysis:
Fuzhen Zhuang et al., SDM 2010
Product announcement: HP's just-released LaserJet Pro P1100
printer and the LaserJet Pro M1130 and M1210 multifunction printers, price … performance ...
Announcement for Lenovo ThinkPad ThinkCentre – price $150 off Lenovo K300
desktop using coupon code ... Lenovo ThinkPad ThinkCentre – price $200 off Lenovo IdeaPad U450p
laptop using. ...their performance
HP news Lenovo news
Product
word concept:
LaserJet, printer,
announcement, price…
ThinkPad, ThinkCentre
announcement price…
Related
Productannounceme
nt
document class:
6
Share some common words: announcement,
price, performance …
Outline• Introduction
Problem Formulation
• Solution for Optimization Problem and Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
• ConclusionsFuzhen Zhuang et al., SDM 2010 7
Preliminary Knowledge
• Basic formula of matrix tri-factorization:
where the input X is the word-document co-occurrence matrix
Fuzhen Zhuang et al., SDM 2010
F
G
S
8
Problem Formulation (1)
• Input: source domain Xs, target domain Xt
• Matrix tri-factorization based classification framework• Two-step Optimization Framework
(MTrick0)• Joint Optimization Framework
(MTrick)
Fuzhen Zhuang et al., SDM 2010 9
Problem Formulation (2)
• Sketch map of two-step optimization
Fuzhen Zhuang et al., SDM 2010
Source domain Xs
Ss
Fs Gs
Ft Gt
Targetdomain Xt
Ss
10
First step
Second step
Problem Formulation (3)• The optimization problem in source domain (First
step)
• The optimization problem in target domain (Second step)
Fuzhen Zhuang et al., SDM 2010
G0 is used as the supervision
information for this optimization
Our goal:
to obtain Fs , Gs and Ss
11
Our goal: to obtain Ft ,
Gt
Ss is the solution obtained from the
source domain
Problem Formulation (4)
• Sketch map of joint optimization
Fuzhen Zhuang et al., SDM 2010
Source domain Xs
Fs Gs
Ft Gt
Targetdomain Xt
S Knowledge Transfer
12
Problem Formulation (5)
• The joint optimization problem over source and target domain:
Fuzhen Zhuang et al., SDM 2010
G0 is the supervision informatio
n
the association S is shared as bridge to
transfer knowledge
13
Outline• Introduction
• Problem Formulation
Solution for Optimization Problem and Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
• ConclusionsFuzhen Zhuang et al., SDM 2010 14
Solution for Optimization
• Alternately iterative algorithm is developed and the updated formulas are as follows,
Fuzhen Zhuang et al., SDM 2010
This is the solution for
joint optimization problem
15
Analysis of Algorithm Convergence
• According to the methodology of convergence analysis in the two works [Lee et al., NIPS’01] and [Ding et al., KDD’06], the following theorem holds.
Fuzhen Zhuang et al., SDM 2010
Theorem (Convergence): After each round of calculating the iterative formulas, the objective function in the joint optimization will converge monotonically.
16
Outline• Introduction
• Problem Formulation
• Solution for Optimization Problem and Analysis of Algorithm Convergence
Experimental Validation
• Related Works
• ConclusionsFuzhen Zhuang et al., SDM 2010 17
Experimental Preparation (1)• Construct Classification Tasks
rec and sci denote the positive and negative classes, respectively
For source domain:
For target domain:
144 ( ) Tasks can be constructed from this data set rec vs. sci
2 24 4P P
rec.autos rec.motorcycles
rec.baseball
rec.hockey
sci.crypt sic.electronics
sci.med sci.space
Fuzhen Zhuang et al., SDM 2010
recsci
rec.autos + sci.med
rec.motorcycles + sci.space
(4 x 4 cases)
(3 x 3 cases)
18
Experimental Preparation (2)• Data Sets 20 Newsgroup (three top categories are selected)
– Two data sets for binary classification: rec vs. sci and sci vs. talk
rec vs. sci : 144 tasks
sci vs. talk : 144 tasks
Reuters-21578 (the problems constructed in [Gao et al., KDD’08])
rec.autos rec.motorcycles
rec.baseball
rec.hockey
sci.crypt sic.electronics
sci.med sci.space
talk.guns talk.mideast
talk.misc talk.religion
Fuzhen Zhuang et al., SDM 2010
recsci
talk
19
Experimental Preparation (3)
• Compared Algorithms– Supervised Learning:
Logistic Regression (LG) [David et al., 00] Support Vector Machine (SVM) [Joachims, ICML’99]
– Semi-supervised Learning: TSVM [Joachims, ICML’99]
– Cross-domain Learning: CoCC [Dai et al., KDD’07] LWE [Gao et al., KDD’08]
•Our Methods MTrick0 (Two-step optimization framework) MTrick (Joint optimization framework)
•Measure: classification accuracy
Fuzhen Zhuang et al., SDM 2010 20
Experimental Results (1)• Comparisons among MTrick, MTrick0, CoCC, TSVM,
SVM and LG on data set rec vs. sci
Fuzhen Zhuang et al., SDM 2010
MTrick can perform well
even the accuracy of LG is lower than 65%
21
Experimental Results (2)• Comparisons among MTrick, MTrick0, CoCC, TSVM,
SVM and LG on data set sci vs. talk
Fuzhen Zhuang et al., SDM 2010
Similar with rec vs. sci Mtrick also achieves the
best results in this data set
22
Experimental Results (3)
• The performance comparison of MTrick, LWE, CoCC, TSVM, SVM and LG on Reuters-21578
MTrick also performs very well on this data set
Fuzhen Zhuang et al., SDM 2010 23
Experimental Results Summary
• The systemic experiments show that MTrick outperforms all the compared algorithms
• Especially, MTrick can perform very well when the accuracy of LG is low (< 65%), which indicates that MTrick still works when the difficulty degree of transfer learning is great
• Also we can find that the joint optimization is better than the two-step optimization
Fuzhen Zhuang et al., SDM 2010 24
Overview• Introduction
• Problem Formulation
• Solution for Optimization Problem and Analysis of Algorithm Convergence
• Experimental Validation
Related Works
• ConclusionsFuzhen Zhuang et al., SDM 2010 25
Related Work (1)
• Cross-domain Learning Solve the distribution mismatch problems between the training
and testing data.– Instance weighting based approaches
Boosting based learning by Dai et al.[ICML’07] Instance weighting framework for NLP tasks by Jiang et al.
[ACL’07]
– Feature selection based approaches Two-phase feature selection framework by Jiang et al.[CIKM’07] Dimensionality reduction approach by Pan et al.[AAAI’08], which
focuses on finding out the latent feature space regarded as the bridge knowledge between the source and target domains
Co-Clustering based Classification method by Dai et al. [KDD’07]
Fuzhen Zhuang et al., SDM 2010 26
Related Work (2)
• Nonnegative Matrix Factorization (NMF) Weighted nonnegative matrix factorization (WNMF)
by Guillamet et al. [PRL’03] Incorporating word space knowledge for document
clustering by Li et al. [SigIR’08] Orthogonal constrained NMF by Ding et al.[KDD’06] Cross-domain collaborative filtering by Li et al.
[IJCAI’09] Transfer the label information by sharing the
information of word clusters, proposed by Li et al.[SigIR’09]. However, the word clusters are not exactly the same due to distribution difference cross domains
Fuzhen Zhuang et al., SDM 2010 27
Outline• Introduction
• Problem Formulation
• Solution for Optimization Problem and Analysis of Algorithm Convergence
• Experimental Validation
• Related Works
ConclusionsFuzhen Zhuang et al., SDM 2010 28
Conclusions
• Propose a nonnegative matrix factorization based classification framework (MTrick), which explicitly consider‒ the domain-dependent concepts‒ the domain-independent association between concepts and document classes
• Develop an alternately iterative algorithm to solve the optimization problem, and theoretically analyze the convergence
• Experiments on real-world text data sets show the effectiveness of the proposed approach
Fuzhen Zhuang et al., SDM 2010 29
Thank you!
Q. & A.
Acknowledgement
Fuzhen Zhuang et al., SDM 2010 30