1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate...
-
Upload
julia-diana-bailey -
Category
Documents
-
view
223 -
download
0
Transcript of 1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate...
1
Co-Training for Cross-Lingual Sentiment Classification
Xiaojun Wan (萬小軍 )
Associate Professor, Peking University
ACL 2009
2
Research Gap
• Opinion mining has drawn much attention recently– Sentiment classification (POS, NEG, NEU)– Subjectivity analysis (subjective, objective)
• Annotated corpora are most important for training
• However, most of them are English data
• Corpora for other languages, including Chinese, are rare
3
Related Work
• Pilot studies on cross-lingual subjectivity classification
• Mihalcea et al. ACL 2007– Bilingual lexicon and manually translated parallel
corpus• Banea et al. EMNLP 2008
– English annotation tool + MT– Build Romanian annotation tool– Not much loss compared to human translation– Suggesting MT is a viable way
4
Problem Definition
• Perform cross-lingual sentiment classification– Either positive or negative
• Source: English
• Target: Chinese
• Leverage– 8000 Labeled English product reviews
– 1000 Unlabeled Chinese product reviews
– Machine translation (MT)
• Derive– Sentiment classification tools for Chinese product reviews
5
Framework
• Training Phase
• Classification Phase
6
Training Phase (1)Machine Translation
7
Two Views
Chinese View English View
8
Training Phase (2)The Co-Training Approach
English View
9
Label the unlabeled data (English)
English Classifierwith SVM
Label
Een
Top p positiveTop n negative
most confidentreview
10
Label the unlabeled data (Chinese)
Chinese Classifierwith SVM
Ecn
Top p positiveTop n negativemost confident
review
Label
11
Remove from Unlabeled DataFinish one Iteration
Een
Top p positiveTop n negative
most confidentreview
Ecn
Top p positiveTop n negativemost confident
review
∪
Train again Train again
12
Setting
• #Iteration = 40
• p = n = 5
13
Classification Phase
Chinese Classifier
English Classifier
average [-1, 1]
14
Experiment Setting (Training)
8000 Amazonproduct reviews.
4000 positive4000 negative
Books, DVDs,electronics
1000 product reviews fromwww.it168.com
mp3 player,mobile phones,DC
15
Experiment Setting (Testing)
• 886 Chinese product reviews from www.it168.com– 451 positive, 435 negative
– Different from unlabeled training data (outside testing)
16
Baseline
• SVM– Use only labeled data
• TSVM (Transductive SVM)– Joachims, 1999– Use both labeled and unlabeled
17
SVM Baselines
SVM(EN)SVM(CN)
18
SVM Baselines
SVM(ENCN1)
19
SVM Baselines
SVM(ENCN2)
average
20
TSVM Baselines
TSVM(EN)TSVM(CN)
21
TSVM Baselines
TSVM(ENCN1)
22
TSVM BaselinesTSVM(ENCN2)
average
23
Result: Method Comparison (1)
24
Result: Method Comparison (2)Performance on Each Side
SVM(EN)
TSVM(EN)
CoTrain(EN)
25
Result: Method Comparison (3)
Accuracy
SVM(EN) 0.738
TSVM(EN) 0.769
CoTrain(EN) 0.790
Accuracy
SVM(CN) 0.771
TSVM(CN) 0.767
CoTrain(CN) 0.775
CoTrain make better use of unlabeled Chinese reviews than TSVM
26
Result: Iteration Number Outperform TSVM(ENCN2) after 20 iterations
27
Result: Balance of (p,n) Unbalanced examples hurt the performance badly
28
Conclusion & Comment
• Co-Training approach for cross-lingual sentiment classification
• Future Work– Translated and natural text have different feature
distribution
– Domain adaptation algorithm (ex. structural correspondence learning) for linking them
29
Comment
• Leverage word (phrase) alignment in translated text