1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate...

1

Co-Training for Cross-Lingual Sentiment Classification

Xiaojun Wan (萬小軍 )

Associate Professor, Peking University

ACL 2009

2

Research Gap

• Opinion mining has drawn much attention recently– Sentiment classification (POS, NEG, NEU)– Subjectivity analysis (subjective, objective)

• Annotated corpora are most important for training

• However, most of them are English data

• Corpora for other languages, including Chinese, are rare

3

Related Work

• Pilot studies on cross-lingual subjectivity classification

• Mihalcea et al. ACL 2007– Bilingual lexicon and manually translated parallel

corpus• Banea et al. EMNLP 2008

– English annotation tool + MT– Build Romanian annotation tool– Not much loss compared to human translation– Suggesting MT is a viable way

4

Problem Definition

• Perform cross-lingual sentiment classification– Either positive or negative

• Source: English

• Target: Chinese

• Leverage– 8000 Labeled English product reviews

– 1000 Unlabeled Chinese product reviews

– Machine translation (MT)

• Derive– Sentiment classification tools for Chinese product reviews

5

Framework

• Training Phase

• Classification Phase

6

Training Phase (1)Machine Translation

7

Two Views

Chinese View English View

8

Training Phase (2)The Co-Training Approach

English View

9

Label the unlabeled data (English)

English Classifierwith SVM

Label

Een

Top p positiveTop n negative

most confidentreview

10

Label the unlabeled data (Chinese)

Chinese Classifierwith SVM

Ecn

Top p positiveTop n negativemost confident

review

Label

11

Remove from Unlabeled DataFinish one Iteration

Een

Top p positiveTop n negative

most confidentreview

Ecn

Top p positiveTop n negativemost confident

review

∪

Train again Train again

12

Setting

• #Iteration = 40

• p = n = 5

13

Classification Phase

Chinese Classifier

English Classifier

average [-1, 1]

14

Experiment Setting (Training)

8000 Amazonproduct reviews.

4000 positive4000 negative

Books, DVDs,electronics

1000 product reviews fromwww.it168.com

mp3 player,mobile phones,DC

15

Experiment Setting (Testing)

• 886 Chinese product reviews from www.it168.com– 451 positive, 435 negative

– Different from unlabeled training data (outside testing)

16

Baseline

• SVM– Use only labeled data

• TSVM (Transductive SVM)– Joachims, 1999– Use both labeled and unlabeled

17

SVM Baselines

SVM(EN)SVM(CN)

18

SVM Baselines

SVM(ENCN1)

19

SVM Baselines

SVM(ENCN2)

average

20

TSVM Baselines

TSVM(EN)TSVM(CN)

21

TSVM Baselines

TSVM(ENCN1)

22

TSVM BaselinesTSVM(ENCN2)

average

23

Result: Method Comparison (1)

24

Result: Method Comparison (2)Performance on Each Side

SVM(EN)

TSVM(EN)

CoTrain(EN)

25

Result: Method Comparison (3)

Accuracy

SVM(EN) 0.738

TSVM(EN) 0.769

CoTrain(EN) 0.790

Accuracy

SVM(CN) 0.771

TSVM(CN) 0.767

CoTrain(CN) 0.775

CoTrain make better use of unlabeled Chinese reviews than TSVM

26

Result: Iteration Number Outperform TSVM(ENCN2) after 20 iterations

27

Result: Balance of (p,n) Unbalanced examples hurt the performance badly

28

Conclusion & Comment

• Co-Training approach for cross-lingual sentiment classification

• Future Work– Translated and natural text have different feature

distribution

– Domain adaptation algorithm (ex. structural correspondence learning) for linking them

29

Comment

• Leverage word (phrase) alignment in translated text

1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate...

Documents

Transcript of 1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate...