Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling...

18
Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang Huanhuan Liu Chu-Ren Huang Soochow University Hong Kong Polytechnic University

Transcript of Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling...

Page 1: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Employing Active Learning to Cross-Lingual Sentiment

Classification with Data Quality Controlling

Shoushan Li †‡ Rong Wang† Huanhuan Liu† Chu-Ren Huang‡

† Soochow University

‡ Hong Kong Polytechnic University

Page 2: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Outline

Introduction

Inadequacies of the Existing Work

Our Methods

Experimental Results

Conclusion

Page 3: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Introduction

Sentiment classification is a task of predicting the sentimental orientation (e.g., positive or negative) for a certain text.

However, the resources are rather imbalanced across different languages.

For example, due to dominant studies on English sentiment classification, the labeled data in English is often in a large scale while the labeled data in some other languages is much limited.

Page 4: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Introduction ( Cont.)

Cross-lingual sentiment classification aims to predict the sentiment orientation of a text in a language (named as the target language) with the help of the resources from another language (named as the source language).

Page 5: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Inadequacies of the Existing Work

The classification performance of only using the labeled data in the source language remains far away from satisfaction due to the huge difference in linguistic expression and social culture.

One challenge in active learning-based cross-lingual sentiment classification lies in the much imbalanced labeled data from the source and target languages.

A huge imbalance in the labeled data easily floods the small amount of the labeled target data in the abundance of labeled source data and largely reduces the contribution of the labeled data in the target language.

Page 6: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Our Methods We propose a certainty-based quality measurement (the intra-

quality measurement), together with cross-validation to select high-quality samples in the source language.

We propose a similarity measurement (the extra-quality measurement) to select the samples in the source language that are similar to those in the target language.

For a particular data in the target language, these two kinds of measurements are integrated to select high-quality samples in the source language.

After obtaining the high-quality samples in the source language, we employ standard uncertainty sampling for active learning-based cross-lingual sentiment classification.

Page 7: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Intra-quality Measurement

It only employs the data in the source language to measure the quality of the samples in the source language.

We first split the labeled data from the source language into two different parts. One is severed as the training data and the other is severed as the validation data.

Then, we use the training data to train a classifier which is used to predict the samples in the validation data.

After the prediction process, we assume that the samples with high posterior possibilities are capable of representing the classification knowledge in the training data.

Page 8: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Intra-quality Measurement

Page 9: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Extra-quality Measurement

Page 10: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Integrating Intra- and Extra-Quality Measurements

We consider the certainty measurement as the main ranking factor and leave the similarity measurement as a supplementary one when designing the way to integrate them.

• Input: Translated training data from the source language Testing data from the target language • Output: The selected data set

Page 11: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Integrating Intra- and Extra-Quality Measurements

Page 12: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Active Learning-based Cross-lingual Sentiment Classification

Page 13: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Active Learning-based Cross-lingual Sentiment Classification

Page 14: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Experimental Settings• Labeled Data in the Source Language: English reviews from

four domains: Book (B), DVD (D), Electronics (E) and Kitchen (K) . Each domain contains 1000 positive and 1000 negative reviews. All these labeled samples are translated into Chinese ones with Google Translate.

• Testing Data in the Target Language: Chinese reviews from IT168 and Chinese reviews from 360BUY , together with 2000 unlabeled reviews.

• Unlabeled Data in the Target Language: We select 500 positive and 500 negative as the unlabeled samples for active learning.

Page 15: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Experimental Results

Table 1:The classification performance by using all 8000 samples in the source domain

Four Approaches:Random + No_sourceUncertainty + No_sourceUncertainty + All_sourceUncertainty + Selected_source

Domain IT168 360BUYAccuracy 0.756 0.754

Page 16: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Experimental Results ( Cont.)

360BUY

0. 65

0. 7

0. 75

0. 8

0. 85

0. 9

50 100 150 200 400 600 800

Si ze of the Newl y added sampl es

Accu

racy

I T168

0. 65

0. 7

0. 75

0. 8

0. 85

0. 9

0. 95

50 100 150 200 400 600 800

Si ze of the Newl y added sampl es

Accu

racy

Random+No_source Uncertai nty+No_source

Uncertai nty+Al l _source Uncertai nty+Sel ected_source

Page 17: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Conclusion

• We propose an active learning approach for cross-lingual sentiment classification and address the huge challenge of the data imbalance by controlling data quality in the source language. Experimentation verifies the appropriateness of active learning for cross-lingual sentiment classification.

• In future work, we would like to improve the extra-quality measurement to make it more effective for selecting high quality samples. Meanwhile, we will try data quality controlling in other cross-lingual NLP tasks.

Page 18: Employing Active Learning to Cross-Lingual Sentiment Classification with Data Quality Controlling Shoushan Li †‡ Rong Wang † Huanhuan Liu † Chu-Ren Huang.

Thank You!