Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...

21
Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University at Bloomington

Transcript of Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...

Page 1: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Text Representation & Text Classification

for Intelligent Information Retrieval

Ning YuSchool of Library and Information ScienceIndiana University at Bloomington

Page 2: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Outline

The big picture

A specific problem – opinion detection

Page 3: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Intelligent information retrieval

Characteristics Not restricted to keyword matching and Boolean search Deal with natural language query and advanced search criteria Coarse-to-fine level of granularity Automatically organize/evaluate/interpret solution space User-centered, e.g., adapt to user’s learning habit Etc.

Page 4: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Intelligent information retrieval

System Preferences Various source of evidence Natural language processing Semantic web technologies Automatic text classification Etc.

Page 5: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Intelligent IR system diagram

Page 6: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

A Specific Question:Semi-Supervised Learning for Identifying Opinions in Web ContentDissertation work

Page 7: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Growing demand for online opinions

Enormous body of user-generated content

About anything, published anywhere and at any time

Useful for literature review, decision making, market monitoring, etc.

Page 8: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Major approaches for opinion detection

Page 9: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

To acquire a broad and comprehensive collection of opinion-bearing

features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic

collocations, stylistic features, contextual features);

To generate complex patterns (e.g., “good amount”) that can approximate

the context of words.

To generate and evaluate opinion detection systems;

To allow evaluation of opinion detection strategies with high confidence;

9

9

What’s Essential?Labeled Data! And lots of them!!!

Page 10: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Challenges for opinion detection

Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up

Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain

Page 11: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Motivations & research question

Easy to collect unlabeled user-generated content that contains opinions

Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies

Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?

Page 12: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Datasets & data split

Evaluation(5%)

Unlabeled (90%)

Labeled(1-5%)

SSL Full SLBaseline

Supervised Learning (SL)

Labeled(95%)

Evaluation(5%)Labeled(1-5%)

Evaluation(5%)

Dataset(sentences) Blog Posts Movie Reviews News Articles

Opinion 4,843 5,000 5,297

Non-opinion 4,843 5,000 5,174

Page 13: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Two major SSL methods: Self-training

Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set.

Limitation: Auto-labeled data may be biased by the particular opinion classifier.

Page 14: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Two major SSL methods: Co-training

Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other.

Limitation: It is not always easy to create two different classifiers.

Page 15: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Experimental design

General settings for SSL Naïve Bayes classifier for self-training Binary values for unigram and bigram features

Co-training strategies: Unigrams and bigrams (content vs. context) Two randomly split feature/training sets A character-based language model (CLM) and a bag-of-words

model (BOW)

Page 16: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Results: Overall

For movie reviews and news articles, co-training proved to be most robust

For blog posts, SSL showed no benefits over SL due to the low initial accuracy

Page 17: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Results: Movie reviews

Both self-training and co-training can improve opinion detection performance

Co-training is more effective than self-training

Page 18: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Results: Movie reviews (cont.)

The more different the two classifiers, the better the performance

Page 19: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Results: Domain transfer (movie reviews->blog posts)

For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.

Page 20: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Contributions

Comprehensive research expands the spectrum of SSL application to opinion detection

Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation

Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection

Research extensible to other data domains, non-English texts, and other text mining tasks

Page 21: Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

21

www.CartoonStock.com

“All my opinions are posted on my online blog.”

“A grade of 85 or higher will get you favorable mention on my blog.”

“If you want a second opinion, I’ll ask my computer”

Thank you!