Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...

Text Representation & Text Classification

for Intelligent Information Retrieval

Ning YuSchool of Library and Information ScienceIndiana University at Bloomington

Outline

The big picture

A specific problem – opinion detection

Intelligent information retrieval

Characteristics Not restricted to keyword matching and Boolean search Deal with natural language query and advanced search criteria Coarse-to-fine level of granularity Automatically organize/evaluate/interpret solution space User-centered, e.g., adapt to user’s learning habit Etc.

Intelligent information retrieval

System Preferences Various source of evidence Natural language processing Semantic web technologies Automatic text classification Etc.

Intelligent IR system diagram

A Specific Question:Semi-Supervised Learning for Identifying Opinions in Web ContentDissertation work

Growing demand for online opinions

Enormous body of user-generated content

About anything, published anywhere and at any time

Useful for literature review, decision making, market monitoring, etc.

Major approaches for opinion detection

To acquire a broad and comprehensive collection of opinion-bearing

features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic

collocations, stylistic features, contextual features);

To generate complex patterns (e.g., “good amount”) that can approximate

the context of words.

To generate and evaluate opinion detection systems;

To allow evaluation of opinion detection strategies with high confidence;

9

9

What’s Essential?Labeled Data! And lots of them!!!

Challenges for opinion detection

Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up

Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain

Motivations & research question

Easy to collect unlabeled user-generated content that contains opinions

Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies

Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?

Datasets & data split

Evaluation(5%)

Unlabeled (90%)

Labeled(1-5%)

SSL Full SLBaseline

Supervised Learning (SL)

Labeled(95%)

Evaluation(5%)Labeled(1-5%)

Evaluation(5%)

Dataset(sentences) Blog Posts Movie Reviews News Articles

Opinion 4,843 5,000 5,297

Non-opinion 4,843 5,000 5,174

Two major SSL methods: Self-training

Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set.

Limitation: Auto-labeled data may be biased by the particular opinion classifier.

Two major SSL methods: Co-training

Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other.

Limitation: It is not always easy to create two different classifiers.

Experimental design

General settings for SSL Naïve Bayes classifier for self-training Binary values for unigram and bigram features

Co-training strategies: Unigrams and bigrams (content vs. context) Two randomly split feature/training sets A character-based language model (CLM) and a bag-of-words

model (BOW)

Results: Overall

For movie reviews and news articles, co-training proved to be most robust

For blog posts, SSL showed no benefits over SL due to the low initial accuracy

Results: Movie reviews

Both self-training and co-training can improve opinion detection performance

Co-training is more effective than self-training

Results: Movie reviews (cont.)

The more different the two classifiers, the better the performance

Results: Domain transfer (movie reviews->blog posts)

For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.

Contributions

Comprehensive research expands the spectrum of SSL application to opinion detection

Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation

Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection

Research extensible to other data domains, non-English texts, and other text mining tasks

21

www.CartoonStock.com

“All my opinions are posted on my online blog.”

“A grade of 85 or higher will get you favorable mention on my blog.”

“If you want a second opinion, I’ll ask my computer”

Thank you!

Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...

Documents

Transcript of Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...