Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...
-
Upload
marcia-taylor -
Category
Documents
-
view
226 -
download
0
Transcript of Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of...
Text Representation & Text Classification
for Intelligent Information Retrieval
Ning YuSchool of Library and Information ScienceIndiana University at Bloomington
Outline
The big picture
A specific problem – opinion detection
Intelligent information retrieval
Characteristics Not restricted to keyword matching and Boolean search Deal with natural language query and advanced search criteria Coarse-to-fine level of granularity Automatically organize/evaluate/interpret solution space User-centered, e.g., adapt to user’s learning habit Etc.
Intelligent information retrieval
System Preferences Various source of evidence Natural language processing Semantic web technologies Automatic text classification Etc.
Intelligent IR system diagram
A Specific Question:Semi-Supervised Learning for Identifying Opinions in Web ContentDissertation work
Growing demand for online opinions
Enormous body of user-generated content
About anything, published anywhere and at any time
Useful for literature review, decision making, market monitoring, etc.
Major approaches for opinion detection
To acquire a broad and comprehensive collection of opinion-bearing
features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic
collocations, stylistic features, contextual features);
To generate complex patterns (e.g., “good amount”) that can approximate
the context of words.
To generate and evaluate opinion detection systems;
To allow evaluation of opinion detection strategies with high confidence;
9
9
What’s Essential?Labeled Data! And lots of them!!!
Challenges for opinion detection
Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up
Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain
Motivations & research question
Easy to collect unlabeled user-generated content that contains opinions
Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies
Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?
Datasets & data split
Evaluation(5%)
Unlabeled (90%)
Labeled(1-5%)
SSL Full SLBaseline
Supervised Learning (SL)
Labeled(95%)
Evaluation(5%)Labeled(1-5%)
Evaluation(5%)
Dataset(sentences) Blog Posts Movie Reviews News Articles
Opinion 4,843 5,000 5,297
Non-opinion 4,843 5,000 5,174
Two major SSL methods: Self-training
Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set.
Limitation: Auto-labeled data may be biased by the particular opinion classifier.
Two major SSL methods: Co-training
Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other.
Limitation: It is not always easy to create two different classifiers.
Experimental design
General settings for SSL Naïve Bayes classifier for self-training Binary values for unigram and bigram features
Co-training strategies: Unigrams and bigrams (content vs. context) Two randomly split feature/training sets A character-based language model (CLM) and a bag-of-words
model (BOW)
Results: Overall
For movie reviews and news articles, co-training proved to be most robust
For blog posts, SSL showed no benefits over SL due to the low initial accuracy
Results: Movie reviews
Both self-training and co-training can improve opinion detection performance
Co-training is more effective than self-training
Results: Movie reviews (cont.)
The more different the two classifiers, the better the performance
Results: Domain transfer (movie reviews->blog posts)
For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.
Contributions
Comprehensive research expands the spectrum of SSL application to opinion detection
Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation
Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection
Research extensible to other data domains, non-English texts, and other text mining tasks
21
www.CartoonStock.com
“All my opinions are posted on my online blog.”
“A grade of 85 or higher will get you favorable mention on my blog.”
“If you want a second opinion, I’ll ask my computer”
Thank you!