A Survey of Opinion MiningA Survey of Opinion Mining
Dongjoo Lee
Intelligent Database Systems Lab.
Dept. of Computer Science and Engineering
Seoul National University
Copyright © 2007 by CEBT
IntroductionIntroduction
The Web contains a wealth of opinions about products, politics, and more in newsgroup posts, review sites, and other web sites
A few problems
What is the general opinion on the proposed tax reform?
How is popular opinion on the presidential candidates evolving?
Which of our customers are unsatisfied? Why?
Opinion Mining (OM)
a recent discipline at the crossroads of information retrieval and computational linguistics which is concerned not with the subject of a document, but with opinion it expresses
Related Areas
Data Mining(DM), Information Retrieval (IR), Text Classification (TC), Text Summarization (TS)
IDS Lab. - 2Center for E-Business Technology
Copyright © 2007 by CEBT
AgendaAgenda
Introduction
Development of Linguistic Resource Conjunction Method
PMI Method
WordNet Expanding Method
Gloss Use Method
Sentiment Classification PMI Method
Machine Learning Method
NLP Combined Method
Extracting and Summarizing Opinion Expression Statistical Approach
NLP Based Approach
Discussion
Center for E-Business Technology IDS Lab. - 3
Copyright © 2007 by CEBT
Development of Linguistic Resource Development of Linguistic Resource (1)(1)
Linguistic resources can be used to extract opinion and to classify the sentiment of text
Appraisal Theory Sentiment related properties are well-defined
A framework of linguistic resources which describes how writers and speakers express inter-subjective and ideological position
underlying linguistic foundation of OM
Tasks Determining the subjectivity of a term
Determining term orientation
Determining the strength of term attitude
Example Objective: vertical, yellow, liquid
Subjective– Positive: good < excellent
– Negative: bad < terrible
Center for E-Business Technology IDS Lab. - 4
Copyright © 2007 by CEBT
Development of Linguistic Resource Development of Linguistic Resource (2)(2)
Conjunction Method
PMI Method
Orientation
Subjectivity
WordNet Expansion Method
Gloss Use Method
Orientation
Subjectivity
SentiWordNet
Center for E-Business Technology IDS Lab. - 5
Copyright © 2007 by CEBT
Conjunction Method Conjunction Method - overview- overview
Hatzivassiloglou and McKeown, 1997
Hypothesis Adjectives in ‘and’ conjunctions usually have similar orientation, while ‘but’
is used with opposite orientation.
Process
Randomly selected adjectives with positive and negative orientation seed terms were used to predict orientation.
Center for E-Business Technology IDS Lab. - 6
1. All conjunction of adjectives are extracted from the corpus.
2. A log-linear regression model combines information from different conjunctions to determine if each two conjoined adjectives are of same or different orientation.
3. A clustering algorithm separates the adjectives into two subsets of different orientation. It places as many words of same orientation as possible into the same subset.
4. The average frequencies in each group are compared and the group with the higher frequency is labeled as positive.
andbut
positive
negative
corpuscorpus
seed termsseed terms
Copyright © 2007 by CEBT
Conjunction Method Conjunction Method –– objective function and objective function and constraintsconstraints
Select pmin that minimizes Φ(p)
dissimilarity between adjectives in same cluster is minimized and dissimilarity between adjectives in different cluster is maximized.
Experiments HM term set : 1,336 adjectives
– 657 positive, 679 negative terms
Methods to improve performance of orientation prediction– But rule : Most conjunctions had same orientation, while some conjunctions
linked by ‘but’ had almost opposite orientation
– log-linear regression model
– morphological relationship adequate-inadequate or thoughtful –thoughtless
log-linear model with morphological relationship : 82.5% accuracy
IDS Lab. - 7Center for E-Business Technology
|Ci| : the cardinality of cluster i
d(x, y) : the dissimilarity between adjectives x , y
Copyright © 2007 by CEBT
PMI Method PMI Method - overview- overview
Pointwise Mutual Information (PMI)
a measure of association used in information theory and statistics
Orientation
– Turney and Littman, 2003
– terms with similar orientation tend to co-occur in documents
Subjectivity
– Baroni and Vegnaduzzo, 2004
– subjective adjectives tend to occur in the near of other subjective adjectives
IDS Lab. - 8Center for E-Business Technology
Copyright © 2007 by CEBT
PMI Method PMI Method – predicting semantic orientation– predicting semantic orientation
Modified PMI was measured using the number of results returned by the AltaVista search engine with NEAR operator
Predicting semantic orientation of a term SO(t)
Experiments
With HM term set and three corpora
– With small corpus, accuracy isn’t higher than conjunction method.
– With large corpus, accuracy is higher than conjunction method.
Center for E-Business Technology IDS Lab. - 9
t : target term
ti : paradigmatic term
Corpus AV-ENG AV-CA TASA
Approx. # of word in corpus 1 *1011 2*109 1*107
Accuracy 87.13% 80.31% 61.83%
Copyright © 2007 by CEBT
WordNet Expansion MethodWordNet Expansion Method
Hu et al., 2004 used synonym and antonym relationship between words
Hypothesis adjectives usually share the same orientation as their synonyms and
opposite orientation as their antonyms
By using a set of seed adjectives, orientation of all adjectives in WordNet can be assigned through a procedure exploring on the cluster graphs.
IDS Lab. - 10Center for E-Business Technology
Copyright © 2007 by CEBT
Gloss Use Method Gloss Use Method - overview- overview
Esuli et al., 2005, 2006
Hypothesis
Orientation
– terms with similar orientation have similar glosses
Subjectivity
– terms with similar orientation have similar glosses
– terms without orientation have non-oriented glosses
SentiWordNet
All words in the WordNet have three scores
– positivity, negativity, and objectivity
Term Sense is positioned in reversed triangle
Center for E-Business Technology
good: that which is pleasing or valuable or useful; agreeable or pleasing
beautiful: aesthetically pleasing
pretty: pleasing by delicacy or grace; not imposing
yellow: similar to the color of an egg yolk
vertical: at right angles to the plane of the horizon or a base line
IDS Lab. - 11
Copyright © 2007 by CEBT
Gloss Use Method – Gloss Use Method – classification processclassification process
Process
1. A seed set (Lp, Ln) is provided as input
2. Lexical relations (e.g. synonymy) from a thesaurus, or online dictionary, are used to extend seed set. Once added to the original ones, the new terms yield two new, richer sets Trp and Trn; together they form the
training set for the learning phase of Step 4.
3. For each term ti in Trp∪Trn or in the test set, a
textual representation of ti is generated by
collating all the glosses of ti as found in a
machine-readable dictionary. Each such representation is converted into vectorial form by standard text indexing techniques.
4. A binary text classifier is trained on the terms in Trp∪Trn and then applied to the
terms in the test set.
1. Experiments
1. Classifier : NB, SVM, PrTFIDF
2. 87.38% AccuracyCenter for E-Business Technology IDS Lab. - 12
Copyright © 2007 by CEBT
Development of Linguistic Resource - Development of Linguistic Resource - SummarySummary
Method Intuition Accuracy Characteristics
Conjunction
Method
Adjectives in and conjunctions usually have similar orientation, though but is used with opposite orientation
78.08% The First try test data : 1336
adjectives
PMI method
terms with similar orientation tend to co-occur in documents
87.13% No limitation Much time required
WordNet Expansion
Method
adjectives usually share the same orientation as their synonyms and opposite orientation as their antonyms
N/A Limited to WordNet
Gloss Use Method
terms with similar orientation have similar glosses
terms without orientation have non-oriented glosses
87.38% SentiWordNet (All word in WordNet)
Accuracy depends on the quality of thesaurus
Center for E-Business Technology IDS Lab. - 13
Copyright © 2007 by CEBT
Sentiment ClassificationSentiment Classification
The process of identifying the sentiment – or polarity – of a piece of text or a document. Document-level
Sentence-level, phrase-level
Feature-level
– Define target of the opinion and assign the sentiment of the target
Document-level Sentiment Classification Method PMI method
Machine Learning Method
– Default Classifiers
– Enhanced Classifier
NLP Combined Method
– A Two-Step Classification
– Combining Appraisal Theory
Center for E-Business Technology IDS Lab. - 14
Copyright © 2007 by CEBT
PMI MethodPMI Method
Turney et al., 2002
Process Only two-word phrases containing adjectives or adverbs are extracted
Semantic orientation of a phrase
– SO(phrase) = PMI(phrase, “excellent”) – PMI(phrase, “poor”)
Semantic orientation is an average semantic orientation of the phrases
Experiments 410 reviews from Epinions (epinion.com): 170 positive, 240 negative
calculating the PMI of 10,658 phrases from 410 reviews consume about 30 hours
Center for E-Business Technology
Domain of review Accuracy Domain of review Accuracy
Automobiles 84.00% Movies 65.83%
- Honda Accord 83.78% - The Matrix 66.67%
- Volkswagen Jetta 84.21% - Pearl Harbor 65.00%
Banks 80.00% Travel Destination 70.53%
- Bank of America 78.33% - Cancun 64.41%
- Washington Mutual 81.67% - Puerto Vallarta 80.56%
IDS Lab. - 15
Copyright © 2007 by CEBT
ML ML - Default Classifier- Default Classifier
Pang and Lee, 2002
A special case of text categorization with sentiment- rather than topic-based categories
Document modeling
standard bag-of-features framework
Experiments
Data : movie reviews (Internet Movie Database), rating -> negative, neutral, positive
Naïve Bayes, Maximum Entropy, Support Vector Machine
In terms of relative performance, Naïve Bayes tends to do the worst and SVM tends to do the best, although the differences aren’t very large.
Center for E-Business Technology IDS Lab. - 16
Features # of featuresFrequency
or presence?NB ME SVM
unigrams 16165 freq. 78.7 N/A 72.8unigrams 16165 pres. 81.0 80.4 82.9unigrams+bigrams 32330 pres. 80.6 80.8 82.7bigrams 16165 pres. 77.3 77.4 77.1unigrams+POS 16695 pres. 81.5 80.4 81.9adjectives 2633 pres. 77.0 77.7 75.1top 2633 unigrams 2633 pres. 80.3 81.0 81.4unigrams+position 22430 pres. 81.0 80.1 81.6
Copyright © 2007 by CEBT
ML ML - Using Only Subjective Sentences- Using Only Subjective Sentences
Pang and Lee, 2004
improved polarity classification by removing objective sentences
A subjectivity detector determines whether each sentence is subjective or not
Standard subjectivity classifier
Subjectivity classifier using proximity relationship
The use of subjectivity extracts can improve the polarity classification at least no loss of accuracy.
Center for E-Business Technology IDS Lab. - 17
Copyright © 2007 by CEBT
NLP Combined MethodNLP Combined Method – A Two-Step – A Two-Step ClassificationClassification
Wilson et al., 2005
A Two-Step Contextual Polarity Classification
employ machine learning and 28 linguistic features
document polarity : the average polarity of phrases
Step 1. Neutral-polar classifier classifies each phrase containing a clue as neutral or polar
Step 2. Polarity classifier takes all phrases marked in step 1 as polar and disambiguates their contextual polarity (positive, negative, both, or neutral).
28 Features : were extracted using NLP techniques with a dependency parser
4 Word Features, 8 Modification Features, 11 Structure Features, 3 Sentence Features, 1 Document Feature
Experiments
Data : Multi-perspective Question Answering (MPQA) Opinion Corpus
Center for E-Business Technology
Features AccuracyWord token 73.6
Word+priorpol 74.228 features 75.9
Features Accuracy
Word token 61.7Word+priorpol 63.0
10 features 65.7
neutral-polar classification (%) polarity classification (%).
IDS Lab. - 18
Copyright © 2007 by CEBT
NLP Combined MethodNLP Combined Method - Combining Appraisal - Combining Appraisal TheoryTheory
Whitelaw et al., 2005 applied the appraisal theory to the machine learning methods of Pang and Lee
Structure of an appraisal
An example “not very happy”
Experiments a lexicon of 1329 appraisal entities have been produced semi-automatically from
400 seed terms in around twenty man-hours
combining attitude type and orientation : accuracy 90.2%.
Center for E-Business Technology IDS Lab. - 19
Copyright © 2007 by CEBT
Sentiment Classification - SummarySentiment Classification - Summary
Method Characteristics Pros Cons
PMI Method
Use phrase PMI Simple Need not priory
polarity dictionary
Loss of contextual meaning
Slow(Time to get PMI)
Machine Learning Method
Bag of Words Unigram to bigram or n-
gram SVM, NB, MaxEnt
Simple Need not priory
polarity dictionary
Loss of contextual meaning
Need learning phase
NLP Combined
Method
Based on ML Parsing or Syntactic
Analysis Prior polarity to
contextual polarity
Consider contextual meaning
Easily extendible for various purpose
Need prior polarity dictionary
Syntactic Analysis Overhead
Center for E-Business Technology IDS Lab. - 20
Copyright © 2007 by CEBT
Extracting and Summarizing Opinion Extracting and Summarizing Opinion ExpressionExpression
Goal Extract the opinion expression from large reviews and present it with an effective
way
Tasks Feature Extraction
– Sentiment classification at the feature-level requires the extraction of features that are the target of opinion words
Sentiment Assignment– Each feature is usually classified as being either favorable or unfavorable.
Visualization– Extracted opinion expression are summarized and visualized.
Methods Statistical Approaches
– ReviewSeer (2003)
– Opinion Observer (2004)
– Red Opal (2007)
NLP-Based Approaches– Kanayama System (2004)
– WebFountain (2005)
– OPINE (2005)
Center for E-Business Technology IDS Lab. - 21
product
product
ExtractFeaturesExtract
Features
SummarizeSummarize
AssignSentiment
AssignSentiment
reviews
Copyright © 2007 by CEBT
Opinion Observer Opinion Observer - Overview- Overview
Hu and Liu, 2005
Extract and summarize opinion expression from customer reviews on the Web.
Only mines the features of the product on which the customers have expressed their opinions and whether the opinion are positive or negative
Overall process
1. Review crawling
2. Feature extraction
3. Sentiment assignment
– Opinion word extraction
– Opinion orientation identification
4. Summary generation
Center for E-Business Technology IDS Lab. - 22
Overall process
Copyright © 2007 by CEBT
Opinion Observer Opinion Observer - Tasks- Tasks
Feature Extraction Product features are extracted from the noun or noun phrase by the
association miner CBA
Compactness pruning, redundancy pruning
Sentiment Assignment Opinion sentence : a sentence contains one or more product features and
one or more opinion words
Adjectives are the only opinion words
Prior polarity of adjectives was identified by WordNet expansion methods with seed terms
Infrequent features are extracted by using frequent opinion words
Polarity of a sentence is assigned as a dominant orientation
Extracted form : (product feature, # of positive sentences, # of negative sentences)
Experiments Large collection of reviews of 15 electronic products
86.3% recall, 84.0% precision
IDS Lab. - 23Center for E-Business Technology
Copyright © 2007 by CEBT
Opinion Observer Opinion Observer - Visualization- Visualization
Features of products are compared by the bar graph
Number of positive and negative sentences of each feature are normalized
IDS Lab. - 24Center for E-Business Technology
Positive portion
Negative portion
Copyright © 2007 by CEBT
Web Fountain Web Fountain - Overview- Overview
Yi et al., 2005
Extracts target features of the sentiment from the various resources and assigns polarity to the features
System Architecture
Sentiment Miner
Analyzes grammatical sentence structures and phrases by using NLP techniques
Center for E-Business Technology IDS Lab. - 25
Copyright © 2007 by CEBT
Web Fountain Web Fountain – Tasks– Tasks
Feature Extraction Candidate features
– a part-of relationship with the given topic
– an attribute-of relationship with the given topic.
– an attribute-of relationship with a known feature of the given topic
bBNP (Beginning definite Base Noun Phrase) heuristic is used
Select bnp (base noun phrase) that has high likelihood ratio
Experiments
– Precision - digital camera: 97%, music reviews: 100%
Sentiment Assignment Parse and traverse with two linguistic resources
– Sentiment lexicon: define the sentiment polarity of terms
– Sentiment pattern database: contain the sentiment assignment patterns of predicates
Experiments
– Product review
– Recall 56%, Precision 87%
IDS Lab. - 26Center for E-Business Technology
Copyright © 2007 by CEBT
Web Fountain Web Fountain – Visualization– Visualization
Web interface listing sentiment bearing sentences about a given product
IDS Lab. - 27Center for E-Business Technology
Copyright © 2007 by CEBT
Extracting and Summarizing Opinion Expression Extracting and Summarizing Opinion Expression - - SummarySummary
System Feature Extraction Sentiment Assignment Visualization
Statistical
ReviewSeer(2003)
N/A probabilistic model Naïve Bayes Accuracy: 85.3%
List feature term and it’s score and show sentences contain the feature term
Opinion Observer(2004)
CBA miner Infrequent feature
selection
WordNet expansion prior polarity of adjectives
graph
Recall: 86.3% Precision: 84.0%
Red Opal(2007)
frequent noun and noun phrase
Precision:85%
use user’s rating Precision:80%
ordered product list by score of each feature
the confidence of the scoring
NLP-based
Kanayama’s system(2004)
sentiment unit modifying the machine translation framework
N/A
Recall:43% Precision:89%
WebFountain(2005)
bBNP heuristics likelihood ratio Precision:97%
sentiment lexicon sentiment pattern database Recall:56% Precision:87%
listing sentiment bearing sentences of a product
OPINE(2005)
Web PMI Recall:76% Precision:79%
Relaxation Labeling Recall:89% Precision:86%
N/A
Center for E-Business Technology IDS Lab. - 28
Copyright © 2007 by CEBT
DiscussionDiscussion
OM is a growing research discipline related to various research areas, such as IR, computational linguistics, TC, TS, and DM.
Surveyed three topics and summarized it.
For Korean OM?
There isn’t any published research into the Korean OM.
Language differences may impose some limits on the methods used in the OM subtasks.
– Structural differences between English and Korean may mean that the same heuristics cannot be applied to extract features from text
– The lack of Korean thesaurus similar to WordNet limits the methods of obtaining the prior polarity of words for the PMI or conjunction methods.
Research into Korean OM must be conducted in conjunction with other related areas.
Center for E-Business Technology IDS Lab. - 29
Copyright © 2007 by CEBT
Discussion Discussion - Research Map of OM- Research Map of OM
IDS Lab. - 30Center for E-Business Technology
Thank youThank you
IDS Lab. - 31Center for E-Business Technology
Top Related