Post on 18-Dec-2014
description
Sentiment Classification with
RapidMiner
Bruno Ohana and Brendan Tierney
DIT School of Computing
June 2011
Sentiment Classification with
RapidMiner
Bruno Ohana and Brendan Tierney
DIT School of Computing
June 2011
� Introduction to Sentiment Analysis� Supervised Learning Approaches� Case Study with RapidMiner
Our Talk
Introduction to Sentiment AnalysisSupervised Learning ApproachesCase Study with RapidMiner
“81% of US internet users (60used the internet to perform research on a product theyintended to purchase, as of
“Over 30% of US internet users have at one time
Motivation
“Over 30% of US internet users have at one timeposted a comment or online review about a product orservice they’ve purchased.”
60% of population) haveused the internet to perform research on a product theyintended to purchase, as of 2007.”
% of US internet users have at one time% of US internet users have at one timeposted a comment or online review about a product orservice they’ve purchased.”
(Horrigan, 2008)
Motivation
� A lot of online content is subjective in nature.� User Generated Content: Product reviews, blog
posts, twitter, etc.� epinions.com, Amazon, RottenTomatoes.com.� Sheer volume of opinion data calls for automated
analytical methods.
A lot of online content is subjective in nature.User Generated Content: Product reviews, blog
epinions.com, Amazon, RottenTomatoes.com.Sheer volume of opinion data calls for automated
Why Are Automated Methods Relevant?
� Search and Recommendation Engines.� Show me only positive/negative/neutral.
� Market Research.� What is being said about brand X on Twitter?� What is being said about brand X on Twitter?
� Contextual Ad Placement.
� Mediation of online communities.
Why Are Automated Methods Relevant?
Search and Recommendation Engines.Show me only positive/negative/neutral.
What is being said about brand X on Twitter?What is being said about brand X on Twitter?
Contextual Ad Placement.
Mediation of online communities.
A Growing Industry
� Opinion Mining offerings� Voice of Customer analytics� Social Media Monitoring� SaaS or embedded in data mining packages
Voice of Customer analytics
SaaS or embedded in data mining packages
Opinion Mining – Sentiment Classification
� For a given Text Document, Determine Sentiment Orientation� Positive or Negative, Favorable or Unfavorable, etc.� Binary or along a scale (e.g. 1� Data is unstructured text format. From sentence to
document level.document level.
Ex: Positive or Negative?“This is by far the worst hotel experience i've ever had. the owner
overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that room wasn't even a hotel room!”
Sentiment Classification
For a given Text Document, Determine Sentiment
Positive or Negative, Favorable or Unfavorable, etc.Binary or along a scale (e.g. 1-5 stars)Data is unstructured text format. From sentence to
his is by far the worst hotel experience i've ever had. the owner overbooked while i was staying there (even though i booked the room two months in advance) and made me move to another room, but that
Supervised Learning for Text
� Train a classifier algorithm based on a training data set.� Raw data will be text.
� Approach: Use term presence � Approach: Use term presence features.� A plain text document becomes a word vector.
Supervised Learning for Text
Train a classifier algorithm based on a training
Raw data will be text.
term presence information as term presence information as
A plain text document becomes a word vector.
Supervised Learning for Text
� A word vector can be used to train a classifier.� Building a Word Vector
� Unit of tokenization: uni/bi/n� Term presence metric� Binary, tf-idf, frequency
� Stemming� Stemming� Stop Words Removal
IMDB Data Set(Plain Text)
TokenizeTokenize StemmingStemming
Supervised Learning for Text
A word vector can be used to train a classifier.
Unit of tokenization: uni/bi/n-gram
idf, frequency
Train ClassifierTrain ClassifierWord Word VectorVector
Opinion Mining – Sentiment Classification
Challenges of Data Driven Approaches
� Domain dependence.� “chuck norris” might be a good sentiment
predictor, but on movies onlypredictor, but on movies only� We lose discourse information.
� Ex: negation detection� “This comedy is not really funny.”
� NLP techniques might help.
Sentiment Classification
Challenges of Data Driven Approaches
” might be a good sentiment predictor, but on movies onlypredictor, but on movies only
We lose discourse information.Ex: negation detection“This comedy is not really funny.”
NLP techniques might help.
RapidMiner Case Study
� Sentiment Classification based on Word Vectors.
� Convert Text data to Word Vectors� Using RapidMiner’s Text Processing Extension.
� Use it to Train/Test a Learner Model.� Use it to Train/Test a Learner Model.� Using Cross-Validation.� Using Correlation and Parameter Testing to pick better
features.
� Our data set is a collection of Film reviews from presented in (Pang et al, 2004).
RapidMiner Case Study
Sentiment Classification based on Word Vectors.
Convert Text data to Word VectorsUsing RapidMiner’s Text Processing Extension.
Use it to Train/Test a Learner Model.Use it to Train/Test a Learner Model.
Using Correlation and Parameter Testing to pick better
Our data set is a collection of Film reviews from IMDBpresented in (Pang et al, 2004).
RapidMiner Case StudyRapidMiner Case Study
Selects document collectionFrom a directory.
From text to list of tokens.From text to list of tokens.
Convert word variations toTheir stem.
RapidMiner Case StudyParameter Testing- Filter “top K” most correlated attributes.- K is a macro iterated using
Testing
RapidMiner Case StudyParameter Testing
Filter “top K” most correlated attributes.K is a macro iterated using Parameter Testing.
RapidMiner Case Study� Cross Validation - Training Step.
� Calculate Attribute Weights and Normalize.� Pass models on “through port” to Testing.� Select “top k” attributes by weight and train SVM.
RapidMiner Case StudyTraining Step.
Calculate Attribute Weights and Normalize.Pass models on “through port” to Testing.Select “top k” attributes by weight and train SVM.
RapidMiner Case Study� Cross Validation – Testing Step
RapidMiner Case StudyTesting Step
Case Study – Adding More Features
� Pre-Computed features based on text statistics.� Document, Word and Sentence Sizes, Part
Presence, Stop words ratio, Syllable Count.
� Features based on scoring using a sentiment lexicon.� (Ohana & Tierney ‘09).� (Ohana & Tierney ‘09).� Used SentiWordNet as the Lexicon (Esuli et al, 09).
� In RapidMiner we can merge those data sets using a known unique ID (File name in our case).
Adding More Features
Computed features based on text statistics.Document, Word and Sentence Sizes, Part-of-speech Presence, Stop words ratio, Syllable Count.
Features based on scoring using a sentiment lexicon.
Used SentiWordNet as the Lexicon (Esuli et al, 09).
In RapidMiner we can merge those data sets using a known unique ID (File name in our case).
Opinion Lexicons
� Opinion Lexicons.� A database of terms and opinion information they carry.� Some terms and expressions carry “a priori” opinion
bias, relatively independent from context.� Ex: good, excellent, bad, poor.
� To build the data set:� Score document based on terms found.� Total positive/negative scores.� Per part-of-speech.� Per document section.
A database of terms and opinion information they carry.Some terms and expressions carry “a priori” opinion bias, relatively independent from context.
Ex: good, excellent, bad, poor.
Score document based on terms found.Total positive/negative scores.
Lexicon Based Approach
IMDB Data Set
POS POS TaggerTagger
NegationNegationDetectionDetection
IMDB Data Set(Plain Text)
Lexicon Based Approach
Document ScoresDocument ScoresSWN FeaturesSWN FeaturesScoringScoring
SentiWordNet
Part of Speech Tagging
The computer-animated comedy " shrek " is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs
The/DT computer-animated/JJ comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ imaginative/JJ visuals/NNS ,/, appealing/VBG mixed/VBN with/IN a/DT host/NN of/IN action/NN and/CC a/DT barrage/NN of/IN
Part of Speech Tagging
animated comedy " shrek " is designed to be enjoyed on different levels by different groups . for children , it offers imaginative visuals , appealing new characters mixed with a host of familiar faces , loads of action and a barrage of big laughs
comedy/NN ''/'' shrek/NN ''/'' is/VBZ designed/VBN to/TO be/VB enjoyed/VBN on/IN different/JJ levels/NNS by/IN
groups/NNS ./. for/IN children/NNS ,/, it/PRP offers/VBZ visuals/NNS ,/, appealing/VBG new/JJ characters/NNS
mixed/VBN with/IN a/DT host/NN of/IN familiar/JJ faces/NNS ,/, loads/NNS of/IN action/NN and/CC a/DT barrage/NN of/IN big/JJ laughs/NNS
Negation Detection
� NegEx (Chapman et al ’01
� Look for negating expressions�Pseudo-negations.
� “no wonder”, “no change”, “not only”� “no wonder”, “no change”, “not only”
�Forward and Backward Scope.� “don’t”, “not”, “without”, “unlikely to”, etc…
Chapman et al ’01).Look for negating expressions
“no wonder”, “no change”, “not only”“no wonder”, “no change”, “not only”
Forward and Backward Scope.“don’t”, “not”, “without”, “unlikely to”, etc…
Case Study – Adding More Features
� Data Set Merging
Adding More Features
Results - Accuracy
Method
Baseline word vector
Baseline less uncorrelated attributes
Average Accuracy using 10-fold Cross
Document Stats (S)
SentiWordNet features (SWN)
Merging (S) + (N)
Merging Baseline + (S) + (SWN) and removing uncorrelated attributes
Accuracy % Feature Count
85.39 6739
Baseline less uncorrelated attributes 85.49 1800
fold Cross-validation
68.73 22
67.40 39
72.79 61
Merging Baseline + (S) + (SWN) and 86.39 1800
Opinion Mining – Sentiment Classification
Method Accuracy
Support Vector Machines and Bigrams word vector
77.10%
Word Vector Naïve Bayes + Parts of 77.50%
� Some results from the field (IMDB data set).
Word Vector Naïve Bayes + Parts of Speech
77.50%
Support Vector Machines and Unigrams word vector
82.90%
Unigrams + Subjectivity Detection 87.15%
SVM + stylistic features 87.95%
SVM + GA feature selection 95.55%
Sentiment Classification
Accuracy Source
77.10% (Pang et al, 2002)
77.50% (Salvetti et al, 2004)
Some results from the field (IMDB data set).
77.50% (Salvetti et al, 2004)
82.90% (Pang et al, 2002)
87.15% (Pang et al, 2004)
87.95% (Abbasi et al, 2008)
95.55% (Abbasi et al, 2008)
Results – Term Correlation
Terms (after Stemming)
Most Correlated didn, georg, add, wast, bore, guess, bad, son, stupid, masterpiece, perform, stereotyp, if, adventur, oscar, worst, blond, mediocr
Least Correlated already, face, which, put, same, without, someth, mustLeast Correlated already, face, which, put, same, without, someth, mustmanag, someon, talent, get, goe, sinc, abrupt
Term Correlation
Terms (after Stemming)
didn, georg, add, wast, bore, guess, bad, son, stupid, masterpiece, perform, stereotyp, if, adventur, oscar, worst, blond, mediocr
already, face, which, put, same, without, someth, mustalready, face, which, put, same, without, someth, mustmanag, someon, talent, get, goe, sinc, abrupt
Thank YouThank YouThank YouThank You