Sentiments Improvement

28
Sentiment improvements Proposed ideas: Part I. Data preprocessing Part II. PMI-IR approach Team members: Denys Astanin Mykhailo Kozik

description

Research and Development project on how to improve rule-based sentiments accuracy.

Transcript of Sentiments Improvement

Page 1: Sentiments Improvement

Sentiment improvements

Proposed ideas:

Part I. Data preprocessing

Part II. PMI-IR approach

Team members:

Denys AstaninMykhailo Kozik

Page 2: Sentiments Improvement

Data preprocessing

Raw data

Preprocessed data

NarrowingLong words

EmoticonsDecoding

SpellCorrection

AbbreviationsDecoding

TagsDetection

:'( → cry@Alex nice photo

#photoworld

goooood → good

lol → laughing out loud

I am shure that is realy exsellent plece|

I am sure that is really excellent place

Page 3: Sentiments Improvement

Narrowing long words

This hotel so goooooooood! This hotel so good!

NEUTRAL POSITIVE

Using regexp narrow more than 2 duplicate letters in word to just 2

goooooood → good (correct narrowing)

baaaaaad → baad (incorrect narrowing, but will be corrected with spell-checker)

This place not coooooool! This place not cool!

NEUTRAL NEGATIVE

Try this regexp: http://regexr.com?30abm

Page 4: Sentiments Improvement

Narrowing long words. Examples

dancing with the stars and two and a half men toniiiight

@BrunoMars you were AMAZINGGGGGG at the vma's need to see you!

RT @BriannaStull13: I hateeeeeee pandora ads....

It was sooo badddd

Woooooooooooow I Like that, very nice and big like

Thts cooool

i hack any thing but for moneyyyy

who know how hacked one add fb??? pleaseeee

Page 5: Sentiments Improvement

Narrowing long words. Performance

10K 100K 1M

Long words 83.13 msec 828.30 msec 8370.97 msec~8 sec.

Normal words 31.92 msec 275.34 msec 2763.77 msec~3 sec.

Mixed words* 35.23 msec 339.23 msec 3370.31 msec~3 sec.

* assume that 1% of words are long words

Page 6: Sentiments Improvement

Emoticons decoding

Using map of smile meanings convert smile to word that it means

<3 → love

:( → sad

Look at her http://t.co/12345 <3 Look at her http://t.co/12345 love

NEUTRAL POSITIVE

I will be out of work tomorrow :( I will be out of work tomorrow sad

NEUTRAL NEGATIVE

List of emoticons: http://en.wikipedia.org/wiki/List_of_emoticons

Page 7: Sentiments Improvement

Emoticons decoding. Examples

Awww He is Too cute :) Thanks bae next weekend..

@LenovoDoTour I have missed these two days in Belgrade :(

Katie Holmes <3 #VMA

ahaha just to warn you!! ;)

it's amazing how Oracle can do so much! I'm loving it <3

please someone help me i need to finish this im out of time!! thank!! :D

Boa noite, viajantes! Menos um diazinho nessa semana =)

:-( don't have my Mcard number required to fill out form

Page 8: Sentiments Improvement

Emoticons decoding. Performance

10K 100K 1M

1 smile list 45.03 msec 444.62 msec 4426.74 msec~4 sec.

5 smile list 189.87 msec 1304.10 msec~1 sec.

12355.37 msec~12 sec.

10 smile list 227.26 msec 2325.23 msec~2 sec.

26954.26 msec~27 sec.

We have so poor performance when smile list grow up due to method that performreplacements. Better results can achieved with using state machines or regexps

Page 9: Sentiments Improvement

Abbreviations decoding

Using map of abbreviations convert abbr to word that it means

lol → laughing out loud

thx → thanks

Got it! lol Got it! laughing out loud

NEUTRAL POSITIVE

I was DWI, haha I was driving while intoxicated, haha

NEUTRAL NEGATIVE

List of abbreviations: http://www.smartdefine.org/internet_slang/abbreviations/r

Page 10: Sentiments Improvement

Abbreviation decoding. Examples

No offense though.. Lol

O lmao!

http://t.co/Evvh4hj ROFL

JFYI #blackcarpet

Nice code LOL

TNX you Rose! We appreciate it!

OMG, FML!

Wait me, i will be AFK

Page 11: Sentiments Improvement

Emoticons and Abbreviations

Alternative approach Abbreviations, acronyms, slang words are already parsed as tokens

Parse smiles as tokens also in FX

Now we can use ”Tune sentiments” on these tokens

Page 12: Sentiments Improvement

Spell correction

Perform spell correction on data before sentiment calculation

I lov this hotel! I love this hotel!

NEUTRAL POSITIVE

They have terryble servic They have terrible service

NEUTRAL NEGATIVE

Page 13: Sentiments Improvement

Spell corection. Examples

i hope @ladygaga will take some rest now becauce of...

But its still also hilarioouss

Shoukd i wast my money?

Business eviroment

It's impossibru!

I like dansing! <3

You can dowload the data from http://to.download/file

Coleguaues, lets keep it clean.

Page 14: Sentiments Improvement

Spell correction. Edit distance

Edit types: Deletion beauetiful → beautiful

Insertion speling → spelling

Substitution performanse → performance

Swaping yaer → year

Examples

unsucesful → unsuccesful → unsuccessful (2 edits)

wardoub → wardroub → wardrobu → wardrobe (3 edits)

Page 15: Sentiments Improvement

Spell correction. Algorithm

Peter Norvig's spelling corrector Bayes rule approach Train data Simple implementation High performance Low accuracy

More theory: http://norvig.com/spell-correct.html

Train data: http://norvig.com/big.txt

Page 16: Sentiments Improvement

Spell correction. Coverage

Edit1 + Edit2 covers 98%!!!

Page 17: Sentiments Improvement

Spell correction. Accuracy

Test data 1 Test data 2

1 edit 61.8% 67.2%

2 edits 71.2% 74.1%

Test data 1: Wikipedia – Common misspelled words (~4k)http://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines

Test data 2: Birkbeck spelling error corpus (270)http://www.ota.ox.ac.uk/headers/0643.xml

Page 18: Sentiments Improvement

Spell correction. Performance

10K 100K 1M

1 edit 11350.52 msec~11 sec.

117261.12 msec~2 min.

1252882.23 msec~20 min.

2 edits 4300631.29 msec~70 min.

Due to quadratic complexity these testsmake no sense

Spell-check complexity for word:

Edit distance 1: O(C·n)Edit distance 2: O(C²·n²)

* n – length of word** C ~= 50

Page 19: Sentiments Improvement

Spell correction. Improvements

Performance Memoize correction (Best → O(1))

Give ability to user to perform spell-correction

Improve train data

Coverage & Accuracy Use more edits candidates

Use common mispelling rules

Use weights for edit operations

Hit part of speech

Hit context

Improve train data

Page 20: Sentiments Improvement

Tags detection

Process differently source-specific information (twitter)

I say to @love hello! I say to - hello!

NEUTRALPOSITIVE

I mean that i #hatetwitter I mean that i hate twitter

NEUTRAL NEGATIVE

● Hashtag (#music) use word splitter

● Username (@LadyGaga) just ignore it

Page 21: Sentiments Improvement

Tags detection. Examples

@INevaTrustEm ok :) we need to make a date for this

Watching @danieltosh #toofunny

#lovetolaugh

#sick

Avatar, #wasteofmoney

#soft #thissucks

#happytweet

RT @BriannaStull13: what do you mean?

Page 22: Sentiments Improvement

Tags detection. Words splitting

Dynamic programming Statistical approach due to ambiguity

#orcore → [orc_ore], [or_core]

#expertsexchange → [expert_sex_change], [experts_exchange]

Train data Dictionary (default linux ~100K words)

Page 23: Sentiments Improvement

Tags detection. Twitter hashtags

Twitter hashtags crawled from (~800):http://hashtags.org/http://kingnetforums.weebly.com/twitter-hashtags-lists.htmlhttp://edudemic.com/2011/10/twitter-hashtag-dictionary/http://nicolehumphrey.net/60-favorite-twitter-hashtags-for-writers-clickable-list/http://www.dailywritingtips.com/40-twitter-hashtags-for-writers/http://greeneconomypost.com/green-twitter-hashtag-17290.htm

Page 24: Sentiments Improvement

Tags detection. Performance

100 400 800

Time 4019.73 msec~4 sec.

6429.19~6 sec.

7897.23~8 sec.

Accuracy 83.00% 86.25% 84.88%

Main problems:

● Train set not often solves ambiguity problem● Dictionary hits filter lot of right candidates

#rapnotamusic → [ra_p_not_a_music]

Page 25: Sentiments Improvement

Words splitting. Improvements

Performance Memoize splitting

Prefix tree approach

Viterbi algorithm (http://en.wikipedia.org/wiki/Viterbi_algorithm)

Improve train data

Accuracy Use famous names, geographic locations, slang, abbreviations,

acronyms,...

Big dictionary

Improve train data (twitter-specific)

Page 26: Sentiments Improvement

Preprocessing performance

Input conditions:

Data: 2.4K (incorrect) of 15.8K (total) from Omniture15K.xls file (15%) Emoticons size: 14 most common smilesAbbreviations size: 8 most common abbrsSpell-correction distance: 1Train data: big.txtDictionary: linux-words.txt

Results:

Sentence count: 2412Preprocessing time: 29214.88 msec (~29 sec.)Number of corrected sentences: 368Percent of corrected to incorrect data: 15.28%Percent of corrected to total data: 2.33%

Page 27: Sentiments Improvement

Data preprocessing. Future.

Sentence breaker

Page 28: Sentiments Improvement

Environment

Hardware CPU: 2 x Intel Pentium Dual T2370 @ 1.73GHz

RAM: 2.0 GB

Software OS: Ubuntu 11.04

Kernel: Linux 2.6.38-13-generic

IDE: Emacs 23.2.1

Programming: Clojure 1.3