Data Acquisition for Sentiment Analysis
-
Upload
ali-belcaid -
Category
Data & Analytics
-
view
175 -
download
7
description
Transcript of Data Acquisition for Sentiment Analysis
![Page 1: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/1.jpg)
DASA ProjectData Acquisition for Sentiment Analysis
Ali Belcaid © AB Advisory & Consulting
High level architecture and components overview – March 2013
![Page 2: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/2.jpg)
Objectives
• Streamline and facilitate the process of unstructured data acquisition
• Create and manage corpora’s for contextual opinions and sentiments
• Detect trends based on contexctual reviews, comments, discussions…
• Run and train models for sentiment or opinion analysis
• Provide Figures, results and graphs as outputs
![Page 3: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/3.jpg)
Software components
• Python– Program language
• Django : Web application container
• Scapy : Web Crawler
• Librairies : Twitter,
• MySQL / MongoDB / Hbase– For the time being, no absolute choice is made But the final solution could be a mix
of different databases depending on the nature of the use.
• R Project– R Project will be used whenever specific textmining libraries are missing in python
or it become easier to use R instead of python. In that case, the R scripts will beencapsulated in python programs.
• Hadoop– For massive storage we will use Hadoop. The architecture is not yet depicted .
– It is used for Raw data storage.
![Page 4: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/4.jpg)
Simplified Solution Architecture
…
…
Web Interface (Django)
Crawl Engine & API(Scrapy)
Text Mining Engine(NLTK)
(TM – R project)
Pre-processing &
Corpuses
Output results
ConfigurationCrawl
Content
1 2
3
4
5
![Page 5: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/5.jpg)
Architecture components
1Data sources : The access will be managed via API or Crawls. Sources are all ones related to social media -> blogs, forums, advisors, social web… In general, all media where sentiment / opinion are expressed.
2 Web Interface to interact with the system -> to manage inputs, configurations, outputs…
3There will be a mix between Scrapy (the Crawler) and python scripts for using APIs. Basically, the engine will be used to gather all data sources and store them for further processing (pre-processing and analysis).
4There will be a mix between Scrapy (the Crawler) and python scripts for using APIs. Basically, the engine will be used to gather all data sources and store them for further processing.
5The target database solution is not yet selected. The objective is to store all the relative content whenever is raw data, configuration items or ouput results.
![Page 6: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/6.jpg)
Characteristics of Sentiment Analysis
Sentiment = Holder + Polarity + Target + Auxiliary –Holder: who expresses the sentiment –Target: what/whom the sentiment is expressed to –Polarity: the nature of the sentiment (e.g., positive or negative)
“The games in iPhone 4s are pretty funny!”
Feature/Aspect Target Polarity : Positive
Holder = the user/reviewer
Auxiliary• Strength : Differentiate the intensity • Confidence : Measure the reliability of the sentiment • Summary : Explain the reason inducing the sentiment • Time
![Page 7: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/7.jpg)
Basic Tasks
• Holder detection – Find who express the sentiment
• Target recognition – Find whom/what the sentiment is expressed towards
• Sentiment (Polarity) classification – Positive, negative, neutral
• Opinion summarization
• Opinion spam detection
![Page 8: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/8.jpg)
Subjectivity versus Sentiment
• Sentiment analysis also known as opinion mining.• Attempts to identify the opinion/sentiment that a person may hold
towards an object• It is a finer grain analysis compared to subjectivity analysis
![Page 9: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/9.jpg)
Lexicon Based Sentiment Classification
Basic idea
• Use the dominant polarity of the opinion words in the sentence to determine its polarity :• If positive/negative opinion prevails, the opinion sentence is regarded as
positive/negative• Lexicon + Counting• Lexicon + Grammar Rule + Inference Method
Example Lexicon : http://www.wjh.harvard.edu/~inquirerhttp://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rarhttp://sentiwordnet.isti.cnr.it/
![Page 10: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/10.jpg)
Sentiment Analysis Tasks
Level Task Description
Document • Task: sentiment classification of reviews
• Classes: positive, negative, and neutral
• Assumption: each document (or review) focuses on a single object (not true in many discussion posts) and contains opinion from a single opinion holder.
Sentence • Task 1: identifying subjective/opinionated sentences
• Classes: objective and subjective (opinionated)
• Task 2: sentiment classification of sentences
• Classes: positive, negative and neutral.
• Assumption: a sentence contains only one opinion; not true in many cases.
• Then we can also consider clauses or phrases.
Feature • Task 1: Identify and extract object features that have been commented on by an opinion holder (e.g., a reviewer).
• Task 2: Determine whether the opinions on the features are positive, negative or neutral.
• Task 3: Group feature synonyms.
• Produce a feature-based opinion summary of multiple reviews.
![Page 11: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/11.jpg)
Some tools
Lexicon-based tools
• Use sentiment and subjectivity lexicons• Rule-based classifier
• A sentence is subjective if it has at least two words in the lexicon• A sentence is objective otherwise
Corpus-based tools
• Use corpora annotated for subjectivity and/or sentiment• Train machine learning algorithms:
• Naïve bayes• Decision trees• SVM • …
• Learn to automatically annotate new text
![Page 12: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/12.jpg)
Sentiment Analysis : Levels
• Document level –E.g., product/movie review
• Sentence level –E.g., news sentence
• Expression level –E.g., word/phrase
![Page 13: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/13.jpg)
Sentiment Analysis : Holder detection
Identifying Sources of Opinions with Conditional Random Fields and Extraction Patterns
International officers believe that the EU will prevail. International officers said US officials want the EU to prevail.
• View source identification as an information extraction task and tackle the problem using sequence tagging and pattern matching techniques simultaneously
• Linear-chain CRF model to identify opinion sources • Patterns incorporated as features
![Page 14: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/14.jpg)
Sentiment Analysis : Twitter
![Page 15: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/15.jpg)
Sentiment Analysis : Twitter
1. Tweet normalization – A simple rule-based model –“gooood” to “good”, “luve” to “love”
2. POS tagging – OpenNLP POS tagger 3. Word stemming – A word stem mapping table (about 20,000
entries) 4. Syntactic parsing – A Maximum Spanning Tree dependency
parser
![Page 16: Data Acquisition for Sentiment Analysis](https://reader033.fdocuments.net/reader033/viewer/2022052906/558c2c4dd8b42aa5738b4651/html5/thumbnails/16.jpg)
Crawling scenario : Definition
Scenario x
Instance 1
Instance 2
Instance n
URLS sélectionnées
Paramètres de configuration
Name
Key words
…
• Scenario : 1 -> n : Category.• Theme: n -> n : Scenario• Scenario : 1 -> n : instance
• The scenario define the type of Crawl wewant to run. It is tied to the notion of instance which is considered as a specificconfiguration of scenario.
Module gestion des URLS
Module gestion de paramètres
de configuration
Il faudra se pencher sur l’interface GUI en développement de Nutch et s’en inspirer pour la gestion des paramètres et des URLS.
Theme
Category