SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.
SEASR Text
-
Upload
loretta-auvil -
Category
Technology
-
view
116 -
download
1
description
Transcript of SEASR Text
Text
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
MONK Project
MONK provides:
• 1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database.
• Several different open-source interfaces for working with this data
• A public API to the datastore
• SEASR under the hood, for analytics
MONK Project
Executes flows for each analysis requested
– Predictive modeling using Naïve Bayes
– Predictive modeling using Support Vector Machines (SVM)
Dunning Loglikelihood TagCloud
• Words that are under-represented in writings by Victorian women as compared to Victorian men. —Sara Steger
Feature Lens
“The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end.
It would have been impossible to discern through traditional reading“
Semantic Analysis: Information Extraction
• Definition: Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)
• Extracttherelevantinforma1onandignorenon‐relevantinforma1on(important!)
• Linkrelatedinforma1onandoutputinapredeterminedformat
Information Extraction
Informa(onType Stateoftheart(Accuracy)En((es
anobjectofinterestsuchasapersonororganiza1on.
90‐98%
A9ributes
apropertyofanen1tysuchasitsname,alias,descriptor,ortype.
80%
Facts
arela1onshipheldbetweentwoormoreen11essuchasPosi1onofa
PersoninaCompany.
60‐70%
Events
anac1vityinvolvingseveralen11essuchasaterroristact,airlinecrash,managementchange,newproduct
introduc1on.
50‐60%
“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL
Information Extraction Approaches
• Terminology (name) lists
– This works very well if the list of names and name expressions is stable and available
• Tokenization and morphology
– This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas)
• Use of characteristic patterns
– This works fairly well for novel entities
– Rules can be created by hand or learned via machine learning or statistical algorithms
– RulescapturelocalpaFernsthatcharacterizeen11esfrominstancesofannotatedtrainingdata
Mayor Rex Luthor announced today the establishment
of a new research facility in Alderwood. It will be
known as Boynton Laboratory.
NE:Person NE:Time
NE:Location
NE:Organization
Semantic Analytics
Named Entity (NE) Tagging
Mayor Rex Luthor announced today the establishment
of a new research facility in Alderwood. It will be
known as Boynton Laboratory.
UNE:Organization
Semantic Analysis
Co-reference Resolution for entities and unnamed entities
Mayor Rex Luthor announced today the establishment
known as Boynton Laboratory
of a new research facility in Alderwoon. It will be
ACTIONACTOR WHEN OBJECT
WHERE
ACTION
OBJECT
COMPL
Semantic Analysis
Semantic Role Analysis
Rex Luthor
person
announce
action
establ.
event
Boynton Lab
organiz.
today
time
Alderwood
location
location
(where)
object
(what)
time(when)
objec
t(w
hat)
actor(who)
Semantic Analysis
Concept-Relation Extraction
Results: Timeline
Results: Maps
UIMA Structured data
• Two SEASR examples using UIMA POS data
– Frequent patterns (rule associations) on nouns (fpgrowth)
– Sentiment analysis on adjectives
UIMA
Unstructured Information Management Applications
UIMA + P.O.S. tagging
Four Analysis Engines to analyze document to record Part Of Speech information.
OpenNLP Tokenizer
OpenNLP PosTagger
OpenNLP SentanceDetector POSWriter
Serialization of the UIMA CAS
UIMA to SEASR: Experiment I
• Finding patterns
SEASR + UIMA: Frequent Patterns
Frequent Pattern Analysis on nouns
• Goal:
– Discover a cast of characters within the text
– Discover nouns that frequently occur together
• character relationships
Frequent Patterns: visualization
Analysis of Tom Sawyer 10 paragraph window Support set to 10%
UIMA to SEASR: Experiment II
• Sentiment Analysis
UIMA + SEASR: Sentiment Analysis
• Classifying text based on its sentiment
– Determining the attitude of a speaker or a writer
– Determining whether a review is positive/negative
• Ask: What emotion is being conveyed within a body of text?
– Look at only adjectives (UIMA POS)
• lots of issues, challenges, and but’s “but … “
• Need to Answer: – What emotions to track?
– How to measure/classify an adjective to one of the selected emotions?
– How to visualize the results?
UIMA + SEASR: Sentiment Analysis
• Which emotions:
– http://en.wikipedia.org/wiki/List_of_emotions
– http://changingminds.org/explanations/emotions/basic%20emotions.htm
– http://www.emotionalcompetency.com/recognizing.htm
• Parrot’s classification (2001)
– six core emotions
– Love, Joy, Surprise, Anger, Sadness, Fear
UIMA + SEASR: Sentiment Analysis
UIMA + SEASR: Sentiment Analysis
• How to classify adjectives:
– Lots of metrics we could use …
• Lists of adjectives already classified
– http://www.derose.net/steve/resources/emotionwords/ewords.html
– Need a “nearness” metric for missing adjectives
– How about the thesaurus game ?
• Using only a thesaurus, find a path between two words
– no antonyms
– no colloquialisms or slang
UIMA + SEASR: Sentiment Analysis
• How to get from delightful to rainy ?
['delightful', 'fair', 'balmy', 'moist', 'rainy'].
['sexy', 'provocative', 'blue', 'joyless’]
['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’]
• sexy to joyless?
• bitter to lovable?
UIMA + SEASR: Sentiment Analysis
• Use this game as a metric for measuring a given adjective to one of the six emotions.
• Assume the longer the path, the “farther away” the two words are.
• address some of issues
SynNet: rainy to pleasant
UIMA + SEASR: Sentiment Analysis
• SynNet Metrics
• Common nodes
• Path length
• Symmetric: a->b->c c->b->a
• Link strength:
• tangy->sweet
• sweet->lovable
• Use of slang or informal usage
UIMA + SEASR: Sentiment Analysis
• Common Nodes
• depth of common
UIMA + SEASR: Sentiment Analysis
• Symmetry of path in common nodes
UIMA + SEASR: Sentiment Analysis
• Find the shortest path between adjective and each emotion:
• ['delightful', 'beatific', 'joyful']
• ['delightful', 'ineffable', 'unspeakable', 'fearful']
• Pick the emotion with shortest path length
• tie breaking procedures
UIMA + SEASR: Sentiment Analysis
• Not a perfect solution
– still need context to get quality
• Vain – ['vain', 'insignificant', 'contemptible', 'hateful'] – ['vain', 'misleading', 'puzzling', 'surprising’]
• Animal – ['animal', 'sensual', 'pleasing', 'joyful'] – ['animal', 'bestial', 'vile', 'hateful'] – ['animal', 'gross', 'shocking', 'fearful'] – ['animal', 'gross', 'grievous', 'sorrowful']
• Negation – “My mother was not a hateful person.”
UIMA + SEASR: Sentiment Analysis
• Process Overview
• Extract the adjectives (UIMA POS analysis)
• Read in adjectives (SEASR library)
• Label each adjective (SynNet)
• Summarize windows of adjectives
• lots of experimentation here
• Visualize the windows
UIMA + SEASR: Sentiment Analysis
• Visualization
• New SEASR visualization component
• Based on flare ActionScript Library
• http://flare.prefuse.org/
• Still in development
• http://demo.seasr.org:1714/public/resources/data/emotions/ev/EmotionViewer.html
UIMA + SEASR: Sentiment Analysis