SEASR Text

36
Text National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

description

Pathway to SEASR Workshop in March 2009 in North Carolina

Transcript of SEASR Text

Page 1: SEASR Text

Text

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign

Page 2: SEASR Text

MONK Project

MONK provides:

•  1400 works of literature in English from the 16th - 19th century = 108 million words, POS-tagged, TEI-tagged, in a MySQL database.

•  Several different open-source interfaces for working with this data

•  A public API to the datastore

•  SEASR under the hood, for analytics

Page 3: SEASR Text

MONK Project

Executes flows for each analysis requested

–  Predictive modeling using Naïve Bayes

–  Predictive modeling using Support Vector Machines (SVM)

Page 4: SEASR Text

Dunning Loglikelihood TagCloud

•  Words that are under-represented in writings by Victorian women as compared to Victorian men. —Sara Steger

Page 5: SEASR Text

Feature Lens

“The discussion of the children introduces each of the short internal narratives. This champions the view that her method of repetition was patterned: controlled, intended, and a measured means to an end.

It would have been impossible to discern through traditional reading“

Page 6: SEASR Text

Semantic Analysis: Information Extraction

•  Definition: Information extraction is the identification of specific semantic elements within a text (e.g., entities, properties, relations)

•  Extracttherelevantinforma1onandignorenon‐relevantinforma1on(important!)

•  Linkrelatedinforma1onandoutputinapredeterminedformat

Page 7: SEASR Text

Information Extraction

Informa(onType Stateoftheart(Accuracy)En((es

anobjectofinterestsuchasapersonororganiza1on.

90‐98%

A9ributes

apropertyofanen1tysuchasitsname,alias,descriptor,ortype.

80%

Facts

arela1onshipheldbetweentwoormoreen11essuchasPosi1onofa

PersoninaCompany.

60‐70%

Events

anac1vityinvolvingseveralen11essuchasaterroristact,airlinecrash,managementchange,newproduct

introduc1on.

50‐60%

“Introduction to Text Mining,” Ronen Feldman, Computer Science Department, Bar-Ilan University, ISRAEL

Page 8: SEASR Text

Information Extraction Approaches

•  Terminology (name) lists

–  This works very well if the list of names and name expressions is stable and available

•  Tokenization and morphology

–  This works well for things like formulas or dates, which are readily recognized by their internal format (e.g., DD/MM/YY or chemical formulas)

•  Use of characteristic patterns

–  This works fairly well for novel entities

–  Rules can be created by hand or learned via machine learning or statistical algorithms

–  RulescapturelocalpaFernsthatcharacterizeen11esfrominstancesofannotatedtrainingdata

Page 9: SEASR Text

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

NE:Person NE:Time

NE:Location

NE:Organization

Semantic Analytics

Named Entity (NE) Tagging

Page 10: SEASR Text

Mayor Rex Luthor announced today the establishment

of a new research facility in Alderwood. It will be

known as Boynton Laboratory.

UNE:Organization

Semantic Analysis

Co-reference Resolution for entities and unnamed entities

Page 11: SEASR Text

Mayor Rex Luthor announced today the establishment

known as Boynton Laboratory

of a new research facility in Alderwoon. It will be

ACTIONACTOR WHEN OBJECT

WHERE

ACTION

OBJECT

COMPL

Semantic Analysis

Semantic Role Analysis

Page 12: SEASR Text

Rex Luthor

person

announce

action

establ.

event

Boynton Lab

organiz.

today

time

Alderwood

location

location

(where)

object

(what)

time(when)

objec

t(w

hat)

actor(who)

Semantic Analysis

Concept-Relation Extraction

Page 13: SEASR Text

Results: Timeline

Page 14: SEASR Text

Results: Maps

Page 15: SEASR Text

UIMA Structured data

•  Two SEASR examples using UIMA POS data

–  Frequent patterns (rule associations) on nouns (fpgrowth)

–  Sentiment analysis on adjectives

Page 16: SEASR Text

UIMA

Unstructured Information Management Applications

Page 17: SEASR Text

UIMA + P.O.S. tagging

Four Analysis Engines to analyze document to record Part Of Speech information.

OpenNLP Tokenizer

OpenNLP PosTagger

OpenNLP SentanceDetector POSWriter

Serialization of the UIMA CAS

Page 18: SEASR Text

UIMA to SEASR: Experiment I

•  Finding patterns

Page 19: SEASR Text

SEASR + UIMA: Frequent Patterns

Frequent Pattern Analysis on nouns

•  Goal:

–  Discover a cast of characters within the text

–  Discover nouns that frequently occur together

•  character relationships

Page 20: SEASR Text

Frequent Patterns: visualization

Analysis of Tom Sawyer 10 paragraph window Support set to 10%

Page 21: SEASR Text

UIMA to SEASR: Experiment II

•  Sentiment Analysis

Page 22: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Classifying text based on its sentiment

–  Determining the attitude of a speaker or a writer

–  Determining whether a review is positive/negative

•  Ask: What emotion is being conveyed within a body of text?

–  Look at only adjectives (UIMA POS)

•  lots of issues, challenges, and but’s “but … “

•  Need to Answer: –  What emotions to track?

–  How to measure/classify an adjective to one of the selected emotions?

–  How to visualize the results?

Page 23: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Which emotions:

–  http://en.wikipedia.org/wiki/List_of_emotions

–  http://changingminds.org/explanations/emotions/basic%20emotions.htm

–  http://www.emotionalcompetency.com/recognizing.htm

•  Parrot’s classification (2001)

–  six core emotions

–  Love, Joy, Surprise, Anger, Sadness, Fear

Page 24: SEASR Text

UIMA + SEASR: Sentiment Analysis

Page 25: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  How to classify adjectives:

–  Lots of metrics we could use …

•  Lists of adjectives already classified

–  http://www.derose.net/steve/resources/emotionwords/ewords.html

–  Need a “nearness” metric for missing adjectives

–  How about the thesaurus game ?

•  Using only a thesaurus, find a path between two words

–  no antonyms

–  no colloquialisms or slang

Page 26: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  How to get from delightful to rainy ?

['delightful', 'fair', 'balmy', 'moist', 'rainy'].

['sexy', 'provocative', 'blue', 'joyless’]

['bitter', 'acerbic', 'tangy', 'sweet', 'lovable’]

•  sexy to joyless?

•  bitter to lovable?

Page 27: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Use this game as a metric for measuring a given adjective to one of the six emotions.

•  Assume the longer the path, the “farther away” the two words are.

•  address some of issues

Page 28: SEASR Text

SynNet: rainy to pleasant

Page 29: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  SynNet Metrics

•  Common nodes

•  Path length

•  Symmetric: a->b->c c->b->a

•  Link strength:

•  tangy->sweet

•  sweet->lovable

•  Use of slang or informal usage

Page 30: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Common Nodes

•  depth of common

Page 31: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Symmetry of path in common nodes

Page 32: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Find the shortest path between adjective and each emotion:

•  ['delightful', 'beatific', 'joyful']

•  ['delightful', 'ineffable', 'unspeakable', 'fearful']

•  Pick the emotion with shortest path length

•  tie breaking procedures

Page 33: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Not a perfect solution

–  still need context to get quality

•  Vain –  ['vain', 'insignificant', 'contemptible', 'hateful'] –  ['vain', 'misleading', 'puzzling', 'surprising’]

•  Animal –  ['animal', 'sensual', 'pleasing', 'joyful'] –  ['animal', 'bestial', 'vile', 'hateful'] –  ['animal', 'gross', 'shocking', 'fearful'] –  ['animal', 'gross', 'grievous', 'sorrowful']

•  Negation –  “My mother was not a hateful person.”

Page 34: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Process Overview

•  Extract the adjectives (UIMA POS analysis)

•  Read in adjectives (SEASR library)

•  Label each adjective (SynNet)

•  Summarize windows of adjectives

•  lots of experimentation here

•  Visualize the windows

Page 35: SEASR Text

UIMA + SEASR: Sentiment Analysis

•  Visualization

•  New SEASR visualization component

•  Based on flare ActionScript Library

•  http://flare.prefuse.org/

•  Still in development

•  http://demo.seasr.org:1714/public/resources/data/emotions/ev/EmotionViewer.html

Page 36: SEASR Text

UIMA + SEASR: Sentiment Analysis