Document-level Semantic Orientation and Argumentation
description
Transcript of Document-level Semantic Orientation and Argumentation
![Page 1: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/1.jpg)
Document-level Semantic Orientation and Argumentation
Presented by Marta TatuCS7301
March 15, 2005
![Page 2: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/2.jpg)
or ? Semantic Orientation Applied to
Unsupervised Classification of
Reviews
Peter D. TurneyACL-2002
![Page 3: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/3.jpg)
3
Overview Unsupervised learning algorithm for
classifying reviews as recommended or not recommended
The classification is based on the semantic orientation of the phrases in the review which contain adjectives and adverbs
![Page 4: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/4.jpg)
4
AlgorithmInput: review
Identify phrases that contain adjectives or adverbs by using a part-of-speech tagger
Estimate the semantic orientation of each phrase
Assign a class to the given review based on the average semantic orientation of its phrasesOutput: classification ( or )
![Page 5: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/5.jpg)
5
Step 1 Apply Brill’s part-of-speech tagger on the review Adjective are good indicators of subjective sentences. In
isolation: unpredictable steering () / plot ()
Extract two consecutive words: one is an adjective or adverb, the other provides the context
First Word Second Word Third Word(not extracted)
1. JJ NN or NNS Anything
2. RB, RBR, or RBS JJ Not NN nor NNS
3. JJ JJ Not NN nor NNS
4. NN or NNS JJ Not NN nor NNS
5. RB, RBR, or RBS VB, VBD, VBN, or VBG Anything
![Page 6: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/6.jpg)
6
Step 2 Estimate the semantic orientation of the
extracted phrases using PMI-IR (Turney, 2001) Pointwise Mutual Information (Church and Hanks,
1989):
Semantic Orientation:
PMI-IR estimates PMI by issuing queries to a search engine (Altavista, ~350 million pages)
)()()(
221 21
21log),(PMI wordpwordpwordwordpwordword
)poor"",(PMI)excellent"",(PMI)(SO phrasephrasephrase
)excellent")hits("poor"" NEAR hits(
)poor")hits("excellent"" NEAR hits(log)(SO 2 phrase
phrasephrase
![Page 7: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/7.jpg)
7
Step 2 – continued Added 0.01 to hits to avoid division by
zero If hits(phrase NEAR “excellent”) and hits(phrase
NEAR “poor”)≤4, then eliminate phrase Added “AND (NOT host:epinions)” to the
queries not to include the Epinions website
![Page 8: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/8.jpg)
8
Step 3 Calculate the average
semantic orientation of the phrases in the given review
If the average is positive, then
If the average is negative, then
Phrase POS tags SOdirect deposit JJ NN 1.288
local branch JJ NN 0.421
small part JJ NN 0.053
online service JJ NN 2.780
well other RB JJ 0.237
low fees JJ NNS 0.333
…
true service JJ NN -0.732
other bank JJ NN -0.850
inconveniently located
RB VBN -1.541
Average Semantic Orientation
0.322
![Page 9: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/9.jpg)
9
Experiments 410 reviews from Epinions
170 (41%) () 240 (59%) () Average phrases per review: 26
Baseline accuracy: 59%
Domain Accuracy Correlation
Automobiles 84.00% 0.4618
Banks 80.00% 0.6167
Movies 65.83% 0.3608
Travel Destinations 70.53% 0.4155
All 74.39% 0.5174
![Page 10: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/10.jpg)
10
Discussion What makes the movies hard to classify?
The average SO tends to classify a recommended movies as not recommended
Evil characters make good movies The whole is not necessarily the sum of the
parts Good beaches do not necessarily add up
to a good vacation But good automobile parts usually add up
to a good automobile
![Page 11: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/11.jpg)
11
Applications Summary statistics for search engines Summarization of reviews
Pick out the sentence with the highest positive/negative semantic orientation given a positive/negative review
Filtering “flames” for newsgroups When the semantic orientation drops below a
threshold, the message might be a potential flame
![Page 12: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/12.jpg)
12
Questions ? Comments ? Observations ?
![Page 13: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/13.jpg)
? Sentiment Classification using Machine Learning
Techniques
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan
EMNLP-2002
![Page 14: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/14.jpg)
14
Overview Consider the problem of classifying
documents by overall sentiment Three machine learning methods besides
the human-generated lists of words Naïve Bayes Maximum Entropy Support Vector Machines
![Page 15: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/15.jpg)
15
Experimental Data Movie-review domain Source: Internet Movie Database (IMDb) Stars or numerical value ratings converted
into positive, negative, or neutral » no need to hand label the data for training or testing
Maximum of 20 reviews/author/sentiment category 752 negative reviews 1301 positive reviews 144 reviewers
![Page 16: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/16.jpg)
16
List of Words Baseline Maybe there are certain words that people tend
to use to express strong sentiments Classification done by counting the number of
positive and negative words in the document Random-choice baseline: 50%
![Page 17: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/17.jpg)
17
Machine Learning Methods Bag-of-features framework:
{f1,…,fm} predefined set of m features
ni(d) = number of times fi occurs in document d
(Naïve Bayes)
))(,),(),(( 21 dndndnd m
)(
))|()((:)|(
)(
)|()()|(),|(maxarg
1
)(
dP
cfPcPdcP
dP
cdPcPdcPdcPc
m
i
dni
NB
c
i
![Page 18: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/18.jpg)
18
Machine Learning Methods – continued (Maximum
Entropy)
where Fi,c is a feature/class function:
Support vector machines: Find hyperplane that maximizes the margin. The constraint optimization problem:
cj is the correct class of document dj
)),(exp()(
1: ,,
iciciME cdF
dZP
otherwise ,0
and 0)(,1:),(,
ccdncdF i
ci
w
}1,1{,0,: jjj
jjj cdcw
![Page 19: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/19.jpg)
19
Evaluation 700 positive-sentiment and 700 negative-
sentiment documents 3 equal-sized folds The tag “NOT_” was added to every word
between a negation word (“not”, “isn’t”, “didn’t”) and the first punctuation mark “good” is opposite to “not very good”
Features: 16,165 unigrams appearing at least 4 times in
the 1400-document corpus 16,165 most often occurring bigrams in the
same data
![Page 20: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/20.jpg)
20
Results
POS information added to differentiate between: “I love this movie” and “This is a love story”
![Page 21: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/21.jpg)
21
Conclusion Results produced by the machine learning
techniques are better than the human-generated baselines SVMs tend to do the best Unigram presence information is the most
effective Frequency vs. presence: “thwarted
expectation”, many words indicative of the opposite sentiment to that of the entire review
Some form of discourse analysis is necessary
![Page 22: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/22.jpg)
22
Questions ? Comments ? Observations ?
![Page 23: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/23.jpg)
Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status
Simone Teufel and Marc MoensCL-2002
![Page 24: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/24.jpg)
24
Overview Summarization of scientific articles:
restore the discourse context of extracted material by adding the rhetorical status of each sentence in the document
Gold standard data for summaries consisting of computational linguistics articles annotated with the rhetorical status and relevance for each sentence
Supervised learning algorithm which classifies sentences into 7 rhetorical categories
![Page 25: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/25.jpg)
25
Why? Knowledge about the rhetorical status of
the sentence enables the tailoring of the summaries according to user’s expertise and task Nonexpert summary: background information
and the general purpose of the paper Expert summary: no background, instead
differences between this approach and similar ones
Contrasts or complementarity among articles can be expressed
![Page 26: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/26.jpg)
26
Rhetorical Status Generalizations about the nature of scientific
texts + information to enable the construction of better summaries
Problem structure: problems (research goals), solutions (methods), and results
Intellectual attribution: what the new contribution is, as opposed to previous work and background (generally accepted statements)
Scientific argumentation Attitude toward other people’s work: rival
approach, prior approach with a fault, or an approach contributing parts of the authors’ own solution
![Page 27: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/27.jpg)
27
Metadiscourse and Agentivity Metadiscourse is an aspect of scientific
argumentation and a way of expressing attitude toward previous work “we argue that”, “in contrast to common
belief, we” Agent roles in argumentation: rivals,
contributors of part of the solution (they), the entire research community, or the authors of the paper (we)
![Page 28: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/28.jpg)
28
Citations and Relatedness Just knowing that an article cites another
is often not enough One needs to read the context of the
citation to understand the relation between the articles Article cited negatively or contrastively Article cited positively or in which the authors
state that their own work originates from the cited work
![Page 29: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/29.jpg)
29
Rhetorical Annotation Scheme
Only one category assigned to each full sentence Nonoverlapping, nonhierarchical scheme The rhetorical status is determined on the basis
of the global context of the paper
![Page 30: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/30.jpg)
30
Relevance Select important content from text Highly subjective » low human agreement Sentence is considered relevant if it
describes the research goal or states a difference with a rival approach
Other definitions: relevant sentence if it shows a high level of similarity with a sentence in the abstract
![Page 31: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/31.jpg)
31
Corpus 80 conference articles
Association for Computational Linguistics (ACL) European Chapter of the Association for
Computational Linguistics (EACL) Applied Natural Language Processing (ANLP) International Joint Conference on Artificial
Intelligence (IJCAI) International Conference on Computational
Linguistics (COLING). XML markups added
![Page 32: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/32.jpg)
32
The Gold Standard 3 tasked-trained annotators 17 pages of guidelines 20 hours of training No communication between annotators Evaluation measures of the annotation:
Stability Reproducibility
![Page 33: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/33.jpg)
33
Results of Annotation Kappa coefficient K (Siegel and Castellan, 1988)
where P(A)= pairwise agreement and P(E)= random agreement
Stability: K=.82, .81, .76 (N=1,220 and k=2) Reproducibility: K=.71
)(1
)()(
EP
EPAPK
![Page 34: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/34.jpg)
34
The System Supervised machine learning Naïve Bayes
![Page 35: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/35.jpg)
35
Features Absolute location of a sentence
Limitations of the author’s own method can be expected to be found toward the end, while limitations of other researchers’ work are discussed in the introduction
![Page 36: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/36.jpg)
36
Features – continued Section structure: relative and absolute
position of sentence within section: First, last, second or third, second-last or third-
last, or either somewhere in the first, second, or last third of the section
Paragraph structure: relative position of sentence within a paragraph Initial, medial, or final
![Page 37: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/37.jpg)
37
Features – continued Headlines: type of headline of current
section Introduction, Implementation, Example,
Conclusion, Result, Evaluation, Solution, Experiment, Discussion, Method, Problems, Related Work, Data, Further Work, Problem Statement, or Non-Prototypical
Sentence length Longer or shorter than 12 words (threshold)
![Page 38: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/38.jpg)
38
Features – continued Title word contents: does the sentence
contain words also occurring in the title? TF*IDF word contents
High values to words that occur frequently in one document, but rarely in the overall collection of documents
Do the 18 highest-scoring TF*IDF words belong to the sentence?
Verb syntax: voice, tense, and modal linguistic features
![Page 39: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/39.jpg)
39
Features – continued Citation
Citation (self), citation (other), author name, or none + location of the citation in the sentence (beginning, middle, or end)
History: most probable previous category AIM tends to follow CONTRAST Calculated as a second pass process during
training
![Page 40: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/40.jpg)
40
Features – continued Formulaic expressions: list of phrases described
by regular expressions, divided into 18 classes, comprising a total of 644 patterns Clustering prevents data sparseness
![Page 41: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/41.jpg)
41
Features – continued Agent: 13 types, 167 patterns
The placeholder WORK_NOUN can be replaced by a set of 37 nouns including theory, method, prototype, algorithm
Agent classes with a distribution very similar with the overall distribution of target categories were excluded
![Page 42: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/42.jpg)
42
Features – continued Action: 365 verbs clustered into 20 classes based
on semantic concepts such as similarity, contrast PRESENTATION_ACTIONs: present, report, state RESEARCH_ACTIONs: analyze, conduct, define, and
observe Negation is considered
![Page 43: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/43.jpg)
43
System Evaluation 10-fold-cross-validation
![Page 44: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/44.jpg)
44
Feature Impact The most distinctive single feature is Location,
followed by SegAgent, Citations, Headlines, Agent and Formulaic
![Page 45: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/45.jpg)
45
Questions ? Comments ? Observations ?
![Page 46: Document-level Semantic Orientation and Argumentation](https://reader036.fdocuments.net/reader036/viewer/2022062309/56815b11550346895dc8bbe3/html5/thumbnails/46.jpg)
46
Thank You !