Excitons and Solar Energy Shuhua Liang UTK Physics department UTK Physics department.
Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor...
-
Upload
brenda-mccormick -
Category
Documents
-
view
218 -
download
1
Transcript of Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor...
Text Summarization -- In Search of Effective Ideas and Techniques
Shuhua Liu, Assistant ProfessorDepartment of Information SystemsÅbo Akademi University, Finland &Univercity Berkeley
Modified By Shinta P., 2012
2
Headline news — informing
3
TV-GUIDES — decision making
4
Abstracts of papers — time saving
5
Graphical maps — orienting
What is text summarization?
To reduce (long) textual information to its most essential points
to distill the most important information from a source or sources to produce an abridged version of it (Endres-Niggemeyer, 1998; Mani and Maybury, 1999; Spärck-Jones, 1999).
Text summarization: a context-dependent activity
8
‘Genres’ of Summary? Indicative vs. informative
...used for quick categorization vs. content processing.
Extract vs. abstract...lists fragments of text vs. re-phrases content
coherently.
Generic vs. query-oriented...provides author’s view vs. reflects user’s interest.
Background vs. just-the-news...assumes reader’s prior knowledge is poor vs. up-to-
date.
Single-document vs. multi-document source...based on one text vs. fuses together many texts.
Shuhua Liu, IIS/IAMSR, ÅA
Text summarization Key issues:
how to identify the most important content out of the rest of the text?
how to synthesize the substance and formulate a summary text based on the identified content?
Major approaches: Selection based: produce ”extracts” Text understanding based: produce
”abstracts”
Shuhua Liu, IIS/IAMSR, ÅA
Shuhua Liu, IIS/IAMSR, ÅA
Selection based summarization: how does it work?
The most content-bearing sentences or passages are identified and selected to compose a summary.
Compute a significance value for each sentence: (Luhn, 1958; Edmundson, 1969) Count word frequency the keywords, title words, cue words it
contains; the position of the sentence
RST (Rhetorical structute theory) based discourse analysis (Marcu, 1997)
Passage and sentence similarity analysis (Goldstein et al, 2000; CMU)
Shuhua Liu, IIS/IAMSR, ÅA
MSWord AutoSummarize
Shuhua Liu, IIS/IAMSR, ÅA
Text understanding system
A text understanding task often aims to recover all of the information that there is in a text, including what is only implicit in what is actually written. “All the richness of natural language
becomes fair game, including metaphor, metonymy, discourse structure, and the recognition of the author's underlying intentions, and the full interplay between language and world knowledge becomes central to the task.”
Shuhua Liu, IIS/IAMSR, ÅA
Text understanding based summarization
Depend on complete sentence analysis and discourse analysis with full knowledge support Syntactic pasrer, semantic interpreter Linguistic knowledge, world
knowledge, domain knowledge Reasoning mechnisms that work
effectively over huge knowledge collections.
Shuhua Liu, IIS/IAMSR, ÅA
Selection based vs. Understanding based
Selection based: general applicable, but incoherent content, poor readability due to unclear relationships between the selected text excerpts, dangling references, and so on.
Understanding based: high precision, but very slow, large amount of wasted computation, highly domain specific.
Endres-Niggenger (2000) found that, people prefer (sometimes) extractive summaries instead of gloss-over abstractive summaries!
Shuhua Liu, IIS/IAMSR, ÅA
The reality:
The dominant approach in practice is still selection-based;
Understanding based systems only exist in theory, and will continue to be so for quite a while;
However, certain text understanding tasks in small scale or restricted domains can be done.
Shuhua Liu, IIS/IAMSR, ÅA
Topic guided text summarization
Text summarization as a process of topic analysis, passage extraction, and text understanding, information integration/fusion, and text generation proces.
Passage extraction guided by topic structure will expect to keep the logic relationships between the extracted text parts: e.g. sentences are arranged logically according to topic structure
Topic representation will also be very helpful in next phase text analysis and information integration.
Shuhua Liu, IIS/IAMSR, ÅA
Phase 1: Theme detection, topic labels, sentence/passage selection
Theme detection through passage pairwise similarity analysis Vector space model of term and document TF-IDF: baseline method
nNfw ijij log
t
kjk
t
kik
t
kjkik
ji
ww
wwDDsimilarity
1
2
1
2
1
)()(
),(
Shuhua Liu, IIS/IAMSR, ÅA
Passage similarity analysis with LSA method
LSA (Latent Sematic Analysis) Similar results as using TF-IDF Fuzzy LSI approach (Nikravesh, 2002)
ndddD ,,, 21 mwwwW ,,, 21
ijjiij dwnN , 1
tVUN NVUVUN tt ~~
Shuhua Liu, IIS/IAMSR, ÅA
Passage adjacency matrix (partial)
similarity strength >= 0.35
s21 s22 s23 s24 s25 s26 s27 s28s21 0 1 0 1 0 1 0 0s22 1 0 0 0 0 1 0 0s23 0 0 0 0 0 0 0 0s24 1 0 0 0 1 1 1 1s25 0 0 0 1 0 0 1 1s26 1 1 0 1 0 0 0 1s27 0 0 0 1 1 0 0 1s28 0 0 0 1 1 1 1 0s29 0 0 0 0 0 0 0 0s210 0 1 0 0 0 0 0 0s211 0 0 0 0 0 0 0 1
Shuhua Liu, IIS/IAMSR, ÅA
Passage Relation Map
Shuhua Liu, IIS/IAMSR, ÅA
Passage Extraction Rules
Passage clusters help us to identify themes and topics; unconnected passages form distinct topics covered in a document.
The MMR algorithm (CMU) (Goldstein et al, 2000) A sentence/passage closest to the centroid of the
cluster be chosen to be included in the summary. Sentences that are maximally similar to the
document and maximally dissimilar to sentences already in the summary are selected to compose a summary.
Shuhua Liu, IIS/IAMSR, ÅA
Creating theme labels
Keywords (TF based) Word families (semantic related
words in a passage cluster) Key phrases
Linguistic approach Statistical + simple heuristics (Kelledy
and Smeaton, 1997) – seems quite effective.
Shuhua Liu, IIS/IAMSR, ÅA
Next step
Shuhua Liu, IIS/IAMSR, ÅA
WordNet, since 1985
Lexical database developed at Princeton University, led by George Miller
Hand-coded, freely available Word knowledge of: nouns, verbs,
adjectives, adverbs Semantic network representation with only
a few semantic relations: Synonym, hypernynm, Categorization relation: Is-a
Widely used in query expansion, word similarity determination (based on synsets)
Shuhua Liu, IIS/IAMSR, ÅA
Table Semantic Relations in WordNet (Miller, 1995) Semantic Relation Syntactic Category Examples Synonym (similar) N, V, Aj, Av Pipe, tube; rise, ascend;
Sad, unhappy; rapidly, speedily Antonymy (opposite) Aj, Av, (N, V) Wet, dry; powerful, powerless; friendly,
unfriendly; rapidly, slowly Hyponymy (subordinate) N Sugar maple, maple, maple tree, plant
Meronymy (part)
N Brim, hat; gin, martini; ship, fleet;
Troponymy (manner) V March, walk; whisper, speak
Entailment V Drive, ride; divorce, marry
Note: N – Nouns Aj – Adjectives V – Verbs Av - Adverbs
Shuhua Liu, IIS/IAMSR, ÅA
Shuhua Liu, IIS/IAMSR, ÅA
Shuhua Liu, IIS/IAMSR, ÅA
ConceptNet, MIT Media Lab
Common sense knowledge base with NLP capability
Extracted automatically from common sense knowledge expressed in semi-structured NL sentences from OMCSNet (open mind common sense) – applying about 50 extraction rules ”The Effect of [falling off a bike] is [you get hurt].” ”A lime is a very sour fruit” at OMCS is extracted
into two assertations:IsA (lime, fruit)PropertyOf (lime, very sour)
Shuhua Liu, IIS/IAMSR, ÅA
Twenty Semantic Relation Types in ConceptNet (Liu and Singh, 2004)
THINGS (52,000 assertions)
IsA: (IsA "apple" "fruit") Part of: (PartOf "CPU" "computer") PropertyOf: (PropertyOf "coffee" "wet") MadeOf: (MadeOf "bread" "flour") DefinedAs: (DefinedAs "meat" "flesh of animal")
EVENTS (38,000 assertions)
PrerequisiteeventOf: (PrerequisiteEventOf "read letter" "open envelope") SubeventOf: (SubeventOf "play sport" "score goal") FirstSubeventOF: (FirstSubeventOf "start fire" "light match") LastSubeventOf: (LastSubeventOf "attend classical concert" "applaud")
AGENTS (104,000 assertions)
CapableOf: (CapableOf "dentist" "pull tooth")
SPATIAL (36,000 assertions)
LocationOf: (LocationOf "army" "in war")
TEMPORAL time & sequence
CAUSAL (17,000 assertions)
EffectOf: (EffectOf "view video" "entertainment") DesirousEffectOf: (DesirousEffectOf "sweat" "take shower")
AFFECTIONAL (mood, feeling, emotions) (34,000 assertions)
DesireOf (DesireOf "person" "not be depressed") MotivationOf (MotivationOf "play game" "compete")
FUNCTIONAL (115,000 assertions)
IsUsedFor: (UsedFor "fireplace" "burn wood") CapableOfReceivingAction: (CapableOfReceivingAction "drink" "serve")
ASSOCIATION K-LINES (1.25 million assertions)
SuperThematicKLine: (SuperThematicKLine "western civilization" "civilization") ThematicKLine: (ThematicKLine "wedding dress" "veil") ConceptuallyRelatedTo: (ConceptuallyRelatedTo "bad breath" "mint")
Shuhua Liu, IIS/IAMSR, ÅA
ConceptNet (Liu and Singh, 2004a, 2004b)
Inference Spreading activation: node-activation
radiating outward from an origin code GetContext (node) GetAnalogousConcept (node)
Graph traversal: FindPathBetweenNodes (node1, node2)
Shuhua Liu, IIS/IAMSR, ÅA
ConceptNet (Liu and Singh, 2004a, 2004b)
Support Topic sensing Query expansion Semantic similarity of words Lexical generalization Thematic generalization
Much needs to be examined; Uncontrolled vocabulary, can be biased
in terms of content; but seems quite reliable knowledge.
Shuhua Liu, IIS/IAMSR, ÅA
Topic-Sensing
Shuhua Liu, IIS/IAMSR, ÅA
Eurovoc: multilingual thesaurus
Controlled vocabulary, 20 languages, broad fields politics, international relations, European
Communities, law, economics, trade, finance, social questions, education, science, international organizations, employment and working conditions
industry, business and competition, production, technology and research,
transport, environment, energy, agriculture, forestry and fisheries, agri-foodstuffs, geography