Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor...

34
Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University, Finland & Univercity Berkeley Modified By Shinta P., 2012

Transcript of Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor...

Page 1: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Text Summarization -- In Search of Effective Ideas and Techniques

Shuhua Liu, Assistant ProfessorDepartment of Information SystemsÅbo Akademi University, Finland &Univercity Berkeley

Modified By Shinta P., 2012

Page 2: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

2

Headline news — informing

Page 3: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

3

TV-GUIDES — decision making

Page 4: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

4

Abstracts of papers — time saving

Page 5: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

5

Graphical maps — orienting

Page 6: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

What is text summarization?

To reduce (long) textual information to its most essential points

to distill the most important information from a source or sources to produce an abridged version of it (Endres-Niggemeyer, 1998; Mani and Maybury, 1999; Spärck-Jones, 1999).

Page 7: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Text summarization: a context-dependent activity

Page 8: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

8

‘Genres’ of Summary? Indicative vs. informative

...used for quick categorization vs. content processing.

Extract vs. abstract...lists fragments of text vs. re-phrases content

coherently.

Generic vs. query-oriented...provides author’s view vs. reflects user’s interest.

Background vs. just-the-news...assumes reader’s prior knowledge is poor vs. up-to-

date.

Single-document vs. multi-document source...based on one text vs. fuses together many texts.

Page 9: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Text summarization Key issues:

how to identify the most important content out of the rest of the text?

how to synthesize the substance and formulate a summary text based on the identified content?

Major approaches: Selection based: produce ”extracts” Text understanding based: produce

”abstracts”

Page 10: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Page 11: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Selection based summarization: how does it work?

The most content-bearing sentences or passages are identified and selected to compose a summary.

Compute a significance value for each sentence: (Luhn, 1958; Edmundson, 1969) Count word frequency the keywords, title words, cue words it

contains; the position of the sentence

RST (Rhetorical structute theory) based discourse analysis (Marcu, 1997)

Passage and sentence similarity analysis (Goldstein et al, 2000; CMU)

Page 12: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

MSWord AutoSummarize

Page 13: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Text understanding system

A text understanding task often aims to recover all of the information that there is in a text, including what is only implicit in what is actually written. “All the richness of natural language

becomes fair game, including metaphor, metonymy, discourse structure, and the recognition of the author's underlying intentions, and the full interplay between language and world knowledge becomes central to the task.”

Page 14: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Text understanding based summarization

Depend on complete sentence analysis and discourse analysis with full knowledge support Syntactic pasrer, semantic interpreter Linguistic knowledge, world

knowledge, domain knowledge Reasoning mechnisms that work

effectively over huge knowledge collections.

Page 15: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Selection based vs. Understanding based

Selection based: general applicable, but incoherent content, poor readability due to unclear relationships between the selected text excerpts, dangling references, and so on.

Understanding based: high precision, but very slow, large amount of wasted computation, highly domain specific.

Endres-Niggenger (2000) found that, people prefer (sometimes) extractive summaries instead of gloss-over abstractive summaries!

Page 16: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

The reality:

The dominant approach in practice is still selection-based;

Understanding based systems only exist in theory, and will continue to be so for quite a while;

However, certain text understanding tasks in small scale or restricted domains can be done.

Page 17: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Topic guided text summarization

Text summarization as a process of topic analysis, passage extraction, and text understanding, information integration/fusion, and text generation proces.

Passage extraction guided by topic structure will expect to keep the logic relationships between the extracted text parts: e.g. sentences are arranged logically according to topic structure

Topic representation will also be very helpful in next phase text analysis and information integration.

Page 18: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Phase 1: Theme detection, topic labels, sentence/passage selection

Theme detection through passage pairwise similarity analysis Vector space model of term and document TF-IDF: baseline method

nNfw ijij log

t

kjk

t

kik

t

kjkik

ji

ww

wwDDsimilarity

1

2

1

2

1

)()(

),(

Page 19: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Passage similarity analysis with LSA method

LSA (Latent Sematic Analysis) Similar results as using TF-IDF Fuzzy LSI approach (Nikravesh, 2002)

ndddD ,,, 21 mwwwW ,,, 21

ijjiij dwnN , 1

tVUN NVUVUN tt ~~

Page 20: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Passage adjacency matrix (partial)

similarity strength >= 0.35

s21 s22 s23 s24 s25 s26 s27 s28s21 0 1 0 1 0 1 0 0s22 1 0 0 0 0 1 0 0s23 0 0 0 0 0 0 0 0s24 1 0 0 0 1 1 1 1s25 0 0 0 1 0 0 1 1s26 1 1 0 1 0 0 0 1s27 0 0 0 1 1 0 0 1s28 0 0 0 1 1 1 1 0s29 0 0 0 0 0 0 0 0s210 0 1 0 0 0 0 0 0s211 0 0 0 0 0 0 0 1

Page 21: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Passage Relation Map

Page 22: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Passage Extraction Rules

Passage clusters help us to identify themes and topics; unconnected passages form distinct topics covered in a document.

The MMR algorithm (CMU) (Goldstein et al, 2000) A sentence/passage closest to the centroid of the

cluster be chosen to be included in the summary. Sentences that are maximally similar to the

document and maximally dissimilar to sentences already in the summary are selected to compose a summary.

Page 23: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Creating theme labels

Keywords (TF based) Word families (semantic related

words in a passage cluster) Key phrases

Linguistic approach Statistical + simple heuristics (Kelledy

and Smeaton, 1997) – seems quite effective.

Page 24: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Next step

Page 25: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

WordNet, since 1985

Lexical database developed at Princeton University, led by George Miller

Hand-coded, freely available Word knowledge of: nouns, verbs,

adjectives, adverbs Semantic network representation with only

a few semantic relations: Synonym, hypernynm, Categorization relation: Is-a

Widely used in query expansion, word similarity determination (based on synsets)

Page 26: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Table Semantic Relations in WordNet (Miller, 1995) Semantic Relation Syntactic Category Examples Synonym (similar) N, V, Aj, Av Pipe, tube; rise, ascend;

Sad, unhappy; rapidly, speedily Antonymy (opposite) Aj, Av, (N, V) Wet, dry; powerful, powerless; friendly,

unfriendly; rapidly, slowly Hyponymy (subordinate) N Sugar maple, maple, maple tree, plant

Meronymy (part)

N Brim, hat; gin, martini; ship, fleet;

Troponymy (manner) V March, walk; whisper, speak

Entailment V Drive, ride; divorce, marry

Note: N – Nouns Aj – Adjectives V – Verbs Av - Adverbs

Page 27: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Page 28: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Page 29: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

ConceptNet, MIT Media Lab

Common sense knowledge base with NLP capability

Extracted automatically from common sense knowledge expressed in semi-structured NL sentences from OMCSNet (open mind common sense) – applying about 50 extraction rules ”The Effect of [falling off a bike] is [you get hurt].” ”A lime is a very sour fruit” at OMCS is extracted

into two assertations:IsA (lime, fruit)PropertyOf (lime, very sour)

Page 30: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Twenty Semantic Relation Types in ConceptNet (Liu and Singh, 2004)

THINGS (52,000 assertions)

IsA: (IsA "apple" "fruit") Part of: (PartOf "CPU" "computer") PropertyOf: (PropertyOf "coffee" "wet") MadeOf: (MadeOf "bread" "flour") DefinedAs: (DefinedAs "meat" "flesh of animal")

EVENTS (38,000 assertions)

PrerequisiteeventOf: (PrerequisiteEventOf "read letter" "open envelope") SubeventOf: (SubeventOf "play sport" "score goal") FirstSubeventOF: (FirstSubeventOf "start fire" "light match") LastSubeventOf: (LastSubeventOf "attend classical concert" "applaud")

AGENTS (104,000 assertions)

CapableOf: (CapableOf "dentist" "pull tooth")

SPATIAL (36,000 assertions)

LocationOf: (LocationOf "army" "in war")

TEMPORAL time & sequence

CAUSAL (17,000 assertions)

EffectOf: (EffectOf "view video" "entertainment") DesirousEffectOf: (DesirousEffectOf "sweat" "take shower")

AFFECTIONAL (mood, feeling, emotions) (34,000 assertions)

DesireOf (DesireOf "person" "not be depressed") MotivationOf (MotivationOf "play game" "compete")

FUNCTIONAL (115,000 assertions)

IsUsedFor: (UsedFor "fireplace" "burn wood") CapableOfReceivingAction: (CapableOfReceivingAction "drink" "serve")

ASSOCIATION K-LINES (1.25 million assertions)

SuperThematicKLine: (SuperThematicKLine "western civilization" "civilization") ThematicKLine: (ThematicKLine "wedding dress" "veil") ConceptuallyRelatedTo: (ConceptuallyRelatedTo "bad breath" "mint")

Page 31: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

ConceptNet (Liu and Singh, 2004a, 2004b)

Inference Spreading activation: node-activation

radiating outward from an origin code GetContext (node) GetAnalogousConcept (node)

Graph traversal: FindPathBetweenNodes (node1, node2)

Page 32: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

ConceptNet (Liu and Singh, 2004a, 2004b)

Support Topic sensing Query expansion Semantic similarity of words Lexical generalization Thematic generalization

Much needs to be examined; Uncontrolled vocabulary, can be biased

in terms of content; but seems quite reliable knowledge.

Page 33: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Topic-Sensing

Page 34: Text Summarization -- In Search of Effective Ideas and Techniques Shuhua Liu, Assistant Professor Department of Information Systems Åbo Akademi University,

Shuhua Liu, IIS/IAMSR, ÅA

Eurovoc: multilingual thesaurus

Controlled vocabulary, 20 languages, broad fields politics, international relations, European

Communities, law, economics, trade, finance, social questions, education, science, international organizations, employment and working conditions

industry, business and competition, production, technology and research,

transport, environment, energy, agriculture, forestry and fisheries, agri-foodstuffs, geography