Text Summarization Jagadish M(07305050) Annervaz K M (07305063) Joshi Prasad(07305047) Ajesh...

45
Text Summarization Jagadish M(07305050) Annervaz K M (07305063) Joshi Prasad(07305047) Ajesh Kumar S(07305065) Shalini Gupta(07305R02)
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Text Summarization Jagadish M(07305050) Annervaz K M (07305063) Joshi Prasad(07305047) Ajesh...

Page 1: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Text Summarization

Jagadish M(07305050)Annervaz K M (07305063)

Joshi Prasad(07305047)Ajesh Kumar S(07305065)Shalini Gupta(07305R02)

Page 2: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Introduction

Summary: Brief but accurate representation of the contents of a document

Goal: Take an information source, extract the most important content from it and present it to the user in a condensed form and in a manner sensitive to the user’s needs.

Compression: Amount of text to present or the length of the summary to the length of the source.

Page 3: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

MSWord AutoSummarize

Page 4: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)
Page 5: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Presentation Outline Motivation Different Genres Simple Statistical Techniques Degree Centrality Lex Rank Lexical/Co-reference Chains Rhetorical Structure Theory WordNet Based Methods DUC/TAC

Page 6: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Motivation

Abstracts for Scientific and other articles

News summarization (mostly Multiple document summarization)

Classification of articles and other written data

Web pages for search engines Web access from PDAs, Cell phones Question answering and data gathering

Page 7: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Genres

Indicative vs. informative used for quick categorization vs. content

processing. Extract vs. abstract

lists fragments of text vs. re-phrases content coherently.

Generic vs. query-oriented provides author’s view vs. reflects user’s interest.

Background vs. just-the-news assumes reader’s prior knowledge is poor vs. up-

to-date. Single-document vs. multi-document source

based on one text vs. fuses together many texts.

Page 8: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Statistical scoring

Scoring techniques Word frequencies throughout the

text(Luhn58) Position in the text(Edmundson69) Title Method(Edmundson69) Cue phrases in sentences (Edmundson69)

Page 9: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Luhn58

Important words occur fairly frequently

Earliest work in field

Page 10: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Statistical Approaches(contd..)

Degree Centrality LexRank Continuous LexRank

Page 11: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Degree Centrality

Problem Formulation Represent each sentence by a vector Denote each sentence as the node of a

graph Cosine similarity determines the edges

between nodes

Page 12: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Degree Centrality

Since we are interested in significant similarities, we can eliminate some low values in this matrix by defining a threshold.

Page 13: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Degree Centrality

Compute the degree of each sentence

Pick the nodes (sentences) with high degrees

Page 14: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Degree Centrality

Disadvantage in Degree Centrality approach

Page 15: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

LexRank

Centrality vector p which will give a lexrank of each sentence (similar to page rank) defined by :

Page 16: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

What Should B Satisfy?

Stochastic Matrix and Markov Chain property.

Irreducible. Aperiodic

Page 17: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Perron-Frobenius Theorem

An irreducible and aperiodic Markov chain is guaranteed to converge to a stationary distribution

Page 18: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Reducibility

Page 19: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Aperiodicity

Page 20: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

LexRank

B is a stochastic matrix Is it an irreducible and aperiodic

matrix? Dampness (Page et al. 1998)

Page 21: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Matrix Form of p for Dampening

Solve for p using Power method

Page 22: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Continuous LexRank

Page 23: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Linguistic/Semantic Methods

Co-reference /Lexical Chain Rhetorical Analysis

Page 24: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Co-reference/Lexical Chains

Assumption/Observation :- Important parts in a text will be more related in a semantic interpretation

Co-reference / Lexical Chains (Object-Action, Part-of relation, Semantically related)

Important sentences will be traversed by more number of such chains

Page 25: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Co-reference/Lexical Chains

Mr. Kenny is the person that invented the anesthetic machine which uses micro-computers to control the rate at which an anesthetic is pumped into the blood. Such machines are nothing new. But his device uses two micro-computers to achieve much closer monitoring of the pump feeding the anesthetic into the patient

Page 26: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Rhetorical Structure Theory

Mann & Thompson 88 Rhetoric Relation

Between two non-overlapping text snippets

Nucleus - Core Idea, Writers Purpose Satellite - Referred in context to nucleus

for Justifying, Evidencing, Contradicting etc

Page 27: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Rhetorical Structure Theory Nucleus of a rhetorical relation is

comprehensible independent of the satellite, but not vice versa

All rhetoric relations are not nucleus-satellite relations, Contrast is a multinuclear relationship

Example: evidence [The truth is that the pressure to smoke in 'junior high' is greater than it will be any other time of one’s life:][ we know that 3,000 teens start smoking each day.]

Page 28: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Rhetorical Structure Theory

Rhetoric Parsing Breaks into elementary units Uses cue phrases(discourse markers) and

notion of semantic similarity in order to hypothesize rhetorical relations

Rhetorical relations can be assembled into rhetorical structure trees (RS-trees) by recursively applying individual relations across the whole text

Page 29: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

2Elaboration

2Elaboration

8Example

2BackgroundJustification

3Elaboration

8Concession

10Antithesis

Mars experiences

frigid weather

conditions(2)

Surface temperatures typically average

about -60 degrees

Celsius (-76 degrees

Fahrenheit) at the

equator and can dip to -

123 degrees C near the

poles(3)

4 5Contrast

Although the atmosphere

holds a small

amount of water, and water-ice

clouds sometimes develop,

(7)

Most Martian weather involves

blowing dust and carbon monoxide.

(8)

Each winter, for example, a blizzard of

frozen carbon dioxide

rages over one pole, and a few meters of

this dry-ice snow

accumulate as

previously frozen carbon dioxide

evaporates from the opposite

polar cap.(9)

Yet even on the summer pole, where

the sun remains in the sky all day long,

temperatures never warm

enough to melt frozen

water.(10)

With its distant orbit (50 percent farther from the sun than Earth) and

slim atmospheric

blanket,(1)

Only the midday sun at tropical latitudes is

warm enough to

thaw ice on occasion,

(4)

5Evidence

Cause

but any liquid water formed in this way would

evaporate almost

instantly(5)

because of the low

atmospheric pressure

(6)

Page 30: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

RST Based Summarization

Multiple RS-trees A built RS-tree captures relations in the

text and can be used for high quality summarization

Picking up the ‘K’ nodes nearest to the root

Disadvantages

Page 31: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

WordNet based Approach for Summarization

Preprocessing of text Constructing sub-graph from WordNet Synset Ranking Sentence Selection Principal Component Analysis

Page 32: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Preprocessing

Break text into sentences Apply POS tagging Identify collocations in the text Remove the stop words

Sequence is important

Page 33: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Constructing sub-graph from WordNet Mark all the words and collocations in

the WordNet graph which are present in the text

Traverse the generalization edges up to a fixed depth, and mark the synsets you visit

Construct a graph, containing only the marked synsets

Page 34: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Synset Ranking

Rank synsets based on their relevance to text

Construct a Rank vector, corresponding to each node of the graph, initialized to 1/√ (no_of_nodes, n in graph)

Create an authority matrix, A(i,j) = 1/(num_of_predecessors(j)), if j is a child of i.

Page 35: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Synset Ranking

Update the R vector iteratively as,

Higher value implies better rank and higher relevance

Page 36: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Sentence Selection

Construct a matrix, M with m rows and n columns

m is number of sentences and n is number of nodes

For each sentence Si

Traverse graph G, starting with words present in Si and following generalization edges

Find set of reachable synsets, SYi

For each syij ∈ SYi

set M[Si][syij] to rank of syij calculated in previous step

Page 37: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Principal Component Analysis

Apply PCA on matrix M and get set of principal components or eigen vectors

Eigen value of each eigen vector is measure of relevance of eigen vector to the meaning

Sort Eigen vectors according to Eigen values

For each Eigen vector, find its projection on each sentence

Page 38: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Principal Component Analysis

Select top nnumselect sentences for each eigen vector

nnumselect is proportional to the eigen values of the eigen vectors

nnumselect = i/∑j(j)) where i is the eigen value corresponding to the eigen vector, i

Page 39: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Document Understanding Conference(DUC) Text Analysis Conference(TAC)

Interest and activity aimed at building powerful multi-purpose information systems

Evaluation results of various summarization techniques

www-nlpir.nist.gov/projects/duc/data.html

Page 40: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Human Summary of Our Presentation :) What is Text Summarization? Why Text Summarization? Methods to Summarization

LexRank Lexical Chains Rhetorical Structure Theory Wordnet Based

Page 41: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Challenges ahead..

Ensuring text coherency Sentences may have dangling

anaphors Summarizing non-textual data Handling multiple sources effectively High reduction rates are needed Achieving human quality

summarization!!

Page 42: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

References

Erkan, Radev, 2004. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. Vol: 22, 457 – 479, Journal of Artificial Intelligence Research

Barzilay, R. and M. Elhadad. 1997. Using Lexical Chains for Text Summarization. In Proceedings of the Workshop on Intelligent Scalable Text Summarization at the ACL/EACL Conference, 10–17. Madrid, Spain.

Mann, W.C. and S.A. Thompson. 1988. Rhetorical Structure Theory: Toward a Functional Theory of Text Organization. Text 8(3), 243–281. Also available as USC/Information Sciences Institute Research Report RR-87-190.

Page 43: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

References

Baldwin, B. and T. Morton. 1998. Coreference-Based Summarization. In T. Firmin Hand and B. Sundheim (eds). TIPSTER-SUMMAC Summarization Evaluation. Proceedings of the TIPSTER Text Phase III Workshop. Washington.

Marcu, D. 1998. Improving Summarization Through Rhetorical Parsing Tuning. Proceedings of the Workshop on Very Large Corpora. Montreal, Canada.

Ramakrishnan and Bhattacharya, 2003. Text representation with wordnet synsets. Eighth International Conference on Applications of Natural Language to Information Systems (NLDB2003)

Page 44: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

References

Bellare,Anish S., Atish S., Loiwal, Bhattacharya, Mehta, Ramakrishnan, 2004. Generic Text Summarization using WordNet

Inderjeet Mani and Mark T. Maybury (eds). Advances in Automatic Text. Summarization. MIT Press, 1999. ISBN 0-262-13359-8.

www.wikipedia.com

Page 45: Text Summarization Jagadish M(07305050)  Annervaz K M (07305063)  Joshi Prasad(07305047)  Ajesh Kumar S(07305065)  Shalini Gupta(07305R02)

Thank You