Summary Generation Keith Trnka

Summary GenerationKeith Trnka

The approach

● Apply Marcu's basic summarizer (1999) to perform content selection

● Re-generate the selected content so that it's more natural

RST Refresher

● A text is composed of elementary discourse units (EDUs)– What constitutes an EDU varies from author to author– Common consensus that they are no larger than

sentences● Text spans

– An EDU is a text span– A sequence of adjacent text spans in some rhetorical

relation is a text span

RST Refresher (cont'd)

● A rhetorical relation is the relationship between text spans– Some relations have the notion of nuclearity:

one sub-span (nucleus) is the one to which all other sub-spans (satellites) relate

● These relations are called mononuclear● Example: [When I got home,] circumstance-for [I was

tired]

– Other spans are called multinuclear● There is no most-important sub-span● Example: [Cats scratch] contrast-with [, but dogs bite.]

RST Discourse Treebank

● RST analyses of 385 WSJ articles from Penn Treebank

● Available from LDC (http://www.ldc.upenn.edu)● Overview can be found in (Carlson et. al. 2001)● Annotation manual is (Carlson, Marcu 2001)● Thanks to the department for buying it

http://www.ldc.upenn.edu/

● Notes about the annotation– EDUs are clause-like– Mono-nuclear relations were forced to be binary– Relative clauses and appositives can be embedded

relations

RST Discourse Treebank (cont'd)

RST Discourse Treebank (cont'd)

● Statistical analysis of 335 training documents– 98% of spans are binary (two children)– For binary mononuclear relations:

● Nucleus-satellite order can be predicted with 87% accuracy, given the relation, using predict-majority

Relation Frequency N-S Order S-N OrderElaboration-additional 20.44% 99.79% 0.17%Attribution 17.19% 32.34% 67.42%elaboration-object-attribute-e 16.13% 99.96% 0.04%Elaboration-additional-e 5.22% 99.06% 0.94%Circumstance 3.95% 55.26% 44.56%Explanation-argumentative 3.61% 96.88% 2.34%

Marcu's Content Selection Algorithm

● Described in (Marcu 1999)● Promotion sets

– The promotion set of each span is the union of all promotion sets of nuclear sub-spans

– The promotion set of an EDU is the EDU itself

Marcu's Content Selection Algorithm (cont'd)● Build a partial ordering of EDUs*

– For each EDU, find the topmost span in which it's in the promotion set. Let d be the tree depth of this span.

– The rank of each EDU is● If the EDU is in an embedded relation, d + 1● Otherwise, d

– Example of the partial ordering

*re-worded from Marcu's description

Marcu's Content Selection Algorithm (cont'd)● Given a summary length requirement

– Select the topmost EDU groups until it isn't possible to select more and honor the length requirement

– Effect: can't always generate a summary as close to desired length as possible

Generation desiderata

● Removal of problems– Dangling references– Dangling discourse markers

● Introduction of coherence– Generate smaller referring expressions– Generate discourse markers when appropriate

Example

Claude Bebear, chairman and chief executive officer, of Axa-Midi Assurances, pledged to retain employees and management of Farmers Group Inc.. Mr. Bebear made his remarks at a breakfast meeting with reporters here yesterday as part of a tour. Farmers was quick yesterday to point out the many negative aspects. For one, Axa plans to do away with certain tax credits.

The theoretical approach

● Content selection– Marcu's summarization algorithm

● Paragraph generation– Organize sentences into paragraphs

● Sentence generation– Construct complete sentences from EDUs

The theoretical approach (cont'd)

● Discourse marker generation– Remove discourse markers that refer to removed text

spans– Generate discourse markers when none exists and one

is appropriate● Referring expression generation

– Generate the best unambiguous referring expressions● Shorter is better● Faster to interpret is better

The implemented approach

● Content selection– Marcu's algorithm as stated

● Paragraph generation– Not implemented

Implementation: Sentence “generation”● If a selected group of EDUs is an entire text span

– select them all as-is, uppercase the front and make sure it ends with punctuation

● If a selected group of EDUs is an entire text span, except for some embedded relations– Remove punctuation associated with embeddings, add

sentence terminators from embeddings

● If a selected group of EDUs is a sentence– Select as-is

● If a selected EDU isn't part of such a group– uppercase the front and end with punctuation

Implementation: Discourse marker generation● Train to see which discourse markers go with

which relations● In generation, select discourse markers with a

probability > 80%

Training on discourse markers

● Discourse markers identified by string matching at beginning and ending of each EDU

● List of markers taken from (Knott 1994)

Training on discourse markers (cont'd)● Three statistics trained on binary, atomic spans

with zero or one markers– Inclusion

– Usage

– Position

P include a marker | relation

P marker = m | include , relation

P position 1, 2 start , end | marker , include , n-s order

Rough evaluation

● Sentence “generation” isn't much different from not changing it at all– Except embedded relation removal

● Out of 347 summaries, a discourse marker was only generated once– Ms. Johnson is awed by the earthquake's destructive

force. "It really brings you down to a human level," Though "It's hard to accept all the suffering but you have to.

Desired approach: Content selection● Marcu's algorithm can only select groups of

EDUs– Sometimes produces overly short summaries or

nothing at all– If a preferential ordering could be defined within

equivalence, summaries could meet the desired length better

● EDUs tied to more salient EDUs have their score boosted

Desired approach:Paragraph generation● Paragraphs in the source document are marked

– Leave paragraph boundaries intact if they form large enough paragraphs

– A shallow method, but has potential● Correlate paragraph boundaries with something

– RS-tree structure– Co-reference chain beginnings/endings– Topical text segments, by an extension of Heart's text

segmentation algorithm (Hearst 1994)

Desired approach: Sentence generation● Apply shallow parsing to understand the rough

syntactic structure of an EDU● Relative clauses can be attached and full

sentences generated like (Siddharthan 2004)

Desired approach:Discourse marker generation● The probabilities computed in DM training aren't

the best– Need to attach discourse markers and recompute,

repeat until stable– The attachment algorithm involves a constraint-

satisfaction problem● DM attachment needed to perform DM removal● A DM generator should understand syntax better

– When should commas be included and where?

Desired approach:Referring expression generation● Requires good co-reference resolution

– A reference resolver requires (at least) a base noun phrase chunker

– EDUs might be used in conjunction with a shallow parse to approximate Hobbs' naïve approach

● Mitkov (2002) describes Hobbs' naïve approach

● Generation algorithm only adds the creation of a list of referring expressions, ordered by preference

Conclusions

● Document length is poorly defined– Quite a bit of variation between EDU length, word

length, and character length● Attaching discourse markers to the relation they

realize is tough● Representing natural language in programs can

be tough● Summarization of quotations requires special

treatment

References

● Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski (2001). Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory. Proceedings of the 2nd SIGDIAL Workshop on Discourse and Dialogue, Eurospeech 2001, Denmark, September 2001.

● Lynn Carlson and Daniel Marcu. (2001). Discourse Tagging Manual. ISI Tech Report ISI-TR-545. July 2001.

● Marti Hearst (1994). Multi-Paragraph Segmentation of Expository Text. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, June 1994.

● Alistair Knott and Robert Dale (1994). Using Linguistic Phenomena to Motivate a Set of Coherence Relations. Discourse Processes 18(1): 35-62.

● William Mann and Sandra Thompson (1988). Rhetorical Structure Theory: Toward a functional theory of text organization. Text 8(3): 243-281.

References (cont'd)

● Daniel Marcu (1999). Discourse trees are good indicators of importance in text. In I. Mani and M. Maybury editors, Advances in Automatic Text Summarization, pages 123-136, The MIT Press.

– I think this is a cleanup of his earlier work from 1997.

● Ruslan Mitkov (2002). Anaphora Resolution. Pearson Education.

● Advaith Siddharthan (2004). Syntactic Simplification and Text Cohesion. To appear in the Journal of Language and Computation, Kluwer Academic Publishers, the Netherlands.

Summary Generation Keith Trnka

Documents

Transcript of Summary Generation Keith Trnka