Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and...
-
Upload
harvey-shepherd -
Category
Documents
-
view
234 -
download
0
description
Transcript of Timestamped Graphs: Evolutionary Models of Text for Multi-document Summarization Ziheng Lin and...
Timestamped Graphs:Evolutionary Models of Text for Multi-document Summarization
Ziheng Lin and Min-Yen KanDepartment of Computer Science
National University of Singapore, Singapore
2TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Summarization• Traditionally, heuristics for extractive summarization
– Cue/stigma phrases– Sentence position (relative to document, section, paragraph)– Sentence length – TF×IDF, TF scores– Similarity (with title, context, query)
• With the advent of machine learning, heuristic weights for different features are tuned by supervised learning
• In last few years, graphical representations of text have shed new light on the summarization problem
3TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Prestige as sentence selection• One motivation of using graphical methods was to model the problem as finding prestige of nodes in a social network
• PageRank used random walk to smooth the effect of non-local context• HITS and SALSA to model hubs and authorities• In summarization, lead to TextRank and LexRank• Contrast with previous graphical approaches (Salton et al. 1994)
• Did we leave anything out of our representation for summarization?Yes, the notion of an evolving network
4TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Social networks change!Natural evolving networks (Dorogovtsev and Mendes, 2001)
– Citation networks: New papers can cite old ones, but the old network is static
– The Web: new pages are added with an old page connecting it to the web graph, old pages may update links
5TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Evolutionary models for summarization
Writers and readers often follow conventional rhetorical styles - articles are not written or read in an arbitrary way
Consider the evolution of texts using a very simplistic model
– Writers write from the first sentence onwards in a text– Readers read from the first sentence onwards of a text
• A simple model: sentences get added incrementally to the graph
6TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph ConstructionApproach
– These assumptions suggest us to iteratively add sentences into the graph in chronological order.– At each iteration, consider which edges to add to the graph.
– For single document: simple and straightforward: add 1st sentence, followed by the 2nd, and so forth, until the last sentence is added
– For multi-document: treat it as multiple instances of single documents, which evolve in parallel; i.e., add 1st sentences of all documents, followed by all 2nd sentences, and so forth
• Doesn’t really model chronological ordering between articles, fix later
7TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph ConstructionModel: • Documents as columns
– di = document i
• Sentences as rows–sj = jth sentence of document
8TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction• A multi document example
doc1 doc2 doc3
sent1
sent2
sent3
9TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
An example TSG: DUC 2007 D0703A-A
10TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Properties of nodes
Timestamped Graph Construction
Properties of edges
Input text transformation
function
These are just one instance of TSGs Let’s generalize and formalize themDef: A timestamped graph algorithm tsg(M) is a 9-tuple
(d, e, u, f,σ, t, i, s, τ) that specifies a resultingalgorithm that takes as input the set of texts M andoutputs a graph G
11TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Edge properties (d, e, u, f)• Edge Direction (d)
– Forward, backward, or undirected
• Edge Number (e)– number of edges to instantiate per timestep
• Edge Weight (u)– weighted or unweighted edges
• Inter-document factor (f)– penalty factor for links between documents in multi-document sets.
12TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Node properties (σ, t, i, s)• Vertex selection function σ(u, G)
– One strategy: among those nodes not yet connected to u in G, choose the one with highest similarity according to u– Similarity functions: Jaccard, cosine, concept links
(Ye et al.. 2005)
• Text unit type (t)– Most extractive algorithms use sentences as elementary units
• Node increment factor (i) – How many nodes get added at each timestep
• Skew degree (s)– Models how nodes in multi-document graphs are added– Skew degree = how many iterations to wait before adding the 1st sentence of the next document– Let’s illustrate this …
13TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Skew Degree Examplestime(d1) < time(d2) < time(d3) < time(d4)
d1 d2 d3 d4 d1 d2 d3 d4
Skewed by 1 Skewed by 2 Freely skewed
d1 d2 d3 d4
Freely skewed = Only add a new document when it would be linked by some node using vertex function σ
14TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Input text transformation function (τ)• Document Segmentation Function (τ)
– Problem observed in some clusters where some documents in a multi-document cluster are very long– Takes many timestamps to introduce all of the sentences, causing too many edges to be drawn
–Τ(G) segments long documents into several sub docs
• Solution is too hacked – hope to investigate more in current and future work
d5 d5bd5a
15TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction• Representations
– We can model a number of different algorithms using this 9-tuple formalism:
(d, e, u, f, σ, t, i, s, τ)– The given toy example:
(f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null)
– LexRank graphs:(u, N, 1, 1, cosine-based, sentence, Lmax, 0, null)
N = total number of sentences in the cluster; Lmax = the max document lengthi.e., all sentences are added into the graph in one timestep,
each connected to all others, and cosine scores are given to edge weights
TSG-based summarization
MethodologyEvaluation
Analysis
17TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
System Overview• Sentence splitting
–Detect and mark sentence boundaries–Annotate each sentence with the doc ID and the sentence number –E.g., XIE19980304.0061: 4 March 1998 from Xinhua News; XIE19980304.0061-14: the 14th sentence of this document
• Graph construction–Construct TSG in this phase
18TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
System Overview• Sentence Ranking
– Apply topic-sensitive random walk on the graph to redistribute the weights of the nodes
• Sentence extraction– Extract the top-ranked sentences – Two different modified MMR re-rankers are used, depending on whether it is main or update task
19TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Evaluation• Dataset: DUC 2005, 2006 and 2007. • Evaluation tool: ROUGE: n-gram based automatic evaluation• Each dataset contains 50 or 45 clusters, each cluster contains
a query and 25 documents
• Evaluate on some parameters–Do different e values affect the summarization process?–How do topic-sensitivity and edge weighting perform in running PageRank?–How does skewing the graph affect the information flow in the graph?
20TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Evaluation on number of edges (e)Tried different e values • Optimal performance: e = 2• At e = 1, graph is too loosely connected, not suitable for PageRank
→ very low performance• At e = N, a LexRank system
N NN
e = 2e = 2
21TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Evaluation (other edge parameters)• PageRank: generic vs topic-sensitive • Edge weight (u): unweighted vs weighted
• Optimal performance: topic-sensitive PageRank and weighted edges
Topic-sensitive
Weighted edges
ROUGE-1 ROUGE-2
No No 0.39358 0.07690
Yes No 0.39443 0.07838
No Yes 0.39823 0.08072
Yes Yes 0.39845 0.08282
22TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Evaluation on skew degree (s)• Different skew degrees: s = 0, 1 and 2• Optimal performance: s = 1• s = 2 introduces a delay interval that is too large
• Need to try freely skewed graphs
Skew degree ROUGE-1 ROUGE-2
0 0.36982 0.07580
1 0.37268 0.07682
2 0.36998 0.07489
23TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Holistic Evaluation in DUCWe participated in DUC 2007 with an extractive-based TSG system
• Main task: 12th for ROUGE-2, 10th for ROUGE-SU4 among 32 systems
• Update task: 3rd for ROUGE-2, 4th for ROUGE-SU4 among 24 systems• Used a modified version of maximal marginal relevance to penalize links in previously read articles
– Extension of inter-document factor (f)
• TSG formalism better tailored to deal with update / incremental text tasks• New method that may be competitive with current approaches
– Other top scoring systems may do sentence compression, not just extraction
24TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Conclusion• Proposed a timestamped graph model for text understanding and summarization
– Adds sentences one at a time• Parameterized model with nine variables
– Canonicalizes representation for several graph based summarization algorithms
Future Work• Freely skewed model• Empirical and theoretical properties of TSGs (e.g., in-degree distribution)
Backup Slides
25 Minute talk total26 Apr 2007, 11:50-12:15
26TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Differences for main and update task processingMain task:
1. Construct a TSG for input cluster
2. Run topic-sensitive PageRank on the TSG
3. Apply first modified version of MMR to extract sentences
Update task:
• Cluster A:– Construct a TSG for cluster A– Run topic-sensitive PageRank on the TSG– Apply the second modified version of MMR to extract sentences
• Cluster B:– Construct a TSG for clusters A and B– Run topic-sensitive PageRank on the TSG; only retain sentences from B– Apply the second modified version of MMR to extract sentences
• Cluster C:– Construct a TSG for clusters A, B and C– Run topic-sensitive PageRank on the TSG; only retain sentences from C– Apply the second modified version of MMR to extract sentences
27TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Sentence Ranking• Once a timestamped graph is built, we want to compute an prestige score for each node• PageRank: use an iterative method that allows the weights of the nodes to redistribute until stability is reached• Similarities as edges → weighted edges; query → topic-sensitive
Topic sensitive (Q)
portion
Standard random
walk term
28TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Sentence Extraction – Main task• Original MMR: integrates a penalty of the maximal similarity of the candidate document and one selected document
• Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total similarity of the candidate sentence and all selected sentences
• Score(s) = PageRank score of s; S = selected sentences• This is used in the main task
Penalty: All previous sentence similarity
29TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
Sentence Extraction – Update task•Update task assumes readers already read previous cluster(s)
– implies we should not select sentences that have redundant information with previous cluster(s)
• Propose a modified MMR for the update task: – consider the total similarity of the candidate sentence with all selected sentences and sentences in previously-read cluster(s)
• P contains some top-ranked sentences in previous cluster(s)
Previous cluster overlap
30TextGraphs 2 at HLT/NAACL 2007
Using Evolutionary Models of Text for Multi-document summarization
References• Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as salience in text summari-zation. Journal of Artificial Intelligence Research, (22).
• Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In Proceedings of EMNLP 2004.
• S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to Advances in Physics on 6th March 2001.
• Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Com-puter Networks and ISDN Systems, 30(1-7).
• Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.
• Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005: Understanding docu-ments via concepts links. In Proceedings of DUC 2005.