arXiv:1906.01973v2 [cs.CL] 9 Apr 2020 · In the domain of text summarization, hier-archical...

A Hierarchical Decoder with Three-level Hierarchical Attentionto Generate Abstractive Summaries of Interleaved Texts

Sanjeev Kumar Karn1,3, Francine Chen2, Yan-Ying Chen2, Ulli Waltinger3 and Hinrich Schutze1

1Center for Information and Language Processing (CIS), LMU Munich2FX Palo Alto Laboratory, Palo Alto, California

3Machine Intelligence, Siemens CT, Munich, [email protected]

2{chen,yanying}@fxpal.com

Abstract

Interleaved texts, where posts belonging todifferent threads occur in one sequence, area common occurrence, e.g., online chat con-versations. To quickly obtain an overviewof such texts, existing systems first disen-tangle the posts by threads and then ex-tract summaries from those threads. Themajor issues with such systems are errorpropagation and non-fluent summary. Toaddress those, we propose an end-to-endtrainable hierarchical encoder-decoder sys-tem. We also introduce a novel hierarchicalattention mechanism which combines threelevels of information from an interleavedtext, i.e, posts, phrases and words, and im-plicitly disentangles the threads. We evalu-ated the proposed system on multiple inter-leaved text datasets, and it out-performs aSOTA two-step system by 20-40%.

1 Introduction

Interleaved texts, e.g., multi-author entries foractivity reports, and social media conversa-tions, such as Slack are increasingly common.However, getting a quick sense of differentthreads in interleaved texts is often difficultdue to entanglement of threads, i.e, posts be-longing to different threads occurring in onesequence; see a hypothetical example in Fig-ure 1.

In conversation disentanglement, interleavedposts are grouped by the thread. However, areader still has to read all posts in all clus-tered threads to get the insights. To addressthis shortcoming, Shang et al. (2018) proposeda system that takes an interleaved text asinput and provides the reader with its sum-maries. Their system is an unsupervised two-step system, first, a conversation disentangle-ment component clusters the posts thread-wise,

Figure 1: In the upper part, 7 interleaved posts be-longing to different threads occur in a sequence. Inthe background at the bottom, posts are disentan-gled (clustered) into 3 threads (posts are outlinedwith colors corresponding to threads), and in theforeground, single sentence abstractive summariesare generated for each thread.

and second, a multi-sentence compression com-ponent compresses the thread-wise posts tosingle-sentence summaries. However, this sys-tem has two major disadvantages: first, thedisentanglement obtained through either su-pervised (Jiang et al., 2018) or unsupervised(Wang and Oard, 2009) methods propagateits errors to the downstream summarizationtask, and therefore, degrades the overall per-formance, and second, the compression com-ponent is restricted to formulate summariesout of disentangled threads, and therefore, can-not bring new words to improve the fluency.We aim to tackle these issues through an end-

arX

iv:1

906.

0197

3v2

[cs

.CL

] 9

Apr

202

0

to-end trainable encoder-decoder system thattakes a variable length input, e.g., interleavedtexts, processes it and generates a variablelength output, e.g., a multi-sentence summary.An end-to-end system eliminates the disentan-glement component, and thus, the error prop-agation. Furthermore, the corpus-level vocab-ulary of the decoder provided it with greaterselection of words, and thus, a possibility toimprove language fluency.

In the domain of text summarization, hier-archical encoder, encoding words in a sentence(post) followed by the encoding of sentencesin a document (channel), is a very commonlyused method (Nallapati et al., 2016; Hsu et al.,2018). However, hierarchical decoding is rare,as many works in the domain aim to compre-hend an important fact from single or multi-ple documents. Summarizing interleaved textsprovides us a unique opportunity to employhierarchical decoding as such texts compriseseveral facts from several threads. We alsopropose novel hierarchical attention, which as-sists the decoder in its summary generationprocess with 3-levels of information from theinterleaved text; posts, phrases, and words,rather than traditional two levels; post andword (Nallapati et al., 2017, 2016; Tan et al.,2017; Cheng and Lapata, 2016).

As labeling of interleaved texts is a dif-ficult and expensive task, we devised analgorithm that synthesizes interleaved text-summary pairs corpora of different difficultylevels (in terms of entanglement) from a regulardocument-summary pairs corpus. Using thesecorpora, we show the encoder-decoder systemnot only obviates disentanglement component,but also enhances performance. Further, our hi-erarchical encoder-decoder system consistentlyoutperforms traditional sequential ones.

To summarize, our contributions are:

• We propose an end-to-end encoder-decoder system over pipeline to obtaina quick overview of interleaved texts.

• To the best of our knowledge, we arefirst to use a hierarchical decoder to ob-tain multi-sentence abstractive summariesfrom texts.

• We propose a novel hierarchical attentionthat integrates information from 3 levels;

posts, phrases and words, and is trainedend-to-end.

• We devise an algorithm that synthesizes in-terleaved text-summary corpora, on whichwe verify pipeline system vs. encoder-decoder, sequential vs. hierarchical decod-ing, 2- vs. 3-level hierarchical attention.Overall, the proposed system attains 20-40% performance gains on both real-world(AMI) and synthetic datasets.

2 Related Work

Ma et al. (2012); Aker et al. (2016); Shang et al.(2018) designed earlier systems that summa-rize posts in multi-party conversations in orderto provide readers with overview on the dis-cussed matters. They broadly follow the sameapproach: cluster the posts and then extract asummary from each cluster.

There are two kinds of summarization: ab-stractive and extractive. In abstractive sum-marization, the model utilizes a corpus levelvocabulary and generates novel sentences as thesummary, while extractive models extract or re-arrange the source words as the summary. Ab-stractive models based on neural sequence-to-sequence (seq2seq) (Rush et al., 2015) provedto generate summaries with higher ROUGEscores than the feature-based abstractive mod-els. Integration of attention into seq2seq (Bah-danau et al., 2014) led to further advancementof abstractive summarization (Nallapati et al.,2016; Chopra et al., 2016).

Li et al. (2015) proposed an encoder-decoder(auto-encoder) model that utilizes a hierar-chy of networks: word-to-word followed bysentence-to-sentence. Their model is betterat capturing the underlying structure thana vanilla sequential encoder-decoder model(seq2seq). Krause et al. (2017); Jing et al.(2018) showed multi-sentence captioning ofan image through hierarchical Recurrent Neu-ral Network (RNN), topic-to-topic followed byword-to-word, is better than seq2seq.

These works suggest a hierarchical encoder,with word-to-word encoding followed by post-to-post, will better recognize the dispersed in-formation in interleaved texts. Similarly, ahierarchical decoder, thread-to-thread followedby word-to-word, will intrinsically disentangle

word-to-word encoder

c'

post-to-post encoder

thread-to-thread decoder word-to-word decoder

hEw2w

00h

Ew2w

0ph

Ep2p

0

hEp2p

i

hEp2pn

hDt2t

2

hDt2t

1

hDt2t

0 hDw2w

00h

Dw2w

0qJohn

hEw2w

01

bought

hEw2w

02

a

hEw2w

03

red yesterday

hEw2w

i0h

Ew2w

ip

did

hEw2w

i1

he

hEw2w

i2

visit

hEw2w

i3

my ?

hEw2w

n0h

Ew2wnp

he

hEw2w

n1

is

hEw2w

n2

supposed

hEw2w

n3

to all

John betrayed friends and car.

John missed poker game yesterday.

John didn't visit and house

hDw2w

01h

Dw2w

02h

Dw2w

03

hDw2w

10h

Dw2w

1qhDw2w

11h

Dw2w

12h

Dw2w

13

hDw2w

20

hDw2w

2q

hDw2w

21h

Dw2w

22h

Dw2w

23

stop

stop

stop

β1

00,⋯,0p

β1

n0,⋯,np ∗ Wβ1

c'β

2

n0,⋯,np ∗ Wβ2

β0

n0,⋯,np

∗ Wβ0

s0

s1

s2

Figure 2: Our hierarchical encoder-decoder architecture. On the left, interleaved posts are encodedhierarchically, i.e., word-to-word (Ew2w) followed by post-to-post (Ep2p). On the right, summaries aregenerated hierarchically, thread-to-thread (Dt2t) followed by word-to-word (Dt2t).

the posts, and therefore, generate more appro-priate summaries.

Nallapati et al. (2016) devised a hierarchi-cal attention mechanism for a seq2seq model,where two levels of attention distributions overthe source, i.e., sentence and word, are com-puted at every step of the word decoding.Based on the sentence attentions, the word at-tentions are rescaled. Hsu et al. (2018) slightlysimplified this mechanism and computed thesentence attention only at the first step. Ourhierarchical attention is more intuitive and com-putes new sentence attentions for every newsummary sentence, and unlike Hsu et al. (2018),is trained end-to-end.

3 Model

Problem Statement

We aim to design a system that when given a se-quence of posts, C = 〈P1, . . . ,P|C |〉, produces asequence of summaries, T = 〈S1, . . . , S|T |〉. Forsimplicity and clarity, unless otherwise noted,we will use lowercase italics for variables, up-percase italics for sequences, lowercase bold forvectors and uppercase bold for matrices.

3.1 Encoder

The hierarchical encoder (see Figure 2 left handsection) is based on Nallapati et al. (2017),where word-to-word and post-to-post encodersare bi-directional LSTMs. The word-to-wordBiLSTM encoder (Ew2w) runs over word em-beddings of post Pi and generates a set ofhidden representations, 〈hEw2w

i,0 , . . . ,hEw2wi,p 〉, of

d dimensions. The average pooled value of

the word-to-word representations of post Pi(1p∑p

j=0 hEw2wi,j ) is input to the post-to-post

BiLSTM encoder (Et2t), which then generates

a set of representations, 〈hEp2p

0 , . . . ,hEp2pn 〉, cor-

responding to the posts. Overall, for a givenchannel C , output representations of word-to-word, W, and post-to-post, P, has n× p× 2dand n× 2d dimensions respectively.

3.2 Decoder

Our hierarchical decoder structure and arrange-ment is similar to Li et al. (2015) hierarchicalauto encoder, with two uni-directional LSTMdecoders, thread-to-thread and word-to-word(see right-hand side in Figure 2), however, interms of inputs, initial states and attentions itdiffers a lot, which we explain in the next twosections.

The initial state hDt2t0 of the thread-to-

thread LSTM decoder (fDt2t) is set witha feedforward-mapped representation of anaverage pooled post representations (c′ =1n

∑ni=0 hp2pi ). At each step k of the fDt2t , a

sequence of attention weights, 〈βk0,0, . . . , βkn,p〉,corresponding to the set of encoded word rep-resentations, 〈hw2w0,0 , . . . ,hw2wn,p 〉 are computed

utilizing the previous state, hDt2tk−1 . We will elab-

orate the attention computation in the nextsection.

A weighted representation of the words(crossed blue circle) is then computed:∑n

i=1

∑pj=1 β

ki,jWij , Additionally, we use the

last hidden state hDw2wk−1,q of the word-to-word

decoder LSTM (Dw2w) of the previously gener-ated summary sentence as the second input to

compute the next state of thread-to-thread de-coder, i.e., hDt2t

k . The motivation is to provideinformation about the previous sentence.

The current state hDt2tk is passed through a

single layer feedforward network and a distri-bution over STOP=1 and CONTINUE=0 iscomputed:

pSTOPk = σ(g(hDt2tk

)) (1)

where g is a feedforward network. In Figure 2,the process is depicted by a yellow circle. Thethread-to-thread decoder keeps decoding untilpSTOPk is greater than 0.5.

Additionally, the current state hDt2tk and in-

puts to Dt2t at that step are passed through atwo-layer feedforward network r followed by adropout layer to compute the thread represen-

tation sk = r(hDt2tk ; hDw2w

k−1,q ; βk ∗W

).

Given a thread representation sk, the word-to-word decoder generates a summary for thethread. Our word-to-word decoder is based onBahdanau et al. (2014). It is a unidirectionalattentional LSTM (fDw2w); see the right-handside of Figure 2. We refer to (Bahdanau et al.,2014) for further details.

3.3 Hierarchical Attention

<sos>

��,1

��

( ∗ �)� �,1

�

2�

( ∗ �)� �

��2�

�−1

��2�

��

��2�

�,0 ��2�

�,1

��2�

�−1

�

��,0 ��,1

��,0

�

��2�

�−1

�

2�

�

��

� �

� �,1

�

�

2�

��

�2�

0,��

��2�

�,0 ��

�2�

�,�

Figure 3: Hierarchical attention mechanism. Dot-ted lines indicate involvement in the mechanism.

Our novel hierarchical attention works at 3levels, the post level (corresponding to posts),i.e., γ, and phrase level (corresponding tosource tokens), i.e., β, and are computed whileobtaining a thread representation, s. The wordlevel attention (also corresponding to sourcetokens), i.e., α, is computed while generatinga word, y, of a summary, S .; see Figure 3.

We draw inspiration for the hierarchical at-tention from some of the recent works in com-puter vision (Noh et al., 2017; Teichmann et al.,

2019), in which, they show a convolutionalneural network (CNN)-based local descriptorwith attention is better at obtaining key pointsfrom an image than CNN-based global descrip-tor. Phrases from posts of interleaved texts areequivalent to visual patterns in images, andthus, extracting phrases is more relevant forthread recognition than extracting posts. Thus,contrary to popular hierarchical attention (Nal-lapati et al., 2016; Cheng and Lapata, 2016;Tan et al., 2017), we have additional phrase-level attention focusing again on words, butwith a different responsibility. Further, the pop-ularly held intuition of hierarchical attention,i.e., sentence attention scales word attention,is still intact as gamma (post-attention) scalesbeta.

At step k of thread decoding, we computeelements of post-level attention, i.e., γk,· as.

γki = σ(attnγ(hDt2tk−1 ,Pi) i ∈ {1, . . . , n} (2)

, where attnγ aligns the current thread de-coder state vector hDt2t

i−1 to vectors of matrixPi and then maps aligned vectors to scalarvalues through a feed-forward network. At thesame step, we also compute elements of phrase-level attention, i.e, βki,j as.

βki,j = σ(attnβ(hDt2tk−1 ,ai,j))

where ai,j = add(Wi,j ,Pi),

i ∈ {1, . . . , n}, j ∈ {1, . . . , p}(3)

, add aligns a post representation to its consti-tuting word representations and does element-wise addition, and attnβ is a feedforward net-work that maps the current thread decoderstate hDt2t

k−1 and vector ai,j to a scalar value.Importantly, σ(·) in γ and β will allow a threadnot to be associated with any relevant phrase,and thereby, indicating a halt in decoding.

Then, we use γk to rescale phrase-level at-tentions, βk as βki,j = βki,j ∗ γki .

At step l of word-to-word decoding of sum-mary thread k, we compute elements of wordlevel attention, i.e., αk,l

i,· as below.

αk,li,j =exp(ek,li,j )∑n

i=1

∑pj=1 exp(ek,li,j )

where ek,lij = attnα(hDw2wk,l−1 ,ai,j)

(4)

, and ak is same as in Eq. 3 and attnα is afeedforward network that maps the current

3 this study was conducted to evaluate the influence of e. . .3 caffeine in sport . influence ofendurance exercise on the urinarycaffeine concentration .

F to assess the effect of a program of supervised fitness. . .F an 8-week randomized , controlled trial .. . .3 nine endurance-trained athletes participated in a randomised. . .

. . . F supervised fitness walking inpatients with osteoarthritis of the knee. a randomized , controlled trial .

Q we examined the effects of intensity of training on ratings. . .Q subjects were recruited as sedentary controls or were randomly. . .Q the at lt group trained at velocity lt and the greater than. . . Q the effect of training intensity on

ratings of perceived exertion .3 data were obtained on 47 of 51 intervention patients and 45. . .

Table 1: The left rows contain interleaving of 3 articles with 2 to 5 sentences and the right rows containtheir interleaved titles. Associated sentences and titles are depicted by similar symbols.

word decoder state hDw2wk,l−1 and vector ai,j to a

scalar value.

Finally, we use rescaled phrase-level word at-

tentions, βk, for rescaling word level attention,αk,l as αk,li,j = βki,j × α

k,lij

3.4 Training Objective

We train our hierarchical encoder-decoder net-work similarly to an attentive seq2seq model(Bahdanau et al., 2014), but with an additionalweighted sum of sigmoid cross-entropy losson stopping distribution; see Eq. 1. Givena thread summary, Y k = 〈wk,0, . . . ,wk,q〉,our word-to-word decoder generates a targetY k = 〈yk,0, . . . , yk,q〉, with words from a samevocabulary U . We train our model end-to-endby minimizing the objective given in Eq. 5.

m∑k=1

q∑l=1

log pθ

(yk,l|wk,·<l,W

)+ λ

m∑k=1

ySTOPk log(pSTOPk )

(5)

4 Dataset

Obtaining labeled training data for conversa-tion summarization is challenging. The avail-able ones are either extractive (Verberne et al.,2018) or too small (Barker et al., 2016; Angueraet al., 2012) to train a neural model. To getaround this issue and thoroughly verify the pro-posed architecture, we synthesized a datasetby utilizing a corpus of conventional texts forwhich summaries are available. We create twocorpora of interleaved texts: one from the ab-stracts and titles of articles from the PubMedcorpusand one from the questions and titles ofStack Exchange questions. A random interleav-ing of sentences from a few PubMed abstractsor Stack Exchange questions roughly resembles

interleaved texts, and correspondingly inter-leaving of titles resembles its multi-sentencesummary.

The algorithm that we devised for creatingsynthetic interleaved texts is defined in detailin the Appendix. The number of abstracts toinclude in the interleaved texts is given as arange (from a to b) and the number of sentencesper abstract to include is given as a secondrange (from m to n). We vary the parametersas below and create three different corporafor experiments: Easy (a=2, b=2, m=5 andn=5), Medium (a=2, b=3, m=2 and n=5)and Hard (a=2, b=5, m=2 and n=5). Table 1shows an example of a data instance in theHard Interleaved RCT corpus.

5 Experiments

We report ROUGE-1, ROUGE-2, and ROUGE-L as the quantitative evaluation of the models.The hyper-parameters for experiments are de-scribed in detail in the Appendix and remainthe same unless otherwise noted.

5.1 Upper-bound

In upper-bound experiments, we check the im-pact of disentanglement on the abstractive sum-marization models, e.g., seq2seq and hier2hier.In order to do this, firstly, we provide theground-truth disentanglement (cluster) infor-mation and evaluate the performance of thesemodels. Secondly, we let the models to performeither end-to-end or two-step summarization.In order to perform these experiments, we com-piled three corpora of different entanglementdifficulty using Pubmed corpus of MeSH typeDisease and Chemical1. The training, evalua-tion and test sets are of sizes of 170k, 4k and4k respectively.

1interleaving is performed within a MeSH type

InputText

ModelEasy Medium Hard

Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-Lind seq2seq 35.09 28.72 13.16 36.31 28.78 13.45 37.74 28.72 13.76dis seq2seq 36.38 29.90 14.78 35.63 28.45 13.98 37.87 28.85 14.77dis hier2hier 35.30 28.93 13.35 37.30 29.83 14.90 39.09 30.11 15.22kmn seq2seq(dis) 34.48 27.51 13.31 34.05 26.58 13.14 35.54 26.36 13.65kmn seq2seq 34.28 27.84 13.86 34.89 27.42 13.68 31.22 23.37 11.77kmn compress 30.04 19.83 10.75 29.37 17.54 10.43 29.11 15.76 10.13ent seq2seq 35.78 28.89 14.62 35.20 27.44 13.54 32.46 24.17 12.17ent hier2hier 35.88 28.47 13.33 37.29 29.63 14.95 37.11 27.97 14.26

Table 2: Summarization performance (Rouge Recall-Scores) comparing models when the threads aredisentangled (top blue dotted section, upper bounds) and when the threads are entangled (bottom greendashed section, real-world) on the Easy, Medium and Hard Pubmed Corpora. ind = individual, dis =disentangled (ground-truth), kmn = K-means disentangled and ent = entangled. In the middle, thefirst row shows a seq2seq model trained on ground-truth disentangled texts and tested on unsuperviseddisentangled texts, and the second row shows a seq2seq model trained and tested on unsupervised disen-tangled texts. The best performance for the entangled threads and for the disentangled threads are inbold.

The seq2seq model can use ground-truthdisentanglement information in two ways, i.e.,summarize threads individually or summarizeconcatenated threads. The first two rows inTable 2 compares performance of those twosets of experiments. Clearly, seq2seq modelcan easily detect thread boundary in concate-nated threads and perform as good as individ-ual model. However, hier2hier is better thanseq2seq in detecting thread boundaries as in-dicated by its performance gain on Mediumand Hard corpora (see row 3 in Table 2), andtherefore, sets the upper bound for interleavedtext summarization.

Additionally, we also utilize Shang et al.(2018)’s unsupervised disentanglement compo-nent and cluster the entangled threads. Im-portantly, their disentanglement component re-quires a fixed cluster size as an input; however,our Medium and Hard corpora have a vary-ing cluster size, and therefore, we give theirsystem benefit of the doubt and input the max-imum cluster size, i.e., 3 and 5 respectively.We sort the clusters by their association to asequence of summary, where the association ismeasured by Rouge-L between them. We thentake the seq2seq trained on ground-truth disen-tanglement and test it on these unsupervised-disentangled texts to understand the strengthof unsupervised clustering. The performanceof the pre-trained model remains somewhatsimilar (see row 2 and 4), indicating a strongdisentanglement component. We also train andtest a seq2seq on unsupervised-disentangledtexts; however, its performance lowers slightly

(see row 5), which we believe is due to noiseinserted by heuristic sorting of clusters.

In real-world scenarios, i.e., without ground-truth disentanglement, (Shang et al., 2018)’sunsupervised two-step system performs worsethan seq2seq on unsupervised disentanglement(see row 5 and 6), the reason being a seq2seqmodel trained on a sufficiently large dataset isbetter at summarization than the unsupervisedsentence compression (extractive) method. Atthe same time, a seq2seq model trained onentangled texts performs similar to a seq2seqtrained on unsupervised disentangled texts (seerow 5 and 7), and thereby, showing that thedisentanglement component is not necessary.Finally, a hier2hier trained on entangled textsis the only model that reaches closest to theupper-bound set by hier2hier on disentangledtexts (see row 3 and 8).

6 Seq2seq vs. hier2hier models

Further, we compare the proposed hierarchicalapproach against the seq-to-seq approach insummarizing the interleaved texts by exper-imenting on the Medium and Hard corporaobtained from much-varied base document-summary pairs. We interleave Pubmed corpusof 10 MeSH types, e.g., anatomy and organ-ism. Similarly, we interleave Stack Exchangeposts-question pairs of 12 different categorieswith regular vocabularies, e.g., science fictionand travel. As before, the interleaving is per-formed within a type or category. The training,evaluation and test sets of Pubmed are of sizes

CorpusModel

Pubmed Stack ExchangeDifficulty Rouge-1 Rouge-2 Rouge-L Rouge-1 Rouge-2 Rouge-L

Mediumseq2seq 30.67 11.71 23.80 18.78 03.52 14.73hier2hier 32.78 12.36 25.33 24.34 05.07 18.63

Hardseq2seq 29.07 10.96 21.76 20.21 04.03 14.93hier2hier 33.36 12.69 24.72 24.96 05.56 17.95

Table 3: Rouge Recall-Scores on the Medium and Hard Corpora. The base Pubmed has abstract-summary pairs of 10 MeSH types, while base Stack Exchange has posts-question pairs from 12 topics.

280k, 5k and 5k and Stack Exchange are ofsizes 140k, 4k and 4k respectively. Results inTable 3 shows that a noticeable improvementis observed on changing the decoder to hierar-chical, i.e., 1.5-3 Rouge points in Pubmed and2-4.5 points in Stack Exchange.

Additionally, we evaluated models strengthin recognizing threads where summaries are or-dered by the location of each threads greatestdensity. Here, density refers to smallest win-dow of posts with over 50% of posts belongingto a thread; e.g., post1-thread1, post1-thread2,post-2-thread2, post2-thread1, post3-thread1,post4-thread1 → thread2-summary, thread1-summary. In this example, although thread1occurs early, as the majority of posts on thread1occurs latter, therefore, its summary also oc-curs later. So, we create Medium and Hardcorpora of the Stack Exchange with summariessorted by thread density and perform abstrac-tive summarization studies. As seen in Table 4,both the seq2seq and hier2hier models performsimilar to the corpora with summaries sortedby thread occurrence (see Table 3), which indi-cates a strong disentanglement in such abstrac-tive models irrespective of summary arrange-ment. In addition, the hier2hier model is stillconsistently better than the seq2seq model.

Medium CorpusModel Rouge-1 Rouge-2 Rouge-Lseq2seq 19.67 03.88 15.37hier2hier 23.97 05.63 18.75

Hard Corpusseq2seq 19.62 03.71 14.90hier2hier 24.14 05.00 17.25

Table 4: Rouge Recall-Scores of models on theStack Exchange Medium and Hard Corpus.

To understand the impact of hierarchy onthe hier2hier model, we perform an ablationstudy and use the Hard Pubmed corpus forthe experiments, and Table 5 shows the results.Clearly, adding hierarchical decoding already

provides a boost in the performance. Hierar-chical encoding also adds some improvementsto the performance; however, the enhancementattained in training and inference speed by thehierarchical encoding is much more valuable(see Figure 1 in Appendix C)2. Thus, hier2hiermodel not only achieves greater accuracy butalso reduces training and inference time.

Pubmed Hard CorpusModel Rouge-1 Rouge-2 Rouge-Lseq2seq 29.07 10.96 21.76seq2hier 32.92 11.87 24.43hier2seq 31.86 11.9 23.57hier2hier 33.36 12.69 24.72

Table 5: Rouge Recall-Scores of ablated models(encoder-decoder) on the Pubmed Hard Corpus.

7 Hierarchical attention

To understand the impact of hierarchical at-tention on the hier2hier model, we perform anablation study of post-level attentions (γ) andphrase-level attentions (β), using the PubmedHard corpus.

Model Rouge-1 Rouge-2 Rouge-Lhier2hier(+γ + β) 33.36 12.69 24.72hier2hier(−γ + β) 32.65 12.21 24.23hier2hier(+γ − β) 31.28 10.20 23.49hier2hier(Li et al.) 29.83 09.80 22.17hier2hier(−γ − β) 30.58 10.00 22.96seq2seq 29.07 10.96 21.76

Table 6: Rouge Recall-Scores of ablated models(attentions) on the Hard Pubmed Corpus.

Table 6 shows the performance comparison.γ attention improves the performance (0.5-1)of hierarchical decoding but not a lot. Thephrase-level attention, i.e., β is very importantas without it the model performance is notice-ably reduced (Rouge values decrease from 2-3).

2hier2hier takes ≈1.5 days for training on a TeslaV100 GPU, while seq2seq takes ≈4.5 days

The closest hierarchical attentions to ours, i.e.,(Nallapati et al., 2017, 2016; Tan et al., 2017;Cheng and Lapata, 2016) do not use β, andtherefore, is equivalent to hier2hier(+γ − β),whose performs worse than hier2hier(−γ + β)and hier2hier(+γ + β), thus signifying impor-tance of β. We also include Li et al. (2015) typepost-level attention technique in the compari-son, where a softmax γ instead of σ(·) based γand β is used to compute thread representation.Results indicate σ(·) fits better in this case.Lastly, removing both the γ and β) makes thehier2hier similar to seq2seq, except a few moreparameters, i.e., two additional LSTM, and theperformance is also very similar.

8 AMI Experiments

We also experimented both abstractive models;seq2seq and hier2hier, on the popular meetingAMI corpus (McCowan et al., 2005), and com-pare them against Shang et al. (2018) two-stepsystem. We follow the standard train, eval andtest split. Results in Table 7 show hier2hieroutperforms both systems by a large margin.

Model Rouge-1 Rouge-2 Rouge-LShang et al. 29.00 - -seq2seq 31.60 10.60 25.03hier2hier 39.75 12.75 25.41

Table 7: Rouge F1 Scores of models on AMI Cor-pus with summary size 150.

9 Discussion

Table 8 shows an output of our hierarchicalabstractive system, in which interleaved textsare in the top, and ground-truth and generatedsummaries in the bottom. Table 8 also showsthe top two post indexes attended by the post-level attention (γ) while generating those sum-maries, and they coincide with relevant posts.Similarly, the top 10 indexes (words) of thephrase-level attention (β) is directly visualizedin the table through the color coding matchingthe generation. The system not only managesto disentangle the interleaved texts but alsoto generate appropriate abstractive summaries.Meanwhile, β provides explainability of theoutput.

The next step in this research is transferlearning of the hierarchical system trained on

Interleaved Texts

0

this study was conducted to evaluate theinfluence of excessive sweating duringlong-distance running on the urinary con-centration of caffeine. . .

1to assess the effect of a program of su-pervised fitness walking and patient educa-tion on functional status , pain , and. . .

. . . . . .

5a total of 102 patients with a documenteddiagnosis of primary osteoarthritis of one orboth knees participated. . .

6we examined the effects of intensity oftraining on ratings of perceived exertion(. . .

. . . . . .

GroundTruth/Generationcaffeine in sport . influence of enduranceexercise on the urinary caffeine concentration.

0,2effect of excessive [UNK] during[UNK] running on the urinary concen-tration of caffeine .

supervised fitness walking in patients withosteoarthritis of the knee . a randomized ,controlled trial .

1,4effect of a physical fitness walking onfunctional status , pain , and pain

the effect of training intensity on ratings ofperceived exertion .

6,8effects of intensity of training on per-ceived [UNK] in [UNK] athletes .

Table 8: Interleaved sentences of 3 articles, andcorresponding ground-truth and hier2hier gener-ated summaries. The top 2 sentences that wereattended (γ) for the generation are on the left. Ad-ditionally, top words (β) attended for the genera-tion are colored accordingly.

the synthetic corpus to real-world examples.Further, we aim to modify hier2hier to includesome of the recent additions of seq2seq models,e.g., See et al. (2017) pointer mechanism.

10 Conclusion

We presented an end-to-end trainable hierar-chical encoder-decoder architecture with novelhierarchical attention which implicitly disen-tangles interleaved texts and generates abstrac-tive summaries covering the text threads. Thearchitecture addresses the error propagationand fluency issues that occur in the two-steparchitectures, and thereby, adding performancegains of 20-40% on both real-world and syn-

thetic datasets.

References

Ahmet Aker, Monica Paramita, Emina Kurtic,Adam Funk, Emma Barker, Mark Hepple, andRob Gaizauskas. 2016. Automatic label genera-tion for news comment clusters. In Proceedingsof the 9th International Natural Language Gen-eration Conference, pages 61–69.

Xavier Anguera, Simon Bozonnet, Nicholas Evans,Corinne Fredouille, Gerald Friedland, and OriolVinyals. 2012. Speaker diarization: A review ofrecent research. IEEE Transactions on Audio,Speech, and Language Processing, 20(2):356–370.

Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2014. Neural machine translation byjointly learning to align and translate. CoRR,abs/1409.0473.

Emma Barker, Monica Lestari Paramita, AhmetAker, Emina Kurtic, Mark Hepple, and RobertGaizauskas. 2016. The sensei annotated corpus:Human summaries of reader comment conversa-tions in on-line news. In Proceedings of the 17thannual meeting of the special interest group ondiscourse and dialogue, pages 42–52.

Jianpeng Cheng and Mirella Lapata. 2016. Neu-ral summarization by extracting sentences andwords. arXiv preprint arXiv:1603.07252.

Sumit Chopra, Michael Auli, and Alexander M.Rush. 2016. Abstractive sentence summariza-tion with attentive recurrent neural networks. InNAACL HLT 2016, The 2016 Conference of theNorth American Chapter of the Association forComputational Linguistics: Human LanguageTechnologies, San Diego California, USA, June12-17, 2016, pages 93–98. The Association forComputational Linguistics.

Wan-Ting Hsu, Chieh-Kai Lin, Ming-Ying Lee,Kerui Min, Jing Tang, and Min Sun. 2018. Aunified model for extractive and abstractive sum-marization using inconsistency loss. In Proceed-ings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume1: Long Papers), pages 132–141. Association forComputational Linguistics.

Jyun-Yu Jiang, Francine Chen, Yan-Ying Chen,and Wei Wang. 2018. Learning to disentangle in-terleaved conversational threads with a siamesehierarchical network and similarity ranking. InProceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Com-putational Linguistics: Human Language Tech-nologies, Volume 1 (Long Papers), pages 1812–1822. Association for Computational Linguis-tics.

Baoyu Jing, Pengtao Xie, and Eric Xing. 2018. Onthe automatic generation of medical imaging re-ports. In Proceedings of the 56th Annual Meet-ing of the Association for Computational Lin-guistics (Volume 1: Long Papers), pages 2577–2586. Association for Computational Linguis-tics.

Jonathan Krause, Justin Johnson, Ranjay Kr-ishna, and Li Fei-Fei. 2017. A hierarchical ap-proach for generating descriptive image para-graphs. In Computer Vision and Patterm Recog-nition (CVPR).

Jiwei Li, Thang Luong, and Dan Jurafsky. 2015. Ahierarchical neural autoencoder for paragraphsand documents. In Proceedings of the 53rdAnnual Meeting of the Association for Compu-tational Linguistics and the 7th InternationalJoint Conference on Natural Language Process-ing (Volume 1: Long Papers), pages 1106–1115.Association for Computational Linguistics.

Zongyang Ma, Aixin Sun, Quan Yuan, and GaoCong. 2012. Topic-driven reader comments sum-marization. In Proceedings of the 21st ACMinternational conference on Information andknowledge management, pages 265–274. ACM.

I. McCowan, J. Carletta, W. Kraaij, S. Ashby,S. Bourban, M. Flynn, M. Guillemot, T. Hain,J. Kadlec, V. Karaiskos, M. Kronenthal,G. Lathoud, M. Lincoln, A. Lisowska, W. Post,Dennis Reidsma, and P. Wellner. 2005. The amimeeting corpus. In Proceedings of MeasuringBehavior 2005, 5th International Conference onMethods and Techniques in Behavioral Research,pages 137–140. Noldus Information Technology.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou.2017. Summarunner: A recurrent neural net-work based sequence model for extractive sum-marization of documents. In AAAI Conferenceon Artificial Intelligence.

Ramesh Nallapati, Bowen Zhou, Cıcero Nogueirados Santos, Caglar Gulcehre, and Bing Xi-ang. 2016. Abstractive text summarization us-ing sequence-to-sequence rnns and beyond. InProceedings of the 20th SIGNLL Conferenceon Computational Natural Language Learning,CoNLL 2016, Berlin, Germany, August 11-12,2016, pages 280–290. ACL.

Hyeonwoo Noh, Andre Araujo, Jack Sim, TobiasWeyand, and Bohyung Han. 2017. Large-scaleimage retrieval with attentive deep local fea-tures. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 3456–3465.

Alexander M. Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for ab-stractive sentence summarization. In Proceed-ings of the 2015 Conference on Empirical Meth-ods in Natural Language Processing, EMNLP

http://arxiv.org/abs/1409.0473

http://arxiv.org/abs/1409.0473

http://aclweb.org/anthology/N/N16/N16-1012.pdf

http://aclweb.org/anthology/N/N16/N16-1012.pdf

http://aclweb.org/anthology/P18-1013



https://doi.org/10.18653/v1/N18-1164

https://doi.org/10.18653/v1/N18-1164

https://doi.org/10.18653/v1/N18-1164




https://doi.org/10.3115/v1/P15-1107

https://doi.org/10.3115/v1/P15-1107

https://doi.org/10.3115/v1/P15-1107

https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14636/14080



http://aclweb.org/anthology/K/K16/K16-1028.pdf

http://aclweb.org/anthology/K/K16/K16-1028.pdf

http://aclweb.org/anthology/D/D15/D15-1044.pdf

http://aclweb.org/anthology/D/D15/D15-1044.pdf

2015, Lisbon, Portugal, September 17-21, 2015,pages 379–389. The Association for Computa-tional Linguistics.

Abigail See, Peter J. Liu, and Christopher D. Man-ning. 2017. Get to the point: Summarizationwith pointer-generator networks. In Proceedingsof the 55th Annual Meeting of the Associationfor Computational Linguistics (Volume 1: LongPapers), pages 1073–1083. Association for Com-putational Linguistics.

Guokan Shang, Wensi Ding, Zekun Zhang, AntoineTixier, Polykarpos Meladianos, Michalis Vazir-giannis, and Jean-Pierre Lorre. 2018. Unsuper-vised abstractive meeting summarization withmulti-sentence compression and budgeted sub-modular maximization. In Proceedings of the56th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 664–674. Association for ComputationalLinguistics.

Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017.From neural sentence summarization to headlinegeneration: A coarse-to-fine approach. In IJ-CAI, pages 4109–4115.

Marvin Teichmann, Andre Araujo, Menglong Zhu,and Jack Sim. 2019. Detect-to-retrieve: Effi-cient regional aggregation for image search. InProceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages5109–5118.

Suzan Verberne, Emiel Krahmer, Iris Hendrickx,Sander Wubben, and Antal van Den Bosch. 2018.Creating a reference data set for the summariza-tion of discussion forum threads. Language Re-sources and Evaluation, pages 1–23.

Lidan Wang and Douglas W Oard. 2009. Context-based message expansion for disentanglement ofinterleaved text conversations. In Proceedingsof human language technologies: The 2009 an-nual conference of the North American chapterof the association for computational linguistics,pages 200–208. Association for ComputationalLinguistics.

Appendix

A Interleave Algorithm

In Algorithm. 1, Interleave takes a setof concatenated abstracts and titles, C =〈A1; T1, . . . ,A|C|; T|C|〉, minimum, a, and max-imum, b, number of abstracts to interleave,and minimum, m, and maximum, n, numberof sentences in a source, and then returns aset of concatenated interleaved texts and sum-maries. window takes a sequence of texts, X,and returns a window iterator of size |X |−wt +1,where w and t are window size and sliding steprespectively. window reuses elements of X, andtherefore, enlarges the corpus size. NotationsU refers to a uniform sampling, [·] to arrayindexing, and Reverse to reversing an array.

Algorithm 1 Interleaving Algorithm

1: procedure Interleave(C, a, b,m,n)

2: C,Z← window(C, w = b, t = 1), Array()

3: while C 6= ∅ do4: C′,A′,T′,S← C.Next(), Array(), Array(),{}

5: r ∼ U(a, b)6: for j = 1 to r do . Selection

7: A,T← C[j]8: T′.Add(T)9: q ∼ U(m,n)

10: A′.Add(A[1:q])11: S← S ∪ {j×q}12: A, T← Array(), Array()13: for 1 to |S| do . Interleaving14: k← U(S)15: S← S\k16: I← Reverse(A′[k]).pop()

17: A.Add(I)18: J← T′[k]

19: if J 6∈ T then:

20: T.Add(J)

21: Z.Add(A;T)

22: return Z

B Parameters

For the word-to-word encoder, the steps arelimited to 20, while the steps in the word-to-word decoder are limited to 15. The steps in thepost-to-post encoder and thread-to-thread de-coder depend on the corpus type, e.g., Mediumhas 15 steps in post-to-post and 3 steps inthread-to-thread. In seq2seq experiments, thesource is flattened, and therefore, the numberof steps in the source is limited to 300. We ini-tialized all weights, including word embeddings,with a random normal distribution with mean

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099





0 and standard deviation 0.1. The embeddingvectors and hidden states of the encoder anddecoder in the models are set to dimension 100.Texts are lowercased. The vocabulary size islimited to 8000 and 15000 for Pubmed andStack Exchange corpora respectively. We padshort sequences with a special token, 〈PAD〉.We use Adam (Kingma et al. 2014)with aninitial learning rate of .0001 and batch size of64 for training.

C Training Loss

Figure 4: Running average training loss betweenseq2seq (pink) and hier2hier (gray) for Stack Ex-change Hard corpus.

D Examples

3 botulinum toxin a is effective for treatment. . . 3 prospective randomised controlledtrial comparing trigone-sparing versustrigone-including intradetrusorinjection of abobotulinumtoxina forrefractory idiopathic detrusoroveractivity.

3 the trigone is generally spared because of the theoretical. . .3 evaluate efficacy and safety of trigone-including .. . .F most methadone-maintained injection drug users . . .Q gender-related differences in the incidence of bleeding. . .Q we studied patients with stemi receiving fibrinolysis. . .F physicians may be reluctant to treat hcv in idus because. . .

F rationale and design of a randomizedcontrolled trial of directly observedhepatitis c treatment delivered inmethadone clinics.

Q outcomes included moderate or severe bleeding defined . . .F optimal hcv management approaches for idus remain . . .Q moderate or severe bleeding was 1.9-fold higher . . . Q comparison of incidence of

bleeding and mortality of menversus women with st-elevationmyocardial infarction treated withfibrinolysis.

F we are conducting a randomized controlled trial in a net-work. . .Q bleeding remained higher in women even after adjustment. . .


3 the effects of short-course antiretrovirals given to. . .3 hiv-1 persists in breast milk cellsdespite antiretroviral treatment toprevent mother-to-child transmission.

Q good adherence is essential for successful antiretroviral. . .3 women in kenya received short-course zidovudine ( zdv ). . .3 breast milk samples were collected two to three timesweekly.. . .F the present primary analysis of antiretroviral therapy with. . . Q patterns of individual and

population-level adherence toantiretroviral therapy and risk factors forpoor adherence in the first year of thedart trial in uganda and zimbabwe.

Q this was an observational analysis of an open multicenter. . .F patients with hiv-1 rna at least 5000 copies/ml were. . .Q at 4-weekly clinic visits , art drugs were provided and . . .F the primary objective was to demonstrate non-inferiority. . .Q viral load response was assessed in a subset of patients. . . F efficacy and safety of once-daily

darunavir/ritonavir versuslopinavir/ritonavir intreatment-naive hiv-1-infectedpatients at week 48.

¨ we explored the link between serum alpha-fetoprotein lev-els. . .Q drug possession ratio ( percentage of drugs taken between. . .¨ a low alpha-fetoprotein level ( < 5.0 ng/ml ) was an. . .F six hundred and eighty-nine patients were randomized. . .F at 48 weeks , 84 % of drv/r and 78 % of lpv/r patients. . . ¨ serum alpha-fetoprotein predicts

virologic response to hepatitis ctreatment in hiv coinfectedpatients.

3 hiv-1 dna was quantified by real-time pcr .. . .¨ serum alpha-fetoprote in measurement should be integrated. . .


Interleaved Texts

0

botulinum toxin a is effective for treat-ment of idiopathic detrusor overactivity( [UNK] )

1the [UNK] is generally [UNK] because of thetheoretical risk of [UNK] reflux ( [UNK] ) ,although studies assessing. . .

. . . . . .

3most [UNK] injection drug users ( idus) have been infected with hepatitis c virus (hcv ) , but. . .

4

[UNK] differences in the incidence ofbleeding and its relation to subsequent mor-tality in patients with st-segment elevationmyocardial infarction. . .

. . . . . .

8optimal hcv management approaches foridus remain unknown .. . .

. . . . . .

GroundTruth/Generationprospective randomised controlled trialcomparing trigone-sparing versus trigone-including intradetrusor injection of abobo-tulinumtoxina for refractory idiopathic de-trusor overactivity.

0,1efficacy of [UNK] [UNK] in patientswith idiopathic detrusor overactivity :rationale , design

rationale and design of a randomized con-trolled trial of directly observed hepatitis ctreatment delivered in methadone clinics.

3,4validation of a point-of-care hepatitisinjection drug injection drug , hcvmedication , and

comparison of incidence of bleeding and mor-tality of men versus women with st-elevationmyocardial infarction treated with fibrinoly-sis .

4,8subgroup analysis of patients with st-elevation myocardial infarction withst-elevation myocardial infarction .


Interleaved Texts

0

the effects of short-course antiretrovi-rals given to reduce mother-to-childtransmission ( [UNK] ) on temporal pat-terns of [UNK] hiv-1 rna

1

good adherence is essential for success-ful antiretroviral therapy ( art ) provi-sion , but simple measures have rarely beenvalidated. . .

2women in kenya received short-course zidovu-dine ( zdv ) , single-dose nevirapine ( [UNK]) , combination [UNK] or short-course. . .

3breast milk samples were collected two tothree times weekly for 4-6 weeks .. . .

4

the present primary analysis of an-tiretroviral therapy with [UNK] exam-ined in naive subjects ( [UNK] ) comparesthe efficacy and. . .

. . . . . .

10we explored the link between serum[UNK] levels and virologic response in[UNK] [UNK] c virus coinfected patients .. . .

. . . . . .

GroundTruth/Generationhiv-1 persists in breast milk cells despiteantiretroviral treatment to prevent mother-to-child transmission .

0,2impact of hiv-1 persists on hiv-1rna in human immunodeficiency virus-infected individuals with hiv-1

patterns of individual and population-leveladherence to antiretroviral therapy and riskfactors for poor adherence in the first yearof the dart trial in uganda and zimbabwe .

1,3impact of a antiretroviral treatment al-gorithm on adherence to antiretroviraltherapy in [UNK] ,

efficacy and safety of once-daily darunavir/ritonavir versuslopinavir/ritonavir in treatment-naivehiv-1-infected patients at week 48 .

4,2a randomized trial of [UNK] ver-sus [UNK] in treatment-naive hiv-1-infected patients with hiv-1 infection

serum alpha-fetoprotein predicts virologicresponse to hepatitis c treatment in hiv coin-fected patients .

10,12

predicting virologic response in [UNK]coinfected patients coinfected withhiv-1 : a [UNK] randomized


arXiv:1906.01973v2 [cs.CL] 9 Apr 2020 · In the domain of text summarization, hier-archical...

Documents

Transcript of arXiv:1906.01973v2 [cs.CL] 9 Apr 2020 · In the domain of text summarization, hier-archical...