Song Lyrics Summarization Inspired by Audio Thumbnailing

11
HAL Id: hal-02281138 https://hal.archives-ouvertes.fr/hal-02281138 Submitted on 8 Sep 2019 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Song Lyrics Summarization Inspired by Audio Thumbnailing Michael Fell, Elena Cabrio, Fabien Gandon, Alain Giboin To cite this version: Michael Fell, Elena Cabrio, Fabien Gandon, Alain Giboin. Song Lyrics Summarization Inspired by Audio Thumbnailing. RANLP 2019 - Recent Advances in Natural Language Processing, Sep 2019, Varna, Bulgaria. hal-02281138

Transcript of Song Lyrics Summarization Inspired by Audio Thumbnailing

Page 1: Song Lyrics Summarization Inspired by Audio Thumbnailing

HAL Id: hal-02281138https://hal.archives-ouvertes.fr/hal-02281138

Submitted on 8 Sep 2019

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Song Lyrics Summarization Inspired by AudioThumbnailing

Michael Fell, Elena Cabrio, Fabien Gandon, Alain Giboin

To cite this version:Michael Fell, Elena Cabrio, Fabien Gandon, Alain Giboin. Song Lyrics Summarization Inspired byAudio Thumbnailing. RANLP 2019 - Recent Advances in Natural Language Processing, Sep 2019,Varna, Bulgaria. �hal-02281138�

Page 2: Song Lyrics Summarization Inspired by Audio Thumbnailing

Song Lyrics Summarization Inspired by Audio Thumbnailing

Michael Fell, Elena Cabrio, Fabien Gandon, Alain GiboinUniversite Cote d’Azur, CNRS, Inria, I3S, France{firstname.lastname}@inria.fr

AbstractGiven the peculiar structure of songs, applyinggeneric text summarization methods to lyricscan lead to the generation of highly redundantand incoherent text. In this paper, we pro-pose to enhance state-of-the-art text summa-rization approaches with a method inspired byaudio thumbnailing. Instead of searching forthe thumbnail clues in the audio of the song,we identify equivalent clues in the lyrics. Wethen show how these summaries that take intoaccount the audio nature of the lyrics outper-form the generic methods according to both anautomatic evaluation and human judgments.

1 Introduction

Automatic text summarization is the task of pro-ducing a concise and fluent summary while pre-serving key information content and overall mean-ing of a text (Allahyari et al., 2017). Numerous ap-proaches have been developed to address this taskand applied widely in various domains includingnews articles (Cheng and Lapata, 2016), scientificpapers (Mei and Zhai, 2008), web content as blogs(Hu et al., 2007), customer reviews (Pecar, 2018)and social media messages (He and Duan, 2018).Just as we may need to summarize a story, wemay also need to summarize song lyrics, for in-stance to produce adequate snippets for a searchengine dedicated to an online song collection orfor music digital libraries. From a linguistic pointof view however, lyrics are a very peculiar genreof document and generic summarization methodsmay not be appropriate when the input for summa-rization comes from a specific domain or type ofgenre as songs are (Nenkova et al., 2011). Com-pared to news documents, for instance, lyrics havea very different structure. Given the repeatingforms, peculiar structure (e.g. the segmentationinto verse, chorus, etc.) and other unique charac-teristics of song lyrics, we need the summariza-

tion algorithms to take advantage of these addi-tional elements to more accurately identify rele-vant information in song lyrics. But just as suchcharacteristics enable the exploration of new ap-proaches, other characteristics make the applica-tion of summarization algorithms very challeng-ing, as the presence of repeated lines, the discoursestructure that strongly depends on the interrelationof music and words in the melody composition,the heterogeneity of musical genres each featuringpeculiar styles and wording (Brackett, 1995), andsimply the fact that not all songs tell a story.

In this direction, this paper focuses on the fol-lowing research questions: What is the impact ofthe context in summarizing song lyrics?. Thisquestion is broken down into two sub questions: 1)How do generic text summarization methods per-form over lyrics? and 2) Can such peculiar con-text be leveraged to identify relevant sentences toimprove song text summarization? To answer ourresearch questions, we experiment with genericunsupervised state-of-the-art text summarizationmethods (i.e. TextRank, and a topic distributionbased method) to perform lyrics summarization,and show that adding contextual information helpssuch models to produce better summaries. Specif-ically, we enhance text summarization approacheswith a method inspired by audio thumbnailingtechniques, that leverages the repetitive structureof song texts to improve summaries. We show howsummaries that take into account the audio natureof the lyrics outperform the generic methods ac-cording to both an automatic evaluation over 50klyrics, and judgments of 26 human subjects.

In the following, Section 2 reports on relatedwork. Section 3 presents the lyrics summarizationtask and the proposed methods. Sections 4 and 5report on the experiments and on the evaluation,respectively. Section 6 concludes the paper.

Page 3: Song Lyrics Summarization Inspired by Audio Thumbnailing

2 Summarization Methods

This section reports on the related work on bothtext and audio summarization methods.

2.1 Text Summarization

In the literature, there are two different familiesof approaches for automatic text summarization:extraction and abstraction (Allahyari et al., 2017).Extractive summarization methods identify impor-tant elements of the text and generate them verba-tim (they depend only on extraction of sentencesor words from the original text). In contrast, ab-stractive summarization methods interpret and ex-amine the text to generate a new shorter text thatconveys the most critical information from theoriginal text. Even though summaries created byhumans are usually not extractive, most of thesummarization research has focused on extractivemethods. Purely extractive summaries often givebetter results (Nallapati et al., 2016), due to thefact that latter methods cope with more complexproblems such as semantic representation, infer-ence and natural language generation. Existing ab-stractive summarizers often rely on an extractivepre-processing component to produce the abstractof the text (Berg-Kirkpatrick et al., 2011; Knightand Marcu, 2000). Consequently, in this paper wefocus on extractive summarization methods, alsogiven the fact that lyrics i) strongly use figurativelanguage which makes abstractive summarizationeven more challenging; and ii) the choice of thewords by the composer may also have an impor-tance for capturing the style of the song.

In the following, we focus on unsupervisedmethods for text summarization, the ones targetedin our study (no available gold-standard of human-produced summaries of song texts exists). Mostmethods have in common the process for sum-mary generation: given a text, the importance ofeach sentence of that text is determined. Then, thesentences with highest importance are selected toform a summary. The ways different summarizersdetermine the importance of each sentence maydiffer: Statistics-based summarizers extract indi-cator features from each sentence, e.g. (Fattah andRen, 2009) use among others the sentence positionand length and named entities as features. Topic-based summarizers aim to represent each sentenceby its underlying topics. For instance, (Hennig,2009) apply Probabilistic Latent Semantic Anal-ysis, while Latent Dirichlet Allocation is used in

(Arora and Ravindran, 2008) to model each sen-tence’s distribution over latent topics. Anothertype of summarization methods is graph-basedsummarizers. Three of the most popular graph-based summarizers are TextRank (Mihalcea andTarau, 2004), LexRank (Erkan and Radev, 2004),and (Parveen et al., 2015). These methods workby constructing a graph whose nodes are sentencesand whose graph edge weights are sentence sim-ilarities. Then, the sentences that are central tothe graph are found by computing the PageRank(Page et al., 1999). Contrarily to all previouslydescribed methods, systems using supervised ma-chine learning form another type of summarizers.For instance, (Fattah, 2014) treats extractive sum-marization as a binary classification task, wherethey extract indicator features from sentences ofgold summaries and learn to detect the sentencesthat should be included in a summary.

Context-Specific Summarization. If specificknowledge about the application scenario or thedomain of the summarized text is available,generic summarization methods can be adapted totake into account the prior information. In query-based summarization (Otterbacher et al., 2005;Wang et al., 2016), the user’s query is taken intoaccount when generating a summary. Summa-rization of a scientific paper can be improved byconsidering the citations of it, as in (Delort et al.,2003). However, to the best of our knowledge nosummarization methods have been proposed forthe domain of song texts. In this paper we presenta summarization method that uses prior knowl-edge about the text it summarizes to help genericsummarizers generate better summaries.

Evaluation Criteria and Methods. Summariesshould i) contain the most important informationfrom input documents, ii) not contain redundantinformation, iii) be readable, hence they shouldbe grammatical and coherent (Parveen and Strube,2015). While a multitude of methods to iden-tify important sentences has been described above,several approaches aim to make summaries lessredundant and more coherent. The simplest wayto evaluate summaries is to let humans assessthe quality, but this is extremely expensive. Thefactors that humans must consider when givingscores to each candidate summary are grammati-cality, non redundancy, integration of most impor-tant pieces of information, structure and coherence

Page 4: Song Lyrics Summarization Inspired by Audio Thumbnailing

(Saggion and Poibeau, 2013). The more commonway is to let humans generate possibly multiplesummaries for a text and then automatically assesshow close a machine-made summary is to the hu-man gold summaries computing ROUGE scores(Lin, 2004), which boils down to measuring n-gram overlaps between gold summaries and auto-matic summary. More recently there have beenattempts to rate summaries automatically with-out the need for gold summaries (Nenkova et al.,2011). The key idea is that a summary should besimilar to the original text in regard to characteris-tic criteria as the word distribution. (Mackie et al.,2014) find that topic words are a suitable metric toautomatically evaluate micro blog summaries.

2.2 Audio Summarization

Lyrics are texts that accompany music. Therefore,it is worthwhile to see if methods in audio sum-marization can be transferred to lyrics summariza-tion. In audio summarization the goal is to find themost representative parts in a song, in Pop songsthose are usually the chorus and the bridge, in in-strumental music the main theme. The task of cre-ating short audio summaries is also known as au-dio thumbnailing (Bartsch and Wakefield, 2005;Chai and Vercoe, 2003; Levy et al., 2006), as thegoal is to produce a short representation of themusic that fits onto a thumbnail, but still coversthe most representative parts of it. In a recent ap-proach of audio thumbnailing (Jiang and Muller,2015), the authors generate a Double Thumbnailfrom a musical piece by finding the two most rep-resentative parts in it. For this, they search forcandidate musical segments in an a priori unseg-mented song. Candidate musical segments are de-fined as sequences of music that more or less ex-actly repeat themselves. The representativeness ofeach candidate segment to the whole piece is thenestimated by their fitness metric. They define thefitness of a segment as a trade-off between how ex-actly a part is repeated and how much of the wholepiece is covered by all repetitions of that segment.Then, the audio segments along with their fitnessallow them to create an audio double thumbnailconsisting of the two fittest audio segments.

3 Lyrics Summarization

Song texts are arranged in segments and lines. Forinstance the song text depicted in Figure 1 con-sists of 8 segments and 38 lines. Given a song text

S consisting of n lines of text, S = (x1, ..., xn),we define the task of extractive lyrics summariza-tion as the task of producing a concise summarysum of the song text, consisting of a subset of theoriginal text lines: sum(S) ⊆ S, where usually|sum(S)| << |S|. We define the goal of a sum-mary as to preserve key information and the over-all meaning of a song text. To address this task, weapply the following methods from the literature:the popular graph-based summarizer TextRank;an adaptation of a topic-based method (TopSum).Moreover, we introduce a method inspired by au-dio thumbnailing (which we dub Lyrics Thumb-nail) which aims at creating a summary from themost representative parts of the original song text.While for TextRank we rely on the off-the-shelfimplementation of (Barrios et al., 2016), in the fol-lowing we describe the other two methods.

3.1 TopSum

We implement a simple topic-based summariza-tion model that aims to construct a summarywhose topic distribution is as similar as possibleto that of the original text. Following (Kleedorferet al., 2008), we train a topic model by factorizinga tf-idf-weighted term-document matrix of a songtext corpus (see Section 4.2) using non-negativematrix factorization into a term-topic and a topic-document matrix. Given the learnt term-topic ma-trix, we compute a topic vector t for each new doc-ument (song text). In order to treat t as a (pseudo-) probability distribution over latent topics ti, wenormalize t by applying λt.t/

∑ti∈t ti to it. Given

the distributions over latent topics for each songtext, we then incrementally construct a summaryby greedily adding one line from the original textat a time (same mechanism as in KLSum algo-rithm in (Haghighi and Vanderwende, 2009)); thatline x∗ of the original text that minimizes the dis-tance between the topic distribution tS of the orig-inal text S and the topic distribution of the incre-mental summary sum(S):

x∗ = argminx∈(S\sum(S))

{W (tS , tsum(S)+x)}

W is the Wasserstein distance (Villani, 2008)and is used to measure the distance between twoprobability distributions (an alternative to Jensen-Shannon divergence (Louis and Nenkova, 2013)).

Page 5: Song Lyrics Summarization Inspired by Audio Thumbnailing

Figure 1: Song text of “Let’s start a band” by Amy MacDonald along with two example summaries.

3.2 Lyrics ThumbnailInspired by (Jiang and Muller, 2015), we transfertheir fitness measure for audio segments to com-pute the fitness of lyrics segments. Analog to anaudio thumbnail, we define a Lyrics Thumbnail asthe most representative and repetitive part of thesong text. Consequently, it usually consists of (apart of) the chorus. In our corpus the segments areannotated (as double line breaks in the lyrics), sounlike in audio thumbnailing, we do not have toinduce segments, but rather measure their fitness.In the following, we describe the fitness measurefor lyrics segments and how we use this to producea summary of the lyrics.

Lyrics Fitness Given a segmented song textS = (S1, ..., Sm) consisting of text segments Si,where each Si consists of |Si| text lines, we clus-ter the Si into partitions of similar segments. Forinstance, the lyrics in Figure 1 consists of 8 seg-ments and 38 lines and the cluster of chorus con-sists of {S5, S6, S7}. The fitness Fit of the seg-ment cluster C ⊆ S is defined through the preci-sion pr of the cluster and the coverage co of thecluster. pr describes how similar the segments inC are to each other while co is the relative amountof lyrics lines covered by C:

pr(C) = (∑

Si,Sj∈Ci<j

1)−1 ·∑

Si,Sj∈Ci<j

sim(Si, Sj)

co(C) = (∑Si∈S|Si|)−1 ·

∑Si∈C

|Si|

where sim is a normalized similarity measure be-tween text segments. Fit is the harmonic mean

between pr and co. The fitness of a segment Siis defined as the fitness of the cluster to which Sibelongs:

∀Si ∈ C : Fit(Si) = Fit(C) = 2pr(C) · co(C)pr(C) + co(C)

For lyrics segments without repetition the fit-ness is defined as zero. Based on the fitness Fitfor segments, we define a fitness measure for a textline x. This allows us to compute the fitness of ar-bitrary summaries (with no or unknown segmen-tation). If the text line x occurs fi(x) times in textsegment Si, then its line fitness fit is defined as:

fit(x) = (∑Si∈S

fi(x))−1 ·

∑Si∈S

fi(x) · Fit(Si)

Fitness-Based Summary Analog to (Jiang andMuller, 2015)’s audio thumbnails, we createfitness-based summaries for a song text. A LyricsDouble Thumbnail consists of two segments: onefrom the fittest segment cluster (usually the cho-rus), and one from the second fittest segment clus-ter (usually the bridge).1 If the second fittest clus-ter has a fitness of 0, we generate a Lyrics Sin-gle Thumbnail solely from the fittest cluster (usu-ally the chorus). If the thumbnail generated has alength of k lines and we want to produce a sum-mary of p < k lines, we select the p lines inthe middle of the thumbnail following (Chai andVercoe, 2003)’s “Section-transition Strategy” that

1We pick the first occurring representative of the segmentcluster. Which segment to pick from the cluster is a potentialquestion for future work.

Page 6: Song Lyrics Summarization Inspired by Audio Thumbnailing

they find to capture the “hook” of the music morelikely.2

4 Experimental Setting

We now describe the WASABI dataset of songlyrics (Section 4.1), and the tested configurationsof the summarization methods (Section 4.2).

4.1 DatasetFrom the WASABI corpus (Meseguer-Brocalet al., 2017) we select a subset of 190k unique songtexts with available genre information. As the cor-pus has spurious genres (416 different ones), wefocus on the 10 most frequent ones in order toevaluate our methods dependent on the genre. Weadd 2 additional genres from the underrepresentedRap field (Southern Hip Hop and Gangsta Rap).The dataset contains 95k song lyrics.

To define the length of sum(S) (see Section3), we rely on (Bartsch and Wakefield, 2005) thatrecommend to create audio thumbnails of the me-dian length of the chorus on the whole corpus.We therefore estimate the median chorus length onour corpus by computing a Lyrics Single Thumb-nail on each text, and we find the median choruslength to be 4 lines. Hence, we decide to gener-ate summaries of such length for all lyrics and allsummarization models to exclude the length biasin the methods comparison3. As the length of thelyrics thumbnail is lower-bounded by the lengthof the chorus in the song text, we keep only thoselyrics with an estimated chorus length of at least 4.The final corpus of 12 genres consists of 50k lyricswith the following genre distribution: Rock: 8.4k,Country: 8.3k, Alternative Rock: 6.6k, Pop: 6.9k,R&B: 5.2k, Indie Rock: 4.4k, Hip Hop: 4.2k,Hard Rock: 2.4k, Punk Rock: 2k, Folk: 1.7k,Southern Hip Hop: 281, Gangsta Rap: 185.

4.2 Models and ConfigurationsWe create summaries using the three summariza-tion methods described in Section 3, i.e. a graph-based (TextRank), a topic-based (TopSum), andfitness-based (Lyrics Thumbnail) method, plustwo additional combined models (described be-low). While the Lyrics Thumbnail is generatedfrom the full segment structure of the lyrics in-cluding its duplicate lines, all other models are fed

2They also experiment with other methods to create athumbnail, such as section initial or section ending.

3We leave the study of other measures to estimate thesummary length to future work.

with unique text lines as input (i.e. rendundantlines are deleted). This is done to produce less re-dundant summaries, given that for instance, Tex-tRank scores each duplicate line the same, henceit may create summaries with all identical lines.TopSum can suffer from a similar shortcoming: ifthere is a duplicate line close to the ideal topic dis-tribution, adding that line again will let the incre-mental summary under construction stay close tothe ideal topic distribution. All models were in-structed to produce summaries of 4 lines, as thisis the estimated median chorus length in our cor-pus (see Section 4.1). The summary lines werearranged in the same order they appear in the orig-inal text.4 We use the TextRank implementation5

of (Barrios et al., 2016) without removing stopwords (lyrics lines in input can be quite short,therefore we avoid losing all content of the line ifremoving stop words). The topic model for Top-Sum is built using non-negative matrix factoriza-tion with scikit-learn6 (Pedregosa et al., 2011) for30 topics on the full corpus of 190k lyrics.7 Forthe topical distance, we only consider the distancebetween the 3 most relevant topics in the originaltext, following the intuition that one song text usu-ally covers only a small amount of topics. TheLyrics Thumbnail is computed using String-baseddistance between text segments to facilitate clus-tering. This similarity has been shown in (Watan-abe et al., 2016) to indicate segment borders suc-cessfully. In our implementation, segments areclustered using the DBSCAN (Ester et al., 1996)algorithm.8 We also produce two summaries bycombining TextRank + TopSum and TextRank +TopSum + Lyrics Thumbnail, to test if summariescan benefit from the complementary perspectivesthe three different summarization methods take.

Model Combination For any lyrics line, we canobtain a score from each of the applied methods.TextRank provides a score for each line, TopSumprovides a distance between the topic distributionsof an incremental summary and the original text,and fit provides the fitness of each line. We treatour summarization methods as blackboxes and usea simple method to combine the scores the dif-ferent methods provide for each line. Given the

4In case of repeated parts, the first position of each linewas used as original position.

5https://github.com/summanlp/textrank6https://scikit-learn.org7loss=’kullback-leibler’8eps=0.3, min samples=2

Page 7: Song Lyrics Summarization Inspired by Audio Thumbnailing

original text separated into lines S = (x1, ..., xn),a summary is constructed by greedily adding oneline x∗ at a time to the incremental summarysum(S) ⊆ S such that the sum of normalizedranks of all scores is minimal:

x∗ = argmin⋃x

{∑A

RA(x)}

Here x ∈ (S \ sum(S)) and A ∈{TextRank,TopSum,fit}. The normalizedrank RA(x) of the score that method A assigns toline x is computed as follows: first, the highestscores9 are assigned rank 0, the second highestscores get rank 1, and so forth. Then the ranks arelinearly scaled to the [0,1] interval, so each sumof ranks

∑A RA(x) is in [0,3].

Model Nomenclature For abbreviation, we callthe TextRank model henceforth Mr, the Top-Sum modelMs, the fitness-based summarizerMf ,model combinations Mrs and Mrsf , respectively.

5 Evaluation

We evaluate the quality of the produced lyricssummary both soliciting human judgments on thegoodness and utility of a given summary (Section5.1), and through an automatic evaluation of thesummarization methods (Section 5.2) to provide acomprehensive evaluation.

5.1 Human EvaluationWe performed human evaluation of the differentsummarization methods introduced before by ask-ing participants to rate the different summariespresented to them by specifying their agreement /disagreement according to the following standardcriteria (Parveen and Strube, 2015):

Informativeness: The summary contains themain points of the original song text.

Non-redundancy: The summary does not con-tain duplicate or redundant information.

Coherence: The summary is fluent to read andgrammatically correct.Plus one additional criterion coming from our def-inition of the lyrics summarization task:

Meaning: The summary preserves the meaningof the original song text.

An experimental psychologist expert in HumanComputer Interaction advised us in defining the

9In the case of topical distance, a “higher score” means alower value.

questionnaire and setting up the experiment. 26participants - 12 nationalities, 18 men, 8 women,aged from 21 to 59 - were taking a questionnaire(Google Forms), consisting of rating 30 items withrespect to the criteria defined before on a Likertscale from 1 (low) to 5 (high). Each participantwas presented with 5 different summaries - eachproduced by one of the previously described sum-marization models - for 6 different song texts. Par-ticipants were given example ratings for the dif-ferent criteria in order to familiarize them with theprocedure. Then, for each song text, the originalsong text along with its 5 summaries were pre-sented in random order and had to be rated ac-cording to the above criteria. For the criterion ofMeaning, we asked participants to give a short ex-planation in free text for their score. The selected6 song texts10 have a minimum and a median cho-rus length of 4 lines and are from different genres,i.e. Pop/Rock (4), Folk (1) and Rap (1), similarto our corpus genre distribution. Song texts wereselected from different lengths (18-63 lines), gen-ders of singer (3 male, 3 female), topics (family,life, drugs, relationship, depression), and mood(depressive, angry, hopeful, optimistic, energetic).The artist name and song title were not shown tothe participants.

Results Figure 2 shows the ratings obtained foreach criterion. We examine the significant dif-ferences between the models performances byperforming a paired two-tailed t-test. The sig-nificance levels are: 0.05∗, 0.01∗∗, 0.001∗∗∗, andn.s. First, Informativeness and Meaning are ratedhigher∗∗ for the combined model Mrs comparedto the single models Mr and Ms. Combiningall three models improves the summaries further:both for Informativeness and Meaning the modelMrsf is rated higher∗∗∗ than Mrs. Further, sum-maries created by Mrsf are rated higher∗∗∗ in Co-herence than summaries from any other model -except from Mf (n.s. difference). Summariesare rated on the same level (n.s. differences) forNon-redundancy in all but the Mr and Mf sum-maries, which are perceived as lower∗∗∗ in Non-redundancy than all others. Note, how the modelMrsf is more stable than all others by exhibit-ing lower standard deviations in all criteria except

10“Pills N Potions” by Nicki Minaj, “Hurt” by Nine InchNails, “Real to me” by Brian McFadden, “Somebody That IUsed To Know” by Gotye, “Receive” by Alanis Morissette,“Let’s Start A Band” by Amy MacDonald

Page 8: Song Lyrics Summarization Inspired by Audio Thumbnailing

Figure 2: Human ratings per summarization model in terms of average and standard deviation.

Non-redundancy. The criteria Informativeness andMeaning are highly correlated (Pearson correla-tion coefficient 0.84). Correlations between othercriteria range between 0.29 and 0.51.

Overall, leveraging the Lyrics Fitness in a songtext summary improves summary quality. Espe-cially with respect to the criteria that, we believe,indicate the summary quality the most - Informa-tiveness and Meaning - theMrsf method is signif-icantly better performing and more consistent.

Figure 1 shows an example song text and exam-ple summaries from the experiment. Summary 1 isgenerated by Mf and consists of the chorus. Sum-mary 2 is made by the method Mrsf and has rel-evant parts of the verses and the chorus, and wasrated much higher in Informativeness and Mean-ing. We analyzed the free text written by the par-ticipants to comment on the Meaning criterion, butno relevant additional information was provided(the participants mainly summarized their ratings).

5.2 Automatic Evaluation

We computed four different indicators of summaryquality on the dataset of 50k songs described inSection 4.1. Three of the criteria use the similar-ity between probability distributions P,Q, whichmeans we compute the Wasserstein distance be-tween P and Q (cf. Section 3.1) and applyλx. x−1 to it.11 The criteria are:

Distributional Semantics: similarity betweenthe word distributions of original and summary, cf.(Louis and Nenkova, 2013). We give results rela-tive to the similarity of the best performing model(=100%).

11This works as we always deal with distances > 0.

Topical: similarity between the topic distribu-tions of original and summary. Restricted to the3 most relevant topics of the original song text.We give results relative to the similarity of the bestperforming model (=100%).

Coherence: average similarity between worddistributions in consecutive sentences of the sum-mary, cf. (ShafieiBavani et al., 2018). We give re-sults relative to the coherence of the original songtext (=100%).

Lyrics fitness: average line-based fitness fit (cf.Section 3) of the lines in the summary. We giveresults relative to the Lyrics fitness of the originalsong text (=100%).

Results When evaluating each of the 12 genres,we found two clusters of genres to behave verysimilarly. Therefore, we report the results for thesetwo groups: the Rap genre cluster contains HipHop, Southern Hip Hop, and Gangsta Rap. TheRock / Pop cluster contains the 9 other genres.Results of the different automatic evaluation met-rics are shown in Table 1. Distributional Seman-tics metrics have previously been shown (Louisand Nenkova, 2013; ShafieiBavani et al., 2018)to highly correlate with user responsiveness judg-ments. We would expect correlations of this met-ric with Informativeness or Meaning criteria there-fore, as those criteria are closest to responsiveness,but we have found no large differences betweenthe different models for this criterion. The sum-maries of the Ms model have the highest similar-ity to the original text and the Mf have the low-est similarity of 90%. The difference between thehighest and lowest values are low.

For the Topical similarity, the results are mostly

Page 9: Song Lyrics Summarization Inspired by Audio Thumbnailing

Evaluation criterion Genre Mr Ms Mrs Mf Mrsf original text

DistributionalSemantics [%]

Rock / Pop 92 100 97 90 93Rap 94 100 99 86 92 n/a∑

92 100 98 90 93

Topical [%]Rock / Pop 44 100 76 41 64

Rap 58 100 80 48 66 n/a∑46 100 77 42 64

Coherence [%]Rock / Pop 110 95 99 99 100

Rap 112 115 112 107 107 100∑110 97 101 100 101

Lyricsfitness [%]

Rock / Pop 71 53 63 201 183Rap 0 0 0 309 249 100∑

62 47 55 214 191

Table 1: Automatic evaluation results for the 5 summarization models and 2 genre clusters. Distributional Seman-tics and Topical are relative to the best model (=100%), Coherence and Fitness to the original text (=100%).

in the same order as the Distributional Semanticsones, but with much larger differences. While theMs model reaches the highest similarity, this is aself-fulfilling prophecy, as summaries of Ms weregenerated with the objective of maximizing topicalsimilarity. The other two models that incorporateMs (Mrs and Mrsf ), show a much higher topicalsimilarity to the original text than Mr and Mf .

Coherence is rated best in Mr with 110%. Allother models show a coherence close to that of theoriginal text - between 97% and 101%. We be-lieve that the increased coherence ofMr is not lin-guistically founded, but merely algorithmic. Mr

produces summaries of the most central sentencesin a text. The centrality is using the concept ofsentence similarity. Therefore, Mr implicitly op-timizes for the automatic evaluation metric of co-herence, based on similar consecutive sentences.Sentence similarity seems to be insufficient to pre-dict human judgments of coherence in this case.

As might be expected, methods explicitly in-corporating the Lyrics fitness produce summarieswith a fitness much higher than the original text -214% for the Mf and 191% for the Mrsf model.The methods not incorporating fitness producesummaries with much lower fitness than the origi-nal - Mr 62%, Ms 47%, and Mrs 55%. In the Rapgenre this fitness is even zero, i.e. summaries (inmedian) contain no part of the chorus.

Overall, no single automatic evaluation crite-rion was able to explain the judgments of ourhuman participants. However, considering Topi-cal similarity and fitness together gives us a hint.The model Mf has high fitness (214%), but lowTopical similarity (42%). The Ms model has thehighest Topical similarity (100%), but low fitness(47%). Mrsf might be preferred by humans as it

strikes a balance between Topical similarity (64%)and fitness (191%). Hence, Mrsf succeeds in cap-turing lines from the most relevant parts of thelyrics, such as the chorus, while jointly represent-ing the important topics of the song text.

6 Conclusion

In this paper we have defined and addressedthe task of lyrics summarization. We have ap-plied both generic unsupervised text summariza-tion methods (TextRank and a topic-based methodwe called TopSum), and a method inspired by au-dio thumbnailing on 50k lyrics from the WASABIcorpus. We have carried out an automatic evalua-tion on the produced summaries computing stan-dard metrics in text summarization, and a humanevaluation with 26 participants, showing that usinga fitness measure transferred from the musicologyliterature, we can amend generic text summariza-tion algorithms and produce better summaries.

In future work, we will model the importanceof a line given the segment to avoid cutting off im-portant parts of the chorus, as we sometimes ob-served. Moreover, we plan to address the chal-lenging task of abstractive summarization oversong lyrics, with the goal of creating a summaryof song texts in prose-style - more similar to whathumans would do, using their own words.

Acknowledgement

This work is partly funded by the French ResearchNational Agency (ANR) under the WASABIproject (contract ANR-16-CE23-0017-01).

Page 10: Song Lyrics Summarization Inspired by Audio Thumbnailing

ReferencesMehdi Allahyari, Seyed Amin Pouriyeh, Mehdi Assefi,

Saeid Safaei, Elizabeth D. Trippe, Juan B. Gutier-rez, and Krys Kochut. 2017. Text summarizationtechniques: A brief survey. CoRR, abs/1707.02268.

Rachit Arora and Balaraman Ravindran. 2008. Latentdirichlet allocation based multi-document summa-rization. In Proceedings of the second workshop onAnalytics for noisy unstructured text data, pages 91–97. ACM.

Federico Barrios, Federico Lopez, Luis Argerich, andRosa Wachenchauzer. 2016. Variations of the simi-larity function of textrank for automated summariza-tion. CoRR, abs/1602.03606.

Mark A. Bartsch and Gregory H. Wakefield. 2005. Au-dio thumbnailing of popular music using chroma-based representations. Trans. Multi., 7(1):96–104.

Taylor Berg-Kirkpatrick, Dan Gillick, and Dan Klein.2011. Jointly learning to extract and compress. InProceedings of the 49th Annual Meeting of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies - Volume 1, HLT ’11, pages481–490, Stroudsburg, PA, USA. Association forComputational Linguistics.

David Brackett. 1995. Interpreting Popular Music.Cambridge University Press.

Wei Chai and Barry Vercoe. 2003. Music thumbnail-ing via structural analysis. In Proceedings of theeleventh ACM international conference on Multime-dia, pages 223–226.

Jianpeng Cheng and Mirella Lapata. 2016. Neuralsummarization by extracting sentences and words.In Proceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers), pages 484–494. Association for Com-putational Linguistics.

Jean Yves Delort, Bernadette Bouchon-Meunier, andMaria Rifqi. 2003. Enhanced web document sum-marization using hyperlinks. In Proceedings of theFourteenth ACM Conference on Hypertext and Hy-permedia, HYPERTEXT ’03, pages 208–215, NewYork, NY, USA. ACM.

Gunes Erkan and Dragomir R Radev. 2004. Lexrank:Graph-based lexical centrality as salience in textsummarization. Journal of artificial intelligence re-search, 22:457–479.

Martin Ester, Hans-Peter Kriegel, Jorg Sander, XiaoweiXu, et al. 1996. A density-based algorithm fordiscovering clusters in large spatial databases withnoise. In Kdd, volume 96, pages 226–231.

Mohamed Abdel Fattah. 2014. A hybrid machinelearning model for multi-document summarization.Applied intelligence, 40(4):592–600.

Mohamed Abdel Fattah and Fuji Ren. 2009. Ga, mr,ffnn, pnn and gmm based models for automatic textsummarization. Comput. Speech Lang., 23(1):126–144.

Aria Haghighi and Lucy Vanderwende. 2009. Explor-ing content models for multi-document summariza-tion. In Proceedings of Human Language Tech-nologies: The 2009 Annual Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics, pages 362–370. Association forComputational Linguistics.

Ruifang He and Xingyi Duan. 2018. Twitter summa-rization based on social network and sparse recon-struction. In AAAI.

Leonhard Hennig. 2009. Topic-based multi-documentsummarization with probabilistic latent semanticanalysis. In Proceedings of the International Con-ference RANLP-2009, pages 144–149.

Meishan Hu, Aixin Sun, Ee-Peng Lim, and Ee-PengLim. 2007. Comments-oriented blog summarizationby sentence extraction. In Proceedings of the Six-teenth ACM Conference on Conference on Informa-tion and Knowledge Management, CIKM ’07, pages901–904, New York, NY, USA. ACM.

Nanzhu Jiang and Meinard Muller. 2015. Estimat-ing double thumbnails for music recordings. In2015 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages146–150.

Florian Kleedorfer, Peter Knees, and Tim Pohle. 2008.Oh oh oh whoah! towards automatic topic detectionin song lyrics. In Ismir, pages 287–292.

Kevin Knight and Daniel Marcu. 2000. Statistics-based summarization - step one: Sentence compres-sion. In Proceedings of the Seventeenth NationalConference on Artificial Intelligence and TwelfthConference on Innovative Applications of ArtificialIntelligence, pages 703–710. AAAI Press.

Mark Levy, Mark Sandler, and Michael Casey. 2006.Extraction of high-level musical structure from au-dio data and its application to thumbnail generation.In 2006 IEEE International Conference on Acous-tics Speech and Signal Processing Proceedings, vol-ume 5, pages V–V. IEEE.

Chin-Yew Lin. 2004. Rouge: A package for automaticevaluation of summaries. In Text SummarizationBranches Out.

Annie Louis and Ani Nenkova. 2013. Automaticallyassessing machine summary content without a goldstandard. Computational Linguistics, 39(2).

Stuart Mackie, Richard McCreadie, Craig Macdonald,and Iadh Ounis. 2014. On choosing an effective au-tomatic evaluation metric for microblog summarisa-tion. In Proceedings of the 5th Information Interac-tion in Context Symposium, pages 115–124. ACM.

Page 11: Song Lyrics Summarization Inspired by Audio Thumbnailing

Qiaozhu Mei and ChengXiang Zhai. 2008. Generatingimpact-based summaries for scientific literature. InACL.

Gabriel Meseguer-Brocal, Geoffroy Peeters, Guil-laume Pellerin, Michel Buffa, Elena Cabrio, Cather-ine Faron Zucker, Alain Giboin, Isabelle Mirbel, Ro-main Hennequin, Manuel Moussallam, FrancescoPiccoli, and Thomas Fillon. 2017. WASABI: a TwoMillion Song Database Project with Audio and Cul-tural Metadata plus WebAudio enhanced Client Ap-plications. In Web Audio Conference 2017 – Col-laborative Audio #WAC2017, London, United King-dom. Queen Mary University of London.

Rada Mihalcea and Paul Tarau. 2004. Textrank: Bring-ing order into text. In Proceedings of the 2004 Con-ference on Empirical Methods in Natural LanguageProcessing.

Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2016.Summarunner: A recurrent neural network based se-quence model for extractive summarization of docu-ments. In AAAI.

Ani Nenkova, Kathleen McKeown, et al. 2011. Auto-matic summarization. Foundations and Trends R© inInformation Retrieval, 5(2–3):103–233.

Jahna Otterbacher, Gunes Erkan, and Dragomir RRadev. 2005. Using random walks for question-focused sentence retrieval. In Proceedings of theconference on Human Language Technology andEmpirical Methods in Natural Language Process-ing, pages 915–922. Association for ComputationalLinguistics.

Lawrence Page, Sergey Brin, Rajeev Motwani, andTerry Winograd. 1999. The pagerank citation rank-ing: Bringing order to the web. Technical report,Stanford InfoLab.

Daraksha Parveen, Hans-Martin Ramsl, and MichaelStrube. 2015. Topical coherence for graph-based ex-tractive summarization. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing, pages 1949–1954.

Daraksha Parveen and Michael Strube. 2015. Inte-grating importance, non-redundancy and coherencein graph-based extractive summarization. In Pro-ceedings of the 24th International Conference onArtificial Intelligence, IJCAI’15, pages 1298–1304.AAAI Press.

Samuel Pecar. 2018. Towards opinion summarizationof customer reviews. In Proceedings of ACL 2018,Student Research Workshop, pages 1–8. Associationfor Computational Linguistics.

Fabian Pedregosa, Gael Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, Jake Vanderplas, Alexan-dre Passos, David Cournapeau, Mathieu Brucher,Mathieu Perrot, and Edouard Duchesnay. 2011.

Scikit-learn: Machine learning in Python. Journalof Machine Learning Research, 12:2825–2830.

Horacio Saggion and Thierry Poibeau. 2013. Auto-matic Text Summarization: Past, Present and Fu-ture, pages 3–21. Springer, Berlin, Heidelberg.

Elaheh ShafieiBavani, Mohammad Ebrahimi, Ray-mond Wong, and Fang Chen. 2018. Summariza-tion evaluation in the absence of human model sum-maries using the compositionality of word embed-dings. In Proceedings of the 27th International Con-ference on Computational Linguistics, pages 905–914. Association for Computational Linguistics.

Cedric Villani. 2008. Optimal transport: old and new,volume 338. Springer Science & Business Media.

Lu Wang, Hema Raghavan, Vittorio Castelli, RaduFlorian, and Claire Cardie. 2016. A sentencecompression based framework to query-focusedmulti-document summarization. arXiv preprintarXiv:1606.07548.

Kento Watanabe, Yuichiroh Matsubayashi, Naho Orita,Naoaki Okazaki, Kentaro Inui, Satoru Fukayama,Tomoyasu Nakano, Jordan Smith, and MasatakaGoto. 2016. Modeling discourse segments in lyricsusing repeated patterns. In Proceedings of COLING2016, the 26th International Conference on Compu-tational Linguistics: Technical Papers, pages 1959–1969.