New What is in the news on a subject: automatic and sparse …binyu/ps/papers2011/... · 2012. 7....

What is in the news on a subject:automatic and sparse summarization of

large document corporaLuke Miratrix1∗ Jinzhu Jia2∗

Brian Gawalt3

Bin Yu1,3

Laurent El Ghaoui3

1Department of Statistics, UC Berkeley, Berkeley, CA 947202School of Mathematical Sciences and Center for Statistical Science,

Peking University, Beijing, China, 1008713Department of EECS, UC Berkeley, Berkeley, CA 94720

July 26, 2011

Abstract

News media play a significant role in our political and daily lives.The traditional approach in media analysis to news summarization islabor intensive. As the amount of news data grows rapidly, the needis acute for automatic and scalable methods to aid media analysisresearchers so that they could screen corpora of news articles veryquickly before detailed reading.

In this paper we propose a general framework for subject-specificsummarization of document corpora with news articles as a specialcase. We use the state-of-the art scalable and sparse statistical predic-tive framework to generate a list of short words/phrases as a summaryof a subject. In particular, for a particular subject of interest (e.g.,China), we first create a list of words/phrases to represent this subject(e.g., China, Chinas, and Chinese) and then create automatic labels foreach document depending on the appearance pattern of this list in thedocument. The predictor vector is then high dimensional and containscounts of the rest of the words/phrases in the documents excludingphrases overlapping the subject list. Moreover, we consider several pre-processing schemes, including document unit choice, labeling scheme,tf-idf representation and L2 normalization, to prepare the text databefore applying the sparse predictive framework. We examined fourdifferent scalable feature selection methods for summary list genera-tion: phrase Co-occurrence, phrase correlation, L1-regularized logisticregression (L1LR), and L1-regularized linear regression (Lasso).

We carefully designed and conducted a human survey to comparethe different summarizers with human understanding based on news

* Miratrix and Jia are co-first authors.

1

articles from the New York Times international section in 2009. Thereare three main findings: first the Lasso is the best overall choice offeature selection method; second, the tf-idf representation is a strongchoice of vector space, but only for longer units of text such as articlesin newspapers; and third, the L2 normalization is the best for shorterunits of text such as paragraphs.

Keywords: text summarization, high-dimensional analysis, sparse mod-eling, Lasso, L1 regularized logistic regression, co-occurrence, tf-idf, L2 nor-malization.

1 Introduction

Joseph Pulitzer wrote in his last will and testament, “[Journalism] is a no-ble profession and one of unequaled importance for its influence upon theminds and morals of the people.” Faced with an overwhelming amount ofworld news, concerned citizens, media analysts and decision-makers alikewould greatly benefit from scalable, efficient methods that extract compactsummaries of how subjects of interest are covered in text corpora. Thesecompact summaries could be used by the interested people for screeningcorpora of news articles before detailed reading or further investigation.

We propose a novel approach to perform automatic, subject-specificsummarization of news articles that applies to general text corpora. Ourapproach allows researchers to easily and quickly explore large volumes ofnews articles. These methods readily generalize to other types of documents.For example, El Ghaoui et al. (2011) identified potentially dangerous aspectsof specific airports by using these methods on pilots’ flight logs. We are cur-rently implementing our approach within an on-line toolkit, SnapDragon,1

which will soon be available to researchers interested in screening corporaof documents relative to a subject matter.

News media significantly drives the course of world events. By choos-ing which events to report and the manner in which to report them, themedia affects the sentiments of readers and through them the wider world.Exposure to news can drive change in society (Miller and Krosnick, 2000;Nisbet and Myers, 2010; Hopkins and King, 2010), and, even when control-ling for topic, can vary in tone, emphasis, and style (Branton and Dunaway,2009; Gilens and Hertzman, 2000; D’Alessio and Allen, 2000). Our focus onnews media is motivated by a crucial need in a democracy: to understandprecisely where and how these media variations occur.

1http://statnews2.eecs.berkeley.edu/snapdragon

2

Currently, media analysis is often conducted by hand coding (Wahl et al.,2002; Denham, 2004; Potter, 2009). Hand-coding manually reduces complextext-data to a handful of quantitative variables, allowing for statistical anal-ysis such as regression or simple tests of difference. It is prohibitively labor-intensive (Hopkins and King, 2010). In personal correspondence, Denhamrelayed that each article took roughly fifteen minutes to analyze, suggestingabout 28 hours of time for their full text corpus of 115 articles.

In the last five years we have seen the emergence of a computationalsocial science field connecting statistics and machine learning to anthropol-ogy, sociology, public policy, and more (Lazer et al., 2009). Many organiza-tions have introduced automated summary methods: Google news trends,Twitter’s trending queries, Crimson Hexagon’s brand analysis and othersall use computation to make sense of the vast volumes of text now publiclygenerated. These approaches, discussed below, illustrate the potential ofcomputation for news media analysis.

Summarization by extraction. There are two major approaches to textanalysis, key-phrase extraction (listing key-phrases for the document suchas in Rose et al. (2010); Senellart and Blondel (2008); Frank et al. (1999);Chen et al. (2006)) and sentence extraction (identifying and presenting the“most relevant” sentences of a document as a summary, such as in Hennig(2009); Goldstein et al. (2000); Neto et al. (2002)). Both these approachesscore potential key-phrases or sentences found in the text and then selectthe highest scorers as the summary. This line of research has primarilyfocused on summarizing individual documents, with one summary for everydocument in a corpus.

However, when there are multiple documents, even a short summary ofeach document adds up quickly. Content can be buried in a sea of sum-maries if most documents are not directly related to the subject of interest.If many documents are similar, the collection of summaries becomes redun-dant. Moreover, if the subject of interest is usually mentioned in a secondarycapacity, it might be missing entirely from the summaries. To address someof these problems, Goldstein et al. (2000) worked on summarizing multipledocuments at once to remove redundancy. Under their system, sentencesare scored and selected sequentially, with future sentences penalized by sim-ilarity to previously selected sentences. In this system, the documents needto be first clustered by overall topic.

Hennig (2009) fits a latent topic model (similar to LDA, discussed be-low) for subject-specific summarization of documents. Here the subject is

3

represented as a set of documents and a short narrative of the desired con-tent. All units of text are projected into a latent topic space that is learnedfrom the data independent of the subject and then sentences are extractedby a scoring procedure by comparing the similarity of the latent represen-tations of the sentences to the subject. Although we also summarize anentire collection of documents as they pertain to a specific subject of inter-est, we do not use a latent space representation of the data. In Monroe et al.(2008), the authors merge all text into two super-documents and then scoreindividual words based on their differing rates of appearance, normalized bytheir overall frequency. We analyze the corpus through individual documentunits.

Summarization via topic modeling. Some analysis algorithms taketext information as input and produce a model, usually generative, fit tothe data. The model itself captures structure in the data, and this structurecan be viewed as a summary. A popular example is the latent Dirichletallocation (Blei et al., 2003), which posits that each word observed in thetext is standing in for a hidden, latent “topic” variable. These models arecomplex and dense, with all the words playing a role in all the topics, butone can still take the most prominent words in a topic as the summary.

Chang et al. (2009) had humans evaluate the internal cohesion of learnedtopics. Respondents were asked to identify “impostor” words inserted intolists of words representing a topic. This showed these approaches as produc-ing cogent and reasonable topics. Supervised versions (Blei and McAuliffe,2008) of these methods can be used to summarize a subject of interest.

Although these methods are computationally expensive and producedense models requiring truncation for interpretability, they are powerfulindications of the capabilities of computer-assisted summarization. Thesemethods analyze the corpus as a whole and model how the documents covera modest number of organically grown topics. We opt instead for a more di-rected process of summarizing a particular, specified subject (out of possiblemillions).

Other automated approaches. Google Trend charts are calculated bycomparing the number of times a subject appears in the news outlets thatGoogle compiles to the overall volume of news for a specified time period.Even this simple approach can show how subjects enter and leave publicdiscourse across time. Twitter’s trending topics appears to operate similarly,although it selects the hottest topics by those which are gaining in frequency

4

most quickly. Although neither of these tools summarize a specified subject,they are similar in spirit to the normalized simpler methods (co-occur andcorrelation screen) that we will introduce and investigate in this paper.

Hopkins and King (2010)extrapolates from a potentially non-randomsample of hand-coded documents to estimate the proportion of documentsin several pre-defined categories. This can be used for sentiment analysis(e.g., estimating the proportion of blogs showing approval for some specifiedpublic figure). We instead identify key-phrases most associated with a givensubject. These key-phrases could then be analyzed directly for sentiment,thus reducing the amount of hand-coding required. Their work is behindCrimson Hexagon, a company currently offering brand analysis to severalcompanies.

In a similar spirit, we believe there is opportunity to answer the question,“What is being said in the news regarding China?” or, more generally,“What is discussed in this corpus of documents regarding subject A?” usingmachine learning techniques.

Our predictive and sparse approach with human evaluation. Inthis paper, we propose to use statistical machine learning tools such as Lassothat are fast, sparse, and different from those described earlier to produceshort and interpretable summaries. Our proposal is desirable because me-dia analysis (or general document summarization) tools need to encourageexploration, allowing researchers to easily examine how different topics areportrayed in a corpus or how this portrayal evolves over time or comparesacross different corpora.

Given a corpus of documents (e.g., the New York Times InternationalSection articles in 2009) as well as a subject of interest (e.g.,subject China asrepresented as a short list of words China, Chinas and Chinese), we estab-lish the predictive framework by first automatically labeling documents intopositive and negative examples by, for example, determining if they containwords from the subject list. Counts of words/phrases not on the subjectlist then form the predictor vectors for the documents. After normaliz-ing these vectors, we use scalable, reproducible prediction and classificationtechniques to identify a small set of words and phrases that best predict thesubject as it appears. This overall arc reduces corpora of many millions ofwords into a few representative key-phrases that constitute how the givensubject is treated in the corpus.

To validate these summaries we cannot use traditional machine learningapproaches, however, since traditional measures have no guarantee of cor-

5

relating with actual meaning. We therefore compare the different summaryapproaches with a human survey. We find that sparse methods such asLasso indeed produce higher quality summaries than many currently used,simpler, methods. Moreover, we conducted usability testing to investigatehow different preprocessing techniques differ in quality of resulting lists inthe user survey. We found the choice of preprocessing important, especiallywith simpler summarization methods. The sparse methods such as Lasso,however, are more robust to potential mistakes made in the data preparationstep.

To illustrate, consider how the New York Times treated China (repre-sented as “china, chinas, chinese”) in the international section in 2009. Oneof our summary methods, L1LR, yields a short list of terms: “beijing, con-tributed research, global, hu jintao, imports, of xinjiang, peoples liberationarmy, shanghai, sichuan province, staterun, tibet, trade, uighurs, wen ji-abao, xinhua news agency”. This succinct summary captures main relevantpersonalities (e.g., Wen Jiabao, Hu Jintao), associated countries and areas(e.g., Uighurs, Tibet), entities (Xinhua news), and topics (trade, imports,global [activity], state-run [organizations]). The presence of “contributed re-search” indicates that other people, in addition to the authors, contributedto many of the Times articles on China. These terms inform interestedreaders including China experts about how China is being treated by theNew York Times and suggest directions for further reading on topics suchas Sichuan, Xinjiang, Tibet, and trade. Table 2 contains four other samplesummaries.

The rest of the paper is organized as follows. Section 2.1 describes ourproposal including the predictive framework and the preprocessing choicesused in the study. Section 2.2 proposes different key-phrase selectors (e.g.,Lasso,co-occurrence) one might use to generate the final summary. We then de-scribe the validation experiment designed to examine the performance of thesummarizers built from the choices in Sections 3. Results of this experimentare shown in Section 4. Section 5 concludes.

2 Our approach: predictive, fast, and sparse

Our approach is based on a predictive binary classification framework. In atypical binary classification scenario, data units (e.g.,news articles or para-graphs) belong to two classes and features of a data unit are used to predictits class membership. Classification of text documents using the phrasesin those documents as features is familiar and well-studied (Genkin et al.,

6

2007; Zhang and Oles, 2001).We turn subject-specific (e.g., China) summarization into a binary clas-

sification problem by forming two classes, that is, we automatically labelnews articles (or other document units) in a corpus as subject-related andirrelevant. See Section 2.1.3, where we discuss several different ways of la-beling. We then use a predictive classifier to generate a summary list. To beprecise, we take those words and phrases most important for classificationas a summary of the subject relative to the corpus.

A predictive framework consists of n units, each with a class label yi ∈{−1,+1} and a collection of p possible features that can be used to predictthis class label. Each unit i ∈ I ≡ {1, . . . , n} is attributed a value xij foreach feature j ∈ J ≡ {1, . . . , p}. These xij form a n × p matrix X. Then units are blocks of text taken from the corpus (e.g., entire articles orindividual paragraphs), the class labels yi indicate whether document unit icontains content on a subject of interest, and the features are all the possiblekey-phrases that could be used to summarize the subject. As mentionedearlier, we consider several ways of automatically labeling or assigning classmemberships based on the document unit itself in Section 2.1.3. Matrix Xand vector y can be built in several ways. We build X by reweighting theelements of a document-term matrix C:

Definition A document-term matrix C sets

Cij := The number of times key-phrase j appears in document i

This is often called the bag-of-phrases model : each document is representedas a vector with the jth element being the total number of times that thespecific phrase j appears in the document. Stack these row vectors to makethe matrix C ∈ Rn×p of counts. C has one row for each document and onecolumn for each phrase. C tends to be highly sparse: most entries are 0.

To transform raw text into this vector space, convert it to a collection ofindividual text document units, establish a dictionary of possible phrases,and count how often each of the dictionary’s phrases appear in each of thedocument units. Once this is completed, the summarizing process consistsof 3 major steps:

1. Reweight: build X from C;

2. Label: build y by identifying which document units in the corpus likelytreat the specified subject;

3. Select: extract a list of phrases that ideally summarize the subject.

7

How the document units are labelled, how the document units are vec-torized, and how phrases are selected can all be done in different ways.Different choices for these steps result in different summarizers, some betterthan others. We describe these steps and choices fully in Sections 2.1 and2.2.

Iraq Russia Germany Mexicoamerican a medvedev angela merkel and border protectionand afghanistan caucasus berlin antonio betancourtbaghdad europe chancellor angela cancnbrigade gas european chihuahuacombat georgia france and denise gradygen interfax news agency frankfurt drug cartelsin afghanistan iran group of mostly guadalajarainvasion moscow hamburg influenzanuri nuclear marwa alsherbini oaxacapentagon president dmitri matchfixing outbreaksaddam republics minister karltheodor zu president felipesergeant sergei munich sinaloasunni soviet nazi swinetroops vladimir world war texaswar and who tijuana

Table 1: Four Sample Summaries of Four Different Countries. The methodused, a count rule with a threshold of 2, the Lasso for feature selection, andtf-idf reweighting of features, was one of the best identified for article-unitanalysis by our validation experiment.

2.1 Data pre-processing

In this section, we describe in detail how we pre-process a corpus of docu-ments into a vectorized space so that the predictive classification approachcan be employed. Our description is for a general document corpus, butat times we use examples from news article summarization to ground thegeneral description and to show the need to consider context.

8

2.1.1 Choosing the document units

We divide the raw text into units of analysis and determine which of thoseunits have relevant information about the subject, and summarize basedon common features found in these units. The granularity with which thetext is partitioned may then have some impact on the resulting summaries.In particular, we hypothesized that using smaller, lower-word-count units oftext should produce more detail-oriented summaries, while using larger unitswill highlight key-phrases dealing more with the larger themes discussedwhen the subject of interest is mentioned.

We tested this hypothesis by comparing summarizers that analyze atthe article level to those which analyze at the component-paragraphs level.Interestingly, we found no large differences. See Section 4.4.

2.1.2 Identifying potential summarizing phrases

To build the document-term matrix C we first identify all possible phrasesthat could be part of a summary. This list of possibilities constitute ourdictionary. Building this dictionary begins with asking, “Which text phrasesare acceptably descriptive?” Sometimes the answer to this question suggestsa manually-defined dictionary: if summaries should only list, e.g., countriesthen the dictionary would be easy to assemble by hand.

In many situations, however, the dictionary of terms should be keptlarge, and possibly be drawn from the corpus itself. Different decisions—Should capitalization matter? Are punctuation marks terms? Can terms in-clude numerals?—yield dictionaries varying widely in size and utility. Termscould be further disambiguated by many natural language tools, e.g., part-of-speech tagging, which would again increase dictionary size. Any automatedsystem for forming a dictionary will entail at least some uncertainty in termidentification.

We elected to use a large dictionary containing all phrases of up to threewords in length. We generated our dictionary by first removing all numer-als and punctuation, then case-folding (converting all text to lowercase).We then segmented each document into overlapping phrases, consisting ofall single words, bigrams and trigrams in that document unit. Some textanalysts stem their phrases, reducing words to a core root prefix, e.g., trun-cating “terror,” “terrorist,” and “terrorism” to “terror”. But we do notstem. There is a semantic difference if a particular subject is associatedwith “canadians”, the citizenry versus “canada” the country. For our cor-pus of articles from the New York Times International Section in 2009,

9

this identified in 4.5 million distinct phrases. A first-pass pruning, remov-ing all phrases appearing fewer than six times, resulted in a dictionary ofp = 216, 626 distinct phrases.

2.1.3 Representing subjects as lists of words/phrases and auto-matically labeling documents

We train a classifier to predict document labels, yi ∈ {−1,+1}, with theirvectors of phrase features, xi ∈ Rp, for i = 1, . . . , n. In the labeling step weautomatically build the label vector y = (y1, . . . , yn) by deciding whethereach document unit in the corpus should be considered a positive class ex-ample or a negative class example.

Establishing the class labels for a news document corpus is sometimesstraightforward. For instance, for comparing 1989 articles about mentalillness to those from 1999 as Wahl et. al did, the labels would be simple:the documents from the opposing years go in opposite classes. We build yby identifying which of the document units treat the subject of interest with”treat” being precisely defined later. For small enough n, y could be built byhand. For corpora too large to admit manual labeling, we need reasonableautomatic labeling. Ideally this need not be a perfect identification—noisein labeling should not have undue impact on the resulting summaries.

In the large sense, a subject is a concept of interest that an investigatormight have. We represent a subject with a small list of words or phrases,e.g., the subject of China would be well represented by the set {“china,”“chinas,” “chinese”}. Specifically, let the subject Q ⊂ J be a list of wordsor phrases selected by the user to capture a subject of interest.

count-m. We consider two general automatic labeling techniques. Thefirst technique, count-m, marks document unit i as treating a subject ifrelated phrases appear frequently enough, as given by:

Definition Count-m labeling labels document unit i as:

yi = 2 · 1{ri ≥ m} − 1

where 1{·} is the indicator function and ri ≡∑

j∈Q cij is the total numberof subject-specific phrases in unit i.

hardcount-m. The second automatic labeling technique, hardcount-m,drops any document i with 0 < ri < m from the data set instead of labeling

10

it with −1. The hardcount method considers those documents too ambigu-ous to be useful as negative class examples. It produces the same positiveexample set as count-m. We hypothesized that dropping the ambiguous doc-ument units would heighten the contrast in content between the two classes,and thus lead to superior summaries. It did not. See Section 4.4.

We generate y this way because we are interested in how a subject istreated in a corpus. Other approaches are possible. For example, y mightidentify which articles were written in a specific date range; this would thenlead to a summary of what was covered in that date range.

2.1.4 Reweighting and removing features

It is well known that baseline word frequencies impact information retrievalmethods, and so often raw counts are adjusted to account for commonalityand rarity of terms (e.g., Monroe et al., 2008; Salton and Buckley, 1988). Inthe predictive framework, this adjustment is done in the construction of thefeature matrix X. We consider three different constructions of X, all builton the bag-of-phrases representation. Salton and Buckley (1988) examineda variety of weighting approaches for document retrieval in a multi-factorexperiment and found choice of approach to be quite important; we takea similar comparative approach for our task of summarizing a corpus ofdocuments.

We first remove the columns corresponding to the phrases in subject Qfrom the set of possible features J to prevent the summary from being trivialand circular. We also remove sub-phrases and super-phrases. For example,if Q is {“united states”} then candidate summary phrases “united”, “states”,“of the united”, and “states of america” would all be removed. The removalis easily automated.

Our baseline then is to simply drop stop words (words determined apriori as too uninformative to merit inclusion). Our second approach is torescale each phrase vector (column of C) to have unity L2 norm. Our third isan implementation of the tf-idf technique (Salton and Buckley, 1988; Salton,1991), rescaling the bag-of-phrases components so that appearances of rarerphrases are considered more important than common ones.

Stop Words. Stop words are low information words such as “and,” or“the”, typically appearing with high frequency. Stop words may be con-text dependent. For example, in US international news “united states” or“country” might be considered high frequency and low information. High-frequency words have higher variance and effective weight in many methods,

11

causing them to be erroneously selected as features due to sample noise. Todeal with these nuisance words, many text-processing methods use a fixed,hand-built stop-word list and preemptively remove all features on that listfrom consideration (e.g., Zhang and Oles, 2001; Ifrim et al., 2008; Genkinet al., 2007).

This somewhat ad-hoc method does not adapt automatically to the indi-vidual character of a given corpus and presents many difficulties. Switchingto a corpus of a different language would require new stop word lists. Whenconsidering phrases instead of single words, the stop word list is not natu-rally or easily extended. For example, simply dropping phrases containingany stop word is problematic: it would be a mistake to label “shock andawe” uninteresting. On the other hand, there are very common candidatephrases that are entirely made up of stop words, e.g., “of the,” so just cullingthe single word phrases is unlikely to be sufficient. See Monroe et al. (2008)for further critique.

L2-rescaling. As an alternative, appropriately adjusting the documentvectors can act in lieu of a stop-word list by reducing the variance andweight of the high-frequency features. We use the corpus to estimate baselineappearance rates for each feature and then adjust the matrix C by a functionof these rates; this core idea is discussed by Monroe et al. (2008).

We define L2-rescaling to be:

Definition X is a L2-rescaled version of C if each column of C is rescaledto have unit length under the L2 norm. I.e.:

xij =cij√zj, where zj ≡

n∑i=1

c2ij

tf-idf Weighting. An alternate rescaling comes from the popular tf-idfheuristic (Salton, 1991), which attempts to de-emphasize commonly occur-ring terms while also trying to account for each document’s length.

Definition X is a tf-idf weighted version of C if

xij :=cijqi

log(n

dj

)where qi ≡

∑pj=1 cij is the sum of the counts of all key-phrases in document

i and dj ≡∑n

i=1 1{cij > 0} is the number of documents in which term jappears at least once.

12

Under tf-idf, words which appear in a large proportion of documents areshrunk considerably in their representation in X. Words which appear in alln documents, such as “the”, are zeroed out entirely. A potential advantageof tf-idf is that it might ease comparisons between documents of differentlengths because term counts are rescaled by the total count of terms in thedocument.

To illustrate the advantages of reweighting, we generated four sum-maries from the L1LR feature selection method with all combinations ofL2-rescaling and stop-word removal (see supplementary material). Withoutreweighting, if no stop words are deleted the list is dominated by high-frequency, low-content words such as “of,” “and,” and “was”. The juststop-word removal list is only a bit better. It contains generic words suchas “mr,” “percent,” and “world.” The rescaled list does not contain thesewords. This is a common problem with stop-word lists: they get rid of theworst offenders, but do not solve the overall problem. The list from stop-word removal together with L2-rescaling and the list from just L2-rescalingare the same—the rescaling, in this case, has rendered the stop-word listirrelevant.

We hypothesized that feature weighting is more transparent and repro-ducible than stop-word removal and that it results in superior summarieswhen compared to stop-word removal. With the human validation experi-ment, we compared using L2-rescaling, tf-idf weighting, and stop-word re-moval as the pre-processing step for each of our feature selectors and foundthat humans indeed prefer lists coming from the reweighting methods.

2.2 Four Feature Selection Methods

Classical prediction yields models that give each feature a non-zero weight.The models are thus hard to interpret when there are many features, asis typically the case with text analysis (our data set contains more than200,000 potential phrases). We, however, want to ensure that the number ofphrases selected is small so the researcher can easily read and consider theentire summary. Short summaries are quick to digest, and thus are easilycomparable. Such summaries might even be automatically generated for acorpus in one language and then translated to another, thus easing com-parison of media coverage from different nationalities and allowing insightinto foreign language news (Dai et al., 2011). Fundamentally, a small set offeatures is more easily evaluated by human researchers.

Given the features X and document labels y for a subject, we extractthe columns of X that constitute the final summary. We seek a subset of

13

phrases K ⊆ J with cardinality as close as possible to, but no larger than,a target k, the desired summary length. We typically use k = 15 phrases,but 30 or 50 might also be desirable. The higher the value of k, the moredetailed and complex the summary. We require the selected phrases bedistinct meaning that we don’t count sub-phrases. For example, if “unitedstates” and “united” are both selected, we drop “united”.

The constraint of short summaries makes the summarization problema sparse feature selection problem, as studied in, e.g., Forman (2003); Leeand Chen (2006); Yang and Pendersen (1997). Sparse methods, such as L1-penalized regression, naturally select a small subset of the available features(in our case candidate key-phrases) as being relevant predictors.

In other domains, L1-regularized methods are useful for sparse modelselection; they can identify which of a large set of mostly irrelevant featuresare associated with some outcome. In our domain there is no reasonable un-derlying model that is indeed sparse; we expect different phrases to be moreor less relevant, but few to be completely and utterly irrelevant. Neverthe-less, we still employ the sparse methods to take advantage of their featureselection aspects, hoping that the most important features will be selectedfirst.

We examine four methods for extraction, detailed below. Two of them,Co-occurrence and Correlation Screening, are scoring schemes where eachfeature gets scored independently and the top-scoring features are taken asa summary. This is similar to traditional key-phrase extraction techniquesand to other methods currently used to generate word clouds and othertext visualizations. The other two (the Lasso and L1LR) are L1 regularizedleast squares linear regression and logistic regression, respectively. Table 2displays four summaries for China, one from each feature selector: choicematters greatly. Co-occurrence and L1-penalized logistic regression (L1LR),are familiar schemes from previous work (Gawalt et al., 2010).

2.2.1 Co-occurrence

Co-occurrence is our simplest, baseline, method. The idea is to take thosephrases that appear most often (or have greatest weight) in the positivelymarked text as the summary. This method is often used in, e.g., newspapercharts showing the trends of major words over a year (such as Google NewsTrends2) or word or tag clouds (created at sites such as Wordle3).

2http://www.google.com/trends3http://www.wordle.net/

14

Co-occurrence Correlation L1LR Lasso1 and beijing and asian asian2 by beijings beijing beijing3 contributed research contributed research contributed research contributed research4 for from beijing euna lee exports5 global global global global6 has in beijing hong kong hong kong7 hu jintao li jintao jintao8 in beijing minister wen jiabao north korea north korea9 its president hu jintao shanghai shanghai

10 of prime minister wen staterun tibet11 that shanghai uighurs uighurs12 the the beijing wen jiabao wen jiabao13 to tibet xinhua xinhua14 xinhua xinhua the15 year zhang

Table 2: Comparison of the Four Feature Selection Methods. Four samplesummaries of news coverage of China. (Documents labeled via count-2 onarticles, X from L2-rescaling.) Note superior quality of Lasso and L1LR onthe right.

By the feature selection step we have two labelled document subsets,I+ = {i ∈ I|yi = +1}, of cardinality #I+, and I− = {i ∈ I|yi = −1}, ofcardinality #I−. The relevance score sj of feature j for all j ∈ J is then:

sj =1

#I+

∑i∈I+

xij

sj is the average weight of the phrase in the positively marked examples.For some k′, let s be the (k′+1)th highest value found in the set {sj | j ∈

J }. Build K = {j ∈ J : sj > s}, the set of (up to) k′ phrases with thehighest average weight across the positive examples. Any phrases tied withthe (k′+1)th highest value are dropped, sometimes giving a list shorter thank′. The size of K after subphrases are removed can be even less. Let theinitial value of k′ be k, the actual desired length. Now adjust k′ upwardsuntil just before the summary of distinct phrases is longer than k. We arethen taking the k′ ≥ k top phrases and removing the sub-phrases to producek or fewer distinct phrases in the final summary.

15

If X = C, i.e. it is not weighted, then sj is the average number of timesfeature j appears in I+, and this method selects those phrases that appearmost frequently in the positive examples. The weighting step may, however,reduce the Co-occurrence score for common words that appear frequentlyin both the positive and negative examples. This is especially true if, as isusually the case, there are many more negative examples than positive ones.Appropriate weighting can radically increase this method’s performance.

2.2.2 Correlation Screening

Correlation Screening selects features with the largest absolute correlationwith the subject labeling y. It is a fast method that independently selectsphrases that tend to appear in the positively marked text and not in thenegatively marked text. Score each feature as:

sj = |cor(xj , y)|

Now select the k highest-scoring, distinct features as described for Co-occurrence, above.

2.2.3 L1-penalized linear regression (the Lasso)

The Lasso is an L1-penalized version of linear regression and is the firstof two feature selection methods examined in this paper that address ourmodel-sparsity-for-interpretability constraint explicitly. Imposing an L1 penaltyon a least-squares problem regularizes the vector of coefficients, allowing foroptimal model fit in high-dimensional (p > n) regression settings. Further-more, L1 penalties typically result in sparse feature-vectors, which is desir-able in our context. The Lasso takes advantage of the correlation structureof the features to, in principle, avoid selecting highly correlated terms. Foran overview of the Lasso and other sparse methods see, e.g., The Elementsof Statistical Learning (Hastie et al., 2003).

The Lasso is defined as:

(β(λ), γ) := arg minβ,γ

m∑i=1

∥∥y − xTi β − γ∥∥2+ λ

∑j

|βj |. (1)

The penalty term λ governs the number of non-zero elements of β. We usea non-penalized intercept, γ, in our model. Penalizing the intercept wouldshrink the estimated ratio of number of positive example documents to thenumber of negative example documents to 1. This is not desirable; the

16

number of positive examples is far less than 50%, as shown in Table 1 in thesupplementary materials, and in any case is not a parameter which needsestimation for our summaries. We solve this convex optimization problemwith a modified version of the BBR algorithm (Genkin et al., 2007) describedfurther in Section 2.2.5.

2.2.4 L1-penalized logistic regression (L1LR)

Similar to the Lasso, L1-penalized logistic regression (L1LR) is typicallyused to obtain a sparse feature set for predicting the log-odds of an outcomevariable being either +1 or −1. It is widely studied in the classificationliterature, including text classification (see Genkin et al., 2007; Ifrim et al.,2008; Zhang and Oles, 2001). We define the model as:

(β(λ), γ) := arg minβ,γ−

m∑i=1

log(1 + exp[−yi(xTi β + γ)]

)+ λ

∑j

|βj |. (2)

The penalty term λ again governs the number of non-zero elements ofβ. As with Lasso, we again do not penalize the intercept. We implementL1LR with a modified form of the BBR algorithm.

2.2.5 Implementation and computational cost

Computational costs primarily depend on the size and sparsity of X. Westore X as a list of tuples, each tuple being a row and column index andvalue of a non-zero element. This list is sorted so it is quick to identify allelements in a matrix column. This data structure saves both in storage andcomputational cost.

The complexity for the tf-idf and L2 rescaling methods are O(Z), withZ being the number of nonzero elements in X, because we only have tore-weight the nonzero elements and the weights are only calculated from thenonzero elements. Stop-word elimination is also O(Z).

The running times of the four feature selection methods differ widely.For Correlation Screening and Co-occurrence, the complexity is O(Z). TheLasso and L1LR depend on solving convex optimization problems. We im-plemented them using a modified form of the BBR algorithm (Genkin et al.,2007). The BBR algorithm is a coordinate descent algorithm for solvingL1 penalized logistic regressions with penalized (or no) intercept. It cyclesthrough all the columns of X, iteratively computing the optimal βj for fea-ture j holding the other βk fixed. We modified the BBR algorithm such

17

that 1) we can solve the Lasso with it; 2) we do not penalize the intercept;and 3) the implementation exploits the sparsity of X. Not penalizing theintercept preserves sparsity, even if we chose to center the columns of X.

For each iteration of the modified BBR, we first calculate the optimumintercept γ given the current value of β and then cycle through the features,calculating the update to βj of ∆βj for j = 1, . . . , p. The complexity forcalculating ∆βj is O(Zj), where Zj is the number of nonzero elements inthe jth column of X. We then have a complexity cost of O(Z) for each fullcycle through the features. We stop when the decrease in the loss functionafter a full cycle through β1 to βp is sufficiently small. For both the Lassoand Logistic regression, we need only a few iterations to obtain the finalsolution. When not exploiting the sparse matrix, the complexity of a coor-dinate decent iteration is O(n× p). When Z � n× p, our implementationsaves a lot of computation cost.

For both the Lasso and L1LR, higher values of λ result in the selectionof fewer features. A sufficiently high λ will return a β with zero weightfor all phrases, selecting no phrase features, and λ = 0 reverts the problemto ordinary regression, leading to some weight put on all phrases in mostcircumstances. By doing a binary search between these two extremes, wequickly find a value of λ for which β(λ) has the desired k distinct phraseswith non-zero weight.

Computation speed comparisons. We timed the various methods tocompare them given our data set. The average times to summarize a givensubject for each method, not including the time to load, label, and rescale thedata, are on Table 3. As expected, Co-occurrence and Correlation Screeningare roughly the same speed. The data-preparation steps by themselves (notincluding loading the data into memory) average a total of 15 seconds, moreexpensive by far than the feature selection for the simple methods (althoughwe did not optimize the labeling of y or the dropping the subject-relatedfeatures from X).

The Lasso is currently about 9 times slower and L1LR is more than 100times slower than the baseline Co-occurrence using current optimizationtechniques, but these techniques are evolving fast. For example, one currentarea of research, safe feature elimination, allows for quickly and safely prun-ing many irrelevant features before fitting, leading to substantial speed-ups(El Ghaoui et al., 2011). This pre-processing step allows for a huge reduc-tion in the number of features when the penalty parameter is high, which isprecisely the regime where the desired list length is short.

18

The difference between the Lasso and L1LR may seem surprising giventhe running time analysis above. L1LR is slower for two main reasons: for agiven λ L1LR requires more iterations to converge on average, and for eachiteration, L1LR takes more time due to, in part, more complex mathematicaloperations.

We implemented the sparse regression algorithms and feature correla-tion functions in C; the overall package is in Matlab. L1LR is the slowestmethod, although further optimization is possible. The speed cost for Lasso,especially when considering the overhead of labeling and rescaling, is fairlyminor, as shown on the third column of Table 3.

Phrase Total Percentselection time increase

(sec) (sec)Co-occurrence 1.0 20.3

Correlation Screen 1.0 20.3 0%The Lasso 9.3 28.7 +41%

L1LR 104.9 124.2 +511%

Table 3: Computational Speed Chart. Average running times for the fourfeature selection methods over all subjects considered. Second column in-cludes time to generate y and adjust X. Final column is percentage increasein total time over Co-occurrence, the baseline method.

2.2.6 Comments

The primary advantages of Co-occurrence and Correlation Screening is thatthey are fast, scalable, and easily distributed across multiple platforms forparallel processing. Unfortunately, as they score each feature independentlyfrom the others, they cannot take advantage of any dependence betweenfeatures to aid summarization, for example, to remove redundant phrases.The Lasso and L1LR, potentially, do. The down side is sparse methods canbe more computationally intensive.

Final summaries consist of a target of k distinct key-phrases. Thefeature-selectors are adjusted to provide enough phrases such that once sub-phrases (e.g., “united” in “united states”) are removed, the list is k phraseslong. This removal step, similar to stop-word removal, is somewhat ad hoc.It would be preferable to have methods that naturally select distinct phrasesthat do not substantially overlap. Sparse methods are such methods; they

19

do not need to take advantage of this step, supporting the heuristic knowl-edge that L1-penalization tends to avoid selecting highly correlated features.With tf-idf, an average of about one phrase is dropped for Lasso and L1LR.The independent feature selection methods, however, tend to drop manyphrases. See Section 1 of the supplementary material.

3 Human Experiment

Four sample summaries of the coverage of four different countries are shownin Table 2. Sometimes fragments are selected as stand-ins for completephrases, e.g., “President Felipe [Calderon].” These summaries inform us asto which aspects of these countries are of most concern to the New YorkTimes in 2009: even now, Nazis and the World Wars are tied to Germany.Iraq and Afghanistan are also tied closely. Gen[erals] and combat are themajor focus in Iraq. The coverage of Mexico revolves around the swineflu, drug cartels, and concerns about the border. Russia had a run-in withEurope about gas, and nuclear involvement with Iran. These summariesprovide some pointers as to future directions of more in-depth analysis. Theycame from a specific combination of choices for the reweighting, labeling, andfeature selection steps. But are these summaries better, or worse, than thesummaries from a different summarizer?

Comparing the efficacy of different summarizers requires systematic eval-uation. To do this, many researchers use corpora with existing summaries(e.g., using the human-encoded key-phrases in academic journals such as inFrank et al. (1999)), or corpora that already have human-generated sum-maries (such as the TIPSTER dataset used in Neto et al. (2002)). Othershave humans generate summaries for a sample of individual documents andcompare the summarizer’s output to this human baseline. We, however,summarize many documents, and so we cannot use an annotated evaluationcorpus or summaries of individual documents.

In the machine-learning world, numerical measures such as predictionaccuracy or model fit are often used to compare different techniques, and hasbeen successful in many applications. While we hypothesize that predictionaccuracy should correlate with summary quality to a certain extent, thereare no results to demonstrate this. Indeed some researchers found that itdoes not always (Gawalt et al., 2010; Chang et al., 2009). Because of this,evaluating summarizer performance with numerical measures is not robustto critique.

Although time consuming, people can tell how well a summary relates to

20

a subject, as the hand-coding practice in media analysis shows. Because finaloutcomes of interest are governed by human opinion, the only way to validatethat a summarizer is achieving its purpose is via a study where humans assesssummary quality. We therefore design and conduct such a study. Our studyhas three main aims: to verify that features used for classification are indeedgood key-phrases, to help learn what aspects of the summarizers seem mostimportant in extracting the key meaning of a corpus, and to determinewhich feature selection methods are most robust to different choices of pre-processing (choice of granularity, labeling of document units, and rescalingof X).

We compare our four feature selection methods under a variety of la-beling and reweighting choices in a crossed, randomized experiment wherenon-experts read both original documents and our summaries and judgethe quality and relevance of the output. Even though we expect individu-als’ judgements to vary, we can average the responses across a collection ofrespondents and thus get a measure of overall, generally shared opinion.

We carried out our survey in conjunction with the XLab, a campus labdedicated to helping researchers conduct human experiments. We recruited36 respondents (undergraduates at a major university) from the lab’s re-spondent pool via a generic, nonspecific message stating that there was astudy that would take up to one hour of time. While these experimentsare expensive and time consuming, they are necessary for evaluating textsummarization tools.

3.1 Description of text corpus

For our investigation we used the International Section of the New YorkTimes for the 2009 year. Articles were scraped from the newspaper’s RSSfeed,4 and the HTML markup was stripped from the text. We obtained130,266 paragraphs of text comprising 9,560 articles. The New York Times,upon occasion, will edit an article and repost it under a different headlineand link; these multiple versions of the articles remain in the data set. Bylooking for similar articles as measured by a small angle between their featurevectors in the document-term matrix C, we estimate that around 400 articles(4–5%) have near-duplicates.

The number of paragraphs in an article ranges from 1 to 38. Typical ar-ticles5 have about 16 paragraphs (with an Inter-Quartile Range (IQR) of 11to 20 paragraphs). However, about 15% of the articles, the “World Briefing”

4feed://feeds.nytimes.com/nyt/rss/World5See, e.g., http://www.nytimes.com/2011/03/ 04/world/americas/04mexico.html

21

articles, are a special variety that contain only one long paragraph.6 Amongthe more typical, non-“World Briefing” articles, the distribution of articlelength as number of paragraphs is bell-shaped and unimodal. The longerarticles, with a median length of 664 words, have much shorter paragraphs(median of 38 words), generally, than the “Word Briefing” single-paragrapharticles (median of 87 words).

3.2 Generating the sample of summaries

A choice of each of the options described above gives a unique summarizer.We evaluated 96 different summarizers built from these factors:

1. We evaluated summarizers that analyzed the data at the article-unitand paragraph-unit level (see Section 2.1.1).

2. When performing paragraph-unit analysis, we labeled document unitsusing count-1, count-2, and hardcount-2. For the article-unit analy-sis we considered these three, plus count-3 and hardcount-3 (see Sec-tion 2.1.3).

3. We considered tf-idf weighting, L2 rescaling, and simple stop-wordremoval (see Section 2.1.4).

4. We considered four feature-selection techniques (see Section 2.2).

Choices of unit of analysis, labeling, the three preprocessing options, andthe four feature-selection methods give 96 different summarizers indexed bythese combinations of factors.

We compared the efficacy of these combinations by having respondentsassess the quality of several different summaries generated by each summa-rizer. We applied each summarizer to the set of all articles in the New YorkTimes International Section from 2009 for 15 different countries of inter-est. These countries are listed in Table 1 in the supplementary materials.We only considered countries with reasonable representation in the corpus.After identifying these countries, we hand-specified a phrase set Q for eachcounty by including any plurals and possessives of the country and any com-mon names for the country’s people. Using these 15 subjects on each of the96 summarizers, we calculated 1,440 summaries.

Supplementary Table 1 also includes the number of positively markedexamples under all the count-m labeling schemes we used for both article-

6See, e.g., http://www.nytimes.com/2011/03/03/ world/americas/03briefs-cuba.html

22

and paragraph-unit analysis. The “article-1” header is the most generouslabeling: any article that mentions any of the words associated with thesubject one or more times is marked as treating the subject. Even underthis, positive examples are scarce; it is clear we are attempting to summarizesomething that does not constitute a large portion of the text.

3.3 The survey and respondents

For our survey, paid respondents were convened in a large room of kioskswhere they could be asked questions in a focused, controlled environment.. Each respondent sat at a computer and was given a series of questionsover the course of an hour. Respondents assessed a series of summaries andarticles presented in 6 blocks of 8 questions each. Each block considereda single (randomly selected) subject from our list of 15. Within a block,respondents were first asked to read four articles and rate their relevance tothe specified subject. Respondents were then asked to read and rate foursummaries of that subject randomly chosen from the subject’s library of 96.Respondents could not go back to previous questions.

Before the survey all respondents did a sample series of four practicequestions and were then asked if they had any questions as to how to scoreor rate articles and summaries.

Evaluating article topicality. To insure that respondents had a highprobability of seeing several articles actually relevant to the subject beinginvestigated, the articles presented in a block were selected with a weightedsampling scheme with weights proportional to the number of times theblock’s country’s name was mentioned. We monitored the success of thisscheme (and collected data about the quality of the automatic labelers) byasking the respondents to evaluate each shown article’s relevance to thespecified subject on a 1 to 7 scale.

With 4 or higher scored as relevant, respondents saw at least 2 articles(out of the 4) on the subject of interest about 75% of the time. With 3 orhigher, the number of blocks with at least two relevant articles rises to 91%.We attempt to summarize how a subject is treated overall, including how itis treated in articles in which the subject is only a secondary consideration.For example, an article focused on world energy production may discussRussia. Hence, even a modest score of 3 or 4 is a likely indication that thearticle has subject-related content.

Only the first 120 words of each article were shown; consultation withjournalists suggests this would not have a detrimental impact on content

23

presented, as a traditional newspaper article’s “inverted pyramid” structuremoves from the most important information to more minute details as itprogresses (Pottker, 2003).

Evaluating summaries. A simple random sample of 4 summaries werepresented after the four articles. Each summary was presented on its ownscreen. The respondents were asked to assess each summary in four respects:

1. Content: How does this list capture the content about [subject] in thetext you just read? (1-7 scale, 7 being fully captured)

2. Relevance: How many words irrelevant or unrelated to [subject] doesthis list contain? (1-7 scale, 7 being no irrelevance)

3. Redundancy: How many redundant or repeated words does this listcontain? (1-7 scale, 7 being no redundancies)

4. Specificity: Given what you just read, would you say this list is proba-bly too general or too specific a summary of how [subject] was coveredby the newspaper in 2009? (response options: too general, about right,too specific, not relevant, and can’t tell)

All respondents finished their full survey, and fewer than 1% of the ques-tions were skipped. Time to completion ranged from 14 to 41 minutes, witha mean completion time of 27 minutes.

4 Human Survey Results

We primarily examined an aggregate “quality” score, taken as the mean ofthe three main outcomes (Content, Relevance, and Redundancy). Figure 1shows the raw mean aggregate outcomes for the article-unit and paragraph-unit data. We immediately see that the Lasso and L1LR perform betterthan Co-Occurrence and Correlation Screen, and Tf-idf is a good choice ofrescaling for articles, L2-rescaling for paragraphs. Labeling does not seem tomatter for the articles, but does for paragraphs. Clearly, the unit of analysisinteracts with the other three factors, and so we conduct further analysis ofthe article-unit and paragraph-unit data separately. Section 4.4 has overallcomparisons.

We analyze the data by fitting the respondents’ responses to the summa-rizer characteristics using linear regression. The full model includes terms

24

P

PP

3.0

3.5

4.0

4.5

5.0

mea

n of

lfa

cap$

com

b

AA A

A

A

coun

t−1

coun

t−2

Hco

unt−

2

coun

t−3

Hco

unt−

3

PA

Paragraph−UnitArticle−Unit

P

P

P

3.0

3.5

4.0

4.5

5.0

mea

n of

lfa

cap$

com

b

A

A

A

stop

resc

tfidf

P

PP

P

3.0

3.5

4.0

4.5

5.0

mea

n of

lfa

cap$

com

b

A

A

A

A

cooc

corr

lass

o

l1lr

Agg

rega

te S

core

Figure 1: Aggregate Results. Outcome is aggregate score based on theraw data. There are major differences between article-unit analysis andparagraph-unit analysis when considering the impact of choices in prepro-cessing.

25

for respondent, subject, unit type, rescaling used, labeling used, and featureselector used, as well as all interaction terms for the latter four features.

In all models, there are large respondent and subject effects. Some sub-jects were more easily summarized than others, and some respondents weremore critical than others. Interactions between the four summarizer fac-tors are (unsurprisingly) present (df = 33, F = 4.14, logP ≈ −13 underANOVA). There are significant three-way interactions between unit, feature-selector, and rescaling (P ≈ 0.03) and labeling, feature-selector, and rescal-ing (P ≈ 0.03). Interaction plots (Figure 1) suggest that the sizes of theseinteractions are large, making interpretation of the marginal differences foreach factor potentially misleading. Table 4 shows all significant two-wayinteractions and main effects for the full model, as well as for models run onthe article-unit and paragraph-unit data separately.

4.1 Article unit analysis

Interactions between factors make interpretation difficult, but overall, Lassois a good summarizer that is resistant to preprocessing choices. Interestingly,the simplest method, Co-occurrence, is on par with Lasso under tf-idf.

The left column of Figure 2 shows plots of the three two-way interac-tions between feature selector, labeling scheme, and rescaling method forthe article-unit data. There is a strong interaction between rescaling andfeature-selection method (df = 6, F = 8.07, logP ≈ −8, top-left plot), andno evidence of a labeling by feature-selection interaction or a labeling byrescaling interaction. Model-adjusted plots (not shown) akin to Figure 2 donot differ substantially in character. Table 4 show all significant main effectsand pairwise interactions. There is no significant three-way interaction.

Lasso is the most consistent method, maintaining high scores under al-most all combinations of the other two factors. In Figure 2, note how Lassohas a tight cluster of means regardless of rescaling used in the top-left plotand how Lasso’s outcomes are high and consistent across all labeling in themiddle-left plot. Though L1LR or Co-occurrence may be slightly superiorto Lasso when the data has been vectorized according to tf-idf, they are notgreatly so, and, regardless, both these methods seem fragile, varying a greatdeal in their outcomes based on the text preprocessing choices. Note, forexample, how vulnerable the Co-occurrence feature-selection method is tochoice of rescaling.

Tf-idf seems to be the best overall rescaling technique, consistently com-ing out ahead regardless of choice of labeling or feature-selection method.Note how its curve is higher than the rescaling and stop-word curves in

26

S

S

S

S

2.5

3.0

3.5

4.0

4.5

5.0

5.5

RR

R

R

T

T

T

T

cooc

corr

lass

o

l1lr

SRT

stopresctfidf

11

1

1

1

2.5

3.0

3.5

4.0

4.5

5.0

5.5

2

22

22

33 3 3 3

4

4

4

4

4

coun

t−1

coun

t−2

Hco

unt−

2

coun

t−3

Hco

unt−

3

1234

cooccorrlassol1lr

SS

S S S

2.5

3.0

3.5

4.0

4.5

5.0

5.5

RR R

R

R

TT

T

T

T

coun

t−1

coun

t−2

Hco

unt−

2

coun

t−3

Hco

unt−

3

SRT

stopresctfidf

article plots

S

SS

S

2.5

3.0

3.5

4.0

4.5

5.0

5.5

R

R

R

R

T

T T

T

cooc

corr

lass

o

l1lr

SRT

stopresctfidf

1

1

1

2.5

3.0

3.5

4.0

4.5

5.0

5.5

2 2

2

3

3

3

4

4

4

coun

t−1

coun

t−2

Hco

unt−

2

coun

t−3

Hco

unt−

3

1234

cooccorrlassol1lr

S

S

S

2.5

3.0

3.5

4.0

4.5

5.0

5.5

R R

RT

T T

coun

t−1

coun

t−2

Hco

unt−

2

coun

t−3

Hco

unt−

3

SRT

stopresctfidf

paragraph plots

Agg

rega

te S

core

Figure 2: Aggregate Quality Plots. Pairwise interactions of feature selector,labeling, and rescaling technique. Left-hand side are for article-unit sum-marizers, right for paragraph-unit. See testing results for which interactionsare significant.

27

All data Article-unit Paragraph-unitFactor Unit Feat Lab Resc Feat Lab Resc Feat Lab RescUnit . -2 -1 -7Feature Selection -15 . -10 -10 . -7 -6 . -2Labeling . . . . -1 .Rescaling -14 -15 -3

Table 4: Main Effects and Interactions of Factors. Main effects along diago-nal in bold. A number denotes a significant main effect or pairwise interac-tion for aggregate scores, and is the base-10 log of the P -value. “.” denoteslack of significance. “All data” is all data in a single model. “Article-unit”and “paragraph-unit” indicate models run on only those data for summa-rizers operating at that level of granularity.

both the top- and bottom-left plots in Figure 2. Under tf-idf, all the meth-ods seem comparable. Alternatively put, tf-idf brings otherwise poor featureselectors up to the level of the better selectors.

Adjusting P -values with Tukey’s honest significant difference and calcu-lating all pairwise contrasts for each of the three factors show which choicesare overall good performers, ignoring interactions. For each factor, we fit amodel with no interaction terms for the factor of interest and then performedpairwise testing, adjusting the P -values to control familywise error rate. SeeTable 5 for the resulting rankings of the factor levels. Co-occurrence andCorrelation Screening are significantly worse than L1LR and Lasso (corre-lation vs. L1LR gives t = 3.46, P < 0.005). The labeling method optionsare indistinguishable. The rescaling method options are ordered with tf-idfsignificantly better than rescaling (t = 5.08, logP ≈ −4), which in turn isbetter than stop-word removal (t = 2.45, P < 0.05).

4.2 Paragraph unit analysis

For the paragraph-unit summarizers, the story is similar. Lasso is again themost stable to various pre-processing decisions, but does not have as stronga showing under some of the labeling choices. Co-occurrence is again themost unstable. L1LR and Correlation Screening outperform Lasso undersome configurations. The main difference from the article-unit data is thattf-idf is a poor choice of rescaling and L2-rescaling is the best choice.

The right column of Figure 2 shows the interactions between the threefactors. There is again a significant interaction between rescaling and method

28

Data Included Order (article) Order (paragraph)All cooc, corr < L1LR, Lasso cooc < corr, Lasso, L1LR

stop < resc < tf-idf tfidf, stop < resctf-idf only no differences no differencesL2 only cooc < L1LR, Lasso; corr < Lasso no differencesstop only cooc < corr, L1LR, Lasso; corr < Lasso cooc < Lasso, L1LRcooc only stop < resc < tf-idf stop < resccorr only stop < tf-idf no differencesLasso only no differences no differencesL1LR only no differences tf-idf < resc

Table 5: Quality of Feature Selectors. This table compares the significanceof the separation of the feature selection methods on the margin. Order isalways from lowest to highest estimated quality. A ”<” denotes a significantseparation. All P -values corrected for multiple pairwise testing. The lastseven lines are lower power due to subsetting the data.

(df = 6, F = 3.25, P < 0.005, top-plot). This time, however, it is not entirelydue to Co-occurrence being sensitive to rescaling. Co-occurrence is still sen-sitive, but correlation and L1LR are as well. Stop-word removal does quitewell for L1LR and Lasso, suggesting that rescaling is less relevant for shorterdocument units.

Co-occurrence is significantly worse than the other three on the margin(Co-occurrence vs. Correlation Screening gives an adjusted pairwise testwith t = 4.11, P < 0.0005), but the other three are indistinguishable. La-beling matters significantly (df = 2, F = 5.23, P < 0.01), with count-1 doingbetter in the margin than count-2 and hardcount-2. The higher thresholdis likely removing too many substantive paragraphs from the set of positiveexamples. See Table 1 in the supplementary materials—around 75% of theexamples are dropped by moving from count-1 to count-2.

4.3 Analysis of subscores

The above analysis considers the aggregate score across (1) specific Contentcaptured, (2) Redundancy of phrases in the list, and (3) general Relevanceof phrases in the list to the subject. We also performed the above analysesfor each of the three sub-scores separately. Overall conclusions mainly hold,with a few important exceptions. The paragraph-unit results are especiallyvariable, suggesting that paragraph-unit analysis requires more fine-tuningto get good results than article-unit.

29

The Lasso and L1LR maintain word-lists that have few repeats (goodRedundancy scores), but their information capture degrades when givenquite short units of text. This partially explains the weaker performanceof Lasso in the aggregate scores for the paragraph-unit. For the paragraphunit summarizers, L2-Rescaling is clearly superior for Relevance and Contentscores, but inferior to tf-idf for Redundancy.

The marginal Redundancy scores for feature selection method are ex-tremely differentiated, with L1LR and Lasso both scoring high and Co-occurrence and Correlation Screening scoring quite low. Correlation Screen-ing’s poor Redundancy score substantially reduces its aggregate score. Thismight be solved by a different, more sophisticated, pruning technique. In-deed, given Correlation Screening’s quite high scores for Relevance and Con-tent, fixing the Redundancy problem could result in a good, fast summarizerthat may well outperform the penalized regression methods.

4.4 Discussion

The feature selectors interact differently with labeling and rescaling underthe two different units of analyses. While the overall summary quality wasno different between these two varieties of summarizer, interaction plots sug-gest labeling is important, with count-2 being more appropriate for articlesand count-1 being more appropriate for paragraph units (see top plots ofFigure 1). This is unsurprising: a count of 1 vs. 2 means a lot more in asingle paragraph than an entire article.

Preprocessing choice is a real concern. While stop-word removal andL2-rescaling seem relatively consistent across both units of analysis, tf-idfworks much worse, overall, for the paragraph unit summarizers than witharticles. This is probably due to the short length of the paragraph causingrescaling by term frequency to have large and varying impact. It mightalso have to do with tf-idf correctly adjusting for the length of the short“World-Briefing” articles. Under Lasso, however, these decisions seem lessimportant, regardless of unit size.

Comparing the performance of the feature selectors is difficult due tothe different nature of interactions for paragraph and article units. Thatsaid, Lasso consistently performed well. For the article-unit it performednear the top. For the paragraph-unit it did better than most but was notas definitively superior. L1LR, if appropriately staged, also performs well.

We hypothesized that paragraph-unit analysis would generate more spe-cific summaries and article-unit more general. This does not seem to be thecase; in analyzing the results for the fourth question on generality vs. speci-

30

ficity of the summaries (not shown), there was no major difference foundbetween article-unit and paragraph-unit summarizers.

It is on the surface surprising that the Lasso often outperformed L1LRas L1LR fits a model that is more appropriate for the binary outcome of thelabeling. The Lasso has a L2-loss, which is sensitive to outliers, while L1LR’slogistic curve is less sensitive. However, the design matrix X, especiallyunder rescaling, is heavily restricted. All entries are nonnegative and feware large. This may limit the opportunity for individual entries in the L2

loss to have a significant impact, ameliorating the major drawback of theLasso.

There is no evidence that dropping units that mention the subject belowa given threshold (the hardcount labeling technique) is a good idea. Indeed,it appears to be a bad one. The pattern of a quality dip between count-nand hardcount-n appears both in the paragraph- and article-unit results.Perhaps articles that mention a subject only once are important negativeexamples. The sub-scores offer no further clarity on this point.

5 Conclusions

News media significantly impact our day to day lives and the direction ofpublic policy. Analyzing the news, however, is a complicated task. Thelabor intensity of hand coding either leads to small-scale studies, or greatexpense. This and the amount of news available to a typical reader stronglymotivate automated methods to help with media analysis.

We proposed a sparse predictive framework for extracting meaningfulsummaries of specific subjects from document corpora including corporaof news articles. We constructed different summarizers based on our pro-posed approach by combining different options of data preparation scheme,document-granularity, labeling choice, and feature selection method. A hu-man validation experiment was strongly advocated and carried out for com-paring these different summarizers among themselves and with human un-derstanding.

Based on the human experiment, we conclude that the features selectedusing a prediction framework do generally form informative key-phrase sum-maries for subjects of interest for the corpus of New York Times internationalsection articles. These summaries are superior to those from simpler meth-ods of the kind currently in wide use such as Co-occurrence. The Lasso isa good overall feature selector that seems robust to how the data is vec-torized and labeled and it is computationally scalable. L1LR, a natural fit

31

model-wise, can perform well if preprocessing is done correctly. However,it is computationally expensive. Data preparation is important: the vectorrepresentation of the data should incorporate some reweighting of the phraseappearance counts. Tf-idf is a good overall choice unless the document unitsare small (e.g., paragraphs, and, presumably, headlines, online comments,and tweets) in which case an L2 scaling should be used. We also learnedthat limiting phrases to three words or fewer is a potential problem; weencountered it, for example, when dealing with political leaders frequentlymentioned with title (as in “Secretary of State Hillary Clinton”). Ifrimet al. (2008) proposed a greedy descent approach for L1LR that allows forarbitrary-length key-phrases, but it currently does not allow for intercept orreweighting. However, it potentially could. Alternatively, natural languagetools such as parts of speech tagging could pull out such names as distinctfeatures. These approaches are currently under investigation.

Our framework provides a general tool for summarization of corpora ofdocuments relative to a subject matter. Using this tool, researchers caneasily explore a corpus of documents with an eye to understanding a conciseportrayal of any subject they desire. We are now in the process of workingwith social scientists to use this tool to carry out research in a substantivesocial science field. Our proposed approach is being implemented withinan on-line toolkit, SnapDragon.7 This toolkit is intended for researchersinterested in quickly assessing how corpora of documents cover specifiedsubjects.

6 Acknowledgements

We are extremely grateful to Hoxie Ackerman and Saheli Datta for helpassembling this publication. We are indebted to the staff of the XLab atUC Berkeley for their help in planning and conducting the human validationstudy. This work was partially supported by the NSF grant SES-0835531under the “Cyber-Infrastructure and Discovery program,” NSF grant DMS-0907632, ARO grant W911NF-11-1-0114, NSF grant CCF-0939370, andNSF-CMMI grant 30148. Luke Miratrix is grateful for the support of a Grad-uate Research Fellowship from the National Science Foundation. Jinzhu Jiathanks Department of Statistics, UC Berkeley for supporting this researchwhen he was a postdoc.

7http://statnews2.eecs.berkeley.edu/snapdragon

32

References

Blei, D. and J. McAuliffe (2008). Supervised topic models. In J. Platt,D. Koller, Y. Singer, and S. Roweis (Eds.), Advances in Neural Informa-tion Processing Systems 20, pp. 121–128. Cambridge, MA: MIT Press.

Blei, D. M., A. Y. Ng, and M. I. Jordan (2003). Latent Dirichlet allocation.J. Mach. Learn. Res. 3, 993–1022.

Branton, R. P. and J. Dunaway (2009). Slanted newspaper coverage ofimmigration: The importance of economics and geography. Policy StudiesJournal 37 (2), 257–273.

Chang, J., J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei (2009). Readingtea leaves: How humans interpret topic models. In Y. Bengio, D. Schuur-mans, J. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), Advances inNeural Information Processing Systems 22, pp. 288–296.

Chen, J., Q. Yang, B. Zhang, Z. Chen, D. Shen, and Q. Cheng (2006).Diverse topic phrase extraction from text collection. In WWW 2006,Edinburgh, UK.

Dai, X., J. Jia, L. El Ghaoui, and B. Yu. (2011). SBA-term: Sparse bilin-gual association for terms. In Fifth IEEE International Conference onSemantic Computing, Palo Alto, CA, USA.

D’Alessio, D. and M. Allen (2000). Media bias in presidential elections: ameta-analysis. Journal of Communication 50 (4), 133–156.

Denham, B. E. (2004). Hero or hypocrite? International Review for theSociology of Sport 39 (2), 167–185.

El Ghaoui, L., G.-C. Li, V.-A. Duong, V. Pham, A. Srivastava, andK. Bhaduri (2011, October). Sparse machine learning methods for under-standing large text corpora: Application to flight reports. In Conferenceon Intelligent Data Understanding.

El Ghaoui, L., V. Viallon, and T. Rabbani (2011). Safe feature eliminationfor the LASSO. Submitted, April 2011.

Forman, G. (2003). An extensive empirical study of feature selection metricsfor text classification. J. Mach. Learn. Res. 3, 1289–1305.

33

Frank, E., G. W. Paynter, I. H. Witten, C. Gutwin, and C. G. Nevill-Manning (1999). Domain-specific keyphrase extraction. In Proceedingsof the Sixteenth International Joint Conference on Artificial Intelligence(IJCAI-99), California, pp. 668–673. Morgan Kaufmann.

Gawalt, B., J. Jia, L. W. Miratrix, L. Ghaoui, B. Yu, and S. Clavier (2010).Discovering word associations in news media via feature selection andsparse classication. In MIR ’10, Proceedings of the International Confer-ence on Multimedia Information Retrieval, Philadelphia, Pennsylvania,USA, pp. 211–220.

Genkin, A., D. D. Lewis, and D. Madigan (2007). Large-scale bayesianlogistic regression for text categorization. Technometrics 49 (3), 291–304.

Gilens, M. and C. Hertzman (2000). Corporate ownership and news bias:Newspaper coverage of the 1996 telecommunications act. The Journal ofPolitics 62 (02), 369–386.

Goldstein, J., V. Mittal, J. Carbonell, and M. Kantrowitz (2000). Multi-document summarization by sentence extraction. In NAACL-ANLP 2000Workshop on Automatic summarization, pp. 40–48.

Hastie, T., R. Tibshirani, and J. H. Friedman (2003). The Elements ofStatistical Learning. unknown: Springer.

Hennig, L. (2009). Topic-based multi-document summarization with proba-bilistic latent semantic analysis. In Recent Advances in Natural LanguageProcessing (RANLP).

Hopkins, D. and G. King (2010, 01/2010). A method of automated nonpara-metric content analysis for social science. American Journal of PoliticalScience 54 (1), 229–247.

Ifrim, G., G. Bakir, and G. Weikum (2008). Fast logistic regression fortext categorization with variable-length n-grams. In 14th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining, NewYork, NY, USA, pp. 354–362. ACM.

Lazer, D., A. Pentland, L. Adamic, S. Aral, A.-L. Barabsi, D. Brewer,N. Christakis, N. Contractor, J. Fowler, M. Gutmann, T. Jebara, G. King,M. Macy, D. Roy, and M. Van Alstyne (2009). Computational social sci-ence. Science 323 (5915), 721–723.

34

Lee, L. and S. Chen (2006). New methods for text categorization basedon a new feature selection method and a new similarity measure betweendocuments. Lecture Notes in Computer Science 4031, 1280.

Miller, J. M. and J. A. Krosnick (2000). News media impact on the ingre-dients of presidential evaluations: Politically knowledgeable citizens areguided by a trusted source. American Journal of Political Science 44 (2),pp. 301–315.

Monroe, B. L., M. P. Colaresi, and K. M. Quinn (2008). Fightin’ words: Lex-ical feature selection and evaluation for identifying the content of politicalconflict. Political Analysis 16 (4), 372–403.

Neto, J., A. Freitas, and C. Kaestner (2002). Automatic text summariza-tion using a machine learning approach. Advances in Artificial Intelli-gence 2507, 205–215.

Nisbet, E. C. and T. A. Myers (2010). Challenging the state: Transna-tional TV and political identity in the middle east. Political Communi-cation 27 (4), 347–366.

Potter, A. E. (2009). Voodoo, zombies, and mermaids: U.S. newspapercoverage of Haiti. Geographical Review 99 (2), 208 – 230.

Pottker, H. (2003). News and its communicative quality: the invertedpyramid–when and why did it appear? Journalism Studies 4, 501–511(11).

Rose, S., D. Engel, N. Cramer, and W. Cowley (2010). Automatic keywordextraction from individual documents. In M. W. Berry and J. Kogan(Eds.), Text Mining: Applications and Theory. unknown: John Wileyand Sons, Ltd.

Salton, G. (1991). Developments in automatic text retrieval. Sci-ence 253 (5023), 974–980.

Salton, G. and C. Buckley (1988). Term-weighting approaches in automatictext retrieval* 1. Information processing and management 24 (5), 513–523.

Senellart, P. and V. D. Blondel (2008). Automatic discovery of similar words.In Survey of Text Mining II. unknown: Springer.

Wahl, O. E., A. Wood, and R. Richards (2002). Newspaper coverage ofmental illness: Is it changing? Psychiatric Rehabilitation Skills 6(1),9–31.

35

Yang, Y. and I. O. Pendersen (1997). A comparative study on feature selec-tion in text categorization. In ICML-97, 14th International Conferenceon Machine Learning, Nashville, US, pp. 412–420.

Zhang, T. and F. J. Oles (2001). Text categorization based on regularizedlinear classfiication methods. Information Retrieval 4, 5–31.

36

New What is in the news on a subject: automatic and sparse …binyu/ps/papers2011/... · 2012. 7....

Documents

Transcript of New What is in the news on a subject: automatic and sparse …binyu/ps/papers2011/... · 2012. 7....