News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and...

42
News and Geolocated Social Media Accurately Measure Protest Size Anton Sobolev 1 , Keith Chen 1 , Jungseock Joo 1 , Zachary C. Steinert-Threlkeld 1, * June 6, 2019 Abstract Larger protests are more likely to lead to policy changes than small ones, but whether or not attendance estimates provided in newspapers or generated from social media are biased is an open question. This research note closes the question: news and geolocated social media data generate accurate estimates of the size of protests. This claim is substantiated using cell phone location data from ten million individuals during the 2017 United States Women’s March protests. These cell phone estimates correlate strongly with those provided in news media as well as three size estimates generated using geolocated tweets, one text-based and two based on images. In testing these estimates, we also show that wealthier, more Democratic, and more urbanized areas generated larger protests. 1 University of California - Los Angeles * To whom correspondence should be addressed: [email protected] Word count (excluding references, tables, and figures): 2,803 1

Transcript of News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and...

Page 1: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

News and Geolocated Social Media Accurately MeasureProtest Size

Anton Sobolev1, Keith Chen1, Jungseock Joo1, Zachary C. Steinert-Threlkeld1,*

June 6, 2019

Abstract

Larger protests are more likely to lead to policy changes than small ones, butwhether or not attendance estimates provided in newspapers or generated from socialmedia are biased is an open question. This research note closes the question: newsand geolocated social media data generate accurate estimates of the size of protests.This claim is substantiated using cell phone location data from ten million individualsduring the 2017 United States Women’s March protests. These cell phone estimatescorrelate strongly with those provided in news media as well as three size estimatesgenerated using geolocated tweets, one text-based and two based on images. In testingthese estimates, we also show that wealthier, more Democratic, and more urbanizedareas generated larger protests.

1 University of California - Los Angeles∗To whom correspondence should be addressed: [email protected] count (excluding references, tables, and figures): 2,803

1

Page 2: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

1 Introduction

Using estimates of protest size obtained via mobile phone location data from 10 million

individuals, this research note shows that news and social media data generate accurate

estimates of protest size.

Protests are a key tool in the repertoire of social movements (Tilly and Tarrow, 2015),

and their efficacy is primarily a function of the number of participants (“size”) (DeNardo,

1985; Chenoweth and Stephan, 2011; Biggs, 2016). Methods to directly measure size are too

costly to be employed across large numbers of protests, limiting their use in academic studies

(Schweingruber and McPhail, 1999; McPhail and McCarthy, 2004; Yip et al., 2010). Instead,

event datasets that record size (usually of protests but also of riots or massacres) rely on

numbers provided in media reports, usually newspapers. This reliance raises concerns about

the accuracy of the tallies since researchers do not create the measures themselves and those

who do, usually activists or authority figures, have strategic reasons for inflating or deflating,

respectively, size. Complicating matters, newspapers can choose between multiple estimates

to find the one that suits their editorial beliefs (Mann, 1974). Concern about measurement

validity means many event datasets record size as an ordinal variable (Salehyan et al., 2012)

or only measure casualties (Raleigh et al., 2010).

Using individuals’ location recorded via their mobile phones, this research note shows that

scholars can rely on size estimates reported in newspapers or obtained via geolocated tweets.

Since people almost always carry their smart phone, those devices record individuals’ location

passively, and other researchers have verified that the locations are not biased politically

ANONYMOUS, we consider the cell phone location to be a “gold standard”. Cell phone

data are especially important because, since 1995, the federal government no longer records

protest size in Washington D.C. (McCarthy et al., 1999), and rigorously measuring with

enumerators is too costly when protests occur in multiple locations (Schweingruber and

McPhail, 1999; Yip et al., 2010). The cell phone location data (“cell phone data”) therefore

forms the baseline against which we compare estimates from news (print and television) and

2

Page 3: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

social media (text and images).

The argument that news and social media can be used to measure protest size is sub-

stantiated using the January 21st, 2017 United States Women’s March. One of the largest

mobilizations in American history, that day featured 942 protests comprised of approximately

4.77 million individuals (Chenoweth and Pressman, 2017). The cell phone data shows that

protest size reported in newspapers as well as estimated via geolocated tweets are reliable.

Of the four sets of estimates we compare, those reported in newspapers or calculated via

geolocated tweets containing protest key words are the most accurate, followed by counting

the number of faces in protest photos shared on Twitter. Counting the number of accounts

which share protest photos is the least accurate. The Supplementary Materials show that

the first principal component of the four sets of estimates provide the most accurate protest

size, though this fifth estimate is the most difficult to scale and so not one we recommend

for future use.

2 The Challenge

Large protests are more likely to lead to policy change than small ones (Biggs, 2016). Ku-

ran (1989)’s canonical model of bandwagoning implicitly means a revolution follows large

protests. Lohmann (1994) argues that unprecedented numbers of people rallying in the Ger-

man Democratic Republic in the beginning of 1989 were a key reason the protest movement

grew and eventually toppled Erich Honecker?s government. The gradual growth of protest

size in Iran in 1979 also made it increasingly difficult for the Shah to remain in power (Rasler,

1996; Kurzman, 2004). Nonviolent social movements succeed more often than violent ones

precisely because they tend to be larger: larger movements increase the cost of repression and

make regime defections more likely (Chenoweth and Stephan, 2011; Celestino and Gleditsch,

2013; Thurber, 2013).1

Previous estimates of protest size rely on teams of well-trained observers or size estimates

1A social movement is not the same as a protest. The quantitative study of them, however, measurestheir size based on the size of protests mobilized as part of the movement.

3

Page 4: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

reported in newspapers. Using observers who understand the square footage of a protest site,

the density of the crowd, and the percent of the site the crowd occupies is considered the

most accurate estimation method; on average, a standing person requires about 5 square

feet of personal space (McPhail and McCarthy, 2004). Trained observers can also disperse

within a protest and count the size of clusters of participants (Schweingruber and McPhail,

1999; Yip et al., 2010). We are aware of no dataset which uses this methodology, probably

because it is expensive and requires advance knowledge a protest will occur.

Instead, scholars rely on estimates of protest size in newspapers (Woolley, 2000; Earl

et al., 2004). Though newspapers have known biases in their coverage (Gerner and Schrodt,

1998; Weidmann, 2014; Baum and Zhukov, 2015), the concern about measurement validity

for protest size does not stem from their preference towards violent events in urbanized areas.

Instead, the concern is that newspaper estimates of protest size are secondary, usually coming

directly from protest organizers or state authorities. The estimates could be too high or too

low, but the bias would not be correctable since no independent size estimate exists.2

Distrust of size estimates leads many datasets to report alternate measures of protest

size. The Social Conflict Analysis Database (SCAD) transforms newspaper estimates into

an ordinal variable on a logged scale (Salehyan et al., 2012). The Armed Conflict Location

and Event Data project only reports fatalities since those are believed to be easier than

people in groups to count (Raleigh et al., 2010). Other datasets, such as Mass Mobilization

(Clark and Regan, 2018), Mass Mobilization in Autocracies Database (Weidmann and Rod,

2018), and Nonviolent and Violent Campaigns and Outcomes (Chenoweth, Pinckney and

Lewis, 2018) do record the estimates provided in newspapers. Nonetheless, we are aware of

no research that has shown that size data from newspapers can be used without introducing

bias, and work which takes protest size as its outcome instead has used the count of protests

(Steinert-Threlkeld, 2017).

2Newspapers are more likely to report on large protests than small ones (McCarthy et al., 1999). Sincelarge protests often grow from small ones, being able to record small protests that do not appear in news-papers should enhance the understanding of protest dynamics.

4

Page 5: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

3 Data

That news and geolocated social media can accurately measure the number of protesters

is demonstrated using the case of the 2017 United States’ Women’s March. Occurring on

January 21st, 2017, the day after President Donald Trump’s inauguration, the simultaneous

protests across the country constitutes the largest single-day mobilization in the United

States.3

This note compares eight estimates of protest size from four sources: cell phone location

data (1) ANONYMOUS, the Crowd Counting Consortium (3) (Chenoweth and Pressman,

2017), Twitter (3), and population according to the American Community Survey as a

placebo (1). The cell phone location data are from a data broker and shared with an author

of this paper; the Crowd Counting Consortium (CCC) data are publicly available; and

the tweets have been collected by another author of this paper ANONYMOUS. Table 1

provides summary statistics of the three sources.4

The cell phone estimates use data provided from SafeGraph, a third-party data broker.

Cell phones record owners’ location and let downloaded applications use that information to,

ostensibly, provide services. The applications and telecommunications providers also sell the

location data brokers or directly to companies (Valentino-DeVries et al., 2018; Cox, 2019).

Our dataset contains location information for just over ten million individuals. For more

information on the data and validation of the political representativeness, see ANONY-

MOUS. We have no identifying information on these individuals (de Montjoye et al., 2018).

Identifying protesters from the SafeGraph data requires several steps. First, we only

search in cities identified in the CCC data. Second, in those cities, we only consider pedes-

trians by excluding pings that were registered while the owners of mobile phones used private

or public transportation; mode of transit is identifiable by the frequency of location pings.

Third, we take the distribution of seven-digit geohashes on January 21st, 2017, the day of

3Protests also occurred in other countries. Lacking cell phone data on them, this paper ignores them.4All tables were made using Hlavac (2018).

5

Page 6: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Table 1: Summary Statistics

Measure N Mean St. Dev. Min Max

Cell Phone Data 399 238.479 596.154 2 3,082

CCC: News 326 10,526.630 52,362.250 3 725,000

CCC, All Estimates (Best Guess) 614 6,727.480 42,490.840 1 725,000

CCC, All Estimates (Low) 614 5,804.056 34,443.560 1 550,000

Twitter: Text Accounts 1,045 10.793 70.643 2 1,522

Twitter: Images Faces 244 2,601.361 10,568.640 0 132,708

Twitter: Images Accounts 269 926.301 2,209.269 3 20,521

Note: CCC refers to the Crowd Counting Consortium data and uses the “Best Guess”variable unless otherwise indicated. Twitter: Text Accounts uses keywords to iden-tify protesters. Twitter: Images Faces sums the number of faces in protest images.Twitter: Images Accounts counts the number of unique accounts that share protestphotos.

the Women’s March, and keep the geohashes above the 95th percentile per city.5 Fourth,

any individual in those geohashes is counted as a protester. The size of this group represents

a conservative estimate of the total size of the protest in each city. In most of the cities, the

neighboring geohashes also have a relatively high density of individuals but are below the

95% threshold; some of the individuals from these geohashes are probably protesters. To

summarize: the number of people in a city’s densest 7-digit geohashes (those whose density

is in the 95th percentile or greater) is the the number of protesters to which we compare the

other estimates.6

The Crowd Counting Consortium (CCC) is an open-source, hand-coded events dataset

established to record protest size. CCC is agnostic regarding source type. Most are newspa-

pers or local television broadcasts, but a substantial minority (26.84%) of estimates derive

5A geohash is an 8 digit alphanumeric code that corresponds to a cell with sides of 38 meters. We use the7 digit code, corresponding to cells wide sides of 152 meters. Because geohashes are alphanumeric stringsand not points, location matching is much quicker than using spatial coordinates.

6We explored an additional step to identifying protesters, based on the path of their walking. In responseto concerns from the data provider, we dropped this methodology. Initial results suggested that this extrastep degraded results.

6

Page 7: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

from tweets or Facebook statuses. It also reports high and low estimates, based on these

sources. Unless otherwise indicated, this research note uses the Best Guess variable, usually

an average of the adjusted low and high estimates in an article, as the protest size estimate.

CCC: News refers to events for which the source is a newspaper or local television broadcast.

CCC, All Estimates includes these events for which only social media provides estimates.7

CCC, All Estimates (Low) uses the Low estimate, not Best Guess, and ignores source type.

For Twitter, three approaches approximate the number of protesters.8 First is a count

of the number of unique accounts in a city per day using one of three common keywords,

“womensmarch”, “whyimarch”, or “imarchfor”, between 10 a.m. and 5 p.m. local time.

Tweets are not resolved to intracity locations, for three reasons. First, for the majority of

tweets, Twitter aggregates the specific location of a tweet to local polygons while reporting

the actual location as the city that contains that polygon. The polygon, however, does

not necessarily envelop the location of the actual tweet. For example, Rancho Palos Verdes

and Monterrey, two cities near Los Angeles, recorded, based on GPS coordinates, the most

number of tweets from January 21, but Twitter labels those tweets as coming from Los

Angeles. Intracity location information is therefore unknown for almost all tweets. Second,

the Crowd Counting Consortium does not always provide intracity geographic information.

A protest they label as occurring in Seattle would require further research for obtaining the

specific location, for example. Third, most cities, even large ones, contain only one protest

per day; if they contain multiple, one protest will contain the vast majority of protesters

(Biggs, 2016). We call this measure Twitter: Text Accounts.

The second Twitter approach to estimate the number of protesters at the Women’s March

is to count the number of faces in any protest photo from a city. A convolutional neural

7We include television broadcasts as part of news because all estimates come from the source type’swebsite, and there is no facile method to distinguish between newspapers and broadcasts. Twitter andFacebook are identifiable with the presence of “twitter” or “facebook” in the link or the word “FB” enteredas the source, so we exclude social media sources.

8Tweets were collected by connecting to Twitter’s streaming application programming interface (API)using the package streamR and requesting only tweets with geographic coordinates (Barbera, 2013). Formore detail, see ANONYMOUS

7

Page 8: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

network (CNN) first identifies protest photos, then another CNN identifies faces in that

photo. Faces in a protest photo are summed and added per city-day. If protesters who

share geolocated photos are randomly scattered across a protest site and share pictures of

their immediate surroundings, then this approach is similar to many authorities’ estimating

procedure (McPhail and McCarthy, 2004). We call this measure Twitter: Images Faces.

The third Twitter approach is to count the number of people tweeting a protest photo.

This approach does not count retweets, so it assumes that any tweet of a protest photo

requires that the Twitter user was actually at a protest. We call this measure Twitter: Images

Accounts. Section 5 discusses in more detail why these three measures should correlate with

the cell phone data estimates.

For either image approach to be accurate, we assume that Twitter account owners take

photographs of protesters, share them on Twitter at some point on the day of the protest,

are not strategic about the timing of the sharing (Pfeffer and Mayer, 2018), and enable

geolocation for the tweet.9 While we can think of no reason that accounts would, in aggregate,

be biased in sharing, or not, protest photos, it is possible these estimates will not be accurate,

especially because so many steps are required to enter our data. The results presented in the

next section show that, despite the biases that could enter in any of these steps, the Twitter

approaches are accurate. For details of these Twitter approaches, see ANONYMOUS.

Table 1 shows that the range of size estimates varies substantially across datasets. To

compare the preferred four estimates to the cell phone data, we therefore convert each

estimate of size to a per capita measure, log-transform that result because it is still very

skewed, and take the z-score of the logged per-capita measure. (For the remainder of the

paper, “z-score” refers to this measure.) We prefer per capita to raw estimates because

population size will drive the cell phone, CCC, and Twitter estimates. Incorporating it

into our measurement therefore removes a possible source of omitted variable bias. The

z-score is appropriate because the estimates vary by orders of magnitude across datasets, so

9The phrase “some point on the day of the protest” is intentionally vague: large crowds can overwhelmthe cellular towers used to transmit data, making it impossible to post during a protest.

8

Page 9: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

standardizing permits distant measurement across the datasets.10 Standardizing in this way

permits the measurement of protest size across datasets whose estimates vary by orders of

magnitude.

4 Results

CCC: News and Twitter: Text Accounts estimate protest size with the least error. Twitter:

Images Faces is next best, while counting the number of accounts that share protest photos

is the least accurate. No estimate, however, is unreliable, and Section 5 discusses when

different ones may be preferred.

Unless otherwise stated, all results are for the 155 cities for which each dataset records a

protest size. The Supplementary Materials shows results keeping the maximum number of

observations each dataset and the cell phone data share; results do not change.

Figure 1 shows the correlation of the z-score for each of the datasets with the cell phone

data. Section S4.1 reports these correlations in a table.

Table 2 shows that CCC: News and Twitter: Text Accounts produce the closest estimates

to the cell phone data. Those two measures are also the closest to the cell phone data the

greatest percentage of the time, whether measuring difference (rows 1-5) or the number

of observations within .1, .5, and 1 standard deviation of the cell phone data (rows 6-8).

Twitter: Images Accounts is the least accurate on six of the eight fit measures.

Figures 2 and 3 show when each dataset’s measure is close to the cell phone data. Figure

2 shows the cumulative distribution of the four datasets’ closest estimates across the range

of z-scores. This graph reveals any bias in each estimate. For example, if most of a dataset’s

closest predictions are for small protests, its CDF will approach 1 quickly. If its closest

predictions are randomly scattered across the range of protest size, its CDF will track the

gold-standard’s. Twitter: Text Accounts has more of its closest estimates occur for small

protests, which makes sense: because more tweets contain protest hashtags than protest

images, it records small protests better than image estimates. Twitter: Images Accounts

10The Supplementary Materials show that the results hold when not using the per capita measure.

9

Page 10: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Figure 1: Correlation of Estimates

Table 2: Measuring Fit

CCC: Twitter: Twitter: Twitter:

News Images Accounts Images Faces Text Accounts

Mean Error 0.06 0.10 0.08 0.07

Mean Error (Trimmed) 0.06 0.11 0.14 0.04

Mean Absolute Error 0.58 0.82 0.71 0.60

Mean Absolute Error (Trimmed) 0.51 0.70 0.59 0.50

Closest Estimate % 0.31 0.22 0.20 0.27

Percent Within 0.1 SD 0.17 0.08 0.11 0.14

Percent Within 0.5 SD 0.52 0.38 0.52 0.54

Percent Within 1 SD 0.82 0.67 0.77 0.81

Note: “(Trimmed)” datasets drop observations where |z| > 3. Bold is the best measure

per row; underline, the worst.

10

Page 11: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

tracks the cell phone data as protests pass their average size. Twitter: Images Faces outper-

forms Twitter: Images Accounts for most of its range except 0 ≤ z ≤ 1. CCC is the most

consistently accurate of the three, as its CDF closely tracks the cell phone data estimates.

Figure 2: Location of Best Fits per Measure

Finally, Figure 3 shows the mean error by z-score decile, with the closest estimate per

decile labeled. All measures overestimate the size of small protests and underestimate large

ones. CCC: News and Twitter: Text Accounts perform best, as they each have the smallest

mean error three times. Twitter: Images Faces is closest twice; Twitter: Images Accounts,

once.

The Supplementary Materials present additional analyses that confirm these results. To

verify that results are not driven by the per capita transformation, we show results measuring

the outcome as the z-score of the logarithm of protest size; the subsections presenting them

are labeled “Not Per Capita”. We have also generated protest size estimates using the first

principal component of all four measure as well as the two Twitter images measures; Section

S1 explains this analysis, and its results are presented in the Supplementary Material’s other

sections. The principal components have the least error.

After introducing the principal components analysis, the next sections replicate the re-

11

Page 12: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Figure 3: Mean Error by Decile

sults above. S2 repeats Table 2 using all observations per measure and the “Not Per Capita”

measure. Section S3 replicates Figure 3 using all observations per measure the “Not Per

Capita” measure. Both incorporate the principal components results and show results when

using the maximum number of observations per dataset.

Sections S4 and S5 provide two sets of additional results. Section S4 shows the Pearson

correlation coefficient for each measure, using different subsets of data and operationaliza-

tions of protest size. Section S5 introduces a series of models regressing the cell phone data

on the four measures, separately and pooled; all models include state fixed effects, socioeco-

nomic controls at the urbanized area level taken from the 2016 American Community Survey,

and population. CCC: News and Twitter: Text Accounts feature the highest correlations

12

Page 13: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

and best model fit regardless of the subset of data or operationalization of protest size.

5 Discussion

5.1 Why Does it Work?

Why do news media and geolocated social media data reliably measure protest size? News

reports rely on others’ estimates, and the social media estimates are from models applied to

relatively small samples with many entry points for bias. That both sets of estimates work

may therefore appear surprising.

The estimates in CCC: News correlate well with the cell phone data because they tend

to come from actors applying methodology similar to that developed in Schweingruber and

McPhail (1999). While we can find no reports of how activists generate their estimates, state

authorities do approximate crowd attendance by estimating the carrying capacity of public

areas and the percentage of that area covered with people (McPhail and McCarthy, 2004).

The estimates in newspapers therefore are not as haphazard as their lack of methodological

reporting would suggest.

The data generating process for Twitter: Images Faces is similar. The photos for this

note primarily consist of people taking photos of their immediate surrounding. The photos

for this paper contain anywhere from 1 to 48 faces, with an average of 2.81 faces per photo.

So long as people who geotag their photos are randomly distributed across a protest site, and

there is no reason to think they are not, then Twitter: Images Faces is similar to randomly

sampling segments of a protest size (McPhail and McCarthy, 2004).

5.2 Which Dataset Should I Prefer?

This research note is agnostic about whether or not researchers should prefer news or social

media data, as there are advantages and disadvantages to relying on either. Broadly speak-

ing, Twitter requires a greater fixed cost but has lower marginal cost than using newspapers.

Newspapers are easier to access than social media, so projects using them can start more

quickly. If a project includes local newspapers, they will also provide more geographic cover-

13

Page 14: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

age than Twitter. One disadvantage is that they are still secondary sources of information,

whereas tweets, properly collected, provide primary evidence. A project using newspapers

to measure protests also requires more ongoing work than one using Twitter.

Though acquiring and processing Twitter data is more technically challenging than doing

the same for news articles, it has some advantages over them. Protest images, because they

are primary source material, may be closer to the cell phone data in their data generating

process. Using the keyword approach to measure protest size is much less difficult than using

images and equally accurate, so measuring protest size once a data pipeline is established is

relatively easy.

One advantage to using images on Twitter versus tweet text is that text requires subject

matter expertise on keywords per protest event to analyze, whereas visuals are closer to a

universal language. Measuring size with text therefore requires more ongoing involvement

than image-based estimates as new events occur.

These findings do not offer any insight into whether or not news reports or Twitter

provide reliable estimates outside of the United States, though other work has found similar

levels of correlation between newspaper estimates and Twitter: Images Faces in South Korea

and Russia (Joo and Steinert-Threlkeld, 2019). It is possible, for example, that crowd size

estimates in news, as well as usage of keywords and sharing of protest photos on social media,

is only reliable in countries above a socioeconomic threshold, with strong protections for

individual rights, during peaceful protests, or some combination of those and other factors.

This work cannot test the international robustness of this finding because CCC only focuses

on the United States, there is no other dataset of contemporaneous protest size, and we

could not obtain cell phone location data in other locations.

5.3 How Many People Protested?

The cell phone estimates should not be interpreted as the true number of protesters since

the data represent about 3% of the United States’ population. The estimates will therefore

be lower than whatever the true number of protesters was. For example, the largest protest

14

Page 15: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

in the cell phone data was in Washington D.C., with 3,082 attendees. The United States

has almost 330 million inhabitants, so those estimates scale to roughly 102,723 protesters

(3,082*33.33). Those estimates are much smaller than newspapers’ reports of 750,000 people;

the carrying capacity of the National Mall, however, is probably much closer to 500,000

people (McPhail and McCarthy, 2004), and reliable, third-party estimates conclude that

470,000 individuals protested in Washington D.C. (Wallace and Parlapiano, 2017).

5.4 Concluding Thoughts

(Similar data have been used to measure the size of apolitical (Mamei and Colonna, 2016)

and political events (Shalev and Rotman, 2016; Traag, Quax and Sloot, 2017).)

These results suggest that news provide reliable estimates of protest sizes, as do social

media text or images of protesters (Botta, Moat and Preis, 2015). Event datasets that report

size should therefore report raw estimates recorded in news. Researchers using those data

can then decide whether or not to transform them. Transforming, and not reporting, raw

data throws away information future researchers may find useful.

The common, understandable concern that news estimates of protest size cannot be

trusted should be laid to rest. Whether a researcher prefers news or social media to measure

protest size is his or her choice, as they are equally accurate. So long as one believes that

estimates of protest size from cellphone location data are accurate, then estimates of protest

size from news and social media are trustworthy.

15

Page 16: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

ReferencesBarbera, Pablo. 2013. “streamR.”.URL: https://cran.r-project.org/web/packages/streamR/

Baum, Matthew A and Yuri M Zhukov. 2015. “Filtering revolution: Reporting bias ininternational newspaper coverage of the Libyan civil war.” Journal of Peace Research52(3):384–400.

Biggs, Michael. 2016. “Size Matters: Quantifying Protest by Counting Participants.” Soci-ological Methods & Research pp. 1–33.

Botta, Federico, Helen Susannah Moat and Tobias Preis. 2015. “Quantifying crowd size withmobile phone and Twitter data.” Royal Society Open Science 2:150162.

Celestino, Mauricio Rivera and Kristian Skrede Gleditsch. 2013. “Fresh carnations or allthorn, no rose? Nonviolent campaigns and transitions in autocracies.” Journal of PeaceResearch 50(3):385–400.

Chenoweth, Erica and Jeremey Pressman. 2017. “Crowd Counting Consortium.”.URL: https://sites.google.com/view/crowdcountingconsortium/home

Chenoweth, Erica, Jonathan Pinckney and Orion Lewis. 2018. “Days of rage : Introducingthe NAVCO 3.0 Dataset.” Journal of Peace Research 55(4):524–534.

Chenoweth, Erica and Maria J. Stephan. 2011. Why Civil Resistance Works. New YorkCity: Columbia University Press.

Clark, David H. and Patrick M. Regan. 2018. “Mass Mobilization Protest Data.”.URL: https://www.binghamton.edu/massmobilization/about.html

Cox, Joseph. 2019. “I Gave a Bounty Hunter $300. Then He Located Our Phone.”.URL: https://motherboard.vice.com/en us/article/nepxbz/i-gave-a-bounty-hunter-300-dollars-located-phone-microbilt-zumigo-tmobile

de Montjoye, Yves-Alexandre, Sebastien Gambs, Vincent Blondel, Geoffrey Canright, Nico-las De Cordes, Sebastien Deletaille, Kenth Engø-monsen, Manuel Garcia-Herranz, JakeKendall, CAmeron Kerry, Gautier Krings, Emmanuel Letouze, Miguel Luengo-Oroz, NuriaOliver, Luc Rocher, Alex Rutherford, Zbigniew Smoreda, Jessica Steele, Erik Wetter, AlexPentland and Linus Bengtsson. 2018. “On the privacy-conscientious use of mobile phonedata.” Scientific Data 5.URL: http://dx.doi.org/10.1038/sdata.2018.286

DeNardo, James. 1985. Power in Numbers: The Political Strategy of Protest and Rebellion.Princeton: Princeton University Press.

Earl, Jennifer, Andrew Martin, John D. Mccarthy and Sarah A. Soule. 2004. “The Use ofNewspaper Data in the Study of Collective Action.” Annual Review of Sociology 30:65–80.

16

Page 17: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Gerner, Deborah J. and Philip A. Schrodt. 1998. The Effects of Media Coverage on CrisisAssessment and Early Warning in the Middle East. In Early Warning and Early Response,ed. Susanne Schmeidl and Howard Adelman. New York City: Columbia University Press.

Hlavac, Marek. 2018. “stargazer: beautiful Latex, HTML, and ASCII tables from R statis-tical output.”.

Joo, Jungseock and Zachary C. Steinert-Threlkeld. 2019. “Image as Data: Automated VisualContent Analysis for Political Science.”.

Kuran, Timur. 1989. “Sparks and Prairie Fires: A Theory of Unanticipated Political Revo-lution.” Public Choice 61(1):41–74.

Kurzman, Charles. 2004. The Unthinkable Revolution in Iran. Cambridge: Harvard Univer-sity Press.

Lohmann, Susanne. 1994. “The Dynamics of Informational Cascades: The Monday Demon-strations in Leipzig, East Germany, 1989-91.” World Politics 47(1):42–101.

Mamei, Marco and Massimo Colonna. 2016. “Estimating attendance from cellular networkdata.” International Journal of Geographical Information Science 30(7):1281–1301.

Mann, Leon. 1974. “Counting the Crowd: Effects of Editorial Policy on Estimates.” Jour-nalism Quarterly 51(2):278–285.

McCarthy, John D., Clark McPhail, Jackie Smith and Louis J. Crishock. 1999. Electronicand Print Media Representations of Washington, D.C. Demonstrations, 1982 and 1991: ADemography of Description Bias. In Acts of Dissent: New Developments in the Study ofProtest, ed. Dieter Rucht, Ruud Koopmans and Friedhelm Neidhardt. pp. 113–130.

McPhail, Clark and John McCarthy. 2004. “Who Counts and How: Estimating the Size ofProtests.” Contexts 3(3):12–18.

Pfeffer, Jurgen and Katja Mayer. 2018. “Tampering with Twitter’s Sample API.” EPJ DataScience 7(50):1–21.

Raleigh, Clionadh, Andrew Linke, Havard Hegre and Joakim Karlsen. 2010. “IntroducingACLED: An Armed Conflict Location and Event Dataset: Special Data Feature.” Journalof Peace Research 47(5):651–660.

Rasler, Karen. 1996. “Concessions, Repression, and Political Protest in the Iranian Revolu-tion.” American Sociological Review 61(1):132–152.

Salehyan, Idean, Cullen Hendrix, Jesse Hammer, Christina Case, Christopher Linebarger,Emily Stull and Jennifer Williams. 2012. “Social Conflict in Africa: A New Database.”International Interactions 38(4):503–511.

Schweingruber, David and Clark McPhail. 1999. “A Method for Systematically Observingand Recording Collective Action.” Sociological Methods & Research 27(4):451–498.

17

Page 18: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Shalev, Michael and Assaf Rotman. 2016. “Mass Protests Meet Big Data.” Working paper .URL: http://cpd.berkeley.edu/wp-content/uploads/2016/09/Rotman-Shalev-for-UCB-Colloquium-1.pdf

Steinert-Threlkeld, Zachary C. 2017. “Spontaneous Collective Action: Peripheral Mobiliza-tion During the Arab Spring.” American Political Science Review 111(02):379–403.

Thurber, Ches. 2013. “Social Structures of Revolution: Networks, Overlap, and the Choiceof Violent Versus Nonviolent Strategies in Conflict.” Journal of Peace Research 50(3):401.

Tilly, Charles and Sidney Tarrow. 2015. Contentious Politics. Second ed. Oxford: OxfordUniversity Press.

Traag, V.A., R. Quax and P.M.A. Sloot. 2017. “Modelling the distance impedance of protestattendance.” Physica A: Statistical Mechanics and its Applications 468:171–182.

Valentino-DeVries, Jennifer, Natasha Singer, Michael H. Keller and Aaron Krolik. 2018.“Your Apps KnowWhere You Were Last Night, and They’re Not Keeping It Secret.”.URL: https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html?smid=tw-share

Wallace, Tim and Alicia Parlapiano. 2017. “Crowd Scientists Say Women’s March inWashington Had 3 Times as Many People as Trump’s Inauguration.”.URL: https://www.nytimes.com/interactive/2017/01/22/us/politics/womens-march-trump-crowd-estimates.html

Weidmann, Nils B. 2014. “On the Accuracy of Media-based Conflict Event Data.” Journalof Conflict Resolution 59(6):1129–1149.

Weidmann, Nils B. and Espen Geelmuyden Rod. 2018. The Internet and Political Protestin Autocracies. Oxford University Press.

Woolley, John T. 2000. “Using Media-Based Data in Studies of Politics.” American Journalof Political Science 44(1):156–173.

Yip, P, R Watson, K Chan, E Lau, F Chen, Y Xu, L Xi, D Cheung, B Ip and D Liu. 2010.“Estimation of the Number of People in a Demonstration.” Australian & New ZealandJournal of Statistics 52(1):17–26.

18

Page 19: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Supplementary Materials for

“News and Social Media Accurately Measure Protest Size”

S1 Principal Component Analysis

We also construct two latent measures, using the first component from a principal component

analysis, of protest size using CCC: News, Twitter: Text Accounts, Twitter: Images Faces,

and Twitter: Images Accounts. PCA: All, the first measure, uses all four estimates. PCA:

Images, the second, uses only Twitter: Images Faces and Twitter: Images Accounts. Only

images are preferred for the second because this analysis will scale more easily than relying

on newspapers or hand-constructed protest dictionaries.

The following sections include PCA estimates. Overall, PCA: All performs the best of

all six measures, while PCA: Images performs almost identically to Twitter: Images Faces.

S2 Additional Model Fit Measurement

S2.1 Per Capita, Overlapping Observations

Table A1: Overlapping Observations, Include PCA

CCC: PCA: PCA: Twitter: Twitter: Twitter:

News Images All Images Accounts Images Faces Text Accounts

Mean Error 0.06 0.09 0.00 0.10 0.08 0.07

Mean Error (Trimmed) 0.06 0.09 -0.07 0.11 0.14 0.04

Mean Absolute Error 0.58 0.75 0.84 0.82 0.71 0.60

Mean Absolute Error (Trimmed) 0.58 0.72 0.78 0.79 0.62 0.58

Best Predictor, % 0.22 0.10 0.20 0.16 0.11 0.21

Within 0.1 SDs 0.17 0.12 0.11 0.08 0.11 0.14

Within 0.5 SDs 0.52 0.50 0.50 0.38 0.52 0.54

Within 1 SDs 0.82 0.75 0.67 0.67 0.77 0.81

Note: “(Trimmed)” datasets drop observations where |z| > 3.

19

Page 20: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S2.2 Per Capita, All Observations

Table A2: All Observations, Include PCACCC: PCA: PCA: Twitter: Twitter: Twitter:

News Images All Images Accounts Images Faces Text Accounts

Mean Error 0.03 0.11 0.00 0.06 0.11 0.13

Mean Error (Trimmed) 0.05 0.13 -0.07 0.10 0.16 0.12

Mean Absolute Error 0.60 0.73 0.84 0.83 0.71 0.61

Mean Absolute Error (Trimmed) 0.59 0.69 0.78 0.78 0.63 0.60

Best Predictor, 0.22 0.10 0.20 0.16 0.11 0.21

Within 0.1 SDs 0.17 0.11 0.11 0.08 0.09 0.13

Within 0.5 SDs 0.50 0.53 0.50 0.39 0.51 0.54

Within 1 SDs 0.81 0.76 0.67 0.68 0.78 0.80

Note: “(Trimmed)” datasets drop observations where |z| > 3.

S2.3 Not Per Capita, Overlapping Observations

Table A3: Not Per Capita, Include PCA, Overlapping Observations

CCC: PCA: PCA: Twitter: Twitter: Twitter:

News Images All Images Accounts Images Faces Text Accounts

Mean Error 0.20 -0.22 -0.41 -0.14 -0.24 0.71

Mean Error (Trimmed) 0.20 -0.20 0.27 -0.14 -0.22 0.58

Mean Absolute Error 0.49 0.70 2.06 0.58 0.54 0.94

Mean Absolute Error (Trimmed) 0.49 0.68 1.42 0.58 0.52 0.83

Best Predictor, % 0.32 0.10 0.07 0.22 0.15 0.14

Within 0.1 SDs 0.19 0.06 0.03 0.12 0.08 0.07

Within 0.5 SDs 0.62 0.46 0.13 0.52 0.61 0.43

Within 1 SDs 0.88 0.79 0.33 0.82 0.88 0.66

Note: “(Trimmed)” datasets drop observations where |z| > 3.

20

Page 21: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S2.4 Not Per Capita, All Observations

Table A4: Not Per Capita, Include PCA, All Observations

CCC: PCA: PCA: Twitter: Twitter: Twitter:

News Images All Images Accounts Images Faces Text Accounts

Mean Error 0.03 0.11 0.00 0.06 0.11 0.13

Mean Error (Trimmed) 0.05 0.13 −0.07 0.10 0.16 0.12

Mean Absolute Error 0.60 0.73 0.84 0.83 0.71 0.61

Mean Absolute Error (Trimmed) 0.59 0.69 0.78 0.78 0.63 0.60

Best Predictor, % 0.22 0.10 0.20 0.16 0.11 0.21

Within 0.1 SDs 0.17 0.11 0.11 0.08 0.09 0.13

Within 0.5 SDs 0.50 0.53 0.50 0.39 0.51 0.54

Within 1 SDs 0.81 0.76 0.67 0.68 0.78 0.80

Note: “(Trimmed)” datasets drop observations where |z| > 3.

21

Page 22: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S3 Mean Error by Z-Score Decile

S3.1 Per Capita, Overlapping Observations

Figure A1: Mean Error by Decile, Overlapping Observations

22

Page 23: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S3.2 Per Capita, All Observations

Figure A2: Mean Error by Decile, All Observations

23

Page 24: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S3.3 Not Per Capita, Overlapping Observations

Figure A3: Mean Error by Decile, Overlapping Observations

24

Page 25: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S3.4 Not Per Capita, All Observations

Figure A4: Mean Error by Decile, All Observations

25

Page 26: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S4 Correlation Tables

Table A5 shows the pairwise correlation of the cell phone data estimates with the six protest

measures and city population. Because correlation is sensitive to outliers, our preferred

measure is the correlation of the logged protest size values. Raw correlations are very high for

all measures, with the news and social media measures appearing equally accurate. The raw

correlations are influenced by the largest protests, however, though the logged correlations

are still high. Social media and news are almost equally accurate, though Twitter: Images

Accounts is consistently the least accurate. Its size correlation is the lowest of the six

measures, and the gap between its rank correlation and the next lowest (CCC, All Estimates

(Low)) is large. Note as well that its size and rank correlation are low with the news and

other Twitter estimates, suggesting that counting the number of accounts which share protest

images captures something slightly different than protest attendance. While counting the

number of accounts which share protest photos may appear less methodologically problematic

than counting faces, it is clearly worse than the other two Twitter estimates and should be

avoided.

Across almost all measurements of protest size, the placebo correlation is lower than the

measures of protest size. The only times it is larger, shown in Tables A7b and A9b, is when

the data are restricted to the 128 observations common to all datasets. Even then, it is not

much larger than CCC: News or Twitter: Text Accounts.

S4.1 Z-score of Log per Capita

This measure is the one used throughout the manuscript.

26

Page 27: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Tab

leA

5:C

orre

lati

onof

Z-s

core

ofL

ogp

erC

apit

a

(a)

All

Ob

serv

atio

ns

per

Pai

r

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

1-0.5

850.

655

0.65

50.

652

0.68

10.

424

0.39

5

Urb

aniz

edA

rea

-0.5

851

-0.3

27-0.3

27-0.3

38-0.4

28-0.0

59-0.2

16

CC

C:

New

s0.

655

-0.3

271

10.

988

0.80

70.

609

0.39

1

CC

C,

All

Est

imat

es0.

655

-0.3

271

10.

988

0.80

70.

609

0.39

1

CC

C,

All

Est

imat

es(L

ow)

0.65

2-0.3

380.

988

0.98

81

0.78

00.

600

0.36

6

Tw

itte

r:T

ext

Acc

oun

ts0.

681

-0.4

280.

807

0.80

70.

780

10.

574

0.56

3

Tw

itte

r:Im

ages

Fac

es0.

424

-0.0

590.

609

0.60

90.

600

0.57

41

0.24

9

Tw

itte

r:Im

ages

Acc

oun

ts0.

395

-0.2

160.

391

0.39

10.

366

0.56

30.

249

1

(b)

Sam

eO

bse

rvat

ion

sfo

rA

llP

airs

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

1-0.4

090.

609

0.61

90.

612

0.64

20.

432

0.33

7

Urb

aniz

edA

rea

-0.4

091

-0.3

59-0.3

12-0.3

17-0.5

01-0.1

09-0.0

89

CC

C:

New

s0.

609

-0.3

591

10.

991

0.80

20.

606

0.30

5

CC

C,

All

Est

imat

es0.

619

-0.3

121

10.

991

0.75

20.

592

0.20

9

CC

C,

All

Est

imat

es(L

ow)

0.61

2-0.3

170.

991

0.99

11

0.73

60.

585

0.19

1

Tw

itte

r:T

ext

Acc

oun

ts0.

642

-0.5

010.

802

0.75

20.

736

10.

611

0.31

9

Tw

itte

r:Im

ages

Fac

es0.

432

-0.1

090.

606

0.59

20.

585

0.61

11

0.29

4

Tw

itte

r:Im

ages

Acc

oun

ts0.

337

-0.0

890.

305

0.20

90.

191

0.31

90.

294

1

27

Page 28: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S4.2 Z-score of Log Size

This measure is the z-score of the logged protest size, not log per capita protest size.

28

Page 29: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Tab

leA

7:C

orre

lati

onof

Z-s

core

ofL

ogM

easu

res

(a)

All

Ob

serv

atio

ns

per

Pai

r

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

10.

570

0.70

80.

708

0.69

90.

730

0.71

40.

631

Urb

aniz

edA

rea

0.57

01

0.46

70.

467

0.46

00.

630

0.64

10.

844

CC

C:

New

s0.

708

0.46

71

10.

990

0.82

50.

805

0.55

9

CC

C,

All

Est

imat

es0.

708

0.46

71

10.

990

0.82

50.

805

0.55

9

CC

C,

All

Est

imat

es(L

ow)

0.69

90.

460

0.99

00.

990

10.

800

0.79

50.

541

Tw

itte

r:T

ext

Acc

oun

ts0.

730

0.63

00.

825

0.82

50.

800

10.

833

0.75

8

Tw

itte

r:Im

ages

Fac

es0.

714

0.64

10.

805

0.80

50.

795

0.83

31

0.68

7

Tw

itte

r:Im

ages

Acc

oun

ts0.

631

0.84

40.

559

0.55

90.

541

0.75

80.

687

1

(b)

Sam

eO

bse

rvat

ion

sfo

rA

llP

airs

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

10.

722

0.70

70.

716

0.70

20.

726

0.70

70.

660

Urb

aniz

edA

rea

0.72

21

0.45

80.

481

0.47

20.

633

0.62

60.

849

CC

C:

New

s0.

707

0.45

81

10.

990

0.83

00.

800

0.56

1

CC

C,

All

Est

imat

es0.

716

0.48

11

10.

993

0.79

80.

767

0.50

8

CC

C,

All

Est

imat

es(L

ow)

0.70

20.

472

0.99

00.

993

10.

781

0.75

80.

493

Tw

itte

r:T

ext

Acc

oun

ts0.

726

0.63

30.

830

0.79

80.

781

10.

834

0.67

8

Tw

itte

r:Im

ages

Fac

es0.

707

0.62

60.

800

0.76

70.

758

0.83

41

0.68

1

Tw

itte

r:Im

ages

Acc

oun

ts0.

660

0.84

90.

561

0.50

80.

493

0.67

80.

681

1

29

Page 30: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S4.3 Log

30

Page 31: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Tab

leA

9:C

orre

lati

onof

Log

(Siz

e)

(a)

All

Ob

serv

atio

ns

per

Pai

r

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

10.

639

0.72

40.

724

0.71

90.

755

0.71

40.

627

Urb

aniz

edA

rea

0.63

91

0.56

10.

561

0.55

70.

714

0.70

20.

787

CC

C:

New

s0.

724

0.56

11

10.

990

0.83

90.

807

0.60

2

CC

C,

All

Est

imat

es0.

724

0.56

11

10.

990

0.83

90.

807

0.60

2

CC

C,

All

Est

imat

es(L

ow)

0.71

90.

557

0.99

00.

990

10.

811

0.79

50.

579

Tw

itte

r:T

ext

Acc

oun

ts0.

755

0.71

40.

839

0.83

90.

811

10.

838

0.79

3

Tw

itte

r:Im

ages

Fac

es0.

714

0.70

20.

807

0.80

70.

795

0.83

81

0.71

0

Tw

itte

r:Im

ages

Acc

oun

ts0.

627

0.78

70.

602

0.60

20.

579

0.79

30.

710

1

(b)

Sam

eO

bse

rvat

ion

sfo

rA

llP

airs

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

10.

778

0.69

50.

721

0.71

20.

726

0.70

70.

660

Urb

aniz

edA

rea

0.77

81

0.58

50.

603

0.60

00.

709

0.69

80.

797

CC

C:

New

s0.

695

0.58

51

10.

991

0.83

00.

800

0.56

1

CC

C,

All

Est

imat

es0.

721

0.60

31

10.

992

0.79

80.

767

0.50

8

CC

C,

All

Est

imat

es(L

ow)

0.71

20.

600

0.99

10.

992

10.

781

0.75

80.

493

Tw

itte

r:T

ext

Acc

oun

ts0.

726

0.70

90.

830

0.79

80.

781

10.

834

0.67

8

Tw

itte

r:Im

ages

Fac

es0.

707

0.69

80.

800

0.76

70.

758

0.83

41

0.68

1

Tw

itte

r:Im

ages

Acc

oun

ts0.

660

0.79

70.

561

0.50

80.

493

0.67

80.

681

1

31

Page 32: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S4.4 Raw

32

Page 33: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

Tab

leA

10:

Cor

rela

tion

ofSiz

e,no

Tra

nsf

orm

atio

n

(a)

All

Ob

serv

atio

ns

per

Pai

r

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

10.

587

0.79

80.

798

0.84

10.

794

0.82

90.

640

Urb

aniz

edA

rea

0.58

71

0.67

40.

674

0.64

80.

678

0.58

30.

870

CC

C:

New

s0.

798

0.67

41

10.

972

0.97

80.

966

0.81

9

CC

C,

All

Est

imat

es0.

798

0.67

41

10.

972

0.97

80.

966

0.81

9

CC

C,

All

Est

imat

es(L

ow)

0.84

10.

648

0.97

20.

972

10.

931

0.94

30.

760

Tw

itte

r:T

ext

Acc

oun

ts0.

794

0.67

80.

978

0.97

80.

931

10.

971

0.84

9

Tw

itte

r:Im

ages

Fac

es0.

829

0.58

30.

966

0.96

60.

943

0.97

11

0.78

2

Tw

itte

r:Im

ages

Acc

oun

ts0.

640

0.87

00.

819

0.81

90.

760

0.84

90.

782

1

(b)

Sam

eO

bse

rvat

ion

sfo

rA

llP

airs

Cel

lP

hon

eD

ata

Urb

aniz

edA

rea

CC

C:

CC

C,

CC

C,

Tw

itte

r:T

wit

ter:

Tw

itte

r:

New

sA

llE

stim

ates

All

Est

imat

es(L

ow)

Tex

tA

ccou

nts

Imag

esF

aces

Imag

esA

ccou

nts

Cel

lP

hon

eD

ata

10.

587

0.79

80.

798

0.84

10.

794

0.82

90.

640

Urb

aniz

edA

rea

0.58

71

0.67

40.

674

0.64

80.

678

0.58

30.

870

CC

C:

New

s0.

798

0.67

41

10.

972

0.97

80.

966

0.81

9

CC

C,

All

Est

imat

es0.

798

0.67

41

10.

972

0.97

80.

966

0.81

9

CC

C,

All

Est

imat

es(L

ow)

0.84

10.

648

0.97

20.

972

10.

931

0.94

30.

760

Tw

itte

r:T

ext

Acc

oun

ts0.

794

0.67

80.

978

0.97

80.

931

10.

971

0.84

9

Tw

itte

r:Im

ages

Acc

oun

ts0.

640

0.87

00.

819

0.81

90.

760

0.84

90.

782

1

33

Page 34: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S5 Results, Regression

This section regresses the cell phone measure on a placebo model, the placebo model with

the size estimate of the four measures, and a model pooling the four estimates. The placebo

model includes variables for city population, median household income, income inequality,

vote share for Candidate Hillary Clinton, and the number of unemployed. All models include

state fixed effects.

Each subsection uses a different transformation of protest size applied to the overlapping

observations and all observations per measure. There are therefore eight tables, two per the

four subsections.

The results mirror the main paper’s. Across dependent variable transformations and

subsets of data, CCC: News and Twitter: Text Accounts models explain the most variation

of the cell phone protest size. In Tables A15 (z-score of log size) and A17 (log size), the

placebo model outperforms the others models. In those tables, however, the model that

pools the four measures’ estimates with the placebo variables finds statistical significance for

only CCC: News and Twitter: Text Accounts. Note as well that the placebo is worse than

the other measures when using raw data, as seen in Tables A18 and A19.

S5.1 Z-score of Log per Capita

34

Page 35: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

35

Table A12: Same Cases

Z, Log(Per Capita)

(1) (2) (3) (4) (5) (6)

Z, Log(Twitter: Images Faces per Capita) 0.273∗∗ 0.050

(0.105) (0.088)

Z, Log(Twitter: Text per Capita) 0.791∗∗∗ 0.545∗∗∗

(0.104) (0.171)

Z, Log(Twitter: Images Accounts per Capita) 0.363∗∗∗ 0.020

(0.124) (0.119)

Z, Log(CCC: News per Capita) 0.694∗∗∗ 0.297∗∗

(0.106) (0.147)

Median HHI 0.238 0.139 0.068 0.279 0.092 0.042

(0.195) (0.192) (0.150) (0.187) (0.159) (0.152)

Gini 0.178 0.212 0.085 0.015 0.101 0.078

(0.151) (0.146) (0.115) (0.154) (0.122) (0.123)

County Dem. Vote Share −0.034 −0.160 −0.163 −0.060 −0.249∗ −0.240∗

(0.179) (0.180) (0.137) (0.171) (0.148) (0.140)

Unemployed Count −0.300 −0.208 0.562 0.517 0.967 0.899

(1.341) (1.294) (1.024) (1.309) (1.100) (1.045)

Population −0.014 −0.131 −0.990 −0.904 −1.336 −1.324

(1.459) (1.409) (1.115) (1.425) (1.195) (1.133)

Intercept 0.302 0.221 0.042 0.413 0.285 0.107

(0.922) (0.890) (0.700) (0.880) (0.744) (0.694)

Observations 127 127 127 127 127 127

Adjusted R2 0.122 0.182 0.494 0.201 0.428 0.510

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Page 36: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

36

Table A13: Maximum Cases per Measure

Z, Log(Per Capita)

(1) (2) (3) (4) (5) (6)

Z, Log(Twitter: Images Faces per Capita) 0.280∗∗∗ 0.050

(0.091) (0.088)

Z, Log(Twitter: Text per Capita) 0.683∗∗∗ 0.545∗∗∗

(0.076) (0.171)

Z, Log(Twitter: Images Accounts per Capita) 0.281∗∗∗ 0.020

(0.083) (0.119)

Z, Log(CCC: News per Capita) 0.497∗∗∗ 0.297∗∗

(0.085) (0.147)

Median HHI 0.213∗ 0.184 0.085 0.266∗ 0.238 0.042

(0.120) (0.158) (0.113) (0.143) (0.148) (0.152)

Gini 0.202∗ 0.262∗∗ 0.182∗ 0.058 0.164 0.078

(0.103) (0.126) (0.095) (0.129) (0.115) (0.123)

County Dem. Vote Share −0.053 −0.166 −0.220∗∗ −0.040 −0.216 −0.240∗

(0.116) (0.153) (0.108) (0.138) (0.144) (0.140)

Unemployed Count −0.566 −0.717 −0.456 −0.232 0.521 0.899

(0.733) (0.929) (0.724) (0.913) (1.075) (1.045)

Population 0.352 0.462 0.212 −0.031 −0.885 −1.324

(0.757) (0.968) (0.753) (0.949) (1.169) (1.133)

Intercept −0.561 −1.274 −0.449 −0.844 −1.092∗ 0.107

(0.590) (0.853) (0.514) (0.651) (0.600) (0.694)

Observations 214 153 167 165 139 127

Adjusted R2 0.365 0.244 0.492 0.206 0.357 0.510

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Page 37: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S5.2 Z-score of Log Size

Table A14: Same Cases

Z, Log(Size)

(1) (2) (3) (4) (5) (6)

Z, Log(Twitter: Images Faces) 0.334∗∗∗ 0.094

(0.085) (0.100)

Z, Log(Twitter: Text Accounts) 0.355∗∗∗ 0.199∗∗

(0.068) (0.097)

Z, Log(Twitter: Images Accounts) 0.274∗∗ 0.056

(0.124) (0.130)

Z, Log(CCC: News) 0.436∗∗∗ 0.224∗

(0.092) (0.121)

Median HHI 0.130 0.064 0.050 0.156 0.062 0.037

(0.121) (0.113) (0.106) (0.119) (0.109) (0.106)

Gini −0.077 0.0004 −0.001 −0.085 −0.048 0.001

(0.094) (0.088) (0.082) (0.092) (0.083) (0.084)

County Dem. Vote Share 0.231∗∗ 0.045 0.011 0.146 0.021 −0.070

(0.111) (0.112) (0.105) (0.115) (0.108) (0.108)

Unemployed Count −1.498∗ −0.804 −0.110 −0.864 −0.307 0.213

(0.821) (0.775) (0.759) (0.851) (0.770) (0.772)

Population 2.014∗∗ 1.057 0.056 1.153 0.527 −0.289

(0.893) (0.856) (0.858) (0.954) (0.851) (0.876)

Intercept 0.953 0.744 0.183 0.846 0.578 0.249

(0.573) (0.530) (0.518) (0.562) (0.514) (0.508)

Observations 128 128 128 128 128 128

Adjusted R2 0.362 0.461 0.522 0.392 0.499 0.542

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

37

Page 38: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

38

Table A15: Maximum Cases per Measure

Z, Log(Size)

(1) (2) (3) (4) (5) (6)

Z, Log(Twitter: Images Faces) 0.356∗∗∗ 0.094

(0.065) (0.100)

Z, Log(Twitter: Text Accounts) 0.344∗∗∗ 0.199∗∗

(0.050) (0.097)

Z, Log(Twitter: Images Accounts) 0.338∗∗∗ 0.056

(0.075) (0.130)

Z, Log(CCC: News) 0.363∗∗∗ 0.224∗

(0.060) (0.121)

Median HHI 0.380∗∗∗ 0.088 0.104 0.236∗∗∗ 0.197∗∗ 0.037

(0.064) (0.091) (0.084) (0.089) (0.078) (0.106)

Gini 0.052 0.026 0.070 −0.037 0.043 0.001

(0.058) (0.074) (0.069) (0.075) (0.064) (0.084)

County Dem. Vote Share 0.188∗∗∗ 0.022 −0.055 0.101 0.020 −0.070

(0.063) (0.093) (0.085) (0.088) (0.080) (0.108)

Unemployed Count −0.453 −0.682 −0.617 −0.666 −0.576 0.213

(0.496) (0.539) (0.529) (0.571) (0.674) (0.772)

Population 0.725 0.858 0.628 0.837 0.865 −0.289

(0.511) (0.566) (0.557) (0.600) (0.732) (0.876)

Intercept −1.275∗∗∗ 0.055 −0.672∗ −0.779∗ −0.985∗∗∗ 0.249

(0.237) (0.492) (0.375) (0.403) (0.318) (0.508)

Observations 295 154 170 166 187 128

Adjusted R2 0.691 0.563 0.610 0.541 0.600 0.542

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Page 39: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S5.3 Log Size

Table A16: Same Cases

Log(Size)

(1) (2) (3) (4) (5) (6)

Log(Twitter: Images Faces) 0.297∗∗∗ 0.083

(0.076) (0.089)

Log(Twitter: Text Accounts) 0.697∗∗∗ 0.390∗∗

(0.133) (0.190)

Log(Twitter: Images Accounts) 0.275∗∗ 0.056

(0.124) (0.130)

Log(CCC: News) 0.351∗∗∗ 0.180∗

(0.074) (0.097)

Median HHI 0.091 0.044 0.035 0.109 0.043 0.026

(0.085) (0.079) (0.074) (0.083) (0.076) (0.074)

Gini −0.054 0.0003 −0.0004 −0.059 −0.033 0.001

(0.065) (0.062) (0.058) (0.064) (0.058) (0.058)

County Dem. Vote Share 0.161∗∗ 0.032 0.007 0.102 0.014 −0.049

(0.077) (0.078) (0.073) (0.080) (0.075) (0.076)

Unemployed Count −1.045∗ −0.561 −0.077 −0.603 −0.215 0.149

(0.573) (0.541) (0.530) (0.594) (0.537) (0.539)

Population 1.406∗∗ 0.738 0.039 0.805 0.368 −0.202

(0.623) (0.597) (0.599) (0.666) (0.594) (0.612)

Intercept 2.395∗∗∗ 1.465∗∗∗ 1.400∗∗∗ 1.655∗∗∗ 1.075∗∗ 0.750

(0.400) (0.437) (0.395) (0.513) (0.450) (0.521)

Observations 128 128 128 128 128 128

Adjusted R2 0.362 0.461 0.522 0.392 0.499 0.542

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

39

Page 40: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

40

Table A17: Maximum Cases per Measure

Log(Size)

(1) (2) (3) (4) (5) (6)

Log(Twitter: Images Faces) 0.317∗∗∗ 0.083

(0.058) (0.089)

Log(Twitter: Text Accounts) 0.675∗∗∗ 0.390∗∗

(0.099) (0.190)

Log(Twitter: Images Accounts) 0.339∗∗∗ 0.056

(0.075) (0.130)

Log(CCC: News) 0.291∗∗∗ 0.180∗

(0.048) (0.097)

Median HHI 0.265∗∗∗ 0.062 0.072 0.165∗∗∗ 0.137∗∗ 0.026

(0.045) (0.063) (0.059) (0.062) (0.054) (0.074)

Gini 0.036 0.018 0.049 −0.026 0.030 0.001

(0.040) (0.052) (0.048) (0.053) (0.045) (0.058)

County Dem. Vote Share 0.131∗∗∗ 0.016 −0.038 0.071 0.014 −0.049

(0.044) (0.065) (0.059) (0.061) (0.056) (0.076)

Unemployed Count −0.317 −0.476 −0.431 −0.465 −0.402 0.149

(0.346) (0.377) (0.369) (0.399) (0.471) (0.539)

Population 0.506 0.599 0.438 0.585 0.604 −0.202

(0.356) (0.395) (0.389) (0.419) (0.511) (0.612)

Intercept 0.840∗∗∗ 0.933∗∗ 0.817∗∗∗ 0.367 0.164 0.750

(0.166) (0.364) (0.265) (0.332) (0.276) (0.521)

Observations 295 154 170 166 187 128

Adjusted R2 0.691 0.563 0.610 0.541 0.600 0.542

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

Page 41: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

S5.4 Size

Table A18: Same Cases

Cell Phone Data

(1) (2) (3) (4) (5) (6)

Twitter: Images Faces 0.00000 0.0001∗

(0.00001) (0.00003)

Twitter: Text Accounts −0.0001 −0.0002

(0.001) (0.003)

Twitter: Images Accounts −0.00004 −0.0001

(0.0001) (0.0001)

CCC: News −0.00000 −0.00001∗

(0.00000) (0.00000)

Median HHI 0.132 0.129 0.134 0.137 0.138 0.111

(0.110) (0.111) (0.111) (0.110) (0.111) (0.110)

Gini −0.036 −0.033 −0.037 −0.039 −0.041 −0.030

(0.105) (0.105) (0.105) (0.105) (0.105) (0.104)

County Dem. Vote Share 0.303∗∗∗ 0.298∗∗∗ 0.304∗∗∗ 0.317∗∗∗ 0.312∗∗∗ 0.315∗∗∗

(0.112) (0.112) (0.112) (0.112) (0.112) (0.113)

Unemployed Count −1.261 −1.178 −1.295 −1.589∗ −1.426 −1.435

(0.854) (0.910) (0.883) (0.949) (0.888) (1.045)

Population 2.057∗∗ 1.957∗ 2.098∗∗ 2.509∗∗ 2.259∗∗ 2.580∗∗

(0.919) (1.005) (0.977) (1.124) (0.984) (1.213)

Intercept 4.958∗∗∗ 4.953∗∗∗ 4.960∗∗∗ 5.005∗∗∗ 4.967∗∗∗ 5.060∗∗∗

(0.080) (0.084) (0.084) (0.109) (0.084) (0.116)

Observations 128 128 128 128 128 128

Log Likelihood −768.393 −768.376 −768.390 −768.200 −768.326 −765.705

Akaike Inf. Crit. 1,548.787 1,550.751 1,550.781 1,550.401 1,550.652 1,551.410

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01

41

Page 42: News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and images). The argument that news and social media can be used to measure protest size

42

Table A19: Maximum Cases per Measure

Cell Phone Data

(1) (2) (3) (4) (5) (6)

Twitter: Images Faces 0.00000 0.0001∗

(0.00001) (0.00003)

Twitter: Text Accounts −0.0001 −0.0002

(0.001) (0.003)

Twitter: Images Accounts 0.00001 −0.0001

(0.0001) (0.0001)

CCC: News −0.00000 −0.00001∗

(0.00000) (0.00000)

Median HHI 0.669∗∗∗ 0.194∗ 0.297∗∗∗ 0.264∗∗ 0.316∗∗∗ 0.111

(0.086) (0.111) (0.106) (0.107) (0.097) (0.110)

Gini 0.534∗∗∗ −0.029 0.111 0.127 0.326∗∗∗ −0.030

(0.080) (0.105) (0.102) (0.101) (0.089) (0.104)

County Dem. Vote Share 0.291∗∗∗ 0.370∗∗∗ 0.353∗∗∗ 0.351∗∗∗ 0.273∗∗∗ 0.315∗∗∗

(0.083) (0.108) (0.103) (0.105) (0.097) (0.113)

Unemployed Count −0.842 −0.471 −0.699 −0.613 −2.134∗∗ −1.435

(0.607) (0.740) (0.761) (0.757) (0.949) (1.045)

Population 1.187∗ 1.041 1.334∗ 1.190 3.205∗∗∗ 2.580∗∗

(0.623) (0.777) (0.799) (0.803) (1.045) (1.213)

Intercept 4.487∗∗∗ 4.832∗∗∗ 4.741∗∗∗ 4.737∗∗∗ 4.672∗∗∗ 5.060∗∗∗

(0.058) (0.079) (0.078) (0.091) (0.077) (0.116)

Observations 295 154 170 166 187 128

Log Likelihood −1,709.422 −909.459 −991.326 −969.625 −1,050.807 −765.705

Akaike Inf. Crit. 3,430.844 1,832.918 1,996.653 1,953.250 2,115.614 1,551.410

Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01