News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and...
Transcript of News and Geolocated Social Media Accurately Measure Protest …€¦ · social media (text and...
News and Geolocated Social Media Accurately MeasureProtest Size
Anton Sobolev1, Keith Chen1, Jungseock Joo1, Zachary C. Steinert-Threlkeld1,*
June 6, 2019
Abstract
Larger protests are more likely to lead to policy changes than small ones, butwhether or not attendance estimates provided in newspapers or generated from socialmedia are biased is an open question. This research note closes the question: newsand geolocated social media data generate accurate estimates of the size of protests.This claim is substantiated using cell phone location data from ten million individualsduring the 2017 United States Women’s March protests. These cell phone estimatescorrelate strongly with those provided in news media as well as three size estimatesgenerated using geolocated tweets, one text-based and two based on images. In testingthese estimates, we also show that wealthier, more Democratic, and more urbanizedareas generated larger protests.
1 University of California - Los Angeles∗To whom correspondence should be addressed: [email protected] count (excluding references, tables, and figures): 2,803
1
1 Introduction
Using estimates of protest size obtained via mobile phone location data from 10 million
individuals, this research note shows that news and social media data generate accurate
estimates of protest size.
Protests are a key tool in the repertoire of social movements (Tilly and Tarrow, 2015),
and their efficacy is primarily a function of the number of participants (“size”) (DeNardo,
1985; Chenoweth and Stephan, 2011; Biggs, 2016). Methods to directly measure size are too
costly to be employed across large numbers of protests, limiting their use in academic studies
(Schweingruber and McPhail, 1999; McPhail and McCarthy, 2004; Yip et al., 2010). Instead,
event datasets that record size (usually of protests but also of riots or massacres) rely on
numbers provided in media reports, usually newspapers. This reliance raises concerns about
the accuracy of the tallies since researchers do not create the measures themselves and those
who do, usually activists or authority figures, have strategic reasons for inflating or deflating,
respectively, size. Complicating matters, newspapers can choose between multiple estimates
to find the one that suits their editorial beliefs (Mann, 1974). Concern about measurement
validity means many event datasets record size as an ordinal variable (Salehyan et al., 2012)
or only measure casualties (Raleigh et al., 2010).
Using individuals’ location recorded via their mobile phones, this research note shows that
scholars can rely on size estimates reported in newspapers or obtained via geolocated tweets.
Since people almost always carry their smart phone, those devices record individuals’ location
passively, and other researchers have verified that the locations are not biased politically
ANONYMOUS, we consider the cell phone location to be a “gold standard”. Cell phone
data are especially important because, since 1995, the federal government no longer records
protest size in Washington D.C. (McCarthy et al., 1999), and rigorously measuring with
enumerators is too costly when protests occur in multiple locations (Schweingruber and
McPhail, 1999; Yip et al., 2010). The cell phone location data (“cell phone data”) therefore
forms the baseline against which we compare estimates from news (print and television) and
2
social media (text and images).
The argument that news and social media can be used to measure protest size is sub-
stantiated using the January 21st, 2017 United States Women’s March. One of the largest
mobilizations in American history, that day featured 942 protests comprised of approximately
4.77 million individuals (Chenoweth and Pressman, 2017). The cell phone data shows that
protest size reported in newspapers as well as estimated via geolocated tweets are reliable.
Of the four sets of estimates we compare, those reported in newspapers or calculated via
geolocated tweets containing protest key words are the most accurate, followed by counting
the number of faces in protest photos shared on Twitter. Counting the number of accounts
which share protest photos is the least accurate. The Supplementary Materials show that
the first principal component of the four sets of estimates provide the most accurate protest
size, though this fifth estimate is the most difficult to scale and so not one we recommend
for future use.
2 The Challenge
Large protests are more likely to lead to policy change than small ones (Biggs, 2016). Ku-
ran (1989)’s canonical model of bandwagoning implicitly means a revolution follows large
protests. Lohmann (1994) argues that unprecedented numbers of people rallying in the Ger-
man Democratic Republic in the beginning of 1989 were a key reason the protest movement
grew and eventually toppled Erich Honecker?s government. The gradual growth of protest
size in Iran in 1979 also made it increasingly difficult for the Shah to remain in power (Rasler,
1996; Kurzman, 2004). Nonviolent social movements succeed more often than violent ones
precisely because they tend to be larger: larger movements increase the cost of repression and
make regime defections more likely (Chenoweth and Stephan, 2011; Celestino and Gleditsch,
2013; Thurber, 2013).1
Previous estimates of protest size rely on teams of well-trained observers or size estimates
1A social movement is not the same as a protest. The quantitative study of them, however, measurestheir size based on the size of protests mobilized as part of the movement.
3
reported in newspapers. Using observers who understand the square footage of a protest site,
the density of the crowd, and the percent of the site the crowd occupies is considered the
most accurate estimation method; on average, a standing person requires about 5 square
feet of personal space (McPhail and McCarthy, 2004). Trained observers can also disperse
within a protest and count the size of clusters of participants (Schweingruber and McPhail,
1999; Yip et al., 2010). We are aware of no dataset which uses this methodology, probably
because it is expensive and requires advance knowledge a protest will occur.
Instead, scholars rely on estimates of protest size in newspapers (Woolley, 2000; Earl
et al., 2004). Though newspapers have known biases in their coverage (Gerner and Schrodt,
1998; Weidmann, 2014; Baum and Zhukov, 2015), the concern about measurement validity
for protest size does not stem from their preference towards violent events in urbanized areas.
Instead, the concern is that newspaper estimates of protest size are secondary, usually coming
directly from protest organizers or state authorities. The estimates could be too high or too
low, but the bias would not be correctable since no independent size estimate exists.2
Distrust of size estimates leads many datasets to report alternate measures of protest
size. The Social Conflict Analysis Database (SCAD) transforms newspaper estimates into
an ordinal variable on a logged scale (Salehyan et al., 2012). The Armed Conflict Location
and Event Data project only reports fatalities since those are believed to be easier than
people in groups to count (Raleigh et al., 2010). Other datasets, such as Mass Mobilization
(Clark and Regan, 2018), Mass Mobilization in Autocracies Database (Weidmann and Rod,
2018), and Nonviolent and Violent Campaigns and Outcomes (Chenoweth, Pinckney and
Lewis, 2018) do record the estimates provided in newspapers. Nonetheless, we are aware of
no research that has shown that size data from newspapers can be used without introducing
bias, and work which takes protest size as its outcome instead has used the count of protests
(Steinert-Threlkeld, 2017).
2Newspapers are more likely to report on large protests than small ones (McCarthy et al., 1999). Sincelarge protests often grow from small ones, being able to record small protests that do not appear in news-papers should enhance the understanding of protest dynamics.
4
3 Data
That news and geolocated social media can accurately measure the number of protesters
is demonstrated using the case of the 2017 United States’ Women’s March. Occurring on
January 21st, 2017, the day after President Donald Trump’s inauguration, the simultaneous
protests across the country constitutes the largest single-day mobilization in the United
States.3
This note compares eight estimates of protest size from four sources: cell phone location
data (1) ANONYMOUS, the Crowd Counting Consortium (3) (Chenoweth and Pressman,
2017), Twitter (3), and population according to the American Community Survey as a
placebo (1). The cell phone location data are from a data broker and shared with an author
of this paper; the Crowd Counting Consortium (CCC) data are publicly available; and
the tweets have been collected by another author of this paper ANONYMOUS. Table 1
provides summary statistics of the three sources.4
The cell phone estimates use data provided from SafeGraph, a third-party data broker.
Cell phones record owners’ location and let downloaded applications use that information to,
ostensibly, provide services. The applications and telecommunications providers also sell the
location data brokers or directly to companies (Valentino-DeVries et al., 2018; Cox, 2019).
Our dataset contains location information for just over ten million individuals. For more
information on the data and validation of the political representativeness, see ANONY-
MOUS. We have no identifying information on these individuals (de Montjoye et al., 2018).
Identifying protesters from the SafeGraph data requires several steps. First, we only
search in cities identified in the CCC data. Second, in those cities, we only consider pedes-
trians by excluding pings that were registered while the owners of mobile phones used private
or public transportation; mode of transit is identifiable by the frequency of location pings.
Third, we take the distribution of seven-digit geohashes on January 21st, 2017, the day of
3Protests also occurred in other countries. Lacking cell phone data on them, this paper ignores them.4All tables were made using Hlavac (2018).
5
Table 1: Summary Statistics
Measure N Mean St. Dev. Min Max
Cell Phone Data 399 238.479 596.154 2 3,082
CCC: News 326 10,526.630 52,362.250 3 725,000
CCC, All Estimates (Best Guess) 614 6,727.480 42,490.840 1 725,000
CCC, All Estimates (Low) 614 5,804.056 34,443.560 1 550,000
Twitter: Text Accounts 1,045 10.793 70.643 2 1,522
Twitter: Images Faces 244 2,601.361 10,568.640 0 132,708
Twitter: Images Accounts 269 926.301 2,209.269 3 20,521
Note: CCC refers to the Crowd Counting Consortium data and uses the “Best Guess”variable unless otherwise indicated. Twitter: Text Accounts uses keywords to iden-tify protesters. Twitter: Images Faces sums the number of faces in protest images.Twitter: Images Accounts counts the number of unique accounts that share protestphotos.
the Women’s March, and keep the geohashes above the 95th percentile per city.5 Fourth,
any individual in those geohashes is counted as a protester. The size of this group represents
a conservative estimate of the total size of the protest in each city. In most of the cities, the
neighboring geohashes also have a relatively high density of individuals but are below the
95% threshold; some of the individuals from these geohashes are probably protesters. To
summarize: the number of people in a city’s densest 7-digit geohashes (those whose density
is in the 95th percentile or greater) is the the number of protesters to which we compare the
other estimates.6
The Crowd Counting Consortium (CCC) is an open-source, hand-coded events dataset
established to record protest size. CCC is agnostic regarding source type. Most are newspa-
pers or local television broadcasts, but a substantial minority (26.84%) of estimates derive
5A geohash is an 8 digit alphanumeric code that corresponds to a cell with sides of 38 meters. We use the7 digit code, corresponding to cells wide sides of 152 meters. Because geohashes are alphanumeric stringsand not points, location matching is much quicker than using spatial coordinates.
6We explored an additional step to identifying protesters, based on the path of their walking. In responseto concerns from the data provider, we dropped this methodology. Initial results suggested that this extrastep degraded results.
6
from tweets or Facebook statuses. It also reports high and low estimates, based on these
sources. Unless otherwise indicated, this research note uses the Best Guess variable, usually
an average of the adjusted low and high estimates in an article, as the protest size estimate.
CCC: News refers to events for which the source is a newspaper or local television broadcast.
CCC, All Estimates includes these events for which only social media provides estimates.7
CCC, All Estimates (Low) uses the Low estimate, not Best Guess, and ignores source type.
For Twitter, three approaches approximate the number of protesters.8 First is a count
of the number of unique accounts in a city per day using one of three common keywords,
“womensmarch”, “whyimarch”, or “imarchfor”, between 10 a.m. and 5 p.m. local time.
Tweets are not resolved to intracity locations, for three reasons. First, for the majority of
tweets, Twitter aggregates the specific location of a tweet to local polygons while reporting
the actual location as the city that contains that polygon. The polygon, however, does
not necessarily envelop the location of the actual tweet. For example, Rancho Palos Verdes
and Monterrey, two cities near Los Angeles, recorded, based on GPS coordinates, the most
number of tweets from January 21, but Twitter labels those tweets as coming from Los
Angeles. Intracity location information is therefore unknown for almost all tweets. Second,
the Crowd Counting Consortium does not always provide intracity geographic information.
A protest they label as occurring in Seattle would require further research for obtaining the
specific location, for example. Third, most cities, even large ones, contain only one protest
per day; if they contain multiple, one protest will contain the vast majority of protesters
(Biggs, 2016). We call this measure Twitter: Text Accounts.
The second Twitter approach to estimate the number of protesters at the Women’s March
is to count the number of faces in any protest photo from a city. A convolutional neural
7We include television broadcasts as part of news because all estimates come from the source type’swebsite, and there is no facile method to distinguish between newspapers and broadcasts. Twitter andFacebook are identifiable with the presence of “twitter” or “facebook” in the link or the word “FB” enteredas the source, so we exclude social media sources.
8Tweets were collected by connecting to Twitter’s streaming application programming interface (API)using the package streamR and requesting only tweets with geographic coordinates (Barbera, 2013). Formore detail, see ANONYMOUS
7
network (CNN) first identifies protest photos, then another CNN identifies faces in that
photo. Faces in a protest photo are summed and added per city-day. If protesters who
share geolocated photos are randomly scattered across a protest site and share pictures of
their immediate surroundings, then this approach is similar to many authorities’ estimating
procedure (McPhail and McCarthy, 2004). We call this measure Twitter: Images Faces.
The third Twitter approach is to count the number of people tweeting a protest photo.
This approach does not count retweets, so it assumes that any tweet of a protest photo
requires that the Twitter user was actually at a protest. We call this measure Twitter: Images
Accounts. Section 5 discusses in more detail why these three measures should correlate with
the cell phone data estimates.
For either image approach to be accurate, we assume that Twitter account owners take
photographs of protesters, share them on Twitter at some point on the day of the protest,
are not strategic about the timing of the sharing (Pfeffer and Mayer, 2018), and enable
geolocation for the tweet.9 While we can think of no reason that accounts would, in aggregate,
be biased in sharing, or not, protest photos, it is possible these estimates will not be accurate,
especially because so many steps are required to enter our data. The results presented in the
next section show that, despite the biases that could enter in any of these steps, the Twitter
approaches are accurate. For details of these Twitter approaches, see ANONYMOUS.
Table 1 shows that the range of size estimates varies substantially across datasets. To
compare the preferred four estimates to the cell phone data, we therefore convert each
estimate of size to a per capita measure, log-transform that result because it is still very
skewed, and take the z-score of the logged per-capita measure. (For the remainder of the
paper, “z-score” refers to this measure.) We prefer per capita to raw estimates because
population size will drive the cell phone, CCC, and Twitter estimates. Incorporating it
into our measurement therefore removes a possible source of omitted variable bias. The
z-score is appropriate because the estimates vary by orders of magnitude across datasets, so
9The phrase “some point on the day of the protest” is intentionally vague: large crowds can overwhelmthe cellular towers used to transmit data, making it impossible to post during a protest.
8
standardizing permits distant measurement across the datasets.10 Standardizing in this way
permits the measurement of protest size across datasets whose estimates vary by orders of
magnitude.
4 Results
CCC: News and Twitter: Text Accounts estimate protest size with the least error. Twitter:
Images Faces is next best, while counting the number of accounts that share protest photos
is the least accurate. No estimate, however, is unreliable, and Section 5 discusses when
different ones may be preferred.
Unless otherwise stated, all results are for the 155 cities for which each dataset records a
protest size. The Supplementary Materials shows results keeping the maximum number of
observations each dataset and the cell phone data share; results do not change.
Figure 1 shows the correlation of the z-score for each of the datasets with the cell phone
data. Section S4.1 reports these correlations in a table.
Table 2 shows that CCC: News and Twitter: Text Accounts produce the closest estimates
to the cell phone data. Those two measures are also the closest to the cell phone data the
greatest percentage of the time, whether measuring difference (rows 1-5) or the number
of observations within .1, .5, and 1 standard deviation of the cell phone data (rows 6-8).
Twitter: Images Accounts is the least accurate on six of the eight fit measures.
Figures 2 and 3 show when each dataset’s measure is close to the cell phone data. Figure
2 shows the cumulative distribution of the four datasets’ closest estimates across the range
of z-scores. This graph reveals any bias in each estimate. For example, if most of a dataset’s
closest predictions are for small protests, its CDF will approach 1 quickly. If its closest
predictions are randomly scattered across the range of protest size, its CDF will track the
gold-standard’s. Twitter: Text Accounts has more of its closest estimates occur for small
protests, which makes sense: because more tweets contain protest hashtags than protest
images, it records small protests better than image estimates. Twitter: Images Accounts
10The Supplementary Materials show that the results hold when not using the per capita measure.
9
Figure 1: Correlation of Estimates
Table 2: Measuring Fit
CCC: Twitter: Twitter: Twitter:
News Images Accounts Images Faces Text Accounts
Mean Error 0.06 0.10 0.08 0.07
Mean Error (Trimmed) 0.06 0.11 0.14 0.04
Mean Absolute Error 0.58 0.82 0.71 0.60
Mean Absolute Error (Trimmed) 0.51 0.70 0.59 0.50
Closest Estimate % 0.31 0.22 0.20 0.27
Percent Within 0.1 SD 0.17 0.08 0.11 0.14
Percent Within 0.5 SD 0.52 0.38 0.52 0.54
Percent Within 1 SD 0.82 0.67 0.77 0.81
Note: “(Trimmed)” datasets drop observations where |z| > 3. Bold is the best measure
per row; underline, the worst.
10
tracks the cell phone data as protests pass their average size. Twitter: Images Faces outper-
forms Twitter: Images Accounts for most of its range except 0 ≤ z ≤ 1. CCC is the most
consistently accurate of the three, as its CDF closely tracks the cell phone data estimates.
Figure 2: Location of Best Fits per Measure
Finally, Figure 3 shows the mean error by z-score decile, with the closest estimate per
decile labeled. All measures overestimate the size of small protests and underestimate large
ones. CCC: News and Twitter: Text Accounts perform best, as they each have the smallest
mean error three times. Twitter: Images Faces is closest twice; Twitter: Images Accounts,
once.
The Supplementary Materials present additional analyses that confirm these results. To
verify that results are not driven by the per capita transformation, we show results measuring
the outcome as the z-score of the logarithm of protest size; the subsections presenting them
are labeled “Not Per Capita”. We have also generated protest size estimates using the first
principal component of all four measure as well as the two Twitter images measures; Section
S1 explains this analysis, and its results are presented in the Supplementary Material’s other
sections. The principal components have the least error.
After introducing the principal components analysis, the next sections replicate the re-
11
Figure 3: Mean Error by Decile
sults above. S2 repeats Table 2 using all observations per measure and the “Not Per Capita”
measure. Section S3 replicates Figure 3 using all observations per measure the “Not Per
Capita” measure. Both incorporate the principal components results and show results when
using the maximum number of observations per dataset.
Sections S4 and S5 provide two sets of additional results. Section S4 shows the Pearson
correlation coefficient for each measure, using different subsets of data and operationaliza-
tions of protest size. Section S5 introduces a series of models regressing the cell phone data
on the four measures, separately and pooled; all models include state fixed effects, socioeco-
nomic controls at the urbanized area level taken from the 2016 American Community Survey,
and population. CCC: News and Twitter: Text Accounts feature the highest correlations
12
and best model fit regardless of the subset of data or operationalization of protest size.
5 Discussion
5.1 Why Does it Work?
Why do news media and geolocated social media data reliably measure protest size? News
reports rely on others’ estimates, and the social media estimates are from models applied to
relatively small samples with many entry points for bias. That both sets of estimates work
may therefore appear surprising.
The estimates in CCC: News correlate well with the cell phone data because they tend
to come from actors applying methodology similar to that developed in Schweingruber and
McPhail (1999). While we can find no reports of how activists generate their estimates, state
authorities do approximate crowd attendance by estimating the carrying capacity of public
areas and the percentage of that area covered with people (McPhail and McCarthy, 2004).
The estimates in newspapers therefore are not as haphazard as their lack of methodological
reporting would suggest.
The data generating process for Twitter: Images Faces is similar. The photos for this
note primarily consist of people taking photos of their immediate surrounding. The photos
for this paper contain anywhere from 1 to 48 faces, with an average of 2.81 faces per photo.
So long as people who geotag their photos are randomly distributed across a protest site, and
there is no reason to think they are not, then Twitter: Images Faces is similar to randomly
sampling segments of a protest size (McPhail and McCarthy, 2004).
5.2 Which Dataset Should I Prefer?
This research note is agnostic about whether or not researchers should prefer news or social
media data, as there are advantages and disadvantages to relying on either. Broadly speak-
ing, Twitter requires a greater fixed cost but has lower marginal cost than using newspapers.
Newspapers are easier to access than social media, so projects using them can start more
quickly. If a project includes local newspapers, they will also provide more geographic cover-
13
age than Twitter. One disadvantage is that they are still secondary sources of information,
whereas tweets, properly collected, provide primary evidence. A project using newspapers
to measure protests also requires more ongoing work than one using Twitter.
Though acquiring and processing Twitter data is more technically challenging than doing
the same for news articles, it has some advantages over them. Protest images, because they
are primary source material, may be closer to the cell phone data in their data generating
process. Using the keyword approach to measure protest size is much less difficult than using
images and equally accurate, so measuring protest size once a data pipeline is established is
relatively easy.
One advantage to using images on Twitter versus tweet text is that text requires subject
matter expertise on keywords per protest event to analyze, whereas visuals are closer to a
universal language. Measuring size with text therefore requires more ongoing involvement
than image-based estimates as new events occur.
These findings do not offer any insight into whether or not news reports or Twitter
provide reliable estimates outside of the United States, though other work has found similar
levels of correlation between newspaper estimates and Twitter: Images Faces in South Korea
and Russia (Joo and Steinert-Threlkeld, 2019). It is possible, for example, that crowd size
estimates in news, as well as usage of keywords and sharing of protest photos on social media,
is only reliable in countries above a socioeconomic threshold, with strong protections for
individual rights, during peaceful protests, or some combination of those and other factors.
This work cannot test the international robustness of this finding because CCC only focuses
on the United States, there is no other dataset of contemporaneous protest size, and we
could not obtain cell phone location data in other locations.
5.3 How Many People Protested?
The cell phone estimates should not be interpreted as the true number of protesters since
the data represent about 3% of the United States’ population. The estimates will therefore
be lower than whatever the true number of protesters was. For example, the largest protest
14
in the cell phone data was in Washington D.C., with 3,082 attendees. The United States
has almost 330 million inhabitants, so those estimates scale to roughly 102,723 protesters
(3,082*33.33). Those estimates are much smaller than newspapers’ reports of 750,000 people;
the carrying capacity of the National Mall, however, is probably much closer to 500,000
people (McPhail and McCarthy, 2004), and reliable, third-party estimates conclude that
470,000 individuals protested in Washington D.C. (Wallace and Parlapiano, 2017).
5.4 Concluding Thoughts
(Similar data have been used to measure the size of apolitical (Mamei and Colonna, 2016)
and political events (Shalev and Rotman, 2016; Traag, Quax and Sloot, 2017).)
These results suggest that news provide reliable estimates of protest sizes, as do social
media text or images of protesters (Botta, Moat and Preis, 2015). Event datasets that report
size should therefore report raw estimates recorded in news. Researchers using those data
can then decide whether or not to transform them. Transforming, and not reporting, raw
data throws away information future researchers may find useful.
The common, understandable concern that news estimates of protest size cannot be
trusted should be laid to rest. Whether a researcher prefers news or social media to measure
protest size is his or her choice, as they are equally accurate. So long as one believes that
estimates of protest size from cellphone location data are accurate, then estimates of protest
size from news and social media are trustworthy.
15
ReferencesBarbera, Pablo. 2013. “streamR.”.URL: https://cran.r-project.org/web/packages/streamR/
Baum, Matthew A and Yuri M Zhukov. 2015. “Filtering revolution: Reporting bias ininternational newspaper coverage of the Libyan civil war.” Journal of Peace Research52(3):384–400.
Biggs, Michael. 2016. “Size Matters: Quantifying Protest by Counting Participants.” Soci-ological Methods & Research pp. 1–33.
Botta, Federico, Helen Susannah Moat and Tobias Preis. 2015. “Quantifying crowd size withmobile phone and Twitter data.” Royal Society Open Science 2:150162.
Celestino, Mauricio Rivera and Kristian Skrede Gleditsch. 2013. “Fresh carnations or allthorn, no rose? Nonviolent campaigns and transitions in autocracies.” Journal of PeaceResearch 50(3):385–400.
Chenoweth, Erica and Jeremey Pressman. 2017. “Crowd Counting Consortium.”.URL: https://sites.google.com/view/crowdcountingconsortium/home
Chenoweth, Erica, Jonathan Pinckney and Orion Lewis. 2018. “Days of rage : Introducingthe NAVCO 3.0 Dataset.” Journal of Peace Research 55(4):524–534.
Chenoweth, Erica and Maria J. Stephan. 2011. Why Civil Resistance Works. New YorkCity: Columbia University Press.
Clark, David H. and Patrick M. Regan. 2018. “Mass Mobilization Protest Data.”.URL: https://www.binghamton.edu/massmobilization/about.html
Cox, Joseph. 2019. “I Gave a Bounty Hunter $300. Then He Located Our Phone.”.URL: https://motherboard.vice.com/en us/article/nepxbz/i-gave-a-bounty-hunter-300-dollars-located-phone-microbilt-zumigo-tmobile
de Montjoye, Yves-Alexandre, Sebastien Gambs, Vincent Blondel, Geoffrey Canright, Nico-las De Cordes, Sebastien Deletaille, Kenth Engø-monsen, Manuel Garcia-Herranz, JakeKendall, CAmeron Kerry, Gautier Krings, Emmanuel Letouze, Miguel Luengo-Oroz, NuriaOliver, Luc Rocher, Alex Rutherford, Zbigniew Smoreda, Jessica Steele, Erik Wetter, AlexPentland and Linus Bengtsson. 2018. “On the privacy-conscientious use of mobile phonedata.” Scientific Data 5.URL: http://dx.doi.org/10.1038/sdata.2018.286
DeNardo, James. 1985. Power in Numbers: The Political Strategy of Protest and Rebellion.Princeton: Princeton University Press.
Earl, Jennifer, Andrew Martin, John D. Mccarthy and Sarah A. Soule. 2004. “The Use ofNewspaper Data in the Study of Collective Action.” Annual Review of Sociology 30:65–80.
16
Gerner, Deborah J. and Philip A. Schrodt. 1998. The Effects of Media Coverage on CrisisAssessment and Early Warning in the Middle East. In Early Warning and Early Response,ed. Susanne Schmeidl and Howard Adelman. New York City: Columbia University Press.
Hlavac, Marek. 2018. “stargazer: beautiful Latex, HTML, and ASCII tables from R statis-tical output.”.
Joo, Jungseock and Zachary C. Steinert-Threlkeld. 2019. “Image as Data: Automated VisualContent Analysis for Political Science.”.
Kuran, Timur. 1989. “Sparks and Prairie Fires: A Theory of Unanticipated Political Revo-lution.” Public Choice 61(1):41–74.
Kurzman, Charles. 2004. The Unthinkable Revolution in Iran. Cambridge: Harvard Univer-sity Press.
Lohmann, Susanne. 1994. “The Dynamics of Informational Cascades: The Monday Demon-strations in Leipzig, East Germany, 1989-91.” World Politics 47(1):42–101.
Mamei, Marco and Massimo Colonna. 2016. “Estimating attendance from cellular networkdata.” International Journal of Geographical Information Science 30(7):1281–1301.
Mann, Leon. 1974. “Counting the Crowd: Effects of Editorial Policy on Estimates.” Jour-nalism Quarterly 51(2):278–285.
McCarthy, John D., Clark McPhail, Jackie Smith and Louis J. Crishock. 1999. Electronicand Print Media Representations of Washington, D.C. Demonstrations, 1982 and 1991: ADemography of Description Bias. In Acts of Dissent: New Developments in the Study ofProtest, ed. Dieter Rucht, Ruud Koopmans and Friedhelm Neidhardt. pp. 113–130.
McPhail, Clark and John McCarthy. 2004. “Who Counts and How: Estimating the Size ofProtests.” Contexts 3(3):12–18.
Pfeffer, Jurgen and Katja Mayer. 2018. “Tampering with Twitter’s Sample API.” EPJ DataScience 7(50):1–21.
Raleigh, Clionadh, Andrew Linke, Havard Hegre and Joakim Karlsen. 2010. “IntroducingACLED: An Armed Conflict Location and Event Dataset: Special Data Feature.” Journalof Peace Research 47(5):651–660.
Rasler, Karen. 1996. “Concessions, Repression, and Political Protest in the Iranian Revolu-tion.” American Sociological Review 61(1):132–152.
Salehyan, Idean, Cullen Hendrix, Jesse Hammer, Christina Case, Christopher Linebarger,Emily Stull and Jennifer Williams. 2012. “Social Conflict in Africa: A New Database.”International Interactions 38(4):503–511.
Schweingruber, David and Clark McPhail. 1999. “A Method for Systematically Observingand Recording Collective Action.” Sociological Methods & Research 27(4):451–498.
17
Shalev, Michael and Assaf Rotman. 2016. “Mass Protests Meet Big Data.” Working paper .URL: http://cpd.berkeley.edu/wp-content/uploads/2016/09/Rotman-Shalev-for-UCB-Colloquium-1.pdf
Steinert-Threlkeld, Zachary C. 2017. “Spontaneous Collective Action: Peripheral Mobiliza-tion During the Arab Spring.” American Political Science Review 111(02):379–403.
Thurber, Ches. 2013. “Social Structures of Revolution: Networks, Overlap, and the Choiceof Violent Versus Nonviolent Strategies in Conflict.” Journal of Peace Research 50(3):401.
Tilly, Charles and Sidney Tarrow. 2015. Contentious Politics. Second ed. Oxford: OxfordUniversity Press.
Traag, V.A., R. Quax and P.M.A. Sloot. 2017. “Modelling the distance impedance of protestattendance.” Physica A: Statistical Mechanics and its Applications 468:171–182.
Valentino-DeVries, Jennifer, Natasha Singer, Michael H. Keller and Aaron Krolik. 2018.“Your Apps KnowWhere You Were Last Night, and They’re Not Keeping It Secret.”.URL: https://www.nytimes.com/interactive/2018/12/10/business/location-data-privacy-apps.html?smid=tw-share
Wallace, Tim and Alicia Parlapiano. 2017. “Crowd Scientists Say Women’s March inWashington Had 3 Times as Many People as Trump’s Inauguration.”.URL: https://www.nytimes.com/interactive/2017/01/22/us/politics/womens-march-trump-crowd-estimates.html
Weidmann, Nils B. 2014. “On the Accuracy of Media-based Conflict Event Data.” Journalof Conflict Resolution 59(6):1129–1149.
Weidmann, Nils B. and Espen Geelmuyden Rod. 2018. The Internet and Political Protestin Autocracies. Oxford University Press.
Woolley, John T. 2000. “Using Media-Based Data in Studies of Politics.” American Journalof Political Science 44(1):156–173.
Yip, P, R Watson, K Chan, E Lau, F Chen, Y Xu, L Xi, D Cheung, B Ip and D Liu. 2010.“Estimation of the Number of People in a Demonstration.” Australian & New ZealandJournal of Statistics 52(1):17–26.
18
Supplementary Materials for
“News and Social Media Accurately Measure Protest Size”
S1 Principal Component Analysis
We also construct two latent measures, using the first component from a principal component
analysis, of protest size using CCC: News, Twitter: Text Accounts, Twitter: Images Faces,
and Twitter: Images Accounts. PCA: All, the first measure, uses all four estimates. PCA:
Images, the second, uses only Twitter: Images Faces and Twitter: Images Accounts. Only
images are preferred for the second because this analysis will scale more easily than relying
on newspapers or hand-constructed protest dictionaries.
The following sections include PCA estimates. Overall, PCA: All performs the best of
all six measures, while PCA: Images performs almost identically to Twitter: Images Faces.
S2 Additional Model Fit Measurement
S2.1 Per Capita, Overlapping Observations
Table A1: Overlapping Observations, Include PCA
CCC: PCA: PCA: Twitter: Twitter: Twitter:
News Images All Images Accounts Images Faces Text Accounts
Mean Error 0.06 0.09 0.00 0.10 0.08 0.07
Mean Error (Trimmed) 0.06 0.09 -0.07 0.11 0.14 0.04
Mean Absolute Error 0.58 0.75 0.84 0.82 0.71 0.60
Mean Absolute Error (Trimmed) 0.58 0.72 0.78 0.79 0.62 0.58
Best Predictor, % 0.22 0.10 0.20 0.16 0.11 0.21
Within 0.1 SDs 0.17 0.12 0.11 0.08 0.11 0.14
Within 0.5 SDs 0.52 0.50 0.50 0.38 0.52 0.54
Within 1 SDs 0.82 0.75 0.67 0.67 0.77 0.81
Note: “(Trimmed)” datasets drop observations where |z| > 3.
19
S2.2 Per Capita, All Observations
Table A2: All Observations, Include PCACCC: PCA: PCA: Twitter: Twitter: Twitter:
News Images All Images Accounts Images Faces Text Accounts
Mean Error 0.03 0.11 0.00 0.06 0.11 0.13
Mean Error (Trimmed) 0.05 0.13 -0.07 0.10 0.16 0.12
Mean Absolute Error 0.60 0.73 0.84 0.83 0.71 0.61
Mean Absolute Error (Trimmed) 0.59 0.69 0.78 0.78 0.63 0.60
Best Predictor, 0.22 0.10 0.20 0.16 0.11 0.21
Within 0.1 SDs 0.17 0.11 0.11 0.08 0.09 0.13
Within 0.5 SDs 0.50 0.53 0.50 0.39 0.51 0.54
Within 1 SDs 0.81 0.76 0.67 0.68 0.78 0.80
Note: “(Trimmed)” datasets drop observations where |z| > 3.
S2.3 Not Per Capita, Overlapping Observations
Table A3: Not Per Capita, Include PCA, Overlapping Observations
CCC: PCA: PCA: Twitter: Twitter: Twitter:
News Images All Images Accounts Images Faces Text Accounts
Mean Error 0.20 -0.22 -0.41 -0.14 -0.24 0.71
Mean Error (Trimmed) 0.20 -0.20 0.27 -0.14 -0.22 0.58
Mean Absolute Error 0.49 0.70 2.06 0.58 0.54 0.94
Mean Absolute Error (Trimmed) 0.49 0.68 1.42 0.58 0.52 0.83
Best Predictor, % 0.32 0.10 0.07 0.22 0.15 0.14
Within 0.1 SDs 0.19 0.06 0.03 0.12 0.08 0.07
Within 0.5 SDs 0.62 0.46 0.13 0.52 0.61 0.43
Within 1 SDs 0.88 0.79 0.33 0.82 0.88 0.66
Note: “(Trimmed)” datasets drop observations where |z| > 3.
20
S2.4 Not Per Capita, All Observations
Table A4: Not Per Capita, Include PCA, All Observations
CCC: PCA: PCA: Twitter: Twitter: Twitter:
News Images All Images Accounts Images Faces Text Accounts
Mean Error 0.03 0.11 0.00 0.06 0.11 0.13
Mean Error (Trimmed) 0.05 0.13 −0.07 0.10 0.16 0.12
Mean Absolute Error 0.60 0.73 0.84 0.83 0.71 0.61
Mean Absolute Error (Trimmed) 0.59 0.69 0.78 0.78 0.63 0.60
Best Predictor, % 0.22 0.10 0.20 0.16 0.11 0.21
Within 0.1 SDs 0.17 0.11 0.11 0.08 0.09 0.13
Within 0.5 SDs 0.50 0.53 0.50 0.39 0.51 0.54
Within 1 SDs 0.81 0.76 0.67 0.68 0.78 0.80
Note: “(Trimmed)” datasets drop observations where |z| > 3.
21
S3 Mean Error by Z-Score Decile
S3.1 Per Capita, Overlapping Observations
Figure A1: Mean Error by Decile, Overlapping Observations
22
S3.2 Per Capita, All Observations
Figure A2: Mean Error by Decile, All Observations
23
S3.3 Not Per Capita, Overlapping Observations
Figure A3: Mean Error by Decile, Overlapping Observations
24
S3.4 Not Per Capita, All Observations
Figure A4: Mean Error by Decile, All Observations
25
S4 Correlation Tables
Table A5 shows the pairwise correlation of the cell phone data estimates with the six protest
measures and city population. Because correlation is sensitive to outliers, our preferred
measure is the correlation of the logged protest size values. Raw correlations are very high for
all measures, with the news and social media measures appearing equally accurate. The raw
correlations are influenced by the largest protests, however, though the logged correlations
are still high. Social media and news are almost equally accurate, though Twitter: Images
Accounts is consistently the least accurate. Its size correlation is the lowest of the six
measures, and the gap between its rank correlation and the next lowest (CCC, All Estimates
(Low)) is large. Note as well that its size and rank correlation are low with the news and
other Twitter estimates, suggesting that counting the number of accounts which share protest
images captures something slightly different than protest attendance. While counting the
number of accounts which share protest photos may appear less methodologically problematic
than counting faces, it is clearly worse than the other two Twitter estimates and should be
avoided.
Across almost all measurements of protest size, the placebo correlation is lower than the
measures of protest size. The only times it is larger, shown in Tables A7b and A9b, is when
the data are restricted to the 128 observations common to all datasets. Even then, it is not
much larger than CCC: News or Twitter: Text Accounts.
S4.1 Z-score of Log per Capita
This measure is the one used throughout the manuscript.
26
Tab
leA
5:C
orre
lati
onof
Z-s
core
ofL
ogp
erC
apit
a
(a)
All
Ob
serv
atio
ns
per
Pai
r
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
1-0.5
850.
655
0.65
50.
652
0.68
10.
424
0.39
5
Urb
aniz
edA
rea
-0.5
851
-0.3
27-0.3
27-0.3
38-0.4
28-0.0
59-0.2
16
CC
C:
New
s0.
655
-0.3
271
10.
988
0.80
70.
609
0.39
1
CC
C,
All
Est
imat
es0.
655
-0.3
271
10.
988
0.80
70.
609
0.39
1
CC
C,
All
Est
imat
es(L
ow)
0.65
2-0.3
380.
988
0.98
81
0.78
00.
600
0.36
6
Tw
itte
r:T
ext
Acc
oun
ts0.
681
-0.4
280.
807
0.80
70.
780
10.
574
0.56
3
Tw
itte
r:Im
ages
Fac
es0.
424
-0.0
590.
609
0.60
90.
600
0.57
41
0.24
9
Tw
itte
r:Im
ages
Acc
oun
ts0.
395
-0.2
160.
391
0.39
10.
366
0.56
30.
249
1
(b)
Sam
eO
bse
rvat
ion
sfo
rA
llP
airs
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
1-0.4
090.
609
0.61
90.
612
0.64
20.
432
0.33
7
Urb
aniz
edA
rea
-0.4
091
-0.3
59-0.3
12-0.3
17-0.5
01-0.1
09-0.0
89
CC
C:
New
s0.
609
-0.3
591
10.
991
0.80
20.
606
0.30
5
CC
C,
All
Est
imat
es0.
619
-0.3
121
10.
991
0.75
20.
592
0.20
9
CC
C,
All
Est
imat
es(L
ow)
0.61
2-0.3
170.
991
0.99
11
0.73
60.
585
0.19
1
Tw
itte
r:T
ext
Acc
oun
ts0.
642
-0.5
010.
802
0.75
20.
736
10.
611
0.31
9
Tw
itte
r:Im
ages
Fac
es0.
432
-0.1
090.
606
0.59
20.
585
0.61
11
0.29
4
Tw
itte
r:Im
ages
Acc
oun
ts0.
337
-0.0
890.
305
0.20
90.
191
0.31
90.
294
1
27
S4.2 Z-score of Log Size
This measure is the z-score of the logged protest size, not log per capita protest size.
28
Tab
leA
7:C
orre
lati
onof
Z-s
core
ofL
ogM
easu
res
(a)
All
Ob
serv
atio
ns
per
Pai
r
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
10.
570
0.70
80.
708
0.69
90.
730
0.71
40.
631
Urb
aniz
edA
rea
0.57
01
0.46
70.
467
0.46
00.
630
0.64
10.
844
CC
C:
New
s0.
708
0.46
71
10.
990
0.82
50.
805
0.55
9
CC
C,
All
Est
imat
es0.
708
0.46
71
10.
990
0.82
50.
805
0.55
9
CC
C,
All
Est
imat
es(L
ow)
0.69
90.
460
0.99
00.
990
10.
800
0.79
50.
541
Tw
itte
r:T
ext
Acc
oun
ts0.
730
0.63
00.
825
0.82
50.
800
10.
833
0.75
8
Tw
itte
r:Im
ages
Fac
es0.
714
0.64
10.
805
0.80
50.
795
0.83
31
0.68
7
Tw
itte
r:Im
ages
Acc
oun
ts0.
631
0.84
40.
559
0.55
90.
541
0.75
80.
687
1
(b)
Sam
eO
bse
rvat
ion
sfo
rA
llP
airs
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
10.
722
0.70
70.
716
0.70
20.
726
0.70
70.
660
Urb
aniz
edA
rea
0.72
21
0.45
80.
481
0.47
20.
633
0.62
60.
849
CC
C:
New
s0.
707
0.45
81
10.
990
0.83
00.
800
0.56
1
CC
C,
All
Est
imat
es0.
716
0.48
11
10.
993
0.79
80.
767
0.50
8
CC
C,
All
Est
imat
es(L
ow)
0.70
20.
472
0.99
00.
993
10.
781
0.75
80.
493
Tw
itte
r:T
ext
Acc
oun
ts0.
726
0.63
30.
830
0.79
80.
781
10.
834
0.67
8
Tw
itte
r:Im
ages
Fac
es0.
707
0.62
60.
800
0.76
70.
758
0.83
41
0.68
1
Tw
itte
r:Im
ages
Acc
oun
ts0.
660
0.84
90.
561
0.50
80.
493
0.67
80.
681
1
29
S4.3 Log
30
Tab
leA
9:C
orre
lati
onof
Log
(Siz
e)
(a)
All
Ob
serv
atio
ns
per
Pai
r
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
10.
639
0.72
40.
724
0.71
90.
755
0.71
40.
627
Urb
aniz
edA
rea
0.63
91
0.56
10.
561
0.55
70.
714
0.70
20.
787
CC
C:
New
s0.
724
0.56
11
10.
990
0.83
90.
807
0.60
2
CC
C,
All
Est
imat
es0.
724
0.56
11
10.
990
0.83
90.
807
0.60
2
CC
C,
All
Est
imat
es(L
ow)
0.71
90.
557
0.99
00.
990
10.
811
0.79
50.
579
Tw
itte
r:T
ext
Acc
oun
ts0.
755
0.71
40.
839
0.83
90.
811
10.
838
0.79
3
Tw
itte
r:Im
ages
Fac
es0.
714
0.70
20.
807
0.80
70.
795
0.83
81
0.71
0
Tw
itte
r:Im
ages
Acc
oun
ts0.
627
0.78
70.
602
0.60
20.
579
0.79
30.
710
1
(b)
Sam
eO
bse
rvat
ion
sfo
rA
llP
airs
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
10.
778
0.69
50.
721
0.71
20.
726
0.70
70.
660
Urb
aniz
edA
rea
0.77
81
0.58
50.
603
0.60
00.
709
0.69
80.
797
CC
C:
New
s0.
695
0.58
51
10.
991
0.83
00.
800
0.56
1
CC
C,
All
Est
imat
es0.
721
0.60
31
10.
992
0.79
80.
767
0.50
8
CC
C,
All
Est
imat
es(L
ow)
0.71
20.
600
0.99
10.
992
10.
781
0.75
80.
493
Tw
itte
r:T
ext
Acc
oun
ts0.
726
0.70
90.
830
0.79
80.
781
10.
834
0.67
8
Tw
itte
r:Im
ages
Fac
es0.
707
0.69
80.
800
0.76
70.
758
0.83
41
0.68
1
Tw
itte
r:Im
ages
Acc
oun
ts0.
660
0.79
70.
561
0.50
80.
493
0.67
80.
681
1
31
S4.4 Raw
32
Tab
leA
10:
Cor
rela
tion
ofSiz
e,no
Tra
nsf
orm
atio
n
(a)
All
Ob
serv
atio
ns
per
Pai
r
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
10.
587
0.79
80.
798
0.84
10.
794
0.82
90.
640
Urb
aniz
edA
rea
0.58
71
0.67
40.
674
0.64
80.
678
0.58
30.
870
CC
C:
New
s0.
798
0.67
41
10.
972
0.97
80.
966
0.81
9
CC
C,
All
Est
imat
es0.
798
0.67
41
10.
972
0.97
80.
966
0.81
9
CC
C,
All
Est
imat
es(L
ow)
0.84
10.
648
0.97
20.
972
10.
931
0.94
30.
760
Tw
itte
r:T
ext
Acc
oun
ts0.
794
0.67
80.
978
0.97
80.
931
10.
971
0.84
9
Tw
itte
r:Im
ages
Fac
es0.
829
0.58
30.
966
0.96
60.
943
0.97
11
0.78
2
Tw
itte
r:Im
ages
Acc
oun
ts0.
640
0.87
00.
819
0.81
90.
760
0.84
90.
782
1
(b)
Sam
eO
bse
rvat
ion
sfo
rA
llP
airs
Cel
lP
hon
eD
ata
Urb
aniz
edA
rea
CC
C:
CC
C,
CC
C,
Tw
itte
r:T
wit
ter:
Tw
itte
r:
New
sA
llE
stim
ates
All
Est
imat
es(L
ow)
Tex
tA
ccou
nts
Imag
esF
aces
Imag
esA
ccou
nts
Cel
lP
hon
eD
ata
10.
587
0.79
80.
798
0.84
10.
794
0.82
90.
640
Urb
aniz
edA
rea
0.58
71
0.67
40.
674
0.64
80.
678
0.58
30.
870
CC
C:
New
s0.
798
0.67
41
10.
972
0.97
80.
966
0.81
9
CC
C,
All
Est
imat
es0.
798
0.67
41
10.
972
0.97
80.
966
0.81
9
CC
C,
All
Est
imat
es(L
ow)
0.84
10.
648
0.97
20.
972
10.
931
0.94
30.
760
Tw
itte
r:T
ext
Acc
oun
ts0.
794
0.67
80.
978
0.97
80.
931
10.
971
0.84
9
Tw
itte
r:Im
ages
Acc
oun
ts0.
640
0.87
00.
819
0.81
90.
760
0.84
90.
782
1
33
S5 Results, Regression
This section regresses the cell phone measure on a placebo model, the placebo model with
the size estimate of the four measures, and a model pooling the four estimates. The placebo
model includes variables for city population, median household income, income inequality,
vote share for Candidate Hillary Clinton, and the number of unemployed. All models include
state fixed effects.
Each subsection uses a different transformation of protest size applied to the overlapping
observations and all observations per measure. There are therefore eight tables, two per the
four subsections.
The results mirror the main paper’s. Across dependent variable transformations and
subsets of data, CCC: News and Twitter: Text Accounts models explain the most variation
of the cell phone protest size. In Tables A15 (z-score of log size) and A17 (log size), the
placebo model outperforms the others models. In those tables, however, the model that
pools the four measures’ estimates with the placebo variables finds statistical significance for
only CCC: News and Twitter: Text Accounts. Note as well that the placebo is worse than
the other measures when using raw data, as seen in Tables A18 and A19.
S5.1 Z-score of Log per Capita
34
35
Table A12: Same Cases
Z, Log(Per Capita)
(1) (2) (3) (4) (5) (6)
Z, Log(Twitter: Images Faces per Capita) 0.273∗∗ 0.050
(0.105) (0.088)
Z, Log(Twitter: Text per Capita) 0.791∗∗∗ 0.545∗∗∗
(0.104) (0.171)
Z, Log(Twitter: Images Accounts per Capita) 0.363∗∗∗ 0.020
(0.124) (0.119)
Z, Log(CCC: News per Capita) 0.694∗∗∗ 0.297∗∗
(0.106) (0.147)
Median HHI 0.238 0.139 0.068 0.279 0.092 0.042
(0.195) (0.192) (0.150) (0.187) (0.159) (0.152)
Gini 0.178 0.212 0.085 0.015 0.101 0.078
(0.151) (0.146) (0.115) (0.154) (0.122) (0.123)
County Dem. Vote Share −0.034 −0.160 −0.163 −0.060 −0.249∗ −0.240∗
(0.179) (0.180) (0.137) (0.171) (0.148) (0.140)
Unemployed Count −0.300 −0.208 0.562 0.517 0.967 0.899
(1.341) (1.294) (1.024) (1.309) (1.100) (1.045)
Population −0.014 −0.131 −0.990 −0.904 −1.336 −1.324
(1.459) (1.409) (1.115) (1.425) (1.195) (1.133)
Intercept 0.302 0.221 0.042 0.413 0.285 0.107
(0.922) (0.890) (0.700) (0.880) (0.744) (0.694)
Observations 127 127 127 127 127 127
Adjusted R2 0.122 0.182 0.494 0.201 0.428 0.510
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
36
Table A13: Maximum Cases per Measure
Z, Log(Per Capita)
(1) (2) (3) (4) (5) (6)
Z, Log(Twitter: Images Faces per Capita) 0.280∗∗∗ 0.050
(0.091) (0.088)
Z, Log(Twitter: Text per Capita) 0.683∗∗∗ 0.545∗∗∗
(0.076) (0.171)
Z, Log(Twitter: Images Accounts per Capita) 0.281∗∗∗ 0.020
(0.083) (0.119)
Z, Log(CCC: News per Capita) 0.497∗∗∗ 0.297∗∗
(0.085) (0.147)
Median HHI 0.213∗ 0.184 0.085 0.266∗ 0.238 0.042
(0.120) (0.158) (0.113) (0.143) (0.148) (0.152)
Gini 0.202∗ 0.262∗∗ 0.182∗ 0.058 0.164 0.078
(0.103) (0.126) (0.095) (0.129) (0.115) (0.123)
County Dem. Vote Share −0.053 −0.166 −0.220∗∗ −0.040 −0.216 −0.240∗
(0.116) (0.153) (0.108) (0.138) (0.144) (0.140)
Unemployed Count −0.566 −0.717 −0.456 −0.232 0.521 0.899
(0.733) (0.929) (0.724) (0.913) (1.075) (1.045)
Population 0.352 0.462 0.212 −0.031 −0.885 −1.324
(0.757) (0.968) (0.753) (0.949) (1.169) (1.133)
Intercept −0.561 −1.274 −0.449 −0.844 −1.092∗ 0.107
(0.590) (0.853) (0.514) (0.651) (0.600) (0.694)
Observations 214 153 167 165 139 127
Adjusted R2 0.365 0.244 0.492 0.206 0.357 0.510
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
S5.2 Z-score of Log Size
Table A14: Same Cases
Z, Log(Size)
(1) (2) (3) (4) (5) (6)
Z, Log(Twitter: Images Faces) 0.334∗∗∗ 0.094
(0.085) (0.100)
Z, Log(Twitter: Text Accounts) 0.355∗∗∗ 0.199∗∗
(0.068) (0.097)
Z, Log(Twitter: Images Accounts) 0.274∗∗ 0.056
(0.124) (0.130)
Z, Log(CCC: News) 0.436∗∗∗ 0.224∗
(0.092) (0.121)
Median HHI 0.130 0.064 0.050 0.156 0.062 0.037
(0.121) (0.113) (0.106) (0.119) (0.109) (0.106)
Gini −0.077 0.0004 −0.001 −0.085 −0.048 0.001
(0.094) (0.088) (0.082) (0.092) (0.083) (0.084)
County Dem. Vote Share 0.231∗∗ 0.045 0.011 0.146 0.021 −0.070
(0.111) (0.112) (0.105) (0.115) (0.108) (0.108)
Unemployed Count −1.498∗ −0.804 −0.110 −0.864 −0.307 0.213
(0.821) (0.775) (0.759) (0.851) (0.770) (0.772)
Population 2.014∗∗ 1.057 0.056 1.153 0.527 −0.289
(0.893) (0.856) (0.858) (0.954) (0.851) (0.876)
Intercept 0.953 0.744 0.183 0.846 0.578 0.249
(0.573) (0.530) (0.518) (0.562) (0.514) (0.508)
Observations 128 128 128 128 128 128
Adjusted R2 0.362 0.461 0.522 0.392 0.499 0.542
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
37
38
Table A15: Maximum Cases per Measure
Z, Log(Size)
(1) (2) (3) (4) (5) (6)
Z, Log(Twitter: Images Faces) 0.356∗∗∗ 0.094
(0.065) (0.100)
Z, Log(Twitter: Text Accounts) 0.344∗∗∗ 0.199∗∗
(0.050) (0.097)
Z, Log(Twitter: Images Accounts) 0.338∗∗∗ 0.056
(0.075) (0.130)
Z, Log(CCC: News) 0.363∗∗∗ 0.224∗
(0.060) (0.121)
Median HHI 0.380∗∗∗ 0.088 0.104 0.236∗∗∗ 0.197∗∗ 0.037
(0.064) (0.091) (0.084) (0.089) (0.078) (0.106)
Gini 0.052 0.026 0.070 −0.037 0.043 0.001
(0.058) (0.074) (0.069) (0.075) (0.064) (0.084)
County Dem. Vote Share 0.188∗∗∗ 0.022 −0.055 0.101 0.020 −0.070
(0.063) (0.093) (0.085) (0.088) (0.080) (0.108)
Unemployed Count −0.453 −0.682 −0.617 −0.666 −0.576 0.213
(0.496) (0.539) (0.529) (0.571) (0.674) (0.772)
Population 0.725 0.858 0.628 0.837 0.865 −0.289
(0.511) (0.566) (0.557) (0.600) (0.732) (0.876)
Intercept −1.275∗∗∗ 0.055 −0.672∗ −0.779∗ −0.985∗∗∗ 0.249
(0.237) (0.492) (0.375) (0.403) (0.318) (0.508)
Observations 295 154 170 166 187 128
Adjusted R2 0.691 0.563 0.610 0.541 0.600 0.542
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
S5.3 Log Size
Table A16: Same Cases
Log(Size)
(1) (2) (3) (4) (5) (6)
Log(Twitter: Images Faces) 0.297∗∗∗ 0.083
(0.076) (0.089)
Log(Twitter: Text Accounts) 0.697∗∗∗ 0.390∗∗
(0.133) (0.190)
Log(Twitter: Images Accounts) 0.275∗∗ 0.056
(0.124) (0.130)
Log(CCC: News) 0.351∗∗∗ 0.180∗
(0.074) (0.097)
Median HHI 0.091 0.044 0.035 0.109 0.043 0.026
(0.085) (0.079) (0.074) (0.083) (0.076) (0.074)
Gini −0.054 0.0003 −0.0004 −0.059 −0.033 0.001
(0.065) (0.062) (0.058) (0.064) (0.058) (0.058)
County Dem. Vote Share 0.161∗∗ 0.032 0.007 0.102 0.014 −0.049
(0.077) (0.078) (0.073) (0.080) (0.075) (0.076)
Unemployed Count −1.045∗ −0.561 −0.077 −0.603 −0.215 0.149
(0.573) (0.541) (0.530) (0.594) (0.537) (0.539)
Population 1.406∗∗ 0.738 0.039 0.805 0.368 −0.202
(0.623) (0.597) (0.599) (0.666) (0.594) (0.612)
Intercept 2.395∗∗∗ 1.465∗∗∗ 1.400∗∗∗ 1.655∗∗∗ 1.075∗∗ 0.750
(0.400) (0.437) (0.395) (0.513) (0.450) (0.521)
Observations 128 128 128 128 128 128
Adjusted R2 0.362 0.461 0.522 0.392 0.499 0.542
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
39
40
Table A17: Maximum Cases per Measure
Log(Size)
(1) (2) (3) (4) (5) (6)
Log(Twitter: Images Faces) 0.317∗∗∗ 0.083
(0.058) (0.089)
Log(Twitter: Text Accounts) 0.675∗∗∗ 0.390∗∗
(0.099) (0.190)
Log(Twitter: Images Accounts) 0.339∗∗∗ 0.056
(0.075) (0.130)
Log(CCC: News) 0.291∗∗∗ 0.180∗
(0.048) (0.097)
Median HHI 0.265∗∗∗ 0.062 0.072 0.165∗∗∗ 0.137∗∗ 0.026
(0.045) (0.063) (0.059) (0.062) (0.054) (0.074)
Gini 0.036 0.018 0.049 −0.026 0.030 0.001
(0.040) (0.052) (0.048) (0.053) (0.045) (0.058)
County Dem. Vote Share 0.131∗∗∗ 0.016 −0.038 0.071 0.014 −0.049
(0.044) (0.065) (0.059) (0.061) (0.056) (0.076)
Unemployed Count −0.317 −0.476 −0.431 −0.465 −0.402 0.149
(0.346) (0.377) (0.369) (0.399) (0.471) (0.539)
Population 0.506 0.599 0.438 0.585 0.604 −0.202
(0.356) (0.395) (0.389) (0.419) (0.511) (0.612)
Intercept 0.840∗∗∗ 0.933∗∗ 0.817∗∗∗ 0.367 0.164 0.750
(0.166) (0.364) (0.265) (0.332) (0.276) (0.521)
Observations 295 154 170 166 187 128
Adjusted R2 0.691 0.563 0.610 0.541 0.600 0.542
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
S5.4 Size
Table A18: Same Cases
Cell Phone Data
(1) (2) (3) (4) (5) (6)
Twitter: Images Faces 0.00000 0.0001∗
(0.00001) (0.00003)
Twitter: Text Accounts −0.0001 −0.0002
(0.001) (0.003)
Twitter: Images Accounts −0.00004 −0.0001
(0.0001) (0.0001)
CCC: News −0.00000 −0.00001∗
(0.00000) (0.00000)
Median HHI 0.132 0.129 0.134 0.137 0.138 0.111
(0.110) (0.111) (0.111) (0.110) (0.111) (0.110)
Gini −0.036 −0.033 −0.037 −0.039 −0.041 −0.030
(0.105) (0.105) (0.105) (0.105) (0.105) (0.104)
County Dem. Vote Share 0.303∗∗∗ 0.298∗∗∗ 0.304∗∗∗ 0.317∗∗∗ 0.312∗∗∗ 0.315∗∗∗
(0.112) (0.112) (0.112) (0.112) (0.112) (0.113)
Unemployed Count −1.261 −1.178 −1.295 −1.589∗ −1.426 −1.435
(0.854) (0.910) (0.883) (0.949) (0.888) (1.045)
Population 2.057∗∗ 1.957∗ 2.098∗∗ 2.509∗∗ 2.259∗∗ 2.580∗∗
(0.919) (1.005) (0.977) (1.124) (0.984) (1.213)
Intercept 4.958∗∗∗ 4.953∗∗∗ 4.960∗∗∗ 5.005∗∗∗ 4.967∗∗∗ 5.060∗∗∗
(0.080) (0.084) (0.084) (0.109) (0.084) (0.116)
Observations 128 128 128 128 128 128
Log Likelihood −768.393 −768.376 −768.390 −768.200 −768.326 −765.705
Akaike Inf. Crit. 1,548.787 1,550.751 1,550.781 1,550.401 1,550.652 1,551.410
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
41
42
Table A19: Maximum Cases per Measure
Cell Phone Data
(1) (2) (3) (4) (5) (6)
Twitter: Images Faces 0.00000 0.0001∗
(0.00001) (0.00003)
Twitter: Text Accounts −0.0001 −0.0002
(0.001) (0.003)
Twitter: Images Accounts 0.00001 −0.0001
(0.0001) (0.0001)
CCC: News −0.00000 −0.00001∗
(0.00000) (0.00000)
Median HHI 0.669∗∗∗ 0.194∗ 0.297∗∗∗ 0.264∗∗ 0.316∗∗∗ 0.111
(0.086) (0.111) (0.106) (0.107) (0.097) (0.110)
Gini 0.534∗∗∗ −0.029 0.111 0.127 0.326∗∗∗ −0.030
(0.080) (0.105) (0.102) (0.101) (0.089) (0.104)
County Dem. Vote Share 0.291∗∗∗ 0.370∗∗∗ 0.353∗∗∗ 0.351∗∗∗ 0.273∗∗∗ 0.315∗∗∗
(0.083) (0.108) (0.103) (0.105) (0.097) (0.113)
Unemployed Count −0.842 −0.471 −0.699 −0.613 −2.134∗∗ −1.435
(0.607) (0.740) (0.761) (0.757) (0.949) (1.045)
Population 1.187∗ 1.041 1.334∗ 1.190 3.205∗∗∗ 2.580∗∗
(0.623) (0.777) (0.799) (0.803) (1.045) (1.213)
Intercept 4.487∗∗∗ 4.832∗∗∗ 4.741∗∗∗ 4.737∗∗∗ 4.672∗∗∗ 5.060∗∗∗
(0.058) (0.079) (0.078) (0.091) (0.077) (0.116)
Observations 295 154 170 166 187 128
Log Likelihood −1,709.422 −909.459 −991.326 −969.625 −1,050.807 −765.705
Akaike Inf. Crit. 3,430.844 1,832.918 1,996.653 1,953.250 2,115.614 1,551.410
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01