Sampling the Twitter graph
-
Upload
antoine-rebecq -
Category
Science
-
view
752 -
download
0
Transcript of Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Using sampling methods to estimate rare stats onTwitter’s graph
Antoine Rebecq
INSEE - Universite Paris X
12/14/15
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Sommaire
1 Stats on social networks / TwitterMotivationTowards design-based estimation
2 Survey samplingEstimatesSampling design
3 Extending the sampling designSnowball samplingAdaptive sampling
4 Results and future workResultsSample sizeFuture work
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Section 1
Stats on social networks / Twitter
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Subsection 1
Motivation
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Big data begets big graph
Twitter in 2013
Image from [2]
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Studies - Twitter
A large range of studies used Twitter data (Computer Science,Sociology, Psychology, etc.)
Data on Twitter can be collected via :
The REST API (limited number of queries - queries can be onanything)
The Streaming API (Only 1% of tweets matching somecriteria)
The Firehose (Unlimited access. Expensive)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
The Twitter graph
The Twitter graph ([7]) :
Is undirected
Degree distribution is heavy-tailed
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
The Twitter graph
Has small path lengths
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Subsection 2
Towards design-based estimation
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Towards design-based estimation
Model-based estimation :
Scale-free networks, Barabasi-Albert ([1])
Small-world networks, Watts-Strogatz ([13])
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Towards design-based estimation
Very little exists about design-based statistical inference onnetworks (Kolaczyk 2009 , [6])
We try survey sampling methods used in official StatisticsInstitutes to make design-based inference about “big graphs”
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Example : Star Wars : The Force Awakens
Star Wars : The Force Awakens
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
MotivationTowards design-based estimation
Example : “Star Wars, The Force Awakens”
Let’s write :
yk = Number of tweets @starwars by user k
between 10/29/15, 7 :48 - 10 :48 PM EST
zk = 1{yk ≥ 1}
Goal : estimate NC = T (Z )
Additionally, we write : nC =∑k∈s
zk
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Section 2
Survey sampling
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Subsection 1
Estimates
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Horvitz-Thompson estimator
Population U : vertices of the Twitter graph.Assign all k ∈ U an inclusion probability P(k ∈ s) = πk
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Horvitz-Thompson estimator
Classic unbiased estimator for totals and means :Horvitz-Thompson
T (Y )HT =∑k∈s
ykπk
ˆy =1
N
∑k∈s
ykπk
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Horvitz-Thompson estimator
Variance of the Horvitz-Thompson estimator depends on the firstand second-order inclusion probabilities :
πk = P(k ∈ s)
πkl = P(k , l ∈ s)
V(T (Y )HT ) =∑k∈U
∑l∈U
(πkl − πkπl)ykπk
ylπl
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Calibrated estimator
Deville-Sarndal, 1992 ([3]). Modification of the Horvitz-Thompsonestimator to take auxiliary information into account. For example :
T (Y ) = Number of tweets @StarWars
N = Number of users in scope
Structure of number of followers
Number of verified users
. . .
Very similar to empirical likelihood methods ([9]).
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Subsection 2
Sampling design
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Sampling frame
Each Twitter user is assigned a unique id. When a new user iscreated, the id that is assigned to it is greater than the lastprevious id.
But, not all ids match an existing user (≈ 3.1 · 109 ids as ofOctober 2015), which means our frame over-covers thepopulation. Over-coverage can be corrected either by using aHorvitz-Thompson or Hajek estimator (see [10]).
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Sampling design : Bernoulli
Poisson sampling : For each k ∈ U , run a πk -Bernoulli experimentto decide whether to include unit k in the sample.
Bernoulli sampling : ∀k, πk = p
Sampling design of non-fixed sample size. We set the expectedsample size to 20000.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Sampling design : Stratified Bernoulli
We write : U = U1⊕U2 (h = 1, 2 being called “strata”) and
draw two independant Bernoulli samples in U1 and U2.
Here :
U1 = Followers of official @starwars account
U2 = Rest of Twitter users
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Sampling design : Neyman allocation
Optimal variance of the Horvitz-Thompson estimator is obtainedfor (Neyman, [8]) :
nh =NhS2
h∑h
NhS2h
Given the expected values, we set :
n1 = 9700
n2 = 10300
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Sampling design : Stratified Bernoulli
Estimators for the two “simple” designs :
NC1 =nC
p
NC2 =N1
n1nC1 +
N − N1
n2nC2
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
EstimatesSampling design
Variance estimators
V(T (Y ))1 =∑k∈s
(1− p)ykp2
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Section 3
Extending the sampling design
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
From now on, our sampling designs will include extensions :s = s0 ∪ sext
s0 is still selected using stratified Bernoulli, but with expectedsample size of 1000, so that the expected sample size of s is moreor less 20000.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Subsection 1
Snowball sampling
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
Population U
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
Initial sample s0
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
One stage snowball extension s = A(s0)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
Formally, we write :
Bi = {i} ∪ {j ∈ V ,Eji 6= ∅}Ai = {i} ∪ {j ∈ V ,Eij 6= ∅}
s = A(s0)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
NC3 =∑k∈s
zi1− π(Bi )
where :
π(Bi ) = P(Bi ⊂ s)
=∏k∈Bi
(1− P(k ∈ s))
= q#(Bi∩U1)S1 · q#(Bi∩U2)
S2
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Snowball sampling
V(NC3) =∑i∈s
∑j∈s
zizjπ(Bi ∪ Bj)
γ′ij
where :
γ′ij =π(Bi ∪ Bj)− π(Bi )π(Bj)
[1− π(Bi )][1− π(Bj)]
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Subsection 2
Adaptive sampling
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Adaptive sampling
In adaptive sampling, when (Thompson, [11])
Used in official statistics to measure number of drugs users orHIV-positive people
Sampling design often compared to the video game“minesweeper”
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Adaptive sampling
Image from [12]
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Adaptive sampling
Once a unit bearing the characteristic of interest (i.e. a user whotweeted about the Star Wars trailer) is found, all its network (i.e.its friends and friends of friends, etc. who have tweeted about StarWars) is included in the sample.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Adaptive sampling
Estimator :
NC4 =K∑
k=1
n∗CkJkπgk
where :
K = number of networks
y∗k = total of Y in the network k
n∗Ck= Number of people with yk ≥ 1in the network k
Jk = 1{k ∈ C}πgk = probability that the initial sample intersects k
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
Snowball samplingAdaptive sampling
Adaptive sampling
When using an adaptive design, it is often better to use theRao-Blackwell of the previous estimate. It has a very simple closedform in the case of the adaptive stratified.
NC5 = n0 +K∑
k=1
nr
1− (1− p)nr
where : n0 = #s0 and s0 = ∪r{k ∈ s, δ(k ,C ) = 1} is the union ofthe sides of C.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Section 4
Results and future work
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Subsection 1
Results
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Results
Design n nscope n0 NC CV ˆDeff
Bernoulli 20013 3946 354121 0.231 1.04
Stratified 20094 9832 316889 0.097 0.68
1-snowball 159957 73570 1000 331097 0.031 0.60
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Results
Mean number of tweets @StarWars per user : 1.18± 0.07
Suggests that bots are not responsible for this very large number oftweets (see [5], [4]) !
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Subsection 2
Sample size
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Snowball sampling - sample size
Expected sample size ≈ 20000.
Actual sample size : > 150000 !
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Adaptive sampling
With our test subject (tweets @AmericanIdol), average networksize was no greater than a few units (≈ 10000 tweets in the scope)
With Star Wars (≈ 300000 tweets in the scope, with much lesstweets per people), we couldn’t get to the end of every network !
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Subsection 3
Future work
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Future work
Control sample size
Estimates and calibration on graph totals (centrality,clustering coefficients, path length, etc.)
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Conclusion
Thank you !
http://nc233.com/cmstatistics2015
@nc233
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Albert-Laszlo Barabasi and Reka Albert.Emergence of scaling in random networks.science, 286(5439) :509–512, 1999.
Paul Burkhardt and Chris Waring.An nsa big graph experiment.In presentation at the Carnegie Mellon University SDI/ISTCSeminar, Pittsburgh, Pa, 2013.
Jean-Claude Deville and Carl-Erik Sarndal.Calibration estimators in survey sampling.Journal of the American statistical Association,87(418) :376–382, 1992.
Emilio Ferrara.”manipulation and abuse on social media” by emilio ferrarawith ching-man au yeung as coordinator.SIGWEB Newsl., (Spring) :4 :1–4 :9, April 2015.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Emilio Ferrara, Onur Varol, Clayton Davis, Filippo Menczer,and Alessandro Flammini.The rise of social bots.arXiv preprint arXiv :1407.5225, 2014.
Eric D Kolaczyk.Statistical analysis of network data.Springer, 2009.
Seth A Myers, Aneesh Sharma, Pankaj Gupta, and Jimmy Lin.Information network or social network ? : the structure of thetwitter follow graph.In Proceedings of the companion publication of the 23rdinternational conference on World wide web companion, pages493–498. International World Wide Web Conferences SteeringCommittee, 2014.
Jerzy Neyman.
Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
On the two different aspects of the representative method :the method of stratified sampling and the method of purposiveselection.Journal of the Royal Statistical Society, pages 558–625, 1934.
Art B. Owen.Empirical likelihood.CRC press, 2010.
Olivier Sautory.Les enjeux methodologiques lies a l’usage de bases de sondageimparfaites.
Steven K Thompson.Adaptive cluster sampling.Journal of the American Statistical Association,85(412) :1050–1059, 1990.
Steven K Thompson.Antoine Rebecq Sampling the Twitter graph
Stats on social networks / TwitterSurvey sampling
Extending the sampling designResults and future work
ResultsSample sizeFuture work
Stratified adaptive cluster sampling.Biometrika, pages 389–397, 1991.
Duncan J Watts and Steven H Strogatz.Collective dynamics of ‘small-world’networks.nature, 393(6684) :440–442, 1998.
Antoine Rebecq Sampling the Twitter graph