8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 ·...
Transcript of 8 1 4 Park, Hosung) - KAISTan.kaist.ac.kr/~sbmoon/paper/thesis/2010Dec-hosung.pdf · 2018-08-30 ·...
석사 학위논문
Master’s Thesis
트위터소셜네트워크에서의정보전파분석
Analysis on Information Spreading as Recorded in Twittersphere
박 호 성 (朴 鎬 成 Park, Hosung)
전산학과
Department of Computer Science
KAIST
2011
트위터소셜네트워크에서의정보전파분석
Analysis on Information Spreading as Recorded in Twittersphere
Analysis on Information Spreading as Recorded in
Twittersphere
Advisor : Professor Moon, Sue Bok
by
Park, Hosung
Department of Computer Science
KAIST
A thesis submitted to the faculty of KAIST in partial fulfillment
of the requirements for the degree of Master of Science in Engineering
in the Department of Computer Science . The study was conducted in
accordance with Code of Research Ethics1.
2010. 12. 16.
Approved by
Professor Moon, Sue Bok
[Advisor]
1Declaration of Ethical Conduct in Research: I, as a graduate student of KAIST, hereby declare that
I have not committed any acts that may damage the credibility of my research. These include, but are
not limited to: falsification, thesis written by someone else, distortion of research findings or plagiarism.
I affirm that my thesis contains honest conclusions based on my own careful research under the guidance
of my thesis advisor.
트위터소셜네트워크에서의정보전파분석
박 호 성
위 논문은 한국과학기술원 석사학위논문으로
학위논문심사위원회에서 심사 통과하였음.
2010년 12월 16일
심사위원장 문 수 복 (인)
심사위원 황 규 영 (인)
심사위원 오 혜 연 (인)
MCS
20093227
박호성. Park, Hosung. Analysis on Information Spreading as Recorded in Twittersphere.
트위터소셜네트워크에서의정보전파분석. Department of Computer Science . 2011.
22p. Advisor Prof. Moon, Sue Bok. Text in English.
ABSTRACT
Twitter is offering an unprecedented opportunity for the study of information spreading in human
society as all actions and underlying social network can be recorded. In this thesis, we present empirical
study of information spreading phenomena in Twitter. We collected 32 million photo URLs posted on
Twitter via Twitpic and related 52 million tweets from April 2010 to June 2010. Twitpic links guarantee
that the source of information is unique and a Twitter user. Thus, Twitpic links eliminate chance of
having multiple source users who bring the same information to Twittersphere. We show analysis on
information creation and spreading tracking Twitpic URL links. Microscopic characteristics of informa-
tion diffusion at the individual information level are not well-explored yet. We analyze temporal and
topological characteristics of microscopic information spreading characteristics with the reconstructed
diffusion trees of Korean users. We discover that diffusion in Twitter is very fast making wide and shal-
low diffusion trees. We show information diffusion model which is an extension of Independent Cascade
Model considering response time of users. We show that spreading probabilities are influenced by type
of information and directness to the original source. To be best of our knowledge this work is the first
study on characteristics of microscopic information diffusion in Twitter.
i
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Chapter 1. Introduction 1
Chapter 2. Background and Related Work 3
Chapter 3. Basic Analysis 4
3.1 Data Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Daily Patterns of Twitpic Links . . . . . . . . . . . . . . . . . . . 4
3.3 Creation and Spreading of Photos . . . . . . . . . . . . . . . . . . 4
Chapter 4. Microscopic Patterns of Information Diffusion in Twitter 9
4.1 Temporal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Topological Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Chapter 5. Information Diffusion Model of Twitter 14
5.1 Process of the Model . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2 Characteristics of Diffusion Probabilities and Response Time . 15
Chapter 6. Discussion 17
Chapter 7. Conclusions 19
Summary (in Korean) 20
References 21
ii
List of Tables
4.1 Statistics of diffusion trees ≥ 10 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Statistics of diffusion trees ≥ 100 nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5.1 Diffusion probabilites for each type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
6.1 Correlation coefficients for group L(followers ≥ 150) and group S(10 < followers < 150) . 18
iii
List of Figures
3.1 Daily behavior of creation and spreading of information . . . . . . . . . . . . . . . . . . . 5
3.2 CCDF of the number of photo creations for users . . . . . . . . . . . . . . . . . . . . . . . 5
3.3 PDF of the interval between photo uploads . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.4 CCDF of the number of tweets for each link . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 CDF of spreading duration in days . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.6 The median duration in days over the number of spreading tweets . . . . . . . . . . . . . 8
4.1 Examples of diffusion trees with the same number of nodes . . . . . . . . . . . . . . . . . 10
4.2 CCDF of proportion of spread within 24 hours . . . . . . . . . . . . . . . . . . . . . . . . 10
4.3 CDF of max and median timestamps of tweets for each diffusion tree . . . . . . . . . . . . 11
4.4 Mean depth of tweets with timestamps in days . . . . . . . . . . . . . . . . . . . . . . . . 12
4.5 CDF of proportion of source contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.6 Cascade size over proportion of source contribution . . . . . . . . . . . . . . . . . . . . . . 12
5.1 CDF of diffusion probabilities of direct diffusion from the source node. . . . . . . . . . . . 16
5.2 CDF of diffusion probabilities of indirect diffusion from non-source nodes . . . . . . . . . 16
5.3 CDF of response time in hours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6.1 CDF of clusterring coefffiecint of Korean users . . . . . . . . . . . . . . . . . . . . . . . . 17
iv
Chapter 1. Introduction
Various online social network services (OSNs) allow users to communicate with each other and help
users acquire information. An increasing number of people not only use OSNs for social interactions
but also for propagating and obtaining information as OSNs recently play the important role of news
media [14], Information or idea is created from the sources then it spreads over a social network as some
users forward it to other users. This cascading phenomena is common in OSNs.
There is a large literature on information spreading. Measurement and analysis works discovered
how information spreads out in chain letters, blogosphere, social games, Flickr, Facebook, Digg, Twitter
and so on. Emerging OSNs allow researchers to track massive and complete user behavior data. One of
these, Twitter is offering an unprecedented opportunity for the study of information spreading in human
society as all actions and underlying social network can be recorded.
In this thesis, we present empirical study on how information spreads on Twitter with Twitpic links.
We trace Twitpic URL links to capture information propagation. Twitpic provides a unique URL link
for one uploaded photo. Twitpic guarantees that the source of information diffusion is a unique twitter
user. This is different from diffusion of news links or blog posts links which may have multiple sources
of information. These multiple sources bring redundant information into Twittersphere making it hard
to indentify microscopic patterns of information diffusion at the individual information level. Recent
researches concentrate on macroscopic trends of diffusion for aggregate topics because of this hardness.
On the other hand, we concentrate on microscopic trends of information diffsuion in Twitter with Twitpic
links.
Our key questions are How fast/How broad does information diffuse? and What is the microscopic
information spreading patterns in Twitter? To answer these questions,we collected 32 million photo
URLs posted on Twitter via Twitpic and related 52 million tweets from April 2010 to June 2010. We
also collected the social graph of Korean users to examine information diffusion paths by inferring
information diffusion trees.
We show empirical analysis on information creation and spreading. The users have tendency to
create photos periodically but they spread information without regularity. Only 1% of information are
to be popular gathering 10 tweets or more. We analyze temporal and topological characteristics of
microscopic information spreading characteristics with the inferred diffusion trees of Korean user. In
diffusion tree, each node corresponds to the forwarding tweet and each edge corresponds to diffusion
path. 99.88% of diffusion trees have median timestamp within 7 days implicating Twitter keeps topics
brand new. For topological analysis, we measured cascade size, maximum depth, median depth, width of
tree, single-child edge fraction and volume contribution of source nodes. Statistics show that information
diffusion trees are wide and shallow.
We think that the Independent Cascade Model is a suitable model for Twitter and present in-
formation diffusion model as the extension of ICM considering response time of user. We show that
different types of the information have different spreading probabilities. We divide types of information
into ‘General’, ‘Promotion’ and ‘Request’ information. Promotion Information which provides reward
for spreading has the largest spreading probability. Direct transfer from the original source has larger
spreading probability than indirect transfer from other nodes. We also measure response time of users.
– 1 –
The median response time is 26.37 minutes.
This thesis is organized as follows. Chapter 2 describes background and related work. We conduct
basic analysis on creation and spreading information in chapter 3. In chapter 4, we study temporal
and topological diffusion patterns in Twitter. Chapter 5 covers information diffusion model of Twitter.
Chapter 6 discusses effective spreading structure of social network. In chapter 7 we conclude.
– 2 –
Chapter 2. Background and Related Work
Twitter(http://twitter.com) is a popular online social networking site. Twitter users can post and
read short text messages at most 140 characters and these messages are called tweets. The users ‘follow’
other users to subscribe tweets of them. This following interaction is undirected and only 22.1% of social
edges are reciprocal [14]. Twitter provides the ‘Retweet’ feature to allow users to forward information to
others easily. If one user retweets the tweet which he/she received, his/her followers can also read that
tweet. Once retweeted, a tweet gets retweeted almost instantly on next hops. General users, celebrities,
mass media and enterprises use Twitter for social interactions and information channel. Kwak et al. [14]
showed that Twitter has the characteristics of both social networking service and news media.
There is a large literature on the social network analysis topic [14, 15, 3, 1, 2, 16, 17, 12, 5].
Measurement and analysis works discovered how information spreads out on social networks. Examined
types of information and social networks are various. Gruhl et al. [12] studied information diffusion on
blogspace. They presented “chatter” and “spike” characterization of topic propagation. Leskovec et
al.[17] considered information propagation as recommendation cascades. They studied recommendation
and purchase propagation with large on-line retailer dataset. They showed cascade patterns with frequent
cascade subgraphs and size distribution of cascade. Leskovec et al.[16] analyzed temporal and structural
patterns of information propagation in blogspace. McGlohon et al.[19] clustered blogs into ‘humor’ and
‘conservative’ blogs by structural cascade types. They showed that the temporal activity of blogs is
bursty. Bakshy et al. [1] viewed gesture adoption in Second Life as information propagation. Cha et
al. [2, 3] analyze the onine media, Youtube and Flickr. Emerging Online Social Networking services allow
researchers to track massive and complete user behavior data on OSNs. Kwak et al. [14] studied on the
entire Twittersphere. Lerman et al. [15] studied spread of information on Digg and Twitter.
however, microscopic information spreading behavior of Twitter at the level of individual information
is not well explored yet. Our work concentrates the microscopic information diffusion patterns in Twitter.
Previous researches analyze the macroscopic characteristics like the diffusion of trending topics. We track
Twitpic URL links to capture temporal and topological characteristics of information spreading for each
links.
There are many efforts to definethe infromation spreading model in social networks [10, 8, 9, 4,
22, 13]. Independent Cascade Model(ICM) and Linear Threshold Model(LTM) are two basic diffusion
models for the information diffusion processes which have been considered in the precedent studies.
These models are expained in chapter 5
Kempe et al. [13] find influencial nodes to maximize the spread of influence with the given models.
Watts et al. [22] studied information diffusion based on simulations with the threshold model. Goyal
et al. [10] studied the method to set probability parameters in the model. Gomez et al. [8] inferred
underlying diffusion networks when only timestamps of diffusion are provided. Gotz et al. [9] proposed
zero-crossing model for each individual blog to produce temporal and topological characteristics of blo-
gosphere.
We consider ICM a suitable model for Twitter and show the extension of ICM regarding response
time of users. We show that spreading probabilites are influenced by type of information and directness
to the original source.
– 3 –
Chapter 3. Basic Analysis
In this chapter, we present basic analysis on creation and spreading of the information.
3.1 Data Collections
Twitpic (http://twitpic.com) is one of the third parties of Twitter which allows users to share their
photos easily on Twitter. Twitpic has about 110 million visitors a month [21], as of August 2010. Users
upload their photos to Twitpic from mobile devices or desktops. The uploaded photo acquires the unique
URL from Twitpic. Users use these URLs in the tweets sharing their photos. We use Twitpic URLs to
track the information spread in Twitter.
We use Streaming API of Twitter [20] to crawl Twitpic URLs on Twitter. We collect traces for 3
months from April 2010 to June 2010 for the basic analysis chapter. 52, 696, 184 tweets are collected
containing 32, 053, 742 unique photos. We also collected snapshots of social graph of Korean Twitter
users for the analysis of diffusion patterns. There are about 640 thousand Korean users on June 30th.
3.2 Daily Patterns of Twitpic Links
Figure 3.1 shows the daily created number of photos and tweets. The x-axis represents a timeline
in days. Red boxes are the number of tweets which upload the photos creating the information on
Twittersphere. Green line represents the number of total tweets which contain creation and spreading of
the photos. Blue line is difference between Red box and Green line which means the number of spreading
tweets.
The period of information creation is 7 days and the peaks are on weekends. Users create many
photos on weekends because they may have much spare time and special events on weekends. In contrast,
spreading behaviors don’t have obvious period. This implies that spreading events can occur anytime
because spreading behaviors only need the access to Twitter.
3.3 Creation and Spreading of Photos
How many users create photos for 3 months? The number of users who create photos is 3, 627, 782.
The average number of photo uploads for a user is 8.8 and the median is 3.
Figure 3.2 shows complementary cumulative distribution function (CCDF) of the number of photo
creations for users in logscale. More than 90% of users upload less than 25 photos.
What is the interval between photo uploads of a user? Figure 3.3 shows probability density func-
tion (PDF) of the interval between the photo uploads in logscale. The unit of x-axis is seconds. The
distribution fluctuates after 24 hours. Vertical black lines are guide lines of days (24hours, 48hours and
so on). We observe that the peaks are on the guide lines implicating the existence of regular behaviors
of users. The users may upload photos at regular time of the day. 90% of intervals are less than 8 days
and the medain interval is 12.87 hours.
– 4 –
Figure 3.1: Daily behavior of creation and spreading of information
Figure 3.2: CCDF of the number of photo creations for users
We regard tweeting the tweets which contain Twitpic URL links as spreading behavior. Figure 3.4
shows CCDF of the number of tweets for eack link in logscale. 81.6% of informations do not spread at
all implicating that most of photos have no meaning to others. Only 1% of informations are referred 10
– 5 –
Figure 3.3: PDF of the interval between photo uploads
times or more. In this paper, we focus on diffusion of information which is spread 10 times or more to
capture diffusion of meaningful information.
Figure 3.5 shows CDF of spreading duration in days. The spreading duration is time difference
between the first tweet and the last tweet. We only plot the informations which have two tweets or more.
90% of information spreadings end up within just one day.
Figure 3.6 shows the median duration in days over the number of spreading tweets in logscale. The
median duration is proportional to the number of tweets up to 100 tweets. Massive spreadings which
have more than 100 tweets can occur in various durations.
– 6 –
Figure 3.4: CCDF of the number of tweets for each link
Figure 3.5: CDF of spreading duration in days
– 7 –
Figure 3.6: The median duration in days over the number of spreading tweets
– 8 –
Chapter 4. Microscopic Patterns of Information
Diffusion in Twitter
We reconstruct infromation diffusion trees in temporal order to investigate microscopic information
diffusion patterns. We need some inferences to reconstruct trees for some reasons as below.
First, even though Twitter provides ‘Retweet’ feature to forward tweets to others, some users spreads
information in the normal tweet form which doesn’t contain the source from whom they received the
information. Second, after creation of the official ‘Retweet’ button in Twitter interface, the forwarder
which is parsed from retweet messages can be incorrect because if the messages are retweeted with
the retweet button, the original source of that message appears as the forwarder in retweet messages
regardless of the actual forwarder of that message.
Inference rules to find the actual forwarder of one tweet are as follows.
• Find users U who are followee of user A and tweeted before A’s tweet appears.
• If A’s tweet is retweet and forwarder F which is parsed from text of retweet is in U, set actual
forwarder of A to F
• Else set actual forwarder of A to the first tweeting user in U.
We ignore replies because replies have no purpose of spreading and followers of replier cannot see
those replies unless they follow both the replier and the replied user. We also eliminate loops and multiple
edges to make trees. Examples of reconstructed diffusion trees are shown in Figure 4.1. Two trees have
the same number of nodes but they are in different shapes.
There may exist many diffusion trees for one information. We focus the diffusion tree which stems
from the original source of the photo, and we call it the primary tree. The median proportion of tweet
volume contribution of the primary tree having 10 edges or more is 0.842, which means that 84.2% of
tweets are in the primary tree. The median proportion of contribution of the primary tree having 100
edges or more is 0.696, which means that largely spread information has more chance to be spread on
external paths not on Twitter social networks. We only examine the primary trees to study information
diffusion phenomena on social networks.
4.1 Temporal Analysis
In this section, we investigate how fast information diffuses in Twitter and temporal diffusion pat-
terns. We only consider 6, 728 diffusion trees of Korean users, which have 10 edges or more.
Figure 4.2 shows CCDF of the proportion of the tweets which spread the information within 24hours
after source information appeared. More than 90% of diffusion trees have 80% of tweets within 24 hours.
This result shows that most information diffusion in Twitter take place in a day which is very fast
compared to other OSNs.
Figure 4.3 shows CDF of max and median timestamps of tweets for each diffusion tree. Red points
stand for max timestamps and blue points are median timestamps. 90.84% of trees complete diffusion
– 9 –
Figure 4.1: Examples of diffusion trees with the same number of nodes
Figure 4.2: CCDF of proportion of spread within 24 hours
within 7 days. 99.88% of trees have median timestamps within 7 days. These characteristics allow
Twitter to be the source of realtime information or news.
Figure 4.4 represents mean depth of tweets in diffusion trees. Tweets are grouped with timestamps
in days. Tweets of the first day occur in shallower depth than other days tweets. But there is not much
difference in the mean depth between other days tweets. This implied that slow diffusion of information
– 10 –
Figure 4.3: CDF of max and median timestamps of tweets for each diffusion tree
does not always occur with many steps from the source and fast diffusion can also have many steps from
the source.
4.2 Topological Analysis
We measure properties of diffusion trees having (T1) 10 nodes or more and (T2) 100 nodes or more.
Measured properties are cascade size, maximum depth, median depth, width of tree, single-child edge
fraction and volume contribution of source nodes. Here, the width of a tree is defined as the maximum
size of a set of nodes that lies in the same depth and single-child edge fraction is defined as the fraction
of nodes with exactly one child which is used in [18]. Statistics are shown in Table 4.1. The median
cascade size is 17 for (T1) and 174 for (T2). The median depth is 1.5 for (T1) and 1.0 for (T2). The
median width of (T1) is 10.0 and 125.0 for (T2). These metrics tell that the diffusion trees in Twitter
are wide and shallow. Single-child edge fraction has the median 0.125 for (T1) and 0.0439 for (T2).
Single-child edge fraction is very low, meaning that there are not many single chains which lengthen the
depth of trees without broadening the width. Proportion of source’s children over all nodes is 0.4375 for
(T1) and 0.8104. This means that the source plays an important role in information diffusion.
But source volume contribution 0.8104 for (T2) can be misleading as shown in Figure 4.5, 4.6.
Figure 4.5 shows CDF of proportion of source contribution. Black dots represent trees having 10 nodes
or more, which matches to (T1) and red dots represent 100 nodes or more trees(T2). For largely spread
– 11 –
Figure 4.4: Mean depth of tweets with timestamps in days
Figure 4.5: CDF of proportion of source con-
tribution
Figure 4.6: Cascade size over proportion of source
contribution
information (T2), even though (T2) has the high median 0.8104, proportion of source contribution is
divided into two sides, large source contribution and small source contribution. This implies that largely
spread information needs influential source nodes or hubs in diffusion path. Figure 4.6 shows cascade
size over proportion of source contribution. we also confirm that largely information result from either
large source contribution or small source contribution.
– 12 –
≥ 10 nodes Min. 1st Qu. Median Mean 3rd Qu. Max.
casecade size 10.00 12.00 17.00 36.88 31.00 991.00
max. depth 1.00 2.00 3.00 3.24 4.00 16.00
median depth 1.000 1.000 1.500 1.654 2.000 7.000
width 2.00 7.00 10.00 23.45 17.00 954.00
single-edge fraction 0.00000 0.05882 0.12500 0.13640 0.20000 0.66670
source contribution 0.001092 0.214300 0.437500 0.476700 0.750000 0.997100
Table 4.1: Statistics of diffusion trees ≥ 10 nodes
≥ 100 nodes Min. 1st Qu. Median Mean 3rd Qu. Max.
casecade size 100.0 131.0 174.0 226.3 265.0 991.0
max. depth 1.000 2.000 3.000 4.216 6.000 16.000
median depth 1.000 1.000 1.000 1.776 2.000 7.000
width 19.0 67.0 125.0 164.7 213.0 954.0
single-edge fraction 0.00000 0.01629 0.04390 0.08019 0.14790 0.24760
source contribution 0.001092 0.117200 0.810400 0.583000 0.960700 0.997100
Table 4.2: Statistics of diffusion trees ≥ 100 nodes
– 13 –
Chapter 5. Information Diffusion Model of Twitter
In this chapter, we show information diffusion model in Twitter and characteristics of information
diffusion for the model.
Independent Cascade and Linear Threshold models are two basic diffusion models for the information
diffusion processes which have been considered in the precedent studies.
The Independent Cascade Model(ICM) [7] starts with a set of active nodes A0. When a node
s becomes active, it has only one chance to activate each inactive neighbor t with probability p(s, t).
Regardless of success of activation, s cannot attempt to activate t after one try. Newly activated nodes
have chance to activate their neighbors in the next step. This process runs in discrete steps until no
more activations are possible.
The Linear Threshold Model(LTM) [11] has a threshold tn ∈ [0, 1] for each node n. A node n
is influenced by its neighbors m with weight w(n,m) with∑
m w(n,m) ≤ 1. This process also starts
with a set of active nodes A0 and continues in discrete steps. At each step, a node n is activated if
tn ≤∑
activated m w(n,m).
LTM is not suitable for our diffusion modeling because Twitter users cannot see the redundant
retweets which are arisen from the retweet button for the same tweet. Instead, they only see the first
retweet of them in their timelines. 61.4% of users receive only one tweet and more than 90% of users
receive no more than 5 tweets for the same links. Thus, we extends ICM for the information diffusion
model in Twitter.
5.1 Process of the Model
Contrary to ICM, our model has response time for each edge. Response time is the elapsed time
between receiving and forwarding information for a user. In the real world, depth-1-edge can be created
later than depth-3-edge when the depth-1-edge discovered the information later than the depth-3-edge.
But ICM cannot reflect this phenomena. ICM always create deeper-depth-edges later than shallower-
depth-edges. Thus we consider response time in the edge creation. In our model, there is only one
active starting node A0 because the source of Twitpic is unique. The activation behavior of this model is
forwarding the information to others. We assume that the activated node cannot go back to the inactive
state again. Information diffuses from the active node s to an inactive neighbor t with probability p(s, t).
If the activation succeeds, new edge (s, t) is created in the diffusion tree. When new edge is created,
we pick the response time from the time distribution which is shown later and attach timestamp to the
edge. The process runs in time ticks. Activated nodes which have earlier timestamps than current time
tick have chance to activate their neighbors. This process runs until the time tick reaches to the selected
time limit. If one node has multiple incoming edges, we only give chance for activating this node to the
first edge.
In the next section, we show characteristics on diffsuion probabilities and response time of Twitter
for the model.
– 14 –
5.2 Characteristics of Diffusion Probabilities and Response Time
Diffusion Probabilities in Twitter
We measure diffusion probabilities regarding type of information and directness to the source node.
We divide information in Twitpic manually into three types manually, (1) general information, (2)
promotion information and (3) request information. The general information is normal information like
street scenes, portrait, humorous images and so on. The promotion information gives a reward to user for
spreading the information. For example, enterprises launch promotional campaign for new products in
Twitter giving the new product to the users who retweet the campaign. The request information does not
have reward for spreading but it requests to spread the information in the tweet. Searching for a missing
child is one of this type. We also consider whether a node receive the information from the source node
directly or not. We calculated the mean diffusion probability p = sumof all atcivatedfollowers of all nodessumof all followers of all nodes for
each case. Table 5.1 shows these diffusion probabilies.
information types direct to src indirect to src
General Information 0.00107 0.00083
Promotion Information 0.01382 0.00103
Request Information 0.00321 0.00112
Table 5.1: Diffusion probabilites for each type
The promotion information has the strongest spreading power for direct diffusion from information
source. Indirect spreading from the source has less spreading power than direct spreading from the
source. This implies that users have tendency to avoid spreading the already transferred information by
someone.
Figure 5.1 shows CDF of diffusion probabilities of direct diffusion from the source node. Each line
represent different type of information. Red line is the general information and blue is the promotion
information and green is the request information. Figure 5.2 shows CDF of diffusion probabilities of
indirect diffusion from non-source nodes. Compared to Figure 5.1, transferred information has less
spreading power than the information of the source. We confirm that different types of information have
different distribution of diffusion probabilities and that directness from the source node influences the
diffusion probabilites.
Response time of Twitter User
We measure response time of users to find out the response time distribution. The response time is
defined as the time period between the time when one information is transfered to user A and the time
when user A respond to the information with the tweet.
Figure 5.3 shows CDF of the response time with x-axis in hours. Most of responses occur in few
hours implicating fast speed of information diffusion in Twitter. The median of response time is 26.37
minutes.
– 15 –
Figure 5.1: CDF of diffusion probabilities of di-
rect diffusion from the source node.
Figure 5.2: CDF of diffusion probabilities of in-
direct diffusion from non-source nodes
Figure 5.3: CDF of response time in hours
– 16 –
Chapter 6. Discussion
In this chapter we discuss effective spreading structure of social network. Social networks have
characteristic of clustering. The clustering coefficient [23] is a quantitative measure of this phenomena.
The clustering coefficient C(s) for a node s is defined as follows. Let s be a node which has n neighbours
or followers in Twitter. Then three can be exist n(n − 1)/2 edges between them. C(s) is the fractionnumber of acutually existing edges
number of allowable edges . Intuitive meaning of C(s) is the probability that friends of s are also
friends each other. The clustering coefficient C for the whole network is the average of C(s) over all s.
Figure 6.1: CDF of clusterring coefffiecint of Korean users
Figure 6 shows CDF of 457, 168 korean users who have less than 2000 followers and more than 1
followers. More than 99% of korean users have less than 2000 followers. We ignore foreign followers in
the calculation. We plot two kinds of the connected edge because edges in Twitter are directed. (1) In
blue line, only reciprocal edges are considered as the connected edges. (2) Both reciprocal and one way
edges are counted in red line. The mean and median values of the clustering coefficient are 0.278 and
0.179 for (1) and 0.332 and 0.238 for (2). In average, about 30% of neighbors of a user are also neighbors
with each other in Twitter.
– 17 –
Dunbar’s number [6] is a limit to the number of people with whom one can maintain stable social
relationships. 150 is a commonly used value.
We divide users into two groups according to Dunbar’s number. The First group L have 150 followers
and more. The second group S have followers less than 150 and more than 10. We ignore users who
have followers less than 10.
For each group L and S, table 6.1 shows correlation coefficient between (1) number of followers -
the median number of followers of one hop neighbors, (2) number of followers - clustering coefficient, (3)
clustering coefficient - the median number of followers of one hop neighbors and (4) clustering coefficient
- the mean clustering coefficient of one hop neighbors.
Correation Group L Group S
(1) ] of followers, the median ] of followers of one hop neighbors 0.1094 0.0182
(2) ] of followers, clustering coefficient 0.0263 -0.1585
(3) clustering coefficient, the median ] of followers of one hop neighbors 0.8536 0.3236
(4) clustering coefficient, the mean clustering coefficient of one hop neighbors 0.6034 0.5091
Table 6.1: Correlation coefficients for group L(followers ≥ 150) and group S(10 < followers < 150)
There are weak or no correlations in (1) and (2) for both groups implicating the number of followers
is not the important factor for the structure of network of followers. But there are strong correlations
in (3) and (4). Statistics of (3) shows that the more neighbors are clustered densely, the more these
neighbors have followers making more chance to spread information. Correlation for group L is very
strong with coefficient 0.8536. Statistics of (4) shows homophily of users. The densely clustered users
also have the densely clustered one hop neighbors.
In viral marketing, It is important to choose the initiating nodes to maximize marketing outcomes.
Table 6.1 gives hints for this problem. Normal user may have more densely clustered friends than the
enterprise user. It is better to request an existing user who has densely clustered friends to initiate
the campaign than to initiate the campaign for itself when these two users have the similar number of
followers. This can be one of characteristics of effective spreading structure of social network.
– 18 –
Chapter 7. Conclusions
In this thesis, we study information spreading phenomena in Twitter with Twitpic links. Tracking
Twitpic links gurantees that the source of information is the unique Twitter user making it easy to trace
information diffusion in Twitter.
First, we present basic analysis on creation and spreading of information with 52 million tweets.
Twitter users create information periodically and spread information without regularity
Second, we study microscopic characteristics of information diffusion at the individual information
level with Korean users. Temporal and topological analysis shows that diffusion in Twitter is fast and
makes wide and shallow diffusiontrees. Large spread of information is due to either power of the source,for
example celebrities, or hubs in diffusion trees.
Third, we show information diffusion model of Twitter, which is an extension of independent cascade
model. This model considers response time of users and runs until the selected time limit is reached.
We show that spreading probabilites are influenced by type of information and directness to the original
source.
We leave comparison between result of Twitpic data and general tweets to future work. The accurate
prediction of spreading and The characteristics of nodes that are effective in information spreading also
remain as future works. Our work shed lights on microscopic characteristics of information diffusion in
OSNs. This work can help viral marketing on OSNs plan campaign and choose targets.
– 19 –
Summary
Analysis on Information Spreading as Recorded in Twittersphere
트위터는 사용자들의 모든 행위와 소셜 네트워크 구조를 제공하면서 유래없는 인간 사회에서의 정보
전파 연구의 기회를 제공한다. 이 논문에서는 트위터가 제공하는 데이터를 기반으로 정보 전파 현상을
분석하였다. 우리는 2010년 4월 부터 2010년 6월 까지 Twitpic을 통해 트위터에 업로드된 3,200만 개의
사진 URL 링크와 관련된 5,200만개의 트윗을 수집하였다. Twitpic 링크는 정보의 발생점이 유일한
트위터 사용자임을 보장함으로써 정보 전파를 추적하는데 모호한 경우인 여러 사용자가 같은 정보를
트위터에가져와서정보전파가시작되는경우를배제한다. 우리는이 Twitpic링크로정보전파현상을
추적하여 정보의 생성과 전달을 분석하였다. 또한 아직 트위터에서 연구가 미진한 개별적인 정보 단위
의 미시적인 관점의 정보 전파 현상을 한국인 사용자의 소셜 그래프를 통해 추론한 정보 전파 트리를
이용하여 시간 및 위상적으로 분석하였다. 이 분석으로 우리는 트위터의 정보 전파는 매우 빠르고 넓
고 얕은 정보 전파 트리가 생성된다는 것을 밝혔다. 또한 사용자의 응답시간을 고려하여 Independent
Cascade Model을 확장한 트위터에서의 정보 전파 모델을 제안하였다. 트위터에서의 정보 전파 확률은
의도에 따라 구분된 정보의 종류와 정보의 발생점으로 부터 직접 전달 받았는지 간접적으로 전달 받았
는지에 영향을 받는다는 것을 밝혔다. 이 연구는 온라인 소셜 네트워크에서의 미시적인 관점으로 정보
전파의 특성을 밝히는데 기여을 하고, 바이럴 마케팅에서 효과적인 마케팅 대상을 찾거나 정보 전파를
예상하는데 응용될 수 있다.
– 20 –
References
[1] E. Bakshy, B. Karrer, and L. Adamic. Social influence and the diffusion of user-created content. In
Proceedings of the tenth ACM conference on Electronic commerce, pages 325–334. ACM, 2009.
[2] M. Cha, H. Kwak, P. Rodriguez, Y. Ahn, and S. Moon. I tube, you tube, everybody tubes: analyzing
the world’s largest user generated content video system. In Proceedings of the 7th ACM SIGCOMM
conference on Internet measurement, pages 1–14. ACM, 2007.
[3] M. Cha, A. Mislove, and K. Gummadi. A measurement-driven analysis of information propagation
in the flickr social network. In Proceedings of the 18th international conference on World wide web,
pages 721–730. ACM, 2009.
[4] J. Cointet and C. Roth. How realistic should knowledge diffusion models be. Journal of Artificial
Societies and Social Simulation, 10(3):5, 2007.
[5] J. Cointet and C. Roth. Socio-semantic dynamics in a blog network. In Computational Science and
Engineering, 2009. CSE’09. International Conference on, volume 4, pages 114–121. IEEE, 2009.
[6] R. Dunbar. Grooming, gossip, and the evolution of language. Harvard Univ Pr, 1998.
[7] J. Goldenberg, B. Libai, and E. Muller. Talk of the network: A complex systems look at the
underlying process of word-of-mouth. Marketing Letters, 12(3):211–223, 2001.
[8] M. Gomez Rodriguez, J. Leskovec, and A. Krause. Inferring networks of diffusion and influence. In
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 1019–1028. ACM, 2010.
[9] M. Gotz, J. Leskovec, M. McGlohon, and C. Faloutsos. Modeling blog dynamics. In AAAI Confer-
ence on Weblogs and Social Media, 2009.
[10] A. Goyal, F. Bonchi, and L. Lakshmanan. Learning influence probabilities in social networks. In
Proceedings of the third ACM international conference on Web search and data mining, pages 241–
250. ACM, 2010.
[11] M. Granovetter and R. Soong. Threshold models of diffusion and collective behavior. The Journal
of Mathematical Sociology, 9(3):165–179, 1983.
[12] D. Gruhl, R. Guha, D. Liben-Nowell, and A. Tomkins. Information diffusion through blogspace. In
Proceedings of the 13th international conference on World Wide Web, pages 491–501. ACM, 2004.
[13] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network.
In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 137–146. ACM, 2003.
[14] H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a social network or a news media? In
Proceedings of the 19th international conference on World wide web, pages 591–600. ACM, 2010.
– 21 –
[15] K. Lerman and R. Ghosh. Information contagion: An empirical study of the spread of news on Digg
and Twitter social networks. 2010.
[16] J. Leskovec, M. McGlohon, C. Faloutsos, N. Glance, and M. Hurst. Cascading behavior in large
blog graphs: Patterns and a model. In Society of Applied and Industrial Mathematics: Data Mining
(SDM07), 2007.
[17] J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network.
Advances in Knowledge Discovery and Data Mining, pages 380–389, 2006.
[18] D. Liben-Nowell and J. Kleinberg. Tracing information flow on a global scale using Internet chain-
letter data. Proceedings of the National Academy of Sciences, 105(12):4633, 2008.
[19] M. McGlohon, J. Leskovec, C. Faloutsos, M. Hurst, and N. Glance. Finding patterns in blog shapes
and blog evolution. In International Conference on Weblogs and Social Media. The AAAI Press,
2007.
[20] Twitter Streaming API. http://dev.twitter.com/pages/streaming api.
[21] Visitor of Twitpic. http://www.quantcast.com/twitpic.comp.
[22] D. Watts and P. Dodds. Influentials, networks, and public opinion formation. Journal of Consumer
Research, 34(4):441–458, 2007.
[23] D. Watts and S. Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–442,
1998.
– 22 –