Survey on Twitter Analysis

ABSTRACT

In his paper we introduce two aspects of research on Twit-ter.(1) the quantitative study on Twitter and (2)Influential Twitterers Finding Algorithms. At the end of this paper two applications based on Twitter are presented.

1. INTRODUCTION

An important characteristic of Twitter is real-time nature, users can publish what they are doing or thinking now.

Unlike on most online social networking sites, such as Fa-cebook or MySpace, the relationship of following and being followed requires no reciprocation. A user can follow any other user, and the user being followed need not follow back. Being a follower on Twitter means that the user rece-ives all the messages (called tweets) from those the user follows. Common practice of responding to a tweet has evolved into well-defined markup culture: RT stands for retweet, ’@’ followed by a user identifier address the user, and ’#’ followed by a word represents a hashtag. This well-defined markup vocabulary combined with a strict limit of 140 characters per posting conveniences users with brevity in expression.

2. Characteristics Analysis

2.1 Following and Followers Distribution

Figure 1 #following/followers

A directed network based on the following and followed is constructed. Figure 1 displays the distribution of the num-

ber of followings as the solid line and that of followers as the dotted line.

The y-axis represents complementary cumulative distribu-tion function (CCDF). The research first explains the dis-tribution of the number of followings.

The dashed line in Figure 1 up to x = 10� fits to a pow-er-law distribution with the exponent of 2.276. Most real networks including social networks have a power-law ex-ponent between 2 and 3. The data points beyond x = 10� represent users who have many more followers than the power-law distribution predicts. Similar tail behavior in degree distribution has been reported from Cyworld in [1] but not from other social networks. The common characte-ristics between Twitter and Cyworld are that many celebri-ties are present and they readily form online relations with their fans.

2.2 Reciprocity

Twitter shows a low level of reciprocity; 77.9% of user pairs with any link between them are connected one-way, and only 22.1% have reciprocal relationship between them. The research calls those r-friends of a user as they recipro-cate a user’s following. Previous studies have reported much higher reciprocity on other social networking services: 68% on Flickr [2] and 84% on Yahoo! 360 [3].

2.3 Degree of Separation

The concept of degrees of separation has become a key to understanding the societal structure, ever since Stanley Milgram’s famous ‘six degrees of separation’ experiment [4]. In his work he reports that any two people could be connected on average within six hops from each other. Re-search on the MSN messenger network reports that the me-dian and the 90% degrees of separation are 6 and 7.8, re-spectively[5].

To estimate the path-length distribution of Twitter, the same random sampling approach as in [1] is used. The median

Material for Rinko June 25th, 2010

Survey on Twitter Analysis Kitsuregawa, Toyoda Lab (D1) RenYong ID: 48-107411

and the mode of the distribution are both 4, and the average path length is 4.12. The 90th percentile distance, known as the effective diameter , is 4.8. The average path length of 4:12 is quite short for the network of Twitter size.

3 Trending The Trends

3.1 Comparison with Trends in Other Media

To answer what topics are popular in Twitter, the research compares Twitter’s trending topics with those in other me-dia, namely, Google Trend and CNN headlines.

Consider a search keyword and a trending topic a match if the length of the longest common substring is more than 70% of either string. Only 126 (3.6%) out of 3, 479 unique trending topics from Twitter exist in 4, 597 unique hot keywords from Google. Most of them are real world events, celebrities, and movies.

Figure 2 The age of the trending topics

The freshness of topics in Google Trend and Twitter trend-ing topics are also compared. Figure 2 plots how many top-ics are fresh, a day old, a week old, or longer. On average 95% of topics each day are new in Google while only 72% of topics are new in Twitter. Interactions among users, e.g., retweet, reply, and mention, are prevalent in Twitter unlike Google search, and such interactions might be a factor to keep trending topics persist.

How close are trending topics to CNN Headline News in time and coverage? CNN Headline News of our Twitter data collection period are collected and preliminary analysis is conducted.

From a subset of trending topics that they have matched against CNN Headline News more than half the time CNN was ahead in reporting. However, some news broke out on Twitter before CNN and they are of live broadcasting na-ture (e.g., sports matches and accidents).

3.2 User Participation in Trending Topics

How many topics does a user participate on average? Out of 41 million Twitter users, a large number of users (8, 262, 545) participated in trending topics and about 15% of those users participated in more than 10 topics during four months.

Figure 3 Cumulative fraction

A trending topic does not last forever nor dies to never come back. Figure 3 plots the CDF of the active periods and shows that 73% topics have a single active period. About 15% of topics have 2 active periods and 5% have 3. Very few have more than 3 active periods.

Most of the active periods are a week or shorter. Figure shows that 31% of periods are 1 day long, and only 7% of periods are longer than 10 days. There are, however, a few long-lasted topics that have been active for more than two months.

This research applies their classification methodology in[6] on the number of tweets and their times, and classifies trending topic periods into the following fourcategories: exogenous subcritical, exogenous critical, endogenous sub-critical, and endogenous subcritical.

Manual inspection of the topics that fall into the exogenous critical class reveal that they are mostly timely breaking news, which the research refer as headline news.

Table 1 # of topics in each category

Subcritical Critical

Exo. 31.5%(1905) 54.3%(3290)

Endo. 6.9%(419) 7.3%(444)

The numbers and percentage of active periods in each class areshown in Table 1. The largest number falls into the ex-ogenous critical class. This meansTwitter users tend to talk

about topics from headline news and respond to fresh news.

4 Impact of Retweet

On Twitter people acquire information not always directly from those they follow, but often via retweets. Assuming a tweet posted by a user is viewed and consumed by all of the user’s followers, count the number of additional recipients who are not immediate followers of the original tweet own-er. Figure 4 displays its average and median per tweet against the number of followers of the original tweet user. The median lies almost always below the average, indicat-ing that many tweets have a very large number of additional recipients. Up to about 1, 000 followers, the average num-ber of additional recipients is not affected by the number of followers of the tweet source. That is, no matter how many followers a user has, the tweet is likely to reach a certain number of audience, once the user’s tweet starts spreading via retweets. This illustrates the power of retweeting. That is, the mechanism of retweet has given every user the pow-er to spread information broadly.

Figure 4 Average and median numbers of additional recipicents

4.1Retweet Tree

In order to answer how far and deep retweets travel in Twitter ,the research builds an information diffusion tree of every tweet that is retweeted and calls it a retweet tree. All retweet trees are subgraphs of the Twitter network.

he research illustrate all the retweet trees of the topic ‘air france flight’ in Figure5. In every connected component different colors represent different tweets. The forest of retweet trees has a large number of one or two-hop chains. The research finds interesting retweet patterns such as repe-titive retweet and cross-retweet; the former is repeatedly retweeting the same tweet, and cross-retweet is retweeting each other.

Figure 5 Retweet trees of ‘air france flight’ tweets

Figure plots the CCDFs of the retweet tree heights and the number of users in a retweet tree. The height of 1 is the most common claiming 95.8%,and no tree goes beyond 11 hops.

Figure 6 Height and participating users in retweet trees

4.2 Temporal Analysis of Retweet

Figure 7 Time lag between a retweet and the original tweet

Figure 6 plots the time lag from a tweet to its retweet. Half of retweeting occurs within an hour, and 75% under a day. However, about 10% of retweets take place a month later.

Figure 8 Time lag between a retweet and the original tweet

Figure 7 plots the time lag between two nodes on a retweet tree. As most retweet trees are one-hop deep, the time lag on the first hop is spread out, with the median at just under 1 hour and the inter-quartile range expanding from a few minutes to more than a day. What is interesting is from the second hop and on is that the retweets two hops or more away from the source are much more responsive and basi-cally occur back to back up to 5 hops away.

5 Influential Twitterers Finding Algorithms

5.1Motivations

The benefit of finding Influential Twitterers is multifold. First, it potentially brings order to the real-time web in that it allows the search results to be sorted by the authori-ty/influence of the contributing twitterers giving a timely update of the thoughts of influential twitterers. Second, according to [7], Twitter is also a marketing platform. Tar-geting those influential users will increase the efficiency of the marketing campaign .

Now, a twitterer’s influence is often measured by her node in-degree in the network, i.e.,the number of followers. However, as observed in previous social network analysis studies [8], in-degree does not accurately capture the notion of influence. PageRank improves over in-degree by consi-dering the link structure of the whole network [8]. Never-theless, Pagerank ignores the interests of twitterers, which affects the way twitterers influence one another. Given this, the algorithms called TwitterRank is proposed.

The framework of it is shown in the Figure9

Figure 9 Framework of the Proposed Approach

First, topics that twitterers are interested in are distilled

automatically by analyzing the content of their tweets. Based on the topics distilled, topic-specific relationship networks among twitterers are constructed. Finally, measure the influence taking both the topical similarity between twitterers and the link structure into account.

5.2Topic-Distillation and Homophily among Twitterers

The goal of the topic distillation is to automatically identify the topics that twitterers are interested in based on the tweets they published. For this purpose, Latent Dirichlet Allocation (LDA) model [9] is applied, which is an unsu-pervised machine learning technique to identify latent topic information from large document collection.

To distill the topics that twitterers are interested in using LDA, documents should naturally correspond to tweets. However, since the goal is to understand the topics that each twitter is interested in rather than the topic that each single tweet is about, the research aggregats the tweets published by individual twitterer into a big document. Thus, each document essentially corresponds to a twitterer.

The result is represented in three matrices:

1. DT, a D × T matrix, where D is the number of twitter-ers and T is the number of topics. 𝐷𝑇�� contains the num-ber of times a word in twitterer si’s tweets has been as-signed to topic 𝑡� .

2. WT, a W×T matrix, where W is the number of unique words used in the tweets and T is the number of topics. 𝑊𝑇�� captures the number of times unique word 𝑤� has been assigned to topic 𝑡� ,

3. and Z, a 1 × N vector, where N is the total number of words in the tweets. 𝑍� is the topic assignment for word 𝑤� .

Among the three matrices in the result of topic distillation, matrix DT is of particular interest. It contains the number of times a word in a twitterer’s tweets has been assigned to a

particular topic. It can be normalized as 𝐷𝑇′ such that

�𝐷𝑇�.′� = 1 for each row 𝐷𝑇�.

′ . Each element 𝐷𝑇��′cap-

tures the probability that twitterer 𝑠� is interested in topic 𝑡�

Given this, the topical difference between twitterers can be

measured as follows:

Definition 1Topical difference between two twitterers 𝑠�and 𝑠�can be calculated as:

𝑑𝑖𝑠𝑡(𝑖, 𝑗) = �𝐷��(𝑖, 𝑗) （1）

𝐷��(𝑖, 𝑗) is the Jensen-Shannon Divergence between the two

probability distributions 𝐷𝑇� .′and 𝐷𝑇�.

′ ,which is defined

as:

𝐷��(𝑖, 𝑗) = ��𝐷��𝐷𝑇�.′�𝑀� + 𝐷��𝐷𝑇�.

′�𝑀�� (2)

M is the average of the two probability distributions,

i.e. M = ��𝐷𝑇𝑖.

′ + 𝐷𝑇𝑗.′� . 𝐷�� in Eq (2) is the Kull-

back-Leibler Divergence which defines the divergence from distribution Q to P as:

𝐷��(�𝑃‖𝑄) = �P(i) logP(i)Q(i)

�

And according to the definition of topical difference, the research find

(1)Twitterers with following relationship are more simi-lar than those without

（2）Twitterers with reciprocal following relationship are more similar than those without

This two findings show that Hompohily does exist among Twitterers.

5.3Topic-specific TwitterRank

First of all, a directed graph D(V,E)is formed with the twit-terers and the “following” relationships among them. V is the vertex set, which contains all the twitterers. E is the edge set. There is an edge between two twitterers if there is “following” relationship between them, and the edge is directed from follower to friend.

A random surfer model on graph D computes the Twitter-Rank as follows: the random surfer visits each twitterer with certain probability by following the appropriate edge in D. TwitterRank differentiates itself from PageRank in that the random surfer performs a topic-specific random walk, i.e. the transition probability from one twitterer to

another is topic-specific. By doing so, the research is essen-tially constructing a topic-specific relationship network among twitterers.

The transition matrix for topic t, denoted as Pt, is defined as follows:

Definition2 Given a topic,each element of matrix Pt,i.e. the transition probability of the random surfer from follower Si to friend Sj.is defined as:

𝑃�(𝑖, 𝑗) = ��∑ |��|�:��

∗ 𝑠𝑖𝑚�(𝑖, 𝑗) （3）

�𝛵�� is number of tweets published by 𝑠� ,and ∑ |𝛵�|�:�� sums up the number of tweets published by all of 𝑠�’s friends. 𝑠𝑖𝑚�(𝑖, 𝑗) is the similarity between 𝑠� and 𝑠� in topic t,which is defined as:

𝑠𝑖𝑚�(𝑖, 𝑗) = 1 − �𝐷𝑇��′ − 𝐷𝑇��′ � (4)

This definition captures two notions. Assume twitterer 𝑠� follows a number of friends.Those friends publish different numbers of tweets, all of which will be directly visible to 𝑠� .

The more a friend 𝑠� publishes, the higher portion of tweets 𝑠� reads is from 𝑠�. Generally, this leads to a higher influence on 𝑠�, which corresponds to a higher transition probability from 𝑠� to 𝑠� . This intuition is captured in the first term in the RHS of Eq. (3). Figure 10 shows an exam-ple about three twitterers. 𝑠� follows 𝑠�and 𝑠�, who pub-lish 500 and 1000 tweets respectively. In this case, 𝑠�’s influence on 𝑠� is two times of that of 𝑠�, when the topical similarity among the three twitterers is not taken into ac-count.

Figure 10 Example of Transition Probability Calculation

Second, 𝑠� ’s influence on 𝑠� is also related to the topical similarity between the two as suggested by the homophily phenomenon discussed in Section before. Row-normalized

matrix DT′is one of the results in the topic distillation. A

row DT�.′contains the probability of twitterer 𝑠� ’s interest

in different topics. The similarity between 𝑠� and 𝑠� in topic t can be evaluated as the difference between the probability that the two twitterers are interested in the same topic t, which is basically the second term in the RHS of Eq. (3).The more similar the two twitterers are, the higher the transition probability from 𝑠� to 𝑠�.

It is possible that some twitterers would “follow” one another in a looping manner without “following” other twitterers outside the loop. Such loop will accumulate high influence without distribute their influence. To tackle this, a teleportation vector E� is also introduced, which basically captures the probability that the random surfer would “jump” to some twitterers instead of following the edges of the graph D. E� is defined as follows:

Definition 3 The teleportation vector of the random surfer in topic t is defined as:

𝐸� = 𝐷𝑇.�″ (5)

𝐷𝑇.�″ is the t-th column of matrix DT″, which is the col-

umn-normalized form of matrix DT such that�𝐷𝑇.�″� = 1.

DT is one of the results obtained during the topic distilla-tion.

With the transition probability matrix and teleportation vector defined, the topic-specific TwitterRank can be calcu-lated.

Definition4 The topic-specific TwitterRank of the twitterers in topic t, denoted as 𝑇𝑅��⃗ , can be calculated iteratively by:

𝑇𝑅��⃗ = γ𝑃� × 𝑇𝑅��⃗ + (1 −γ) × 𝐸� （6）

Pt is the transition probability matrix defined in Eq. (3), Et is the teleportation vector defined in Eq. (5). γ is a parame-ter between 0 and 1 to control the probability of teleporta-tion. The lower γ is, the higher probability the random sur-fer will teleport to twitterers according to Et, and vice versa

5.4Aggregation of Topic-specific TwitterRank

The approach presented in the sections above generates a set of topic-specific TwitterRank vectors, which basically measure the twitterers’ influence in individual topics. An aggregation of TwitterRank can also be obtained to measure twitterers’ overall influence.

Definition5 Twitterers’s general influence can be meas-ured as an aggregation of the topic-specific TwitterRank in different topics, which is calculated as:

𝑇𝑅��⃗ = �𝑟��

.𝑇𝑅��⃗

𝑇𝑅��⃗ is the TwitterRank vector for topic t, while 𝑟� is the weight assigned to topic t and associated 𝑇𝑅��⃗

5.5 Empirical Evaluation

The research evaluates the usefulness of TwitterRank in the twitterer recommendation task. And comparisons against re-

lated algorithms are also conducted. The related algorithms studied

include:

In-degree, which measures the influence of twitterers by the

number of followers.

PageRank, which measures the influence with only

link structure of the network taken into account.

Topic-sensitive PageRank, makes use of topic biased

teleportation vector.

The recommendation task is designed as below：

1 Choose following relationship set based on differ-

ent basis

2 For each following relationship in the set do things

as below:

3 take 𝑠�and 𝑠�as follower and friend

4 Choose another 10 twitterers that 𝑠�dose not

follow, denote them as St

5 Remove existed relationship between 𝑠�and 𝑠�

to generate a new network

6 apply different algorithms to measure the in-

fluence of the twitterers in the new network

7 Based on the influence, 𝑠� is recommended

whether to follow 𝑠�

L, the set of existing “following” relationships in Step 1 of the recommendation task is considered the “ground truth” for evaluation: the recommendation is considered “good” if 𝑠� is ranked higher than all the twitterers in St chosen in Step 4. Given this, the quality of the recommendation is measured as the number of twitterers in St who have a higher rank than 𝑠� . More formally, it is defined as fol-lows:

Definition 6 Assume l is a ranked list recommended by any of the algorithms, and 𝑠� is a twitterer. Let l(𝑠�)be the rank list of 𝑠� in l (a higher rank corresponds to a low-numbered rank in l). The quality of the recommenda-tion Q(l) is measured as 𝑄(𝑙) = ��𝑠�|𝑠�∈𝑆� , 𝑎𝑛𝑑 𝑙(𝑠�) <𝑙�𝑠��. �

𝑠� is the friend removed in Step 5 in Figure

The lower the value of Q(l) is, the higher the quality of cor-responding algorithm is.

Different L’s based on various criteria have been used to study the proposed TwitterRank’s performance as compre-hensively as possible. Currently, there are in total four cri-teria based on which L is generated:

（a）: Two L’s denoted by L�� and L�� are generated based on the number of followers that sf has: L�� has 𝑠� with high follower count, while L�� has 𝑠� with low fol-lower count. 𝑠� ’s follower count is considered high if it is larger than FH, and low if smaller than FL. FH and FL are set as the 90th and 10th percentile of all the follower counts of the twitterers

(b): Two L’s denoted by L�� and L�� are generated based on the number of tweets that sf has. These two sets are gen-erated in a similar approach as in (a). The difference is that the thresholds for high tweet count and low tweet count, denoted as TH and TL,are set as the 90th and 10th percen-tile of all the tweet counts of the twitterers.

(c): Two L’s denoted by L�� and L�� are generated based on the topical difference between 𝑠� and 𝑠� .These two sets are generated in a similar approachas in(a) and (b). The difference is that the thresholds for low topical difference and high topical difference, denoted as DL and DH, are set as the 10th and 90th percentile of the difference of all the existing “following” relationships.

(d)：Two L’s denoted by L�� and L�� are generated based on whether there is reciprocal “following” relationship be-tween 𝑠� and 𝑠� . There is no threshold applied.

There are eight sets of L used in each individual round of evaluation. Five rounds of evaluation are conducted.

Figure 11 Comparison of Performance (measured by Q(l)) in the Recommenda-

tion Task

Figure 11 shows the average results of the four algorithms with different sets of L over all the evaluation rounds. It can be observed that all the algorithms perform better in scena-rios where L��is used than in those where L�� is used. This observation shows that there are twitterers who “follow” because of the topical similarity between them and their friends. This supports the phenomenon of homo-phily discussed before.

TR is outperformed by other algorithms in 3 out of the 8 scenarios studied, including those where L��, L��,and L�� are used. In scenarios where L�� is used, there is no ob-vious difference in the performance of all the algorithms. Yet, InD achieves the best performance. This is probably because, in the dataset, twitterers’ “following” behaviors have already been biased toward those with more follow-ers,since InD is essentially the algorithm applied in Twitter to recommend friends.

In scenarios where L�� is used, TR’s performance is the worst among all. This is because the quality of topics dis-tilled for 𝑠� is not as good since LDA-based topic distilla-tion is less accurate with little content available. Conse-quently, this impacts the performance of TR which takes into account the topical similarity when measuring the twitterers’ influence.

In scenarios where L�� is used, TR outperforms all the other algorithms except InD. This phenomenon, together

with the one observed in scenarios where L�� is used, shows that there still exist some twitterers who do not “fol-low” based on topical similarity, although homophily is observed.

TR performs the best in all the other scenarios, though the improvement is not significant in most of cases. It is noted that in scenarios where L��is used, TR outperforms the other algorithms significantly, especially InD and PR. This is because friends of 𝑠� in the “following” relationships in L�� are with lower numbers of followers. Consequently, the corresponding so would have lower chance to be biased by the recommendation made by Twitter, which is essentially made with InD. In such cases, the chance that the“following” relationship is formed due to topical similari-ty is higher. Therefore, TR outperforms InD and PR,which do not take into account topical similarity. Furthermore,TR outperforms TSPR.This is because TSPR uses identical transition probability matrix when calculating the topics pecific ranks. By doing so, TSPR basically propagates a twitterer’s influence in one topic to her friends in different topics with equal probabilities.

6 Applications

Twitter has been using in various areas. Target event detec-tion and online word of mouth are two typical exam-ples,both of them make use of the real-time nature of Twit-ter.More information can be found in research and research.

7 Summary

We introduced characteristics of Twitter: non-power-law follower distribution, a short average path length, and low reciprocity, which all mark a deviation from known charac-teristics of human social networks .The impact of retweet and the news nature of trending topics are presented. Then the phenomenon of homophily is explained. An extension of PageRank algorithms called TwitterRank which is based on homophily is introduced.

8 References

[1] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of topological characteristics of huge online social networking services. In Proc. of the 16th international con-ference on World Wide Web. ACM, 2007

[2] M. Cha, A. Mislove, and K. P. Gummadi. A measure-ment-driven analysis of information propagation in the Flickr social network. In Proc. of the 18th international con-ference on World Wide Web. ACM, 2009

[3] R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In Proc. of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006

[4] S. Milgram. The small world problem. Psychology to-day, 2(1):60–67, 1967.

[5] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. In Proc. of the 17th inter-national conference on World Wide Web. ACM, 2008

[6] R. Crane and D. Sornette. Robust dynamic classes re-vealed by measuring the response function of a social sys-tem. Proc. of the National Academy of Sciences, 105(41):15649–15653, 2008.

[7] S. Milstein, A. Chowdhury, G. Hochmuth, B. Lorica, and R. Magoulas. Twitter and the micro-messaging revolu-tion: Communication, connections, and immediacy–140 characters at a time. O’Reilly Report, November 2008.

[8] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Network and ISDN Systems, 30(1-7):107–117, 1998.

[9] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

Survey on Twitter Analysis

Documents

Transcript of Survey on Twitter Analysis