Clustering Twitter Users Based on Interests Derived from Their Retweetskjs.nagaokaut.ac.jp ›...

ISIS2017 The 18th International Symposium on Advanced Intelligent Systems

183

Clustering Twitter Users Based on Interests Derived from Their Retweets

Suppassara Kijsupapaisan, Koichi Yamada, Muneyuki Unehara, Izumi Suzuki

[email protected], {yamada,unehara,suzuki}@kjs.nagaokaut.ac.jp

Nagaoka University of Technology Kamitomioka, Nagaoka, Niigata, 940-2188 Japan

Abstract. We propose a new approach to cluster twitter users based on categories of interest, which are examined from their retweets. A user's attributes are represented by a vector composed of frequencies of retweets per each category of interest. The category of each retweet inherits from the hub user who posted the original tweet. The hub user category is defined in advance from the past tweets and the user profile. We apply two clustering techniques and the results are evaluated through two kinds of similarity measures between users within clusters; one utilizes hashtag features and the other noun features in retweets. The results shows that the proposed approach has the potential for twitter user clustering based on retweets data.

Keywords: Twitter; Users clustering; Users similarity

1 Introduction

Twitter is a web service which has over 100 million users, and 200 million messages are posted and shared everyday. Many digital contents companies are using twitter to connect with people in order to show their opinions to audience. A challenge of twitter may be a limitation of text length (140 characters) that leads to the content shortness. Beside, the microblogging (likes Twitter) analysis has the difficulty due to abbreviations and linguistic errors.

Twitter provides three functions that allow users to reply, quote and retweet. The reply function is a function for responding tweets conversation. The retweet mechanism is regarded as the way to propagate messages throughout users. In this paper, we focus on the retweet mechanism. We suppose that users retweet in order to share their interests on timeline. In consequence, the retweets also can be used as properties for clustering users based on their interest.

Retweeting behavior was explained in [2] where it is the way to be in conversation, according to a series of questions on @zephoria’s twitter account. There are several reasons in retweeting; to inform information, to agree with the content or to validate other’s thoughts. Furthermore, Nagarajan et al. [7] and Welch et al. [9] also investigated and gave a qualitative examination about retweet and concluded that the retweet has the potential for utilization. Moreover, [6] shows that users tweeting behavior are always changing. Recently, the percentage of retweet is larger than the percentage of replies. Some users seem to prefer retweeting than writing their own


184

tweets. In such a case, extracting information of users only from their original tweets would be insufficient. Therefore, the idea of using retweets was considered.

Recent works on twitter user’s topic of interest, P. Kapanipathi et al. [4] proposed a method to identify user's interest by extracting named entities from tweet text and using those words to define the category. However, the reliability of data source is still being a problem. The work [5] proposed a method to find user communities by using following links of celebrities. They supposed that a group of people with the same interests is gathering within a large society. Then, they proposed a method to detect communities with common interests. Their method aimed to get more cohesive users that share a common interest. The measurement of interest was based on the number of celebrities in an interest category that users follow. The idea of using celebrities as the interest category identifier was applied to our research. Beside, work [8] also used the propagation of retweet in twitter to find community.

Fig. 1. The model of relation between the ordinary user and the hub user. The black circle

represented a hub user and the white is regular users. We consider the relations of regular users to hub users are connecting by retweet actions

Thus, in this research, we assume that retweets have a potential to be analyzed for users’ interests and we can further use those interests for clustering users into groups by using unsupervised learning. The experiment results were evaluated to proof that our method has ability to cluster users who have common interests and has high similarities into the same group relied on retweets data.

2 Retweets

Retweet is a function that lets users post a tweet about other's tweets. We define a hub user by the user whose tweets always get retweeted by a large number of others. The examples of hub users are news reporters, celebrities or organizations. However, a hub user can be anyone if he/she got a large number of retweets in our case.

Using retweet as the resource has various potentials; 1) Discovery of active users: Collecting users who retweet trending tweets is a

good way to collect active users. Users who are retweeting often seem to be interested in others’ tweets and try to spread them to their audiences.


185

2) Hub user filtering: The numbers of retweeted users reveal the public acceptance of hub users. Generally, hub users can be classified into a category through the contents that they usually spread.

3) The same interest user group: Users retweet because they want to spread the content. This means the hub user can be used as the center of community where the ordinary users with the same interest interact with the hub user.

According to these reasons, we considered that retweets are more valuable than original tweets for our goal.

3 Data Collection

We started the tweet search with topics of US President Election using popular hashtags; #trump, #nevertrump, #Imwither, etc., during March and June 2016. First, we focused on so-called "hub users" retweeted by a large number of others. Then, We collected the users who retweeted the hub users' posts (Fig. 2) and their recent tweets.

Fig. 2. The diagram depicted users collecting process.

We collected retweeting users from hub users’ posts and also retweeting users recent tweets.

The total numbers of users and their tweets collected are 1335 and 3.0 million respectively with the API limitation that 3,000 recent tweets per user can be retrieved. Among the retrieved tweets, 15% are original posts, 68% are retweets, and the rest are quotes and replies.

The hub users called hereafter are defined formally and collected in the following conditions; 1) more than 300 users in our data collection have retweeted their posts, and 2) they have followers more than 10 thousands. The number of hub users chosen is 930 users. The top 500 active users are chosen as the ordinary users, who will be clustered later, in the conditions; 1) more than 35% of his/her tweets are retweets, and 2) more than 50% of his/her retweets belong to one or some of categories in Table 1 in the next section.


186

4 Implementation

4.1 Hub Users' Category

We examined the content of tweets and the profile description of hub users to identify the category of each hub user. Latent Dirichlet allocation (LDA) [1] was employed for finding topics within a document set. In our case, a tweet is a document and a hub user has a collection of documents. LDA is a topic model that gives us a probability distribution on latent topics for a document and a probability distribution on words for a latent topic.

We determined the category of each hub user manually taking into account the results of LDA and the user's profile. The categories we used are shown in table 1,

Table 1. The categories from Alchemy taxonomy recognition services. Categories

art and entertainment health and fitness science automotive and vehicles hobbies and interests society business and industrial home and garden sports

education law style and fashion family and parenting news technology and computing

finance pets travel food and drink religion and spirituality

which was originated from taxonomy used by Alchemy taxonomy recognition service; the service is analyzing assigned text into categories.

4.2 User’s Interest

Every retweet posted by ordinary users inherits the category from the hub user who posted the original tweet, and the interest of an ordinary user is represented by the frequency vector of categories.

𝑋𝑋′ = 𝑋𝑋 − 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚 − 𝑋𝑋𝑚𝑚𝑚𝑚𝑚𝑚

, (1)

where X’ is a normalized frequency, X is the frequency of retweets in each category, Xmin is the lowest frequency in the category, and Xmax is the highest frequency. We clustered the users with their frequency vector using two unsupervised clustering techniques: k-means clustering and hierarchy clustering with ward method, the number of clusters are n = 5, 7, 10 and 15 respectively.


187

5 Evaluation and Results

5.1 Evaluation feature

The basic idea of the proposed concept is that frequencies of a user's retweets are reflection of his/her interests. Users belonging to the same cluster would share common topics or words. Thus we introduce two kinds of similarity measures related to the common words to evaluate the quality of clusters. Then, we evaluated the obtained clusters with the two similarity measures. For the purpose of comparison, random selected sets of users were also evaluated with the measures.

One of the similarity measures utilizes hashtag features. Hashtag is a mechanism provided by twitter and a way to join social trending conversation. As the result, hashtags reveal the topics a user is interested in, so we utilize hashtag usage as features for measuring similarity (See Fig. 3).

On the other hand, nouns in user’s tweets could also reflect user’s interests. We

extract all nouns in the tweet collection by using Part of Speech tagging[3]. Then, a noun vector is calculated for each user, where each element of the vector is the Term frequency - Inverse Document Frequency (Tf-idf) under the condition that all tweets of a user are combined as a document. The number of dimensions of the noun vector was 5832.

Fig. 3. The example of hashtag usage vector

5.2 Similarity Measurement

Cosine similarity is used to measure the similarity between two vectors. We apply this method to both Hashtags feature and nouns feature in order to evaluate the similarity between two users.

The similarity of all user pairs in one cluster was averaged. For the similarity of overall, we weighted each cluster with the number of users in the cluster and divide with total users as shown in equation (2).

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑐𝑐𝑠𝑠 =

∑ (𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑐𝑐𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑐𝑐𝑠𝑠𝑖𝑖 × |𝑐𝑐𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠𝑐𝑐𝑠𝑠𝑖𝑖|)𝑁𝑁𝑖𝑖=1

𝑠𝑠𝑜𝑜𝑠𝑠𝑠𝑠𝑠𝑠 𝑐𝑐𝑠𝑠𝑐𝑐𝑠𝑠𝑠𝑠 , (2)


188

5.3 Unsupervised techniques

We consider similarity value from Hashtags and nouns features separately. Figure 4 is the results of all cluster similarities from hashtags feature, both clustering technique have better average similarity values than the users which selected randomly (outer cluster). The results from nouns feature also have the same tendency (Figure 5). These results show us that our proposed method has potential of clustering Twitter users.

Fig. 4. The similarity result of Hashtags feature - comparison between two clustering methods

with different numbers of cluster clusters

Fig.5. The similarity result of nouns feature - comparison between two clustering methods with different numbers of cluster clusters

6.4 Number of Clusters

Both k-mean clustering and hierarchy clustering require a number of clusters before processing that is crucial to find the best value which suit with data. As mentioned previously, we performed the clustering with number of cluster n = 5, 7,

0.047 0.053 0.054

0.058

0.047 0.053 0.055

0.059

0.0003 0.0004 0.0005 0.0005 0

0.015

0.03

0.045

0.06

0.075

5 7 10 15

Sim

ilarit

y va

lue

Number of clusters

Hierarchy Technique K-mean Technique Random Selection

0.059 0.069

0.081 0.089

0.062 0.071

0.078 0.086

0.0004 0.002 0.004 0.008

0

0.02

0.04

0.06

0.08

0.1

5 7 10 15

Sim

ilarit

y va

lue

Number of clusters

Hierarchy Techinique K-mean Technique Random Selection


189

10 and 15. We found that when the assigned number of cluster is low, there are the clusters which have very high frequency in one category and one cluster that gather users who have ambiguous interest.

Table 2. The hierarchy clustering similarity result, the number of clusters is 5 cluster

number representative interest hashtags feature

similarity nouns feature

similarity number of

users 1 nonspecific 0.027 lower than 0.001 333 2 business and industrial 0.065 0.076 30 3 art and entertainment 0.081 0.090 66 4 news 0.116 0.271 23 5 technology 0.095 0.294 46

random - 0.0005 0.0003 100

Table 3. The hierarchy clustering similarity result, the number of cluster is 10 cluster



similarity number of users

1 news (lower frequency) 0.048 0.043 112 2 business and industrial 0.065 0.076 30 3 sports 0.051 0.077 43 4 art and entertainment 0.081 0.090 66 5 nonspecific 0.020 lower than 0.001 167 6 news 0.116 0.271 23 7 food 0.039 0.153 2 8 technology 0.095 0.294 46 9 science 0.223 0.207 7

10 fashion 0.082 0.153 4 random - 0.0004 0.0006 50

Table 4. The hierarchy clustering similarity result,

the number of cluster is 15 cluster



similarity number of users

1 news (lower frequency) 0.048 0.043 112 2 business and industrial 0.065 0.076 30 3 sports 0.051 0.077 43 4 art and entertainment 0.070 0.061 57 5 nonspecific 0.020 lower than 0.001 158 6 news 0.116 0.271 23 7 travel 0.110 0.082 4 8 food 0.039 0.135 2 9 hobby and interests 0.112 0.264 25

10 technology 0.111 0.356 24 11 entertainment

(lower frequency) 0.259 0.447 8

12 education 0.053 0.015 2 13 science 0.213 0.207 7 14 fashion 0.082 0.1534 4 15 pets - - 1

random - 0.0004 0.0006 50


190

When the assigned number of cluster is getting higher, more communities could be discovered. Those are the clusters which were concealed by the bigger clusters previously. Moreover, those new clusters also have good similarities values.

7 Discussion

The proposed concept shows that the retweet has potential to discover group of user who share common interest when compared with the random selection. The assigned number of cluster should depend on the purpose of application and the specific-level of target users. The applied categories in the experiment are in general levels that the categories could be deeper in specific field like movies, soccer or music. The clustered users can be used as the resource data for leveraging useful information that directly relate to users interest for marketing strategies or online advertising.

References

1. D. M. Blei, A. Y. Ng and M. I. Jordan. Latent dirichlet allocation. In Journal of Machine Learning Research, pages 993-1022, 2003.

2. D. Boyd, S. Golder, and G. Lotan. Tweet, tweet, retweet: Conversational aspects of retweeting on Twitter. In 43rd Hawaii International Conf. on System Sciences, page 412, 2008.

3. K. Gimpel, N. Schneider, B. O ’ Connor, D. Das. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceeding of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Volume 2 pages 42-47, 2011.

4. P. Kapanipathi, P. Jain, C. Venkataramani, and A. Sheth. User Interests Identification on Twitter Using a Hierarchical Knowledge Base. The Semantic Web: Trends and Challenges, Volume 8465, pages 99-113, 2014.

5. K. H. Lim and A. Datta. Finding twitter communities with common interests using following links of celebrities. In Proceedings of the 3rd international workshop on Modeling social media, pages 25–32, 2012.

6. Y. Liu, C. Kliman-Silver, and A. Mislove. The Tweets They Are a-Changin’: Evolution of Twitter Users and Behavior. In Proceedings of the 8th International AAAI Conference on Weblogs and Social Media, 2014.

7. M. Nagarajan, H. Purohit and A. Sheth. A Qualitative Examination of Topical Tweet and Retweet Practices.In Proceedings of Fourth International AAAI Conference on Weblogs and Social Media, pages 295–298, 2010.

8. Y. Ota, K. Maruyama, and M. Terada. Discovery of interesting users in Twitter by overlapping propagation paths of retweets. In Proceedings of The International Joint Conferences on Web Intelligence and Intelligent Agent Technology, Volume 3, pages 274-279 (2012)

9. J. Welch, D. He, U. Schonfeld, and J. Cho. Topical Semantics of Twitter Links. In Proceedings of the 4th ACM international conference on Web search and data mining, Pages 327-336, 2011.

Clustering Twitter Users Based on Interests Derived from Their Retweetskjs.nagaokaut.ac.jp ›...

Documents

Transcript of Clustering Twitter Users Based on Interests Derived from Their Retweetskjs.nagaokaut.ac.jp ›...