Twitter : ketab n...Twitter : ketab_n. Twitter : ketab_n. Twitter : ketab_n
Global connectivity and multilinguals in the Twitter network (slides)
-
Upload
scott-a-hale -
Category
Social Media
-
view
637 -
download
2
Transcript of Global connectivity and multilinguals in the Twitter network (slides)
Global Connectivity and Multilingualsin the Twitter Network
Scott A. HaleOxford Internet Institute
http://www.scotthale.net/pubs/?chi2014
28 April 2014
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Background, Motivations
Twitter perceived as ‘global’ network
Past studies concentrate on role of geography in structuring network(e.g. Takhteyev, Gruzd, & Wellman, 2011)
Language also important?
Less than half the tweets in English (the most used language) (Hong,Convertino, & Chi, 2011)
Content is diverse across languages in other platforms (e.g. WikipedaHecht & Gergle, 2010)
Multilingual users may act as unconscious translators bridging languagedivides (Eleta & Golbeck, 2012)
What implications to designing search, friend recommendationalgorithms?
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
1 Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Data
Twitter mentions, retweet network
18 days of ‘spritzer’ 1% sample stream from June 2011
7,341,271 nodes. 8,545,693 directed, weighted edges
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Data cleaning
Language classification
Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham,Hale, & Gaffney, 2013)
Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total
Bots and spam users
Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)
End result
916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Data cleaning
Language classification
Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham etal., 2013)
Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total
Bots and spam users
Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)
End result
916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Data cleaning
Language classification
Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham etal., 2013)
Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total
Bots and spam users
Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)
End result
916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Data cleaning
Language classification
Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham etal., 2013)
Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total
Bots and spam users
Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)
End result
916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Users by most used language
Language User count Tweets/user (s.d.)
English (en) 375,474 8.43 (5.81)Japanese (ja) 137,263 9.51 (8.38)Portuguese (pt) 133,501 7.95 (5.18)Malay/Indonesian (ms) 106,223 8.44 (5.51)Spanish (es) 70,246 8.01 (5.18)Dutch (nl) 31,035 8.81 (5.84)Korean (ko) 16,123 10.46 (8.96)Thai (th) 8,629 9.03 (6.48)Arabic (ar) 7,679 8.30 (6.48)French (fr) 5,769 9.06 (6.71)Filipino/Tagalog (fil) 5,393 6.74 (3.64)
Table: Languages with the most users and the average number of tweets per user.Each user is placed in the language he or she uses most frequently.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Multilinguals vs Monolinguals
11% of users (103,645) were observed to use more than one language anddesignated as multilingual users.
Figure: Comparison of tweet count, in-degree, and out-degree for multilingual andmonolingual users. Vertical lines show mean values.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Finding clusters
Finding clusters of densely connected nodes in a network of this size isnon-trivial.
Used the label propagation algorithmnode adapts most common label among its neighbors(Raghavan, Albert, & Kumara, 2007)
I parallelized the algorithm, and open-sourced the codehttp://www.scotthale.net/pubs/?chi2014
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Language and structure
Found 17,480 clusters, most with a clearly dominant language
Figure: Histograms of the size of clusters (left) and the number of languages withineach cluster (right).
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Language and structure
Figure: Scatter plot of cluster size and the percentage of users in the cluster mostoften using the most prevalent language.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Language and structure
Most-used lan-guage
% users inmost-usedlanguage
Number of lan-guages
Number ofnodes
Malay (ms) 78.3 41 123,616English (en) 99.3 39 114,826Portuguese (pt) 94.3 40 101,987Japanese (ja) 99.6 19 83,785English (en) 75.7 44 80,387English (en) 55.1 42 37,688Dutch (nl) 90.6 23 20,634
Table: Clusters with over 10,000 nodes found through the label propagationalgorithm. Collectively 61% of all users are in one of these clusters.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Bridging role of multilinguals
Figure: Size of the largest, weakly-connected component (left), total number ofcomponents (center), and average size of the components (right) created byremoving all multilingual users, an equivalent number of monolingual usersrandomly, an equivalent number of all users randomly, and removing all multilingualusers from a network with the same degree distribution but with edges randomlyshuffled. Box plots show values from 100 realizations. Mean values are indicatedwith +.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Variations by language
Figure: Number of users in each language compared to the percentage of theseusers classified as multilingual
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Variations by language (Wikipedia)
Figure: Number of users in each language compared to the percentage of thoseusers editing multiple language editions of Wikipedia (arXiv:1312.0976).
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
Ö Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Bridging at a language level
ar
de
en
es fil
fr
it
jako
ms
nl
pt
ru
th
tr
Figure: A collapsed network graph with users grouped to nodes representing theprimary language used. Edges are weighted by the percent error in the expectedvs. the actual number of mentions and retweets between language groups. Nodesize is proportional to the number of users primarily using each language, and nodecolor is the result of a modularity-maximizing community detection algorithm.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Bridging at a language level
Figure: Percentage of remaining users not in the largest-connected component afterremoving users in different languages. NB 3.7% of remaining users are not in thelargest-connected component after removing multilingual users across all languages.
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
1 Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
Ö Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Hypotheses
X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language
X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network
Ö Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages
X When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Implications and future directions
Implications
Global connectivity resultsthrough the combination ofmultilinguals across manylanguage pairs
Allow users to have multiplepreferred languages (over 10% ofusers used multiple languages)
Important per languagevariations
Further work
Construct diffusion cascades oflinks, retweets
Examine the content of externallink. e.g., original language ofcontent
Comparison of language togeography
Working on similar questionswith Wikipedia as point ofcomparison (arXiv:1312.0976)
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Global Connectivity and Multilingualsin the Twitter Network
Scott A. HaleOxford Internet Institute
http://www.scotthale.net/pubs/?chi2014
28 April 2014
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
I would like to thank Taha Yasseri, Eric T. Meyer, Sandra Gonzalez-Bailon, Jonathan
Bright, Mike Thelwall, and Irene Eleta as well as the anonymous CHI reviewers who
provided helpful comments on previous versions of this article.
Eleta, I., & Golbeck, J. (2012). Bridging Languages in Social Networks:How Multilingual Users of Twitter Connect Language Communities.Proceedings of the American Society for Information Science andTechnology, 49(1), 1–4. Available fromhttp://dx.doi.org/10.1002/meet.14504901327
Graham, M., Hale, S. A., & Gaffney, D. (2013). Where in the world areyou? Geolocation and language identification in Twitter. ProfessionalGeographer.
Hecht, B., & Gergle, D. (2010). The Tower of Babel meets Web 2.0:User-generated content and its applications in a multilingual context.In Proceedings of the 28th international conference on human factorsin computing systems (pp. 291–300). New York, NY, USA: ACM.Available from http://doi.acm.org/10.1145/1753326.1753370
Hong, L., Convertino, G., & Chi, E. (2011). Language matters in Twitter:A large scale study. In International AAAI conference on weblogs andsocial media (pp. 518–521). Available from http://www.aaai.org/
ocs/index.php/ICWSM/ICWSM11/paper/view/2856
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network
Raghavan, U. N., Albert, R., & Kumara, S. (2007, September). Near lineartime algorithm to detect community structures in large-scale networks.Phys. Rev. E, 76(3), 36106. Available fromhttp://link.aps.org/doi/10.1103/PhysRevE.76.036106
Takhteyev, Y., Gruzd, A., & Wellman, B. (2011). Geography of Twitternetworks. Social Networks, 1–26. Available fromhttp://www.sciencedirect.com/science/article/pii/
S0378873311000359#FCANote
Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network