Global connectivity and multilinguals in the Twitter network (slides)

31
Global Connectivity and Multilinguals in the Twitter Network Scott A. Hale Oxford Internet Institute http://www.scotthale.net/pubs/?chi2014 28 April 2014 Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Transcript of Global connectivity and multilinguals in the Twitter network (slides)

Page 1: Global connectivity and multilinguals in the Twitter network (slides)

Global Connectivity and Multilingualsin the Twitter Network

Scott A. HaleOxford Internet Institute

http://www.scotthale.net/pubs/?chi2014

28 April 2014

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 2: Global connectivity and multilinguals in the Twitter network (slides)

Background, Motivations

Twitter perceived as ‘global’ network

Past studies concentrate on role of geography in structuring network(e.g. Takhteyev, Gruzd, & Wellman, 2011)

Language also important?

Less than half the tweets in English (the most used language) (Hong,Convertino, & Chi, 2011)

Content is diverse across languages in other platforms (e.g. WikipedaHecht & Gergle, 2010)

Multilingual users may act as unconscious translators bridging languagedivides (Eleta & Golbeck, 2012)

What implications to designing search, friend recommendationalgorithms?

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 3: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

1 Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 4: Global connectivity and multilinguals in the Twitter network (slides)

Data

Twitter mentions, retweet network

18 days of ‘spritzer’ 1% sample stream from June 2011

7,341,271 nodes. 8,545,693 directed, weighted edges

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 5: Global connectivity and multilinguals in the Twitter network (slides)

Data cleaning

Language classification

Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham,Hale, & Gaffney, 2013)

Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total

Bots and spam users

Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)

End result

916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 6: Global connectivity and multilinguals in the Twitter network (slides)

Data cleaning

Language classification

Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham etal., 2013)

Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total

Bots and spam users

Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)

End result

916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 7: Global connectivity and multilinguals in the Twitter network (slides)

Data cleaning

Language classification

Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham etal., 2013)

Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total

Bots and spam users

Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)

End result

916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 8: Global connectivity and multilinguals in the Twitter network (slides)

Data cleaning

Language classification

Clean text of tweets for languagedetection (remove urls,usernames, emoticons)Use Chromium CompactLanguage Detection kit forlanguage detection (Graham etal., 2013)

Remove users with less than 2tweets or 20% of the user’stweets in one languageRemove users with less than fourtweets total

Bots and spam users

Remove users with no mentions(indegree=0)Select only the largestweakly-connected component(88% of nodes)

End result

916,836 nodes (users) and 2,652,618directed edges (mentions/retweets)Each user assigned most usedlanguage and frequency [0-1] thatthe most used language is used

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 9: Global connectivity and multilinguals in the Twitter network (slides)

Users by most used language

Language User count Tweets/user (s.d.)

English (en) 375,474 8.43 (5.81)Japanese (ja) 137,263 9.51 (8.38)Portuguese (pt) 133,501 7.95 (5.18)Malay/Indonesian (ms) 106,223 8.44 (5.51)Spanish (es) 70,246 8.01 (5.18)Dutch (nl) 31,035 8.81 (5.84)Korean (ko) 16,123 10.46 (8.96)Thai (th) 8,629 9.03 (6.48)Arabic (ar) 7,679 8.30 (6.48)French (fr) 5,769 9.06 (6.71)Filipino/Tagalog (fil) 5,393 6.74 (3.64)

Table: Languages with the most users and the average number of tweets per user.Each user is placed in the language he or she uses most frequently.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 10: Global connectivity and multilinguals in the Twitter network (slides)

Multilinguals vs Monolinguals

11% of users (103,645) were observed to use more than one language anddesignated as multilingual users.

Figure: Comparison of tweet count, in-degree, and out-degree for multilingual andmonolingual users. Vertical lines show mean values.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 11: Global connectivity and multilinguals in the Twitter network (slides)

Finding clusters

Finding clusters of densely connected nodes in a network of this size isnon-trivial.

Used the label propagation algorithmnode adapts most common label among its neighbors(Raghavan, Albert, & Kumara, 2007)

I parallelized the algorithm, and open-sourced the codehttp://www.scotthale.net/pubs/?chi2014

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 12: Global connectivity and multilinguals in the Twitter network (slides)

Language and structure

Found 17,480 clusters, most with a clearly dominant language

Figure: Histograms of the size of clusters (left) and the number of languages withineach cluster (right).

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 13: Global connectivity and multilinguals in the Twitter network (slides)

Language and structure

Figure: Scatter plot of cluster size and the percentage of users in the cluster mostoften using the most prevalent language.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 14: Global connectivity and multilinguals in the Twitter network (slides)

Language and structure

Most-used lan-guage

% users inmost-usedlanguage

Number of lan-guages

Number ofnodes

Malay (ms) 78.3 41 123,616English (en) 99.3 39 114,826Portuguese (pt) 94.3 40 101,987Japanese (ja) 99.6 19 83,785English (en) 75.7 44 80,387English (en) 55.1 42 37,688Dutch (nl) 90.6 23 20,634

Table: Clusters with over 10,000 nodes found through the label propagationalgorithm. Collectively 61% of all users are in one of these clusters.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 15: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 16: Global connectivity and multilinguals in the Twitter network (slides)

Bridging role of multilinguals

Figure: Size of the largest, weakly-connected component (left), total number ofcomponents (center), and average size of the components (right) created byremoving all multilingual users, an equivalent number of monolingual usersrandomly, an equivalent number of all users randomly, and removing all multilingualusers from a network with the same degree distribution but with edges randomlyshuffled. Box plots show values from 100 realizations. Mean values are indicatedwith +.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 17: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 18: Global connectivity and multilinguals in the Twitter network (slides)

Variations by language

Figure: Number of users in each language compared to the percentage of theseusers classified as multilingual

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 19: Global connectivity and multilinguals in the Twitter network (slides)

Variations by language (Wikipedia)

Figure: Number of users in each language compared to the percentage of thoseusers editing multiple language editions of Wikipedia (arXiv:1312.0976).

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 20: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

Ö Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 21: Global connectivity and multilinguals in the Twitter network (slides)

Bridging at a language level

ar

de

en

es fil

fr

it

jako

ms

nl

pt

ru

th

tr

Figure: A collapsed network graph with users grouped to nodes representing theprimary language used. Edges are weighted by the percent error in the expectedvs. the actual number of mentions and retweets between language groups. Nodesize is proportional to the number of users primarily using each language, and nodecolor is the result of a modularity-maximizing community detection algorithm.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 22: Global connectivity and multilinguals in the Twitter network (slides)

Bridging at a language level

Figure: Percentage of remaining users not in the largest-connected component afterremoving users in different languages. NB 3.7% of remaining users are not in thelargest-connected component after removing multilingual users across all languages.

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 23: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

1 Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 24: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

2 Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 25: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

3 Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 26: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

Ö Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

4 When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 27: Global connectivity and multilinguals in the Twitter network (slides)

Hypotheses

X Language will have strong role in structuring the network: thementions/retweet network will have many clusters composed of asingle, dominant language

X Users engaging with content in multiple languages (multilingual users)will serve as bridges between different clusters of the Twitter network

Ö Users primarily writing in less-represented languages will be more likelyto cross-language boundaries than users writing in highly-representedlanguages

X When users cross languages they will cross to larger languages (e.g.English) and thus at a language level English will form more bridgesthan other other languages

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 28: Global connectivity and multilinguals in the Twitter network (slides)

Implications and future directions

Implications

Global connectivity resultsthrough the combination ofmultilinguals across manylanguage pairs

Allow users to have multiplepreferred languages (over 10% ofusers used multiple languages)

Important per languagevariations

Further work

Construct diffusion cascades oflinks, retweets

Examine the content of externallink. e.g., original language ofcontent

Comparison of language togeography

Working on similar questionswith Wikipedia as point ofcomparison (arXiv:1312.0976)

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 29: Global connectivity and multilinguals in the Twitter network (slides)

Global Connectivity and Multilingualsin the Twitter Network

Scott A. HaleOxford Internet Institute

http://www.scotthale.net/pubs/?chi2014

28 April 2014

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

I would like to thank Taha Yasseri, Eric T. Meyer, Sandra Gonzalez-Bailon, Jonathan

Bright, Mike Thelwall, and Irene Eleta as well as the anonymous CHI reviewers who

provided helpful comments on previous versions of this article.

Page 30: Global connectivity and multilinguals in the Twitter network (slides)

Eleta, I., & Golbeck, J. (2012). Bridging Languages in Social Networks:How Multilingual Users of Twitter Connect Language Communities.Proceedings of the American Society for Information Science andTechnology, 49(1), 1–4. Available fromhttp://dx.doi.org/10.1002/meet.14504901327

Graham, M., Hale, S. A., & Gaffney, D. (2013). Where in the world areyou? Geolocation and language identification in Twitter. ProfessionalGeographer.

Hecht, B., & Gergle, D. (2010). The Tower of Babel meets Web 2.0:User-generated content and its applications in a multilingual context.In Proceedings of the 28th international conference on human factorsin computing systems (pp. 291–300). New York, NY, USA: ACM.Available from http://doi.acm.org/10.1145/1753326.1753370

Hong, L., Convertino, G., & Chi, E. (2011). Language matters in Twitter:A large scale study. In International AAAI conference on weblogs andsocial media (pp. 518–521). Available from http://www.aaai.org/

ocs/index.php/ICWSM/ICWSM11/paper/view/2856

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network

Page 31: Global connectivity and multilinguals in the Twitter network (slides)

Raghavan, U. N., Albert, R., & Kumara, S. (2007, September). Near lineartime algorithm to detect community structures in large-scale networks.Phys. Rev. E, 76(3), 36106. Available fromhttp://link.aps.org/doi/10.1103/PhysRevE.76.036106

Takhteyev, Y., Gruzd, A., & Wellman, B. (2011). Geography of Twitternetworks. Social Networks, 1–26. Available fromhttp://www.sciencedirect.com/science/article/pii/

S0378873311000359#FCANote

Scott A. Hale Global Connectivity and Multilingualsin the Twitter Network