A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere
-
Upload
brenda-moon -
Category
Social Media
-
view
155 -
download
1
Transcript of A Topic Analysis Approach To Revealing Discussions On The Australian Twittersphere
A TOPIC ANALYSIS APPROACH TO REVEALING DISCUSSIONS ON THE AUSTRALIAN TWITTERSPHERE
Brenda MoonQueensland University of Technology
Introduction
This paper investigates techniques to identify the topics being discussed in one week of tweets from the Australian Twittersphere. Tweets were extracted from a comprehensive dataset which captures all tweets by 2.8m Australian: the Tracking Infrastructure for Social Media Analysis (TrISMA) (Bruns, Burgess & Banks et al., 2016).
Selected week: Sunday 2 August to Saturday 8 August 2015
• Thursday 6th August 2015 was used for One Day in the Life of a National Twittersphere (Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016)
• Same day used for initial development of topic modelling approach
• Then extended to full week
Latent Dirichlet Allocation
Blei, D. M. (2011)
Data cleaning
• Remove – retweets & multitweets (“rt”, “mt” or “via”)– URLs– dates, times, distances & weights– Words less than 3 characters – elipses ('...’)
• NTLK tokenisation using Twitter Tokenizer– Remove all @users and urls– Lowercase
• Convert – HTML entities to text– Hashtags to words (trim ‘#’ off hashtags)
• NLTK lemmatisation• NLTK stopwords
Hashtag pooling
• Mehrotra, Sanner, Buntine & Xie (2013) looked at different options of ‘pooling’ tweets into documents before LDA analysis to see if this could increase accuracy. They found that hashtag pooling was effective (best was hashtag pooling with clustering, but more complex to apply)
• Group all the tweets with hashtags into documents for each hashtag (some tweets will be added into more than one document)
• Tweets without hashtags stay as individual documents
Corpus filtering (Thursday 6 August 2015)
• Raw tweets: 963,064• After data cleaning: 583,528• After hashtag pooling: 516,263
– 23% of tweets had hashtags• Dictionary pruning – remove most frequent and least
frequent terms – no_above=0.5 (percent of documents), no_below=5
(documents)– 223,157 unique tokens reduced to 49,964 unique tokens
Latent Dirichlet Allocation (LDA)
• Gensim LDA (Lau & Baldwin, 2014)• LdaMulticore• Identify 30 topics• 100 passes
Thursday 6th August 2015 – overall terms
https://github.com/bmabey/pyLDAvis
Thursday 6th August 2015Topic 2: Politics / coal / China / Queensland
Thursday 6th August 2015Topic 5: Cricket – The Ashes
Thursday 6th August 2015Topic 5: Cricket – The Ashes
Thursday 6th August 2015Topic 5: Cricket – The Ashes
Thursday 6th August 2015Topic 5: Cricket – The Ashes
Thursday 6th August 201530 topics, With hashtag pooling.
MH370
Thursday 6th August 201530 topics, With hashtag pooling.Comparison to other study
Pop?
Teen culture?
MH370
1.1m tweets from 147k, to 224k accounts294k nodes total, including non-Australians535k edges from 856k @mentions / RTs
Visualisation: Gephi, Force Atlas 2Colours: Gephi, modularity resolution 1.0
Labels assigned through qualitative evaluation
Politics
Cricket
Teen CulturePop
From “One Day in the Life of a National Twittersphere” by Axel Bruns and Brenda Moon, presented at Social Media and Society, London, 13 July 2016.
Further Outlook• Confirm initial topic labelling by looking at top tweets for each topic• Check whether the hashtag pooling has allowed non-hashtag tweet
topics to still be visible• Use statistical coherence of model (U_Mass Coherence, C_V
coherence) to tune LDA parameters• Model different numbers of topics (coarse/fine grain)• Relate topics per user back to our mention network graphs• Extend to the full week (or longer)• Compare to alternative approaches
– Doc2Vec / Tensorflow / dynamic LDA etc
References
• Blei, D. M. (2011). Introduction to probabilistic topic models. Communications of the ACM, 1–16. Retrieved from http://www.cs.princeton.edu/~blei/papers/Blei2011.pdf
• Mehrotra, R., Sanner, S., Buntine, W., & Xie, L. (2013). Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, 889–892. http://doi.org/10.1145/2484028.2484166
• Lau, J. H., & Baldwin, T. (2014). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation.
• Puschmann, C., & Scheffler, T. (2016). Topic modeling for media and communication research : A short primer (HIIG Discussion Paper Series No. 2016–5). Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2836478
• Sievert, C., & Shirley, K. (2014). LDAvis: A method for visualizing and interpreting topics. Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70. Retrieved from http://www.aclweb.org/anthology/W/W14/W14-3110