Twitter Sentiment Analysis Applied to Land Use in Los ... · This research aims to unite Twitter...

Twitter Sentiment Analysis Applied to Land Use in LosAngeles County

Dr. Arthur Huang, David Ebert & Parker RiderTarleton State UniversityStephenville, TX 76401

[email protected]@[email protected]

June 2, 2016∗

1 Introduction

How happy is my neighborhood? With the increasing popularity of microblogging sites suchas Twitter, Facebook, Google+ and Tumblr, it is easier than ever to apply machine learningapproaches to vast amounts of social media data for the purpose of detecting influenzaoutbreaks Hu et al. (2013), Achrekar et al. (2011), monitor political sentiments Tumasjanet al. (2010), detect earthquakes Sakaki et al. (2010), and even identify drunk Tweets Hossainet al. (2016). This research aims to unite Twitter data with machine learning, sentimentanalysis, and land use studies to address another question: What is the geography of emotion?Using a corpus of 9.4 million geotagged tweets collected from Los Angeles county, we aim toclassify tweets according to their sentiment polarity, perform a cluster analysis to identifythe location of the happiest and saddest places in Los Angeles County, and then analyzewhat neighborhood attributes correlate to the happiness and sadness expressed in tweets.By addressing questions of where and why neighborhoods are happy and sad, we hope thatthis scalable approach to land-use classification will enable city planners to more efficientlydesign safe and enjoyable living spaces.

The rest of this paper is organized as follows: section 2 summarizes primary approachesto sentiment analysis and land use studies. Following a description of the data sets andlexicons used in this study in section 3, section 4 details the procedures used to select a

∗The source code for this project is available at www.github.com/dpebert7/Rsentiment. Additionally,updates to the status of the 2016 tweet collection from Los Angeles County on Twitter can be seen byfollowing @tsutweets on Twitter.

1

lexicon and train a random forest classifier. Finally, section we present our results in section5.

2 Literature review

2.1 Lexical Polarity Classifiers

Within the field of natural language processing, sentiment analysis refers broadly to the taskof automatically determining the attitude of a writer with regard to some topic Mohammad(2015). Early work focused on automatically determining the sentiment polarity (i.e. posi-tive vs negative sentiment) of documents such as online movie reviews using lexical affinityapproaches, part of speech tagging, and classification techniques from machine learning Tur-ney (2002), Pang et al. (2002). As with other work in natural language processing, this workfocused on documents much longer than Tweets, which are limited to 140-characters, andin many instances neutral polarities were either nonexistent or ignored.

Among the many difficulties involved with classifying sentiment, those which are partic-ularly important to the polarity classification tweets include non-standard language, a lackof properly labeled training data, ambiguous meaning, and subjective language Mohammad(2015). In particular, many applications of sentiment analysis require the removal of objec-tive tweets which do not express positive or negative polarity Agarwal et al. (2011), Barbosaand Feng (2010). On the other hand, Pang and Lee (2008) argue that objective text is notonly difficult to define and identify, but it may also be useful in determining subjective feel-ings (e.g. ‘It’s rainy.’). For the purposes of our study, we define a tweet’s sentiment polarity(or simply polarity) as a personal positive of negative feeling. We will assume that all tweetsmay express polarity, even if they are objective or are directed at a particular audience ortarget. This assumption is justified by the fact that most of the geotagged Tweets in ourcorpus are individuals expressing their thoughts rather than a news organization, celebrity,etc. Additionally, the tweets are not retweets, but rather are originally composed by thegiven user.

Due to its wide use and availability, many prior studies have used geotagged Twitterdata. One challenge of using such data for the purposes of land use studies is that Twitterusers are not a representative or random sample of the population. Compared to the restof the United States population, geotagged users are more likely to be in cities and haveincome above the median. Additionally, Twitter users are more likely to be young, Asian,Black and Latino/Hispanic Malik et al. (2015). Moreover, while about 16% of Americaninternet users are active on Twitter, less 2% of tweets are geotagged Duggan and Brenner(2013), Morstatter et al. (2013).

The simplest approach to classifying polarity involves a simple count of positive andnegative words from a lexicon. This is easy to implement, since it does not require a trainingset. However it is not as robust as other methods, usually relying on a lexicon based outsidethe corpus. Moreover, the lexicon approach requires additional work to account for negationsand remove objective tweets Taboada et al. (2011). Lexicon classifiers have been most

2

successful when combined with other approaches from natural language processing such aspart-of-speech tagging and trained classifiers Cambria et al. (2013).

2.2 Machine Learning Classifiers

Machine learning approaches to sentiment polarity classification have proven to be effectivewhen a training set is available Feldman (2013). However, apart from intensive and expensivemanual classification, no acceptable training set of tweets labeled by polarity was availableto us. An alternative approach, often called semi-supervised or distant supervision useshashtags (e.g. #happy, #sad, #awful) or emoticons (e.g. :D, :(, ,) to create a trainingset. While such symbols are not always reliable indicators of polarity for individual sentences,prior research has successfully used distant supervision to train classifiers Taboada et al.(2011)

Past research into sentiment analysis used machine learning algorithms such as supportvector machines and random forests to train classifiers Lima et al. (2015). Following priorresearch into sentiment analysis of tweets, we opted to use random forests. Random forestsare an ensemble method in which numerous decision trees are fitted randomly to the data.While an individual tree may not be a good predictor of sentiment polarity, random forestsleverage the combined predictions of many decision trees. Random forest classifiers havethe advantage of being robust and working well with various types of data, including datawith various underlying distributions that are difficult to classify Tan et al. (2006). Unlikelexicon polarity scores which range from −∞ to ∞, random forest classifiers return valuesbetween 0 and 1, where scores near 0 indicate that most decision trees predict sad sentimentand scores near 1 indicate that most decision trees predict happy sentiment.

Following other research, we evaluate the effectiveness of our lexical and machine learningclassifiers by comparing the area under the curve (AUC) of the receiver operating character-istic. The receiver operating characteristic (ROC) curve is a graphical representation of truepositive classification rate versus false positive classification rate. For our balanced, two-classclassification, a perfect classification model will have an AUC of 1, while random guessingwill have an area under the curve near 0.5. In addition to being a good indicator of a classi-fiers true positive rate, ROC curves are especially useful in comparing performance betweenclassifiers Tan et al. (2006), including between lexical and machine learning classifiers.

2.3 Urban Land Use Studies

This section needs to be greatly expanded once we have a better idea of what kind of clusteranalysis, regression, and/or statistical tests we plan on running using the latitude/longidutepart of the tweets.

3

3 Data and Lexicons

The primary data set used in this analysis, which we will call LA2014, is a corpus comprising9.4 million geotagged tweets collected from Los Angeles county between January 1, 2014and November 13, 2014. This corpus was provided by ?who gave us the set?. During thedata preprocessing (described in section 4), about 200000 blank tweets and 200000 Spanishtweets were removed, leaving nearly 9 million tweets for analysis. The tweets were made by41941 unique usernames, each posting between 1 and 3249 tweets, with an average of nearly214tweets per user. Figure 1 displays the number of tweets made for each username.

Figure 1: Tweets per unique username in LA2014

In addition to the tweets collected from Los Angeles county, our analysis used a smallerset of tweets made available by Go et. al., called Sentiment140 Go et al. (2009). This dataset consists of 498 tweets collected using Twitter’s rest API using various search queries.The 498 tweets were hand classified into sets of 177 negative, 139 neutral, and 182 positivetweets. The positive and negative tweets were used to assist in the selection of lexicons andvalidate the machine learning model.

To assist in the training of a machine learning classifier, we compared the four followinglexicons:

• The NRC Word-Emotion Association Lexicon consists 5636 terms, each of which werescored for polarity, arousal, and dominance. We assigned a score of 1 to words withpositive polarity and −1 to words with negative polarity. The lexicon was created by??.

4

• The AFINN Lexicon constructed by Nielsen (2011) as a small subset 2477 words,mostly taken from the ANEW Lexicon with emphasis on social media vocabulary,including slang and abbreviations. This lexicon scores words were manually scored -5to 5, with negative scores assigned to words with negative polarity and positive scoresassigned to words with positive polarity.

• The OpinionFinder Lexicon is a lexicon created by Wilson et al. (2005) that classifies asset of words and phrases as positive, neutral, or negative. OpinionFinder extends therange of the Multi-Perspective Question Answering (MPQA) data-set. The creatorsused a group of human annotators to label classify a sentiment to selected words. Then,words of variable assignment were taken out of the set, leaving a list of 6884 Englishwords, each classified as positive (1) or negative (−1) polarity.

• Finally, the ANEW lexicon is the work of Bradley and Lang (1999) and expandedupon by Warriner et al. (2013). The lexicon consists of 13915 lemmas, each of which isassigned to an emotion and a sentiment polarity decimal score between 1 and 9, whichwas the average score assigned by a group of human scorers. To increase the lexicon’saccuracy, we subtracted the average word score from each word, resulting in decimalscores ranging from 1.26 to 8.53.

4 Methodology

Tweets:

LA2014

Remove

emoticon

tweets

Clean &

tokenize

tweets

Classify

tweets

Train classifierValidate lexicon

(AFINN)Sentiment140

Random Forest

Figure 2: Summary of Research Methodology

A summary of the methodology we used is given in figure 2.

4.1 Emoticon Tweets

Prior to cleaning the data, we pulled a semi-supervised validation set from the data. Sincethe data included few hashtag-identified tweets (i.e. tweets including hashtags with strongpolarity such as #happy, #sad, etc.), we used common emoticons instead. All tweets werechecked for 11 emoticons indicating positive polarity, including :), (:, :-), (-:, :D, :-D,=), (=, ,and -. Additionally, the tweets were checked for 9 emoticons indicating negativepolarity, including :(, :-(, ):, )-:, :[, :{, }:,=(, )= and /. In LA2014, over 150 000 tweets

5

included a happy emoticon and nearly 50 000 tweets included a sad emoticon, accounting fornearly 2% of the data. These tweets were assigned positive and negative polarity before thecleaning and sentiment analysis steps and were used to (1) select the most appropriate lexicon(section 4.3), and (2) train a classifier (future work). Word clouds illustrating common wordsfrom this semisupervised set are shown in figure 3. Figure 4 gives a few examples of tweetsin emoticon, showing that while the tweets usually contain appropriate positive or negativesentiment, there are some tweets included in the set which are nevertheless not very usefulfor classification because they are non-English or sarcastic.

Figure 3: Frequent terms found in 150 000 happy (left) and 50 000 sad (right) semi-supervisedtweets from LA2014.

4.2 Cleaning

Finally, the data were cleaned in the following manner:

• tokenize urls, retweets, usernames, and hashtags as follows:

Text Cleaned TextIconic http://t.co/vWceYgGypw iconic url

All your fault @khoagie8 all your fault username

Now I want Starbucks! #thirsty now i want starbucks hash thirsty

• remove punctuation, numbers, and non-English characters

• set all text to lowercase

• shorten character to 2 letters, e.g. yesssss! becomes yess

6

Figure 4: Example tweets from the emoticon semi-supervised tweets

• remove all leading and trailing spaces

• remove all remaining tweets that are fewer than 3 characters long

4.3 Lexicon Selection

Before applying sentiment analysis, four lexicons were evaluated to determine which pro-duced the highest accuracy and area under the receiver operating curve when applied toboth sentiment140 and emoticon tweets. A list of 21 negations was used to reverse the scoreof words immediately preceded by a negation.1

Table 1 shows the results of the four lexicons when applied to the non-neutral tweets fromsentiment140, and table 2 gives the results when the four lexicons applied to the emoticontweets. Both tables report the area under the receiver operating characteristic curve, thecutoff value that yielded the highest accuracy, and the classification accuracy achieved atthe optimal cutoff point. In all cases, values at or below the cutoff value are classified asnegative, and values above the cutoff value are classified as positive. In all cases the baselineaccuracy is 50%.

1We still need to do more work to find out how effective this is, or if we should just allow negations tocancel out the words they precede. For example, while the phrase ‘don’t like’ probably indicates sad polarity,phrases like ‘can’t love’ might portray more of a neutral or even positive sentiment

7

Tables 1 and 2 indicate that the AFINN lexicon has higher accuracy and AUC than theother lexicons for both validation data sets. Additionally, they indicate that higher AUC andaccuracy measures generated over the sentiment140 data. In particular, the classificationaccuracies in table 1 are in some cases barely above the baseline of 50%. There are severalpossible explanations for these low values:

• None of the lexicons match the emoticons contained in the emoticon tweets. In manycases, the tweets are too short to match many English words, and a sizable portion arenon-English. E.g. “Lo extrano :(”

• The emoticon data is noisy, containing tweets that are incorrectly classified based onthe emoticon with the tweet. E.g. “@celebhelpers heeellpppp? :)”

• On the other hand, the sentiment140 tweets all have clear, human-assigned polarityand only a few contain emoticons.

Overall, Though the accuracy AUC for the AFINN lexicon applied to sentiment140 is asatisfactory 75.5%, the accuracy of 62.9% over the emoticon data indicates that we will needto improve this classifier before using it more broadly on tweets from Los Angeles County.To address this question we next turn to a machine learning classifier.

Table 1: Accuracy of lexicons applied to Sentiment140 data

Lexicon AUC Cutoff AccuracyAFINN 0.852 1 77.7%ANEW 0.796 2.245 73.5%OpinionFinder 0.779 1 72.1%NRC 0.743 1 67.7%

Table 2: Accuracy of lexicons applied to Emoticon data

Lexicon AUC Cutoff AccuracyAFINN 0.686 2 63.5%ANEW 0.671Xt 1.95 63.0%OpinionFinder 0.647 1 60.6%NRC 0.620 1 58.9%

4.4 Trained Classifier

The procedure for training a machine learning classifier over the emoticon data has 3 steps:

1. Use a norm-difference sentiment index (NDSI) approach to build a lexicon of happyand sad words from the emoticon tweets.

8

2. Use the NDSI lexicon to build a document-term matrix over the emoticon tweets.

3. Train and test a random forest classifier over the term-document matrix.

We begin by selecting words from among the emoticon tweets that are good predictorsof whether a tweet is happy or sad. To do this we consider the Norm Difference SentimentIndex score for each word occurring in at least 0.1% of tweets. This formula is calculated asfollows: given a term t, a set of happy (,) and sad (/) tweets, and a smoothing term α,

NDSI(t) =|n(t|,)− n(t|/)|n(t|,) + n(t|/) + α

In this way, all terms that occur in any of the emoticon tweets are scored from 0 to 1,with scores closer to 1 indicating the term occurred more frequently in either sad or happytweets. A sample of words from among the 5570 words in the NSDSI lexicon are given intable 3, sorted according to NDSI score. For our calculations we used the constant α = 128to penalize words that occur infrequently in the corpus. To limit processing required, sincemost of the words in the NDSI lexicon have low NDSI scores, our model will used the 1024words with NDSI score above 0.05.

Rank Term n(t|,) n(t|/) NDSI(t)1 thank 1976 120 0.7892 sad 59 1277 0.7653 thanks 2137 177 0.7624 birthday 1698 194 0.7005 happy 2458 346 0.6906 miss 453 2380 0.623

100 leave 103 279 0.275998 santa 42 25 0.0525570 shes 184 185 0.001

Table 3: Words from the NDSI lexicon with the highest NDSI score

The next step involves creating a term-document matrix over the emoticon data. Thissparse matrix has 93046 rows corresponding to each tweet in the emoticon corpus, and 1025columns. The first column indicates the sentiment polarity of the tweet (1 for happy, 0 forsad), the second column gives the AFINN score of the tweet2, and the other 1024 columnscorrespond to the words in the NDSI lexicon. Each i, j + 2 entry of the term documentsindicates the number of times term j occurred in tweet i. An example term-document matrixis given in figure 4.

Finally, the term-document matrix is used to train a random forest classifier. The randomforest model was made of 500 trees and trained over a set of 20000 happy tweets and 20000sad tweets selected at random from the emoticon data set. The rest of the emoticon tweets

2The AFINN scores have not yet been incorporated into these results. Coming soon...

9

polarity AFINN Score term 1 term 2 term 3 term 4 · · ·tweet 1 1 6 0 0 0 1 · · ·tweet 2 0 -4 0 0 0 0 · · ·tweet 3 1 3 0 0 1 0 · · ·tweet 4 1 0 0 0 0 0 · · ·tweet 5 0 2 3 0 0 0 · · ·

......

......

......

.... . .

Table 4: Term document matrix used to train classifiers

were used as test data, and sentiment140 was also used for validation. Table 5 providesthe test results of the random forest classifier. Compared to the AFINN lexicon alone, therandom forest is far more successful in classifying tweets from the emoticon data set, withan increased AUC of nearly 0.15. Figure 5 shows a density plot of the random forest scores,indicating the random forest classifier’s effectiveness in distinguishing happy and sad tweets.

Test Tweets AUC Cutoff AccuracyEmoticon 0.831 0.515 77.8%Sent140 0.777 0.602 72.1%

Table 5: Accuracy of Random Forest Classifier

Figure 5: Random forest densities over the sentiment140 and emoticon tweets

4.5 Cluster Analysis

Someday...

10

5 Results

Report on results of applying the lexicon and random forest classifier to the LA2014 data.

6 Conclusions

This section will need to be expanded later, once results are confirmed. Here’s a preview:

Definite: The AFINN lexicon is more effective than many others at classifying tweets’ sentimentpolarity, though used on its own it is not a very accurate classifier.

Probable: The random forest classifier trained over an NDSI lexicon yields high accuracy whenpredicting a tweet’s sentiment polarity, especially when AFINN scores are incorporated.

Uncertain: Results from applying sentiment scores to land use. What independent variables aregood predictors of happiness? Where are the happy or sad clusters in LA county?

References

Xia Hu, Jiliang Tang, Huiji Gao, and Huan Liu. Unsupervised sentiment analysis with emotionalsignals. In Proceedings of the 22nd international conference on World Wide Web, pages 607–618. International World Wide Web Conferences Steering Committee, 2013.

Harshavardhan Achrekar, Avinash Gandhe, Ross Lazarus, Ssu-Hsin Yu, and Benyuan Liu. Pre-dicting flu trends using twitter data. In Computer Communications Workshops (INFOCOMWKSHPS), 2011 IEEE Conference on, pages 702–707. IEEE, 2011.

Andranik Tumasjan, Timm Oliver Sprenger, Philipp G Sandner, and Isabell M Welpe. Predictingelections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10:178–185, 2010.

Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter users: real-timeevent detection by social sensors. In Proceedings of the 19th international conference on Worldwide web, pages 851–860. ACM, 2010.

Nabil Hossain, Tianran Hu, Roghayeh Feizi, Ann Marie White, Jiebo Luo, and Henry Kautz.Inferring fine-grained details on user activities and home location from social media: Detectingdrinking-while-tweeting patterns in communities. arXiv preprint arXiv:1603.03181, 2016.

Saif M Mohammad. Sentiment analysis: Detecting valence, emotions, and other affectual statesfrom text. Emotion Measurement, 2015.

Peter D Turney. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classi-fication of reviews. In Proceedings of the 40th annual meeting on association for computationallinguistics, pages 417–424. Association for Computational Linguistics, 2002.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment classification usingmachine learning techniques. In Proceedings of the ACL-02 conference on Empirical meth-ods in natural language processing-Volume 10, pages 79–86. Association for ComputationalLinguistics, 2002.

11

Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow, and Rebecca Passonneau. Sentimentanalysis of twitter data. In Proceedings of the workshop on languages in social media, pages30–38. Association for Computational Linguistics, 2011.

Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from biased and noisydata. In Proceedings of the 23rd International Conference on Computational Linguistics:Posters, pages 36–44. Association for Computational Linguistics, 2010.

Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Foundations and trends ininformation retrieval, 2(1-2):1–135, 2008.

Momin M Malik, Hemank Lamba, Constantine Nakos, and Jurgen Pfeffer. Population bias ingeotagged tweets. In Ninth International AAAI Conference on Web and Social Media, 2015.

Maeve Duggan and Joanna Brenner. The demographics of social media users, 2012, volume 14.2013.

Fred Morstatter, Jurgen Pfeffer, Huan Liu, and Kathleen M Carley. Is the sample goodenough? comparing data from twitter’s streaming api with twitter’s firehose. arXiv preprintarXiv:1306.5204, 2013.

Maite Taboada, Julian Brooke, Milan Tofiloski, Kimberly Voll, and Manfred Stede. Lexicon-basedmethods for sentiment analysis. Computational linguistics, 37(2):267–307, 2011.

Erik Cambria, Bjorn Schuller, Yunqing Xia, and Catherine Havasi. New avenues in opinion miningand sentiment analysis. IEEE Intelligent Systems, (2):15–21, 2013.

Ronen Feldman. Techniques and applications for sentiment analysis. Communications of the ACM,56(4):82–89, 2013.

Ana Carolina ES Lima, Leandro Nunes de Castro, and Juan M Corchado. A polarity analysisframework for twitter messages. Applied Mathematics and Computation, 270:756–767, 2015.

Pang-Ning Tan, Michael Steinbach, Vipin Kumar, et al. Introduction to data mining, volume 1.Pearson Addison Wesley Boston, 2006.

Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision.CS224N Project Report, Stanford, 1:12, 2009.

Finn Arup Nielsen. A new anew: Evaluation of a word list for sentiment analysis in microblogs.arXiv preprint arXiv:1103.2903, 2011.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing contextual polarity in phrase-levelsentiment analysis. In Proceedings of the conference on human language technology and em-pirical methods in natural language processing, pages 347–354. Association for ComputationalLinguistics, 2005.

Margaret M Bradley and Peter J Lang. Affective norms for english words (anew): Instructionmanual and affective ratings. Technical report, Technical report C-1, the center for researchin psychophysiology, University of Florida, 1999.

Amy Beth Warriner, Victor Kuperman, and Marc Brysbaert. Norms of valence, arousal, anddominance for 13,915 english lemmas. Behavior research methods, 45(4):1191–1207, 2013.

12

Twitter Sentiment Analysis Applied to Land Use in Los ... · This research aims to unite Twitter...

Documents

Transcript of Twitter Sentiment Analysis Applied to Land Use in Los ... · This research aims to unite Twitter...