Tweet Rises: Twitter Sentiment Analysis - Caltech .Tweet Rises: Twitter Sentiment Analysis...
Embed Size (px)
Transcript of Tweet Rises: Twitter Sentiment Analysis - Caltech .Tweet Rises: Twitter Sentiment Analysis...
Tweet Rises: Twitter Sentiment Analysis
Aleksander BelloCalifornia Institute of
Alexandru CiocCalifornia Institute of
Victor DuanCalifornia Institute of
Archan LuharCalifornia Institute of
Louis OBryanCalifornia Institute of
ABSTRACTThis paper focuses on the work of the California Instituteof Technology CS 145 group: Tweet Rises. It focuses ona combination of Twitter sentiment analysis and effectiveweb-based visualization.
The group worked with several forms of natural languageprocessing and machine learning, and with two primary vi-sualization methods.
Categories and Subject DescriptorsH.3.5 [Online Information Services]: Web-based ser-vices; H.5.3 [Group and Organization Interfaces]: Web-based interaction
KeywordsTwitter, sentiment analysis, 2D visualization
1. INTRODUCTIONHow is the world feeling right now? That is a hard ques-tion so to make it easier and narrow the scope down to thequantifiable, we ask, how is Twitter feeling right now? Wepropose here an application and underlying infrastructure tocategorize Tweets based on emotional content, create a rep-resentative sample, and visualize the sentiments on a mapin a web browser in real-time.
2. FRONTENDIn order to make our work available to the widest audience,we decided to work on visualizing our results within a webbrowser. A major challenge was in, first, developing a work-ing prototype and in, next, iterating upon it on a weekly
Figure 1: Heatmap of Twitter sentiment.
basis. Among our most important priorities was to preventour clients from being flooded with too much informationand, from the other standpoint, still providing enough in-formation to make viewing the website meaningful.
Our development primarily utilized the Google Maps APIin order to achieve a fast and effective visualization. GoogleMaps allowed us to focus on the visualization itself, andsaved us the time of having to create a scalable map of theUnited States.
Originally, we intended on visualizing our results using aheat map, but quickly discovered that a more understand-able mode of communication was to use a state map. Ourprimary concern with using the heat map was that the GoogleMaps API heatmap naturally scales based on the amount ofdata it receives. This means that, within seconds, statesand cities with small populations have their data pointseffectively reduced to obscurity, and our entire heat maptherefore only shows data for cities like Los Angeles, SanFrancisco, and New York City. As previously mentioned,we therefore focused on a different type of visualization, thestate map.
Our idea for using our state mapcame from analyzing vari-ous forms of 2D visualizations, and seeing that a particularlyunderstandable visualization came from geographic electoralmaps during U.S. elections. This forms of visualization de-picts an entire state with a solid color that indicates that
Figure 2: Example US electoral map.
states political preference. We believed that, since peoplehave already been predisposed to understand these mapsfrom news coverage, we could lower the barrier to entry forunderstanding what our data actually depicts. Therefore,in our own work, we decided to create an independent ge-ometric shape for each state using the Google Maps API,and then colored the state as the average color of the Tweetsentiments it received. For visualization purposes, we letnegative emotions be depicted in red and positive emotionsbe depicted in blue.
An issue that quickly arose was that, in averaging the colorover every Tweet received, every state would, given an ad-equate amount of time, become the same shade of purple -roughly an equal amount of redand blue sentiments. Webelieve that this makes sense since, over a prolonged periodof time, we should notice an equal amount of both posi-tive and negative sentiments since Twitter has a very wideaudience so whenever a wave of positive or negative senti-ments are shown, people often respond with a contradictingsentiment.
In order to alleviate this, we decided to allow for specifica-tion of how many Tweets to average over. Thus, by loweringthe number to something more manageable, like 10 Tweets,we see a more meaningful rise and fall of positive and nega-tive sentiments.
Our final touches to our frontends visualization came in theform of sidebar indicating trending topics and their overallsentiments. Clicking on a topic allowed users to view the vi-sualization for that single topic. This provided more mean-ingful information for users who aimed at gauging overallsentiments for a single topic as opposed to the overall Twit-ter Tweet stream. Even further, the sidebar allows usersto estimate, at a glance, what a topics sentiment is. Thiscould potentially be seen as an exploratory tool, since itgives users a chance to notice outliers - topics that mightbe heavily weighed towards one sentiment, and then easilygives them access to see the map for that topic.
Overall, we believe our techniques provided for an effectivevisualization of our Twitter sentiment data.
Figure 3: State map of Twitter sentiment.
3. NATURAL LANGUAGE PROCESSINGThe first step to analyzing the Tweets is natural languageprocessing (NLP). The techniques used to classify the Tweetsall focus on a bag of words approach. As such, the NLP por-tion of the project focused on developing an efficient way toget the best bag of words from any given Tweet.
Initially, we eliminate obvious stop words that dont con-tribute to the content of a Tweet. This list includes wordssuch as a, I, she, etc. Afterwards, we define words asa string of alphabetic characters with whitespace on bothsides. Note that this ignores things such as numbers andemoticons. Once the set of words in each Tweet has beencomputed for each Tweet in the training data, mutual infor-mation is used to determine the words that provide the mostinsight to the content of the Tweets. For our purposes weused about 1000 words. With these, each Tweet was thuscharacterized by which of these 1000 words appeared. Forexample, for the Tweet I am not happy. He is not happyand the mutual information words not and happy, theTweet would be characterized as [not, happy]. Note thatthe number of times a word appears is not taken into ac-count.
Once a Tweet has been characterized by the above steps,it is passed along to the machine learning portion of theclassification.
4. MACHINE LEARNINGOur machine learning methods consisted of four algorithms:Nave Bayes, Stochastic gradient descent, Support vectormachines, and Maximum entropy. Our first implementationwas Nave Bayes, due to its simplicity. Nave Bayes predictsthe classification of an observation by providing a partic-ularly simple formula for the probability that an outcomeC is observed given that there are features F1, F2, ..., Fn inthe observation. These probabilities can be compared foreach outcome to find the most likely one. Specifically, themodel assumes that the features variables F1, F1, ..., Fn areindependent, and the assumption implies that
p(C|F1, F2, ..., Fn) =1
where Z = p(F1, F2, , Fn) is the evidence for these features.In our case, the outcome C was whether the Tweet had
Figure 4: Performance of Naive Bayes, Maximumentropy, and Support vector machines algorithmsfor up to 10,000 training Tweets.
positive or negative sentiment, the features were the wordsdetermined by the mutual information algorithm, and theprobabilities p(Fi|C) were determined from the training datadepending on their appearance rate. So, the formula gave away to compare the likelihood of the two sentiments giventhe words in the Tweet.
One problem with this approach was that if a word did notappear in both positive and negative sentiment Tweets, theprobability p(Fi|C) was zero for one of the outcomes. Al-though these terms were unlikely, they occurred in our calcu-lation when we limited the number of training Tweets. Wedecided to simply leave these terms out of our calculation.
The other three algorithms, stochastic gradient descent, sup-port vector machines, and maximum entropy, were imple-mented in the python scikit-learn package. So, our workwith these algorithms mainly involved tuning the parame-ters to the scikit-learn functions. For example, we changedthe loss function, number of iterations, learning rate, andwhether or not to fit the intercept for stochastic gradientdescent.
Using 10,000 training Tweets, all algorithms but stochas-tic gradient descent had an accuracy rate between 65% and75% on the test data set. The performance of our stochas-tic gradient descent implementation was poor, so we left itout in the end. The support vector machines algorithm per-formed better than the other algorithms when the numberof training Tweets was less than 10,000, achieving over 70%accuracy.
Only our Nave Bayes algorithm was able to process signif-icantly larger training sets in a reasonable amount of time.We were able to train Nave Bayes on 1.6 million Tweets,which gave the algorithm almost 80% accuracy, outperform-ing the others. The other algorithms may have performedbetter with the same amount of training data, but they tooksignificantly longer at only 20,000 Tweets, so this was im-practical to test.
5. BACKENDThe first thing that needs to be done before we can produceany results is to have the raw Tweets, i.e. the text and ge-olocation information. This is obtained by the Twitter 1%firehose API. There is a persistent connection between ourbackend and Twitter that continuously streams new Tweets
Figure 5: Performance of Naive Bayes classifier fo