City Happenings into Wikipedia Category: Classifying Urban...
Transcript of City Happenings into Wikipedia Category: Classifying Urban...
City Happenings into Wikipedia Category:Classifying Urban Events by Combining Analyses of
Location-based Social Networks and Wikipedia
Shoya SatoTakuro Yonezawa
Jin NakazawaKeio University
Kanagawa, Japan{shoya,takuro,jin}@ht.sfc.keio.ac.jp
Satoshi KawasakiKen Ohta
Hiroshi InamuraNTT Docomo, Inc.Kanagawa, Japan
{satoshi.kawasaki.vx,ootaken,inamura}@nttdocomo.com
Hideyuki TokudaKeio University
Kanagawa, [email protected]
ABSTRACTRecently, many researchers have been focusing on the detec-tion and classification of urban events by information analy-sis on social networks. Previous works mainly use text anal-ysis of users’ posts on social networks for detecting urbanevents. However, this approach has a limitation that theusers’ posts must mention the event for the analysis to beconducted. We propose a new method for classifying urbanevents by extracting user interest from the location-basedsocial network information without text analysis. The pro-posed method includes analyzing common friends of users inthe vicinity of the event venue and extracting common thefriends’ attributes by referring to related Wikipedia informa-tion. We designed and implemented the proposed method,and conducted an experiment for evaluating our method.Our experimental result shows that our method can classifyevents well in cases where participants have similar inter-ests.
CCS Concepts•Human-centered computing → Social network analy-sis;
KeywordsUrban event; classification; SNS; Wikipedia
1. INTRODUCTIONVarious events are conducted in cities, including sports
games, festivals, etc. Recognizing these urban events real-time is one of the key functionalities for making cities smarterin terms of disaster/emergency management, disease/health
management, traffic management, and further urban plan-ning. To detect and classify urban events, many researchershave been focusing on analyzing user-generated content-informationfrom location-based social networks, such as geo-tagged tweetsfrom Twitter [7]. The analysis approaches are mainly classi-fied into the following two categories (or their cross analysis):semantic analysis [3, 9, 2] and spatiotemporal signal anal-ysis [5, 1, 8]. Semantic analysis leverages the fact that theparticipants in urban events post text information about theevents. By extracting hot topics or burst information fromthe content, urban events can be detected and classified.However, since this approach relies on the assumption thatthere are a number of text contents that mention an urbanevent, applying this approach to the urban events not men-tioned considerably is difficult. In addition, even if there isconsiderable content, it always contains noisy information,which is not related to the events. However, spatiotem-poral signal analysis focuses on the geolocation and times-tamp of content, instead of semantic information. The ap-proach analyzes the spatio-temporal pattern of signals fromthe location-based social network to detect abnormality incities. However, though this approach works well for de-tecting abnormalities, achieving classification-inferring thetype of urban events is difficult. Therefore, both seman-tic and spatiotemporal approaches still have limitations inthe detection and classification of urban events based onlocation-based social network information.
In this paper, we propose an alternative simple approachfor detecting and classifying urban events to resolve thisproblem. We focus on common followee information froma location-based social network. We assume that peopleparticipating in an event have similar interests, and theseinterests are reflected on their followees. For example, par-ticipants of a football game are generally interested in foot-ball, and they would also follow football related users onsocial networking sites. Therefore, by analyzing commonfollowees of participants on social networks, the characteris-tics of the event could be inferred. Our classification methodfirst involves obtaining common followees from the socialnetwork of users who tweet from the vicinity of an urbanevent site. It then weighs each common followee. Highlyweighted common followees usually include famous users,who already appear in Wikipedia entries. Therefore, we
Geo-tagged Tweets
Extract Users
(Possible) Event Attendees
(Possible) Event Attendees
…Common Followees
…Common Followees
Weighting
…
Common FolloweesFeature Extraction
Urban EventClassification
0.4 0.25 0.2 0.15
0.4 0.25 0.2
…
…
Figure 1: Overall process of proposed urban event classification
adopt the Wikipedia category of the common followees forurban event classification.
We designed and implemented an interactive analysis toolthat grabs and classifies events based on the common fol-lowees information. We classified sixteen urban events byusing this method and compared the results with the groundtruth, which was achieved by manual classification of thoseevents. From the result, we could confirm that classificationby our method achieves a higher correlation with groundtruth for the events that gather selective participants, suchas a baseball game, which is attended by the fans of twoteams, and a comic market, where otakus gather, as com-pared to generic events, e.g., a nature viewing event. Thecontributions of this paper are as follows:
• Presenting a new method for urban event classificationbased on combining the analysis of common followeesof event participants and Wikipedia entries
• Designing and implementing an interactive web toolthat easily captures and classifies urban events
• Evaluating the effectiveness of our approach by com-paring classification results of our method with eventclassification of human impression of the events
2. EVENTS CLASSIFICATION BASED ONCOMMON FOLLOWEES ANALYSIS
2.1 Overall ProcessAs location-based social network data sources, we use geo-
tagged tweet information from Twitter. Figure 1 shows theoverall process of urban event classification. First, we ex-tract SNS users who post textual messages from the vicinityof an event venue when the event is being conducted. Sec-ond, we extract the common followees weighting factor fromthe user set. Common followees imply user accounts beingfollowed by more than two users from a user set. Then, weclassify each common followee by referring to a Wikipediaarticle related to the common followee. Finally, we use theoverall result of classified information as classification of theurban event. Details of each process are explained in thefollowing sections.
2.2 Common Followees WeightingThis process extracts common followees that are uniquely
exposed in the events, and make their weighting. A more
straightforward approach is simply counting the number ofusers from the user set following each common follower.However, this simple approach hides important informationregarding the common followees being uniquely followed bythe users of the event. For example, through Twitter, wefound that @masason user always appears as a popular com-mon followee from user sets of many events. The @masasonuser has most followed account in Japan (2,595,630 follow-ers). Such famous users are recommended by Twitter whena user registers on Twitter. These users may possibly showup as noise in the classification of events. Therefore, pickingup common followees who are uniquely followed by users ofthe events is necessary. To remove noise, we calculate theratio of following by the user sets from exact events andentire Twitter space. With the comparison information, weextract the common followee weighting factor for the events.
For detailed processing, we put a set of common followeesas F = {f0, f1, ...., fn}. EventFollow(fx) represents eachcommon followee’s total follow number by Twitter users ob-served in the event. We apply EventIndex(fx) as popularityof the common followee among twitter users observed in theevent:
EventIndex(fx) =EventFollow(fx)n∑
i=0
EventFollow(fi)
Then, we also calculate BasicIndex, which represents theoriginal popularity of each common followee among a setF in entire Twitter space. BasicFollow(fx) represents thenumber of followed of each common followee by all Twitterusers.
BasicIndex(fx) =BasicFollow(fx)n∑
i=0
BasicFollow(fi)
Finally, we calculate FeatureIndex of each common fol-lowee, a weighting factor of common followees, which isuniquely exposed from Twitter users observed in the event:
FeatureIndex(fx) = EventIndex(fx) − BasicIndex(fx)
If FeatureIndex of a common followee is greater thanzero, it implies that the common followee is obtaining morefollowers than usual. Through this process, we obtain thelist of common followees with unique weighting information
when the urban events are conducted. Then, we classifyeach common followees’ character.
2.3 Common Followees Classificationby Wikipedia Category
Next, we infer the categories of the extracted common fol-lowees. To investigate characteristics of twitter users, sev-eral research works focused on analyzing textual informationfrom tweets and additional metadata. For example, Marcoet.al.[6] conducted a linguistic Twitter analysis to infer theTwitter users’ political affiliation, ethnicity identification,etc. These approaches work effectively in detecting certaintopics; however, classified categories are still limited for ap-plication to urban events. To classify common followees invarious categories, we used a Wikipedia category related tothe article about common followees. We found that mostcommon followees are influencer accounts having more than2,000 followers. In addition, there are also exact or relatedWikipedia articles for most of the common followers. There-fore, we assume that Wikipedia information related to com-mon followees, especially the category of the article, can beutilized for classifying common followees.
Wikipedia is an online encyclopedia created and managedby the power of collective wisdom. It has more than 30million articles in multiple languages maintaining fairness ofpolicy as much as possible. Each Wikipedia article is associ-ated to certain Wikipedia categories. Wikipedia categoriesare structured as a hierarchical category tree. Though de-tails of a category (e.g., name of category, or level of treeat which the category exists) differ depending on differentlanguage versions of Wikipedia, any Wikipedia version hasa similar higher-level category that covers various funda-mental genres/keywords. For example, Japanese Wikipediahas nine categories at the top-level category (e.g., society,nature, technology, or culture). Lower-level tree categoriesinclude more details of categories, such as political system,biology, computer, or animation. By stepping up from alower-level category to an upper-level category, categoriesattached to any article can be classified into the same higher-level category. We use the number of categories at a certainhigher-level tree as the dimension of vector spaces for clas-sification of the article.
To classify a common followee, specifying the related Wikipediaarticle of the common followee is necessary. Finding therelated article is not difficult because there are several on-line databases that provide the relationship between famousTwitter accounts and a Wikipedia article. After specifyingthe article, we first collect categories associated with the ar-ticle. Since these collected categories are located at a lowertree level, the upper level (parent) of category, which is pre-liminarily defined to be used for urban classification, is dis-covered by breadth first search. Some lower-level categorieshave multiple parents, so breadth first search is adapted toall parent trees. If the search hits a defined level of a cate-gory, our algorithm adds a score to the category.
As an example, @cfm miku the Twitter account of a vir-tual idol called Hatsune Miku1, which is followed by morethan 40,000 users, can be associated to the Wikipedia arti-cle of Hatsune Miku. The article is classified into followingcategories: music software, virtual singer, vocaloid, musicin Sapporo, and Crypton Future Media (a company name).These categories are located at a lower level of the category
1https://en.wikipedia.org/wiki/Hatsune Miku
Geotagged Tweets Collection Module
Geotagged Tweets Database
Event RegistrationModule
Event ClassificationModule
Result VisualizationModule
Common Followees Analysis Module
Common Followees’Information
Common Followees’ Category
Request
Twitter Users’ followees Information
RequestGeotagged Tweet
Geotagged Tweet
Geo information/ Twitter users
Common Followees’List with Weighting Factor
ClassificationResult
Event RegisterationInterface
Classification Result Visualization
(A)(B)
(C)(D) Twitter Users’ followees
Information
Figure 2: System Architecture
tree; thus, each category is transformed into an upper cate-gory by breadth first search. In this example, all categoriesare expressed, such as music software = {music:2, movie:1,computer:4} and virtual singer = {art:1, entertainment:3}.Finally, the result of totaling categories is used as a featurevector for each common followee’s classification.
2.4 Urban Events ClassificationBased on common followee weighting factor and classified
categories of each followee, we finally classify urban events.
We define ~fx as a feature vector of an arbitrary commonfollowee fx ∈ F . Then, the feature vector of the urbanevent is calculated as the total feature vectors of commonfollowees while considering each of their weighting factors:
~e =
n∑i=0
FeatureIndex(fi)n∑
j=0
FeatureIndex(fj)
~fi
3. DESIGN AND IMPLEMENTATIONWe designed and implemented an interactive system that
enables users to grab and classify urban events semi-automatically(see Figure 2 for overall system architecture). The systemhas the following functions:
3.1 Collecting and visualizing geo-tagged tweetsWe collected geo-tagged tweets in Japan by using Twitter
Streaming API2 (A in Figure 2). The collected tweets arestored into a database, and it can be visualized on GoogleMaps with date and area specifications(B in Figure 2).
3.2 Registering urban events from the map tobe classified
By using the interface, users can register target events tobe classified (C in Figure 2). Figure 3 shows the registrationinterface. Each pin represents geo-tagged tweets, and users
2http://dev.twitter.com
Target Event Registration Interface Common followees ranking
and their account name/image
Twitter users in red circle are target to be analysed
Classi!cation of each common
followees
Figure 3: A screenshot of the interactive urbanevent classification tool-left: event registration,right: showing classified common followees
can register a target event by specifying the date, time, andarea (see red circle on Figure 3).
3.3 Classifying registered urban events auto-matically
The registered events are then classified using our methoddescribed in Section 2 (D in Figure 2). First, we extractcommon followees from users present where target eventshappened. Considering that analyzing all common followeesrequires high calculation costs (e.g., too many queries toTwitter API beyond its limit and also Wikipedia access), weanalyzed the top ten weighted common followees for urbanevent classification.
Most of the geo-tagged tweets we collected are generatedby Japanese users, and the extracted common followees arealso related to Japan. Therefore, we used the Japanese ver-sion of Wikipedia. In terms of category tree level for ur-ban event classification, we used a category level that has91 categories (see Table 1). This is because this categorylevel is recommended as a comprehensive category level bythe Japanese version of Wikipedia, and we consider that itcan provide enough classification category to urban events.The information at each classification process (i.e.,the origi-nal categories of each Wikipedia article of common followees,level of 91 category of each Wikipedia article and result clas-sification of the urban event) is visualized with the tool (seealso Figure 3). As an example, Figure 4 shows the classifi-cation results of two different events held at the same venueat different dates. We visualized the classification of urbanevents, which is 91 dimension of feature vector, as a wordcloud for recognizing its feature easily. The size of eachkeyword represents the weight of each keyword, and zeroweighted keywords are not shown in the word cloud. OnApril 13th, 2013, an idol group’s live music concert was heldat the national stadium. The analyzed categories show thatclassification keywords of music and entertainment were rep-resented as strong categories. On the contrary, on January13th, 2014, a soccer game was held at the same venue. Atthat time, we confirmed that classification keyword of sportswas represented as strong category.
4. EVALUATIONWe evaluate how our classification method is similar to
people’s actual impression about the event classification. Weselected 13 events for our evaluation. Then, we collected
National Stadium
2014
Jan.13
2013
Apr.13Idol Group Music Live
Soccer Game
Figure 4: Classification example: two differentevents at the same venue by our proposed method
the ground truth of the event classification by conducting aquestionnaire survey.
4.1 Target EventsJACE (Japan Association for the Promotion of Creative
Events) defines 16 types of events, including traditional fes-tivals, art exhibitions, music exhibitions, parades, sports,contents, etc. From these, we found 12 types of events withgeo-tagged tweets from their venue and classified them byusing our tool. We analyzed all Twitter users who postedgeo-tagged tweets when/where the event was being held.
4.2 Questionnaire Survey for Ground TruthTo validate the accuracy of our classification method, we
also prepared the ground truth of the target event classifi-cation. We conducted a questionnaire survey for 27 partic-ipants (18-25 ages) to classify the events with same classi-fication keywords as that of our method. Figure 5 showsthe actual questionnaire interface. The interface providesa name, image, and simple explanation of events that wereavailable from the event’s official website. In addition tothis information, classification keywords (as same as Table1) are also presented. When participants click on each key-word, the keyword is associated to the event. We askedparticipants to select multiple keywords considered as re-lated keywords of each event. The classification result foreach event from all participants is aggregated as values ofthe 91 dimension of vector space. We used this result as theground truth because this result shows the extent to whichpeople are impressed by the events.
4.3 Comparison ResultWe compared the classification results of our method with
the ground truth through a questionnaire survey. Both clas-sifications are expressed as the 91 dimension of feature vec-tor, so we evaluate the similarity of both classifications withcosine similarity, as follows.
cos(~x, ~y) =~x · ~y|~x||~y| =
|v|∑i=1
xiyi√√√√ |v|∑i=1
x2i
√√√√ |v|∑i=1
y2i
Table 1: Categories to be used for classifying urban events: 91 dimension of vector spaceSociety, Politics, Economy, Industry, Traffic, Education, History, Welfare, Medical, Health, Environment, Citizenship, Peace,
Military, University, Art, Culture, Language, Religion, Playing, Hobby, TraditionalArt, Literature, Music, FineArt, Movie, Drama,
Anime, Manga, Illustration, Sports, Game, Gambling, Fashion, DietaryCuluture, Structure, Media, Entertainment, Continent,
Asia, Africa, Oceania, NorthAmerica, SouthAmerica, Europe, Hokkaido, Tohoku, Kanto, Chubu, Kinki, Chugoku, Shikoku, Kyushu,
Okinawa, Nature, Universe, Elements, Weather, Disaster, Sea, LivingThing, Plant, Animal, Mineral, Study, Philosophy,
Logic, Linguistics, Psychology, Literature, ScienceOfReligion, PoliticalScience, BusinessAdministration, Sociology, Pedagogy,
Mathematics, Physics, Chemistry, Biology, Anthropology, EarthScience, MedicalScience, Pharmacy, Dentistry, Agriculture, Engineering,
Technology, Computer, Network, Electronics, Biotechnology
Event Name
Event Explanation
keywords for
classifying the event
Selected keywords
are registered here
Figure 5: Questionnaire survey interface
Table 2 shows the comparison result. Since values of thefeature vectors of each event are expressed as a positivenumber, the value of cosine similarity is expressed from 0as the minimum similarity, to 1 as the maximum similarity.As a result, we confirmed that there is a strong similarityfor the events of a high school soccer championship, Mo-moiro Clover live, comic market, the Iruma Airbase show,and Tokyo Anime Award Festival. On the contrary, theevents of Sapporo Snow Festival, Hakata Dontaku festival,Gaien fireworks festival, Kanda festival, and Hirosaki cherryblossom festival have low similarity with the ground truth.
In terms of events that have a high similarity with theground truth, we found that the events are popular for spe-cific people interested in the genre of events, such as sports,music, manga, anime, or military. Since these events havea strong fanbase, the fans’ interest seems to be reflectedto common followees. For example, in terms of the IrumaAirbase show, high weighted common followers were relatedto the military or government, such as the official Twitteraccount of Japanese self-defence force and government. Asanother example, common followers in Momoiro Clover live(a live music of idol group called Momoiro Clover) includesTwitter accounts of group members of Momoiro Clover, theyare classified into entertainment, music, and fashion.
On the contrary, in terms of events having low similar-ity, we found that those events are popular for all people.For example, traditional festivals, such as Kanda festival, orpublic viewing events, such as fireworks or cherry blossomfestival, are the events where many types of people, such asfamilies and couples of all ages participate. Therefore, sinceparticipants’ interests in the social network are also diverse,obtaining feature classification is difficult. Kawano.et.al.[4]
introduced ”popularity” of urban events that are composedof diverse participants and suggested a method of classify-ing events based on ”popularity”. These low similarity eventshave a possibility of becoming high popularity events. If so,we can estimate a certain accuracy of event classification byadapting the popularity analytics at the same time.
In addition to above observations, we confirmed severalfeatures from low similarity events. Sapporo Snow Festivalis one of Japan’s most popular winter events that exhibitsa dozen large snow sculptures. Thus, the ground truth in-cludes several keywords such as culture, fineart, or Hokkaido(name of the island where the event is held). However, ourmethod classifies the events by using different keywords suchas music and entertainment. These keywords are derivedfrom high-weighted common followees of Hatsune-Miku (avirtual singer software idol). In fact, there was a specialsub-event called Hatsune-Miku Snow Festival as a part ofthe snow festival in 2014. Hatsune-Miku is popular amonginternet users. This was done so that the effect on Miku’sfans may be more reflected on Twitter rather than otherusers. Similarly, though Hakata Dontaku is a traditionalfestival, our method derives entertainment and music as key-words. This is also because the idol group performed live onthat day at the event, and their fans’ tweets were mainly ob-served. These results suggest that our classification methodis significantly affected by active Twitter users. On the con-trary, though the above two results are different from thepublic impression of the events, our result has a possibilityof showing more real aspects of the snow festival.
5. CONCLUSION AND FUTURE WORKIn this paper, we propose a novel method of classifying
urban events by using a Wikipedia category. Previous re-lated works mainly focused on textual analysis of location-based SNS for event classification; the quality and numberof textual information to be analyzed are always importantissues. Our method analyzes common followees in users inthe vicinity of the event venue and extracts common fol-lowee attributes by referring to Wikipedia information. Ourexperimental result showed that the proposed method canclassify the events well where participants have similar inter-ests. Therefore, we can provide an alternative classificationmethod even if SNS texts from the area do not mention theevents.
As future work, first, the accuracy of urban event clas-sification requires enhancement. In this paper, we regardall Twitter users from an area as the event’s participants.However, some users might not be participants. Removingsuch noise will enhance our classification accuracy. Combin-ing our method with textural analysis may provide a more
accurate and robust method of urban event classification.Second, we will apply our method also for event detectionby constant monitoring. If the analyzed classification key-words change for an area, it implies that an event has hap-pened. This urban event detection and classification playsan important role in the era of cities with IoT.
6. ACKNOWLEDGMENTSThis research is partially supported by National Institute
of Information and Communications Technology.
7. REFERENCES[1] Boettcher, A., and Lee, D. Eventradar: A real-time
local event detection scheme using twitter stream. InGreen Computing and Communications (GreenCom),2012 IEEE International Conference on (2012), IEEE,pp. 358–367.
[2] Gupta, A., and Kumaraguru, P. Credibility rankingof tweets during high impact events. In Proceedings ofthe 1st Workshop on Privacy and Security in OnlineSocial Media (New York, NY, USA, 2012), PSOSM ’12,ACM, pp. 2:2–2:8.
[3] Ishikawa, S., Arakawa, Y., Tagashira, S., andFukuda, A. Hot topic detection in local areas usingtwitter and wikipedia. In ARCS Workshops (ARCS),2012 (2012), IEEE, pp. 1–5.
[4] Kawano, M., Yonezawa, T., Nakazawa, J.,Kawasaki, S., Ohta, K., Inamura, H., andTokuda, H. Classifying urban events’ popularity byanalyzing friends information in location-based socialnetwork. In The First International Conference on IoTin Urban Space, Urb-IoT 2014, Rome, Italy, October27-28, 2014 (2014), pp. 87–89.
[5] Lee, R., and Sumiya, K. Measuring geographicalregularities of crowd behaviors for twitter-basedgeo-social event detection. In Proceedings of the 2NdACM SIGSPATIAL International Workshop onLocation Based Social Networks (New York, NY, USA,2010), LBSN ’10, ACM, pp. 1–10.
[6] Pennacchiotti, M., and Popescu, A.-M. A machinelearning approach to twitter user classification. InICWSM (2011), L. A. Adamic, R. A. Baeza-Yates, andS. Counts, Eds., The AAAI Press.
[7] Steiger, E., de Albuquerque, J. P., and Zipf, A.An advanced systematic literature review onspatiotemporal analyses of twitter data. Transactionsin GIS 19, 6 (2015), 809–834.
[8] Wakamiya, S., Lee, R., and Sumiya, K.Crowd-sourced cartography: Measuring socio-cognitivedistance for urban areas based on crowd’s movement.In Proceedings of the 2012 ACM Conference onUbiquitous Computing (New York, NY, USA, 2012),UbiComp ’12, ACM, pp. 935–942.
[9] Weng, J., and Lee, B.-S. Event detection in twitter.In International AAAI Conference on Web and SocialMedia (2011).
Table 2: Comparison result