City Happenings into Wikipedia Category: Classifying Urban...

City Happenings into Wikipedia Category:Classifying Urban Events by Combining Analyses of

Location-based Social Networks and Wikipedia

Shoya SatoTakuro Yonezawa

Jin NakazawaKeio University

Kanagawa, Japan{shoya,takuro,jin}@ht.sfc.keio.ac.jp

Satoshi KawasakiKen Ohta

Hiroshi InamuraNTT Docomo, Inc.Kanagawa, Japan

{satoshi.kawasaki.vx,ootaken,inamura}@nttdocomo.com

Hideyuki TokudaKeio University

Kanagawa, [email protected]

ABSTRACTRecently, many researchers have been focusing on the detec-tion and classification of urban events by information analy-sis on social networks. Previous works mainly use text anal-ysis of users’ posts on social networks for detecting urbanevents. However, this approach has a limitation that theusers’ posts must mention the event for the analysis to beconducted. We propose a new method for classifying urbanevents by extracting user interest from the location-basedsocial network information without text analysis. The pro-posed method includes analyzing common friends of users inthe vicinity of the event venue and extracting common thefriends’ attributes by referring to related Wikipedia informa-tion. We designed and implemented the proposed method,and conducted an experiment for evaluating our method.Our experimental result shows that our method can classifyevents well in cases where participants have similar inter-ests.

CCS Concepts•Human-centered computing → Social network analy-sis;

KeywordsUrban event; classification; SNS; Wikipedia

1. INTRODUCTIONVarious events are conducted in cities, including sports

games, festivals, etc. Recognizing these urban events real-time is one of the key functionalities for making cities smarterin terms of disaster/emergency management, disease/health

management, traffic management, and further urban plan-ning. To detect and classify urban events, many researchershave been focusing on analyzing user-generated content-informationfrom location-based social networks, such as geo-tagged tweetsfrom Twitter [7]. The analysis approaches are mainly classi-fied into the following two categories (or their cross analysis):semantic analysis [3, 9, 2] and spatiotemporal signal anal-ysis [5, 1, 8]. Semantic analysis leverages the fact that theparticipants in urban events post text information about theevents. By extracting hot topics or burst information fromthe content, urban events can be detected and classified.However, since this approach relies on the assumption thatthere are a number of text contents that mention an urbanevent, applying this approach to the urban events not men-tioned considerably is difficult. In addition, even if there isconsiderable content, it always contains noisy information,which is not related to the events. However, spatiotem-poral signal analysis focuses on the geolocation and times-tamp of content, instead of semantic information. The ap-proach analyzes the spatio-temporal pattern of signals fromthe location-based social network to detect abnormality incities. However, though this approach works well for de-tecting abnormalities, achieving classification-inferring thetype of urban events is difficult. Therefore, both seman-tic and spatiotemporal approaches still have limitations inthe detection and classification of urban events based onlocation-based social network information.

In this paper, we propose an alternative simple approachfor detecting and classifying urban events to resolve thisproblem. We focus on common followee information froma location-based social network. We assume that peopleparticipating in an event have similar interests, and theseinterests are reflected on their followees. For example, par-ticipants of a football game are generally interested in foot-ball, and they would also follow football related users onsocial networking sites. Therefore, by analyzing commonfollowees of participants on social networks, the characteris-tics of the event could be inferred. Our classification methodfirst involves obtaining common followees from the socialnetwork of users who tweet from the vicinity of an urbanevent site. It then weighs each common followee. Highlyweighted common followees usually include famous users,who already appear in Wikipedia entries. Therefore, we

Geo-tagged Tweets

Extract Users

(Possible) Event Attendees

(Possible) Event Attendees

…Common Followees

…Common Followees

Weighting

…

Common FolloweesFeature Extraction

Urban EventClassification

0.4 0.25 0.2 0.15

0.4 0.25 0.2

…

…

Figure 1: Overall process of proposed urban event classification

adopt the Wikipedia category of the common followees forurban event classification.

We designed and implemented an interactive analysis toolthat grabs and classifies events based on the common fol-lowees information. We classified sixteen urban events byusing this method and compared the results with the groundtruth, which was achieved by manual classification of thoseevents. From the result, we could confirm that classificationby our method achieves a higher correlation with groundtruth for the events that gather selective participants, suchas a baseball game, which is attended by the fans of twoteams, and a comic market, where otakus gather, as com-pared to generic events, e.g., a nature viewing event. Thecontributions of this paper are as follows:

• Presenting a new method for urban event classificationbased on combining the analysis of common followeesof event participants and Wikipedia entries

• Designing and implementing an interactive web toolthat easily captures and classifies urban events

• Evaluating the effectiveness of our approach by com-paring classification results of our method with eventclassification of human impression of the events

2. EVENTS CLASSIFICATION BASED ONCOMMON FOLLOWEES ANALYSIS

2.1 Overall ProcessAs location-based social network data sources, we use geo-

tagged tweet information from Twitter. Figure 1 shows theoverall process of urban event classification. First, we ex-tract SNS users who post textual messages from the vicinityof an event venue when the event is being conducted. Sec-ond, we extract the common followees weighting factor fromthe user set. Common followees imply user accounts beingfollowed by more than two users from a user set. Then, weclassify each common followee by referring to a Wikipediaarticle related to the common followee. Finally, we use theoverall result of classified information as classification of theurban event. Details of each process are explained in thefollowing sections.

2.2 Common Followees WeightingThis process extracts common followees that are uniquely

exposed in the events, and make their weighting. A more

straightforward approach is simply counting the number ofusers from the user set following each common follower.However, this simple approach hides important informationregarding the common followees being uniquely followed bythe users of the event. For example, through Twitter, wefound that @masason user always appears as a popular com-mon followee from user sets of many events. The @masasonuser has most followed account in Japan (2,595,630 follow-ers). Such famous users are recommended by Twitter whena user registers on Twitter. These users may possibly showup as noise in the classification of events. Therefore, pickingup common followees who are uniquely followed by users ofthe events is necessary. To remove noise, we calculate theratio of following by the user sets from exact events andentire Twitter space. With the comparison information, weextract the common followee weighting factor for the events.

For detailed processing, we put a set of common followeesas F = {f0, f1, ...., fn}. EventFollow(fx) represents eachcommon followee’s total follow number by Twitter users ob-served in the event. We apply EventIndex(fx) as popularityof the common followee among twitter users observed in theevent:

EventIndex(fx) =EventFollow(fx)n∑

i=0

EventFollow(fi)

Then, we also calculate BasicIndex, which represents theoriginal popularity of each common followee among a setF in entire Twitter space. BasicFollow(fx) represents thenumber of followed of each common followee by all Twitterusers.

BasicIndex(fx) =BasicFollow(fx)n∑

i=0

BasicFollow(fi)

Finally, we calculate FeatureIndex of each common fol-lowee, a weighting factor of common followees, which isuniquely exposed from Twitter users observed in the event:

FeatureIndex(fx) = EventIndex(fx) − BasicIndex(fx)

If FeatureIndex of a common followee is greater thanzero, it implies that the common followee is obtaining morefollowers than usual. Through this process, we obtain thelist of common followees with unique weighting information

when the urban events are conducted. Then, we classifyeach common followees’ character.

2.3 Common Followees Classificationby Wikipedia Category

Next, we infer the categories of the extracted common fol-lowees. To investigate characteristics of twitter users, sev-eral research works focused on analyzing textual informationfrom tweets and additional metadata. For example, Marcoet.al.[6] conducted a linguistic Twitter analysis to infer theTwitter users’ political affiliation, ethnicity identification,etc. These approaches work effectively in detecting certaintopics; however, classified categories are still limited for ap-plication to urban events. To classify common followees invarious categories, we used a Wikipedia category related tothe article about common followees. We found that mostcommon followees are influencer accounts having more than2,000 followers. In addition, there are also exact or relatedWikipedia articles for most of the common followers. There-fore, we assume that Wikipedia information related to com-mon followees, especially the category of the article, can beutilized for classifying common followees.

Wikipedia is an online encyclopedia created and managedby the power of collective wisdom. It has more than 30million articles in multiple languages maintaining fairness ofpolicy as much as possible. Each Wikipedia article is associ-ated to certain Wikipedia categories. Wikipedia categoriesare structured as a hierarchical category tree. Though de-tails of a category (e.g., name of category, or level of treeat which the category exists) differ depending on differentlanguage versions of Wikipedia, any Wikipedia version hasa similar higher-level category that covers various funda-mental genres/keywords. For example, Japanese Wikipediahas nine categories at the top-level category (e.g., society,nature, technology, or culture). Lower-level tree categoriesinclude more details of categories, such as political system,biology, computer, or animation. By stepping up from alower-level category to an upper-level category, categoriesattached to any article can be classified into the same higher-level category. We use the number of categories at a certainhigher-level tree as the dimension of vector spaces for clas-sification of the article.

To classify a common followee, specifying the related Wikipediaarticle of the common followee is necessary. Finding therelated article is not difficult because there are several on-line databases that provide the relationship between famousTwitter accounts and a Wikipedia article. After specifyingthe article, we first collect categories associated with the ar-ticle. Since these collected categories are located at a lowertree level, the upper level (parent) of category, which is pre-liminarily defined to be used for urban classification, is dis-covered by breadth first search. Some lower-level categorieshave multiple parents, so breadth first search is adapted toall parent trees. If the search hits a defined level of a cate-gory, our algorithm adds a score to the category.

As an example, @cfm miku the Twitter account of a vir-tual idol called Hatsune Miku1, which is followed by morethan 40,000 users, can be associated to the Wikipedia arti-cle of Hatsune Miku. The article is classified into followingcategories: music software, virtual singer, vocaloid, musicin Sapporo, and Crypton Future Media (a company name).These categories are located at a lower level of the category

1https://en.wikipedia.org/wiki/Hatsune Miku

Geotagged Tweets Collection Module

Geotagged Tweets Database

Event RegistrationModule

Event ClassificationModule

Result VisualizationModule

Twitter

Common Followees Analysis Module

Common Followees’Information

Common Followees’ Category

Request

Twitter Users’ followees Information

RequestGeotagged Tweet

Geotagged Tweet

Geo information/ Twitter users

Common Followees’List with Weighting Factor

ClassificationResult

Event RegisterationInterface

Classification Result Visualization

(A)(B)

(C)(D) Twitter Users’ followees

Information

Figure 2: System Architecture

tree; thus, each category is transformed into an upper cate-gory by breadth first search. In this example, all categoriesare expressed, such as music software = {music:2, movie:1,computer:4} and virtual singer = {art:1, entertainment:3}.Finally, the result of totaling categories is used as a featurevector for each common followee’s classification.

2.4 Urban Events ClassificationBased on common followee weighting factor and classified

categories of each followee, we finally classify urban events.

We define ~fx as a feature vector of an arbitrary commonfollowee fx ∈ F . Then, the feature vector of the urbanevent is calculated as the total feature vectors of commonfollowees while considering each of their weighting factors:

~e =

n∑i=0

FeatureIndex(fi)n∑

j=0

FeatureIndex(fj)

~fi

3. DESIGN AND IMPLEMENTATIONWe designed and implemented an interactive system that

enables users to grab and classify urban events semi-automatically(see Figure 2 for overall system architecture). The systemhas the following functions:

3.1 Collecting and visualizing geo-tagged tweetsWe collected geo-tagged tweets in Japan by using Twitter

Streaming API2 (A in Figure 2). The collected tweets arestored into a database, and it can be visualized on GoogleMaps with date and area specifications(B in Figure 2).

3.2 Registering urban events from the map tobe classified

By using the interface, users can register target events tobe classified (C in Figure 2). Figure 3 shows the registrationinterface. Each pin represents geo-tagged tweets, and users

2http://dev.twitter.com

Target Event Registration Interface Common followees ranking

and their account name/image

Twitter users in red circle are target to be analysed

Classi!cation of each common

followees

Figure 3: A screenshot of the interactive urbanevent classification tool-left: event registration,right: showing classified common followees

can register a target event by specifying the date, time, andarea (see red circle on Figure 3).

3.3 Classifying registered urban events auto-matically

The registered events are then classified using our methoddescribed in Section 2 (D in Figure 2). First, we extractcommon followees from users present where target eventshappened. Considering that analyzing all common followeesrequires high calculation costs (e.g., too many queries toTwitter API beyond its limit and also Wikipedia access), weanalyzed the top ten weighted common followees for urbanevent classification.

Most of the geo-tagged tweets we collected are generatedby Japanese users, and the extracted common followees arealso related to Japan. Therefore, we used the Japanese ver-sion of Wikipedia. In terms of category tree level for ur-ban event classification, we used a category level that has91 categories (see Table 1). This is because this categorylevel is recommended as a comprehensive category level bythe Japanese version of Wikipedia, and we consider that itcan provide enough classification category to urban events.The information at each classification process (i.e.,the origi-nal categories of each Wikipedia article of common followees,level of 91 category of each Wikipedia article and result clas-sification of the urban event) is visualized with the tool (seealso Figure 3). As an example, Figure 4 shows the classifi-cation results of two different events held at the same venueat different dates. We visualized the classification of urbanevents, which is 91 dimension of feature vector, as a wordcloud for recognizing its feature easily. The size of eachkeyword represents the weight of each keyword, and zeroweighted keywords are not shown in the word cloud. OnApril 13th, 2013, an idol group’s live music concert was heldat the national stadium. The analyzed categories show thatclassification keywords of music and entertainment were rep-resented as strong categories. On the contrary, on January13th, 2014, a soccer game was held at the same venue. Atthat time, we confirmed that classification keyword of sportswas represented as strong category.

4. EVALUATIONWe evaluate how our classification method is similar to

people’s actual impression about the event classification. Weselected 13 events for our evaluation. Then, we collected

National Stadium

2014

Jan.13

2013

Apr.13Idol Group Music Live

Soccer Game

Figure 4: Classification example: two differentevents at the same venue by our proposed method

the ground truth of the event classification by conducting aquestionnaire survey.

4.1 Target EventsJACE (Japan Association for the Promotion of Creative

Events) defines 16 types of events, including traditional fes-tivals, art exhibitions, music exhibitions, parades, sports,contents, etc. From these, we found 12 types of events withgeo-tagged tweets from their venue and classified them byusing our tool. We analyzed all Twitter users who postedgeo-tagged tweets when/where the event was being held.

4.2 Questionnaire Survey for Ground TruthTo validate the accuracy of our classification method, we

also prepared the ground truth of the target event classifi-cation. We conducted a questionnaire survey for 27 partic-ipants (18-25 ages) to classify the events with same classi-fication keywords as that of our method. Figure 5 showsthe actual questionnaire interface. The interface providesa name, image, and simple explanation of events that wereavailable from the event’s official website. In addition tothis information, classification keywords (as same as Table1) are also presented. When participants click on each key-word, the keyword is associated to the event. We askedparticipants to select multiple keywords considered as re-lated keywords of each event. The classification result foreach event from all participants is aggregated as values ofthe 91 dimension of vector space. We used this result as theground truth because this result shows the extent to whichpeople are impressed by the events.

4.3 Comparison ResultWe compared the classification results of our method with

the ground truth through a questionnaire survey. Both clas-sifications are expressed as the 91 dimension of feature vec-tor, so we evaluate the similarity of both classifications withcosine similarity, as follows.

cos(~x, ~y) =~x · ~y|~x||~y| =

|v|∑i=1

xiyi√√√√ |v|∑i=1

x2i

√√√√ |v|∑i=1

y2i

Table 1: Categories to be used for classifying urban events: 91 dimension of vector spaceSociety, Politics, Economy, Industry, Traffic, Education, History, Welfare, Medical, Health, Environment, Citizenship, Peace,

Military, University, Art, Culture, Language, Religion, Playing, Hobby, TraditionalArt, Literature, Music, FineArt, Movie, Drama,

Anime, Manga, Illustration, Sports, Game, Gambling, Fashion, DietaryCuluture, Structure, Media, Entertainment, Continent,

Asia, Africa, Oceania, NorthAmerica, SouthAmerica, Europe, Hokkaido, Tohoku, Kanto, Chubu, Kinki, Chugoku, Shikoku, Kyushu,

Okinawa, Nature, Universe, Elements, Weather, Disaster, Sea, LivingThing, Plant, Animal, Mineral, Study, Philosophy,

Logic, Linguistics, Psychology, Literature, ScienceOfReligion, PoliticalScience, BusinessAdministration, Sociology, Pedagogy,

Mathematics, Physics, Chemistry, Biology, Anthropology, EarthScience, MedicalScience, Pharmacy, Dentistry, Agriculture, Engineering,

Technology, Computer, Network, Electronics, Biotechnology

Event Name

Event Explanation

keywords for

classifying the event

Selected keywords

are registered here

Figure 5: Questionnaire survey interface

Table 2 shows the comparison result. Since values of thefeature vectors of each event are expressed as a positivenumber, the value of cosine similarity is expressed from 0as the minimum similarity, to 1 as the maximum similarity.As a result, we confirmed that there is a strong similarityfor the events of a high school soccer championship, Mo-moiro Clover live, comic market, the Iruma Airbase show,and Tokyo Anime Award Festival. On the contrary, theevents of Sapporo Snow Festival, Hakata Dontaku festival,Gaien fireworks festival, Kanda festival, and Hirosaki cherryblossom festival have low similarity with the ground truth.

In terms of events that have a high similarity with theground truth, we found that the events are popular for spe-cific people interested in the genre of events, such as sports,music, manga, anime, or military. Since these events havea strong fanbase, the fans’ interest seems to be reflectedto common followees. For example, in terms of the IrumaAirbase show, high weighted common followers were relatedto the military or government, such as the official Twitteraccount of Japanese self-defence force and government. Asanother example, common followers in Momoiro Clover live(a live music of idol group called Momoiro Clover) includesTwitter accounts of group members of Momoiro Clover, theyare classified into entertainment, music, and fashion.

On the contrary, in terms of events having low similar-ity, we found that those events are popular for all people.For example, traditional festivals, such as Kanda festival, orpublic viewing events, such as fireworks or cherry blossomfestival, are the events where many types of people, such asfamilies and couples of all ages participate. Therefore, sinceparticipants’ interests in the social network are also diverse,obtaining feature classification is difficult. Kawano.et.al.[4]

introduced ”popularity” of urban events that are composedof diverse participants and suggested a method of classify-ing events based on ”popularity”. These low similarity eventshave a possibility of becoming high popularity events. If so,we can estimate a certain accuracy of event classification byadapting the popularity analytics at the same time.

In addition to above observations, we confirmed severalfeatures from low similarity events. Sapporo Snow Festivalis one of Japan’s most popular winter events that exhibitsa dozen large snow sculptures. Thus, the ground truth in-cludes several keywords such as culture, fineart, or Hokkaido(name of the island where the event is held). However, ourmethod classifies the events by using different keywords suchas music and entertainment. These keywords are derivedfrom high-weighted common followees of Hatsune-Miku (avirtual singer software idol). In fact, there was a specialsub-event called Hatsune-Miku Snow Festival as a part ofthe snow festival in 2014. Hatsune-Miku is popular amonginternet users. This was done so that the effect on Miku’sfans may be more reflected on Twitter rather than otherusers. Similarly, though Hakata Dontaku is a traditionalfestival, our method derives entertainment and music as key-words. This is also because the idol group performed live onthat day at the event, and their fans’ tweets were mainly ob-served. These results suggest that our classification methodis significantly affected by active Twitter users. On the con-trary, though the above two results are different from thepublic impression of the events, our result has a possibilityof showing more real aspects of the snow festival.

5. CONCLUSION AND FUTURE WORKIn this paper, we propose a novel method of classifying

urban events by using a Wikipedia category. Previous re-lated works mainly focused on textual analysis of location-based SNS for event classification; the quality and numberof textual information to be analyzed are always importantissues. Our method analyzes common followees in users inthe vicinity of the event venue and extracts common fol-lowee attributes by referring to Wikipedia information. Ourexperimental result showed that the proposed method canclassify the events well where participants have similar inter-ests. Therefore, we can provide an alternative classificationmethod even if SNS texts from the area do not mention theevents.

As future work, first, the accuracy of urban event clas-sification requires enhancement. In this paper, we regardall Twitter users from an area as the event’s participants.However, some users might not be participants. Removingsuch noise will enhance our classification accuracy. Combin-ing our method with textural analysis may provide a more

accurate and robust method of urban event classification.Second, we will apply our method also for event detectionby constant monitoring. If the analyzed classification key-words change for an area, it implies that an event has hap-pened. This urban event detection and classification playsan important role in the era of cities with IoT.

6. ACKNOWLEDGMENTSThis research is partially supported by National Institute

of Information and Communications Technology.

7. REFERENCES[1] Boettcher, A., and Lee, D. Eventradar: A real-time

local event detection scheme using twitter stream. InGreen Computing and Communications (GreenCom),2012 IEEE International Conference on (2012), IEEE,pp. 358–367.

[2] Gupta, A., and Kumaraguru, P. Credibility rankingof tweets during high impact events. In Proceedings ofthe 1st Workshop on Privacy and Security in OnlineSocial Media (New York, NY, USA, 2012), PSOSM ’12,ACM, pp. 2:2–2:8.

[3] Ishikawa, S., Arakawa, Y., Tagashira, S., andFukuda, A. Hot topic detection in local areas usingtwitter and wikipedia. In ARCS Workshops (ARCS),2012 (2012), IEEE, pp. 1–5.

[4] Kawano, M., Yonezawa, T., Nakazawa, J.,Kawasaki, S., Ohta, K., Inamura, H., andTokuda, H. Classifying urban events’ popularity byanalyzing friends information in location-based socialnetwork. In The First International Conference on IoTin Urban Space, Urb-IoT 2014, Rome, Italy, October27-28, 2014 (2014), pp. 87–89.

[5] Lee, R., and Sumiya, K. Measuring geographicalregularities of crowd behaviors for twitter-basedgeo-social event detection. In Proceedings of the 2NdACM SIGSPATIAL International Workshop onLocation Based Social Networks (New York, NY, USA,2010), LBSN ’10, ACM, pp. 1–10.

[6] Pennacchiotti, M., and Popescu, A.-M. A machinelearning approach to twitter user classification. InICWSM (2011), L. A. Adamic, R. A. Baeza-Yates, andS. Counts, Eds., The AAAI Press.

[7] Steiger, E., de Albuquerque, J. P., and Zipf, A.An advanced systematic literature review onspatiotemporal analyses of twitter data. Transactionsin GIS 19, 6 (2015), 809–834.

[8] Wakamiya, S., Lee, R., and Sumiya, K.Crowd-sourced cartography: Measuring socio-cognitivedistance for urban areas based on crowd’s movement.In Proceedings of the 2012 ACM Conference onUbiquitous Computing (New York, NY, USA, 2012),UbiComp ’12, ACM, pp. 935–942.

[9] Weng, J., and Lee, B.-S. Event detection in twitter.In International AAAI Conference on Web and SocialMedia (2011).

Table 2: Comparison result

City Happenings into Wikipedia Category: Classifying Urban...

Documents

Transcript of City Happenings into Wikipedia Category: Classifying Urban...