When a Tweet Finds its Place: Fine-Grained Tweet ...ceur-ws.org/Vol-1831/paper_9.pdf When a Tweet...

download When a Tweet Finds its Place: Fine-Grained Tweet ...ceur-ws.org/Vol-1831/paper_9.pdf When a Tweet Finds

of 15

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of When a Tweet Finds its Place: Fine-Grained Tweet ...ceur-ws.org/Vol-1831/paper_9.pdf When a Tweet...

  • When a Tweet Finds its Place: Fine-Grained Tweet Geolocalisation

    Pavlos Paraskevopoulos1 and Giovanni Pellegrini2 and Themis Palpanas3

    1 University of Trento Telecom Italia - SKIL

    p.paraskevopoulos@unitn.it 2 University of Trento Telecom Italia - SKIL

    giovanni.pellegrini@studenti.unitn.it 3 Paris Descartes University


    Abstract. The recent rise in the use of social networks has resulted in an abundance of information on different aspects of everyday social activities that is available online. In the process of analysis of the in- formation originating from social networks, and especially Twitter, an important aspect is that of the geographic coordinates, i.e., geolocalisa- tion, of the relevant information. This information is used by a variety of applications for the better understanding of an urban area, the tracking of the way a virus spreads, the identification of people that need help in case of a disaster (e.g., an earthquake), or just for the better under- standing of the dynamics of a major event (e.g., a concert). However, only a tiny percentage of the twitter posts are geotagged, which restricts the applicability of location-based applications. In this work, we extend our framework for geolocating tweets that are not geotagged, and de- scribe a general solution for estimating the city and neighborhood in the city, from which a post was generated. In addition, we study the spe- cific problem of geolocalising tweets deriving from targeted locations of interest (i.e., cities and neighborhoods in these cities), and present the visualizations of the prototype dashboard application we have developed, which can help end-users and large-scale event organizers to better plan and manage their activities. The experimental evaluation with real data demonstrates the efficiency and effectiveness of our approach.

    Keywords: geotag, geolocation, Twitter, social networks

    1 Introduction

    [Motivation:] Events that happen around us affect our lives to different degrees. The effects of an event on a community vary depending on the type of the event and its dynamics. For example, traffic jams affect the way we move, football matches and concerts may affect the normal pace of life in the area of the venue

  • for a short period of time, while earthquakes and diseases are unpredicted events, which could cause significant problems that have to be addressed fast. Many entities, public and private, are interested in analyzing the effects of such events, in order to better understand and react to them, and lead to a better quality of life. For example, the identification of lack of clean water at a place would lead the water providers to take special care for resolving the problem. Even though this would be a manual, labour-intensive, and time-consuming process in the past (e.g., consider the 1854 cholera outbreak in London [17]), this is no longer the case.

    People tend to share their experiences, especially those affecting their lives (or feelings). Social networks, such as Twitter [3], Facebook [1] and Google+ [2], give users the opportunity to express themselves and report details about their everyday social activities. The combination of this behavior with the widespread use of mobile smart-phones and tablets has allowed users to report their activities in real time, adding reports from several different locations (not just from their homes, or workplaces). Consequently, we now have access to datasets containing detailed information of social activities. To that effect, several studies [29], in- cluding applications [20, 32, 14, 7, 11, 6, 30, 35] and techniques [33, 28, 23, 31] have been developed that analyze datasets created through the use of social networks, tracking crowd movements and identifying needs, in order to provide benefits to end users, businesses, civil authorities and scientists alike.

    It is interesting to note that several of these applications depend on the knowledge of the user location at the time of the posting. This knowledge is necessary for applications that target to characterize an urban landscape, or to optimize urban planning [14], to monitor and track mobility and traffic [7], and to identify and report natural disasters, such as earthquakes [11]. For example, in the case of earthquakes knowing the exact location of a tweet can provide ac- tionable insights to emergency-response workers (extent of damages, or number of victims at specific locations, etc.) [8]. Such applications, which represent an increasingly wide range of domains, are restricted to the use of geotagged data4, that is, posts in social networks containing the geographic coordinates of the user at the time of posting.

    Evidently, the availability of geotagged data, determines not only the possi- bility to use such applications, but also their quality-performance characteristics: the more geotagged data posts are available, the better the quality of the results will be (more precisely: the higher the probability for being able to produce better quality results). Nevertheless, the availability of geotagged data is rather limited. In Twitter, which is the focus of our study, the number of geotagged tweets is a mere 1.5-3% of the total number of tweets [19, 21, 15]. As a result, the amount of useful data for these applications to analyze is small, which in turn limits the utility of the applications. Even if we considered this subset of geotagged tweets as representative, “there is a tendency for geotaggers to be

    4 For the rest of this paper, we will use the terms geotagged and geolocalised inter- changeably.


  • slightly older than non-geotaggers” [27], which may lead to non-representative, or skewed results.

    [Proposed Approach and Contributions:] In this study, we address this problem by extending our framework [24] for geolocalising tweets that are non- geotagged. Even though previous works have recognized the importance and have studied this problem [9, 18] (for a comprehensive discussion of this problem refer to [15]), their goal was to produce a coarse-grained estimate (i.e., postal zipcodes, cities, or geographical areas larger than cities) of the location of a set of non-geotagged tweets (e.g., those originating from a single user).

    In contrast, in our previous work [24], we examined this problem at a much finer granularity, thus enabling a new range of applications that require detailed geolocalised data. More specifically, our solution provides location estimates for individual tweets, at the level of a city neighborhood given the city (or the city, given the country). That is, we focused on the identification of the location, where the location belonged to a set of candidate locations. This solution exploits the similarities in the content between an individual tweet and a set of geotagged tweets, as well as their time-evolution characteristics.

    In this work, we extend our previous solution, and describe a general tech- nique for estimating the location from which a post was generated using a two- stage process: we first determine the city, and then the neighborhood in the city, by building content-based models and analyzing the volume of posts over time, independently for each one of these two levels. Using this set up, we are able to effectively predict the location of a post form the Twitter stream, when the only input we have is the actual content of the post and its timestamp.

    In addition, we study the specific problem of geolocalising tweets deriving from targeted locations of interest, that is, neighborhoods of a particular cultural, social, or touristic importance (e.g., the Vatican in Rome). Our experiments show that we can reuse our technique for this case, as well, by adjusting its operation to this context, where a small number of popular keywords mentioned in the posts characterize the location.

    Finally, we present the visualizations of the prototype dashboard application we have developed, which can help end-users and large-scale event organizers to better plan and manage their activities. These interactive visualizations include heatmaps for the volume of (geotagged and geolocalised) tweets, where the user can zoom at different levels of granularity, ranging from a country, down to a city neighborhood, for which the user can also explore the relevant keywords. Furthermore, we provide visualizations that illustrate in a comprehensive manner the changes in the volume of posts at different locations over time.

    [Paper Organization:] The rest of the document is organized as follows. In Section 2 we present the related work. Section 3 formalizes the problem, and Section 4 describes our solution. We present our experimental evaluation in Section 5, and our prototype dashboard implementation in Section 6. Finally, we conclude in Section 7.


  • 2 Related Work

    Several works have studied the problem of geotagged tweet analysis. Balduini et al. [7] studied the movement of people by analyzing geotagged tweets. Some stud- ies focus on the extraction of local events by analyzing the text in the tweets [12]. Abdelhaq et al. [4] use both geotagged and non-geotagged tweets for identify- ing keywords that best describe events. We note that in all the above studies, the tweets that are analyzed are already geotagged. In contrast, our focus is on non-geotagged tweets.

    The problem of using tweets in order to identify the location of a user, or the place that an event took place has been studied in the past. The “who, where, what, when” attributes extracted from a user’s profile can be used to create spatio-temporal profiles of users, and ultimately lead to identification of mobility patterns [34]. Cheng et al. [10] create location pro