7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
1/151
1
An Investigation in Defining Neighbourhood Boundaries
Using Location Based Social Media
Tai Tong KAM
28thAugust 2015
For BENVGSC6: Dissertation
Supervised by: Steven Gray, Dr Elsa Arcaute
Word Count: 10,169 words
This dissertation is submitted in partial fulfilment for the requirements for the MSc in
Smart Cities and Urban Analytics in the Centre for Advanced Spatial Analysis, Bartlett
Faculty of the Built Environment, University College London.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
2/151
2
ABSTRACT
The widespread use of smartphones and social media has opened opportunities for
researchers to define one of the most elusive concepts in cities: neighbourhoods. While
the number of neighbourhood detection methods using location based social media have
increased in recent years, there is much that we do not know about the process. For
example, researchers have rarely integrated the neighbourhoods detected with
administrative data to add meaning beyond what can be inferred from social media.
This work takes a step towards better understanding neighbourhood detection methods,
and also attempts to add meaning to the clusters / neighbourhoods generated by
incorporating administrative data to these clusters / neighbourhoods.
I break down the neighbourhood detection process into three common elements (a) the
unit used for aggregation, (b) the type of clustering method used; and (c) the similarity
measure.
I then illustrate one way of better understanding the neighbourhood detection process by
applying multiple variations of the Livehoods method (Cranshaw et al., 2012) on data
from Greater London, and find that in addition to neighbourhood clusters, the
Livehoods method may also be able to generate clusters that depict the citys boundaries
from the residents perspective.
I also make a preliminary attempt in this work to combine the clusters / neighbourhoods
formed using the Livehoods method with data from LondonsLower Super Output
Areas to investigate ethnic diversity in neighbourhoods. I found that using location
based social media may generate neighbourhood boundaries that are more appropriate
than or can complement traditional administrative boundaries for studies where
definitions of neighbourhood goes beyond arbitrary administrative boundaries and a
multifaceted view of neighbourhoods is needed.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
3/151
3
DECLARATION
I, Tai Tong Kam, hereby declare that this dissertation is all my original work and that all
sources have been acknowledged. It is 10,169 words in length.
Signature
====================
Date: 28thAugust 2015
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
4/151
4
TABLE OF CONTENTS
1. RESEARCH GOAL AND OVERVIEW..................................................................... 8
1.1. Research goal, motivations, and limitations ......................................................... 8
1.2.
Overview........................................................................................................... 10
2. INTRODUCTION.................................................................................................... 12
2.1. Neighbourhoods ................................................................................................ 12
2.2. Location Based Social Media and Detecting Neighbourhood Boundaries......... . 14
2.3. Review of Methods for Neighbourhood Detection ............................................. 16
3. METHODOLOGY ................................................................................................... 25
3.1. Data sources ...................................................................................................... 25
3.2. Data sorting, import, storage and analysis......................................................... 26
3.3. The Livehoods method...................................................................................... 26
4. ANALYZING THE LIVEHOODS METHOD.......................................................... 30
4.1. Tuning the number of smallest eigenvalues (k).................................................. 30
4.2. Tuning the alpha constant ()............................................................................ 33
4.3. Tuning the nearest neighbours parameter (m).................................................. 34
4.4. Using cosine similarity....................................................................................... 35
4.5. Nearest neighbours versus full similarity graph ................................................ 36
4.6. Summary........................................................................................................... 36
5. DESCRIPTION OF LIVEHOOD CLUSTERS / NEIGHBOURHOODS.................. 38
5.1. Overview of neighbourhoods............................................................................. 38
5.2. Breakdown of individual neighbourhoods ......................................................... 46
6. COMPARING LIVEHOODS CLUSTERS TO LOWER SUPER OUTPUT AREAS 54
7. CONCLUSION......................................................................................................... 59
7.1. Concluding Remarks ......................................................................................... 59
7.2. Limitations and Future Research ...................................................................... 60
8. BIBLIOGRAPHY.................................................................................................... 649. APPENDIX.............................................................................................................. 67
9.1. Scripts for collecting and formatting data for analysis ...................................... 67
9.1.1. IPython notebook: twitter_streaming.ipynb ............................................... 67
9.1.2. IPython notebook: extract_twitter_data.ipynb........................................... 70
9.1.3. IPython notebook: foursquare_search_place.ipynb................................... 75
9.1.4. IPython notebook: format_data_for_analysis.ipynb.................................. 84
9.2. Scripts for Livehoods clustering method........................................................... 89
9.2.1. Bash script: install.sh................................................................................. 89
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
5/151
5
9.2.2. Bash script: runLDN.sh............................................................................. 90
9.2.3. Python script: clustering.py....................................................................... 92
9.2.4. Python script: clusteringalgo.py................................................................. 94
9.2.5. Python script: getdata.py ..........................................................................100
9.2.6. Python script: utils.py ...............................................................................111
9.3. Scripts for visualizing cluster results ................................................................119
9.3.1. Python script: formatresults.py.................................................................119
9.3.2. Python script: visualize_cluster_results.py................................................127
9.4. Scripts for comparing Lower Super Output Areas with Livehoods clusters in
terms of ethnic diversity ..............................................................................................138
9.4.1. Python script: extract_ldn_lsoa.ipynb .......................................................138
9.4.2. Python script: add_ethnic_diversity_to_geojson.ipynb.............................141
9.4.3. Python script: stats_for_eth_diversity.ipynb .............................................146
9.4.4. R script: ethnic_diversity_chart.R ............................................................148
9.5. Livehood clusters for nearest neighbours parameter m=5 to m=20..................149
9.6. Largest cluster generated from Livehoods method ...........................................151
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
6/151
6
LIST OF FIGURES
Figure 1: Relationship between number of smallest eigenvalues (k) found and number of
clusters formed ................................................................................................................. 32
Figure 2: Boundaries formed for different number of clusters ... ........ ........ ......... ....... ......... .. 33
Figure 3: Boundaries formed for different alpha constants ........ ........ ......... ....... ........ ......... .. 34Figure 4: Boundaries formed for different nearest neighbours parameter (m) ....................... 35
Figure 5: Clustering results for London ................................................................................ 40
Figure 6: Properties of Livehood clusters ............................................................................. 44
Figure 7: Overall distribution of venues and checkins across clusters ............ ........ ......... ....... 47
Figure 8: Hirschman concentration index (HI) for clusters.......... ........ ......... ....... ........ ......... .. 56
LIST OF TABLES
Table 1: Summary statistics for cluster results for London ........ ........ ........ ........ ........ ........ .... 41
Table 2: Percentage difference between proportion of venues within cluster to proportion of
venues within city in terms of Foursquares main categories............................................... 50
Table 3: Percentage difference between proportion of users within cluster checking-in to
proportion of users within city checking-in in terms of Foursquares main categories............ 52
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
7/151
7
ACKNOWLEDGMENTS
I would like to thank my supervisors, Steven Gray and Elsa Arcaute, who have been
extremely supportive and helpful throughout the dissertation process. Steven was also
instrumental in helping me process the data by guiding me on the process for setting up
the cloud computing infrastructure required to run the time-consuming scripts in parallel.
On the other hand, Elsa introduced me to Anastasios Noulas from the University of
Cambridge, who kindly provided the Foursquare data used in this work.
I would also like to thank all the teachers, staff and fellow course mates at CASA, who
have given me a great year of friendship, learning and joy in my time at CASA and
inspired me to do better.
Finally, I would like to thank my partner Cherlyn Ng, whose love, patience and support
made it possible for me to focus on my work while we were 6,740 miles apart.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
8/151
8
1. RESEARCH GOAL AND OVERVIEW
1.1.Research goal, motivations, and limitations
The widespread use of smartphones and social media has generated an immense
amount of data which has been used to study topics such as mobility and event
detection in the city (Silva et al., 2013). Some researchers have been attempting to
use the data to define one of the most elusive concepts in cities: neighbourhoods
(Cranshaw et al., 2012; Falher et al., 2015; Zhang et al., 2013). While the research is
promising, there is much that we do not understand about the process of detecting
neighbourhoods using location based social media. For example, we do not know
how the neighbourhoods detected compare with traditional administrative
boundaries, and how we can combine the neighbourhoods detected with data from
these administrative boundaries to help us better understand cities dynamically. We
also do not know how the neighbourhoods detected may change when data over
different time periods or different time intervals are used and what these changes
may mean.
This work takes a step towards better understanding neighbourhood detection
methods. I break down the neighbourhood detection process into three common
elements (a) the unit used for aggregation, (b) the type of clustering method used;
and (c) the similarity measure used so that they can be studied in depth.
Better understanding can come in the form of research on particular elements in the
neighbourhood detection process across a variety of methods and comparing the
differences when different elements are used. It can also come in the form of better
understanding a particular method in depth and exploring how the neighbourhoods
formed are different depending on the parameters used.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
9/151
9
In this dissertation, I illustrate one way of doing this by applying multiple variations
of the Livehoods method (Cranshaw et al., 2012) on data from Greater London. The
Livehoods method was chosen as it is a venues-based approach which has not been
used as much in the literature. In addition, it has not yet been applied to the Greater
London area.
As mentioned above, we do not understand how we can combine the clusters /
neighbourhoods detected via neighbourhood detection methods with data from these
administrative boundaries to help us better understand cities. Integrating cluster /
neighbourhoods detected using neighbourhood detection with data from
administrative boundaries is rare in the neighbourhood detection literature as most
researchers using neighbourhood detection methods have used them for developing
recommendation engines that find similar places based on social media activity. As
such, I make a preliminary attempt in this work to combine the clusters /
neighbourhoods formed using the Livehoods method with data from more
traditional administrative boundaries (the Lower Super Output Areas in this case) to
extend the meaningfulness of the clusters / neighbourhoods formed. In particular, I
have tried to integrate ethnic diversity data with the clusters / neighbourhoods
formed using the Livehoods method.
As neighbourhood detection using location based social media is relatively new and
there are few comparisons between existing neighbourhood detection methods, this
work is not aimed at evaluating whether one method or even whether particular
elements of a method are better than another. Neighbourhood detection is a form of
clustering, and determining the best clustering method has a certain degree of
subjectivity.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
10/151
10
1.2.Overview
The dissertation is divided into seven sections.
Section Twodiscusses the concept of neighbourhoods, its importance for
understanding cities and why social media is a useful source of data for defining
neighbourhoods. I will review the methods that have so far been used for defining
neighbourhoods and three common elements used by the methods: (a) the unit used
for aggregation, (b) the type of clustering method used; and (c) the similarity
measure used. I will then describe what we have learnt so far about neighbourhood
detection using location based social media, and outline some ideas for better
understanding these methods.
Sections three to six illustrates one way we can better understand neighbourhood
detection methods by taking a closer look at the Livehoods method (Cranshaw et al.,
2012). Section Threebegins by describing the data and methodology used.
Section Fourthen considers different variations of Cranshaw et als (2012)
Livehoods method for neighbourhood detection and tests three different parameters
to find out if changing them affects the clustering results.
Section Five describes the clusters / neighbourhoods that are formed using the
Livehoods method and explores some types of information that can be derived from
these clusters, by combining the clusters with Foursquares venues database.
Section Six describes the clusters / neighbourhoods that are formed using the
Livehoods method by combining them with data from Lower Super Output Areas
(LSOAs) in Greater London. It discusses the issue of the modifiable areal unit
problem (Openshaw, 1984) and how characteristics of the clusters / neighbourhoods
formed using the Livehoods method may be more appropriate than traditional
administrative boundaries such as the LSOAs.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
11/151
11
Section Seven consists of concluding remarks and outlines some ideas for further
research that can help us better understand neighbourhood detection methods using
location based social media.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
12/151
12
2. INTRODUCTION
2.1.Neighbourhoods
Neighbourhoods are a ubiquitous feature of urban livingeveryone lives in a
neighbourhood. Many groups have an interest in understanding neighbourhoods.
Cranshaw and Yano (2010) note that analysing neighbourhoods is of interest to
businesses such as realtors and developers as the quality of a neighbourhood
affects the value of their assets, and to researchers in the social sciences as they seek
to understand neighbourhood and community level factors that influence
phenomenon such as obesity rates and perceived happiness through neighbourhood
effects (Sampson et al., 2002). A third group that has an interest in neighbourhoods
are city governments that implement neighbourhood interventions and wish to
identify where the interventions would make sense and be most effective. Being
able to identify neighbourhoods in our cities would be valuable to all three groups.
While there is a general consensus that a neighbourhood is a contiguous
geographic area within a larger city, limited in size, and somewhat homogeneous in
its characteristics (Weiss et al., 2007), it is hard to pin down a more exact definition
(Chaskin, 1998; Weiss et al., 2007). Researchers have defined neighbourhoods in
terms of 3 dimensions with varying emphasisby social ties, physical demarcations
and residents experiences (Chaskin, 1997). These are influenced by many factors
such as administrative boundaries, manmade features such as roads, natural features
such as rivers, demographics, social networks of the people that live in or frequent
the area, and the availability of services and facilities (Cranshaw and Yano, 2010).
Each persons perception of their neighbourhood boundaries may differ, even from
their neighbours, and these perceptions may also differ from the official boundaries
used by city governments for urban planning or neighbourhood initiatives
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
13/151
13
(Campbell et al., 2009). However, researchers have also found evidence that
residents often identify a common core within their neighbourhood, and the
differences are about the boundaries where neighbourhoods begin and end
(Campbell et al., 2009).
Neighbourhoods differ from communities, in the sense that neighbourhoods are tied
to a spatial unit with boundaries, while communities are not limited to spatial units.
This difference is reflected in how the role of neighbourhoods in cities has shifted
over time. To summarize Chaskin (1997), neighbourhoods in the past were tied
closely to the idea of community. There were close ties between those living within
a neighbourhood and a strong sense of identity, akin to an urban village. However,
as transportation systems improved and communication over long distances became
available, ties within a neighbourhood have become less close and more functional,
providing a space where neighbours share information, aid and services. When
studying social ties within neighbourhoods, it may be useful to look at common
social and functional activities between those living in a neighbourhood and where
these activities take place. These may give an indication of places that are
considered part of the neighbourhood for those involved in the activities.
Traditionally, studies on neighbourhoods and the neighbourhood effect have used
boundaries where data was easily available, such as administrative and political
boundaries. The data is often reliable as they are typically collected by government
agencies, and the boundaries used usually do not change greatly. Such data is useful
for understanding long term trends and behaviours such as demographics and
urbanisation. However, these traditional data sources are usually collected at certain
periods with long intervals between each period. The data collected represents
snapshots at particular points in time, and do not capture the multiple changes that
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
14/151
14
may occur in between data collection periods. This means that data from traditional
data sources are less useful for reseachers interested in questions where trends and
behaviors are more short term or temporary in nature, such as commuting behaviour
during transport strikes or riots, are unable to capture the. For example, full censuses
in the United Kingdom take place once every ten years. In addition, data from
traditional sources is often expensive and time consuming to collect. Such issues
means data from traditional sources are less suitable for studying trends and
behaviours that are more short term in nature and change frequently. For studying
more short term and dynamic trends and behaviours, location based social media is
likely to be a more suitable data source.
2.2.Location Based Social Media and Detecting Neighbourhood Boundaries
Location based social media is a relatively new source of data for researchers. Users
of these platforms post their thoughts or activities with location data attached. Many
of the characteristics of data from these posts or check-ins make it suitable for
studying short term phenomena and behaviours. It is easily available, it is cheap and
quick to collect, and it provides multiple points of data within a short period. Its
biggest advantage over other data sources is the amount of context that it provides.
A typical data point from location based social media contains information on who
the user is, where the user was, when the data was created. It also provides
additional information depending on the social media platform used. For example,
Twitter1users post tweets indicating what they were doing or thinking, Instagram2
users post photos, and Foursquare3users provide more detailed information about
1https://twitter.com/
2https://instagram.com/
3https://foursquare.com/
https://twitter.com/https://twitter.com/https://twitter.com/https://instagram.com/https://instagram.com/https://instagram.com/https://foursquare.com/https://foursquare.com/https://foursquare.com/https://foursquare.com/https://instagram.com/https://twitter.com/7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
15/151
15
their location. Social media platforms may provide additional contextual
information. The aforementioned Foursquare, for example, maintains a database of
venues that their users post from. This database contains rich contextual information
such as the type of venue (e.g. restaurant, school) and its popularity, which can be
linked to the posts from its users. Furthermore, it is possible to look at the
relationships between different users on social media platforms through the users
interactions with each other.
Silva et al (2012) observe that the widespread adoption of smartphones and social
media websites has created a valuable opportunity to study city dynamics. Data
from location based social media provides rich contextual information on user
activity at different times of day. These characteristics make location-based social
media useful in detecting the invisible image of cities (Silva et al., 2012), such as
patterns of transition between locations that serve different functions in the city.
Given that city neighbourhoods do not follow strict boundaries and can shift over
time (Chaskin, 1997), location-based social media, which provides a large amount
of data in real time, is a useful source of information for neighbourhood detection in
cities and identifying changes over time. As such, researchers have also started to
use social media to detect neighbourhood boundaries.
Using data from location based social media has its limitations. While data from
location based social media has rich context and can be collected easily, such
platforms are typically used by young males who are interested in technology
(Cranshaw et al., 2012), thus the data represents a skewed demographic. Using such
data may generate clusters / neighbourhoods that reflect the views of a certain
demographic, which may not be in agreement with the general population. In
addition, data on these platforms are usually private unless the user agrees to share
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
16/151
16
the data publicly, which further limits the amount of data available for analysis.
Another factor to consider is that users may curate the types of places that they
check-in at using location based social media. Places that are considered more
socially desirable to be at may be over represented when using data from location
based social media. For example, people may be more likely to checkin when eating
at a new fancy restaurant or shopping in a branded goods store rather than when
they are eating at a fast food restaurant or shopping in a discount store. This means
that conclusions based on data from location based social media will likely be
biased towards such socially desirable venues. In the case of neighbourhood
detection, the clusters / neighbourhoods formed may be similarly biased. Previous
research has shown that users have been more likely to check-in at venues
concerning travel and transport, office buildings, and residences (Preotiuc-Pietro
and Cohn, 2013). Despite these limitations, reseachers believe that data from
location based social media can still be valuable for its rich contextual information
and sheer volume available (Silva et al., 2013).
2.3.Review of Methods for Neighbourhood Detection
What follows is a review of neighbourhood detection methods using location-based
social media. Neighbourhood detection using location based social media is
typically treated as a clustering problem, and the methods used so far reflect this
paradigm. Essentially, researchers wish to cluster users social media activities into
contiguous geographic areas based on certain measures of similarity.
Neighbourhood detection methods usually contain three elements:
a. The unit used for aggregation (e.g. grid-based, venue-based)
b. The type of clustering method (e.g. K-Means clustering, spectral
clustering)
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
17/151
17
c. The similarity measures used
Unit used for aggregation
While the data from location based social media comes in the form of individual
posts or check-ins, they are usually aggregated in some spatial form before being
clustered. A common method used in neighbourhood detection is to take the grid-
based approach for aggregating the posts. This means dividing the city into multiple
grid squares of equal size and aggregating the properties of the posts within the grid
square. The properties of the grid squares are later used to calculate similarity
measures between grid squares during clustering. Noulas et al (2011), for example,
used a grid-square approach where each grid contained the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues.
Grid squares that are contiguous and are similar to each other based on the
clustering algorithm are then grouped up and form neighbourhoods. Grid-based
approaches can alter the neighbourhoods formed depending on the number, size and
shape of the grid cells used, and is an important consideration when adopting this
approach. For example, large grid cells means a lower number of grids overall and
will increase the speed of processing, but are less precise in delineating
neighbourhood boundaries. In certain cases, the grid square itself may be treated as
a neighbourhood. The size of the grid is often a key decision that has to be made in
grid-based approaches.
A second method is the venues-based approach. Venues are locations specifically
identified by location-based social media platforms, which usually have a database
of venues that users can check-in from. Researchers can make use of the data
contained in these venue databases in addition to the posts made by the users to
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
18/151
18
develop methods for neighbourhood detection. Venues that are considered similar to
each other and fulfil a proximity criterion such as being within a certain distance
from each other are then grouped together and the area bounded by these venues
form a neighbourhood. The proximity criterion is important as it defines the
geographic aspect of the venues. It is similar to how defining the size and shape of
the grids in the grid-based approach determines how the grids are geographically
related to each other. One of the earliest attempts at neighbourhood detection using
location based social media is called Livehoods (Cranshaw et al., 2012) and this
took the venues-based approach. Zhang et al (2013) pointed out that one of the
weaknesses of the venues-based approach is that the neighbourhoods formed have to
be geographically tied to the network of venues used, whereas the grid-based
approach does not.
Clustering methods
Clustering methods used in neighbourhood detection are a reflection of the breadth
and variety of clustering methods used in other fields. This dissertation does not
seek to determine which clustering methods are the best methods for
neighbourhood detection using location baesd social media, since there is a certain
degree of subjectivity. So far, neighbourhood detection methods have included
clustering methods such as K-Means clustering (Del Bimbo et al., 2014), spectral
clustering (Cranshaw et al., 2012; Noulas et al., 2011), and topic-based modelling
(Cranshaw and Yano, 2010). Each clustering method used involves the researcher
choosing parameters used. Examples are the number of topics to use for topic-based
modelling and the number of clusters in K-Means clustering.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
19/151
19
Similarity measures
A variety of similarity measures have been used in neighbourhood detection. In
terms of properties to include in the similarity measure, researchers have used
properties related to users, such as the users check-in patterns and interests (Del
Bimbo et al., 2014). Researchers have also used properties related to venues in the
databases of location based social media platforms, such as the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues
(Noulas et al., 2011). Other researchers have combined the above mentioned
properties with temporal properties to provide a contextually richer set of properties
to calculate similarity (Falher et al., 2015; Zhang et al., 2013). Different properties
characterise neighbourhoods in different ways, and makes them useful for different
purposes. Amongst the three dimensions of neighbourhoods mentioned earlier
(social ties, physical demarcations and residents experiences), methods in
neighbourhood detection using location based social media have typically used
properties related to residents experiences, for example the number of check-ins,
the temporal pattern of check-ins, and the type and number of venues in the area.
Cosine similarity measures similarity as the angle between two vectors (Xia et al.,
2015). In neighbourhood detection methods, these vectors represent the properties of
the grid and of the venues in the grid-based method and the venues-based method
respectively. Cosine similarity is often used for clustering in neighbourhood
detection with location based social media, and often preferred over other similarity
measures because cosine similarity does not take the magnitude of the vectors into
account. This is useful in cases where the magnitudes of the vectors differ greatly
but at the same time less important for determining similarity. For example, cosine
similarity is often used in information retrieval to determine document similarity as
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
20/151
20
the relative frequency of words in each document and across documents are more
important than the total number of words in a document (Huang, 2008). Similarly,
the magnitude of vectors used in neighbourhood detection differ greatly. The most
popular venues often garner many more check-ins than those less popular and the
most active users check-in much more frequently than those who are less active
(Scellato and Mascolo, 2011). As such, researchers have found that relative
frequencies between venues/grid squares are more useful for neighbourhood
detection rather than absolute numbers, and prefer cosine similarity measures over
Euclidean distance measures when measuring similarity for neighbourhood
detection (Cranshaw et al., 2012; Preoiuc-Pietro et al., 2013).
Researchers use different combinations of the three elements (unit used for
aggregation, clustering method, similarity measure) of neighbourhood detection to
create neighbourhoods, depending on their research purpose. Within each element,
researchers have also had to make decisions that influence the eventual
neighbourhoods formed. Most of the research so far seek to compare urban
neighbourhoods within and across cities so that recommendation engines can make
better recommendations based on criteria such as the users check-in patterns, the
users preferred venue categories and the users interests. Their goals are to suggest
new places that the user may wish to visit, which are similar to places the user has
visited in the past.
A typical example of a neighbourhood detection method for recommendation
engines comes from Noulas et al (2011). They take a grid-based approach and use a
spectral clustering algorithm to cluster grid squares based on the distribution of
Foursquare venue categories nearby and the number of check-ins at these venues.
The method creates neighbourhoods that give us an idea of what type of places are
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
21/151
21
in an area, and a measure of their importancebased on users check-in activity.
Another example is Del Bimbo et al (2014)s LiveCities method, which performed
K-means clustering using data on Facebook check-ins and user interests and
Foursquare venue categories.
An early attempt at neighbourhood detection was the Livehoods algorithm
(Cranshaw et al., 2012), which took the venues-based approach and used spectral
clustering to cluster Foursquare venues in Pittsburgh in the United States based on
spatial and social proximity. Through interviews with local residents, Cranshaw et al
(2012) found that neighbourhood detection methods could generate clusters /
neighbourhoods that reflect the character of life in cities. More recent attempts
have combined more information and experimented with different elements. For
example, Zhang et al (2013)s Hoodsquare method takes a grid-based approach and
assesses the similarity of a grid cell with its neighbouring grid cells based on (a) the
distribution of Foursquare venue categories in vicinity; (b) whether these venues
were frequented by tourists or locals, and; (c) the busiest time of the day in terms of
check-ins at these venues. Neighbourhoods were then formed by finding groups of
grid cells that had high relative homogeneity. Zhang et al (2013) point out that using
multiple types of information may better represent the multifaceted nature of
neighbourhoods, and that grid-based methods may be more suitable for identifying
neighbourhoods as the boundaries formed using grid-based methods are not bound
to a particular set of venues.
The most recent attempt at neighbourhood detection using location based social
media describes neighbourhoods in terms of the activity they host (Falher et al.,
2015). Falher et al consider 2 neighbourhoods to be similar if they contain the same
kind of Foursquare venues in the same proportion. In addition to basing the
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
22/151
22
similarity of these venues on the number of check-ins and unique users as well as
the temporal distribution of the check-ins, they also take into account the
distribution of Foursquare venues in the surrounding area.
Cranshaw and Yano (2010) provided a different perspective by treating the question
as an issue of latent topic discovery. They divided the city into grids and applied
topic based modeling to the grids, using each grid as a document and each
Foursquare category tag as a word. With this method, they were able to identify
clusters of places and activities that often appeared together (e.g. beach and seafood).
While research on neighbourhood detection using location based social media has
flourished, there is less research available on understanding whether these methods
accurately reflect neighbourhoods in reality, and how they can contribute to
purposes other than recommending new places that users may wish to visit.
Researchers using the Livehoods algorithm attempted to validate the
neighbourhoods generated through their algorithm (Cranshaw et al., 2012). The
neighbourhoods identifiedby Cranshaw et als algorithm included neighbourhoods
that corresponded with municipal boundaries, those that were subsets of municipal
boundaries and those that spilled over to more than one municipal boundary.
Cranshaw et al interviewed 27 residents that lived in the city and found that the
neighbourhoods generated by their Livehoods method closely matched the residents
perspectives of neighbourhoods in the city. Cranshaw et als research provides
evidence that the boundaries generated by neighbourhood detection algorithms can
capture local dynamics that includes factors such as municipal boundaries,
demographics, traffic flow and economic development.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
23/151
23
Some researchers have argued that including more properties in the similarity
measures would better characterise the units being aggregated and produce clusters
that more closely match actual neighbourhoods. For example, Del Bimbo et al (2014)
use both static features (e.g. categories assigned by location based social networks)
and dynamic features (e.g. distribution of the interests of the people who check in at
venues) in their LiveCities method to create neighbourhoods for Florence, which
they then validated qualitatively through online questionnaires with 28 residents.
They found that including both types of features produce neighbourhoods that better
reflect the residents perceptions.
There is much that we do not know about the methods used for neighbourhood
detection process with location based social media. For example, we do not know
how the neighbourhoods detected compare with traditional administrative
boundaries, and how we can combine the neighbourhoods detected with data from
these administrative boundaries to help us better understand cities dynamically. We
also do not know how the neighbourhoods detected may change when data over
different time periods or different time intervals are used and what these changes
may mean.
Better understanding can come in the form of research on particular elements in the
neighbourhood detection process across a variety of methods and comparing the
differences when different elements are used. It can also come in the form of better
understanding a particular method in depth and exploring how the neighbourhoods
formed are different depending on the parameters used. In this dissertation, I look at
the Livehoods method in depth by applying variations of the method on data
collected on Greater London. The Livehoods method was chosen as it is a venues-
based approach which has not been used as much in the literature. It is also one of
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
24/151
24
the rare methods in the neighbourhood detection literature that has validated the
clusters / neighbourhoods generated with the citys residents and found strong
support that the residents perceptions agreed with the clusters formed. This gives it
legitimacy in being able to detect actual neighbourhoods compared to other
neighbourhood detection methods. In addition, it has not yet been applied to the
Greater London area.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
25/151
25
3. METHODOLOGY
Python was used for most of the analysis and visualization in this work. IPython
notebooks were used for early exploration and experimentation with the data and
Python scripts were written in the later stages to run the neighbourhood detection
method. All scripts used for this work can be found in the appendix section.
3.1.Data sources
The data used for analysis consists of 42,581 Foursquare check-ins at 8,845 venues
by 12,397 unique users in the Greater London area from 6thApril 2011 to 31stMay
2011. This data was kindly provided by Anastasios Noulas from the University of
Cambridge. For each check-in, the data consists of the user ID, the time, the latitude
and longitude, and the venue ID. Further information on the venues was collected
using the python package foursquare. This included information on the venues
name, category and subcategory (as categorized by the social media network
Foursquare).
Data was also collected from 6thApril 2015 to 31stMay 2015 for three cities:
London, Singapore and New York City. The Python package tweepy was used to
collect data from Twitters streaming API, which offers samples of the data being
posted on Twitter in real time. A subset of this data consists of Foursquare checkins
from users who have linked their Foursquare accounts to their Twitter accounts such
that their Foursquare checkins also appear as tweets on Twitter. The scripts for
collecting this data and formatting them for analysis are also included in the
appendix. While this data was eventually not used in the analysis for this work,
future work could compare the results generated across the three different cities, or
the results generated from 2 different time periods in London.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
26/151
26
3.2.Data sorting, import, storage and analysis
The data was formatted using the Python package pandas, which was developed to
mimic the R softwares capabilities in managing large tables of data quickly and
easily. To improve the speed of the analysis, many of the intermediate data required
was pre-generated and stored in various file formats such as JSON files, numpy files
for matrices and pickle files created using the Python pickle package.
As each run of the method took a significant amount of time of one to two hours, an
Amazon cloud server was set up to run the multiple variations of the neighbourhood
detection method. This greatly sped up the process.
The results of the neighbourhood detection method were stored in pickle files. They
were subsequently converted to GeoJSON format and also stored in a MySQL
database using Pythons sqlalchemy package for further analysis and visualization.
In parts of the process where GeoJSON files had to be manipulated, the Python
packages fiona and shapely were used to manage GeoJSON files and check for
relationships between geographic features, for example whether a particular venue
was within a particular boundary.
Many of the visualizations in this work were created using Pythons matplotliband
seaborn packages. Figure 8 was created using the software R and its ggplot library.
3.3.The Livehoods method
The Livehoods method is Cranshaw et als (2012) method for neighbourhood
detection using location based social media. It is a venues-based approach that
performs spectral clustering on an affinity matrix that takes both spatial affinity and
social affinity into consideration. This method sought to fit the intuitive notion that
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
27/151
27
neighbourhoods are areas that a similar set of people frequent the more often the
same people go to the same venues, the more likely these venues are in the same
neighbourhood. To validate this method, Cranshaw et al (2012) had conducted
qualitative interviews with residents in their study area and verified that the
neighbourhoods generated by their method closely matched the residents
perspectives of neighbourhoods in the city.
Specifically, I applied the following steps from Cranshaw et al (2012) to generate
the affinity matrices used in the spectral clustering algorithm:
1.
Given the following sets:
a. Set V, a set of nvFoursquare venues, for which we can compute a
geographic distance , between the venues given their latitudeand longitude coordinates.
b. Set U, a set of nuFoursquare users
c.
Set C, a set of checkins of users in Uto the venues in V
Each venue vin Vis then represented by an nudimensional vector
where the uthcomponent of is the number of times user uchecked-in
to v.
2. Compute the social similaritys(i, j)between each pair of venues i, j Vby
comparing the vectors and . Cosine similarity was used for this measure,where
, = ( . )
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
28/151
28
3. Compute an nvby nvaffinity matrix on the venues. For a given venue v, let
Nm(v)be the mclosest venues to vaccording to the , . for someparameter m. Then we let
, = {, + , 0,
where is a small constant that prevents any degenerate matrices from
forming. In Cranshaw et al (2012)s work,a value of 1 102was used for.
The affinity matrices were generated using the python packages numPy (Van Der
Walt et al., 2011) and sciPy (Jones et al., 2001), and spectral clustering was
performed on the affinity matrices using the python package scikit-learn (Pedregosa
et al., 2011). To determine the number of clusters that the algorithm should create, I
used the commonly-used eigengap heuristic (Noulas et al., 2011; Planck and
Luxburg, 2006). This involved calculating the ksmallest eigenvalues of the
normalized Laplacian of the affinity matrix, and setting the number of clusters as the
number where the largest difference in eigenvalues occurred.
The question of determining parameters such as the number of clusters to form is an
important issue for clustering algorithms (Lancichinetti and Fortunato, 2009; Planck
and Luxburg, 2006; Zelnik-Manor and Perona, 2004). For some clustering
algorithms, researchers have found that maximizing modularity is a useful techniqueto guide which values to use for various parameters (Lancichinetti and Fortunato,
2009), though they also recognize that this technique has its own limitations
(Fortunato and Barthlemy, 2007; Good et al., 2010; Lancichinetti and Fortunato,
2011). For spectral clustering algorithms such as the one used in the Livehoods
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
29/151
29
method, the eigengap heuristic was developed in particular to maximize modularity
for the clusters generated (Donetti and Munoz, 2004).
Cranshaw et al (2012) included a post processing step after spectral clustering to
break up any cluster that spanned too large a geographic area (more than 40% of the
geographic area in their work on Pittsburgh), and redistributed the venues in those
clusters to the nearest cluster instead. In my work, the spectral clustering algorithm
typically produced one cluster that spans a large part of the city. This seems to be a
qualitatively different type of cluster where its boundaries are a reflection of what
the users of the social media platform regard as the boundaries of their city, rather
than any particular neighbourhood. As there was no theoretical reason to redistribute
the venues in this large cluster and as a result expand the boundaries of the other
clusters, I chose not to break up the large cluster.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
30/151
30
4. ANALYZING THE LIVEHOODS METHOD
As described above, there are a number of parameters in the Livehoods method
(Cranshaw et al., 2012) that can be tuned to generate the neighbourhood boundaries:
the number of smallest eigenvalues to calculate (k), the number of nearest
neighbours (m), and the alpha constant . Cranshaw et als (2012) values for these
parameters for the Pittsburgh metropolitan area were 45, 10 and 0.01 respectively.
Cranshaw et al (2012) acknowledged that tuning the clusters is non-trivial and may
lead to experimenter bias. As such, it is worthexploring how tuning the parameters
affects the resulting neighbourhoods formed to better understand the Livehoods
method.
4.1.Tuning the number of smallest eigenvalues (k)
In general, as the value for kincreased, the total number of clusters formed
increased as well. Figure 1 illustrates the relationship between k and the total
number of clusters formed using the eigengap heuristic, for values of kfrom 0 to
200 and Cranshaw et als (2012) values of 0.01 for the alpha constant and 10 for the
number of nearest neighbours. The number of clusters formed increases at certain
threshold value of k, and remains constant until the next threshold is reached. The
threshold values for kin this case are 7, 9, 13, 25, 43, 74 and 101 with the
corresponding values for the number of clusters formed being 5, 7, 11, 23, 41, 72
and 99.
Figure 2 shows the boundaries of the clusters that are formed when the 7 different
values are used in the Livehoods method, with m= 10 and = 0.01. As the number
of clusters created increases, the larger clusters tend to break up into smaller and
smaller clusters. The areas near the centre of the city tend to be broken up first, and
continue to be broken up into smaller clusters as the number of clusters increase.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
31/151
31
The clusters nearer to the edges of the city tend to remain large and unbroken.
Generally, the clusters formed nearer the edge of the city are larger than the clusters
formed nearer the centre of the city. This phenomenon is likely because the density
of venues further from the centre of the city is much lower than the density of
venues nearer the centre of the city. Since the Livehoods method uses a nearest
neighbours criterion for identifying adjacent venues, areas where venues are less
dense will cover larger areas when searching for adjacent venues and result in the
method creating boundaries with larger areas. Many of the clusters formed when
there are a higher number of clusters are either subsets of the clusters formed using a
lower number of clusters, or very similar to the clusters formed using a lower
number of clusters. The clear exception occurs where k= 74 and 72 clusters are
formeda previously undetected large cluster is formed. This is the qualitatively
different cluster mentioned earlier.
Donetti and Munoz (2004) have pointed out that the weakest part of the eigengap
heuristic is that we do not know how many eigenvalues (kin the Livehoods method)
should be calculated apriori. While Cranshaw et al (2012) also has not provided any
guidelines on how to choose the right value of k for cities of different sizes, cities
occupying a larger area could be seen to potentially contain more neighbourhoods,
and larger values of kshould be used. As the Greater London area is much larger
than Pittsburgh, kshould be larger than 45. A kvalue of 100 was arbitrarily chosen
in this work to test the effects of tuning the nearest neighbour parameter and the
alpha constant, to reflect the possibility of a higher number of neighbourhoods in
London. An even higher value may be more suitable as London is many times larger
than Pittsburgh, but this value was used to keep computation requirements
manageable.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
32/151
32
Figure 1: Relationship between number of smallest eigenvalues (k) found and number of clusters formed
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
33/151
33
Figure 2: Boundaries formed for different number of clusters
5 clusters (k = 7)
7 clusters (k = 9)
11 clusters (k =13)
23 clusters (k = 25)
41 clusters (k = 43)
72 clusters (k = 74)
99 clusters (k = 101)
4.2.Tuning the alpha constant ()
To see if the alpha constant influenced the clusters formed using the Livehoods
method, clusters were formed with k= 100, m= 10 and varying from 0.00 to 0.05
In general, there was little difference in the clusters formed. Figure 3 depicts the
boundaries formed using the various alpha constants. Almost all clusters formed are
consistent or highly similar at the different alpha values. In certain rare instances,
some clusters are merged or subdivided into 2 clusters. This shows that varying the
alpha constant between 0.00 and 0.05 do not greatly influence the boundaries
formed. A clear exception occurs with the largest cluster in the shift from = 0.00
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
34/151
34
to = 0.01it expands greatly to include many other parts of the Greater London
area. This boundary remains consistent as increases. This behaviour again
highlights the qualitatively different nature of this cluster.
Figure 3: Boundaries formed for different alpha constants
= 0.00
= 0.01
= 0.02
= 0.03
= 0.04
= 0.05
4.3.Tuning the nearest neighbours parameter (m)
To see if the nearest neighbours parameter influenced the clusters formed using the
Livehoods method, clusters were formed with k = 100, = 0.01, and mvarying
from 5 to 20. Figures 4 depicts the boundaries formed for some of the values used.
When m= 5, the boundaries formed overlap many of the other boundaries. As m
increases, the number of overlaps decrease and more stable clusters are formed. For
m= 8 to m= 20, the clusters formed are largely consistent with each other. Smaller
clusters with a high density of venues are more consistent than larger clusters with
low density of venues. The largest cluster changes in shape and size as at different
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
35/151
35
levels of m. It is hard to determine the optimal number to use for m, but values of 8
and higher seem to generate reasonably consistent clusters.
Figure 4: Boundaries formed for different nearest neighbours parameter (m)
m = 5
m= 8
m= 10
m= 15
m= 18
m= 20
4.4.Using cosine similarity
It has been mentioned earlier that cosine similarity was preferred over other
similarity measures because cosine similarity does not take the magnitude of vectors
into account. In the case of forming neighbourhoods and determining venue
similarity, the relative frequency of the user checkins at each venue and across
venues matter more than the total number of user checkins at each venue. Similarity
measures that include magnitude such as Euclidean distance are thus less suitable
than the cosine similarity measure. Using Jaccard similarity, a variant of the cosine
similarity measure, produced results similar to the cosine similarity measure.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
36/151
36
4.5.Nearest neighbours versus full similarity graph
The k-nearest neighbours similarity graph was chosen for constructing the affinity
matrix instead of the full similarity graph as the k-nearest neighbours graph better
captured check-in behaviour in neighbourhoods. While individuals have regular
mobility patterns and often return to a few highly frequented locations such as home,
school or work (Gonzlez et al., 2008), this differs from their check-in behaviour on
location based social media networks60% to 80% of check-ins occur at places
that were not visited before by individual users (Noulas et al., 2012). Using the full
similarity graph meant that most of the similarity captured would relate to new
places that the users visited over the time period. This would create clusters of
venues that related to types of places that groups of users preferred to visit such as
museums, nightspots and stadiums, and generate boundaries that span most of the
city. These boundaries cannot be classified as neighbourhoods, given that they
overlap each other greatly and cover areas that are similar to each other.
The nearest neighbours graph, on the other hand, captures similarity relating to users
who visited sets of venues close to one another. The boundaries formed often have
clear separation from each other and there is very little overlap in terms of area
covered by the boundaries. These boundaries better fit the intuitive notion of
neighbourhoods in a city.
4.6.Summary
Through an investigation of the Livehoods method, I have found that using different
alpha values from 0.01 to 0.05 and nearest neighbours parameters above 8 generally
do not affect the results of the clusters formed. I have also found that using different
values for the number of smallest eigenvalues changes the resulting number of
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
37/151
37
clusters formed, with more clusters being formed when the number of eigenvalues
increases. The investigation also revealed that two types of clusters may be formed
by the method. One type of cluster is the contiguous geographic space that can be
associated with neighbourhoods, and another type of cluster seems to be large and
spans the entire city.
In the next two sections, I will use one of the sets of clusters / neighbourhoods
generated by the Livehoods method to illustrate the types of information that can be
derived from clusters formed using the Livehoods method, and neighbourhood
detection methods in general. In section 5, I combine the clusters formed with data
from Foursquares venues database and use it to describe the types of venues and
activities that take place within the cluster. Incorporating information from location
based social media to better understand the clusters / neighbourhoods formed is
common for researchers using neighbourhood detection methods.
In section 6, I attempt to combine the cluster / neighbourhoods formed using the
Livehoods method with data from administrative boundaries (the Greater London
Lower Super Output Areas in this case) and determine the ethnic diversity of the
clusters / neighbourhoods formed. Integrating cluster / neighbourhoods detected
using neighbourhood detection with data from administrative boundaries is rare in
the neighbourhood detection literature as most researchers using neighbourhood
detection methods have used them for developing recommendation engines that find
similar places based on social media activity. My attempt tries to add more meaning
to the clusters formed so that they can be used for other purposes, such as
investigating ethnic diversity issues within neighbourhoods.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
38/151
38
5. DESCRIPTION OF LIVEHOOD CLUSTERS / NEIGHBOURHOODS
5.1.Overview of neighbourhoods
For comparison, the Livehoods method was applied to the Foursquare data with k=
100, = 0.01, and m= 10. For the Greater London area, 72 clusters were generated.
Their boundaries are depicted in Figure 5. The numbers on the clusters will be used
as a reference for labelling and describing the results below. As mentioned earlier,
the largest cluster formed (cluster 66 in this case) is not depicted in the figures as it
is a qualitatively different type of cluster, and not included when describing the
clustering results. The boundaries for this cluster can be found in the appendix.
Table 1 contains summary statistics related to each cluster. The area for each cluster
ranged from to 0.11 square kilometers (cluster 48) to 203 square kilometers (cluster
18) with a median of 1.86 square kilometers per cluster. While tests (using Pythons
powerlaw package) show no support for a power law distribution, the distribution is
highly skewed with many small clusters and a few huge clusters. The huge clusters
also tend to have low density in terms of checkins and venues, and as such they
could be an artefact of the nearest neighbours proximity criterion. In sparse areas,
the nearest neighbours tend to be further apart from each other than in dense areas,
thus venues far apart from each other are more likely to be linked and clustered
together.
Figures 6a to 6c depict properties of the clusters in terms of absolute numbers - the
number of venues in each cluster ranged from 16 (cluster 45) to 279 (cluster 38)
with a median of 129.0; the number of check-ins in each cluster ranged from 43
(cluster 45) to 5147 (cluster 2) with a median of 412; and the number of unique
users checking-in in each cluster ranged from 10 (cluster 45) to 2585 (cluster 2) with
a median of 230. Figures 6d to 6f depict properties of the clusters relative to the area
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
39/151
39
of the cluster and the number of venues in the clusterthe number of venues per
square kilometer ranged from 1.27 (cluster 18) to 1,304.61 (cluster 7) with a median
of 43.95; the number of checkins per venue ranged from 1.26 (cluster 65) to 40.09
(cluster 26) with a median of 3.23; and the number of unique users per venue ranged
from 0.55 (cluster 67) to 19.52 (cluster 16) with a median of 1.89.
Many of the distributions of cluster properties are highly skewed. Clusters 2, 13, 16
and 26 are particularly active clusters and are in the top 5 in terms of users and
checkins across all clusters, whether in absolute terms or on a per venue basis.
Collectively, the four clusters account for 29.5% of all checkins from 60% of unique
users despite containing only 5.7% of all venues across the city. This is
understandable for clusters 2 and 13 as they are in the city centre, and cluster 26 as it
is at Heathrow airport. Cluster 16 consists of Wembley stadium, and it is likely that
it had such high values for users and checkins during that period as it was the host
for the 2011 UEFA Champions League Final on 28 thMay 2011, which is within the
period of analysis. People attending this event are highly likely to checkin on social
media as it is a rare and meaningful event for them. Under more normal
circumstances, cluster 16 likely would have values closer to the median.
Across all clusters, cluster 18 stands out with the largest area and relatively low
frequencies of users and venues over such a large area. It could be classified as an
outlier, but results for the cluster have been included for completeness. In addition,
all variations of the Livehoods method detect this cluster or a cluster similar to this
cluster. This is more likely an artefact of using the nearest neighbours proximity
criterion as discussed above.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
40/151
40
Figure 5: Clustering results for London
Greater London area
City area
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
41/151
41
Table 1: Summary statistics for cluster results for London
Cluster Area (sq
km)
Number of
checkins
Number of
users
Number of
venues
Number of
check-ins per
sq km
Number of
users per sq
km
Number of
venues per sq
km
Number of
check-ins per
venue
Number of
check-ins per
user
Number of
users per
venue
0 0.69 1002 641 238 1447.35 925.9 343.78 4.21 1.56 2.69
1 0.89 469 321 165 527.2 360.84 185.48 2.84 1.46 1.95
2 1.25 5147 2585 161 4121.23 2069.82 128.91 31.97 1.99 16.06
3 26.83 356 178 180 13.27 6.63 6.71 1.98 2 0.99
4 2.95 851 450 163 288.6 152.61 55.28 5.22 1.89 2.76
5 0.75 462 230 102 616.58 306.95 136.13 4.53 2.01 2.25
6 2.19 1055 556 239 481.71 253.87 109.13 4.41 1.9 2.33
7 0.16 695 447 215 4217.23 2712.38 1304.61 3.23 1.55 2.08
8 0.82 754 493 195 924.93 604.76 239.21 3.87 1.53 2.53
9 1.77 610 325 241 344.83 183.72 136.24 2.53 1.88 1.35
10 1.5 806 409 253 536.37 272.18 168.36 3.19 1.97 1.62
11 0.6 967 622 231 1602.32 1030.65 382.77 4.19 1.55 2.69
12 1.09 294 163 120 270.77 150.12 110.52 2.45 1.8 1.36
13 2.73 2888 2032 202 1056.98 743.7 73.93 14.3 1.42 10.06
14 4.62 540 213 155 116.81 46.07 33.53 3.48 2.54 1.37
15 0.62 1357 578 108 2184.13 930.31 173.83 12.56 2.35 5.35
16 22.55 3508 1737 89 155.54 77.01 3.95 39.42 2.02 19.52
17 1.74 691 322 165 396.12 184.59 94.59 4.19 2.15 1.95
18 203.11 257 110 157 1.27 0.54 0.77 1.64 2.34 0.7
19 0.88 248 154 101 280.51 174.19 114.24 2.46 1.61 1.52
20 2.08 556 296 154 267.1 142.2 73.98 3.61 1.88 1.92
21 23.94 831 398 257 34.71 16.63 10.74 3.23 2.09 1.55
22 12.1 453 304 157 37.43 25.12 12.97 2.89 1.49 1.94
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
42/151
42
Cluster Area (sqkm)
Number ofcheckins
Number ofusers
Number ofvenues
Number ofcheck-ins per
sq km
Number ofusers per sq
km
Number ofvenues per sq
km
Number ofcheck-ins per
venue
Number ofcheck-ins per
user
Number ofusers per
venue
23 4.7 378 168 139 80.49 35.78 29.6 2.72 2.25 1.21
24 1.56 464 296 123 296.6 189.21 78.62 3.77 1.57 2.41
25 42.64 285 121 135 6.68 2.84 3.17 2.11 2.36 0.9
26 0.35 2165 975 54 6131.41 2761.26 152.93 40.09 2.22 18.0627 0.41 348 235 163 844.05 569.97 395.34 2.13 1.48 1.44
28 0.31 167 117 48 543.27 380.61 156.15 3.48 1.43 2.44
29 1.24 827 384 54 668.99 310.63 43.68 15.31 2.15 7.11
30 1.71 1921 547 148 1126.03 320.63 86.75 12.98 3.51 3.7
31 0.75 160 124 31 214.22 166.02 41.5 5.16 1.29 4
32 136.96 432 340 131 3.15 2.48 0.96 3.3 1.27 2.6
33 25.62 405 224 141 15.81 8.74 5.5 2.87 1.81 1.59
34 0.21 637 394 188 3098.25 1916.34 914.4 3.39 1.62 2.1
35 0.15 181 94 38 1197.88 622.1 251.49 4.76 1.93 2.47
36 22.11 321 140 93 14.52 6.33 4.21 3.45 2.29 1.51
37 0.6 358 183 73 600.17 306.79 122.38 4.9 1.96 2.51
38 0.32 1169 740 279 3624.81 2294.57 865.12 4.19 1.58 2.65
39 1.4 1366 622 161 974.53 443.75 114.86 8.48 2.2 3.8640 8.27 179 69 81 21.65 8.34 9.8 2.21 2.59 0.85
41 5.94 144 82 87 24.23 13.79 14.64 1.66 1.76 0.94
42 0.28 481 311 75 1702.65 1100.88 265.49 6.41 1.55 4.15
43 1.86 172 134 29 92.24 71.87 15.55 5.93 1.28 4.62
44 75.25 167 69 99 2.22 0.92 1.32 1.69 2.42 0.7
45 1.13 43 10 16 38.16 8.88 14.2 2.69 4.3 0.62
46 6.48 65 30 40 10.03 4.63 6.17 1.62 2.17 0.75
47 11.88 315 149 144 26.51 12.54 12.12 2.19 2.11 1.03
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
43/151
43
Cluster Area (sqkm)
Number ofcheckins
Number ofusers
Number ofvenues
Number ofcheck-ins per
sq km
Number ofusers per sq
km
Number ofvenues per sq
km
Number ofcheck-ins per
venue
Number ofcheck-ins per
user
Number ofusers per
venue
48 0.11 199 155 36 1761.06 1371.68 318.58 5.53 1.28 4.31
49 31.95 173 86 89 5.42 2.69 2.79 1.94 2.01 0.97
50 0.66 255 117 99 387.71 177.89 150.52 2.58 2.18 1.18
51 0.55 385 248 131 705.65 454.55 240.1 2.94 1.55 1.8952 39.21 775 287 129 19.77 7.32 3.29 6.01 2.7 2.22
53 1.12 751 413 209 670.36 368.65 186.56 3.59 1.82 1.98
54 87.89 202 93 107 2.3 1.06 1.22 1.89 2.17 0.87
55 5.6 316 98 123 56.39 17.49 21.95 2.57 3.22 0.8
56 18.86 551 287 200 29.21 15.21 10.6 2.76 1.92 1.44
57 1.12 189 105 79 168.69 93.72 70.51 2.39 1.8 1.33
58 0.33 766 444 132 2296.85 1331.33 395.8 5.8 1.73 3.36
59 21.86 412 195 193 18.85 8.92 8.83 2.13 2.11 1.01
60 47.01 228 88 107 4.85 1.87 2.28 2.13 2.59 0.82
61 1.27 115 60 56 90.25 47.08 43.95 2.05 1.92 1.07
62 1.99 181 56 66 90.82 28.1 33.12 2.74 3.23 0.85
63 9.31 47 20 28 5.05 2.15 3.01 1.68 2.35 0.71
64 8.39 1325 681 261 157.85 81.13 31.09 5.08 1.95 2.6165 10.86 54 31 43 4.97 2.86 3.96 1.26 1.74 0.72
67 33.75 99 28 51 2.93 0.83 1.51 1.94 3.54 0.55
68 14.95 103 44 38 6.89 2.94 2.54 2.71 2.34 1.16
69 4.78 113 76 73 23.62 15.89 15.26 1.55 1.49 1.04
70 0.5 699 367 115 1388.01 728.75 228.36 6.08 1.9 3.19
71 34.32 532 323 221 15.5 9.41 6.44 2.41 1.65 1.46
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
44/151
44
Figure 6: Properties of Livehood clusters
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
45/151
45
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
46/151
46
5.2.Breakdown of individual neighbourhoods
The venues within each cluster are venues that can be found on the location based
social network Foursquare. Foursquare categorizes its venues in a category hierarchy
with three levels. The 10 main categories at the top of the hierarchy are: Arts &
Entertainment, College & University, Event, Food, Nightlife Spot, Outdoors &
Recreation, Professional & Other Places, Residence, Shop & Service, and Travel &
Transport. Each of these 10 main categories have their own subcategories, which
themselves can be further subcategorized. There are more than 200 subcategories and
sub-subcategories altogether. As places may be referred to at different levels of
granularity, some venues may not have a sub-subcategory. For example, London
Heathrows Terminal 5 falls in the Travel & Transport main category, the airport
subcategory, and the airport terminal sub-subcategory. The London Heathrow Airport,
on the other hand, falls in the same main and subcategories, but does not have a sub-
subcategory.
We can gain insight to the makeup of the city by creating city profiles using
information on venue categories of each and the behavior of the users of location based
social media networks. To calculate the distribution of venues / checkins by category
for the city, the formula used to calculate the value for each category (A) was:
= .
. 100
Figure 7 shows the overall distribution of venues and checkins across all clusters
according to Foursquares main categoriesin percentage values. 29.23% of venues in
the data are in the food category, followed by 17.05% of venues in the nightlife spots
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
47/151
47
category. Users, however, check-in mostly at venues related to travel & transport
(23.04%), professional & other places (18.86%), and arts & entertainment venues
(15.68%). From here, we can observe that venues in the travel & transport, professional
& other places, nightlife spot and arts & entertainment receive a disproportionate
number of checkins. This means that clusters formed based on Foursquare checkins are
likely to be biased towards these venues in these categories, and may be more suitable
for research questions related to such categories (e.g. transport, culture).
Figure 7: Overall distribution of venues and checkins across clusters
% of venues
% of checkins
Similar profiles can be created for each cluster to form neighbourhood profiles. To
calculate the distribution of venues / checkins by category within a neighbourhood, the
formula used to calculate the value for each category (B) was:
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
48/151
48
= . . 100
This gives a sense of the type of venues in the clusters and the type of activities that
occur within them. These neighbourhood profiles were compared with the city profile
to understand which categories within the neighbourhood were overrepresented /
underrepresented. For each category, the formula was:
= 100
Tables 2 and 3 contain the percentage difference figures for all clusters for venues and
checkins respectively, with the highest positive difference for each cluster highlighted.
These percentage differences between each category was used to determine which types
of venues occurred more frequently and which types of venues users checked-in at
more frequently within the cluster. For example, clusters 28 and 29 have more venues
and checkins in the travel and transport category, as these clusters are essentially the
London Heathrow airport terminals, which we expect to have a higher concentration of
venues and checkins related to travel and transport. Another example is clusters with
high levels of concentration of venues and checkins in the college & university category.
Clusters 27, 46 and 47 have percentage difference figures of over 1000% for users
checking-in, and they contain University College London, Brunel University London,
and the Queen Mary University of London respectively.
From tables 2 and 3, we again observe differences between checkin behaviour and types
of venues. For many clusters, the most overrepresented category in terms of venues is
different from the most overrepresented category in terms of checkins. Cluster 3, for
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
49/151
49
example, would be characterised as a cluster in the outdoors & recreation category in
terms of venues, and as a cluster in the residence category in terms of checkins.
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
50/151
50
Table 2: Percentage difference between proportion of venues within cluster to proportion of venues
within city in terms of Foursquares main categories
Note: Empty cells indicate that the cluster did not contain venues in that categoryCluster Arts &
Entertain
ment
College &
University
Food Nightlife
Spot
Outdoors &
Recreation
Professional &
Other Places
Residence Shop &
Service
Travel &
Transport
0 -36.9 22.98 -12.42 -2.02 -69.76 99.3 -74.25 73.22 -75.35
1 56.72 -22.09 94.24 -56.19 56.2 -62.7 -25.86 -73.22
2 -30.23 -31.03 -20.18 102.84 193.33 -20.29 -75.62 -38.013 -38.26 11.54 -4.65 134.71 -28.4 58.7 -29.9 5.49
4 -58.42 -36.97 18.11 -40.54 -22.52 10.51 18.75 -57.63 75.24
5 67.78 9 34.81 -45.16 -30.51 -65.77 -37.2 63.82
6 -9.85 -79.5 9.09 13.45 31.04 47.02 -17.33 -35.31
7 -31.44 -77.73 60.26 56.88 -78.1 2.95 -52.95 -79.92
8 -54.73 -22.79 23.46 -9.36 -49.38 2.55 -83.84 48.28 -18.77
9 -55.9 180.75 2.22 -4.14 -90.14 133.34 -62.21 -42.23 -57.8
10 -48.85 55.06 -18.64 -4.91 -42.81 122.44 -75.65 -21.83 -15.52
11 49.53 154.99 27.43 9.59 -68.65 -8.55 10.19 -61.68
12 26.02 -4.49 25.29 92.24 -76.52 -54.33 -70.01 -8.28 -71.29
13 30.04 10.82 -22.51 9.05 13.9 -53.57 -43.21 40.73
14 -0.96 -13.19 21.4 -36.72 38.45 41.43 -13.5 -8.14
15 -20.1 5.53 -25.73 -13.13 -73.83 166.216 129.2 -39.23 -34.44 -14.57 1.53 45.47 33.45 56.65
17 21.43 24.18 54.36 -24.57 -41.32 -80.73 6.06 -44.67
18 -52.25 117.14 -29.47 -13.5 60.17 -25.01 263.68 18.16 3.35
19 -14.22 -51.24 31.58 -38.66 -4.08 -53.37 22.5 -25.08 61.23
20 -15.11 286.03 -8.37 -23.11 58.2 7.68 -19.18 -1.14 -12.97
21 13.19 -1.02 -10.97 31.98 143.38 -49.51 74.07 14.06 -46.44
22 60.08 -33.82 -18.15 70.65 95.26 -36.71 24.69 -36.45 -35.35
23 -34.8 -62.94 -2.77 -39.4 64.02 24.05 -6.9 -0.35 33.67
24 18.68 -10.05 31.48 -60.4 -55.77 -21.14 -71.75 141.86 -52.68
25 -49.49 -28.26 -22.95 88.27 9.84 92.36 61.76 9.32
26 -84.9 -74.66 -80.66 520.55
27 5.15 557.5 -5.91 -9.77 52.44 -43.69 -2.42 -68.56
28 -51.49 -42.13 -86.13 -57.63 413.88
29 -57.48 -64.32 -69.86 -27.37 383.09
30 -84.8 280.27 24.38 -69.56 -32 -6.33 -34.86 65.99 -6.48
31 291.8 78.17 6.84 34.47 -12.38 -43.2 11.9 -46.44
32 62.27 -59.01 -4.74 -48.43 20.96 -67.33 80.23 33.85 66.35
33 113.59 49.43 -27.19 -20.11 28.6 -52.36 205.03 -13.9 34.75
34 -52.34 21.84 5.64 -60.03 12.27 118.52 -79.64
35 85.21 40.38 15.75 23.61 -30.97 -55.25 164.5 -19.12 -57.8
36 -16.27 -36.54 -61.95 35.7 212.06 21.38 99.28 -14.69 -4.63
37 -35.32 -17.32 1.74 5.48 38.55 -85.88 143.13
38 171.65 -14.21 54.33 33.81 -83.13 -64.45 -89.22 -24.21 -79.37
39 -13.91 226.23 -21.75 10.8 -51.87 4 -38.53 -68.67 76.51
40 -44.94 12.62 -37 23.14 -80.04 253.85 20.23 41.13
41 3.16 17.28 -7.7 69.65 15.34 -81.31 120.97 57.67 -73.56
42 -39.18 38.28 -6.71 -13.03 -32 -44.9 -86.72 180.57
43 -21.64 -73.29 -77.59 -12.38 13.6 -65.78 328.44
44 -73.54 20.32 -36.87 -1.62 166.26 -23.29 -62.21 15.55 71.79
45 -56.6 -27.16 184.75 269.19 263.68
46 461.5 -36.87 5.95 107.09 11.88 76.33 7.84 -36.71
47 -6.69 112.17 -33.73 -2.14 143.45 7.09 99.89 8.67 -25.59
48 473 73.62 -45.37 42.38 -76.93 -72.2
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
51/151
51
Cluster Arts &
Entertain
ment
College &
University
Food Nightlife
Spot
Outdoors &
Recreation
Professional &
Other Places
Residence Shop &
Service
Travel &
Transport
49 4.48 -15.42 -17.82 -12.38 -14.8 198.41 59.69 -10.74
50 19.84 9 34.81 -38.3 -46.4 -13.13 -31.54 25.6 -18.09
51 36.96 -61.07 48.82 37.11 -61.71 -13.13 -25.24 -70.75
52 -25.23 27.5 -55.4 -62.58 -58.2 8.39 140.23 22.44 149.11
53 -58.63 -76.49 30.43 -17.18 -19.06 -47.53 94.2 -8.11
54 -75.45 11.62 -41.43 -1.71 37.23 -19.93 5.16 125.1 17.43
55 7.23 -2.48 -12.28 -7.99 43.88 -45.59 83.76 124.77 -56.03
56 19.84 -75.23 16.98 -22.1 -39.09 -48.67 117.82 66.52 -25.54
57 -68.66 -28.73 49.58 25.51 -64.95 -43.2 -10.48 36.88 -67.87
58 92.51 -63.52 -15.24 0.94 25.56 51.16 -31.27 -29.94 -17.77
59 -88.42 -47.36 -9.25 65.54 42.38 -66.44 98.37 1.1 2.85
60 -76.85 5.28 -21.08 -60.27 81.21 -24.48 429 81.98 -28.8
61 -53.7 -21.08 -7.3 -48.23 -16.09 32.25 21.32 89.88
62 -18.51 -7.35 4.17 74.81 173.36 -26.16 -46.62 -58.23
63 -7.39 -36.87 58.92 3.55 -32.87 142.65 -36.71
64 23.23 -62.64 -6.19 -10.71 65.34 10.16 -6.15 -46.19 34.75
65 -48.94 2.83 -33 73.74 156.72 -73.83 104.77
67 120.25 -81.23 25.99 146.27 -0.22 -21.37 20.23 50.53
68 764.33 40.38 -47.39 -11.71 38.06 -77.62 -11.83 -15.6169 -68.66 81.63 -10.35 75.23 -77.28 79.04 -86.31 -14.31
70 -52.24 -17.67 -15.9 -6.06 -31.49 -81.66 215.81
71 18.68 -10.05 -17.4 13.15 54.82 -3.22 154.23 -17.94 -22.27
7/24/2019 An Investigation in Defining Neighbourhood Boundaries Using Location Based Social Media
52/151
52
Table 3: Percentage difference between proportion of users within cluster checking-in to proportion of users
within city checking-in in terms of Foursquares main categories
Note: Empty cells indicate that the cluster did not contain checkins at venues in that categoryCluster Arts &
Entertain
ment
College &
University
Food Nightlife
Spot
Outdoors &
Recreation
Professional &
Other Places
Residence Shop &
Service
Travel &
Transport
0 -79.6 -0.55 40.74 27.39 -68.98 -18.97 -85.59 301.5 -56.14
1 -76.69 102.86 443.13 -84.49 -9.12 -53.68 -62.41 -75.13
2 -96.62 -89.9 -89.18 -79.65 397.75 -94.49 -98.82 -93.593 -81.43 104.46 56.17 47.46 -35.65 512.27 -40.5 -20.42
4 -72.77 9.73 -4.76 -70.33 -73.65 -20.14 4.96 -88.73 154.67
5 126.08 -58.82 45.69 -29.67 -90.09 -83.59 -64.74 54.72
6 116.72 -82.9 21.01 42.39 -48.17 -15.14 50.33 -77.63
7 -81.33 -86.93 259.98 268.23 -43.21 -52.84 -28.4 -87.57
8 -78.01 -39.08 159.78 23.12 -55.41 -63.36 -90.29 176.79 -32.2
9 -85.31 425.78 109.76 105.26 -98.28 67.16 -40.13 -46.84 -47.13
10 -77.83 92.76 14.47 19.42 -85.73 52.78 -81.92 8.74 26.19
11 -13.69 445.6 72.18 60.66 -68.8 -57.56 207.14 -75.39
12 -48.3 71.85 178.87 242.42 -60.69 -44.18 201.37 -45.07 -77.12
13 -82.6 -61.13 -82.41 612.28 -83.99 -89.89 -94.21 -28.25
14 -38.17 13.87 51.86 -92 40.98 290.4 -50.08 13.97
15 -72.15 -35.39 -67.18 -73.94 -96.01 236.6816 524.9 -91.76 -95.61 -99.12 -92.89 -81.57 -86.8 -93.28
17 107.66 10.02 96.5 -12.62 -78.23 -78.63 95.86 -68.76
18 -80.65 243.03 21.36 83.09 -41.15 -43.26 822.86 37.06 -12.31
19 -34.14 -61.93 244.84 8.38 -56.45 -81.68 142.79 -17.42 -14.89
20 -32.68 524.9 32.26 -3.85 83.51 -38.03 75.02 -42.16 -10.84
21 195.81 -44.45 12.16 78.69 -30.11 -76.61 192.27 -35.31 -77.81
22 200.33 117.03 8.52 100.78 -40.43 -73.89 90.3 -72.75 -70.07
23 -84.82 -24.29 61.91 -3.02 41.43 -2.85 20.7 -7.8 11.63
24 -28.72 41.17 65.53 -59.82 -88.47 -39.35 -67.85 397.29 -90.41
25 -89.82 2.18 15.6 -7.1 -20.61 331.64 62.27 28.77
26 -97.28 -94.73 -99.06 316.58
27 -84.32 1098.59 126.17 174.43 -12.23 107.7 -4.81 -73.98
28 -91.32 3.49 -91.78 -73.64 248.71
29 -75.58 -87.72 -96.76 -86.46 295.48
30 -82.04 31.14 -37.7 -92.67 108.91 -84.51 -77.6 505.79 -84.18
31 352.94 186.79 -24.24 14.29 -93.44 -93.1 220.05 -83.63
32 287.79 -56.49 -11.74 -65.94 -77.61 -84.3 125.45 -35.42 -45.16
33 95.82 70.33 -26.53 3.9 -16.51 -61.95 559.49 -22.22 -20.15
34 -92.35 174.14 129.52 -40.18 -36.21 167.03 -86.18
35 -79.11 -44.45 104.41 58.14 204.98 -83.29 342.83 -74.63 -28.66
36 71.27 -30.99 -67.45 32
Top Related