Hot-Spot analysis Using Apache Spark framework

10
Apache Spark Framework to Process Large Scale Data Supriya, Shalmali Bhoir, Krishna Dharaiya, Kavita Korgaonkar, Tejal Sabnis School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ {sashokk2, sbhoir, kdharaiy, kkorgaon, tsabnis}@asu.edu ABSTRACT With the sudden advent of Big Data, the volume of available spatial data is increasing tremendously. This sudden surge of the data has interested people in academia and businesses to analyze and discover patterns and trends in the data for various business intelligence applications and research purpose. In this project, we learn to handle and process huge spatial and spatial-temporal datasets and execute queries on it. This project deals with running the Apache Spark on a Hadoop cluster of machines where GeoSpark library, an in- memory cluster computing framework, is used for processing large scale spatial data. GeoSpark provides a geometrical operations library that accesses spatial RDDs to perform basic geometrical operations. We leverage the ability of Spark to perform fast, scalable, fault tolerant operations using Resilient Distributed dataset to perform geo-spatial operations. We have also established a suitable measure to strike a balance between the run time performance and CPU/memory utilization in the distributed cluster. In this project we are also required to implement the Hot Spot analysis using the Getis-Ord Gi * spatial statistic to identify statistically significant Hot Spots in the spatial data in Apache Spark framework. Keywords GeoSpark, Large scale data, Spatial data, Spatial-Temporal dataset, Hadoop, Apache Spark, Cluster computing, Distributed dataset 1. INTRODUCTION Geo-spatial queries involve spatial data and operations which are not inherently supported by Apache Spark. Hence users need to perform tedious programming for data processing jobs on top of Spark. GeoSpark system extends Resilient Distributed Dataset (RDD) to support spatial data. The key contributions of this project are as follows. (1) GeoSpark as a full-fledged cluster computing framework to perform certain queries like Spatial range query, Spatial KNN (K-Nearest Neighbors) query and Spatial join query. (2) Analysis of efficiency of queries when run on indexed and unindexed data. Experiments show that GeoSpark achieves better run time performance than its Hadoop based counterparts. (3) Applying spatial statistics to spatial-temporal big data in order to identify statistically significant spatial Hot Spots. For this phase we have used data of New York City Yellow Taxi for January (~1.8). Section 2 specifies the system design, followed by section 3 describing various queries run on Apache Spark using GeoSpark APIs and Hot Spot analysis using Spark framework [8]. Section 4 shows the experiments and results obtained and section 5 describes the conclusion and section 6 and 7 is the acknowledgement and references respectively. 2. SYSTEM DESIGN This sections explains Hadoop and Spark technologies which enables for distributed faster computation for large datasets. 2.1 Hadoop Apache Hadoop project is an open-source software for reliable, scalable, fault-tolerant, distributed computing [1]. It is a framework that allows for the distributed processing of large datasets across clusters of machines, each offering local computation and storage. The key feature provided by Hadoop apart from increased performance and lower latencies is its ability to provide fault tolerance by replicating data across worker nodes in the cluster. Furthermore, this replication increases the availability of data. The components of a Hadoop system are, Name-node, Data-node, Job-tracker, and Task-tracker [2]. Hadoop consists of a distributed filesystem (Hadoop Distributed File System), framework for job scheduling and cluster resource management (Hadoop Yarn) and parallel processing system for large datasets (Hadoop M apReduce) [3][4]. 2.2 HDFS (Hadoop Distributed File System) Hadoop Distributed File System (HDFS) is used for storage of distributed data i.e. the vertices of the rectangle RDDs or search query windows. The data is distributed and stored in master node and slave nodes in a transparent manner when it is first loaded in HDFS. The computations are performed by fetching data from HDFS and the results are stored back to HDFS. 2.3 Apache Spark Apache Spark is an open source cluster computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Spark works on Resilient Distributed Datasets (RDDs) [5] [9]. These are immutable objects that cannot be modified but can be transformed into new RDDs or can be used in some action. They can be explicitly persisted and employ lineage based recovery schemes We have used RDDs to load the vertices of geo-spatial objects from HDFS and perform transformations and actions to run our algorithms for geospatial operation. Spark’s in-memory cluster computing framework can

Transcript of Hot-Spot analysis Using Apache Spark framework

Page 1: Hot-Spot analysis Using Apache Spark framework

Apache Spark Framework to Process Large Scale Data Supriya, Shalmali Bhoir, Krishna Dharaiya, Kavita Korgaonkar, Tejal Sabnis

School of Computing, Informatics, and Decision Systems Engineering,

Arizona State University, Tempe, AZ

{sashokk2, sbhoir, kdharaiy, kkorgaon, tsabnis}@asu.edu

ABSTRACT

With the sudden advent of Big Data, the volume of available

spatial data is increasing tremendously. This sudden surge of

the data has interested people in academia and businesses to

analyze and discover patterns and trends in the data for

various business intelligence applications and research

purpose. In this project, we learn to handle and process huge

spatial and spatial-temporal datasets and execute queries on

it. This project deals with running the Apache Spark on a

Hadoop cluster of machines where GeoSpark library, an in-

memory cluster computing framework, is used for processing

large scale spatial data. GeoSpark provides a geometrical

operations library that accesses spatial RDDs to perform

basic geometrical operations. We leverage the ability of

Spark to perform fast, scalable, fault tolerant operations

using Resilient Distributed dataset to perform geo-spatial

operations. We have also established a suitable measure to

strike a balance between the run time performance and

CPU/memory utilization in the distributed cluster. In this

project we are also required to implement the Hot Spot

analysis using the Getis-Ord Gi* spatial statistic to identify

statistically significant Hot Spots in the spatial data in

Apache Spark framework.

Keywords

GeoSpark, Large scale data, Spatial data, Spatial-Temporal

dataset, Hadoop, Apache Spark, Cluster computing, Distributed

dataset

1. INTRODUCTION Geo-spatial queries involve spatial data and operations which are

not inherently supported by Apache Spark. Hence users need to

perform tedious programming for data processing jobs on top of

Spark. GeoSpark system extends Resilient Distributed Dataset

(RDD) to support spatial data. The key contributions of this

project are as follows. (1) GeoSpark as a full-fledged cluster

computing framework to perform certain queries like Spatial range

query, Spatial KNN (K-Nearest Neighbors) query and Spatial join

query. (2) Analysis of efficiency of queries when run on indexed

and unindexed data. Experiments show that GeoSpark achieves

better run time performance than its Hadoop based counterparts.

(3) Applying spatial statistics to spatial-temporal big data in order

to identify statistically significant spatial Hot Spots. For this

phase we have used data of New York City Yellow Taxi for

January (~1.8). Section 2 specifies the system design, followed by

section 3 describing various queries run on Apache Spark using

GeoSpark APIs and Hot Spot analysis using Spark framework [8].

Section 4 shows the experiments and results obtained and section

5 describes the conclusion and section 6 and 7 is the

acknowledgement and references respectively.

2. SYSTEM DESIGN This sections explains Hadoop and Spark technologies which

enables for distributed faster computation for large datasets.

2.1 Hadoop Apache Hadoop project is an open-source software for reliable,

scalable, fault-tolerant, distributed computing [1]. It is a

framework that allows for the distributed processing of large

datasets across clusters of machines, each offering local

computation and storage. The key feature provided by Hadoop

apart from increased performance and lower latencies is its ability

to provide fault tolerance by replicating data across worker nodes

in the cluster. Furthermore, this replication increases the

availability of data. The components of a Hadoop system are,

Name-node, Data-node, Job-tracker, and Task-tracker [2]. Hadoop

consists of a distributed filesystem (Hadoop Distributed File

System), framework for job scheduling and cluster resource

management (Hadoop Yarn) and parallel processing system for

large datasets (Hadoop MapReduce) [3][4].

2.2 HDFS (Hadoop Distributed File System) Hadoop Distributed File System (HDFS) is used for storage of

distributed data i.e. the vertices of the rectangle RDDs or search

query windows. The data is distributed and stored in master node

and slave nodes in a transparent manner when it is first loaded in

HDFS. The computations are performed by fetching data from

HDFS and the results are stored back to HDFS.

2.3 Apache Spark Apache Spark is an open source cluster computing framework. It

provides an interface for programming entire clusters with implicit

data parallelism and fault tolerance. Spark works on Resilient

Distributed Datasets (RDDs) [5] [9]. These are immutable objects

that cannot be modified but can be transformed into new RDDs or

can be used in some action. They can be explicitly persisted and

employ lineage based recovery schemes We have used RDDs to

load the vertices of geo-spatial objects from HDFS and perform

transformations and actions to run our algorithms for geospatial

operation. Spark’s in-memory cluster computing framework can

Page 2: Hot-Spot analysis Using Apache Spark framework

Figure 1. Spark Architecture

be used in 3 modes – standalone, over Yarn or inside Hadoop

MapReduce. We have used Spark in both distributed and

standalone mode, with the data for processing residing in the

HDFS or in the local storage. Apache Spark has rich high level

library for Java, Python, Scala, R and SQL. The unified engine of

Spark can handle diverse workloads like machine learning,

analytics, and graph processing for large dataset in batch

processing and streaming processing. Figure 1 depicts the Spark

architecture and Figure 2 depicts the Master-Slave task execution

in the Spark environment.

2.4 Master Node We configured one machine as master node running Hadoop and

Spark. The master node is responsible for work distribution

among the slave nodes and collecting results of computations after

they are complete.

2.5 Slave Node We have configured three slave nodes, with master node also

running as the worker. These nodes receive data that is partitioned

by master and performs computations on its set of data. All slave

nodes run the same algorithm and send back the result of local

computations to master which computes the global result. Table 1

depicts the master-slave configuration of three machines.

Figure 2. Spark Master- S lave Task execution

Figure 3. Manipulating RDD using Transformations and

Actions

2.6 Resilient Distributed Datasets(RDD) At a high level, any Spark application creates RDD out of the

input data and run (lazy) transformations on these RDDs to

convert them to some other form (shape), and finally perform

actions to collect or store data. Resilient Distributed Datasets

(RDD) is the core concept in Spark framework. It can hold any

type of data. Spark stores data in RDD on different partitions.

RDD are immutable and can be modified with a transformation but

the transformation returns a new RDD whereas the original RDD

remains the same. RDD supports two types of operations:

Transformation - map, filter, flatMap, groupByKey,

reduceByKey, aggregateByKey, pipe, coalesce

Action - reduce, collect, count, first, take, countByKey,

foreach.

2.7 Environment Setup The environment setup of the system is as follows.

● Operating system of each node in cluster is Ubuntu

14.04.

Machine Processor Disk Space RAM

Master

(Worker 1)

160 GHZ 88.9 GB 5.6 GB

Worker 2 260 GHZ 27.6 GB 3.8 GB

Worker 3 240 GHZ 19.9 GB 5.9 GB

Table 1. Configuration of master-slave nodes

Page 3: Hot-Spot analysis Using Apache Spark framework

Figure 4: Input and Output of Spatial Range Query

Software installation versions

- OpenJDK - 1.7.0

- Hadoop - 2.6.0

- Spark – 1.2.0

- SSH – For password less communication

● Created password-less SSH login between all slave

nodes and master node.

● Connected master with all slaves using Hadoop and

Spark configuration setup.

This setup ensures the whole cluster in up and running.

3. IMPLEMENTATION OF QUERIES In the project phase 1 we were required to load GeoSpark jar into

Apache Spark Scala shell and execute the following operations

using Scala. The added GeoSpark jar file in project dependencies

enables us to use GeoSpark APIs in the Spark program. In this

phase, we had to create GeoSpark SpatialRDD (PointRDD) and

then implement spatial range query, spatial KNN query and

spatial join query. The section below elaborates on each of these

queries.

3.1 Spatial Range Query Spatial Range query attempts to find all the objects, which are

there in a given specific range. For geo-spatial operations, range

query locates all the geometric points that are contained in the

given query rectangle co-ordinates. Using spatial range query, we

check for containment of point within given query rectangles.

GeoSpark executes range query algorithm following the execution

model: Load target dataset, partition data into number of

partitions, create a spatial index on each SRDD partition if

required, broadcast the query window to each SRDD partition,

check the spatial predicate in each partition, and remove spatial

objects duplicates that existed due to the data partitioning phase.

Figure 4 shows the points in the x-y space and the query window

is specified by inputting the coordinates. The output result of the

SpatialRangeQuery() is the number of points in the query

window. Figure 2 shows the various points and the space and the

Input:

Input the Envelope i.e Query Window

Example:(-113.79,-09.73,32.99,35.08);

Input HDFS path for point Dataset

Output:

Count of Points in the Envelope

Algorithm:

1. Define Envelope based on the input

2. Read from the HDFS path given and

convert into PointRDD to ObjectRDD

3. Build the R-Tree on the ObjectRDD if

needed

4. Exexute RangeQuery.SpatialRangeQuery()

API of GeoSpark for Spatial Range

5. Query or Execute

RangeQuery.SpatialRangeQueryUsingIndex

() API of GeoSpark for Spatial Range

Query using Index.

6. Return count of the result optioned in

the step 4

Figure 5. Spatial Range Query Algorithm

coordinates of the Envelope are defined and the Spatial Range

query returns all the points inside the envelope [10].

Input:

Input1 - queryEnvelope: x1, y1, x2, y2

This is the co-ordinates of Envelope we want to create.

Input2 – The location of input2 in HDFS and we use PointRDD

to convert the dataset in csv file into the objectRDD. The dataset

consists of points (longitude, latitude), which defines a point in

the space.

Output:

The output of the spatial range query returns the number of

points belonging in the queryEnvelope whose coordinates are

specified in the Input1.

Algorithm:

The algorithm reads one file that contains all the vertices for input

polygons (arealm.csv). The input from the first file is loaded into

an RDD and partitioned across the worker nodes. A plane sweep

is performed at each worker to check for containment of points in

the query window. The result is sent back to the master where all

the results are merged after removing duplicates (if any) and the

count is returned. The other part of the spatial range query

specifies building R-Tree index on PointRDD and then run the

spatial range query. Figure 3 shows the code for the spatial range

Page 4: Hot-Spot analysis Using Apache Spark framework

query and figure 4 shows the code for Spatial Range Query using

Index where a R-Tree index is built on PointRDD and then this

PointRDD is queried [6].

3.2 Spatial KNN Query

GeoSpark uses a heap based top-k algorithm, which has two

phases: selection and merge. It takes a partitioned SRDD, a point

P and a number kK as inputs. To calculate the k nearest objects

around point P, in the selection phase, for each SRDD partition

GeoSpark calculates the distances between each object to the given

point P, then maintains a local heap by adding or removing

elements based on the distances. This heap contains the nearest k

objects around the given point P. For Indexed SRDD, the system

can utilize the local indexes to reduce the query time. After the

selection phase, GeoSpark merges results from each partition,

keeps the nearest k elements that have the shortest distances to P

and outputs the result. Figure 5 specifies the Spatial KNN query

code and Figure 6 shows the code of building the R-Tree on the 2-

D PointRDD and running the

SpatialKnnQueryUsingIndex().shows the code of building the R-

Tree on the 2-D PointRDD and running the

SpatialKnnQueryUsingIndex().

Input:

Coordinates of the point whose neighbors

are to be found

Value of k i.e the number of neighbors

The location of the input in HDFS of Points

in 2-D space

Output: Return coordinates of k nearest

neighbors.

Algorithm:

1. Create a point by taking the

coordinates of the point from input

2. Read from the HDFS path given and

convert into PointRDD to ObjectRDD

3. Build the R-Tree on the ObjectRDD if

Specified in the query

4. Build the R-Tree on PointRDD where

ever specified

5. KNNQuery.SpatialKnnQuery() API of

GeoSpark for Spatial KNN Query or

Execute KNNQuery.SpatialKnnQueryUsingIndex

() API of GeoSpark for Spatial Range

Query using Index.

6. Output the coordinates of k nearest

neighbors

Figure 6. Spatial KNN Query code

Input:

The location of the input1 in HDFS

The location of the input2 in HDFS

Output: Count Size of the result query

Algorithm:

1. Read from csv file of two coordinate

points from HDFS input1 and load it in

PointRDD

2. Build the R-Tree on PointRDD where ever

specified

3. Read from csv file with coordinates of

the rectangles stored in HDFS Input2 and

load the RectangleRDD

4. Execute JoinQuery(sc, PointRDD,

RectangleRDD)

5. Return size of the result of the query

Figure 7. Spatial Join Query code

3.3 Spatial Join Query Spatial Join Query joins the set of polygons and the spatial points

based on a spatial join predicate. GeoSpark first partitions the

data from the two input SRDDs as well as creates local spatial

indexes (if required) for the SRDD [7]. Then it joins the two

datasets by their keys. For the spatial objects that have the same

grid ID, GeoSpark calculates their spatial relations. If two

elements from two SRDDS are overlapped, they are kept in the

final results. The algorithm continues to group the results for each

rectangle. The grouped results are form of either Rectangle or

point. Finally, the algorithm removes the duplicates and returns

the result to other operations or saves the final result to disk.

Spatial Join query deals with creating a GeoSpark RectangleRDD

and use it to join with PointRDD. The different Spatial Join

Queries to be executed on the dataset are

1. Join the PointRDD using Equal grid without R-Tree index.

2. Join the PointRDD using Equal grid with R-Tree index.

3. Join the PointRDD using R-Tree grid without R-Tree index

3.4 Spatial Join Query (Cartesian Product)

The second phase of the project included implementing the

SpatialJoinQueryUsingCartesianProduct() function in the

GeoSpark source code. This function finds the Spatial join query

using the simple Cartesian Product algorithm between the

PointRDD and RectangleRDD. Specifically, for each rectangle

from the query window dataset, check this rectangle against the

point datasets using the regular GeoSpark Spatial Range Query:

The function is implemented in the JoinQuery.java file where it is

Page 5: Hot-Spot analysis Using Apache Spark framework

Input: PointRDD, RectangleRDD

Output:

JavaPairRDD<<Envelope, HashSet<Point>>

Algorithm:

1. Create a PointRDD objectRDD

2. Create a RectangleRDD queryWindowRDD;

3. Collect rectangles from queryWindowRDD to

one Java List L;

4. For each rectangle R in L

RangeQuery.SpatialRangeQuery(objectRDD,

R, 0)

End

5. Collect all the Results

6. Parallelize the results to generate a RDD

in the format JavaPairRDD<<Envelope,

HashSet<Point>>

7. Return the result RDD

Figure 8. Spatial Join Query using Cartesian Product

Algorithm

the function of the class JoinQuery. The algorithm for the spatial

join query using Cartesian product is explained in figure.

3.5 IDENTIFY TOP FIFTY SPATIO-

TEMPORAL HOT SPOTS

In today's world observational data is being collected at an ever

increasing rate. With the advent of Big Data and the ubiquitous

collection of spatial-temporal observational data (e.g., vehicle and

mobile assets tracking data, geo-fencing), the identification of

unusual patterns of data in a statistically significant manner has

become a key problem that numerous businesses and

organizations are attempting to solve for real-time and batch

analysis.

Since the data is humongous it is not possible to scale algorithms

to run on a single machine. Over the past few years, interest in the

Apache Spark framework has exploded, both in industry, and

academia. Spark can readily be installed on clusters of hardware,

and offers a sophisticated software platform for creating scalable

distributed algorithms using programming techniques to work on

these massive datasets.

Problem Definition:

The phase 3 of the project focuses on applying the spatial

statistics to the spatial-temporal big data in order to identify

statistically significant hot-spots using a distributed Apache spark

computing framework.

Figure 9. The dataset of New York Yellow Taxi January 2015

and the boundary around the dataset

Input

In this problem we will utilize a very large collection of spatial-

temporal observational data of the New York City Yellow Taxi

January 2015 dataset, approximately of 1.8 GB in size. This

dataset consists of 12,748,982 records representing all Yellow Cab

taxi trips in New York City in January 2015. Each record in the

dataset contains key information such as pickup and drop -off

date, time, and location (latitude, longitude), trip distance,

passenger count, and fare amount. Given this particular dataset,

we are required to identify the fifty most statistically significant

drop-off locations in both time and space using the Getis-Ord

statistic. Space is aggregated into a grid (using latitude and

longitude), while time is aggregated into time windows i.e. one day

periods for the month of January (31 days). The source data is

clipped to an envelope encompassing the five New York City

boroughs in order to remove some of the noisy error data i.e., the

latitude ranges from 40.5N – 40.9N and the longitude ranges from

73.7W – 74.25W. Figure 9 shows the dataset plotted on the New

York City map and the defined boundary of the dataset which

restricts us to considers only the points inside the envelope so as

to remove noise and outliers.

Output

The application developed will run on top of the Apache Spark

framework, implemented using Java. The dataset is provided as an

uncompressed CSV file and the CSV file containing the top fifty

hot spots in the format of (latitude, longitude, time_step, zscore) is

saved in the given output path.

Approach

GIS professionals use spatial statistics to identify statistically

significant clusters or outliers in spatial data. When identifying

statistically significant clusters (often termed Hot Spot Analysis),

a very commonly used statistic is the Getis-Ord Gi*. It provides

z-score and p-values that allows users to determine that features

with either high or low values are clustered spatially. This statistic

can be applied to both the spatial and spatial-temporal domains.

In this project phase 3, we are focused on the spatial-temporal

Getis-Ord statistic calculation for the hot spot analysis with the

cells having the high z score being more hot.

Page 6: Hot-Spot analysis Using Apache Spark framework

Figure 10. Spatial-Temporal Space 3-D structure

The Time and space are aggregated into the cube cells, where x axis

denotes the latitude and the y axis denotes the longitude with time

on the z axis. Each cell unit size is 0.01 * 0.01 in terms of latitude

and longitude. One day is used as the time step size which

amounts to 31(number of days in January) step sizes on the z

axis. In total there would be 68200 cells in the spatial-temporal

space of our clipped dataset. The Figure 10 depicts the 3-D

structure of the dataset, when 2-D spatial data plotted with the

longitude and latitude values is plotted against the time axis.

The algorithm for Hot Spot analysis makes use of Java APIs of

the Spark framework. The Spark Context is created using the

SparkConf(). The input data file is loaded in the RDD using the

textFile() method in the Spark Context. The we need to apply the

Spark transformations on the RDD. We make use flatMap()

transformation which is the combination of Flat and Map. This

transformation creates the new RDD from the existing one. Each

data record converted in the RDD is mapped to the (x y z) cell it

would be located in the 3-D structure. The x, y, z is the cell

number on the x-axis, y-axis and z-axis respectively. The second

transformation applied is mapToPair() which creates the Key-

Value pairs like ((x y z), 1) and converts JavaRDD to

JavaPairRDD. The reduceByKey() aggregates the cell with the

same key. As a result, we get the number of each cell occurrences

in the dataset which implies that we get the total number of active

points in the cell in 3-D space. We further perform the action

coalesce(1) which returns new RDD reduced in one partition. The

other action collect() can fetch the entire RDD into the single

machine. We use collect() because the RDD returned after

aggregating the Key-Value pairs are approximately 15k, which can

fit in the memory. We process the RDD collected into a Java

Collection where we check if the neighbor of the cell is present in

the list and according apply the Getis-Ord statistic. Then the cells

are sorted in the descending order of the z-score and outputting

the fifty cells with the top z-score indicating the hottest cells. The

total time taken to execute the Java application of finding the fifty

Hot-Spots on 1.8 GB data using Spark on 4GB RAM Ubuntu

machine is around 1.5 minutes.

Assumptions

Below are the various assumptions for analyzing hot spots in the

spatial-temporal dataset.

Each record in the input dataset is a point in the 3-D

space

Figure 11. The 26 neighbors of a cell in the 3-D

space with the cell itself also its neighbor

Space is aggregated into a grid of cells using latitude and

longitude of pick-up location.

Size of each cell in the 2-D space is

Time is aggregated into window of one-day comprising

of total 31 values on the z-axis.

The source data is clipped to an envelope encompassing

the five New York City in order to remove some of the

noisy error data and the outliers. The envelope ranges

from in latitude 40.5N – 40.9N, longitude 73.7W –

74.25W.

Each cell in the 3-D space has 26 neighbors and the cell

itself is also considered its own neighbor as is depicted

in figure 11. The weight of the cell i is defined as

below.

Input:

New York City Yellow Taxi (January 2015) Dataset (~1.8 GB)

with each record contains pick-up latitude and longitude, date, etc.

Output:

Fifty most significant pick-up locations in both time and space

using Getis-Ord Gi* Statistic.

Pseudo Algorithm:

1. Create Spark context

Step 2 – 5 are run in Spark Context

2. Load Input data file in the Spark context to JavaRDD.

3. Using Word-Count Approach Map each data point(record)

to the 3-D cell. We use FlatMapFuction in Java.

CellX = (|maxLongitude|-|PickupLongitude|) / 0.01

CellY = (PickupLatitude - minLatitude) / 0.01

Page 7: Hot-Spot analysis Using Apache Spark framework

CellZ = Day of the January month

If(PickUpLongitude and PickupLatitude are not in the

envelope)

Return (-1 -1 -1)

Else

`Return (CellX CellY CellZ)

4. Count and aggregate the number of time each (CellX CellY

CellZ) occur using mapToPair function. All the same

coordinate cells are aggregated.

5. Save the output of step 4 in the file which contains ~15K

lines indicating those many active cells.

The output has each line like (CellX CellY CellZ, x)

Where xi denotes the attribute value of the cell i, which

corresponds to the total number of pickup locations in that

cell.

6. Collect the contents of this output in a List CellList

7. Compute n, which is the total number of cells in Spatial-

temporal 3-D space

n = |maxLongitude - minlongitude| ×

|maxLatitude - minLatitude| × 31

= 68200

8. Calculate the mean

9. Calculate the standard deviation

10. For each cell in cellList

10.1 From the coordinates of a cell we know the coordinates

of a cell we know the coordinates of its 26 neighbors.

10.2 Find if any of the neighbors of the cell exist in the cellist

and if they do we should add the attribute value xi of all

its neighbors evaluating the value of .

10.3 Calculate the Gi* Statistic, i.e z-score of each cell using

the equation below.

10.4 Since wij = 1, if i and j are neighbors then

Because each cell has 26 neighbors and one itself also. Therefore,

the Gi* transforms as the below equation.

11. Sort the cellList based on the descending order of the Gi*

statistic i.e the z-score.

12. Output the fifty top cells with the highest Gi* score.

13. Convert the cell values in the longitude and latitude and

output the hot spots latitude, longitude and the day.

4. EXPERIMENTS AND RESULTS Now once we have completed implementing all operations, it is

time to monitor performance of cluster. This section provides

preliminary experimental evaluation that studies the run time

performance of the large-scale spatial data processing systems. We

performed tests on cluster using arealm.csv and zcta510.csv

datasets. The same operations were performed multiple times to

observe average performance of nodes.

Our cluster was set up over an ad-hoc network having one master

and three slave nodes. For measuring performance, we have used

tool called “Ganglia”. It helped us to monitor CPU, memory

network usage of the cluster while executing tasks. We have

compared efficiency of these queries when run on indexed and

unindexed data.

Each task was executed on the cluster and screenshots were taken

from the Spark and Ganglia UI.

4.1 SPATIAL RANGE QUERY

This query’s performance is compared based on indexed and

unindexed data.

Unindexed data (Task 2a)

This is the spatial range query where we query the dataset for

PointRDD using query window. Here the PointRDD is unindexed

and average execution time for this query is 147.7 secs. Figure 12

indicates the CPU utilization on master and 2 slaves. Figure 13

indicates the memory utilization on master and 2 slave nodes. The

results indicate that the master has the highest CPU and memory

utilization as compared to slaves.

Indexed data (Task 2b)

This is the spatial range query where we build R-tree on

PointRDD and then query the database for PointRDD using the

query window. Average execution time for this query is 59.9 secs.

which is lesser than task 2a since data is indexed. Figure 14

indicates the CPU utilization of Master and slave nodes. Figure

Page 8: Hot-Spot analysis Using Apache Spark framework

15 shows the Memory utilization of Master, and slaves. These

metrics for query indicate that Master has the maximum CPU

and memory utilization as compared to the other two slaves.

Figure 12. CPU utilization on cluster nodes for query 2a

Figure 13. Memory utilization on cluster nodes for query 2a

Figure 14. CPU utilization on cluster nodes for query 2b

Figure 15. Memory utilization on cluster nodes for query 2b

4.2 SPATIAL KNN QUERY

This spatial query finds 5 nearest neighbors.

Unindexed data

Below figures show the results of spatial KNN query on

unindexed PointRDD. The average execution time is 51. Figure 16

indicates the CPU utilization on master and 2 slaves. Figure 17

indicates the memory utilization on master and 2 slave nodes.

Figure 16. CPU utilization on cluster nodes for query 3a

Figure 17. Memory utilization on cluster nodes for query 3a

Page 9: Hot-Spot analysis Using Apache Spark framework

R-Tree index

First create R-Tree index and then query PointRDD. The average

execution time is 17.7 secs. which is lesser than task 3a since it is

indexed. Figure 18 indicates the CPU utilization on master and 2

slaves. Figure 19 indicates the memory utilization on master and 2

slave nodes.

Figure 18. CPU utilization on cluster nodes for query 3b

Figure 19. Memory utilization on cluster nodes for query 3b

Spatial Join Query

This query deals with creating a GeoSpark RectangleRDD and

use it to join with PointRDD.

R-tree Unindexed data (Task 4a)

The average execution time of this query is 836 secs. Figure 20

indicates the CPU utilization on master and 2 slaves. Figure 21

indicates the memory utilization on master and 2 slave nodes.

4.3 R-Tree indexed using equal grid (Task 4b) Join the PointRDD using Equal grid with R-Tree index. The

average execution time of this query 580 secs which is less than

the time taken by 4a query with unindexed data points. Figure

22 indicates the CPU utilization on master and 2 slaves. Figure

23 indicates the memory utilization on master and 2 slave nodes.

Figure 20. CPU utilization on cluster nodes for query 4a

Figure 21. Memory utilization on cluster nodes for query 4a

Figure 22. CPU utilization on cluster nodes for query 4b

Query 4c

Join the PointRDD using R-Tree grid without R-Tree index. The

average Execution time of this query is 53 secs which is much less

than query 4a and query 4b. Query 4c takes lesser time because it

has R-Tree grid. Figure 24 indicates the CPU utilization on master

and 2 slaves.

Page 10: Hot-Spot analysis Using Apache Spark framework

Figure 23. Memory utilization on cluster nodes for query 4b

Figure 24. CPU utilization on cluster nodes for query 4c

4.4 CARTESIAN PRODUCT ALGORITHM

This is the Spatial Join query that uses the simple Carteian

Product algorithm. Specifically, for each rectangle from the query

window dataset, check this rectangle against the point datasets

using the Regular GeoSpark Spatial Range Query. The average

execution time of this query is 9060 secs. Figure 25 indicates the

CPU utilization on master and 2 slaves. Figure 26 indicates the

memory utilization on master and 2 slave nodes.

5. CONCLUSION We learned GeoSpark, an in-memory cluster computing

framework for processing large-scale spatial data. GeoSpark

provides an API for Apache Spark programmers to easily deploy

spatial operations. Experiments on data sizes and spatial analysis

show that GeoSpark achieves better run time performance than its

MapReduce based counterparts. Spark speeds up the task at

hand with an increase in the number of nodes. We achieved greater

efficiency in terms of CPU execution and memory utilization

while running the task on a cluster. Though the latency was

present, we identified the reason was mainly because of the slow

ad-hoc network. We therefore believe that Spark can be effectively

used to handle computation intensive queries.

6. ACKNOWLEDGEMENTS We would like to express our sincere gratitude towards course

instructed Dr. Mohamed Sarwat for giving us an opportunity to

work with one of the latest and leading cluster computing

frameworks. We would also like to thank the teaching assistant Jia

Yu who was always there to help us and guided us with his insight

into the same.

Figure 25. CPU utilization on cluster nodes

Figure 26: Memory utilization on cluster nodes

7. REFERENCES [1] https://hadoop.apache.org

[2] http://saphanatutorial.com/mapreduce/

[3] https://en.wikipedia.org/wiki/Apache_Hadoop

[4] http://spark.apache.org/docs/latest/programming-guide.html

[5] https://en.wikipedia.org/wiki/Apache_Spark

[6] http://www.geolib.co.uk/

[7] www.vividsolutions.com/jts/JTSHome.htm

[8] Eldawy, Ahmed, et al. "CG_Hadoop: computational

geometry in MapReduce." Proceedings of the 21st ACM

SIGSPATIAL International Conference on Advances in

Geographic Information Systems. ACM, 2013.

[9] Zaharia, Matei, et al. "Spark: cluster computing with working

sets." Proceedings of the 2nd USENIX conference on Hot

topics in cloud computing. Vol. 10. 2010.

[10] Hsiao, Pei-Yung, and Chia-Chun Tsai. "A new plane-sweep

algorithm based on spatial data structure for overlapped

rectangles in 2-D plane." Computer Software and

Applications Conference, 1990. COMPSAC 90. Proceedings.,

Fourteenth Annual International. IEEE, 1990.