Hot-Spot analysis Using Apache Spark framework
Transcript of Hot-Spot analysis Using Apache Spark framework
Apache Spark Framework to Process Large Scale Data Supriya, Shalmali Bhoir, Krishna Dharaiya, Kavita Korgaonkar, Tejal Sabnis
School of Computing, Informatics, and Decision Systems Engineering,
Arizona State University, Tempe, AZ
{sashokk2, sbhoir, kdharaiy, kkorgaon, tsabnis}@asu.edu
ABSTRACT
With the sudden advent of Big Data, the volume of available
spatial data is increasing tremendously. This sudden surge of
the data has interested people in academia and businesses to
analyze and discover patterns and trends in the data for
various business intelligence applications and research
purpose. In this project, we learn to handle and process huge
spatial and spatial-temporal datasets and execute queries on
it. This project deals with running the Apache Spark on a
Hadoop cluster of machines where GeoSpark library, an in-
memory cluster computing framework, is used for processing
large scale spatial data. GeoSpark provides a geometrical
operations library that accesses spatial RDDs to perform
basic geometrical operations. We leverage the ability of
Spark to perform fast, scalable, fault tolerant operations
using Resilient Distributed dataset to perform geo-spatial
operations. We have also established a suitable measure to
strike a balance between the run time performance and
CPU/memory utilization in the distributed cluster. In this
project we are also required to implement the Hot Spot
analysis using the Getis-Ord Gi* spatial statistic to identify
statistically significant Hot Spots in the spatial data in
Apache Spark framework.
Keywords
GeoSpark, Large scale data, Spatial data, Spatial-Temporal
dataset, Hadoop, Apache Spark, Cluster computing, Distributed
dataset
1. INTRODUCTION Geo-spatial queries involve spatial data and operations which are
not inherently supported by Apache Spark. Hence users need to
perform tedious programming for data processing jobs on top of
Spark. GeoSpark system extends Resilient Distributed Dataset
(RDD) to support spatial data. The key contributions of this
project are as follows. (1) GeoSpark as a full-fledged cluster
computing framework to perform certain queries like Spatial range
query, Spatial KNN (K-Nearest Neighbors) query and Spatial join
query. (2) Analysis of efficiency of queries when run on indexed
and unindexed data. Experiments show that GeoSpark achieves
better run time performance than its Hadoop based counterparts.
(3) Applying spatial statistics to spatial-temporal big data in order
to identify statistically significant spatial Hot Spots. For this
phase we have used data of New York City Yellow Taxi for
January (~1.8). Section 2 specifies the system design, followed by
section 3 describing various queries run on Apache Spark using
GeoSpark APIs and Hot Spot analysis using Spark framework [8].
Section 4 shows the experiments and results obtained and section
5 describes the conclusion and section 6 and 7 is the
acknowledgement and references respectively.
2. SYSTEM DESIGN This sections explains Hadoop and Spark technologies which
enables for distributed faster computation for large datasets.
2.1 Hadoop Apache Hadoop project is an open-source software for reliable,
scalable, fault-tolerant, distributed computing [1]. It is a
framework that allows for the distributed processing of large
datasets across clusters of machines, each offering local
computation and storage. The key feature provided by Hadoop
apart from increased performance and lower latencies is its ability
to provide fault tolerance by replicating data across worker nodes
in the cluster. Furthermore, this replication increases the
availability of data. The components of a Hadoop system are,
Name-node, Data-node, Job-tracker, and Task-tracker [2]. Hadoop
consists of a distributed filesystem (Hadoop Distributed File
System), framework for job scheduling and cluster resource
management (Hadoop Yarn) and parallel processing system for
large datasets (Hadoop MapReduce) [3][4].
2.2 HDFS (Hadoop Distributed File System) Hadoop Distributed File System (HDFS) is used for storage of
distributed data i.e. the vertices of the rectangle RDDs or search
query windows. The data is distributed and stored in master node
and slave nodes in a transparent manner when it is first loaded in
HDFS. The computations are performed by fetching data from
HDFS and the results are stored back to HDFS.
2.3 Apache Spark Apache Spark is an open source cluster computing framework. It
provides an interface for programming entire clusters with implicit
data parallelism and fault tolerance. Spark works on Resilient
Distributed Datasets (RDDs) [5] [9]. These are immutable objects
that cannot be modified but can be transformed into new RDDs or
can be used in some action. They can be explicitly persisted and
employ lineage based recovery schemes We have used RDDs to
load the vertices of geo-spatial objects from HDFS and perform
transformations and actions to run our algorithms for geospatial
operation. Spark’s in-memory cluster computing framework can
Figure 1. Spark Architecture
be used in 3 modes – standalone, over Yarn or inside Hadoop
MapReduce. We have used Spark in both distributed and
standalone mode, with the data for processing residing in the
HDFS or in the local storage. Apache Spark has rich high level
library for Java, Python, Scala, R and SQL. The unified engine of
Spark can handle diverse workloads like machine learning,
analytics, and graph processing for large dataset in batch
processing and streaming processing. Figure 1 depicts the Spark
architecture and Figure 2 depicts the Master-Slave task execution
in the Spark environment.
2.4 Master Node We configured one machine as master node running Hadoop and
Spark. The master node is responsible for work distribution
among the slave nodes and collecting results of computations after
they are complete.
2.5 Slave Node We have configured three slave nodes, with master node also
running as the worker. These nodes receive data that is partitioned
by master and performs computations on its set of data. All slave
nodes run the same algorithm and send back the result of local
computations to master which computes the global result. Table 1
depicts the master-slave configuration of three machines.
Figure 2. Spark Master- S lave Task execution
Figure 3. Manipulating RDD using Transformations and
Actions
2.6 Resilient Distributed Datasets(RDD) At a high level, any Spark application creates RDD out of the
input data and run (lazy) transformations on these RDDs to
convert them to some other form (shape), and finally perform
actions to collect or store data. Resilient Distributed Datasets
(RDD) is the core concept in Spark framework. It can hold any
type of data. Spark stores data in RDD on different partitions.
RDD are immutable and can be modified with a transformation but
the transformation returns a new RDD whereas the original RDD
remains the same. RDD supports two types of operations:
Transformation - map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, pipe, coalesce
Action - reduce, collect, count, first, take, countByKey,
foreach.
2.7 Environment Setup The environment setup of the system is as follows.
● Operating system of each node in cluster is Ubuntu
14.04.
Machine Processor Disk Space RAM
Master
(Worker 1)
160 GHZ 88.9 GB 5.6 GB
Worker 2 260 GHZ 27.6 GB 3.8 GB
Worker 3 240 GHZ 19.9 GB 5.9 GB
Table 1. Configuration of master-slave nodes
Figure 4: Input and Output of Spatial Range Query
Software installation versions
- OpenJDK - 1.7.0
- Hadoop - 2.6.0
- Spark – 1.2.0
- SSH – For password less communication
● Created password-less SSH login between all slave
nodes and master node.
● Connected master with all slaves using Hadoop and
Spark configuration setup.
This setup ensures the whole cluster in up and running.
3. IMPLEMENTATION OF QUERIES In the project phase 1 we were required to load GeoSpark jar into
Apache Spark Scala shell and execute the following operations
using Scala. The added GeoSpark jar file in project dependencies
enables us to use GeoSpark APIs in the Spark program. In this
phase, we had to create GeoSpark SpatialRDD (PointRDD) and
then implement spatial range query, spatial KNN query and
spatial join query. The section below elaborates on each of these
queries.
3.1 Spatial Range Query Spatial Range query attempts to find all the objects, which are
there in a given specific range. For geo-spatial operations, range
query locates all the geometric points that are contained in the
given query rectangle co-ordinates. Using spatial range query, we
check for containment of point within given query rectangles.
GeoSpark executes range query algorithm following the execution
model: Load target dataset, partition data into number of
partitions, create a spatial index on each SRDD partition if
required, broadcast the query window to each SRDD partition,
check the spatial predicate in each partition, and remove spatial
objects duplicates that existed due to the data partitioning phase.
Figure 4 shows the points in the x-y space and the query window
is specified by inputting the coordinates. The output result of the
SpatialRangeQuery() is the number of points in the query
window. Figure 2 shows the various points and the space and the
Input:
Input the Envelope i.e Query Window
Example:(-113.79,-09.73,32.99,35.08);
Input HDFS path for point Dataset
Output:
Count of Points in the Envelope
Algorithm:
1. Define Envelope based on the input
2. Read from the HDFS path given and
convert into PointRDD to ObjectRDD
3. Build the R-Tree on the ObjectRDD if
needed
4. Exexute RangeQuery.SpatialRangeQuery()
API of GeoSpark for Spatial Range
5. Query or Execute
RangeQuery.SpatialRangeQueryUsingIndex
() API of GeoSpark for Spatial Range
Query using Index.
6. Return count of the result optioned in
the step 4
Figure 5. Spatial Range Query Algorithm
coordinates of the Envelope are defined and the Spatial Range
query returns all the points inside the envelope [10].
Input:
Input1 - queryEnvelope: x1, y1, x2, y2
This is the co-ordinates of Envelope we want to create.
Input2 – The location of input2 in HDFS and we use PointRDD
to convert the dataset in csv file into the objectRDD. The dataset
consists of points (longitude, latitude), which defines a point in
the space.
Output:
The output of the spatial range query returns the number of
points belonging in the queryEnvelope whose coordinates are
specified in the Input1.
Algorithm:
The algorithm reads one file that contains all the vertices for input
polygons (arealm.csv). The input from the first file is loaded into
an RDD and partitioned across the worker nodes. A plane sweep
is performed at each worker to check for containment of points in
the query window. The result is sent back to the master where all
the results are merged after removing duplicates (if any) and the
count is returned. The other part of the spatial range query
specifies building R-Tree index on PointRDD and then run the
spatial range query. Figure 3 shows the code for the spatial range
query and figure 4 shows the code for Spatial Range Query using
Index where a R-Tree index is built on PointRDD and then this
PointRDD is queried [6].
3.2 Spatial KNN Query
GeoSpark uses a heap based top-k algorithm, which has two
phases: selection and merge. It takes a partitioned SRDD, a point
P and a number kK as inputs. To calculate the k nearest objects
around point P, in the selection phase, for each SRDD partition
GeoSpark calculates the distances between each object to the given
point P, then maintains a local heap by adding or removing
elements based on the distances. This heap contains the nearest k
objects around the given point P. For Indexed SRDD, the system
can utilize the local indexes to reduce the query time. After the
selection phase, GeoSpark merges results from each partition,
keeps the nearest k elements that have the shortest distances to P
and outputs the result. Figure 5 specifies the Spatial KNN query
code and Figure 6 shows the code of building the R-Tree on the 2-
D PointRDD and running the
SpatialKnnQueryUsingIndex().shows the code of building the R-
Tree on the 2-D PointRDD and running the
SpatialKnnQueryUsingIndex().
Input:
Coordinates of the point whose neighbors
are to be found
Value of k i.e the number of neighbors
The location of the input in HDFS of Points
in 2-D space
Output: Return coordinates of k nearest
neighbors.
Algorithm:
1. Create a point by taking the
coordinates of the point from input
2. Read from the HDFS path given and
convert into PointRDD to ObjectRDD
3. Build the R-Tree on the ObjectRDD if
Specified in the query
4. Build the R-Tree on PointRDD where
ever specified
5. KNNQuery.SpatialKnnQuery() API of
GeoSpark for Spatial KNN Query or
Execute KNNQuery.SpatialKnnQueryUsingIndex
() API of GeoSpark for Spatial Range
Query using Index.
6. Output the coordinates of k nearest
neighbors
Figure 6. Spatial KNN Query code
Input:
The location of the input1 in HDFS
The location of the input2 in HDFS
Output: Count Size of the result query
Algorithm:
1. Read from csv file of two coordinate
points from HDFS input1 and load it in
PointRDD
2. Build the R-Tree on PointRDD where ever
specified
3. Read from csv file with coordinates of
the rectangles stored in HDFS Input2 and
load the RectangleRDD
4. Execute JoinQuery(sc, PointRDD,
RectangleRDD)
5. Return size of the result of the query
Figure 7. Spatial Join Query code
3.3 Spatial Join Query Spatial Join Query joins the set of polygons and the spatial points
based on a spatial join predicate. GeoSpark first partitions the
data from the two input SRDDs as well as creates local spatial
indexes (if required) for the SRDD [7]. Then it joins the two
datasets by their keys. For the spatial objects that have the same
grid ID, GeoSpark calculates their spatial relations. If two
elements from two SRDDS are overlapped, they are kept in the
final results. The algorithm continues to group the results for each
rectangle. The grouped results are form of either Rectangle or
point. Finally, the algorithm removes the duplicates and returns
the result to other operations or saves the final result to disk.
Spatial Join query deals with creating a GeoSpark RectangleRDD
and use it to join with PointRDD. The different Spatial Join
Queries to be executed on the dataset are
1. Join the PointRDD using Equal grid without R-Tree index.
2. Join the PointRDD using Equal grid with R-Tree index.
3. Join the PointRDD using R-Tree grid without R-Tree index
3.4 Spatial Join Query (Cartesian Product)
The second phase of the project included implementing the
SpatialJoinQueryUsingCartesianProduct() function in the
GeoSpark source code. This function finds the Spatial join query
using the simple Cartesian Product algorithm between the
PointRDD and RectangleRDD. Specifically, for each rectangle
from the query window dataset, check this rectangle against the
point datasets using the regular GeoSpark Spatial Range Query:
The function is implemented in the JoinQuery.java file where it is
Input: PointRDD, RectangleRDD
Output:
JavaPairRDD<<Envelope, HashSet<Point>>
Algorithm:
1. Create a PointRDD objectRDD
2. Create a RectangleRDD queryWindowRDD;
3. Collect rectangles from queryWindowRDD to
one Java List L;
4. For each rectangle R in L
RangeQuery.SpatialRangeQuery(objectRDD,
R, 0)
End
5. Collect all the Results
6. Parallelize the results to generate a RDD
in the format JavaPairRDD<<Envelope,
HashSet<Point>>
7. Return the result RDD
Figure 8. Spatial Join Query using Cartesian Product
Algorithm
the function of the class JoinQuery. The algorithm for the spatial
join query using Cartesian product is explained in figure.
3.5 IDENTIFY TOP FIFTY SPATIO-
TEMPORAL HOT SPOTS
In today's world observational data is being collected at an ever
increasing rate. With the advent of Big Data and the ubiquitous
collection of spatial-temporal observational data (e.g., vehicle and
mobile assets tracking data, geo-fencing), the identification of
unusual patterns of data in a statistically significant manner has
become a key problem that numerous businesses and
organizations are attempting to solve for real-time and batch
analysis.
Since the data is humongous it is not possible to scale algorithms
to run on a single machine. Over the past few years, interest in the
Apache Spark framework has exploded, both in industry, and
academia. Spark can readily be installed on clusters of hardware,
and offers a sophisticated software platform for creating scalable
distributed algorithms using programming techniques to work on
these massive datasets.
Problem Definition:
The phase 3 of the project focuses on applying the spatial
statistics to the spatial-temporal big data in order to identify
statistically significant hot-spots using a distributed Apache spark
computing framework.
Figure 9. The dataset of New York Yellow Taxi January 2015
and the boundary around the dataset
Input
In this problem we will utilize a very large collection of spatial-
temporal observational data of the New York City Yellow Taxi
January 2015 dataset, approximately of 1.8 GB in size. This
dataset consists of 12,748,982 records representing all Yellow Cab
taxi trips in New York City in January 2015. Each record in the
dataset contains key information such as pickup and drop -off
date, time, and location (latitude, longitude), trip distance,
passenger count, and fare amount. Given this particular dataset,
we are required to identify the fifty most statistically significant
drop-off locations in both time and space using the Getis-Ord
statistic. Space is aggregated into a grid (using latitude and
longitude), while time is aggregated into time windows i.e. one day
periods for the month of January (31 days). The source data is
clipped to an envelope encompassing the five New York City
boroughs in order to remove some of the noisy error data i.e., the
latitude ranges from 40.5N – 40.9N and the longitude ranges from
73.7W – 74.25W. Figure 9 shows the dataset plotted on the New
York City map and the defined boundary of the dataset which
restricts us to considers only the points inside the envelope so as
to remove noise and outliers.
Output
The application developed will run on top of the Apache Spark
framework, implemented using Java. The dataset is provided as an
uncompressed CSV file and the CSV file containing the top fifty
hot spots in the format of (latitude, longitude, time_step, zscore) is
saved in the given output path.
Approach
GIS professionals use spatial statistics to identify statistically
significant clusters or outliers in spatial data. When identifying
statistically significant clusters (often termed Hot Spot Analysis),
a very commonly used statistic is the Getis-Ord Gi*. It provides
z-score and p-values that allows users to determine that features
with either high or low values are clustered spatially. This statistic
can be applied to both the spatial and spatial-temporal domains.
In this project phase 3, we are focused on the spatial-temporal
Getis-Ord statistic calculation for the hot spot analysis with the
cells having the high z score being more hot.
Figure 10. Spatial-Temporal Space 3-D structure
The Time and space are aggregated into the cube cells, where x axis
denotes the latitude and the y axis denotes the longitude with time
on the z axis. Each cell unit size is 0.01 * 0.01 in terms of latitude
and longitude. One day is used as the time step size which
amounts to 31(number of days in January) step sizes on the z
axis. In total there would be 68200 cells in the spatial-temporal
space of our clipped dataset. The Figure 10 depicts the 3-D
structure of the dataset, when 2-D spatial data plotted with the
longitude and latitude values is plotted against the time axis.
The algorithm for Hot Spot analysis makes use of Java APIs of
the Spark framework. The Spark Context is created using the
SparkConf(). The input data file is loaded in the RDD using the
textFile() method in the Spark Context. The we need to apply the
Spark transformations on the RDD. We make use flatMap()
transformation which is the combination of Flat and Map. This
transformation creates the new RDD from the existing one. Each
data record converted in the RDD is mapped to the (x y z) cell it
would be located in the 3-D structure. The x, y, z is the cell
number on the x-axis, y-axis and z-axis respectively. The second
transformation applied is mapToPair() which creates the Key-
Value pairs like ((x y z), 1) and converts JavaRDD to
JavaPairRDD. The reduceByKey() aggregates the cell with the
same key. As a result, we get the number of each cell occurrences
in the dataset which implies that we get the total number of active
points in the cell in 3-D space. We further perform the action
coalesce(1) which returns new RDD reduced in one partition. The
other action collect() can fetch the entire RDD into the single
machine. We use collect() because the RDD returned after
aggregating the Key-Value pairs are approximately 15k, which can
fit in the memory. We process the RDD collected into a Java
Collection where we check if the neighbor of the cell is present in
the list and according apply the Getis-Ord statistic. Then the cells
are sorted in the descending order of the z-score and outputting
the fifty cells with the top z-score indicating the hottest cells. The
total time taken to execute the Java application of finding the fifty
Hot-Spots on 1.8 GB data using Spark on 4GB RAM Ubuntu
machine is around 1.5 minutes.
Assumptions
Below are the various assumptions for analyzing hot spots in the
spatial-temporal dataset.
Each record in the input dataset is a point in the 3-D
space
Figure 11. The 26 neighbors of a cell in the 3-D
space with the cell itself also its neighbor
Space is aggregated into a grid of cells using latitude and
longitude of pick-up location.
Size of each cell in the 2-D space is
Time is aggregated into window of one-day comprising
of total 31 values on the z-axis.
The source data is clipped to an envelope encompassing
the five New York City in order to remove some of the
noisy error data and the outliers. The envelope ranges
from in latitude 40.5N – 40.9N, longitude 73.7W –
74.25W.
Each cell in the 3-D space has 26 neighbors and the cell
itself is also considered its own neighbor as is depicted
in figure 11. The weight of the cell i is defined as
below.
Input:
New York City Yellow Taxi (January 2015) Dataset (~1.8 GB)
with each record contains pick-up latitude and longitude, date, etc.
Output:
Fifty most significant pick-up locations in both time and space
using Getis-Ord Gi* Statistic.
Pseudo Algorithm:
1. Create Spark context
Step 2 – 5 are run in Spark Context
2. Load Input data file in the Spark context to JavaRDD.
3. Using Word-Count Approach Map each data point(record)
to the 3-D cell. We use FlatMapFuction in Java.
CellX = (|maxLongitude|-|PickupLongitude|) / 0.01
CellY = (PickupLatitude - minLatitude) / 0.01
CellZ = Day of the January month
If(PickUpLongitude and PickupLatitude are not in the
envelope)
Return (-1 -1 -1)
Else
`Return (CellX CellY CellZ)
4. Count and aggregate the number of time each (CellX CellY
CellZ) occur using mapToPair function. All the same
coordinate cells are aggregated.
5. Save the output of step 4 in the file which contains ~15K
lines indicating those many active cells.
The output has each line like (CellX CellY CellZ, x)
Where xi denotes the attribute value of the cell i, which
corresponds to the total number of pickup locations in that
cell.
6. Collect the contents of this output in a List CellList
7. Compute n, which is the total number of cells in Spatial-
temporal 3-D space
n = |maxLongitude - minlongitude| ×
|maxLatitude - minLatitude| × 31
= 68200
8. Calculate the mean
9. Calculate the standard deviation
10. For each cell in cellList
10.1 From the coordinates of a cell we know the coordinates
of a cell we know the coordinates of its 26 neighbors.
10.2 Find if any of the neighbors of the cell exist in the cellist
and if they do we should add the attribute value xi of all
its neighbors evaluating the value of .
10.3 Calculate the Gi* Statistic, i.e z-score of each cell using
the equation below.
10.4 Since wij = 1, if i and j are neighbors then
Because each cell has 26 neighbors and one itself also. Therefore,
the Gi* transforms as the below equation.
11. Sort the cellList based on the descending order of the Gi*
statistic i.e the z-score.
12. Output the fifty top cells with the highest Gi* score.
13. Convert the cell values in the longitude and latitude and
output the hot spots latitude, longitude and the day.
4. EXPERIMENTS AND RESULTS Now once we have completed implementing all operations, it is
time to monitor performance of cluster. This section provides
preliminary experimental evaluation that studies the run time
performance of the large-scale spatial data processing systems. We
performed tests on cluster using arealm.csv and zcta510.csv
datasets. The same operations were performed multiple times to
observe average performance of nodes.
Our cluster was set up over an ad-hoc network having one master
and three slave nodes. For measuring performance, we have used
tool called “Ganglia”. It helped us to monitor CPU, memory
network usage of the cluster while executing tasks. We have
compared efficiency of these queries when run on indexed and
unindexed data.
Each task was executed on the cluster and screenshots were taken
from the Spark and Ganglia UI.
4.1 SPATIAL RANGE QUERY
This query’s performance is compared based on indexed and
unindexed data.
Unindexed data (Task 2a)
This is the spatial range query where we query the dataset for
PointRDD using query window. Here the PointRDD is unindexed
and average execution time for this query is 147.7 secs. Figure 12
indicates the CPU utilization on master and 2 slaves. Figure 13
indicates the memory utilization on master and 2 slave nodes. The
results indicate that the master has the highest CPU and memory
utilization as compared to slaves.
Indexed data (Task 2b)
This is the spatial range query where we build R-tree on
PointRDD and then query the database for PointRDD using the
query window. Average execution time for this query is 59.9 secs.
which is lesser than task 2a since data is indexed. Figure 14
indicates the CPU utilization of Master and slave nodes. Figure
15 shows the Memory utilization of Master, and slaves. These
metrics for query indicate that Master has the maximum CPU
and memory utilization as compared to the other two slaves.
Figure 12. CPU utilization on cluster nodes for query 2a
Figure 13. Memory utilization on cluster nodes for query 2a
Figure 14. CPU utilization on cluster nodes for query 2b
Figure 15. Memory utilization on cluster nodes for query 2b
4.2 SPATIAL KNN QUERY
This spatial query finds 5 nearest neighbors.
Unindexed data
Below figures show the results of spatial KNN query on
unindexed PointRDD. The average execution time is 51. Figure 16
indicates the CPU utilization on master and 2 slaves. Figure 17
indicates the memory utilization on master and 2 slave nodes.
Figure 16. CPU utilization on cluster nodes for query 3a
Figure 17. Memory utilization on cluster nodes for query 3a
R-Tree index
First create R-Tree index and then query PointRDD. The average
execution time is 17.7 secs. which is lesser than task 3a since it is
indexed. Figure 18 indicates the CPU utilization on master and 2
slaves. Figure 19 indicates the memory utilization on master and 2
slave nodes.
Figure 18. CPU utilization on cluster nodes for query 3b
Figure 19. Memory utilization on cluster nodes for query 3b
Spatial Join Query
This query deals with creating a GeoSpark RectangleRDD and
use it to join with PointRDD.
R-tree Unindexed data (Task 4a)
The average execution time of this query is 836 secs. Figure 20
indicates the CPU utilization on master and 2 slaves. Figure 21
indicates the memory utilization on master and 2 slave nodes.
4.3 R-Tree indexed using equal grid (Task 4b) Join the PointRDD using Equal grid with R-Tree index. The
average execution time of this query 580 secs which is less than
the time taken by 4a query with unindexed data points. Figure
22 indicates the CPU utilization on master and 2 slaves. Figure
23 indicates the memory utilization on master and 2 slave nodes.
Figure 20. CPU utilization on cluster nodes for query 4a
Figure 21. Memory utilization on cluster nodes for query 4a
Figure 22. CPU utilization on cluster nodes for query 4b
Query 4c
Join the PointRDD using R-Tree grid without R-Tree index. The
average Execution time of this query is 53 secs which is much less
than query 4a and query 4b. Query 4c takes lesser time because it
has R-Tree grid. Figure 24 indicates the CPU utilization on master
and 2 slaves.
Figure 23. Memory utilization on cluster nodes for query 4b
Figure 24. CPU utilization on cluster nodes for query 4c
4.4 CARTESIAN PRODUCT ALGORITHM
This is the Spatial Join query that uses the simple Carteian
Product algorithm. Specifically, for each rectangle from the query
window dataset, check this rectangle against the point datasets
using the Regular GeoSpark Spatial Range Query. The average
execution time of this query is 9060 secs. Figure 25 indicates the
CPU utilization on master and 2 slaves. Figure 26 indicates the
memory utilization on master and 2 slave nodes.
5. CONCLUSION We learned GeoSpark, an in-memory cluster computing
framework for processing large-scale spatial data. GeoSpark
provides an API for Apache Spark programmers to easily deploy
spatial operations. Experiments on data sizes and spatial analysis
show that GeoSpark achieves better run time performance than its
MapReduce based counterparts. Spark speeds up the task at
hand with an increase in the number of nodes. We achieved greater
efficiency in terms of CPU execution and memory utilization
while running the task on a cluster. Though the latency was
present, we identified the reason was mainly because of the slow
ad-hoc network. We therefore believe that Spark can be effectively
used to handle computation intensive queries.
6. ACKNOWLEDGEMENTS We would like to express our sincere gratitude towards course
instructed Dr. Mohamed Sarwat for giving us an opportunity to
work with one of the latest and leading cluster computing
frameworks. We would also like to thank the teaching assistant Jia
Yu who was always there to help us and guided us with his insight
into the same.
Figure 25. CPU utilization on cluster nodes
Figure 26: Memory utilization on cluster nodes
7. REFERENCES [1] https://hadoop.apache.org
[2] http://saphanatutorial.com/mapreduce/
[3] https://en.wikipedia.org/wiki/Apache_Hadoop
[4] http://spark.apache.org/docs/latest/programming-guide.html
[5] https://en.wikipedia.org/wiki/Apache_Spark
[6] http://www.geolib.co.uk/
[7] www.vividsolutions.com/jts/JTSHome.htm
[8] Eldawy, Ahmed, et al. "CG_Hadoop: computational
geometry in MapReduce." Proceedings of the 21st ACM
SIGSPATIAL International Conference on Advances in
Geographic Information Systems. ACM, 2013.
[9] Zaharia, Matei, et al. "Spark: cluster computing with working
sets." Proceedings of the 2nd USENIX conference on Hot
topics in cloud computing. Vol. 10. 2010.
[10] Hsiao, Pei-Yung, and Chia-Chun Tsai. "A new plane-sweep
algorithm based on spatial data structure for overlapped
rectangles in 2-D plane." Computer Software and
Applications Conference, 1990. COMPSAC 90. Proceedings.,
Fourteenth Annual International. IEEE, 1990.