High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2....
Transcript of High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2....
![Page 1: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/1.jpg)
High Performance Spatial Query Processing for Large scale Scientific
Data
Ablimit Aji Adviser: Fusheng Wang
Math & CS Seminar Nov 16, 2012
![Page 2: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/2.jpg)
![Page 3: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/3.jpg)
Spatial Applications
Map
![Page 4: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/4.jpg)
Pathology Analytical Imaging
Big Spatial Data
Satellite Imagery & Remote Sensing
![Page 5: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/5.jpg)
Integrative Multi-Scale Biomedical Informatics
• Reproducible anatomic/functional characterization at gross level (Radiology) and fine level (Pathology).
• Integration with multiple types of “omic”s (genomics, proteomics, metabolomics) information.
• Create categories of jointly
classified data to describe
pathophysiology, predict prognosis
and response to treatment.
![Page 6: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/6.jpg)
Emory In Silico Brain Tumor Research Center
Integrative in-silico study of GBM brain tumors through digital pathology, radiology, “omics” data, and patient outcome.
Emory In Silico Brain Tumor Research Center
![Page 7: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/7.jpg)
Pathology Imaging
• High-resolution whole slide images provide rich information about morphological and functional characteristics of biological systems. (Ex1)
• Such information has tremendous potential for providing insights regarding the underlying mechanisms of disease onset and progression.
![Page 8: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/8.jpg)
Introduction Distinguishing Characteristics in Pathology Images
… Molecular data
Clinical data
![Page 9: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/9.jpg)
Systematic Image Algorithm Evaluation
• High quality image analysis algorithms are essential to support biomedical research and diagnosis
– Validate algorithms with human annotations
– Compare and consolidate different algorithm results
– Sensitivity study on algorithms’ parameters
• Example: What are the distances and overlap ratios between markup boundaries from two algorithms ?
Algorithm one v.s. Algorithm two
Cross match / join two spatial data sets
![Page 10: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/10.jpg)
Spatial Centric Queries
WINDOW
NEAREST NEIGHBOR
CONTAINMENT POINT
SPATIAL JOIN DENSITY
Need: Manage, Query and Compare Spatially Derived Information
![Page 11: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/11.jpg)
PAIS Data Management Architecture
UML Model XML Schema
Dat
a M
app
ing
Pathology
Image Database
Data Analysis Applications
(MATLAB, SAS…)
Parallel Database
PAIS Data Management
Web APIs
Image Viewer
Clinical Data Molecular & Genetic Annotation
Application Server
Image Data Management
Job
Man
agem
ent
Dat
a St
agin
g
PAIS Document Generator
Database Schema Loading Manager
Zip
ped
PA
IS D
ocu
men
ts
PIDB
Image Analysis Results
Human Annotations
![Page 12: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/12.jpg)
Spatial Database Engine
• Extended on RDBMS to support multi-dimensional spatial data types and queries
– Extended spatial data types
– Spatial functions and procedures
– Spatial access methods
• space oriented – e.g. Grid-File
• data oriented – e.g. R*-Tree
![Page 13: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/13.jpg)
Partitioned, Shared-Nothing Parallel DBMS
• Most often I/O is the main bottleneck
• Partition data for parallel data access
• Shared-nothing architecture is widely used in practice as a scalable database solution
![Page 14: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/14.jpg)
Parallel Spatial DB Summary
• Comprehensive data model to represent data and provenance.
• Expressive query interface (Spatial SQL).
• Can be scaled with a partitioned parallel DBMS.
• Can be quickly setup, and support a set of queries for small/medium scale data.
![Page 15: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/15.jpg)
Questions ?
![Page 16: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/16.jpg)
• Data loading takes very long time
• Limited scalability: possible but with high cost
• Expensive software/hardware license
• Limited query support
• Maintaining & tuning is complex
– most often needs a DBA
Limitations of Parallel Database Approach
F. Wang et al., High Performance Analytical Pathology Imaging Database for Algorithm
Evaluation, In MICCAI-DCI 2011
![Page 17: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/17.jpg)
• 10 billion pixels for each WSI: 105x105
• 1M derived spatial objects, 100 M features per image
• Thousands of patients, tens of slides /patient
• Many algorithms with varying parameters
• Moderate size healthcare operations can generate Terabytes of data per day easily
• If full WSI would be adopted for clinical environment, petabytes of data per day is likely
Large Volumes of Derived Data
Demands High-Throughput
![Page 18: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/18.jpg)
High Complexity of Geometric Computation
0
200
400
600
800
1000
1200
Cross-comparing query
Tim
e (s
)
Area_Of_Intersection 37.4 %
Area_Of_Union 36.7 %
ST_Intersects 21.8 %
Search 2.7%
Index building 1.5%
95.9%
SELECT AVG(ratio) FROM (
SELECT
ST_Area(ST_Intersection(p.the_geom, q.the_geom))/
ST_Area(ST_Union(p.the_geom, q.the_geom)) AS ratio
FROM oligoastroiii_1_1 AS p, oligoastroiii_1_2 AS q
WHERE ST_Intersects(p.the_geom, q.the_geom) );
Requires High Performance
![Page 19: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/19.jpg)
High Performance Queries with MapReduce
• MapReduce is a parallel computing framework widely used for large-scale data analysis and queries
– Widely used in major internet applications open source version: Hadoop
– Two simple UDFs for data processing : map & reduce
– Very easy to setup develop scalable applications
– Implicit parallelization by data partitioning
• Our approach:
– Take advantage of MapReduce to run queries
– Native spatial query support with a spatial engine
![Page 20: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/20.jpg)
Major Gaps
• Spatial Data Partitioning and Index Support
– Spatial data is multidimensional
– Hard to preserve data locality
– Supporting and managing multidimensional index on MapReduce is not easy
![Page 21: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/21.jpg)
Major Gaps
• MapReduce only provides map() and reduce()
– Need to map spatial query processing algorithms to map and reduce environment
– Need to revise and rethink spatial query processing
• An easy to use query interface is required
– MapReduce == assembly language for MPP
– Industrial examples – Hive, Pig, Scope ..
– e.g. in Facebook, 70% of MapReduce workload are submitted through Hive
![Page 22: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/22.jpg)
Spatial Join Processing with MapReduce
• Staging: – images are tiled into regular
grids
– tiles redistributed across HDFS as file blocks
– small tiles are merged and metadata added into records
• Map:
– identify records of same tiles to form tasks
• Reduce: – execute queries with the
spatial query engine
– aggregate query results
Tile
…
Boundary files
Merged boundary file
record(imageID, tileID, boundary)
Copied to
HDFS
…
HDFS
Result of Algorithm 1
Result of Algorithm 2
Algorithm
results
Image
Data Scan Data Scan Data Scan
MAP
(one reducer per tile)
Spatial Join
REDUCE
…
…
…
![Page 23: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/23.jpg)
Boundary File1
Boundary File2
Spatial Join Algorithm
Geometry Refinement
Spatial Measure
R*-Tree File1
R*-Tree File2
Result File
Real-time spatial querying engine (RESQUE)
Bulk R*-Tree Building
Bulk R*-Tree Building
Example: Two-way Spatial Join
• Index building on demand (low overhead)
• Query pipelines to combine multiple steps of query processing
• Decoupled parallel spatial query processing
• Support of spatial access methods and algorithms – spatial selection, multi-way spatial join, nearest neighbor, and highest
density queries, and extensible for new ones
Real-Time Spatial Query Engine (RESQUE)
![Page 24: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/24.jpg)
Nearest Neighbor Query Processing Workflow
Query e.g. : for each cell find the closest blood vessel and return distance to that blood vessel. Access methods can vary (R*-Tree, Voronoi, etc..)
Map Reduce Reduce
![Page 25: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/25.jpg)
System Architecture — Hadoop-GIS
Goal: provide end-to-end support (from storage and indexing
to query execution to query composition) for exploration of
spatial scientific datasets at big data scales.
• SQL query interface for ease of use
• Queries are translated into MapReduce with YSmart-Spatial
• Hadoop executes the translated MapReduce code
• Relies on the RESQUE for spatial query processing
![Page 26: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/26.jpg)
Hadoop-GIS
• HiveQL –> HiveSP
– Pros: Widely used as EDW, Tightly integrated with Hadoop – Mature query processing pipeline is already in place – Cons: Married to Hive, less freedom for optimization
• YSmart –> YSmart-Spatial – Pros: Good query optimization framework – Better control of customized optimization (e.g. GPU) – Cons: Not widely used, lots of development
![Page 27: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/27.jpg)
Hadoop-GIS: HiveSP
![Page 28: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/28.jpg)
Declarative Query Interface
Schema Creation: CREATE TABLE tcga_markups
( markup_id BIGINT, provenance STRING, hand_marked
BOOLEAN, center ST_POINT, polygon ST_POLYGON)
PARTITIONED BY TILE(polygon ST_POLYGON, 4096, 4096)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t‘ STORED AS
TEXTFILE ;
Spatial Join Query: SELECT
ST_AREA(ST_INTERSECTION(ta.polygon,tb.polygon))/
ST_AREA(ST_UNION(ta.polygon,tb.polygon)) AS ratio,
ST_DISTANCE(ST_CENTROID(tb.polygon),ST_CENTROID(ta.polygon))
AS distance
FROM tcga_markups ta JOIN tcga_markups ON
(ST_INTERSECTS(ta.polygon, tb.polygon) = TRUE)
WHERE ta.provenance='Algorithm1' AND
tb.provenance='Algorithm2' ;
![Page 29: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/29.jpg)
Query Processing Pipeline
• Parsing
• Validation
SQL
• Semantic Analysis
• Optimization
• Plan Generation
AST • Physical Plan
• Optimization
Operator Tree
• Hadoop
Execution
![Page 30: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/30.jpg)
Two-way Join Query Plan
![Page 31: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/31.jpg)
Spatial Query Engine Performance
Querying time for a single tile Query response time
for a single image
• R*-Tree building ≈ 16% R*-Tree join ≈ 84%
• Indexing Scalability: IndexTime (Single Image) ≈ IndexTime(Tile) * tile#
• Storage: apply chain coding at R*-Tree leaves ≈ 42% storage reduction
![Page 32: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/32.jpg)
System Scalability Experiments
• Experimental Setup
– in-house cluster with 10 physical nodes (192 cores)
– join query: a set of 18 images (0.5 M nuclei/image) • 39GBs of data with 6 different results
– nearest neighbor query: 50 images from TCGA (size) • 42 GBs
• Hadoop Setup
– Cloudera Hadoop 0.20.2
– Replication factor = 3
– File split size = 128 MB
– Individual task tracker memory = 512MB
![Page 33: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/33.jpg)
Query Setup
• Multi-way Spatial Join Query
star join clique join
• Join Algorithms
– R*-Tree join (Breadth First Tree Traversal)
– Partition based spatial merge join (PBSM) 1. refine with approximate processing
2. filter candidates with precise algorithm
![Page 34: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/34.jpg)
Multi-Way Spatial Join: Star Join
PBSM R*-Tree
![Page 35: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/35.jpg)
Multi-Way Spatial Join: Clique Join
PBSM R*-Tree
![Page 36: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/36.jpg)
Nearest Neighbor Query
Voronoi Diagram R*-Tree
![Page 37: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/37.jpg)
Data Skew in Spatial Query Processing
• Skew may arise for many reasons • Longest running task becomes bottleneck
![Page 38: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/38.jpg)
Stragglers
![Page 39: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/39.jpg)
Skew Mitigation
• Hadoop generates tasks using hash partitioning (default) • Hash partitioning is not skew resistant • We overwrite the default hashing partition function
![Page 40: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/40.jpg)
Cost Estimation
6 2
4 10
![Page 41: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/41.jpg)
Repartition (hash)
6 2
4 10
16 6
![Page 42: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/42.jpg)
Repartition (optimal)
6 2
4 10
12
10
![Page 43: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/43.jpg)
Query Optimization
• Partition problem is NP-Complete
• Greedy Approach ≈ 4/3 OPT ( not bad )
![Page 44: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/44.jpg)
Query Optimization
![Page 45: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/45.jpg)
Summary
• Effective management of big spatial data is one of the pressing challenges for next generation integrative biomedical research.
• We propose and provide a high performance MapReduce based querying system for large scale spatial data.
• We empirically test the concept with analytical medical imaging as an example application.
• The system is efficient, cost effective, scalable, and provides expressive query language.
![Page 46: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/46.jpg)
Next Steps (I)
• Density aware partition methods
– partition according to the local spatial density
– dynamically adjust partition granularity
• Index and pipeline selection
– one spatial index does not fit all (R*-Tree v.s Voronoi etc..)
– select index based on query type data set properties
– pipeline optimization
Spatial Partitioning and Optimization Methods for MapReduce
![Page 47: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/47.jpg)
Next Steps (II)
Spatial Query Optimization SELECT AVG(ratio) FROM (
SELECT
ST_Area(ST_Intersection(p.geom, q.geom))/
ST_Area(ST_Union(p.geom, q.geom)) AS ratio
FROM oligoastroiii_1_1 AS p, oligoastroiii_1_2 AS q
WHERE ST_Intersects(p.geom, q.geom) );
SELECT AVG(ratio) FROM (
SELECT area_intersect/ (area_p + area_q –area_intersect) AS ratio
FROM (
SELECT ST_Area(ST_Intersection(p.geom, q.geom)) AS area_intersect,
ST_Area(p.geom) AS area_p, ST_Area(q.geom) AS area_q
FROM oligoastroiii_1_1 AS p, oligoastroiii_1_2 AS q
WHERE p.geom && q.geom) AS temp1 )
WHERE area_intersect >0 ) AS temp2 ;
![Page 48: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/48.jpg)
Next Steps (III)
0
200
400
600
800
1000
1200
1400
1600
1800
Apr-06 Apr-06 Apr-06 Apr-06 Apr-06 Apr-06 May-06 May-06 May-06 May-06 May-06 May-06 May-06 May-06 May-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jun-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Jul-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Aug-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Sep-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Oct-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Nov-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Dec-06 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Jan-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Feb-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Mar-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 Apr-07 May-07 May-07 May-07 May-07 May-07 May-07 May-07 May-07 May-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jun-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Jul-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Aug-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Sep-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Oct-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Nov-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Dec-07 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Jan-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Feb-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Mar-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 Apr-08 May-08 May-08 May-08 May-08 May-08 May-08 May-08 May-08 May-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jun-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Jul-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Aug-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Sep-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Oct-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Nov-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Dec-08 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Jan-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Feb-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Mar-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 Apr-09 May-09 May-09 May-09 May-09 May-09 May-09 May-09 May-09 May-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jun-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Jul-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Aug-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Sep-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Oct-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Nov-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Dec-09 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Jan-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Feb-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Mar-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 Apr-10 May-10 May-10 May-10 May-10 May-10 May-10 May-10 May-10 May-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jun-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Jul-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Aug-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Sep-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Oct-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Nov-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Dec-10 Jan-11 Jan-11 Jan-11 Jan-11 Jan-11 Jan-11
Peak
Perf
orm
an
ce
(GF
LO
PS
)
G…
0
200
400
600
800
1000
1200
1400
1600
1800
Jul-06 Jul-07 Jul-08 Jul-09 Jul-10
Peak
Perf
orm
an
ce
(GF
LO
PS
)
C…
7900 GTX
8800 GTX 9800 GTX
GTX 280
GTX 285 GTX 480
GTX 580
E4300 E6850 Q9650 X7460 980 XE
All cores on the stream
multiprocessor (SM) execute the
same instruction on different data
E.g., NVIDIA GTX 580 has 512
cores (16 SMs, 32 cores each)
Turbo Charge Hadoop-GIS with GPU
ST_Intersection on set of
polygons
![Page 49: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/49.jpg)
Reduce Unnecessary Computation
p q p q p q p q p q
K. Wang et al., Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems, In PVLDB 2012.
![Page 50: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/50.jpg)
Next Steps (IV)
• Database
– A mature storage engine
– Good single node performance (40 years of research)
– But not scalable
• MapReduce
– Highly scalable and fault tolerant
– But low single node throughput
• Best of the two worlds ?
– current efforts: HadoopDB, Secondo
– how to merge the two for big spatial data ?
MapReduce + DB ≈ Hybrid System
![Page 51: High Performance Spatial Query Processing for Large scale …cs700000/aji-11-16-12.pdf · 2013. 2. 8. · Nov 16, 2012 . Spatial Applications Map . Pathology Analytical Imaging ...](https://reader033.fdocuments.net/reader033/viewer/2022052001/601375d4bcaa7808263283c0/html5/thumbnails/51.jpg)
Questions?
Hadoop-GIS:
https://web.cci.emory.edu/confluence/display/HadoopGIS