High Dimensional Indexing using MongoDB (MongoSV 2012)

42
MONGODB FOR MULTI-DIMENSION SPATIAL INDEXING DECEMBER 2012 @nknize +Nicholas Knize

description

 

Transcript of High Dimensional Indexing using MongoDB (MongoSV 2012)

Page 1: High Dimensional Indexing using MongoDB (MongoSV 2012)

MONGODB FOR MULTI-DIMENSIONSPATIAL INDEXING

DECEMBER 2012

@nknize+Nicholas Knize

Page 2: High Dimensional Indexing using MongoDB (MongoSV 2012)

Thermopylae Sciences & Technology – Who are we?

• Mixed Government (70%) and Commercial (30%) contracting company w/ ~150 employees

• Core customers: – SOUTHCOM, Intel & Security Command, Army Intel Sector, DOI– LVMS, Select Energy Oil & Gas, OSU, Cleveland Cavaliers, and STL Rams

• #1 Google Enterprise partner for Federal and partner w/ imagery providers (GeoEye / Digital Globe)

• FOSS4G contributor and 10gen Enterprise partner

WHO ARE THESE GUYS?

ACCOMPLISHING THE IMPOSSIBLE

ENTERPRISEPARTNER

Page 3: High Dimensional Indexing using MongoDB (MongoSV 2012)

“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one location…this capability allows for unprecedented situational awareness and information sharing”

-Gen. Doug Frasier

TST PRODUCTS

ACCOMPLISHING THE IMPOSSIBLE

Page 4: High Dimensional Indexing using MongoDB (MongoSV 2012)

COMMERCIAL CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

Commercial Examples

ClevelandCavaliers

USGIF Las VegasMotor Speedway

BaltimoreGrand Prix

iSpatial framework serves millions of mobile devices

Page 5: High Dimensional Indexing using MongoDB (MongoSV 2012)

1. iSpatial provides web-based interface for Multi-INT visualization and collaborations2. Map/Reduce provides spatial statistic processing (spatial regression) and heuristics 3. Modified MongoDB provides storing and indexing multi-dimension spatial data at scale

TST ARCHITECTURE

ACCOMPLISHING THE IMPOSSIBLE

iSpatial – UI/Visualization

Hadoop M/R – Processing / Analysis

MongoDB – Spatial Data Management @ Scale

1 2

3

Page 6: High Dimensional Indexing using MongoDB (MongoSV 2012)

What the…..HOW MUCH DATA?!?

• “Swimming in sensors drowning in data”– What size data tsunami are we talking about?

• “Fix and Finish are meaningless until FIND is accomplished”– A “Big Data” Spatial Search Problem

THAT’S A LOT OF DATA….

ACCOMPLISHING THE IMPOSSIBLE

Sensor Type Resolution Data Bandwidth TB/Hr

FMV 640 x 480 (Std Def)1920 x 1080 (HD)

HD: 16bit x 3 bands @ 30fps ~1Gbps

~0.45 TB

WAMI Constant Hawk = 96 MpxGorgon Stare = 460 MpxArgus = 1.8 Gpx

GS @ 16bit x 3 bands @ 2fps ~15.3Gps

Argus @ 16bit x 3 bands @ 12fps ~345.6Gps

~6.89 TB

~155 TB

Satellite NITF / JP2 resolutions32K x 32K432K x 216K

32K x 32K @ 8bit x 3 bands @ 1frame/5mins ~27Gps

~12.15 TB

Page 7: High Dimensional Indexing using MongoDB (MongoSV 2012)

• Horizontally scalable – Large volume / elastic

• Vertically scalable – Heterogeneous data types (“Data Stack”)

• Smartly Distributed – Reduce the distance bits must travel

• Fault Tolerant – Replication Strategy and Consistency model

• High Availability – Node recovery

• Fast – Reads or writes (can’t always have both)

BIG DATA STORAGE CHARACTERISTICS

ACCOMPLISHING THE IMPOSSIBLE

Desired Data Store Characteristic for ‘Big Data’

Page 8: High Dimensional Indexing using MongoDB (MongoSV 2012)

• Cassandra– Nice Bring Your Own Index (BYOI) design– … but Java, Java, Java… Memory management can be a maintenance issue– Adding new nodes can be a pain (Token Changes, nodetool)– Key-Value store…good for simple data models

• Hbase– Nice BigTable model– Key-Value store…good for simple data models– Lots of Java JNI (primarily based on std:hashmap of std:hashmap)

• CouchDB– Provides some GeoSpatial functionality (Currently being rewritten)– HEAVILY dependent on Map-Reduce model (complicated design)– Erlang based – poor multi-threaded heap management

NOSQL OPTIONS

ACCOMPLISHING THE IMPOSSIBLE

Subset of Evaluated NoSQL Options

Page 9: High Dimensional Indexing using MongoDB (MongoSV 2012)

Why MongoDB for Thermopylae?• Documents based on JSON – A GEOJSON match made in heaven! (OGC)

• C++ - No Garbage Collection Overhead! Efficient memory management design reduces disk swapping and paging

• Disk storage is memory mapped, enabling fast swapping when necessary

• Built in auto-failover with replica sets and fast recovery with journaling

• Tunable Consistency – Consistency defined at application layer

• Schema Flexible – friendly properties of SQL enable easy port

• Provided initial spatial indexing support – Point based limited!WHY TST <3’S MONGODB

ACCOMPLISHING THE IMPOSSIBLE

Page 10: High Dimensional Indexing using MongoDB (MongoSV 2012)

MONGODB SPATIAL INDEXER

ACCOMPLISHING THE IMPOSSIBLE

... The Spatial Indexer wasn’t quite right

• MongoDB (like nearly all relational DBs) uses a b-Tree – Data structure for storing sorted data in log time– Great for indexing numerical and text documents (1D attribute data)– Cannot store multi-dimension (>2D) data – NOT COMPLEX GEOMETRY

FRIENDLY

Page 11: High Dimensional Indexing using MongoDB (MongoSV 2012)

DIMENSIONALITY REDUCTION

ACCOMPLISHING THE IMPOSSIBLE

How does MongoDB solve the dimensionality problem?

• Space Filling (Z) Curve – A continuous line that

intersects every point in a two-dimensional plane

• Use Geohash to represent lat/lon values– Interleave the bits of a

lat/long pair– Base32 encode the result

Page 12: High Dimensional Indexing using MongoDB (MongoSV 2012)

GEOHASH BTREE ISSUES

ACCOMPLISHING THE IMPOSSIBLE

• Neighbors aren’t so close!– Neighboring points on the

Geoid may end up on opposite ends of the plane

– Impacts search efficiency

• What about Geometry?– Doesn’t support > 2D– Mongo uses Multi-

Location documents which really just indexes multiple points that link back to a single document

Issues with the Geohash b-Tree approach

Page 13: High Dimensional Indexing using MongoDB (MongoSV 2012)

Sort Order and Multi-Dimension…a nightmare(3D / 4D Hilbert Scanning Order)

GEO-SHARDING ALTERNATIVE

ACCOMPLISHING THE IMPOSSIBLE

Page 14: High Dimensional Indexing using MongoDB (MongoSV 2012)

Case 3:

Case 4:

Multi-Location Document (aka. Polygon) Search Polygon

Case 1:

Case 2:

Success!

Success!

Fail!

Fail!

Mongo Multi-location Document Clipping Issues($within search doesn’t always work w/ multi-location)

MULTI-LOCATION CLIPPING

ACCOMPLISHING THE IMPOSSIBLE

Page 15: High Dimensional Indexing using MongoDB (MongoSV 2012)

• Constrain the system to single point searches– Multi-dimension support will be exponentially complex (won’t scale)

• Interpolate points along the edge of the shape– Multi-dimension support will be exponentially complex (won’t scale)

• Customize the spatial indexer– Selected approach

SOLUTIONS TO GEOHASH PROBLEM

ACCOMPLISHING THE IMPOSSIBLE

Potential Solutions

Page 16: High Dimensional Indexing using MongoDB (MongoSV 2012)

CUSTOM TUNED SPATIAL INDEXER

ACCOMPLISHING THE IMPOSSIBLE

Thermopylae Custom Tuned MongoDB for Geo

TST Leverage’s Kriegel’s 1996 Research in R* Trees• R-Trees organize any-dimensional data by representing

the data as a minimum bounding box. • Each node bounds it’s children. A node can have many

objects in it (max: m min: ceil(m/2) )• Splits and merges optimized by minimizing overlaps• The leaves point to the actual objects (stored on disk

probably)• Height balanced – search is always O(log n)

Page 17: High Dimensional Indexing using MongoDB (MongoSV 2012)

Spatial Indexing at Scale with R-Trees

RTREE THEORY

ACCOMPLISHING THE IMPOSSIBLE

Spatial data represented as minimum bounding rectangles (2-dimension), cubes (3-dimension), hexadecant (4-dimension)

Index represented as: <I, DiskLoc> where:

I = (I0, I1, … In) : n = number of dimensionsEach I is a set in the form of [min,max] describing MBR range along a

dimension

Page 18: High Dimensional Indexing using MongoDB (MongoSV 2012)

R*-Tree Spatial Index Example• Sample insertion result for 4th order

tree• Objectives:

1. Minimize area2. Minimize overlaps3. Minimize margins4. Maximize inner node utilization

a b cd e f g h i j k l

m n o p

R*-TREE INDEX OBJECTIVES

ACCOMPLISHING THE IMPOSSIBLE

Page 19: High Dimensional Indexing using MongoDB (MongoSV 2012)

Insert

• Similar to insertion into B+-tree but may insert into any leaf; leaf splits in case capacity exceeded.– Which leaf to insert into?– How to split a node?

R*-TREE INSERT EXAMPLE

ACCOMPLISHING THE IMPOSSIBLE

Page 20: High Dimensional Indexing using MongoDB (MongoSV 2012)

Insert—Leaf Selection

• Follow a path from root to leaf.• At each node move into subtree whose MBR area

increases least with addition of new rectangle.

mn

o p

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 21: High Dimensional Indexing using MongoDB (MongoSV 2012)

Insert—Leaf Selection

• Insert into m.

m

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 22: High Dimensional Indexing using MongoDB (MongoSV 2012)

Insert—Leaf Selection

• Insert into n.

n

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 23: High Dimensional Indexing using MongoDB (MongoSV 2012)

Insert—Leaf Selection

• Insert into o.

o

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 24: High Dimensional Indexing using MongoDB (MongoSV 2012)

Insert—Leaf Selection

• Insert into p.

p

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 25: High Dimensional Indexing using MongoDB (MongoSV 2012)

mn

o p

aa

a

x

a b cd e f g h i j k l

m n o p

Query• Start at root• Find all overlapping MBRs• Search subtrees recursively

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 26: High Dimensional Indexing using MongoDB (MongoSV 2012)

Query

• Search m.

mn

o p

a

a

x x

a b cd e f g h i j k l

m n o p

a

aa

b

cd

e

g

Modified from Dr. Sahni (UoF Advanced Data Structures)

Page 27: High Dimensional Indexing using MongoDB (MongoSV 2012)

R*-Tree Leverages B-Tree Base Data Structures (buckets)

R*-TREE MONGODB IMPLEMENTATION

ACCOMPLISHING THE IMPOSSIBLE

Page 28: High Dimensional Indexing using MongoDB (MongoSV 2012)

Spatial Index Architecture, Organization, & Performance

MBRKeyNode(s)

BucketHeader

MBRHeader

Dimensions Num Buckets Tree Height Read Time

3 3,448,276 3 190 ms

5 50,76,143 3 275 ms

100 90,909,091 8 ~4.9 sec

1B Polygon Read Performance (worst case O(n))

SPATIAL INDEX ARCH & ORG

ACCOMPLISHING THE IMPOSSIBLE

Page 29: High Dimensional Indexing using MongoDB (MongoSV 2012)

Geo-Sharding – (in work)Scalable Distributed R* Tree (SD-r*Tree)

“Balanced” binary tree, with nodes distributed on a set of servers:

• Each internal node has exactly two children

• Each leaf node stores a subset of the indexed dataset

• At each node, the height of the subtrees differ by at most one

• mongos “routing” node maintains binary tree

GEO-SHARDING

ACCOMPLISHING THE IMPOSSIBLE

Page 30: High Dimensional Indexing using MongoDB (MongoSV 2012)

d0 d1

r1d0Data Node Spatial

Coverage

a a

b

c

cb d0

r1

a

b

c

c

b

d2d1

ed

d

r2

e

SD-r*Tree Data Structure Illustration

• di = Data Node (Chunk)• ri = Coverage Node

Leveraged work from Litwin, Mouza, Rigaux 2007

SD-r*Tree DATA STRUCTURE

ACCOMPLISHING THE IMPOSSIBLE

Page 31: High Dimensional Indexing using MongoDB (MongoSV 2012)

SD-r*Tree Structure Distribution

d0

r1

a

b

c

c

b

d2d1

ed

d

r2

e

r2

d1 d2

d0

r1

GeoShard 2 GeoShard 3

GeoShard 1

mongos

SD-r*TREE STRUCTURE DISTRIBUTION

ACCOMPLISHING THE IMPOSSIBLE

Page 32: High Dimensional Indexing using MongoDB (MongoSV 2012)

Beyond 4-Dimensions - X-Tree(Berchtold, Keim, Kriegel – 1996)

Normal Internal Nodes Supernodes Data Nodes

• Avoid MBR overlaps – more overlaps approaches worst case O(n) read

• Avoid node splits (main cause for high overlap)

• Introduce new node structure: Supernodes – Large Directory nodes of variable size

BEYOND 4-DIMENSIONS

ACCOMPLISHING THE IMPOSSIBLE

Page 33: High Dimensional Indexing using MongoDB (MongoSV 2012)

X-TREE PERFORMANCE

ACCOMPLISHING THE IMPOSSIBLE

X-Tree Performance Results(Berchtold, Keim, Kriegel – 1996)

Page 34: High Dimensional Indexing using MongoDB (MongoSV 2012)

T-Sciences Custom Tuned Spatial Indexer

• Optimized Spatial Search – Finds intersecting MBR and recurses into those nodes

• Optimized Spatial Inserts – Uses the Hilbert Value of MBR centroid to guide search – 28% reduction in number of nodes touched

• Optimize Deletes – Leverages R* split/merge approach for rebalancing tree when nodes become over/under-full

• Low maintenance – Leverages MongoDB’s automatic data compaction and partitioning

CONCLUSION

ACCOMPLISHING THE IMPOSSIBLE

Page 35: High Dimensional Indexing using MongoDB (MongoSV 2012)

Example: Mosaicked Video with KLV Footprints

SLIDESHOW HEADER

ACCOMPLISHING THE IMPOSSIBLE

• Rip through KLV Metadata

• Index frame footprints, and annotations as MBR into X(R*)-Tree

• Leverage Geo-Sharding for spatially relevant scale

Page 36: High Dimensional Indexing using MongoDB (MongoSV 2012)

Example Use Case – OSINT (Foursquare Data)

• Sample Foursquare data set mashed with Government Intel Data (poly reports)

• 100 million Geo Document test (3D points and polys)

• 4 server replica set

• ~350ms query response

• ~300% improvement over PostGIS

EXAMPLE

ACCOMPLISHING THE IMPOSSIBLE

Page 37: High Dimensional Indexing using MongoDB (MongoSV 2012)

Community Support

• Thermopylae plans to open source– http://github.com/thermopylae

• TST working with 10gen to offer as a spatial extension

• Active developer collaboration– IRC: #mongodb freenode.net

FIND US

ACCOMPLISHING THE IMPOSSIBLE

Page 38: High Dimensional Indexing using MongoDB (MongoSV 2012)

THANK YOUQuestions?

Nicholas [email protected]

THANK YOU

ACCOMPLISHING THE IMPOSSIBLE

Page 39: High Dimensional Indexing using MongoDB (MongoSV 2012)

Backup

Page 40: High Dimensional Indexing using MongoDB (MongoSV 2012)

Key Customers - Government• US Dept of State Bureau of Diplomatic Security

– Build and support 30 TB Google Earth Globe with multi-terabytes of individual globes sent to embassies throughout the world. Integrated Google Earth and iSpatial framework.

• US Army Intelligence Security Command– Provide expertise in managing technology integration – prime

contractor providing operations, intelligence, and IT support worldwide. Partners include IBM, Lockheed Martin, Google, MIT, Carnegie Mellon. Integrated Google Earth and iSpatial framework.

• US Southern Command– Coordinate Intelligence management systems spatial data collection,

indexing, and distribution. Integrated Google Earth, iSpatial, and iHarvest.

– Index large volume imagery and expose it for different services (Air Force, Navy, Army, Marines, Coast Guard)

GOVERNMENT CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

Page 41: High Dimensional Indexing using MongoDB (MongoSV 2012)

COMMERCIAL CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

Key Customers - Commercial

ClevelandCavaliers

USGIF Las VegasMotor Speedway

BaltimoreGrand Prix

iSpatial framework serves millions of mobile devices

Page 42: High Dimensional Indexing using MongoDB (MongoSV 2012)

• Expose and manage Multi-INT enterprise data in a geo-temporal user defined environment

• Provide a flexible and scalable spatial data infrastructure (SDI) for Multi-INT data access and analysis

• Spatially referenced data visualization on 3D globe & 2D maps• Access real/near real-time data feeds from forward deployed

devices • Enable real-time information sharing and mission collaboration

ISPATIAL OVERVIEW

ACCOMPLISHING THE IMPOSSIBLE