1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray,...

33
1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

Transcript of 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray,...

Page 1: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

1

There Goes the Neighborhood!Spatial (or N-Dimensional) Search

in a Relational World

Jim Gray, Microsoft

Alex Szalay, Johns Hopkins U.

Page 2: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

2

Page 3: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

3

Background• I have been working with

Astronomy community to build the World Wide Telescope: all telescope data federated in one internet-scale DB

• A great Web Services app• The work here

joint with Alex Szalay

• SkyServer.Sdss.Org is first installment,

• SkyQuery.Netis second installment (federated web services)

Page 4: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

4

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

Page 5: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

5

Spatial Data Access – SQL extension

Szalay, Kunszt, Brunner http://www.sdss.jhu.edu/htm

• Added Hierarchical Triangular Mesh (HTM) table-valued function for spatial joins

• Every object has a 20-deep Mesh ID• Given a spatial definition,

routine returns up to 10 covering triangles

• Spatial query is then up to 10 range queries

• Fast: 1,000 triangles / second / Ghz2,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,2

2,1

2,0

2,32,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,3,0

2,3,12,3,2 2,3,3

Page 6: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

6

A typical call-- find objects within 1 arcminute of (60,20)select objID, ra, dec from PhotoObj as p, fHtmCover(60,20,1) as triangle where p.htmID between triangle.startHtmID and triangle.endHtmID and <geometry test on (ra,dec) – (60,20) < 1 arcmin>

-- or better yetselect objID, ra, dec, distance from dbo.fGetNearbyObjEq(60,20,1)

careful distance test rejects

false positives

Coarse distance test

Coarse filterCoarse filter Correct filterCorrect filter

Page 7: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

7

Integration with CLR Makes it Nicer

• Peter Kukol converted 500 lines of external stored procedure “glue code” to 50 lines of C# code.

• Now we are converting library to C#

• Also, Cross Apply is VERY useful select objID, count(*)from PhotoObj p cross apply dbo.fGetNearbyObjEq(p.ra, p.dec, 1)

Page 8: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

8

But…

• Wanted a faster way to do this: some computations were taking toooooo long (see below).

• Wanted to define areas in relational form.

• Wanted a portable way that works on any relational system.

• So, developed a “constraint database” approach – see below.

Page 9: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

9

The Idea:Equations Define Subspaces

• For (x,y) above the lineax+by > c

• Reverse the space by-ax + -by > -c

• Intersect 3 half-spaces: a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3

x

y

x=c/a

y=c/b

ax + by = c

x

y

Page 10: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

10

The Idea:Equations Define Subspaces

a1x + b1y > c1

a2x + b2y > c2

a3x + b3y > c3x

y

select count(*)from convex where a*@x + b*@y < c

3

2

22

1 1

1

select count(*)from convex where a*@x + b*@y > c

x

y

0

1

11

2 2

2

Page 11: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

11

Domain is Union of Convex Hulls

• Simple volumes are unions of convex hulls.

• Higher order curves also work

• Complex volumes have holes and their holes have holes. (that is harder).

Not a convex hull

+

Page 12: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

12

Now in Relational Termscreate table HalfSpace (

domainID int not null -- domain name foreign key references Domain(domainID), convexID int not null, -- grouping a set of ½ spaces halfSpaceID int identity(), -- a particular ½ space a float not null, -- the (a,b,..) parameters b float not null, -- defining the ½ space c float not null, -- the constraint (“c” above) primary key (domainID, convexID, halfSpaceID)

(x,y) inside a convex if it is inside all lines of the convex(x,y) inside a convex if it is NOT OUTSIDE ANY line of the convex

Convexes containing point (@x,@y):select convexID -- return the convex hullsfrom HalfSpace -- from the constraintswhere (@x * a + @y * b) < c -- point outside the line?group by all convexID -- insist no line of convexhaving count(*) = 0 -- is outside (count outside ==

0)

Page 13: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

13

All Domains Containing this Point

• The group by is supported by the domain/convex index, so it’s a sequential scan (pre-sorted!).

select distinct domainID -- return domains

from HalfSpace -- from constraints

where (@x * a + @y * b) < c -- point outside

group by all domainID, convexID -– never happens having count(*) = 0 -- count outside == 0

Page 14: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

14

The Algebra is Simple (Boolean)@domainID = spDomainNew (@type varchar(16), @comment varchar(8000))@convexID = spDomainNewConvex (@domainID int)@halfSpaceID = spDomainNewConvexConstraint (@domainID int, @convexID int, @a float, @b float, @c float)@returnCode = spDomainDrop(@domainID)

select * from fDomainsContainPoint(@x float, @y float) Once constructed they can be manipulated with the Boolean operations.@domainID = spDomainOr (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainAnd (@domainID1 int, @domainID2 int, @type varchar(16), @comment varchar(8000))@domainID = spDomainNot (@domainID1 int, @type varchar(16), @comment varchar(8000))

Page 15: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

15

What! No Bounding Box?

• Bounding box limits search.A subset of the convex hulls.

• If query runs at 3M half-space/sec then no need for bounding box, unless you have more than 10,000 lines.

• But, if you have a lot of half-spaces then bounding box is good.

Page 16: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

16

OK: “solved” Areas Contain Point? What about: Points near point?

• Table-valued function find points near a point

– Select * from fGetNearbyEq(ra,dec,r)• Use Hierarchical Triangular Mesh www.sdss.jhu.edu/htm/

– Space filling curve, bounding triangles…– Standard approach

• 13 ms/call… So 70 objects/second.• Too slow, so pre-compute neighbors:

Materialized view.• At 70 objects/sec: takes 6 months

to compute materialized view on billion objects.

Page 17: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

17

Zone Based Spatial Join• Divide space into zones• Key points by Zone, offset

(on the sphere this need wrap-around margin.)

• Point search look in a few zonesat a limited offset: ra ± ra bounding box that has

1-π/4 false positives• All inside the relational engine• Avoids “impedance mismatch” • Can “batch” all-all comparisons• 33x faster and parallel

6 days, not 6 months!

r ra-zoneMax

√(r2+(ra-zoneMax)2)cos(radians(zoneMax))

zoneMax

x

Ra ± x

Page 18: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

18

In SQL: points near point

select o1.objID -- find objectsfrom zone o1 -- in the zoned tablewhere o1.zoneID between -- where zone #

floor((@dec-@r)/@zoneHeight) and -- overlaps the circleceiling((@dec+@r)/@zoneHeight)

and o1.ra between @ra - @r and @ra + @r -- quick filter on ra and o1.dec between @dec-@r and @dec+@r -- quick filter on dec and ( (sqrt( power(o1.cx-@cx,2)+power(o1.cy-@cy,2)+power(o1.cz-@cz,2))))

< @r -- careful filter on distance

Eliminates the ~ 21% = 1-π/4False positives

Bounding box

Page 19: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

19

Quantitative Evaluation: 7x faster than external stored proc:

(linkage is expensive)time vs. radius for neighbors function @ various zone heights. Any small zone height is adequate.

time vs. best time @ various radius. A zoneHeight of 4” is near-optimal

Rows vs elapsed timefit is 1.46+2.2e-4*r^2 ms/asec

1

10

100

1000

10 100 1000

r (asec)

tim

e (

se

c)

7.5 asec

15 asec30 asec1 amin

2 amin4 amin

64 aminr 2̂ fit

Relative time vs zone height (asec)4 minute zone is near optimal

2 & 8 minute are slower

1.00

1.10

1.20

1.30

1.40

1.50

1.60

1.70

1.80

1.90

2.00

10 100 1000r (asec)

tim

e vs

bes

t

7 asec15 asec30 asec1 amin2 amin4 amin16 amin1 degree

Page 20: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

20

All Neighbors of All points(can Batch Process the Joins)

• A 5x additional speedup (35x in total)for @deltaZone in {-1, 0, 1} example ignores some spherical geometry details in paper

insert neighbors -- insert one zone's neighbors select o1.objID as objID, -- object pairs

o2.objID as NeighborObjID, .. other fields elided

from zone o1 join zone o2 -- join 2 zones on o1.zoneID-@deltaZone = o2.zoneID -- using zone number and ra

and o2.ra between o1.ra - @r and o1.ra + @r -- points near rawhere -- elided margin logic, see paper. and o2.dec between o1.dec-@r and o1.dec+@r -- quick filter on dec and sqrt(power(o1.x-o2.x,2)+power(o1.y-o2.y,2)+power(o1.z-o2.z,2))

< @r -- careful filter on distance

Page 21: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

21

Spatial Stuff Summary• Easy

– Point in polygon– Polygons containing points– (instance and batch)

• Works in higher dimensions

• Side note: Spherical polygons are – hard in 2-space– Easy in 3-space

Page 22: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

22

Spatial Stuff Summary• Constraint databases are in

– Streams (data is query, query is in DB)– Notification: subscription in DB, data is query– Spatial: constraints in DB, data is query

• You can express constraints as rows

• Then You – Can evaluate LOTS of predicates per second– Can do set algebra on the predicates.

• Benefits from SQL parallelism

• SQL == Prolog // DataLog?

Page 23: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

23

References

• Representing Polygon Areas and Testing Point-in-Polygon

Containment in a Relational Database http://research.microsoft.com/~Gray/papers/Polygon.doc

• A Purely Relational Way of Computing Neighbors on a Sphere, http://research.microsoft.com/~Gray/papers/Neighbors.doc

Page 24: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

24

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

Page 25: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

25

Loading consists of Many Tasks• Very simply:

capture, analyze to produce catalog info, convert to sql,

validate, import,

index,

• Each of these steps has many sub-steps• We learned from the TerraServer that

(1) LOADING IS WHERE THE TIME GOES.(2) You get to load it again when you discover better data you discover a bug in the data

you discover a better design.(3) Essentially, you are always loading or preparing to load.

• Everyone “knows this” but you have to experience it to grasp it.

TelescopeObservation

GenerateCatalogs

Validatecatalogs

Load toDatabase

Page 26: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

26

The SkyServer Load Manager• Built a workflow engine with SQL Agent

(our batch job scheduler) and DTS• State machine is in database

visible to all worker nodes• Workers at each node pull work • Step is a stored procedure

– logs to load monitor database – 3 levels of logging: job, step, phase– Logs and reporting are VERY useful

• Automatic some manual backout to fix problems

• Demo http://skyserver.pha.jhu.edu/admin/tasklist.asp?all=1

Page 27: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

27

Outline

• How to do spatial lookup:– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

Page 28: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

28

How do you document your schema?

Page 29: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

29

Knuth’s Literate Programming

• Put documentation in the program• Tool generates manual from program..

In the new world: tool generates online hypertext from program.

• We Annotated every table, view, function, .. with tags:unitscommentreference to other tables (flags, star schema).

• Program scans DDL, generates web site.• http://skyserver.sdss.org/en/help/docs/browser.asp

http://search.barnesandnoble.com/booksearch/isbnInquiry.asp?userid=1BHUJQ3VNI&isbn=0937073806&itm=1

Page 30: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

30

CREATE FUNCTION fGetNearbyObjXYZ (@nx float, @ny float, @nz float, @r float)---------------------------------------------------------------/H Returns table of primary objects within @r arcmins of an xyz point (@nx,@ny, @nz).---------------------------------------------------------------/T There is no limit on the number of objects returned, but there are about 40 per sq arcmin.--/T <p>returned table: --/T <li> objID bigint PRIMARY KEY, -- Photo primary object identifier--/T <li> run int NOT NULL, -- run that observed this object --/T <li> camcol int NOT NULL, -- camera column that observed the object--/T <li> field int NOT NULL, -- field that had the object--/T <li> rerun int NOT NULL, -- computer processing run that discovered the object--/T <li> type int NOT NULL, -- type of the object (3=Galaxy, 6= star, see PhotoType in DBconstants)--/T <li> cx float NOT NULL, -- x,y,z of unit vector to this object--/T <li> cy float NOT NULL,--/T <li> cz float NOT NULL,--/T <li> htmID bigint, -- Hierarchical Triangular Mesh id of this object--/T <li> distance float -- distance in arc minutes to this object from the ra,dec.--/T <br> Sample call to find PhotoObjects within 5 arcminutes of xyz -.0996,-.1,0--/T <br><samp>--/T <br>select *--/T <br>from dbo.fGetNearbyObjXYZ(-.996,-.1,0,5) --/T </samp> --/T <br>see also fGetNearbyObjEq, fGetNearestObjXYZ, fGetNearestObjXYZ------------------------------------------------------------- RETURNS @proxtab TABLE ( objID bigint PRIMARY KEY,

Page 31: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

31

CREATE TABLE Chunk (---------------------------------------------------------------------------------/H Contains basic data for a Chunk----/T A Chunk is a unit for SDSS data export. --/T It is a part of an SDSS stripe, a 2.5 degree wide cylindrical segment --/T aligned at a great circle between the survey poles. --/T A Chunk has had both strips completely observed. Since --/T the SDSS camera has gaps between its 6 columns of CCDs, each stripe has --/T to be scanned twice (these are the strips) resulting in 12 slightly --/T overlapping narrow observation segments. <P>--/T Only those parts of a stripe are ready for export where the observation --/T is complete, hence the declaration of a chunk, usually consisting of 2 runs. -------------------------------------------------------------------------------

chunkNumber int NOT NULL, --/D Unique chunk identifierstartMu int NOT NULL , --/D Starting mu value --/U arcsecendMu int NOT NULL , --/D Ending mu value --/U arcsec[stripe] int NOT NULL , --/D Stripe numberexportVersion varchar(32) NOT NULL , --/D Export Version

)

Page 32: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

32

SkyServer Object Browserhttp://skyserver.sdss.org/en/help/docs/browser.asp

• VB program scans DDL, generates web site.

• It is VERY usefulFree Text Search VERY useful

Page 33: 1 There Goes the Neighborhood! Spatial (or N-Dimensional) Search in a Relational World Jim Gray, Microsoft Alex Szalay, Johns Hopkins U.

33

Outline• How to do spatial lookup:

– The old way: HTM– The new way: zoned lookup

• A load workflow system

• Embedded documentation:literate programming for SQL DDL

• Not shown: A C# web service (http://skyservice.pha.jhu.edu/SdssCutout) or better yet!

http://skyservice.pha.jhu.edu/dr1/imgcutout/chart.asp