PS1 PSPS Object Data Manager Design

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review November 5-6, 2007

IfA

Outline

ODM Overview Critical Requirements Driving Design Work Completed Detailed Design Spatial Querying [AS]

ODM Prototype [MN]

Hardware/Scalability [JV]

How Design Meets Requirements WBS and Schedule Issues/Risks

[AS] = Alex, [MN] = Maria, [JV] = Jan

ODM Overview

The Object Data Manager will:

Provide a scalable data archive for the Pan-STARRS data products

Provide query access to the data for Pan-STARRS users

Provide detailed usage tracking and logging

ODM Driving Requirements

Total size 100 TB, • 1.5 x 1011 P2 detections• 8.3x1010 P2 cumulative-sky (stack) detections• 5.5x109 celestial objects

Nominal daily rate (divide by 3.5x365)• P2 detections: 120 Million/day• Stack detections: 65 Million/day• Objects: 4.3 Million/day

Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s DB size requirement:

• 25 TB / yr• ~100 TB by of PS1 (3.5 yrs)

Work completed so far

Built a prototype Scoped and built prototype hardware Generated simulated data

• 300M SDSS DR5 objects, 1.5B Galactic plane objects

Initial Load done – Created 15 TB DB of simulated data• Largest astronomical DB in existence today

Partitioned the data correctly using Zones algorithm Able to run simple queries on distributed DB Demonstrated critical steps of incremental loading It is fast enough

• Cross-match > 60k detections/sec• Required rate is ~3k/sec

Detailed Design

Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)

• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware

Query Manager (QM: CasJobs for prototype)

High-Level Organization

Legend

DatabaseFull table [partitioned table]Output tablePartitioned View

Query Manager (QM)Query Manager (QM)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

Web Based Interface (WBI)Web Based Interface (WBI)

Data Transformation Layer (DX)Data Transformation Layer (DX)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers

PartitionMapData Loading Pipeline (DLP)

Legend

DatabaseFull table [partitioned table]Output tablePartitioned View

Query Manager (QM)Query Manager (QM)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

PS1

P1 Pm

PartionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

Data Storage (DS)

Web Based Interface (WBI)Web Based Interface (WBI)

Data Transformation Layer (DX)Data Transformation Layer (DX)

LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers


LoadAdmin

LoadSupport1

objZoneIndx

orphans

Detections_l1

LnkToObj_l1

objZoneIndx

orphans

Detections_ln

LnkToObj_ln

LoadSupportn

Linked servers


Detailed Design




Data Transformation Layer (DX)

Based on SDSS sqlFits2CSV package• LINUX/C++ application• FITS reader driven off header files

Convert IPP FITS files to• ASCII CSV format for ingest (initially)• SQL Server native binary later (3x faster)

Follow the batch and ingest verification procedure described in ICD• 4-step batch verification• Notification and handling of broken publication cycle

Deposit CSV or Binary input files in directory structure• Create “ready” file in each batch directory

Stage input data on LINUX side as it comes in from IPP

DX Subtasks

DXDX

Initialization Job

FITS schemaFITS reader

CSV ConverterCSV Writer

Initialization Job

FITS schemaFITS reader

CSV ConverterCSV Writer

Batch Ingest

Interface with IPPNaming conventionUncompress batch

Read batchVerify Batch

Batch Ingest

Interface with IPPNaming conventionUncompress batch

Read batchVerify Batch

BatchVerification

Verify ManifestVerify FITS IntegrityVerify FITS Content

Verify FITS DataHandle Broken Cycle

BatchVerification

Verify ManifestVerify FITS IntegrityVerify FITS Content

Verify FITS DataHandle Broken Cycle

BatchConversion

CSV ConverterBinary Converter

“batch_ready”Interface with DLP

BatchConversion

CSV ConverterBinary Converter

“batch_ready”Interface with DLP

DX-DLP Interface

Directory structure on staging FS (LINUX):• Separate directory for each JobID_BatchID• Contains a “batch_ready” manifest file

– Name, #rows and destination table of each file• Contains one file per destination table in ODM

– Objects, Detections, other tables Creation of “batch_ready” file is signal to loader to ingest

the batch Batch size and frequency of ingest cycle TBD

Detailed Design




Data Loading Pipeline (DLP)

sqlLoader – SDSS data loading pipeline• Pseudo-automated workflow system• Loads, validates and publishes data

– From CSV to SQL tables• Maintains a log of every step of loading• Managed from Load Monitor Web interface

Has been used to load every SDSS data release• EDR, DR1-6, ~ 15 TB of data altogether• Most of it (since DR2) loaded incrementally• Kept many data errors from getting into database

– Duplicate ObjIDs (symptom of other problems)– Data corruption (CSV format invaluable in

catching this)

sqlLoader Design

Existing functionality• Shown for SDSS version• Workflow, distributed loading, Load Monitor

New functionality• Schema changes• Workflow changes• Incremental loading

– Cross-match and partitioning

sqlLoader Workflow

Distributed design achieved with linked servers and SQL Server Agent

LOAD stage can be done in parallel by loading into temporary task databases

PUBLISH stage writes from task DBs to final DB

FINISH stage creates indices and auxiliary (derived) tables

LOADLOAD

PUBLISHPUBLISHFINISHFINISH

EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD


EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD


EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

LOADLOAD


EXPEXP

CHKCHK

BLDBLD

SQLSQL

VALVAL

BCKBCK

DTCDTC

Export

Check CSV

Build Task DBs

Build SQL Schema

Validate

Backup

Detach

PUBPUB

CLNCLN

Publish

Cleanup

FINFIN

Loading pipeline is a system of VB and SQL scripts, stored procedures and functions

Load Monitor Tasks Page

Load Monitor Active Tasks

Load Monitor Statistics Page

Load Monitor – New Task(s)

Test UniquenessOf Primary KeysTest UniquenessOf Primary Keys

TestForeign Keys

TestForeign Keys

TestCardinalities

TestCardinalities

TestHTM IDs

TestHTM IDs

Test Link TableConsistency

Test Link TableConsistency

Test the uniqueKey in each table

Test for consistencyof keys that link tables

Test consistency of numbers of various quantities

Test the HierarchicalTriamgular Mesh IDsused for spatial indexing

Ensure that links areconsistent

Data Validation

Tests for data integrity and consistency

Scrubs data and finds problems in upstream pipelines

Most of the validation can be performed within the individual task DB (in parallel)

Master Master

SlaveSlave SlaveSlave

Samba-mounted CSV/Binary FilesSamba-mounted CSV/Binary Files

PublishData

PublishData

FinishFinish

Task DB Task DBTaskDataTaskData

Task DB

Task DBView of

MasterSchema

TaskDataTaskData

LoadSupportLoadSupport Task DB

Task DB

TaskDataTaskData

Load Monitor

PublishSchema

View ofMaster

Schema

View ofMaster

Schema

MasterSchema

LoadAdminLoadAdmin

Distributed Loading

Publish

LoadSupportLoadSupportLoadSupportLoadSupport

Schema Changes

Schema in task and publish DBs is driven off a list of schema DDL files to execute (xschema.txt)

Requires replacing DDL files in schema/sql directory and updating xschema.txt with their names

PS1 schema DDL files have already been built Index definitions have also been created Metadata tables will be automatically generated using

metadata scripts already in the loader

LOADExportExport

CheckCSVs

CheckCSVs

CreateTask DBsCreate

Task DBs

Build SQLSchema

Build SQLSchema

ValidateValidate

XMatchXMatch

Workflow Changes

Cross-Match and Partition steps will be added to the workflow

Cross-match will match detections to objects

Partition will horizontally partition data, move it to slice servers, and build DPVs on main

PUBLISH

PartitionPartition

Matching Detections with Objects

Algorithm described fully in prototype section Stored procedures to cross-match detections will be part

of the LOAD stage in loader pipeline Vertical partition of Objects table kept on load server for

matching with detections Zones cross-match algorithm used to do 1” and 2”

matches Detections with no matches saved in Orphans table

XMatch and Partition Data Flow

Loadsupport

PS1

Pm

Detections

LoadDetections

XMatchDetections_In

PullChunk

LinkToObj_In

ObjZoneIndx

Orphans

Detections_chunk

LinkToObj_chunk

MergePartitions

Detections_m

LinkToObj_m

UpdateObjects

Objects_mPull

PartitionSwitch

Partition

Objects_m

LinkToObj_m

Objects

LinkToObj

Detailed Design




Data Storage – Schema

PS1 Table Sizes Spreadsheet

Stars 5.00E+09 1.51E+11Galaxies 5.00E+08 36750000000Total Objects 5.50E+09 m

2.3E-07 0.3*DR1 3.00P2 Detections per year 4.30E+10 0.3 0.29 0.57 0.86 1.00

tablename columns bytes/row total rows total size (TB) Prototype DR1 DR2 DR3 DR4

AltModels 0 7 1547 10 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1 1CameraConfig 0 5 287 30 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 1 1FileGroupMap 0 4 4335 100 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 1 1IndexMap 0 7 2301 100 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 1 1Objects 0 88 420 5.50E+09 2.31 0.693 2.31 2.31 2.31 2.31 1 0.33ObjZoneIndx 0 7 63 5.50E+09 0.3465 0.10395 0.3465 0.3465 0.3465 0.3465 1 0PartitionMap 0 3 4111 100 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 1 1PhotoCal 0 10 151 1000 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 1 1PhotozRecipes 0 2 267 10 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 1 1SkyCells 0 2 10 50000 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 1 1Surveys 0 2 267 30 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 1 1DropP2ToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33DropStackToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33P2AltFits 1 13 71 1.51E+10 1.06855 0.09159 0.3053 0.6106 0.9159 1.06855 0 0.33P2FrameMeta 1 18 343 1.05E+06 0.00036015 0.00003087 0.0001029 0.0002058 0.0003087 0.00036015 1 1P2ImageMeta 1 64 2870 6.72E+07 0.192864 0.0165312 0.055104 0.110208 0.165312 0.192864 1 1P2PsfFits 1 34 183 1.51E+11 27.5415 2.3607 7.869 15.738 23.607 27.5415 0 0.33P2ToObj 1 3 31 1.51E+11 4.6655 0.3999 1.333 2.666 3.999 4.6655 1 0.33P2ToStack 1 2 15 1.51E+11 2.2575 0.1935 0.645 1.29 1.935 2.2575 0 0.33StackDeltaAltFits 1 13 71 3.68E+09 0.260925 0.022365 0.07455 0.1491 0.22365 0.260925 0 0.33StackHiSigDeltas 1 32 167 3.68E+10 6.13725 0.52605 1.7535 3.507 5.2605 6.13725 0 0.33StackLowSigDelta 1 2 5000 1.65E+06 0.00825 0.000707143 0.002357143 0.004714286 0.007071429 0.00825 0 0.33StackMeta 1 49 1551 700000 0.0010857 0.00032571 0.0010857 0.0010857 0.0010857 0.0010857 0 0.33StackModelFits 1 131 535 7.50E+09 4.0125 0.343928571 1.146428571 2.292857143 3.439285714 4.0125 0 0.33StackPsfFits 1 44 215 8.25E+10 17.7375 1.520357143 5.067857143 10.13571429 15.20357143 17.7375 0 0.33StackToObj 1 4 39 8.25E+10 3.2175 0.275785714 0.919285714 1.838571429 2.757857143 3.2175 1 0.33StationaryTransient 1 2 23 5.00E+08 0.0115 0.000985714 0.003285714 0.006571429 0.009857143 0.0115 1 0.33

sum 69.76959861 6.549735569 21.83244779 41.00730812 60.18216845 69.76959861indices 13.95391972 1.309947114 4.366489558 8.201461624 12.03643369 13.95391972total 83.72351833 7.859682683 26.19893735 49.20876974 72.21860214 83.72351833

0 means the table size is essentially the same for all data releases Primary filegroup1 means the table size will grow

0 means full table1 means the table is partitioned and distributed across the cluster

Fraction of the table contained on each partition

Note: These estimates are for the whole PS1, assuming 3.5 years. 7 bytes added to each row for overhead as suggested by Alex

PS1 Table Sizes - All Servers

Table Year 1 Year 2 Year 3 Year 3.5

Objects 4.63 4.63 4.61 4.59

StackPsfFits 5.08 10.16 15.20 17.76

StackToObj 1.84 3.68 5.56 6.46

StackModelFits 1.16 2.32 3.40 3.96

P2PsfFits 7.88 15.76 23.60 27.60

P2ToObj 2.65 5.31 8.00 9.35

Other Tables 3.41 6.94 10.52 12.67

Indexes +20% 5.33 9.76 14.18 16.48

Total 31.98 58.56 85.07 98.87

Sizes are in TB

Data Storage – Test Queries

Drawn from several sources• Initial set of SDSS 20 queries• SDSS SkyServer Sample Queries• Queries from PS scientists (Monet, Howell, Kaiser,

Heasley) Two objectives

• Find potential holes/issues in schema• Serve as test queries

– Test DBMS iintegrity– Test DBMS performance

Loaded into CasJobs (Query Manager) as sample queries for prototype

Data Storage – DBMS

Microsoft SQL Server 2005• Relational DBMS with excellent query optimizer

Plus• Spherical/HTM (C# library + SQL glue)

– Spatial index (Hierarchical Triangular Mesh)• Zones (SQL library)

– Alternate spatial decomposition with dec zones• Many stored procedures and functions

– From coordinate conversions to neighbor search functions

• Self-extracting documentation (metadata) and diagnostics

Documentation and Diagnostics

Data Storage – Scalable Architecture

Monolithic database design (a la SDSS) will not do it SQL Server does not have cluster implementation

• Do it by hand Partitions vs Slices

• Partitions are file-groups on the same server– Parallelize disk accesses on the same machine

• Slices are data partitions on separate servers• We use both!

Additional slices can be added for scale-out For PS1, use SQL Server Distributed Partition Views

(DPVs)

Distributed Partitioned Views

Difference between DPVs and file-group partitioning• FG on same database• DPVs on separate DBs• FGs are for scale-up• DPVs are for scale-out

Main server has a view of a partitioned table that includes remote partitions (we call them slices to distinguish them from FG partitions)

Accomplished with SQL Server’s linked server technology

NOT truly parallel, though

Scalable Data Architecture

Shared-nothing architecture Detections split across cluster Objects

replicated on Head and Slice DBs

DPVs of Detections tables on the Headnode DB

Queries on Objects stay on head node

S2

S3

Head

S1

Objects_S1

Objects_S2

Objects_S3

Objects_S1

Objects_S2

Objects_S3

Detections_S1

Detections_S2

Detections_S3

Objects

Detections_S1

Detections_S2

Detections_S3

Detections DPV

Queries on detections use only local data on slices

Hardware - Prototype

LXPS01

L1PS13

L2/MPS05

Staging Loading

10 TB 9 TB

8

4

4HeadPS11

8

DB

S1PS12

8

S2PS03

4

S3PS04

4

WPS02

4

MyDB

39 TB

2A

2A

2A

2A

A

A2B B

RAID5 RAID10 RAID10 RAID10

14D/3.5W 12D/4W

Total space

RAID config

Disk/rack config

Function

10A = 10 x [13 x 750 GB]3B = 3 x [12 x 500 GB]

LX = LinuxL = Load serverS/Head = DB serverM = MyDB serverW = Web server

Web

0 TB

PS0x = 4-corePS1x = 8-core

Server NamingConvention:

Storage:

Function:

Hardware – PS1

Offline(Copy 2)

Spare(Copy 3)

Live(Copy 1)

Offline(Copy 2)

Spare(Copy 3)

Live(Copy 1)

Queries Ingest

Offline(Copy 1)

Spare(Copy 3)

Live(Copy 2)

Live(Copy 2)

Spare(Copy 3)

Live(Copy 1)

ReplicateQueries

Queries

Queries

Replicate

Queries

Ping-pong configuration to maintain high availability and query performance

2 copies of each slice and of main (head) node database on fast hardware (hot spares)

3rd spare copy on slow hardware (can be just disk)

Updates/ingest on offline copy then switch copies when ingest and replication finished

Synchronize second copy while first copy is online

Both copies live when no ingest

3x basic config. for PS1

Detailed Design




Query Manager

Based on SDSS CasJobs Configure to work with distributed database, DPVs Direct links (contexts) to slices can be added later if

necessary Segregates quick queries from long ones Saves query results server-side in MyDB Gives users a powerful query workbench Can be scaled out to meet any query load PS1 Sample Queries available to users PS1 Prototype QM demo

http://ps1.pha.jhu.edu/casjobs

ODM Prototype Components

Data Loading Pipeline Data Storage CasJobs

• Query Manager (QM)• Web Based Interface (WBI)

Testing

Spatial Queries (Alex)

Spatial Searches in the ODM

Common Spatial Questions

Points in region queries1. Find all objects in this region2. Find all “good” objects (not in masked areas)3. Is this point in any of the regionsRegion in region4. Find regions near this region and their area5. Find all objects with error boxes intersecting region6. What is the common part of these regionsVarious statistical operations7. Find the object counts over a given region list 8. Cross-match these two catalogs in the region

Sky Coordinates of Points

Many different coordinate systems• Equatorial, Galactic, Ecliptic, Supergalactic

Longitude-latitude constraints Searches often in mix of different coordinate systems

• gb>40 and dec between 10 and 20• Problem: coordinate singularities, transformations

How can one describe constraints in a easy, uniform fashion?

How can one perform fast database queries in an easy fashion?• Fast:Indexes• Easy: simple query expressions

Describing Regions

Spacetime metadata for the VO (Arnold Rots) Includes definitions of

• Constraint: single small or great circle• Convex: intersection of constraints• Region: union of convexes

Support both angles and Cartesian descriptions Constructors for

• CIRCLE, RECTANGLE, POLYGON, CONVEX HULL

Boolean algebra (INTERSECTION, UNION, DIFF) Proper language to describe the abstract regions Similar to GIS, but much better suited for astronomy

Things Can Get Complex

AABB

AA

Green area: A (B- ε) should find B if it contains an A and not maskedYellow area: A (B±ε) is an edge case may find B if it contains an A.

We Do Spatial 3 Ways

Hierarchical Triangular Mesh (extension to SQL)• Uses table valued functions• Acts as a new “spatial access method”

Zones: fits SQL well• Surprisingly simple & good

3D Constraints: a novel idea• Algebra on regions, can be implemented

in pure SQL

PS1 Footprint

Using the projection cell definitions as centers for tessellation (T. Budavari)

CrossMatch: Zone Approach

Divide space into declination zones Objects ordered by zoneid, ra

(on the sphere need wrap-around margin.)

Point search look in neighboring zones within ~ (ra ± Δ) bounding box

All inside the relational engine Avoids “impedance mismatch” Can “batch” comparisons Automatically parallel Details in Maria’s thesis

r ra-zoneMax

zoneMax

x

ra ± Δ

Indexing Using Quadtrees

Cover the sky with hierarchical pixels COBE – start with a cube Hierarchical Triangular Mesh (HTM) uses trixels

• Samet, Fekete Start with an octahedron, and

split each triangle into 4 children,down to 20 levels deep

Smallest triangles are 0.3” Each trixel has a unique htmID

2,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

21

23

20

22

222

223220 221

Space-Filling Curve

100

103

102

101

120

1,2,1

122

121

110

113

112

111

132

133

130

131

[0.12,0.13)

[0.122,0.123)[0.121,0.122)[0.120,0.121) [0.123,0.130)

Triangles correspond to rangesAll points inside the triangle are inside the range.

[0.122,0.130)

[0.120,0.121)

SQL HTM Extension

Every object has a 20-deep htmID (44bits) Clustered index on htmID Table-valued functions for spatial joins

• Given a region definition, routine returns up to 10 ranges of covering triangles

• Spatial query is mapped to ~10 range queries Current implementation rewritten in C# Excellent performance, little calling overhead Three layers

• General geometry library• HTM kernel• IO (parsing + SQL interface)

Writing Spatial SQL

-- region description is contained by @area

DECLARE @cover TABLE (htmStart bigint,htmEnd bigint)INSERT @cover SELECT * from dbo.fHtmCover(@area)--DECLARE @region TABLE ( convexId bigint,x float, y float, z float)INSERT @region SELECT dbo.fGetHalfSpaces(@area)--SELECT o.ra, o.dec, 1 as flag, o.objid FROM (SELECT objID as objid, cx,cy,cz,ra,[dec]

FROM Objects q JOIN @cover AS c ON q.htmID between c.HtmIdStart and c.HtmIdEnd) AS o

WHERE NOT EXISTS (SELECT p.convexId FROM @region AS p WHERE (o.cx*p.x + o.cy*p.y + o.cz*p.z < p.c)GROUP BY p.convexId)

Status

All three libraries extensively tested Zones used for Maria’s thesis, plus various papers New HTM code in production use since July on SDSS Same code also used by STScI HLA, Galex Systematic regression tests developed Footprints computed for all major surveys Complex mask computations done on SDSS Loading: zones used for bulk crossmatch Ad hoc queries: use HTM-based search functions Excellent performance

Prototype (Maria)

PS1 PSPSObject Data Manager Design

PSPS Critical Design Review November 5-6, 2007

IfA

Detail Design

General Concepts Distributed Database architecture Ingest Workflow Prototype

Zones (spatial partitioning and indexing algorithm) Partition and bin the data into declination zones

• ZoneID = floor ((dec + 90.0) / zoneHeight) Few tricks required to handle spherical geometry Place the data close on disk

• Cluster Index on ZoneID and RA Fully implemented in SQL Efficient

• Nearby searches • Cross-Match (especially)

Fundamental role in addressing the critical requirements• Data volume management • Association Speed • Spatial capabilities

Zones

De

clin

atio

n (

De

c)

Right Ascension (RA)

Zoned Table

ObjID ZoneID* RA Dec CX CY CZ …

1 0 0.0 -90.0

2 20250 180.0 0.0

3 20250 181.0 0.0

4 40500 360.0 +90.0

* ZoneHeight = 8 arcsec in this example

ZoneID = floor ((dec + 90.0) / zoneHeight)

SQL CrossNeighbors

SELECT *FROM prObj1 z1

JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID

JOIN prObj2 z2ON ZZ.ZoneID2 = z2.zoneID

WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha

AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r

AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))

Good CPU Usage

Partitions

SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.

Partitioning makes management and access of large tables and indexes more efficient• Enables parallel I/O• Reduces the amount of data that needs to be

accessed• Related tables can be aligned and collocated in

the same place speeding up JOINS

Partitions

2 key elements• Partitioning function

– Specifies how the table or index is partitioned

• Partitioning schemas – Using a partitioning function, the schema specifies the placement

of the partitions on file groups

Data can be managed very efficiently using Partition Switching• Add a table as a partition to an existing table• Switch a partition from one partitioned table to another• Reassign a partition to form a single table

Main requirement• The table must be constrained on the partitioning column

Partitions

For the PS1 design, • Partitions mean File Group Partitions• Tables are partitioned into ranges of ObjectID,

which correspond to declination ranges.• ObjectID boundaries are selected so that each

partition has a similar number of objects.

Distributed Partitioned Views

Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers

Concept: Slices

In the PS1 design, the bigger tables will be partitioned across servers

To avoid confusion with the File Group Partitioning, we call them “Slices”

Data is glued together using Distributed Partitioned Views

The ODM will manage slices. Using slices improves system scalability.

For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.

ObjectID boundaries are selected so that each slice has a similar number of objects.

Detail Design Outline


PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

MetaDetections

Linked servers

PS1 database

LoadAdmin

LoadSupport1

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

detections

LoadSupportn

Linked servers

detections

PartitionsMap

PS1 Distributed DB system

Legend

Database

Full table [partitioned table]

Output table

Partitioned View

Query Manager (QM)

Web Based Interface (WBI)

Design Decisions: ObjID

Objects have their positional information encoded in their objID• fGetPanObjID (ra, dec, zoneH)• ZoneID is the most significant part of the ID

It gives scalability, performance, and spatial functionality Object tables are range partitioned according to their

object ID

ObjectID Clusters Data Spatially

ObjectID = 087941012871550661

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

RA = 101.287155

ObjectID is unique when objects are separated by >0.0043 arcsec

Design Decisions: DetectID

Detections have their positional information encoded in the detection identifier• fGetDetectID (dec, observationID, runningID,

zoneH) • Primary key (objID, detectionID), to align detections

with objects within partitions• Provides efficient access to all detections associated

to one object• Provides efficient access to all detections of nearby

objects

DetectionID Clusters Data in Zones

DetectID = 0879410500001234567

Dec = –16.71611583 ZH = 0.008333

ZID = (Dec+90) / ZH = 08794.0661

ObservationID = 1050000

Running ID = 1234567

ODM Capacity

5.3.1.3 The PS1 ODM shall be able to ingest into the

ODM a total of

• 1.51011 P2 detections• 8.31010 cumulative sky (stack) detections• 5.5109 celestial objects

together with their linkages.

PS1 Table Sizes - Monolithic


Objects 2.31 2.31 2.31 2.31

StackPsfFits 5.07 10.16 15.20 17.74

StackToObj 0.92 1.84 2.76 3.22


P2PsfFits 7.87 15.74 23.61 27.54

P2ToObj 1.33 2.67 4.00 4.67

Other Tables 3.19 6.03 8.87 10.29

Indexes +20% 4.37 8.21 12.04 13.96

Total 26.21 49.24 72.23 83.74

Sizes are in TB

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into the main Server

Objects

LnkToObj

Meta

PartitionsMap

Legend

Database


Output table

Distributed Partitioned View

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

What goes into slices

PartitionsMap

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

Meta

Legend

Database


Output table


Duplication of Objects & LnkToObj

Objects are distributed across slices Objects, P2ToObj, and StackToObj are duplicated in the

slices to parallelize “inserts” & “updates” Detections belong into their object’s slice Orphans belong to the slice where their position would

allocate them• Orphans near slices’ boundaries will need special

treatment Objects keep their original object identifier

• Even though positional refinement might change their zoneID and therefore the most significant part of their identifier

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

Linked servers

PS1 database

Glue = Distributed Views

Detections

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

Meta

Legend

Database


Output table


Detections

PS1

P1 Pm


Linked servers

PS1 database

Partitioning in Main Server

Main server is partitioned (objects) and collocated (lnkToObj) by objid

Slices are partitioned (objects) and collocated (lnkToObj) by objid

Query Manager (QM)

PS1 Table Sizes - Main Server


Objects 2.31 2.31 2.31 2.31

StackPsfFits

StackToObj 0.92 1.84 2.76 3.22

StackModelFits

P2PsfFits

P2ToObj 1.33 2.67 4.00 4.67

Other Tables 0.41 0.46 0.52 0.55

Indexes +20% 0.99 1.46 1.92 2.15

Total 5.96 8.74 11.51 12.90

Sizes are in TB

PS1 Table Sizes - Each Slice

m=4 m=8 m=10 m=12


Objects 0.58 0.29 0.23 0.19

StackPsfFits 1.27 1.27 1.52 1.48

StackToObj 0.23 0.23 0.28 0.27


P2PsfFits 1.97 1.97 2.36 2.30

P2ToObj 0.33 0.33 0.40 0.39

Other Tables 0.75 0.81 1.00 1.01

Indexes +20% 1.08 1.04 1.23 1.19

Total 6.50 6.23 7.36 7.16

Sizes are in TB

PS1 Table Sizes - All Servers


Objects 4.63 4.63 4.61 4.59

StackPsfFits 5.08 10.16 15.20 17.76

StackToObj 1.84 3.68 5.56 6.46


P2PsfFits 7.88 15.76 23.60 27.60

P2ToObj 2.65 5.31 8.00 9.35

Other Tables 3.41 6.94 10.52 12.67

Indexes +20% 5.33 9.76 14.18 16.48

Total 31.98 58.56 85.07 98.87

Sizes are in TB

PS1

P1 Pm

PartitionsMap

Objects

LnkToObj

Meta

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

PartitionsMap

Meta

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

PartitionsMap

MetaDetections

Linked servers

PS1 database

LoadAdmin

LoadSupport1

objZoneIndx

orphans_l1

Detections_l1

LnkToObj_l1

objZoneIndx

Orphans_ln

Detections_ln

LnkToObj_ln

detections

LoadSupportn

Linked servers

detections

PartitionsMap

PS1 Distributed DB system

Legend

Database


Output table

Partitioned View

Query Manager (QM)


“Insert” & “Update”

SQL Insert and Update are expensive operations due to logging and re-indexing

In the PS1 design, Insert and Update have been re-factored into sequences of:

Merge + Constrain + Switch Partition Frequency

• f1: daily• f2: at least monthly• f3: TBD (likely to be every 6 months)

Ingest Workflow

ObjectsZ

CSV Detect

X(1”)

DXO_1a

NoMatch X(2”)

DXO_2a

DZone

P2PsfFits

Resolve

P2ToObjOrphans

Ingest @ frequency = f1

P2ToObj

Orphans

SLICE_1 MAIN

P2PsfFits

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

P2ToObj_1

P2ToPsfFits_1

Orphans_1

ObjectsZ

LOADER

SLICE_1 MAIN

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

LOADER

Objects

Updates @ frequency = f2

Updates @ frequency = f2

SLICE_1 MAIN

Metadata+

Objects

Orphans_1

P2ToPsfFits_1

P2ToObj_1

Objects_1

11 12 13

Stack*_1

1 2 3

P2ToObj

StackToObj

Objects

LOADER

Objects Objects_1

Snapshots @ frequency = f3

MAIN

Metadata+

Objects

1 2 3

P2ToObj

StackToObj

Snapshot

Objects

Batch Update of a Partition

A1 A2 A3

1 1 2 1 2 3…

merged

select into

switch

B1

select into … where

B2 + PK index


B3 + PK index

switch switch


B1 + PK index

P2

P3

PS1

P1

P2

Pm

P1

PartitionsMap

Objects

LnkToObj

Meta

Legend

Database Duplicate


Partitioned View Duplicate P view

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Query Manager (QM)

Detections

Linked servers

PS1 database

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

MetaPS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

Pm-1

Pm

Scaling-out

Apply Ping-Pong strategy to satisfy query performance during ingest

2 x ( 1 main + m slices)

P2

P3

PS1

P1

P2

Pm

P1

Legend

Database Duplicate


Partitioned View Duplicate P view

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

[Objects_p2]

[LnkToObj_p2]

[Detections_p2]

Meta

Query Manager (QM)

Linked servers

PS1 database

[Objects_pm]

[LnkToObj_pm]

[Detections_pm]

[Objects_p1]

[LnkToObj_p1]

[Detections_p1]

MetaPS1

PartitionsMap

Objects

LnkToObj

Meta

Detections

Pm-1

Pm

Scaling-out

More robustness, fault-tolerance, and reabilability calls for

3 x ( 1 main + m slices)

PartitionsMap

Objects

LnkToObj

Meta

Detections

Adding New slices

SQL Server range partitioning capabilities make it easy

Recalculate partitioning limits Transfer data to new slices Remove data from slices Define an d Apply new partitioning schema Add new partitions to main server Apply new partitioning schema to main server

Adding New Slices

ODM Ingest Performance

5.3.1.6 The PS1 ODM shall be able to ingest the data

from the IPP at two times the nominal daily arrival rate*

* The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.

Nominal daily data rate:• 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day• 8.31010 / 3.5 / 365 = 6.5107 stack detections / day

Number of Objects

miniProto myProto Prototype PS1

SDSS* Stars 5.7 x 104 1.3 x 107 1.1 x 108

SDSS* Galaxies 9.1 x 104 1.1 x 107 1.7 x 108

Galactic Plane 1.5 x 106 3 x 106 1.0 x 109

TOTAL 1.6 x 106 2.6 x 107 1.3 x 109 5.5 x 109

* “SDSS” includes a mirror of 11.3 < < 30 objects to < 0

Total GB of csv loaded data: 300 GBCSV Bulk insert load: 8 MB/sBinary Bulk insert: 18-20 MB/sCreation Started: October 15th 2007

Finished: October 29th 2007 (??)Includes

• 10 epochs of P2PsfFits detections• 1 epoch of Stack detections

Prototype in Context

Survey Objects Detections

SDSS DR6 3.8 108

2MASS 4.7 108

USNO-B 1.0 109

Prototype 1.3 109 1.4 1010

PS1 (end of survey) 5.5 109 2.3 1011

Size of Prototype Database

Table Main Slice1 Slice2 Slice3 Loader Total

Objects 1.30 0.43 0.43 0.43 1.30 3.89

StackPsfFits 6.49 6.49

StackToObj 6.49 6.49

StackModelFits 0.87 0.87

P2PsfFits 4.02 3.90 3.35 0.37 11.64

P2ToObj 4.02 3.90 3.35 0.12 11.39

Total 15.15 8.47 8.23 7.13 1.79 40.77

Extra Tables 0.87 4.89 4.77 4.22 6.86 21.61

Grand Total 16.02 13.36 13.00 11.35 8.65 62.38

Table sizes are in billions of rows

Size of Prototype Database

Table Main Slice1 Slice2 Slice3 Loader Total

Objects 547.6 165.4 165.3 165.3 137.1 1180.6

StackPsfFits 841.5 841.6

StackToObj 300.9 300.9

StackModelFits 476.7 476.7

P2PsfFits 879.9 853.0 733.5 74.7 2541.1

P2ToObj 125.7 121.9 104.8 3.8 356.2

Total 2166.7 1171.0 1140.2 1003.6 215.6 5697.1

Extra Tables 207.9 987.1 960.2 840.7 957.3 3953.2

Allocated / Free 1878.0 1223.0 1300.0 1121.0 666.0 6188.0

Grand Total 4252.6 3381.1 3400.4 2965.3 1838.9 15838.3

9.6 TB of data in a distributed databaseTable sizes are in GB

Well-Balanced Partitions

Server Partition Rows Fraction Dec Range

Main 1 432,590,598 33.34% 32.59

Slice 1 1 144,199,105 11.11% 14.29

Slice 1 2 144,229,343 11.11% 9.39

Slice 1 3 144,162,150 11.12% 8.91

Main 2 432,456,511 33.33% 23.44

Slice 2 1 144,261,098 11.12% 8.46

Slice 2 2 144,073,972 11.10% 7.21

Slice 2 3 144,121,441 11.11% 7.77

Main 3 432,496,648 33.33% 81.98

Slice 3 1 144,270,093 11.12% 11.15

Slice 3 2 144,090,071 11.10% 14.72

Slice 3 3 144,136,484 11.11% 56.10

Ingest and Association Times

TaskMeasuredMinutes

Create Detections Zone Table 39.62

X(0.2") 121M X 1.3B 65.25

Build #noMatches Table 1.50

X(1") 12k X 1.3B 0.65

Build #allMatches Table (121M) 6.58

Build Orphans Table 0.17

Create P2PsfFits Table 11.63

Create P2ToObj Table 14.00

Total of Measured Times 140.40

Ingest and Association Times

TaskEstimatedMinutes

Compute DetectionID, HTMID 30

Remove NULLS 15

Index P2PsfFits on ObjID 15

Slices Pulling Data from Loader 5

Resolve 1 Detection - N Objects 10

Total of Estimated Times 75

Educated GuessWild Guess

Total Time to I/A daily Data

TaskTime

(hours)Time

(hours)

Ingest 121M Detections (binary) 0.32

Ingest 121M Detections (CSV) 0.98

Total of Measured Times 2.34 2.34

Total of Estimated Times 1.25 1.25

Total Time to I/A Daily Data 3.91 4.57

Requirement: Less than 12 hours (more than 2800 detections / s)

Detection Processing Rate: 8600 to 7400 detections / s

Margin on Requirement: 3.1 to 2.6

Using multiple loaders would improve performance

Insert Time @ slices

TaskEstimatedMinutes

Import P2PsfFits (binary out/in) 20.45

Import P2PsfFits (binary out/in) 2.68

Import Orphans 0.00

Merge P2PsfFits 58

Add constraint P2PsfFits 193

Merge P2ToObj 13

Add constraint P2ToObj 54

Total of Measured Times 362

6 h with 8 partitions/slice

(~1.3 x 109 detections/partition)

Educated Guess

Detections Per Partition

YearsTotal

Detections SlicesPartition per Slice

Total Partitions

Detections per Slice

0.0 0.00 4 8 32 0.00

1.0 4.29 1010 4 8 32 1.34 109

1.0 4.29 1010 8 8 64 6.7 108

2.0 8.57 1010 8 8 64 1.34 109

2.0 8.57 1010 10 8 80 1.07 109

3.0 1.29 1011 10 8 80 1.61 109

3.0 1.29 1011 12 8 96 1.34 109

3.5 1.50 1011 12 8 96 1.56 109

Total Time for Insert @ slice

TaskTime

(hours)

Total of Measured Times 0.25

Total of Estimated Times 5.3

Total Time for daily insert 6

Daily insert may operate in parallel with daily ingest and association.

Requirement: Less than 12 hours

Margin on Requirement: 2.0

Using more slices will improve insert performance.

Summary

Ingest + Association < 4 h using 1 loader (@f1= daily)• Scales with the number of servers• Current margin on requirement 3.1 • Room for improvement

Detection Insert @ slices (@f1= daily)• 6 h with 8 partitions/slice• It may happen in parallel with loading

Detections Lnks Insert @ main (@f2 < monthly)• Unknown• 6 h available

Objects insert & update @ slices (@f2 < monthly)• Unknown • 6 hours available

Objects update @ main server (@f2 < monthly)• Unknown• 12 h available. Transfer can be pipelined as soon as objects

have been processed

Risks

Estimates of Insert & Update at slices could be underestimated• Need more empirical evaluation of exercising

parallel I/O Estimates and lay out of disk storage could be

underestimated• Merges and Indexes require 2x the data size

Hardware/Scalability (Jan)

PS1 Prototype Systems DesignJan Vandenberg, JHU

Early PS1 Prototype

Engineering Systems to Support the Database Design

Sequential read performance is our life-blood. Virtually all science queries will be I/O-bound.

~70 TB raw data: 5.9 hours for full scan on IBM’s fastest 3.3 GB/s Champagne-budget SAN• Need 20 GB/s IO engine just to scan the full data in

less than an hour. Can’t touch this on a monolith. Data mining a challenge even with good index coverage

• ~14 TB worth of indexes: 4-odd times bigger than SDSS DR6.

Hopeless if we rely on any bulk network transfers: must do work where the data is

Loading/Ingest more cpu-bound, though we still need solid write performance

Choosing I/O Systems

So killer sequential I/O performance is a key systems design goal. Which gear to use?• FC/SAN?• Vanilla SATA?• SAS?

Fibre Channel, SAN

Expensive but not-so-fast physical links (4 Gbit, 10 Gbit) Expensive switch Potentially very flexible Industrial strength manageability Little control over RAID controller bottlenecks

Straight SATA

Fast Pretty cheap Not so industrial-

strength

SAS

Fast: 12 Gbit/s FD building blocks

Nice and mature, stable SCSI’s not just for swanky

drives anymore: takes SATA drives!

So we have a way to use SATA without all the “beige”.

Pricey? $4400 for full 15x750GB system ($296/drive == close to Newegg media cost)

SAS Performance, Gory Details

SAS v. SATA differences

Native SAS V. SATA Performance

0

50

100

150

200

250

300

350

400

450

500

1 2 3 4 5 6 7

Disks

MB

/s

20%

Per-Controller Performance

One controller can’t quite accommodate the throughput of an entire storage enclosure.

Controller Limits

0

200

400

600

800

1000

1200

1400

1 2 3 4 5 6 7 8 9 10 11 12 13

Disks

MB

/s

6 Gbit Limit

One Controller

Ideal

Resulting PS1 Prototype I/O Topology

1100 MB/s single-threaded sequential reads per server

Aggregate Design I/O Performance

0

200

400

600

800

1000

1200

1400

1600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Disks

MB

/s

6 Gbit Limit

Dual Controllers

One Controller

Ideal

RAID-5 v. RAID-10?

Primer, anyone? RAID-5 perhaps feasible with contemporary

controllers… …but not a ton of redundancy But after we add enough disks to meet performance

goals, we have enough storage to run RAID-10 anyway!

RAID-10 Performance

0.5*RAID-0 for single-threaded reads RAID-0 perf for 2-user/2-thread workloads 0.5*RAID-0 writes

PS1 Prototype Servers

Prototype DB

H1

S1 S2 S3

Prototype Loader

Linuxstaging

L1 L2


PS1 Prototype

Projected PS1 Systems Design

R1

H1 H2

S1 S2 S8

….

Loader

Linuxstaging

L1 LN

….

R2

H1 H2

S1 S2 S8

….

R3

H1 H2

S1 S2 S8

….

G1

H1 H2

S1 S2 S8

….

Backup/Recovery/Replication Strategies

No formal backup• …except maybe for mydb’s, f(cost*policy)

3-way replication• Replication != backup

– Little or no history (though we might have some point-in-time capabilities via metadata

– Replicas can be a bit too cozy: must notice badness before replication propagates it

• Replicas provide redundancy and load balancing…• Fully online: zero time to recover• Replicas needed for happy production performance plus

ingest, anyway Off-site geoplex

• Provides continuity if we lose HI (local or trans-Pacific network outage, facilities outage)

• Could help balance trans-Pacific bandwidth needs (service continental traffic locally)

Why No Traditional Backups?

Money no object… do traditional backups too!!! Synergy, economy of scale with other collaboration

needs (IPP?)… do traditional backups too!!! Not super pricey… …but not very useful relative to a replica for our

purposes• Time to recover

Failure Scenarios (Easy Ones)

Zero downtime, little effort:• Disks (common)

– Simple* hotswap– Automatic rebuild from hotspare or replacement

drive• Power supplies (not uncommon)

– Simple* hotswap• Fans (pretty common)

– Simple* hotswap

* Assuming sufficiently non-beige gear

Failure Scenarios (Mostly Harmless Ones)

Some downtime and replica cutover:• System board (rare)• Memory (rare and usually proactively detected and

handled via scheduled maintenance)• Disk controller (rare, potentially minimal downtime

via cold-spare controller)• CPU (not utterly uncommon, can be tough and time

consuming to diagnose correctly)

Failure Scenarios (Slightly Spooky Ones)

Database mangling by human or pipeline error• Gotta catch this before replication propagates it everywhere• Need lots of sanity checks before replicating• (and so off-the-shelf near-realtime replication tools don’t help

us)• Need to run replication backwards from older, healthy replicas.

Probably less automated than healthy replication. Catastrophic loss of datacenter

• Okay, we have the geoplex– …but we’re dangling by a single copy ‘till recovery is

complete– …and this may be a while.– …but are we still in trouble? Depending on colo scenarios,

did we also lose the IPP and flatfile archive?

Failure Scenarios (Nasty Ones)

Unrecoverable badness fully replicated before detection Catastrophic loss of datacenter without geoplex Can we ever catch back up with the data rate if we need

to start over and rebuild with an ingest campaign? Don’t bet on it!

Operating Systems, DBMS?

Sql2005 EE x64• Why?• Why not DB2, Oracle RAC, PostgreSQL, MySQL,

<insert your favorite>? (Win2003 EE x64) Why EE? Because it’s there. <indexed DPVs?> Scientific Linux 4.x/5.x, or local favorite Platform rant from JVV available over beers

Systems/Database Management

Active Directory infrastructure Windows patching tools, practices Linux patching tools, practices Monitoring Staffing requirements

Facilities/Infrastructure Projections for PS1

Power/cooling• Prototype is 9.2 kW (2.6 Tons AC)• PS1: something like 43 kW, 12.1 Tons

Rack space• Prototype is 69 RU, <2 42U racks (includes 14U of

rackmount UPS at JHU)• PS1: about 310 RU (9-ish racks)

Networking: ~40 Gbit Ethernet ports …plus sundry infrastructure, ideally already in place

(domain controllers, monitoring systems, etc.)

Operational Handoff to UofH

Gulp.

How Design Meets Requirements

Cross-matching detections with objects• Zone cross-match part of loading pipeline• Already exceeded requirement with prototype

Query performance• Ping-pong configuration for query during ingest• Spatial indexing and distributed queries• Query manager can be scaled out as necessary

Scalability• Shared-nothing architecture• Scale out as needed• Beyond PS1 we will need truly parallel query plans

WBS/Development Tasks

Refine Prototype/Schema

Staging/Transformation

Initial Load

Load/Resolve Detections

Resolve/Synchronize Objects

Create Snapshot

Replication Module

Query Processing

• Workflow Systems• Logging• Data Scrubbing• SSIS (?) + C#

• QM/LoggingHardware

Documentation

2 PM

3 PM

1 PM

3 PM

3 PM

1 PM

2 PM

2 PM

2 PM

2 PM

4 PM

4 PM

4 PM

2 PM

Total Effort: 35 PMDelivery: 9/2008

Testing

Redistribute Data

Personnel Available

2 new hires (SW Engineers) 100% Maria 80% Ani 20% Jan 10% Alainna 15% Nolan Li 25% Sam Carliles 25% George Fekete 5% Laszlo Dobos 50% (for 6 months)

Issues/Risks

Versioning• Do we need to preserve snapshots of monthly

versions?• How will users reproduce queries on subsequent

versions?• Is it ok that a new version of the sky replaces the

previous one every month? Backup/recovery

• Will we need 3 local copies rather than 2 for safety• Is restoring from offsite copy feasible?

Handoff to IfA beyond scope of WBS shown• This will involve several PMs

Mahalo!

Context that query

is executed in

MyDB table that query results go

into

Name that this query

job is given

Check query syntax

Get graphical query plan

Run query in quick (1

minute) mode

Submit query to long (8-

hour) queue

Query buffer

Load one of the sample queries into

query buffer

Query Manager

Stored procedure arguments

SQL code for stored procedure

Query Manager

MyDB context is the default, but other contexts can be selected

The space used and total space available

Multiple tables can be selected and dropped at once

Table list can be sorted by name, size, type.

User can browse DB Views, Tables, Functions and

Procedures

Query Manager

The query that created this

table

Query Manager

Search radius

Table to hold results

Context to run search on

Query Manager

PS1 PSPS Object Data Manager Design

Documents

Transcript of PS1 PSPS Object Data Manager Design