PS1 PSPS Object Data Manager Design
-
Upload
valentine-reilly -
Category
Documents
-
view
26 -
download
1
description
Transcript of PS1 PSPS Object Data Manager Design
PS1 PSPSObject Data Manager Design
PSPS Critical Design Review November 5-6, 2007
IfA
slide 2
Outline
ODM Overview Critical Requirements Driving Design Work Completed Detailed Design Spatial Querying [AS]
ODM Prototype [MN]
Hardware/Scalability [JV]
How Design Meets Requirements WBS and Schedule Issues/Risks
[AS] = Alex, [MN] = Maria, [JV] = Jan
slide 3
ODM Overview
The Object Data Manager will:
Provide a scalable data archive for the Pan-STARRS data products
Provide query access to the data for Pan-STARRS users
Provide detailed usage tracking and logging
slide 4
ODM Driving Requirements
Total size 100 TB, • 1.5 x 1011 P2 detections• 8.3x1010 P2 cumulative-sky (stack) detections• 5.5x109 celestial objects
Nominal daily rate (divide by 3.5x365)• P2 detections: 120 Million/day• Stack detections: 65 Million/day• Objects: 4.3 Million/day
Cross-Match requirement: 120 Million / 12 hrs ~ 2800 / s DB size requirement:
• 25 TB / yr• ~100 TB by of PS1 (3.5 yrs)
slide 5
Work completed so far
Built a prototype Scoped and built prototype hardware Generated simulated data
• 300M SDSS DR5 objects, 1.5B Galactic plane objects
Initial Load done – Created 15 TB DB of simulated data• Largest astronomical DB in existence today
Partitioned the data correctly using Zones algorithm Able to run simple queries on distributed DB Demonstrated critical steps of incremental loading It is fast enough
• Cross-match > 60k detections/sec• Required rate is ~3k/sec
slide 6
Detailed Design
Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)
• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware
Query Manager (QM: CasJobs for prototype)
slide 7
High-Level Organization
Legend
DatabaseFull table [partitioned table]Output tablePartitioned View
Query Manager (QM)Query Manager (QM)
PS1
P1 Pm
PartionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
MetaDetections
Linked servers
Data Storage (DS)
Web Based Interface (WBI)Web Based Interface (WBI)
Data Transformation Layer (DX)Data Transformation Layer (DX)
LoadAdmin
LoadSupport1
objZoneIndx
orphans
Detections_l1
LnkToObj_l1
objZoneIndx
orphans
Detections_ln
LnkToObj_ln
LoadSupportn
Linked servers
PartitionMapData Loading Pipeline (DLP)
Legend
DatabaseFull table [partitioned table]Output tablePartitioned View
Query Manager (QM)Query Manager (QM)
PS1
P1 Pm
PartionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
MetaDetections
Linked servers
Data Storage (DS)
PS1
P1 Pm
PartionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
MetaDetections
Linked servers
Data Storage (DS)
Web Based Interface (WBI)Web Based Interface (WBI)
Data Transformation Layer (DX)Data Transformation Layer (DX)
LoadAdmin
LoadSupport1
objZoneIndx
orphans
Detections_l1
LnkToObj_l1
objZoneIndx
orphans
Detections_ln
LnkToObj_ln
LoadSupportn
Linked servers
PartitionMapData Loading Pipeline (DLP)
LoadAdmin
LoadSupport1
objZoneIndx
orphans
Detections_l1
LnkToObj_l1
objZoneIndx
orphans
Detections_ln
LnkToObj_ln
LoadSupportn
Linked servers
PartitionMapData Loading Pipeline (DLP)
slide 8
Detailed Design
Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)
• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware
Query Manager (QM: CasJobs for prototype)
slide 9
Data Transformation Layer (DX)
Based on SDSS sqlFits2CSV package• LINUX/C++ application• FITS reader driven off header files
Convert IPP FITS files to• ASCII CSV format for ingest (initially)• SQL Server native binary later (3x faster)
Follow the batch and ingest verification procedure described in ICD• 4-step batch verification• Notification and handling of broken publication cycle
Deposit CSV or Binary input files in directory structure• Create “ready” file in each batch directory
Stage input data on LINUX side as it comes in from IPP
slide 10
DX Subtasks
DXDX
Initialization Job
FITS schemaFITS reader
CSV ConverterCSV Writer
Initialization Job
FITS schemaFITS reader
CSV ConverterCSV Writer
Batch Ingest
Interface with IPPNaming conventionUncompress batch
Read batchVerify Batch
Batch Ingest
Interface with IPPNaming conventionUncompress batch
Read batchVerify Batch
BatchVerification
Verify ManifestVerify FITS IntegrityVerify FITS Content
Verify FITS DataHandle Broken Cycle
BatchVerification
Verify ManifestVerify FITS IntegrityVerify FITS Content
Verify FITS DataHandle Broken Cycle
BatchConversion
CSV ConverterBinary Converter
“batch_ready”Interface with DLP
BatchConversion
CSV ConverterBinary Converter
“batch_ready”Interface with DLP
slide 11
DX-DLP Interface
Directory structure on staging FS (LINUX):• Separate directory for each JobID_BatchID• Contains a “batch_ready” manifest file
– Name, #rows and destination table of each file• Contains one file per destination table in ODM
– Objects, Detections, other tables Creation of “batch_ready” file is signal to loader to ingest
the batch Batch size and frequency of ingest cycle TBD
slide 12
Detailed Design
Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)
• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware
Query Manager (QM: CasJobs for prototype)
slide 13
Data Loading Pipeline (DLP)
sqlLoader – SDSS data loading pipeline• Pseudo-automated workflow system• Loads, validates and publishes data
– From CSV to SQL tables• Maintains a log of every step of loading• Managed from Load Monitor Web interface
Has been used to load every SDSS data release• EDR, DR1-6, ~ 15 TB of data altogether• Most of it (since DR2) loaded incrementally• Kept many data errors from getting into database
– Duplicate ObjIDs (symptom of other problems)– Data corruption (CSV format invaluable in
catching this)
slide 14
sqlLoader Design
Existing functionality• Shown for SDSS version• Workflow, distributed loading, Load Monitor
New functionality• Schema changes• Workflow changes• Incremental loading
– Cross-match and partitioning
slide 15
sqlLoader Workflow
Distributed design achieved with linked servers and SQL Server Agent
LOAD stage can be done in parallel by loading into temporary task databases
PUBLISH stage writes from task DBs to final DB
FINISH stage creates indices and auxiliary (derived) tables
LOADLOAD
PUBLISHPUBLISHFINISHFINISH
EXPEXP
CHKCHK
BLDBLD
SQLSQL
VALVAL
BCKBCK
DTCDTC
Export
Check CSV
Build Task DBs
Build SQL Schema
Validate
Backup
Detach
PUBPUB
CLNCLN
Publish
Cleanup
FINFIN
LOADLOAD
PUBLISHPUBLISHFINISHFINISH
EXPEXP
CHKCHK
BLDBLD
SQLSQL
VALVAL
BCKBCK
DTCDTC
Export
Check CSV
Build Task DBs
Build SQL Schema
Validate
Backup
Detach
PUBPUB
CLNCLN
Publish
Cleanup
FINFIN
LOADLOAD
PUBLISHPUBLISHFINISHFINISH
EXPEXP
CHKCHK
BLDBLD
SQLSQL
VALVAL
BCKBCK
DTCDTC
Export
Check CSV
Build Task DBs
Build SQL Schema
Validate
Backup
Detach
PUBPUB
CLNCLN
Publish
Cleanup
FINFIN
LOADLOAD
PUBLISHPUBLISHFINISHFINISH
EXPEXP
CHKCHK
BLDBLD
SQLSQL
VALVAL
BCKBCK
DTCDTC
Export
Check CSV
Build Task DBs
Build SQL Schema
Validate
Backup
Detach
PUBPUB
CLNCLN
Publish
Cleanup
FINFIN
Loading pipeline is a system of VB and SQL scripts, stored procedures and functions
slide 16
Load Monitor Tasks Page
slide 17
Load Monitor Active Tasks
slide 18
Load Monitor Statistics Page
slide 19
Load Monitor – New Task(s)
slide 20
Test UniquenessOf Primary KeysTest UniquenessOf Primary Keys
TestForeign Keys
TestForeign Keys
TestCardinalities
TestCardinalities
TestHTM IDs
TestHTM IDs
Test Link TableConsistency
Test Link TableConsistency
Test the uniqueKey in each table
Test for consistencyof keys that link tables
Test consistency of numbers of various quantities
Test the HierarchicalTriamgular Mesh IDsused for spatial indexing
Ensure that links areconsistent
Data Validation
Tests for data integrity and consistency
Scrubs data and finds problems in upstream pipelines
Most of the validation can be performed within the individual task DB (in parallel)
slide 21
Master Master
SlaveSlave SlaveSlave
Samba-mounted CSV/Binary FilesSamba-mounted CSV/Binary Files
PublishData
PublishData
FinishFinish
Task DB Task DBTaskDataTaskData
Task DB
Task DBView of
MasterSchema
TaskDataTaskData
LoadSupportLoadSupport Task DB
Task DB
TaskDataTaskData
Load Monitor
PublishSchema
View ofMaster
Schema
View ofMaster
Schema
MasterSchema
LoadAdminLoadAdmin
Distributed Loading
Publish
LoadSupportLoadSupportLoadSupportLoadSupport
slide 22
Schema Changes
Schema in task and publish DBs is driven off a list of schema DDL files to execute (xschema.txt)
Requires replacing DDL files in schema/sql directory and updating xschema.txt with their names
PS1 schema DDL files have already been built Index definitions have also been created Metadata tables will be automatically generated using
metadata scripts already in the loader
slide 23
LOADExportExport
CheckCSVs
CheckCSVs
CreateTask DBsCreate
Task DBs
Build SQLSchema
Build SQLSchema
ValidateValidate
XMatchXMatch
Workflow Changes
Cross-Match and Partition steps will be added to the workflow
Cross-match will match detections to objects
Partition will horizontally partition data, move it to slice servers, and build DPVs on main
PUBLISH
PartitionPartition
slide 24
Matching Detections with Objects
Algorithm described fully in prototype section Stored procedures to cross-match detections will be part
of the LOAD stage in loader pipeline Vertical partition of Objects table kept on load server for
matching with detections Zones cross-match algorithm used to do 1” and 2”
matches Detections with no matches saved in Orphans table
slide 25
XMatch and Partition Data Flow
Loadsupport
PS1
Pm
Detections
LoadDetections
XMatchDetections_In
PullChunk
LinkToObj_In
ObjZoneIndx
Orphans
Detections_chunk
LinkToObj_chunk
MergePartitions
Detections_m
LinkToObj_m
UpdateObjects
Objects_mPull
PartitionSwitch
Partition
Objects_m
LinkToObj_m
Objects
LinkToObj
slide 26
Detailed Design
Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)
• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware
Query Manager (QM: CasJobs for prototype)
slide 27
Data Storage – Schema
slide 28
PS1 Table Sizes Spreadsheet
Stars 5.00E+09 1.51E+11Galaxies 5.00E+08 36750000000Total Objects 5.50E+09 m
2.3E-07 0.3*DR1 3.00P2 Detections per year 4.30E+10 0.3 0.29 0.57 0.86 1.00
tablename columns bytes/row total rows total size (TB) Prototype DR1 DR2 DR3 DR4
AltModels 0 7 1547 10 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1.547E-08 1 1CameraConfig 0 5 287 30 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 8.61E-09 1 1FileGroupMap 0 4 4335 100 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 4.335E-07 1 1IndexMap 0 7 2301 100 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 2.301E-07 1 1Objects 0 88 420 5.50E+09 2.31 0.693 2.31 2.31 2.31 2.31 1 0.33ObjZoneIndx 0 7 63 5.50E+09 0.3465 0.10395 0.3465 0.3465 0.3465 0.3465 1 0PartitionMap 0 3 4111 100 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 4.111E-07 1 1PhotoCal 0 10 151 1000 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 0.000000151 1 1PhotozRecipes 0 2 267 10 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 2.67E-09 1 1SkyCells 0 2 10 50000 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 0.0000005 1 1Surveys 0 2 267 30 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 8.01E-09 1 1DropP2ToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33DropStackToObj 1 4 39 4.00E+06 0.000156 1.33714E-05 4.45714E-05 8.91429E-05 0.000133714 0.000156 1 0.33P2AltFits 1 13 71 1.51E+10 1.06855 0.09159 0.3053 0.6106 0.9159 1.06855 0 0.33P2FrameMeta 1 18 343 1.05E+06 0.00036015 0.00003087 0.0001029 0.0002058 0.0003087 0.00036015 1 1P2ImageMeta 1 64 2870 6.72E+07 0.192864 0.0165312 0.055104 0.110208 0.165312 0.192864 1 1P2PsfFits 1 34 183 1.51E+11 27.5415 2.3607 7.869 15.738 23.607 27.5415 0 0.33P2ToObj 1 3 31 1.51E+11 4.6655 0.3999 1.333 2.666 3.999 4.6655 1 0.33P2ToStack 1 2 15 1.51E+11 2.2575 0.1935 0.645 1.29 1.935 2.2575 0 0.33StackDeltaAltFits 1 13 71 3.68E+09 0.260925 0.022365 0.07455 0.1491 0.22365 0.260925 0 0.33StackHiSigDeltas 1 32 167 3.68E+10 6.13725 0.52605 1.7535 3.507 5.2605 6.13725 0 0.33StackLowSigDelta 1 2 5000 1.65E+06 0.00825 0.000707143 0.002357143 0.004714286 0.007071429 0.00825 0 0.33StackMeta 1 49 1551 700000 0.0010857 0.00032571 0.0010857 0.0010857 0.0010857 0.0010857 0 0.33StackModelFits 1 131 535 7.50E+09 4.0125 0.343928571 1.146428571 2.292857143 3.439285714 4.0125 0 0.33StackPsfFits 1 44 215 8.25E+10 17.7375 1.520357143 5.067857143 10.13571429 15.20357143 17.7375 0 0.33StackToObj 1 4 39 8.25E+10 3.2175 0.275785714 0.919285714 1.838571429 2.757857143 3.2175 1 0.33StationaryTransient 1 2 23 5.00E+08 0.0115 0.000985714 0.003285714 0.006571429 0.009857143 0.0115 1 0.33
sum 69.76959861 6.549735569 21.83244779 41.00730812 60.18216845 69.76959861indices 13.95391972 1.309947114 4.366489558 8.201461624 12.03643369 13.95391972total 83.72351833 7.859682683 26.19893735 49.20876974 72.21860214 83.72351833
0 means the table size is essentially the same for all data releases Primary filegroup1 means the table size will grow
0 means full table1 means the table is partitioned and distributed across the cluster
Fraction of the table contained on each partition
Note: These estimates are for the whole PS1, assuming 3.5 years. 7 bytes added to each row for overhead as suggested by Alex
slide 29
PS1 Table Sizes - All Servers
Table Year 1 Year 2 Year 3 Year 3.5
Objects 4.63 4.63 4.61 4.59
StackPsfFits 5.08 10.16 15.20 17.76
StackToObj 1.84 3.68 5.56 6.46
StackModelFits 1.16 2.32 3.40 3.96
P2PsfFits 7.88 15.76 23.60 27.60
P2ToObj 2.65 5.31 8.00 9.35
Other Tables 3.41 6.94 10.52 12.67
Indexes +20% 5.33 9.76 14.18 16.48
Total 31.98 58.56 85.07 98.87
Sizes are in TB
slide 30
Data Storage – Test Queries
Drawn from several sources• Initial set of SDSS 20 queries• SDSS SkyServer Sample Queries• Queries from PS scientists (Monet, Howell, Kaiser,
Heasley) Two objectives
• Find potential holes/issues in schema• Serve as test queries
– Test DBMS iintegrity– Test DBMS performance
Loaded into CasJobs (Query Manager) as sample queries for prototype
slide 31
Data Storage – DBMS
Microsoft SQL Server 2005• Relational DBMS with excellent query optimizer
Plus• Spherical/HTM (C# library + SQL glue)
– Spatial index (Hierarchical Triangular Mesh)• Zones (SQL library)
– Alternate spatial decomposition with dec zones• Many stored procedures and functions
– From coordinate conversions to neighbor search functions
• Self-extracting documentation (metadata) and diagnostics
slide 32
Documentation and Diagnostics
slide 33
Data Storage – Scalable Architecture
Monolithic database design (a la SDSS) will not do it SQL Server does not have cluster implementation
• Do it by hand Partitions vs Slices
• Partitions are file-groups on the same server– Parallelize disk accesses on the same machine
• Slices are data partitions on separate servers• We use both!
Additional slices can be added for scale-out For PS1, use SQL Server Distributed Partition Views
(DPVs)
slide 34
Distributed Partitioned Views
Difference between DPVs and file-group partitioning• FG on same database• DPVs on separate DBs• FGs are for scale-up• DPVs are for scale-out
Main server has a view of a partitioned table that includes remote partitions (we call them slices to distinguish them from FG partitions)
Accomplished with SQL Server’s linked server technology
NOT truly parallel, though
slide 35
Scalable Data Architecture
Shared-nothing architecture Detections split across cluster Objects
replicated on Head and Slice DBs
DPVs of Detections tables on the Headnode DB
Queries on Objects stay on head node
S2
S3
Head
S1
Objects_S1
Objects_S2
Objects_S3
Objects_S1
Objects_S2
Objects_S3
Detections_S1
Detections_S2
Detections_S3
Objects
Detections_S1
Detections_S2
Detections_S3
Detections DPV
Queries on detections use only local data on slices
slide 36
Hardware - Prototype
LXPS01
L1PS13
L2/MPS05
Staging Loading
10 TB 9 TB
8
4
4HeadPS11
8
DB
S1PS12
8
S2PS03
4
S3PS04
4
WPS02
4
MyDB
39 TB
2A
2A
2A
2A
A
A2B B
RAID5 RAID10 RAID10 RAID10
14D/3.5W 12D/4W
Total space
RAID config
Disk/rack config
Function
10A = 10 x [13 x 750 GB]3B = 3 x [12 x 500 GB]
LX = LinuxL = Load serverS/Head = DB serverM = MyDB serverW = Web server
Web
0 TB
PS0x = 4-corePS1x = 8-core
Server NamingConvention:
Storage:
Function:
slide 37
Hardware – PS1
Offline(Copy 2)
Spare(Copy 3)
Live(Copy 1)
Offline(Copy 2)
Spare(Copy 3)
Live(Copy 1)
Queries Ingest
Offline(Copy 1)
Spare(Copy 3)
Live(Copy 2)
Live(Copy 2)
Spare(Copy 3)
Live(Copy 1)
ReplicateQueries
Queries
Queries
Replicate
Queries
Ping-pong configuration to maintain high availability and query performance
2 copies of each slice and of main (head) node database on fast hardware (hot spares)
3rd spare copy on slow hardware (can be just disk)
Updates/ingest on offline copy then switch copies when ingest and replication finished
Synchronize second copy while first copy is online
Both copies live when no ingest
3x basic config. for PS1
slide 38
Detailed Design
Reuse SDSS software as much as possible Data Transformation Layer (DX) – Interface to IPP Data Loading Pipeline (DLP) Data Storage (DS)
• Schema and Test Queries• Database Management System• Scalable Data Architecture• Hardware
Query Manager (QM: CasJobs for prototype)
slide 39
Query Manager
Based on SDSS CasJobs Configure to work with distributed database, DPVs Direct links (contexts) to slices can be added later if
necessary Segregates quick queries from long ones Saves query results server-side in MyDB Gives users a powerful query workbench Can be scaled out to meet any query load PS1 Sample Queries available to users PS1 Prototype QM demo
slide 40
ODM Prototype Components
Data Loading Pipeline Data Storage CasJobs
• Query Manager (QM)• Web Based Interface (WBI)
Testing
slide 41
Spatial Queries (Alex)
slide 42
Spatial Searches in the ODM
slide 43
Common Spatial Questions
Points in region queries1. Find all objects in this region2. Find all “good” objects (not in masked areas)3. Is this point in any of the regionsRegion in region4. Find regions near this region and their area5. Find all objects with error boxes intersecting region6. What is the common part of these regionsVarious statistical operations7. Find the object counts over a given region list 8. Cross-match these two catalogs in the region
slide 44
Sky Coordinates of Points
Many different coordinate systems• Equatorial, Galactic, Ecliptic, Supergalactic
Longitude-latitude constraints Searches often in mix of different coordinate systems
• gb>40 and dec between 10 and 20• Problem: coordinate singularities, transformations
How can one describe constraints in a easy, uniform fashion?
How can one perform fast database queries in an easy fashion?• Fast:Indexes• Easy: simple query expressions
slide 45
Describing Regions
Spacetime metadata for the VO (Arnold Rots) Includes definitions of
• Constraint: single small or great circle• Convex: intersection of constraints• Region: union of convexes
Support both angles and Cartesian descriptions Constructors for
• CIRCLE, RECTANGLE, POLYGON, CONVEX HULL
Boolean algebra (INTERSECTION, UNION, DIFF) Proper language to describe the abstract regions Similar to GIS, but much better suited for astronomy
slide 46
Things Can Get Complex
AABB
AA
Green area: A (B- ε) should find B if it contains an A and not maskedYellow area: A (B±ε) is an edge case may find B if it contains an A.
slide 47
We Do Spatial 3 Ways
Hierarchical Triangular Mesh (extension to SQL)• Uses table valued functions• Acts as a new “spatial access method”
Zones: fits SQL well• Surprisingly simple & good
3D Constraints: a novel idea• Algebra on regions, can be implemented
in pure SQL
slide 48
PS1 Footprint
Using the projection cell definitions as centers for tessellation (T. Budavari)
slide 49
CrossMatch: Zone Approach
Divide space into declination zones Objects ordered by zoneid, ra
(on the sphere need wrap-around margin.)
Point search look in neighboring zones within ~ (ra ± Δ) bounding box
All inside the relational engine Avoids “impedance mismatch” Can “batch” comparisons Automatically parallel Details in Maria’s thesis
r ra-zoneMax
zoneMax
x
ra ± Δ
slide 50
Indexing Using Quadtrees
Cover the sky with hierarchical pixels COBE – start with a cube Hierarchical Triangular Mesh (HTM) uses trixels
• Samet, Fekete Start with an octahedron, and
split each triangle into 4 children,down to 20 levels deep
Smallest triangles are 0.3” Each trixel has a unique htmID
2,2
2,1
2,0
2,3
2,3,0
2,3,12,3,2 2,3,3
21
23
20
22
222
223220 221
slide 51
Space-Filling Curve
100
103
102
101
120
1,2,1
122
121
110
113
112
111
132
133
130
131
[0.12,0.13)
[0.122,0.123)[0.121,0.122)[0.120,0.121) [0.123,0.130)
Triangles correspond to rangesAll points inside the triangle are inside the range.
[0.122,0.130)
[0.120,0.121)
slide 52
SQL HTM Extension
Every object has a 20-deep htmID (44bits) Clustered index on htmID Table-valued functions for spatial joins
• Given a region definition, routine returns up to 10 ranges of covering triangles
• Spatial query is mapped to ~10 range queries Current implementation rewritten in C# Excellent performance, little calling overhead Three layers
• General geometry library• HTM kernel• IO (parsing + SQL interface)
slide 53
Writing Spatial SQL
-- region description is contained by @area
DECLARE @cover TABLE (htmStart bigint,htmEnd bigint)INSERT @cover SELECT * from dbo.fHtmCover(@area)--DECLARE @region TABLE ( convexId bigint,x float, y float, z float)INSERT @region SELECT dbo.fGetHalfSpaces(@area)--SELECT o.ra, o.dec, 1 as flag, o.objid FROM (SELECT objID as objid, cx,cy,cz,ra,[dec]
FROM Objects q JOIN @cover AS c ON q.htmID between c.HtmIdStart and c.HtmIdEnd) AS o
WHERE NOT EXISTS (SELECT p.convexId FROM @region AS p WHERE (o.cx*p.x + o.cy*p.y + o.cz*p.z < p.c)GROUP BY p.convexId)
slide 54
Status
All three libraries extensively tested Zones used for Maria’s thesis, plus various papers New HTM code in production use since July on SDSS Same code also used by STScI HLA, Galex Systematic regression tests developed Footprints computed for all major surveys Complex mask computations done on SDSS Loading: zones used for bulk crossmatch Ad hoc queries: use HTM-based search functions Excellent performance
slide 55
Prototype (Maria)
slide 56
PS1 PSPSObject Data Manager Design
PSPS Critical Design Review November 5-6, 2007
IfA
slide 57
Detail Design
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 58
Zones (spatial partitioning and indexing algorithm) Partition and bin the data into declination zones
• ZoneID = floor ((dec + 90.0) / zoneHeight) Few tricks required to handle spherical geometry Place the data close on disk
• Cluster Index on ZoneID and RA Fully implemented in SQL Efficient
• Nearby searches • Cross-Match (especially)
Fundamental role in addressing the critical requirements• Data volume management • Association Speed • Spatial capabilities
Zones
De
clin
atio
n (
De
c)
Right Ascension (RA)
slide 59
Zoned Table
ObjID ZoneID* RA Dec CX CY CZ …
1 0 0.0 -90.0
2 20250 180.0 0.0
3 20250 181.0 0.0
4 40500 360.0 +90.0
* ZoneHeight = 8 arcsec in this example
ZoneID = floor ((dec + 90.0) / zoneHeight)
slide 60
SQL CrossNeighbors
SELECT *FROM prObj1 z1
JOIN zoneZone ZZ ON ZZ.zoneID1 = z1.zoneID
JOIN prObj2 z2ON ZZ.ZoneID2 = z2.zoneID
WHERE z2.ra BETWEEN z1.ra-ZZ.alpha AND z2.ra+ZZ.alpha
AND z2.dec BETWEEN z1.dec-@r AND z1.dec+@r
AND (z1.cx*z2.cx+z1.cy*z2.cy+z1.cz*z2.cz) > cos(radians(@r))
slide 61
Good CPU Usage
slide 62
Partitions
SQL Server 2005 introduces technology to handle tables which are partitioned across different disk volumes and managed by a single server.
Partitioning makes management and access of large tables and indexes more efficient• Enables parallel I/O• Reduces the amount of data that needs to be
accessed• Related tables can be aligned and collocated in
the same place speeding up JOINS
slide 63
Partitions
2 key elements• Partitioning function
– Specifies how the table or index is partitioned
• Partitioning schemas – Using a partitioning function, the schema specifies the placement
of the partitions on file groups
Data can be managed very efficiently using Partition Switching• Add a table as a partition to an existing table• Switch a partition from one partitioned table to another• Reassign a partition to form a single table
Main requirement• The table must be constrained on the partitioning column
slide 64
Partitions
For the PS1 design, • Partitions mean File Group Partitions• Tables are partitioned into ranges of ObjectID,
which correspond to declination ranges.• ObjectID boundaries are selected so that each
partition has a similar number of objects.
slide 65
Distributed Partitioned Views
Tables participating in the Distributed Partitioned View (DVP) reside on different databases which reside in different databases which reside on different instances or different (linked) servers
slide 66
Concept: Slices
In the PS1 design, the bigger tables will be partitioned across servers
To avoid confusion with the File Group Partitioning, we call them “Slices”
Data is glued together using Distributed Partitioned Views
The ODM will manage slices. Using slices improves system scalability.
For PS1 design, tables are sliced into ranges of ObjectID, which correspond to broad declination ranges. Each slice is subdivided into partitions that correspond to narrower declination ranges.
ObjectID boundaries are selected so that each slice has a similar number of objects.
slide 67
Detail Design Outline
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 68
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
MetaDetections
Linked servers
PS1 database
LoadAdmin
LoadSupport1
objZoneIndx
orphans_l1
Detections_l1
LnkToObj_l1
objZoneIndx
Orphans_ln
Detections_ln
LnkToObj_ln
detections
LoadSupportn
Linked servers
detections
PartitionsMap
PS1 Distributed DB system
Legend
Database
Full table [partitioned table]
Output table
Partitioned View
Query Manager (QM)
Web Based Interface (WBI)
slide 69
Design Decisions: ObjID
Objects have their positional information encoded in their objID• fGetPanObjID (ra, dec, zoneH)• ZoneID is the most significant part of the ID
It gives scalability, performance, and spatial functionality Object tables are range partitioned according to their
object ID
slide 70
ObjectID Clusters Data Spatially
ObjectID = 087941012871550661
Dec = –16.71611583 ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
RA = 101.287155
ObjectID is unique when objects are separated by >0.0043 arcsec
slide 71
Design Decisions: DetectID
Detections have their positional information encoded in the detection identifier• fGetDetectID (dec, observationID, runningID,
zoneH) • Primary key (objID, detectionID), to align detections
with objects within partitions• Provides efficient access to all detections associated
to one object• Provides efficient access to all detections of nearby
objects
slide 72
DetectionID Clusters Data in Zones
DetectID = 0879410500001234567
Dec = –16.71611583 ZH = 0.008333
ZID = (Dec+90) / ZH = 08794.0661
ObservationID = 1050000
Running ID = 1234567
slide 73
ODM Capacity
5.3.1.3 The PS1 ODM shall be able to ingest into the
ODM a total of
• 1.51011 P2 detections• 8.31010 cumulative sky (stack) detections• 5.5109 celestial objects
together with their linkages.
slide 74
PS1 Table Sizes - Monolithic
Table Year 1 Year 2 Year 3 Year 3.5
Objects 2.31 2.31 2.31 2.31
StackPsfFits 5.07 10.16 15.20 17.74
StackToObj 0.92 1.84 2.76 3.22
StackModelFits 1.15 2.29 3.44 4.01
P2PsfFits 7.87 15.74 23.61 27.54
P2ToObj 1.33 2.67 4.00 4.67
Other Tables 3.19 6.03 8.87 10.29
Indexes +20% 4.37 8.21 12.04 13.96
Total 26.21 49.24 72.23 83.74
Sizes are in TB
slide 75
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
What goes into the main Server
Objects
LnkToObj
Meta
PartitionsMap
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 76
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
What goes into slices
PartitionsMap
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 77
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
What goes into slices
PartitionsMap
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
slide 78
Duplication of Objects & LnkToObj
Objects are distributed across slices Objects, P2ToObj, and StackToObj are duplicated in the
slices to parallelize “inserts” & “updates” Detections belong into their object’s slice Orphans belong to the slice where their position would
allocate them• Orphans near slices’ boundaries will need special
treatment Objects keep their original object identifier
• Even though positional refinement might change their zoneID and therefore the most significant part of their identifier
slide 79
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
Linked servers
PS1 database
Glue = Distributed Views
Detections
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
Meta
Legend
Database
Full table [partitioned table]
Output table
Distributed Partitioned View
Detections
slide 80
PS1
P1 Pm
Web Based Interface (WBI)
Linked servers
PS1 database
Partitioning in Main Server
Main server is partitioned (objects) and collocated (lnkToObj) by objid
Slices are partitioned (objects) and collocated (lnkToObj) by objid
Query Manager (QM)
slide 81
PS1 Table Sizes - Main Server
Table Year 1 Year 2 Year 3 Year 3.5
Objects 2.31 2.31 2.31 2.31
StackPsfFits
StackToObj 0.92 1.84 2.76 3.22
StackModelFits
P2PsfFits
P2ToObj 1.33 2.67 4.00 4.67
Other Tables 0.41 0.46 0.52 0.55
Indexes +20% 0.99 1.46 1.92 2.15
Total 5.96 8.74 11.51 12.90
Sizes are in TB
slide 82
PS1 Table Sizes - Each Slice
m=4 m=8 m=10 m=12
Table Year 1 Year 2 Year 3 Year 3.5
Objects 0.58 0.29 0.23 0.19
StackPsfFits 1.27 1.27 1.52 1.48
StackToObj 0.23 0.23 0.28 0.27
StackModelFits 0.29 0.29 0.34 0.33
P2PsfFits 1.97 1.97 2.36 2.30
P2ToObj 0.33 0.33 0.40 0.39
Other Tables 0.75 0.81 1.00 1.01
Indexes +20% 1.08 1.04 1.23 1.19
Total 6.50 6.23 7.36 7.16
Sizes are in TB
slide 83
PS1 Table Sizes - All Servers
Table Year 1 Year 2 Year 3 Year 3.5
Objects 4.63 4.63 4.61 4.59
StackPsfFits 5.08 10.16 15.20 17.76
StackToObj 1.84 3.68 5.56 6.46
StackModelFits 1.16 2.32 3.40 3.96
P2PsfFits 7.88 15.76 23.60 27.60
P2ToObj 2.65 5.31 8.00 9.35
Other Tables 3.41 6.94 10.52 12.67
Indexes +20% 5.33 9.76 14.18 16.48
Total 31.98 58.56 85.07 98.87
Sizes are in TB
slide 84
Detail Design Outline
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 85
PS1
P1 Pm
PartitionsMap
Objects
LnkToObj
Meta
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
PartitionsMap
Meta
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
PartitionsMap
MetaDetections
Linked servers
PS1 database
LoadAdmin
LoadSupport1
objZoneIndx
orphans_l1
Detections_l1
LnkToObj_l1
objZoneIndx
Orphans_ln
Detections_ln
LnkToObj_ln
detections
LoadSupportn
Linked servers
detections
PartitionsMap
PS1 Distributed DB system
Legend
Database
Full table [partitioned table]
Output table
Partitioned View
Query Manager (QM)
Web Based Interface (WBI)
slide 86
“Insert” & “Update”
SQL Insert and Update are expensive operations due to logging and re-indexing
In the PS1 design, Insert and Update have been re-factored into sequences of:
Merge + Constrain + Switch Partition Frequency
• f1: daily• f2: at least monthly• f3: TBD (likely to be every 6 months)
slide 87
Ingest Workflow
ObjectsZ
CSV Detect
X(1”)
DXO_1a
NoMatch X(2”)
DXO_2a
DZone
P2PsfFits
Resolve
P2ToObjOrphans
slide 88
Ingest @ frequency = f1
P2ToObj
Orphans
SLICE_1 MAIN
P2PsfFits
Metadata+
Objects
Orphans_1
P2ToPsfFits_1
P2ToObj_1
Objects_1
11 12 13
Stack*_1
1 2 3
P2ToObj
StackToObj
P2ToObj_1
P2ToPsfFits_1
Orphans_1
ObjectsZ
LOADER
slide 89
SLICE_1 MAIN
Metadata+
Objects
Orphans_1
P2ToPsfFits_1
P2ToObj_1
Objects_1
11 12 13
Stack*_1
1 2 3
P2ToObj
StackToObj
LOADER
Objects
Updates @ frequency = f2
slide 90
Updates @ frequency = f2
SLICE_1 MAIN
Metadata+
Objects
Orphans_1
P2ToPsfFits_1
P2ToObj_1
Objects_1
11 12 13
Stack*_1
1 2 3
P2ToObj
StackToObj
Objects
LOADER
Objects Objects_1
slide 91
Snapshots @ frequency = f3
MAIN
Metadata+
Objects
1 2 3
P2ToObj
StackToObj
Snapshot
Objects
slide 92
Batch Update of a Partition
A1 A2 A3
1 1 2 1 2 3…
merged
select into
switch
B1
select into … where
B2 + PK index
select into … where
B3 + PK index
switch switch
select into … where
B1 + PK index
slide 93
P2
P3
PS1
P1
P2
Pm
P1
PartitionsMap
Objects
LnkToObj
Meta
Legend
Database Duplicate
Full table [partitioned table]
Partitioned View Duplicate P view
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_p2]
[LnkToObj_p2]
[Detections_p2]
Meta
Query Manager (QM)
Detections
Linked servers
PS1 database
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
MetaPS1
PartitionsMap
Objects
LnkToObj
Meta
Detections
Pm-1
Pm
Scaling-out
Apply Ping-Pong strategy to satisfy query performance during ingest
2 x ( 1 main + m slices)
slide 94
P2
P3
PS1
P1
P2
Pm
P1
Legend
Database Duplicate
Full table [partitioned table]
Partitioned View Duplicate P view
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
[Objects_p2]
[LnkToObj_p2]
[Detections_p2]
Meta
Query Manager (QM)
Linked servers
PS1 database
[Objects_pm]
[LnkToObj_pm]
[Detections_pm]
[Objects_p1]
[LnkToObj_p1]
[Detections_p1]
MetaPS1
PartitionsMap
Objects
LnkToObj
Meta
Detections
Pm-1
Pm
Scaling-out
More robustness, fault-tolerance, and reabilability calls for
3 x ( 1 main + m slices)
PartitionsMap
Objects
LnkToObj
Meta
Detections
slide 95
Adding New slices
SQL Server range partitioning capabilities make it easy
Recalculate partitioning limits Transfer data to new slices Remove data from slices Define an d Apply new partitioning schema Add new partitions to main server Apply new partitioning schema to main server
slide 96
Adding New Slices
slide 97
Detail Design Outline
General Concepts Distributed Database architecture Ingest Workflow Prototype
slide 99
ODM Ingest Performance
5.3.1.6 The PS1 ODM shall be able to ingest the data
from the IPP at two times the nominal daily arrival rate*
* The nominal daily data rate from the IPP is defined as the total data volume to be ingested annually by the ODM divided by 365.
Nominal daily data rate:• 1.51011 / 3.5 / 365 = 1.2108 P2 detections / day• 8.31010 / 3.5 / 365 = 6.5107 stack detections / day
slide 100
Number of Objects
miniProto myProto Prototype PS1
SDSS* Stars 5.7 x 104 1.3 x 107 1.1 x 108
SDSS* Galaxies 9.1 x 104 1.1 x 107 1.7 x 108
Galactic Plane 1.5 x 106 3 x 106 1.0 x 109
TOTAL 1.6 x 106 2.6 x 107 1.3 x 109 5.5 x 109
* “SDSS” includes a mirror of 11.3 < < 30 objects to < 0
Total GB of csv loaded data: 300 GBCSV Bulk insert load: 8 MB/sBinary Bulk insert: 18-20 MB/sCreation Started: October 15th 2007
Finished: October 29th 2007 (??)Includes
• 10 epochs of P2PsfFits detections• 1 epoch of Stack detections
slide 102
Prototype in Context
Survey Objects Detections
SDSS DR6 3.8 108
2MASS 4.7 108
USNO-B 1.0 109
Prototype 1.3 109 1.4 1010
PS1 (end of survey) 5.5 109 2.3 1011
slide 103
Size of Prototype Database
Table Main Slice1 Slice2 Slice3 Loader Total
Objects 1.30 0.43 0.43 0.43 1.30 3.89
StackPsfFits 6.49 6.49
StackToObj 6.49 6.49
StackModelFits 0.87 0.87
P2PsfFits 4.02 3.90 3.35 0.37 11.64
P2ToObj 4.02 3.90 3.35 0.12 11.39
Total 15.15 8.47 8.23 7.13 1.79 40.77
Extra Tables 0.87 4.89 4.77 4.22 6.86 21.61
Grand Total 16.02 13.36 13.00 11.35 8.65 62.38
Table sizes are in billions of rows
slide 104
Size of Prototype Database
Table Main Slice1 Slice2 Slice3 Loader Total
Objects 547.6 165.4 165.3 165.3 137.1 1180.6
StackPsfFits 841.5 841.6
StackToObj 300.9 300.9
StackModelFits 476.7 476.7
P2PsfFits 879.9 853.0 733.5 74.7 2541.1
P2ToObj 125.7 121.9 104.8 3.8 356.2
Total 2166.7 1171.0 1140.2 1003.6 215.6 5697.1
Extra Tables 207.9 987.1 960.2 840.7 957.3 3953.2
Allocated / Free 1878.0 1223.0 1300.0 1121.0 666.0 6188.0
Grand Total 4252.6 3381.1 3400.4 2965.3 1838.9 15838.3
9.6 TB of data in a distributed databaseTable sizes are in GB
slide 105
Well-Balanced Partitions
Server Partition Rows Fraction Dec Range
Main 1 432,590,598 33.34% 32.59
Slice 1 1 144,199,105 11.11% 14.29
Slice 1 2 144,229,343 11.11% 9.39
Slice 1 3 144,162,150 11.12% 8.91
Main 2 432,456,511 33.33% 23.44
Slice 2 1 144,261,098 11.12% 8.46
Slice 2 2 144,073,972 11.10% 7.21
Slice 2 3 144,121,441 11.11% 7.77
Main 3 432,496,648 33.33% 81.98
Slice 3 1 144,270,093 11.12% 11.15
Slice 3 2 144,090,071 11.10% 14.72
Slice 3 3 144,136,484 11.11% 56.10
slide 106
Ingest and Association Times
TaskMeasuredMinutes
Create Detections Zone Table 39.62
X(0.2") 121M X 1.3B 65.25
Build #noMatches Table 1.50
X(1") 12k X 1.3B 0.65
Build #allMatches Table (121M) 6.58
Build Orphans Table 0.17
Create P2PsfFits Table 11.63
Create P2ToObj Table 14.00
Total of Measured Times 140.40
slide 107
Ingest and Association Times
TaskEstimatedMinutes
Compute DetectionID, HTMID 30
Remove NULLS 15
Index P2PsfFits on ObjID 15
Slices Pulling Data from Loader 5
Resolve 1 Detection - N Objects 10
Total of Estimated Times 75
Educated GuessWild Guess
slide 108
Total Time to I/A daily Data
TaskTime
(hours)Time
(hours)
Ingest 121M Detections (binary) 0.32
Ingest 121M Detections (CSV) 0.98
Total of Measured Times 2.34 2.34
Total of Estimated Times 1.25 1.25
Total Time to I/A Daily Data 3.91 4.57
Requirement: Less than 12 hours (more than 2800 detections / s)
Detection Processing Rate: 8600 to 7400 detections / s
Margin on Requirement: 3.1 to 2.6
Using multiple loaders would improve performance
slide 109
Insert Time @ slices
TaskEstimatedMinutes
Import P2PsfFits (binary out/in) 20.45
Import P2PsfFits (binary out/in) 2.68
Import Orphans 0.00
Merge P2PsfFits 58
Add constraint P2PsfFits 193
Merge P2ToObj 13
Add constraint P2ToObj 54
Total of Measured Times 362
6 h with 8 partitions/slice
(~1.3 x 109 detections/partition)
Educated Guess
slide 110
Detections Per Partition
YearsTotal
Detections SlicesPartition per Slice
Total Partitions
Detections per Slice
0.0 0.00 4 8 32 0.00
1.0 4.29 1010 4 8 32 1.34 109
1.0 4.29 1010 8 8 64 6.7 108
2.0 8.57 1010 8 8 64 1.34 109
2.0 8.57 1010 10 8 80 1.07 109
3.0 1.29 1011 10 8 80 1.61 109
3.0 1.29 1011 12 8 96 1.34 109
3.5 1.50 1011 12 8 96 1.56 109
slide 111
Total Time for Insert @ slice
TaskTime
(hours)
Total of Measured Times 0.25
Total of Estimated Times 5.3
Total Time for daily insert 6
Daily insert may operate in parallel with daily ingest and association.
Requirement: Less than 12 hours
Margin on Requirement: 2.0
Using more slices will improve insert performance.
slide 112
Summary
Ingest + Association < 4 h using 1 loader (@f1= daily)• Scales with the number of servers• Current margin on requirement 3.1 • Room for improvement
Detection Insert @ slices (@f1= daily)• 6 h with 8 partitions/slice• It may happen in parallel with loading
Detections Lnks Insert @ main (@f2 < monthly)• Unknown• 6 h available
Objects insert & update @ slices (@f2 < monthly)• Unknown • 6 hours available
Objects update @ main server (@f2 < monthly)• Unknown• 12 h available. Transfer can be pipelined as soon as objects
have been processed
slide 113
Risks
Estimates of Insert & Update at slices could be underestimated• Need more empirical evaluation of exercising
parallel I/O Estimates and lay out of disk storage could be
underestimated• Merges and Indexes require 2x the data size
slide 114
Hardware/Scalability (Jan)
slide 115
PS1 Prototype Systems DesignJan Vandenberg, JHU
Early PS1 Prototype
slide 116
Engineering Systems to Support the Database Design
Sequential read performance is our life-blood. Virtually all science queries will be I/O-bound.
~70 TB raw data: 5.9 hours for full scan on IBM’s fastest 3.3 GB/s Champagne-budget SAN• Need 20 GB/s IO engine just to scan the full data in
less than an hour. Can’t touch this on a monolith. Data mining a challenge even with good index coverage
• ~14 TB worth of indexes: 4-odd times bigger than SDSS DR6.
Hopeless if we rely on any bulk network transfers: must do work where the data is
Loading/Ingest more cpu-bound, though we still need solid write performance
slide 117
Choosing I/O Systems
So killer sequential I/O performance is a key systems design goal. Which gear to use?• FC/SAN?• Vanilla SATA?• SAS?
slide 118
Fibre Channel, SAN
Expensive but not-so-fast physical links (4 Gbit, 10 Gbit) Expensive switch Potentially very flexible Industrial strength manageability Little control over RAID controller bottlenecks
slide 119
Straight SATA
Fast Pretty cheap Not so industrial-
strength
slide 120
SAS
Fast: 12 Gbit/s FD building blocks
Nice and mature, stable SCSI’s not just for swanky
drives anymore: takes SATA drives!
So we have a way to use SATA without all the “beige”.
Pricey? $4400 for full 15x750GB system ($296/drive == close to Newegg media cost)
slide 121
SAS Performance, Gory Details
SAS v. SATA differences
Native SAS V. SATA Performance
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7
Disks
MB
/s
20%
slide 122
Per-Controller Performance
One controller can’t quite accommodate the throughput of an entire storage enclosure.
Controller Limits
0
200
400
600
800
1000
1200
1400
1 2 3 4 5 6 7 8 9 10 11 12 13
Disks
MB
/s
6 Gbit Limit
One Controller
Ideal
slide 123
Resulting PS1 Prototype I/O Topology
1100 MB/s single-threaded sequential reads per server
Aggregate Design I/O Performance
0
200
400
600
800
1000
1200
1400
1600
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Disks
MB
/s
6 Gbit Limit
Dual Controllers
One Controller
Ideal
slide 124
RAID-5 v. RAID-10?
Primer, anyone? RAID-5 perhaps feasible with contemporary
controllers… …but not a ton of redundancy But after we add enough disks to meet performance
goals, we have enough storage to run RAID-10 anyway!
slide 125
RAID-10 Performance
0.5*RAID-0 for single-threaded reads RAID-0 perf for 2-user/2-thread workloads 0.5*RAID-0 writes
slide 126
PS1 Prototype Servers
Prototype DB
H1
S1 S2 S3
Prototype Loader
Linuxstaging
L1 L2
slide 127
PS1 Prototype Servers
PS1 Prototype
slide 128
PS1 Prototype Servers
slide 129
Projected PS1 Systems Design
R1
H1 H2
S1 S2 S8
….
Loader
Linuxstaging
L1 LN
….
R2
H1 H2
S1 S2 S8
….
R3
H1 H2
S1 S2 S8
….
G1
H1 H2
S1 S2 S8
….
slide 130
Backup/Recovery/Replication Strategies
No formal backup• …except maybe for mydb’s, f(cost*policy)
3-way replication• Replication != backup
– Little or no history (though we might have some point-in-time capabilities via metadata
– Replicas can be a bit too cozy: must notice badness before replication propagates it
• Replicas provide redundancy and load balancing…• Fully online: zero time to recover• Replicas needed for happy production performance plus
ingest, anyway Off-site geoplex
• Provides continuity if we lose HI (local or trans-Pacific network outage, facilities outage)
• Could help balance trans-Pacific bandwidth needs (service continental traffic locally)
slide 131
Why No Traditional Backups?
Money no object… do traditional backups too!!! Synergy, economy of scale with other collaboration
needs (IPP?)… do traditional backups too!!! Not super pricey… …but not very useful relative to a replica for our
purposes• Time to recover
slide 132
Failure Scenarios (Easy Ones)
Zero downtime, little effort:• Disks (common)
– Simple* hotswap– Automatic rebuild from hotspare or replacement
drive• Power supplies (not uncommon)
– Simple* hotswap• Fans (pretty common)
– Simple* hotswap
* Assuming sufficiently non-beige gear
slide 133
Failure Scenarios (Mostly Harmless Ones)
Some downtime and replica cutover:• System board (rare)• Memory (rare and usually proactively detected and
handled via scheduled maintenance)• Disk controller (rare, potentially minimal downtime
via cold-spare controller)• CPU (not utterly uncommon, can be tough and time
consuming to diagnose correctly)
slide 134
Failure Scenarios (Slightly Spooky Ones)
Database mangling by human or pipeline error• Gotta catch this before replication propagates it everywhere• Need lots of sanity checks before replicating• (and so off-the-shelf near-realtime replication tools don’t help
us)• Need to run replication backwards from older, healthy replicas.
Probably less automated than healthy replication. Catastrophic loss of datacenter
• Okay, we have the geoplex– …but we’re dangling by a single copy ‘till recovery is
complete– …and this may be a while.– …but are we still in trouble? Depending on colo scenarios,
did we also lose the IPP and flatfile archive?
slide 135
Failure Scenarios (Nasty Ones)
Unrecoverable badness fully replicated before detection Catastrophic loss of datacenter without geoplex Can we ever catch back up with the data rate if we need
to start over and rebuild with an ingest campaign? Don’t bet on it!
slide 136
Operating Systems, DBMS?
Sql2005 EE x64• Why?• Why not DB2, Oracle RAC, PostgreSQL, MySQL,
<insert your favorite>? (Win2003 EE x64) Why EE? Because it’s there. <indexed DPVs?> Scientific Linux 4.x/5.x, or local favorite Platform rant from JVV available over beers
slide 137
Systems/Database Management
Active Directory infrastructure Windows patching tools, practices Linux patching tools, practices Monitoring Staffing requirements
slide 138
Facilities/Infrastructure Projections for PS1
Power/cooling• Prototype is 9.2 kW (2.6 Tons AC)• PS1: something like 43 kW, 12.1 Tons
Rack space• Prototype is 69 RU, <2 42U racks (includes 14U of
rackmount UPS at JHU)• PS1: about 310 RU (9-ish racks)
Networking: ~40 Gbit Ethernet ports …plus sundry infrastructure, ideally already in place
(domain controllers, monitoring systems, etc.)
slide 139
Operational Handoff to UofH
Gulp.
slide 140
How Design Meets Requirements
Cross-matching detections with objects• Zone cross-match part of loading pipeline• Already exceeded requirement with prototype
Query performance• Ping-pong configuration for query during ingest• Spatial indexing and distributed queries• Query manager can be scaled out as necessary
Scalability• Shared-nothing architecture• Scale out as needed• Beyond PS1 we will need truly parallel query plans
slide 141
WBS/Development Tasks
Refine Prototype/Schema
Staging/Transformation
Initial Load
Load/Resolve Detections
Resolve/Synchronize Objects
Create Snapshot
Replication Module
Query Processing
• Workflow Systems• Logging• Data Scrubbing• SSIS (?) + C#
• QM/LoggingHardware
Documentation
2 PM
3 PM
1 PM
3 PM
3 PM
1 PM
2 PM
2 PM
2 PM
2 PM
4 PM
4 PM
4 PM
2 PM
Total Effort: 35 PMDelivery: 9/2008
Testing
Redistribute Data
slide 142
Personnel Available
2 new hires (SW Engineers) 100% Maria 80% Ani 20% Jan 10% Alainna 15% Nolan Li 25% Sam Carliles 25% George Fekete 5% Laszlo Dobos 50% (for 6 months)
slide 143
Issues/Risks
Versioning• Do we need to preserve snapshots of monthly
versions?• How will users reproduce queries on subsequent
versions?• Is it ok that a new version of the sky replaces the
previous one every month? Backup/recovery
• Will we need 3 local copies rather than 2 for safety• Is restoring from offsite copy feasible?
Handoff to IfA beyond scope of WBS shown• This will involve several PMs
Mahalo!
slide 145
Context that query
is executed in
MyDB table that query results go
into
Name that this query
job is given
Check query syntax
Get graphical query plan
Run query in quick (1
minute) mode
Submit query to long (8-
hour) queue
Query buffer
Load one of the sample queries into
query buffer
Query Manager
slide 146
Stored procedure arguments
SQL code for stored procedure
Query Manager
slide 147
MyDB context is the default, but other contexts can be selected
The space used and total space available
Multiple tables can be selected and dropped at once
Table list can be sorted by name, size, type.
User can browse DB Views, Tables, Functions and
Procedures
Query Manager
slide 148
The query that created this
table
Query Manager
slide 149
Search radius
Table to hold results
Context to run search on
Query Manager