SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics
-
Upload
san-diego-supercomputer-center -
Category
Technology
-
view
110 -
download
1
description
Transcript of SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics
![Page 1: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/1.jpg)
Open Source Data Management System for Data-Intensive
Scientific Analytics
Jacek Becla
San Diego Supercomputer Center
05/29/2009
![Page 2: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/2.jpg)
2
Outline
• Needs, challenges, today’s solutions and emerging trends
• SciDB design and planned features
• SciDB structure and timeline
![Page 3: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/3.jpg)
3
Size Challenge
• Data set sizes grow dramatically
• Growth rate increases
• Implications– Failures are routine– Provenance tracking is a must– Massive parallelization is a must– Full automation, self-adjustment is a must
![Page 4: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/4.jpg)
4
Analytics Complexity
• More data varieties = more ways to analyze it• Rapid growth of complexity of analytics
– Time series comparisons– N2 and N3 correlations– Proximity and grouping-based searches
• Interactive exploration enables most science• Data uncertainty matters• Provenance is an integral part of analytics• User annotations are important• Ad-hoc integration of derived data
with raw data desired• True for science and industry
![Page 5: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/5.jpg)
5
Today’s Technologies
• Existing databases– Most too monolithic– Expensive to scale– Expensive to provide high availability– Built for perfect schemas and clean data– Relational data model far from ideal for most projects– APIs far from ideal
• Desired intuitive interfaces
• Most very large systems shy away from databases
![Page 6: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/6.jpg)
6
Today’s Solutions
• Metadata in lightweight database plus bulk data in files– BaBar, LHC, LCLS
• Bulk data stored as unstructured data in database– NIF
• Raw data in files, derived data in database– PanSTARRS, LSST (future projects)
• Complete (or mostly) home-grown systems– ATT, Google, Yahoo, Amazon, Facebook– Most common solution
• All in database– WalMart (very expensive)– eBay (very expensive, testing new home grown solution)– SDSS, bio, genomics (small-ish, single-server databases)
• Little reusing, roll-your-own mentality
![Page 7: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/7.jpg)
7
Future
• Emerging trends– Shared nothing parallel database– Lightweight, specialized components– On low-cost commodity hardware– Aggressive compression
• Several attempts to push state-of-the-art forward– Aster Data, Vertica, ParAccel,
EnterpriseDB, Greenplum, Netezza
• Some issues not addressed by anyone– Arrays, provenance, uncertainty,
partial results, intuitive interfaces
![Page 8: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/8.jpg)
8
XLDB Activities
• 2007
• Indentify trends, roadblocks
• Bridge the gaps
• 2008
• Complex analytics
• Bridge the gaps
• 2009
• Reach out to non-US communities
• Connect with remaining sciences
![Page 9: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/9.jpg)
9
Outline
• Needs, challenges, today’s solutions and emerging trends
• SciDB design and planned features
• SciDB structure and timeline
![Page 10: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/10.jpg)
10
• Philosophy– address common scientific needs– geared for analytics, not OLTP
• Key requirements– open source– commercial quality– peta-scale
New Open Source Science Database System
![Page 11: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/11.jpg)
11
Data Model - Types
• Scalars– standard base types (int, float, string, date, …)– geospatial (3-D points, lines, polygons, boxes)
• Multi-d arrays– regular or irregular– any number of dimensions– nesting allowed– dense or sparse
![Page 12: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/12.jpg)
12
Data Model - Operators
• Native (built-in)– array-sql (filter, project, group_by, aggregation, …)– array (pivot, regridding, reshaping, transformations, nest, flatten, …)
• User-defined functions– Postgres-style– coded in C++
• Native operators coded as UDFs• All UDFs treated equally
– optimizer might do more with built-in UDFs
• Two kinds: per cell, per array• All UDFs executable in parallel
• Paradigm: primitives for data-heavy compute
![Page 13: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/13.jpg)
13
Data Model – Match To Science Needs
astronomy earth and environmental sciences, including
oceanography, remote sensing, seismology bio-medical imaging fusion bio (need sequences) chemistry (need network structures)
![Page 14: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/14.jpg)
14
Query Language
• “Parse-tree” representation of operations
• “Bindings” to C++, Python, IDL, ... (TBD)
• Tight integration with popular statistical tools like R or MATLAB
![Page 15: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/15.jpg)
15
Storage Model
• Granularity– “Chunked” arrays
• Chunk = unit of storage, buffering and compression– Chunks partitioned across nodes
• Parallel model– Shared-nothing parallel DBMS
• runs on a grid of computers, uniformity not required– Data exchanged between nodes as needed
• Format– Loaded or in-situ modes
• in-situ: limited capabilities– Adaptors to translate external popular formats
(like HDF5) on the fly
![Page 16: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/16.jpg)
16
Versioning
• No overwrite storage
• Named versions
• Delta compression
![Page 17: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/17.jpg)
17
Provenance
• Need:– what operations led to creation of given element– what operations used this element– what data elements were used as input to this operation– what data elements were created as output from this
operation• Natively supported
– easy if workflow in SciDB• Loading external provenance• Efficient querying• No-overwrite + delta compression helps
![Page 18: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/18.jpg)
18
Uncertainty
• Error bars carried along in the computation• Initial version
– interval arithmetic– uniform error distribution
• More complex models usually science-specific– might consider implementing some
in the future if enough commonalities
• Approximate results
![Page 19: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/19.jpg)
19
Resource Management
• Query scheduling– including shared scans (train scheduling)
• Query progress
• Support for long-running queries (cancel/stop/restart)
• Pre-execution query cost estimates
• Per user/query limits
![Page 20: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/20.jpg)
20
Other Features
• High availability / automatic fail over
• Auto config and auto self-healing
![Page 21: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/21.jpg)
21
Green Computing
• Aggressive compression less disk
• Approximate results stop computing early
• Shared scans share I/O
• Scale out as you go incremental provisioning
![Page 22: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/22.jpg)
22
Science / Industry Needs
• Scale• Complex analytics
– time series, needle in haystack, group based
• Summary statistics @petascale• Arrays• Provenance• Uncertainty• Integration with statistical tools
… all needed by industry
![Page 23: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/23.jpg)
23
Outline
• Needs, challenges, today’s solutions and emerging trends
• SciDB design and planned features
• SciDB structure and timeline
![Page 24: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/24.jpg)
24
Partnership - Roles
• Science and high-end commercial– provide input, including usecases– provide some resources– review design, test the product
• DBMS brain trust– design, oversee construction, perform research
• Non profit company– manage the project– support resulting system
![Page 25: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/25.jpg)
25
Partnership – Current Players
• Science and high-end commercial– LSST, PNNL, UCSB, LLNL– eBay, Vertica, Microsoft– lighthouse customers: LSST and eBay
• DBMS brain trust– Michael Stonebraker, David DeWitt, Dave Maier,
Jennifer Widom, Stan Zdonik, Sam Madden, Ugur Cetintemal, Magda Balazinska, Jignesh Patel
• Non profit company– SciDB, Inc. - 501c(3) foundation
• Plus… 5 developers working on 1st prototype
![Page 26: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/26.jpg)
26
Have Usecases From
• astronomy (LSST)• industry (eBay)• genomics (LLNL)• climate (PNNL/ARM)• seismic (Emory Univ)• environmental observation & modeling
(Oregon Univ)• earth remote sensing (UCSB)• fusion (LLNL/NIF)
• WE NEED YOUR USECASES
![Page 27: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/27.jpg)
27
Timeline
• Mid June ‘09– professional-looking scidb.org– start building user community
• Late August ’09– planned demo at VLDB– reach out to non-US communities
through XLDB3
• End of Q1’10 – alpha• End of Q4’10 – beta
![Page 28: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/28.jpg)
28
Manpower
• All work so far in-kind
• 4.5 FTEs working on demo– SLAC, MIT, UW, RAS
• Good chances to have funds available this FY to hire ~5 full time developers
• Actively looking for more partners
– GET INVOLVED
![Page 29: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/29.jpg)
29
Summary
• Many commonalities within science and between science and industry
• Existing off-the-shelf technologies inefficient for very large scale analytics
• SciDB – new open source science DBMS– Community realizes shared software
infrastructure is good– Big lighthouse customers– Strong team– If successful, will enable unprecedented
analyses at extreme scale
![Page 30: SciDB : Open Source Data Management System for Data-Intensive Scientific Analytics](https://reader033.fdocuments.net/reader033/viewer/2022061300/54c906de4a7959486f8b4574/html5/thumbnails/30.jpg)
30
Related Links
• http://scidb.org
• http://www-conf.slac.stanford.edu/xldb07
• http://www-conf.slac.stanford.edu/xldb08
• http://www-conf.slac.stanford.edu/xldb09