SciDB
-
Upload
oleg-tsarev -
Category
Internet
-
view
417 -
download
3
description
Transcript of SciDB
![Page 1: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/1.jpg)
© Paradigm4 Inc. confidential
![Page 2: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/2.jpg)
© Paradigm4 Inc. confidential
Topics
• The Big Complex Analytics Space
• SciDB Overview
• How we are different and why that matters
• Architecture
Note: We call our company P4 for short
![Page 3: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/3.jpg)
© Paradigm4 Inc. confidential
Rich Data + Complex Analyticsdrive insights and innovative product offerings
● Tap new types and sources of data
– Location, genomic, behavioral, speech, sensors, images, …
● Integrate mixed data sources for novel insights
– Genomic/wearable sensors/ EHRs /clinical/payer/provider
– Satellite images/smart grid data
– Location/weather/traffic/driving behavior
● Generate micro-segmented pricing & products
– Personalized insurance
– Precision medicine
– Precision warranties
– Behavioral targeting
– Location-based services
● Look at whole populations, big time windows, big regions
![Page 4: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/4.jpg)
© Paradigm4 Inc. confidential
Where some of the ‘complex analytics’ problems are
Pharma, Biotech, AgroBusiness, Healthcare Informatics
• Next-gen sequencing analysis & GWAS• Population studies• Evidence-based outcome studies• Pharmaco-economics
Insurance Analytics • Personalized auto or workman’s comp insurance• Catastrophe modeling and policy pricing• Risk modeling for insurance exchanges
Industrial Analytics • Precision warranty pricing & maintenance schedules
Call Centers • Speech analytics
Energy • Data from smart sensor grids
Digital Marketing • Geo-targeting & other personalization strategies• Recommendation engines
Financial Services • Financial modeling, back testing, sensitivity testing• Algorithmic trading• Portfolio management & risk management
Scientific Research • Astronomy, Climatology, High Energy Physics, et al
![Page 5: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/5.jpg)
© Paradigm4 Inc. confidential
‘Big Analytics’ covers two categories
• Big Volume + Simple analytics – Traditional Data Warehouses, RDBMSs
– Business analysts
– Count statistics, roll-ups, aggregates
• Big Volume + Complex Analytics– Emerging markets; new tools
– Data scientists / healthcare analysts / quants / operations researchers
– Multivariate statistics, clustering, SVD, machine-learning, et al
P4’s space
![Page 6: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/6.jpg)
© Paradigm4 Inc. confidential
Why would industrial & commercial analytics applications benefit from yet another software platform?
• Sensor data, geospatial data, temporal data, genomic data & images are far more efficiently managed as multi-dimensional arrays than as relational tables
• Complex analytics should execute in place where the data resides and scale easily with additional nodes and cores
![Page 7: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/7.jpg)
© Paradigm4 Inc. confidential
P4’s new ‘Complex Analytics’ databasescientific data management & analytics for the commercial world
Rich Data
Massively Scalable Math
Smart Data Management
![Page 8: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/8.jpg)
© Paradigm4 Inc.
P4 is well-matched for M2M data
Machine-generated data have inherent ordering & structure• location data from cars and cell phones• telematics data from sensors• energy usage data from smart sensors grids • genetic sequencing data• patient telemetry data • time series and longitudinal event data • satellite images of the earth’s surface
2012-01-31 22:32:36.968000
![Page 9: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/9.jpg)
© Paradigm4 Inc. confidential
A new ‘Complex Analytics’ databasescientific data management & analytics for the commercial & industrial worlds
• All-in-one next generation database with data life cycle management native, seamlessly integrated, scalable complex math operations
• Array data model optimal for temporal, geospatial, and machine-generated data n-dimensional
• Open Source
• Commodity HW grid or cloud
![Page 10: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/10.jpg)
© Paradigm4 Inc. confidential
SciDB Features
Distributed data storage
With redundancy/fault tolerance and high-availability
Scalable Parallel operations Parallel linear algebra, aggregates, summaries, data loading
ACID Transactions
Stuctured N-dimensional Sparse Array Data Model Defined by schema
Expressive SQL-like Query Syntax Supports joins by array dimensions
No-Overwrite Data Versioning
Extensible User-defined types, functions, operators
![Page 11: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/11.jpg)
© Paradigm4 Inc. confidential
Paradigm4 enables data-intensive research
Capture Ingest, store, and manage data throughout its lifecycle
Curation Save raw, corrected, pre-processed and derived analytic data, with meta data and provenance
Curiosityh Explore, drill down, filter, select
Compute Complex math and modeling
Collaboration Shared resourceNo data silos with long, metadata filenames
Compliance No overwrite, versioned data storage supports reproducibility and validation of results
![Page 12: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/12.jpg)
© Paradigm4 Inc. confidential
First class support for scientific data & scientific research• Ingest, store, access, and manage data throughout its life
cycle• No overwrite database; historical versioning support• Metadata – store curation and calibration information• Extensibility (user defined types and operations)• Save raw, corrected, pre-processed, and derived data• Support for provenance • Support reproducibility of results• Share data across work groups and with outside
organizations
Why SciDB for scientists?
![Page 13: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/13.jpg)
© Paradigm4 Inc. confidential
P4’s native Array DB beats Relational DBs* on storage efficiency & complex computations
● Math functions run directly on native storage format
● Dramatic storage efficiencies as # of dimensions & attributes grows
– Architecture supports n dimensions
● Facilitates drill-down & clustering by like groups
● High performance for both sparse and dense data
– 10-100x faster than RDBMSs on array operations
16 cells
48 cells
* Applies to both row stores & column stores
![Page 14: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/14.jpg)
© Paradigm4 Inc. confidential
Data exploration & analytics work better when
the natural ordering of data is preserved
Clusters, temporal regions are stored together
Resample time or re-grid geospatial data at any resolution
Slice & drill-down in any n-dimensional region
Fast data selection for ad hoc queries
Efficient analytics over sub-regions & moving windows
![Page 15: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/15.jpg)
© Paradigm4 Inc. confidential
Complex math underpins many use cases
Industrial Analytics
• Precision warranty pricing• Proactive preventive maintenance• Modeling & optimization• Event monitoring in refineries and factories
• covariance• PCA , SVD• cross validation• bootstrapping• cluster analysis• linear/logistic regression
Pharma Biotech Healthcare
• Next-gen Sequencing• Population studies• Outcome studies• Precision medicine
![Page 16: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/16.jpg)
© Paradigm4 Inc. confidential
Complex math underpins many use cases
Computational Finance
• Back testing• Sophisticating modeling• Portfolio optimization • Risk management
• covariance• PCA , SVD• cross validation• bootstrapping• cluster analysis• monte carlo methods• linear/logistic regression
![Page 17: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/17.jpg)
© Paradigm4 Inc. confidential
P4’s native math library supports distributed processing
• Task parallelism‘Embarrassingly parallel’ tasks
Process subpopulations in parallel
Run simulations in parallel
• Massively scalable complex math‘Non-embarrassingly’ parallel tasks like large scale linear algebra
Math operations that pass intermediate data between nodes
Challenging O(n3) computations
Math operations on data too large to fit on one node
• Large scale analytics without samplingLook at whole populations, big time windows, big regions
Sample when you want to; not to fit analytics package constraints
Use all the data: sometimes you really want the long tail or the black swan
![Page 18: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/18.jpg)
© Paradigm4 Inc. confidential
Query language seamlessly integrates data manipulation & math
Array Query Language -- AQLDeclarative SQL-like language extended for working with array data
Large-scale math operations embedded in queries
ExtensibleAdd user-defined types and functions
R, python, and other client interfaces
Compute the log odds ratio for a failure model using logistic regression
SELECT * FROM LOGISTREGR (model_matrix, success_count, failure_count, 'coefficients')
![Page 19: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/19.jpg)
© Paradigm4 Inc. confidential
Linear Algebra as Building Block
Mathematical and Data Manipulation Operations
multiply ( transpose ( Simple_Array ), Simple_Array );
regrid( Simple_Array, 10, 10, avg (v2) );
cumsum (filter ( Simple_Array, v1 = ‘Odd’ ),I, v1 );
![Page 20: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/20.jpg)
© Paradigm4 Inc. confidential
Flexible Schema
• Ad hoc queries• Don’t have to know a priori
what questions you will want to ask of your data
• Change schema dynamically• Values <=> dimensions
• Supports transparent data exploration and mining
![Page 21: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/21.jpg)
© Paradigm4 Inc. confidential
Satellite images Healthcare images
Well-suited for storing, accessing, & analyzing images
GIS data
Store metadata with the data• Instrument id & calibration data• Experimental conditions and variables• Data set identifiers & comments
![Page 22: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/22.jpg)
© Paradigm4 Inc. confidential
• Regrid operator• Change resolution and coordinate systems
• Overlap• Supports feature detection when features fall between nodes
• Support for multi-dimensional window operations• Spatial averaging
• Non-integer dimensions• Access image through spatio-temporal coordinate systems• Astronomy (right ascension, declination)
• Remote sensing (lat, long, time)
SciDB support for images
![Page 23: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/23.jpg)
© Paradigm4 Inc. confidential
SciDB array model: create array
CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ];
Attributes red, green,
blue
Dimensions longitude, lattitude
Dimension size * indicates unbounded
Chunk size
Chunk overlap
![Page 24: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/24.jpg)
© Paradigm4 Inc. confidential
[SciDB] Scalable data management
instance 1 (coordinator) instance 2 (worker)
instance 3 (worker) instance 4 (worker)
1 2
3 4
![Page 25: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/25.jpg)
© Paradigm4 Inc.
Soft scalability teston automotive telematics and location data
• This graph shows how performance scales when both the data volume and the number of instances are increased together
• Query computes a score for each driver based on how many other vehicles were driving at the same time, in the same areas as the driver
• If data is perfectly distributed and if all operations in a query are perfectly parallelizable, the graph should be a 0 slope line
exec
uti
on
tim
e re
lati
ve t
o 1
X
scale factor
![Page 26: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/26.jpg)
© Paradigm4 Inc. confidential
New Data Window operator
• Computes aggregates over rolling one-dimensional windows• skipping over empty array cells• particularly useful for analysis of time series events that happen at
varying frequencies
• Data window accepts an input array, a dimension name, number of preceding values, number of following values, and a list of aggregate calls
data_window (input_array, dim_name, num_preceding, num_following, aggregate1(attribute1), aggregate2(attribute2)...)
![Page 27: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/27.jpg)
© Paradigm4 Inc. confidential
Analyzing event data
• Event hot spots• Look at which specific sets of locations (at the lat-long level) have
the most hard acceleration and hard braking events (count or volume normalized metric)
• Profile hot spots by day of week, time of day
• Event windows• Look at a 30 second window before and after each hard braking
and hard acceleration event • Look for patterns to predict adverse events or profile drivers
![Page 28: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/28.jpg)
© Paradigm4 Inc. confidential
Manage data throughout its life cycle
• Data is never overwritten• Preserve raw data, corrected data, and updates in the database• Facilitates reproducibility, audits, compliance• Supports model development and testing: what-if modeling,
scenario testing, back-testing, sensitivity testing
• Updates are versioned
![Page 29: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/29.jpg)
© Paradigm4 Inc. confidential
Client Interfaces
• i-query interactive command line query interface• Python, C++, R clients• GUI (forms) interface coming• Open source client api – roll your own!
![Page 30: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/30.jpg)
© Paradigm4 Inc.
What about hadoop?
• Hadoop alone is not a DBMS• No indexes, updates, data consistency, metadata
• Modules (hadoop, Pig, Hive, Hbase, HDFS) are loosely integrated and require a lot of glue code
• Requires skilled development staff to write custom code and maintain clusters
• Slower than a real parallel distributed database so needs more HW• Linear algebra operators are hard to implement as a map and a reduce• See Stonebraker, Kepner CACM blog post: Possible Hadoop Trajectories http://cacm.acm.org/blogs/blog-cacm/149074-possible-hadoop-trajectories
![Page 31: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/31.jpg)
© Paradigm4 Inc.
What about NoSQL like MongoDB?
Great for some uses cases: match the tool to your requirements
• NoSQL and XML-based systems bake ‘schema’ into the application code or the records themselves
• NoSQL is most easily defined by what it excludes• No schemas • No query language• Lacks easily automatic data integrity of ACID databases• No support for joins which are useful when working with multiple data
sources• Requires coding to walk the data structures to manage data and extract
information• Harder to collaborate and share data across groups• More custom code than a DB means potential longer term maintenance
and data archiving issues
• Paradigm4 offers the flexibility of object-oriented data schemas without sacrificing ACID database integrity or ad hoc query support
![Page 32: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/32.jpg)
© Paradigm4 Inc. confidential
SciDB and Paradigm4
• SciDB is a global, open source community • Scientists from many fields & computer-scientists• www.scidb.org
• Paradigm4, a commercial company, sponsors & manages SciDB• Doing all the initial development for SciDB• Sells and supports a commercial-quality release of SciDB• Along with enterprise management tools (e.g. provisioning,
security, recovery) • And industry-specific add-ons• www.paradigm4.com
![Page 33: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/33.jpg)
© Paradigm4 Inc. confidential
Get more from your analytical database
• Power, Productivity & Performance– Less coding
– Less data movement
– Transparent scale-up & speed-up
– Prototypes scale to production without rewriting
– Lower cost deployment
• Highly pedigreed technical teamCTO is Mike Stonebraker
renowned database researcher & entrepreneur
• Ready to work with early adopters
![Page 34: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/34.jpg)
© Paradigm4 Inc.
Big Complex Analyticscombines data sources for novel insights & products
Automotive Telematics Healthcare Informatics
![Page 35: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/35.jpg)
© Paradigm4 Inc.
Big Complex Analyticspowers population studies
> 70K tissue samples
> 65K gene probes per sample
covariance, clustering, SVD
> 10 million cars
GPS & driving data every sec
insurance by the trip & how you drive
linear regressions, risk & pricing modeling
![Page 36: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/36.jpg)
© Paradigm4 Inc.
Architecture
• ‘Shared Nothing” cluster of commodity hardware nodes• Interconnected with standard ethernet and TCP/IP
![Page 37: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/37.jpg)
© Paradigm4 Inc. confidential
SciDB array model: create array
CREATE ARRAY RGB < red : int16, green : int16, blue : int16> [ longitude(double) = *, 10000, 0, lattitude(double) = *, 10000, 0 ];
Attributes red, green,
blue
Dimensions longitude, lattitude
Dimension size * indicates unbounded
Chunk size
Chunk overlap
![Page 38: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/38.jpg)
© Paradigm4 Inc. confidential
SciDB Array Schema
CREATE ARRAY Simple_Array < v1 : double, v2 : int64, v3 : string > [ I = 0:*, 5, 0, J = 0:9, 5, 0 ];
Attributes v1, v2, v3
Dimensions I, J
Dimension size * is unbounded
Chunk size
Chunk overlap
![Page 39: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/39.jpg)
© Paradigm4 Inc.
SciDB array model: data types
• Whole numbers: int8, int16, int32, int64• Unsigned whole numbers: uint8, ..., uint64• Date and Time: datetime• Date and Time with timezone: datetimez• Floating point: float, double• Boolean: bool• Character: char• Variable-length strings: string
![Page 40: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/40.jpg)
© Paradigm4 Inc.
SciDB array model: Storage
• SciDB store every attribute separatelly• Good compression:
– RLE– zlib
• Parallel processing
![Page 41: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/41.jpg)
© Paradigm4 Inc.
SciDB array model: 1D-array
• Chunk: unit of data processing • Chunk should fit in memory entirely• User chooses chunk size
![Page 42: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/42.jpg)
© Paradigm4 Inc.
SciDB array model: bitmap
• SciDB describes EMPTY values using bitmap• bitmap is compressed efficiently with RLE
![Page 43: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/43.jpg)
© Paradigm4 Inc.
SciDB array model: 2D-array
• Stride-major-order of chunks
![Page 44: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/44.jpg)
© Paradigm4 Inc.
SciDB array model: 2D-chunk
![Page 45: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/45.jpg)
© Paradigm4 Inc.
SciDB array model: clustering
• Several available chunk distributions:– Round-Robin (default)– Replication
• Optimizer splits queries into stages
• Every stage processed parallel
• Scatter/Gather intermediate results after every stage according to requirements
• Overlap helps descrease SG size (!)
• NO single point of failure
![Page 46: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/46.jpg)
© Paradigm4 Inc.
SciDB array model: redundancy
• --redundancy=X• Every chunk is replicated X times
• Single copy on every node
• Redundand chunks used only when a node becomes unavilable
• We protect networks and disk failures
• Use RAID for protect disk failures
![Page 47: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/47.jpg)
© Paradigm4 Inc.
SciDB array model: release 12.7
• Time series• Optimizations• Binary loader (based on PostgreSQL binary loader)
• data_window operator
![Page 48: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/48.jpg)
© Paradigm4 Inc.
SciDB array model: next release
• Repart failed nodes by redundand data• Elastic cluster:
– Increase/decrease node count
![Page 49: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/49.jpg)
© Paradigm4 Inc. confidential
Contact
– Marilyn Matz
– CEO & co-founder
– 781 718 3999
– www.paradigm4.com
– www.scidb.org
![Page 50: SciDB](https://reader034.fdocuments.net/reader034/viewer/2022052602/559df92a1a28ab6a468b45da/html5/thumbnails/50.jpg)
© Paradigm4 Inc.
innovative data management with complex analytics