Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications
-
Upload
dataversity -
Category
Technology
-
view
400 -
download
0
Transcript of Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications
Basho Technologies | 1
Scaling Time Series Applications
BashoDorothy Pults – Product Evangelist @deepultsTom Sigler – Solution Architect @tom_sigler
DatabricksPeyman Mohajerian - Solution Architect @mohajeri
Basho Technologies | 2
BASHO TECHNOLOGIESDistributed Systems Software for Big Data and IoT applications
2011 - Creators of Riak• Riak KV: NoSQL Key Value database• Riak TS: NoSQL Time Series database• Integrations: Spark, Redis caching, Solr, Mesos, Riak S2
120+ employees
Global Offices • Seattle (HQ), Washington DC, London, Paris
1/3 of the Fortune 50
Basho Technologies | 3
$1.3 Trillion market spend Internet of Things in 2019
30 Billion Installed base of IoT endpoints in 2020
*Source IDC
Basho Technologies | 4
56% have integrated IOT data
IoT is 24% of the average IT budget
20% decrease in downtime
21% increase in revenue
*Vodafone IOT Barometer
Basho Technologies | 55
CRITICAL SUCCESS FACTORS FOR IOT
• Explore new business models
• Address Key IoT challenges like Edge Analytics
• Provide comprehensive solutions
• Engage with a broader ecosystem
Basho Technologies | 66
100TB DAILY – IOT AND WEATHER DATA
530M personal weather stations reports each day
9M webcam uploads
2M crowd reports
> 20M IoT barometric reports
Basho Technologies | 7
WEATHER FORECAST PREDICTS SALES
Ideal BERRY purchasing weather turns out to be low wind with temperatures below 80 degrees.
People are more likely to eat STEAK when it's warm out with higher winds but no rain, but not if it gets too hot.
Basho Technologies | 88
EDGE ANALYTICS
• Edge Analytics
• Fog Computing
• Inverted Web
• Reverse CDN
Basho Technologies | 99
NEW ECOSYSTEM – DATA PIPELINE
Basho Technologies | 1010
WHAT’S NEEDED TO SCALE FOR IoT
• A database optimized for IoT data
• Review your data life cycle
• Summations and aggregation
• Data expiration
• Data cleansing
• Processing close to devices
• Scale for unstructured metadata
Basho Technologies | 11
TIME SERIES (TS) DATA
• Consists of successive observations made over a time interval
• Structured• Time + State/Measurement • Metadata/Context• Frequency
Basho Technologies | 12
TIME SERIES CHALLENGES AT SCALE
• Ingestion Velocity• Data Volume• Post Ingestion Workloads
– Real time– Batch
• Lifecycle/Expiry
Basho Technologies | 13
Riak TS Overview & Architecture
Basho Technologies | 14
WHAT IS RIAK TS?
Riak TS is a distributed NoSQL key/value store optimized for time series data.
It provides a time series database solution that is extensible and scalable.
Riak TS is derived from Riak KV and adds the ability to co-locate data by composite primary key, including quanta, for efficient sequential read i/o operations.
Basho Technologies | 15
Why Riak TS?• Highly available• Fault Tolerant• Geo data locality• Scalability
– Operations– Real-time range query performance
15
Basho Technologies | 16
RIAK TS MASTERLESS ARCHITECUTURE
Riak has a masterless architecture. Every node is: • homogenous• capable of serving all read and write requests• responsible for a subset of data
Basho Technologies | 17
RIAK TS: DISTRIBUTION AND CO-LOCATION
• Variation of Dynamo• Composite key drives
grouping on disk– Partition Key– Local Key (sort)
Basho Technologies | 18
RIAK: REPLICATION OF DATA
• Intra-cluster replication• Multi-cluster replication
put(“bucket/key”)
Basho Technologies | 19
RIAK: HIGH AVAILABILITY
Hinted handoff allows Riak nodes to temporarily take over storage operations for a failed node and update that node with changes when it comes back online.
Basho Technologies | 20
RIAK TS: SCALABILITYRiak TS scales in a near-linear fashion so increasing the number of a nodes in a cluster increases the number of reads and writes a cluster can handle in a predictable fashion.
Rebalancing of the cluster is a non-blocking operation, which doesn’t require downtime to perform.
If 10 nodes can serve 40,000 Writes/Second Then 20 nodes should serve 72,000+ Writes/Second
> riak-admin cluster join [email protected]
> riak-admin cluster plan
> riak-admin cluster commit
A d d i n g a n o d e
Basho Technologies | 21
RIAK TS: QUERY
select * from GeoCheckin where time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'
select MIN(temperature), AVG(temperature), MAX(temperature) from GeoCheckin where
time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'
select (temperature * 2), (pressure - 1) from GeoCheckin where
time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'
Arithmetic
Aggregate
Range• SQL Interface• Arithmetic Support• Aggregate
– Count()– Sum()– Mean() & Avg()– Min() & Max()– STDDEV()
• Group By• Expanded
capabilitiesin future releases
Basho Technologies | 22
BATCH PROCESSING
• Real-time vs. Batch• Spark Connector• Parallel Extract
Basho Technologies | 23
DATA LIFECYCLE
• Global expiry• Per table expiry
coming soon• Spark batch for
rollups/aggregation
Basho Technologies | 24
Time SeriesData Modeling
Basho Technologies | 25
SUPPORTED DDL DATA TYPES• VARCHAR - Any string content is valid, including Unicode. Can only be
compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings.
• BOOLEAN - true or false (any case)• TIMESTAMP - Timestamps are integer values expressing UNIX epoch time in
UTC in milliseconds. Zero is not a valid timestamp.• SINT64 - Signed 64-bit integer• DOUBLE - This type does not comply with its IEEE specification: NaN (not a
number) and INF (infinity) cannot be used.
Basho Technologies | 26
THE KEY
Consists of:• Partition Key
(node/partition)• Quantum (optional)• Local Key (sort order)
Basho Technologies | 27
RIAK TS: CREATE TABLE
CREATE TABLE GeoCheckin ( deviceID varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY (
(deviceID, quantum(time, 15, 'm')), deviceID, time
) )
Partition Key
Local Key
Basho Technologies | 28
MODELING THE KEY
Methodology:• What questions does your
application ask?• How is the data presented?
Basho Technologies | 29
USE CASE: PEDOMETER
• Questions– How many steps today
(distance) for user?– How many steps per
day this week for user?– Daily average?– Change in elevation?
• Key– Partition: UserID– Local: timestamp– Optimized for reads:
quantum of 1 week– Optimized for writes
quantum of 1 day
• Fields– timestamp– steps– device_id– elevation– geohash
Basho Technologies | 30
DEMO
• Riak TS• Python client• Jupyter Notebook
• Pandas• Matplotlib
Basho Technologies | 31
THE DATADescription Field TypeSensor Status status varchar
Exit ID exitid varchar
Timestamp ts timestamp
Average Measured Time avgMeasuredTime sint64
Average Speed avgSpeed sint64
Median Measured Time medianMeasuredTime sint64
Number of Vehicles vehicleCount sint64
Sensor ID id sint64
Report ID report_id sint64
• Vehicle traffic data• City of Aarhus,
Denmark• Two sensors placed
at each exit• 5 min intervals
Spark and Riak: In-situ analytics beyond Hadoop
33
Who is DatabricksWhy Us Our Product
• Creators of Apache Spark. Contribute 75% of the code - 10x more than others
• Trained 20K Spark users
• Largest number of customers deploying Spark (200+)
• Just-in-Time Data Platform – powered by Apache Spark.
• Empower your organization to swiftly build and deploy advanced analytics with Spark.
open source data processing engine built around speed, ease of use, and sophisticated analytics
largest open source data project with 1000+ contributors
UNIFIED ENGINE ACROSS DIVERSE WORKLOADS & ENVIRONMENTS
Scale out, fault tolerant
Python, Java, Scala, and R APIs
Standard libraries
APACHE SPARK ENGINE
First Cellular Phones Unified DeviceSpecialized Devices
ANALOGY: EVOLUTION OF CONSUMER ELECTRONICS
HISTORY REPEATS: FASTER, EASIER TO USE, UNIFIED
First DistributedProcessing Engine
Specialized Data Processing Engines
Unified Data Processing Engine
Google Trends: Hadoop vs. Spark
Analytics in-situSQL
Streaming
MLEnable SQL analytics over RiakUse Riak to store streaming data
Use Riak to serve results generated by Spark
Riak Spark Connector
User application contacts the coordinating node returning the locations of the data using cluster replication and availability information.Then “N” Spark workers open “N” parallel connections to different nodes, which allow the application to retrieve the desired dataset “N” times faster, without generating “hot spots”.
Demo
Build a PoC on Databricks today.Professional services and training also available.
Contact [email protected]
or
Sign up for a trial at https://databricks.com/try-databricks
Basho Technologies | 43
Thank You!
If you have any questions please reach out to us at basho.com/contact