Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

43
Basho Technologies | 1 Scaling Time Series Applications Basho Dorothy Pults – Product Evangelist @deepults Tom Sigler – Solution Architect @tom_sigler Databricks Peyman Mohajerian - Solution Architect @mohajeri

Transcript of Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Page 1: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 1

Scaling Time Series Applications

BashoDorothy Pults – Product Evangelist @deepultsTom Sigler – Solution Architect @tom_sigler

DatabricksPeyman Mohajerian - Solution Architect @mohajeri

Page 2: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 2

BASHO TECHNOLOGIESDistributed Systems Software for Big Data and IoT applications

2011 - Creators of Riak• Riak KV: NoSQL Key Value database• Riak TS: NoSQL Time Series database• Integrations: Spark, Redis caching, Solr, Mesos, Riak S2

120+ employees

Global Offices • Seattle (HQ), Washington DC, London, Paris

1/3 of the Fortune 50

Page 3: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 3

$1.3 Trillion market spend Internet of Things in 2019

30 Billion Installed base of IoT endpoints in 2020

*Source IDC

Page 4: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 4

56% have integrated IOT data

IoT is 24% of the average IT budget

20% decrease in downtime

21% increase in revenue

*Vodafone IOT Barometer

Page 5: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 55

CRITICAL SUCCESS FACTORS FOR IOT

• Explore new business models

• Address Key IoT challenges like Edge Analytics

• Provide comprehensive solutions

• Engage with a broader ecosystem

Page 6: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 66

100TB DAILY – IOT AND WEATHER DATA

530M personal weather stations reports each day

9M webcam uploads

2M crowd reports

> 20M IoT barometric reports

Page 7: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 7

WEATHER FORECAST PREDICTS SALES

Ideal BERRY purchasing weather turns out to be low wind with temperatures below 80 degrees.

People are more likely to eat STEAK when it's warm out with higher winds but no rain, but not if it gets too hot.

Page 8: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 88

EDGE ANALYTICS

• Edge Analytics

• Fog Computing

• Inverted Web

• Reverse CDN

Page 9: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 99

NEW ECOSYSTEM – DATA PIPELINE

Page 10: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 1010

WHAT’S NEEDED TO SCALE FOR IoT

• A database optimized for IoT data

• Review your data life cycle

• Summations and aggregation

• Data expiration

• Data cleansing

• Processing close to devices

• Scale for unstructured metadata

Page 11: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 11

TIME SERIES (TS) DATA

• Consists of successive observations made over a time interval

• Structured• Time + State/Measurement • Metadata/Context• Frequency

Page 12: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 12

TIME SERIES CHALLENGES AT SCALE

• Ingestion Velocity• Data Volume• Post Ingestion Workloads

– Real time– Batch

• Lifecycle/Expiry

Page 13: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 13

Riak TS Overview & Architecture

Page 14: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 14

WHAT IS RIAK TS?

Riak TS is a distributed NoSQL key/value store optimized for time series data.

It provides a time series database solution that is extensible and scalable.

Riak TS is derived from Riak KV and adds the ability to co-locate data by composite primary key, including quanta, for efficient sequential read i/o operations.

Page 15: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 15

Why Riak TS?• Highly available• Fault Tolerant• Geo data locality• Scalability

– Operations– Real-time range query performance

15

Page 16: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 16

RIAK TS MASTERLESS ARCHITECUTURE

Riak has a masterless architecture. Every node is: • homogenous• capable of serving all read and write requests• responsible for a subset of data

Page 17: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 17

RIAK TS: DISTRIBUTION AND CO-LOCATION

• Variation of Dynamo• Composite key drives

grouping on disk– Partition Key– Local Key (sort)

Page 18: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 18

RIAK: REPLICATION OF DATA

• Intra-cluster replication• Multi-cluster replication

put(“bucket/key”)

Page 19: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 19

RIAK: HIGH AVAILABILITY

Hinted handoff allows Riak nodes to temporarily take over storage operations for a failed node and update that node with changes when it comes back online.

Page 20: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 20

RIAK TS: SCALABILITYRiak TS scales in a near-linear fashion so increasing the number of a nodes in a cluster increases the number of reads and writes a cluster can handle in a predictable fashion.

Rebalancing of the cluster is a non-blocking operation, which doesn’t require downtime to perform.

If 10 nodes can serve 40,000 Writes/Second Then 20 nodes should serve 72,000+ Writes/Second

> riak-admin cluster join [email protected]

> riak-admin cluster plan

> riak-admin cluster commit

A d d i n g a n o d e

Page 21: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 21

RIAK TS: QUERY

select * from GeoCheckin where time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select MIN(temperature), AVG(temperature), MAX(temperature) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select (temperature * 2), (pressure - 1) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

Arithmetic

Aggregate

Range• SQL Interface• Arithmetic Support• Aggregate

– Count()– Sum()– Mean() & Avg()– Min() & Max()– STDDEV()

• Group By• Expanded

capabilitiesin future releases

Page 22: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 22

BATCH PROCESSING

• Real-time vs. Batch• Spark Connector• Parallel Extract

Page 23: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 23

DATA LIFECYCLE

• Global expiry• Per table expiry

coming soon• Spark batch for

rollups/aggregation

Page 24: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 24

Time SeriesData Modeling

Page 25: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 25

SUPPORTED DDL DATA TYPES• VARCHAR - Any string content is valid, including Unicode. Can only be

compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings.

• BOOLEAN - true or false (any case)• TIMESTAMP - Timestamps are integer values expressing UNIX epoch time in

UTC in milliseconds. Zero is not a valid timestamp.• SINT64 - Signed 64-bit integer• DOUBLE - This type does not comply with its IEEE specification: NaN (not a

number) and INF (infinity) cannot be used.

Page 26: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 26

THE KEY

Consists of:• Partition Key

(node/partition)• Quantum (optional)• Local Key (sort order)

Page 27: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 27

RIAK TS: CREATE TABLE

CREATE TABLE GeoCheckin ( deviceID varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY (

(deviceID, quantum(time, 15, 'm')), deviceID, time

) )

Partition Key

Local Key

Page 28: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 28

MODELING THE KEY

Methodology:• What questions does your

application ask?• How is the data presented?

Page 29: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 29

USE CASE: PEDOMETER

• Questions– How many steps today

(distance) for user?– How many steps per

day this week for user?– Daily average?– Change in elevation?

• Key– Partition: UserID– Local: timestamp– Optimized for reads:

quantum of 1 week– Optimized for writes

quantum of 1 day

• Fields– timestamp– steps– device_id– elevation– geohash

Page 30: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 30

DEMO

• Riak TS• Python client• Jupyter Notebook

• Pandas• Matplotlib

Page 31: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 31

THE DATADescription Field TypeSensor Status status varchar

Exit ID exitid varchar

Timestamp ts timestamp

Average Measured Time avgMeasuredTime sint64

Average Speed avgSpeed sint64

Median Measured Time medianMeasuredTime sint64

Number of Vehicles vehicleCount sint64

Sensor ID id sint64

Report ID report_id sint64

• Vehicle traffic data• City of Aarhus,

Denmark• Two sensors placed

at each exit• 5 min intervals

Page 32: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Spark and Riak: In-situ analytics beyond Hadoop

Page 33: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

33

Who is DatabricksWhy Us Our Product

• Creators of Apache Spark. Contribute 75% of the code - 10x more than others

• Trained 20K Spark users

• Largest number of customers deploying Spark (200+)

• Just-in-Time Data Platform – powered by Apache Spark.

• Empower your organization to swiftly build and deploy advanced analytics with Spark.

Page 34: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

open source data processing engine built around speed, ease of use, and sophisticated analytics

largest open source data project with 1000+ contributors

Page 35: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

UNIFIED ENGINE ACROSS DIVERSE WORKLOADS & ENVIRONMENTS

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

APACHE SPARK ENGINE

Page 36: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

First Cellular Phones Unified DeviceSpecialized Devices

ANALOGY: EVOLUTION OF CONSUMER ELECTRONICS

Page 37: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

HISTORY REPEATS: FASTER, EASIER TO USE, UNIFIED

First DistributedProcessing Engine

Specialized Data Processing Engines

Unified Data Processing Engine

Page 38: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Google Trends: Hadoop vs. Spark

Page 39: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Analytics in-situSQL

Streaming

MLEnable SQL analytics over RiakUse Riak to store streaming data

Use Riak to serve results generated by Spark

Page 40: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Riak Spark Connector

User application contacts the coordinating node returning the locations of the data using cluster replication and availability information.Then “N” Spark workers open “N” parallel connections to different nodes, which allow the application to retrieve the desired dataset “N” times faster, without generating “hot spots”.

Page 41: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Demo

Page 42: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Build a PoC on Databricks today.Professional services and training also available.

Contact [email protected]

or

Sign up for a trial at https://databricks.com/try-databricks

Page 43: Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 43

Thank You!

If you have any questions please reach out to us at basho.com/contact