Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Basho Technologies | 1

Scaling Time Series Applications

BashoDorothy Pults – Product Evangelist @deepultsTom Sigler – Solution Architect @tom_sigler

DatabricksPeyman Mohajerian - Solution Architect @mohajeri


BASHO TECHNOLOGIESDistributed Systems Software for Big Data and IoT applications

2011 - Creators of Riak• Riak KV: NoSQL Key Value database• Riak TS: NoSQL Time Series database• Integrations: Spark, Redis caching, Solr, Mesos, Riak S2

120+ employees

Global Offices • Seattle (HQ), Washington DC, London, Paris

1/3 of the Fortune 50


$1.3 Trillion market spend Internet of Things in 2019

30 Billion Installed base of IoT endpoints in 2020

*Source IDC


56% have integrated IOT data

IoT is 24% of the average IT budget

20% decrease in downtime

21% increase in revenue

*Vodafone IOT Barometer


CRITICAL SUCCESS FACTORS FOR IOT

• Explore new business models

• Address Key IoT challenges like Edge Analytics

• Provide comprehensive solutions

• Engage with a broader ecosystem


100TB DAILY – IOT AND WEATHER DATA

530M personal weather stations reports each day

9M webcam uploads

2M crowd reports

> 20M IoT barometric reports


WEATHER FORECAST PREDICTS SALES

Ideal BERRY purchasing weather turns out to be low wind with temperatures below 80 degrees.

People are more likely to eat STEAK when it's warm out with higher winds but no rain, but not if it gets too hot.


EDGE ANALYTICS

• Edge Analytics

• Fog Computing

• Inverted Web

• Reverse CDN


NEW ECOSYSTEM – DATA PIPELINE


WHAT’S NEEDED TO SCALE FOR IoT

• A database optimized for IoT data

• Review your data life cycle

• Summations and aggregation

• Data expiration

• Data cleansing

• Processing close to devices

• Scale for unstructured metadata


TIME SERIES (TS) DATA

• Consists of successive observations made over a time interval

• Structured• Time + State/Measurement • Metadata/Context• Frequency


TIME SERIES CHALLENGES AT SCALE

• Ingestion Velocity• Data Volume• Post Ingestion Workloads

– Real time– Batch

• Lifecycle/Expiry


Riak TS Overview & Architecture


WHAT IS RIAK TS?

Riak TS is a distributed NoSQL key/value store optimized for time series data.

It provides a time series database solution that is extensible and scalable.

Riak TS is derived from Riak KV and adds the ability to co-locate data by composite primary key, including quanta, for efficient sequential read i/o operations.


Why Riak TS?• Highly available• Fault Tolerant• Geo data locality• Scalability

– Operations– Real-time range query performance

15


RIAK TS MASTERLESS ARCHITECUTURE

Riak has a masterless architecture. Every node is: • homogenous• capable of serving all read and write requests• responsible for a subset of data


RIAK TS: DISTRIBUTION AND CO-LOCATION

• Variation of Dynamo• Composite key drives

grouping on disk– Partition Key– Local Key (sort)


RIAK: REPLICATION OF DATA

• Intra-cluster replication• Multi-cluster replication

put(“bucket/key”)


RIAK: HIGH AVAILABILITY

Hinted handoff allows Riak nodes to temporarily take over storage operations for a failed node and update that node with changes when it comes back online.


RIAK TS: SCALABILITYRiak TS scales in a near-linear fashion so increasing the number of a nodes in a cluster increases the number of reads and writes a cluster can handle in a predictable fashion.

Rebalancing of the cluster is a non-blocking operation, which doesn’t require downtime to perform.

If 10 nodes can serve 40,000 Writes/Second Then 20 nodes should serve 72,000+ Writes/Second

> riak-admin cluster join [email protected]

> riak-admin cluster plan

> riak-admin cluster commit

A d d i n g a n o d e


RIAK TS: QUERY

select * from GeoCheckin where time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select MIN(temperature), AVG(temperature), MAX(temperature) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

select (temperature * 2), (pressure - 1) from GeoCheckin where

time > 1453224610000 and time < 1453225490000 and deviceId = 'abc-xxx-001-001'

Arithmetic

Aggregate

Range• SQL Interface• Arithmetic Support• Aggregate

– Count()– Sum()– Mean() & Avg()– Min() & Max()– STDDEV()

• Group By• Expanded

capabilitiesin future releases


BATCH PROCESSING

• Real-time vs. Batch• Spark Connector• Parallel Extract


DATA LIFECYCLE

• Global expiry• Per table expiry

coming soon• Spark batch for

rollups/aggregation


Time SeriesData Modeling


SUPPORTED DDL DATA TYPES• VARCHAR - Any string content is valid, including Unicode. Can only be

compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings.

• BOOLEAN - true or false (any case)• TIMESTAMP - Timestamps are integer values expressing UNIX epoch time in

UTC in milliseconds. Zero is not a valid timestamp.• SINT64 - Signed 64-bit integer• DOUBLE - This type does not comply with its IEEE specification: NaN (not a

number) and INF (infinity) cannot be used.


THE KEY

Consists of:• Partition Key

(node/partition)• Quantum (optional)• Local Key (sort order)


RIAK TS: CREATE TABLE

CREATE TABLE GeoCheckin ( deviceID varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY (

(deviceID, quantum(time, 15, 'm')), deviceID, time

) )

Partition Key

Local Key


MODELING THE KEY

Methodology:• What questions does your

application ask?• How is the data presented?


USE CASE: PEDOMETER

• Questions– How many steps today

(distance) for user?– How many steps per

day this week for user?– Daily average?– Change in elevation?

• Key– Partition: UserID– Local: timestamp– Optimized for reads:

quantum of 1 week– Optimized for writes

quantum of 1 day

• Fields– timestamp– steps– device_id– elevation– geohash


DEMO

• Riak TS• Python client• Jupyter Notebook

• Pandas• Matplotlib


THE DATADescription Field TypeSensor Status status varchar

Exit ID exitid varchar

Timestamp ts timestamp

Average Measured Time avgMeasuredTime sint64

Average Speed avgSpeed sint64

Median Measured Time medianMeasuredTime sint64

Number of Vehicles vehicleCount sint64

Sensor ID id sint64

Report ID report_id sint64

• Vehicle traffic data• City of Aarhus,

Denmark• Two sensors placed

at each exit• 5 min intervals

Spark and Riak: In-situ analytics beyond Hadoop

33

Who is DatabricksWhy Us Our Product

• Creators of Apache Spark. Contribute 75% of the code - 10x more than others

• Trained 20K Spark users

• Largest number of customers deploying Spark (200+)

• Just-in-Time Data Platform – powered by Apache Spark.

• Empower your organization to swiftly build and deploy advanced analytics with Spark.

open source data processing engine built around speed, ease of use, and sophisticated analytics

largest open source data project with 1000+ contributors

UNIFIED ENGINE ACROSS DIVERSE WORKLOADS & ENVIRONMENTS

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

APACHE SPARK ENGINE

First Cellular Phones Unified DeviceSpecialized Devices

ANALOGY: EVOLUTION OF CONSUMER ELECTRONICS

HISTORY REPEATS: FASTER, EASIER TO USE, UNIFIED

First DistributedProcessing Engine

Specialized Data Processing Engines

Unified Data Processing Engine

Google Trends: Hadoop vs. Spark

Analytics in-situSQL

Streaming

MLEnable SQL analytics over RiakUse Riak to store streaming data

Use Riak to serve results generated by Spark

Riak Spark Connector

User application contacts the coordinating node returning the locations of the data using cluster replication and availability information.Then “N” Spark workers open “N” parallel connections to different nodes, which allow the application to retrieve the desired dataset “N” times faster, without generating “hot spots”.

Build a PoC on Databricks today.Professional services and training also available.

Contact [email protected]

or

Sign up for a trial at https://databricks.com/try-databricks

mailto:[email protected]

https://databricks.com/try-databricks


Thank You!

If you have any questions please reach out to us at basho.com/contact

Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications

Technology

Transcript of Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applications