Data Modeling IoT and Time Series data in NoSQL

Matthew BrenderDrew Kerrigan

{ “Matt” : ‘mbrender@basho.com’,‘mjbrender’,‘@mjbrender’,‘ruby, javascript, go’

{ “Drew” : ‘dkerrigan@basho.com’,‘drewkerrigan’,‘@dr00_b’,‘erlang, elixir, go’

Meet your presenters

Basho Technologies | 2

Basho SnapshotDistributed Systems Software for Big Data, IoT and Hybrid Cloud applications

Founded January 2008

2011 Creators of RiakRiak core: used by Goldman, Visa…Riak KV: Feature-rich Distributed NoSQL databaseRiak S2: Object and cloud storage software

2015 New ProductsBasho Data Platform: NoSQL, caching & analyticsRiak TS: Distributed database designed for time series

120+ employees

Global Offices Seattle (HQ), Washington DC, London, Tokyo

Agenda

• Time Series Data• Introducing Riak

TS• Data Modeling• Coding with Riak

TSBasho Technologies | 4

What is Time Series?

How Is Time Series Data Different?• High performance reads and writes of time series data

Data location matters

Data needs to be easy to retrieve using range queries

select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”

Higher write volumes

All while still being highly available!

With no data loss even with a huge number of sources

Eventually rolled up, compressed, with the details expired

Introducing Riak TS

SERVICEINSTANCES

STORAGEINSTANCES

Spark Redis (Caching) Solr Elastic

SearchWeb Services3rd Party Web

Services & Integrations

Riak KV Key/Value

Riak S2 Object Storage

Riak TS Time Series

Document Store Columnar Graph

Replication & Synchronization

MessageRouting

Cluster Management &

Monitoring

Logging &Analytics

Internal Data Store

CORE SERVICES

Riak TS Feature DetailsFeature Overview

Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data

Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command

Simple setup for faster ROI

Greater data locality Faster data storage and retrieval

Option to store structured and semi-structured data

Clean data written to the database eliminating the need to cleanse data

Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language

Near-linear scaling Easy to grow database to meet data demands

High Availability for ingest No data loss even when data is streaming from a large number of sources

Riak TS Feature Details• Same distributed systems benefits of Riak KV

Operational Simplicity

Fault Tolerance

Robust Client APIs

Broad Client Libraries

Massive Scalability

Active Anti-Entropy

Masterless

High Availability

Low Latency

Read Repair

Riak Search

Riak TS Optimization

Optimized Deployment

• Data Co-Location• Composite Keys - time or geohash,

data family• Time quantization (quantum)

Simplified Data Modeling

• DDL – Table and field definitions support structured and semi-structured data

Fast Queries and Analysis

• Range Queries (SQL based)• LevelDB filtering • Spark Connector

Riak has a masterless architecture in which every node in a cluster is capable of serving read and write requests.

Requests are routed to nodes using standard load balancing.

Riak TS Optimization

Riak KV Hashing

2i Query

Riak TS Hashing

TS Query

RIAK TS – Storing Structured Data

• Key format– Objects have a composite key

(partition key and local key)• Tables

– Buckets can be defined as tables

– Tables can have a schema defined using DDL

– Columns in the table can be typed

• Data Validation– Data is validated on input

Buckets used to Define Tables

RIAK TS – Range Queries

• Use Cases– Range queries

• Implementation Details– SQL based query language– Filtering rows based on column expressions– Filtering executed in backend– Specific columns are extracted– Simple select with WHERE clause

• for numbers <,>=,<,<=,=,!=• for other data types =, !=• AND, OR (nesting operators are supported)

Query Like SQL

select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”

Data Modeling

How does one approach time series data?

The first rule…

The real first rule of data modeling:• Decide what questions you want to ask of the data

– Graphs?– Granularity?– Analysis?– Monitoring?

Graphs

Sample Data Exercise

Hard drive test data– https://www.backblaze.com/hard-drive-test-data.html– https://en.wikipedia.org/wiki/S.M.A.R.T.

Sample Data Exercise

Data Characteristics[Date, Serial Number, Model, Capacity (bytes), Failure, …, smart_194_raw (Temp), …]

Sample Row:• Date: “2013-04-10”• Model: “Hitachi HDS5C3030ALA630”• Failure: 0• Temp: 26

Which columns are good candidates for indexing given the question we are asking of the data?

Define the Conceptual QueryEffect of temperature on hard drive stability

Approach 1:

SELECT * FROM HardDrivesWHERE date >= 2013-01-01

AND date <= 2013-12-31AND failure = 'true’

“Find all failures in 2013”• Pros:

– All data is colocated physically• Cons:

– Requires client side processing for further analysis

Create the Table

riak-admin bucket-type create HardDrives '{"props":{"n_val":3, "table_def":”CREATE TABLE HardDrives (

date TIMESTAMP NOT NULL, family VARCHAR NOT NULL, failure VARCHAR NOT NULL, serial VARCHAR, model VARCHAR, capacity FLOAT, temperature FLOAT,

PRIMARY KEY ((quantum(date, 1, ‘y'), family, failure), date, family, failure))"}}’

Ingest the DataRawRow = [

<<“2013-04-10”>>, %% Date<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model<<“3000592982016”>>, %% Capacity<<“0”>>, %% Failure…, <<“26”>>, …], %% SMART Stats with Temperature

ProcessedRow = [1365555661000, %% Date<<“all”>>, %% Family<<“false”>>, %% Failure<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model3000592982016.0, %% Capacity26.0], %% Temperature

Ingest the DataProcessedRow = [ convert(lists:nth(1, RawRow), date), % date <<"all">>, % family convert(lists:nth(5, RawRow), boolean), % failure lists:nth(2, RawRow), % serial lists:nth(3, RawRow), % model convert(lists:nth(4, RawRow), float), % capacity convert(lists:nth(51, RawRow), float) % temp],

riakc_ts:put(Pid,<<"HardDrives">>,[ProcessedRow]).

Query the DataStart = integer_to_list(date_to_epoch_ms(<<"2013-01-01">>)),End = integer_to_list(date_to_epoch_ms(<<"2013-12-31">>)),

Query = "select * from HardDrives where date >= " ++ Start ++ " and date <= " ++ End ++ " and family = 'all' and failure = 'true'",

{_Fields, Results} = riakc_ts:query(Pid, list_to_binary(Query)),

Process the ResultsTotal Failures: 112Results:

1365555661000,<<"all">>,<<"true">>,<<"9VS3FM1J">>,<<"ST31500341AS">>,1500301910016.0,31.0

}, {...}, {...}, ...]

Results

130> ts:approach1().Total Failures: 112"ST31500341AS": ..."ST3000DM001": ..."Hitachi HDS5C4040ALE630": ..."ST4000DM000": ...

"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1

Refine the QueryNew QuerySELECT * FROM HardDrivesWHERE date >= 2013-01-01

AND date <= 2013-12-31

AND model = ‘ST31500541AS‘AND failure = 'true’

New Primary KeyPRIMARY KEY (

(quantum(date, 1, ‘y'), model, failure), date, model, failure))"}}’

Same (but more focused) Results"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1

Think Outside the BoxNew Approach: Multi-Model with Riak KV

Conceptual Query:

Read the single value of a bunch of counters!

“Find the number of failures for each Quantum, Model, and Temperature combination”• Pros:

– Each data point is pre-calculated, so very little client side processing– Potentially faster, depending on a lot of variables

• Cons:– Requires the desire to know very specific stat values prior to writing data– Requires several counter writes for every row of raw data

Create the Bucket Type

riak-admin bucket-type create HardDriveCounters '{"props":{"datatype":"counter"}}’

Ingest the DataFailure = lists:nth(5, RawRow), % failureYear = extract_year(lists:nth(1, RawRow), % yearTemp = lists:nth(51, RawRow),

Bucket = {<<"HardDriveCounters">>,Year},Key = list_to_binary(binary_to_list(Model) ++ binary_to_list(Temp)),

%% We only care about failurescase Failure of

<<“1”>> ->Counter = riakc_counter:new(),

Counter1 = riakc_counter:increment(Counter),riakc_pb_socket:update_type(Pid,Bucket,Key,riakc_counter:to_op(Counter1))_ -> okend.

Query the DataStartTemp = 16,EndTemp = 28,Results = range_get(<<“2013”>>, <<“ST31500341AS”>>, StartTemp, EndTemp, [])....

range_get(_Year, _Model, EndTemp, EndTemp, Accum) -> lists:reverse(Accum);

range_get(Year, Model, CurrentTemp, EndTemp, Accum) -> Bucket = {<<"HardDriveCounters">>,Year}, Key = list_to_binary(binary_to_list(Model) ++ integer_to_list(Temp)),

{ok, Counter} = riakc_pb_socket:fetch_type(Pid,Bucket, Key),

NumFailures = riakc_counter:value(Counter), range_get(Year, Model, CurrentTemp + 1, EndTemp, [{CurrentTemp, NumFailures}|Accum]).

Data Modeling in Riak

Multi-Model with Riak KV

• Keys: Create your own using quantum + “dimension”

• Range Queries: Create your own client side multi-get to issue incremental key gets

• Compound Queries: Create more composite keys!

• Data Location: Sometimes inefficient because data is spread across many vnodes / partitions

Data Modeling in Riak

Time Series Modeling in Riak TS

• Keys: Automatically managed based on your PRIMARY KEY definition as well as the values in those fields

• Range Queries: Use a well known subset of SQL to simply specify a start and end in a WHERE clause which performs a server side multi-get

• Compound Queries: Possible with a wisely chosen composite PRIMARY KEY, although multiple tables may still be necessary

• Data Location: Very efficient data grouping by quantums, families, and series.

Conclusion

Part of the Basho Data Platform

SERVICEINSTANCES

STORAGEINSTANCES

Spark Redis (Caching) Solr Elastic

SearchWeb Services3rd Party Web

Services & Integrations

Riak KV Key/Value

Riak S2 Object Storage

Riak TS Time Series

Document Store Columnar Graph

Replication & Synchronization

MessageRouting

Cluster Management &

Monitoring

Logging &Analytics

Internal Data Store

CORE SERVICES

RIAK TS Feature DetailsFeature Overview

Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data

Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command

Simple setup for faster ROI

Greater data locality Faster data storage and retrieval

Option to store structured and semi-structured data

Clean data written to the database eliminating the need to cleanse data

Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language

Near-linear scaling Easy to grow database to meet data demands

High Availability for ingest No data loss even when data is streaming from a large number of sources

QUESTIONS?

Spend Time

@basho@riconconf

OPEN SOURCE ENTERPRISE

Basho Data Platform (code)• Riak KV with parallel extract

Basho Data Platform, Enterprise• Riak EE with multi-cluster replication• Spark Leader Election Service

Basho Data Platform Add-on’s (code)• Spark + Spark Connector

Basho Data Platform Add-on’s• Redis + Cache Proxy• Spark Workers + Spark Master

Download a build Contact us to get started

getting to know us

Data Modeling IoT and Time Series data in NoSQL

Software

Transcript of Data Modeling IoT and Time Series data in NoSQL

UNLEASHING THE POTENTIAL OF IOT FOR NEW BUSINESS€¦ · UNLEASHING THE POTENTIAL OF IOT FOR NEW BUSINESS ... Big Data Analytics NoSql DB –Cassandra, MongoDB Map-Reduce Spark MQTT

NoSQL - cs.ucr.edueldawy/20FCS226/slides/CS226-10-NoSQL.pdfAdvantages of NoSQL •Handles Big Data •Data Models –No predefined schema •Data Structure –NoSQL handles semi-structured

Data Modeling on NoSQL

NoSQL Data Models - LMU

Master Class: Development€¦ · 4 e s Relational, NoSQL, Graph databases. Data Mart/ Warehouse/Lake. Big data, Hadoop Cloud data: AWS RDS, Dynamo, Cosmos, S3, BLOB IoT Streaming

Data analytics with NOSQL

Big data: NoSQL comme solution

NoSQL Database in Azure for IoT and Business

NoSQL and Big Data Analytics at NOSQL NOW! 2013

Accelerating Outcomes in Big Data, IIoT/IoT, and AI/ML · – Deep expertise in Kafka and Spark Streaming – Data flow and data operations pipelines HASHMAP : CONSULTING NoSQL &

Big Data IoT - notasdeclase.files.wordpress.com · ¿Que é Big Data? ¿Que é IoT? ... •Map/Reduce •Framework de ... Hbase Base de datos NoSQL construída sobre HDFS Hive Motor

SDEC2011 NoSQL Data modelling

NoSQL et Big Data

Data Modeling with NoSQL: How, When and Whyrepositorio-aberto.up.pt/bitstream/10216/61586/1/000148158.pdf · Data Modeling with NoSQL: How, When and Why ... Data modeling with NoSQL

NOSQL OPTIMIZED FOR TIME SERIES AND IoT DATA€¦ · Time series data is any data that has a timestamp, like IoT device data, stocks, commodity prices, tide measurements, solar flare

New Big Data & NoSQL · 2019. 6. 21. · Big Data & NoSQL TECNOLOGIE ABILITANTI PER NUOVI PARADIGMI DI BUSINESS. 2 / 37 Big Data & NoSQL TECNOLOGIE ABILITANTI PER NUOVI PARADIGMI

Redis: NoSQL data storage

NOSQL OPTIMIZED FOR TIME SERIES AND IoT DATA · APACHE SPARK CONNECTOR Seamlessly integrate with Apache Spark to ensure easier and faster operational analysis of time series data.

Data Management in Large-Scale Distributed Systems - NoSQL ... · Introduction Why NoSQL? Transactions, ACID properties and CAP theorem Data models NoSQL databases design and implementation

Big Data NoSQL 1017