Data Modeling IoT and Time Series data in NoSQL

Post on 27-Jan-2017

1.505 views 0 download

Transcript of Data Modeling IoT and Time Series data in NoSQL

Data Modeling IoT and Time Series data in NoSQL

Matthew BrenderDrew Kerrigan

1

{ “Matt” : ‘mbrender@basho.com’,‘mjbrender’,‘@mjbrender’,‘ruby, javascript, go’

}

{ “Drew” : ‘dkerrigan@basho.com’,‘drewkerrigan’,‘@dr00_b’,‘erlang, elixir, go’

}

Meet your presenters

Basho Technologies | 2

Basho SnapshotDistributed Systems Software for Big Data, IoT and Hybrid Cloud applications

Basho Technologies | 3

Founded January 2008

2011 Creators of RiakRiak core: used by Goldman, Visa…Riak KV: Feature-rich Distributed NoSQL databaseRiak S2: Object and cloud storage software

2015 New ProductsBasho Data Platform: NoSQL, caching & analyticsRiak TS: Distributed database designed for time series

120+ employees

Global Offices Seattle (HQ), Washington DC, London, Tokyo

Agenda

• Time Series Data• Introducing Riak

TS• Data Modeling• Coding with Riak

TSBasho Technologies | 4

Basho Technologies | 5

What is Time Series?

What is Time Series?

Basho Technologies | 6

What is Time Series?

Basho Technologies | 7

What is Time Series?

Basho Technologies | 8

How Is Time Series Data Different?• High performance reads and writes of time series data

Basho Technologies | 9

Data location matters

Data needs to be easy to retrieve using range queries

select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”

Higher write volumes

All while still being highly available!

With no data loss even with a huge number of sources

Eventually rolled up, compressed, with the details expired

Introducing Riak TS

Basho Technologies | 10

SERVICEINSTANCES

STORAGEINSTANCES

Solr

Spark Redis (Caching) Solr Elastic

SearchWeb Services3rd Party Web

Services & Integrations

Riak KV Key/Value

Riak S2 Object Storage

Riak TS Time Series

Document Store Columnar Graph

Replication & Synchronization

MessageRouting

Cluster Management &

Monitoring

Logging &Analytics

Internal Data Store

CORE SERVICES

Riak TS Feature DetailsFeature Overview

Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data

Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command

Simple setup for faster ROI

Greater data locality Faster data storage and retrieval

Option to store structured and semi-structured data

Clean data written to the database eliminating the need to cleanse data

Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language

Near-linear scaling Easy to grow database to meet data demands

High Availability for ingest No data loss even when data is streaming from a large number of sources

Basho Technologies | 11

Riak TS Feature Details• Same distributed systems benefits of Riak KV

Operational Simplicity

Fault Tolerance

Robust Client APIs

Broad Client Libraries

Massive Scalability

CRDTs

Active Anti-Entropy

Masterless

High Availability

Low Latency

Read Repair

Riak Search

Basho Technologies | 12

Riak TS Optimization

Basho Technologies | 13

Optimized Deployment

• Data Co-Location• Composite Keys - time or geohash,

data family• Time quantization (quantum)

Simplified Data Modeling

• DDL – Table and field definitions support structured and semi-structured data

Fast Queries and Analysis

• Range Queries (SQL based)• LevelDB filtering • Spark Connector

Riak has a masterless architecture in which every node in a cluster is capable of serving read and write requests.

Requests are routed to nodes using standard load balancing.

Riak TS Optimization

Basho Technologies | 14

Basho Technologies | 15

Riak KV Hashing

Riak KV Hashing

PUT

Basho Technologies | 16

Riak KV Hashing

2i Query

Basho Technologies | 17

Riak TS Hashing

PUT

Basho Technologies | 18

Riak TS Hashing

TS Query

Basho Technologies | 19

RIAK TS – Storing Structured Data

• Key format– Objects have a composite key

(partition key and local key)• Tables

– Buckets can be defined as tables

– Tables can have a schema defined using DDL

– Columns in the table can be typed

• Data Validation– Data is validated on input

Buckets used to Define Tables

Basho Technologies | 20

RIAK TS – Range Queries

• Use Cases– Range queries

• Implementation Details– SQL based query language– Filtering rows based on column expressions– Filtering executed in backend– Specific columns are extracted– Simple select with WHERE clause

• for numbers <,>=,<,<=,=,!=• for other data types =, !=• AND, OR (nesting operators are supported)

Query Like SQL

select *from devices where time >= 2015-08-06 1:00:00 and time <= 2015-08-06 01:10:00 and errorcode = 555123and device_type = “mobile”

Basho Technologies | 21

Data Modeling

How does one approach time series data?

The first rule…

Basho Technologies | 23

The real first rule of data modeling:• Decide what questions you want to ask of the data

– Graphs?– Granularity?– Analysis?– Monitoring?

Basho Technologies | 24

Graphs

Basho Technologies | 25

Graphs

Basho Technologies | 26

Sample Data Exercise

Hard drive test data– https://www.backblaze.com/hard-drive-test-data.html– https://en.wikipedia.org/wiki/S.M.A.R.T.

Basho Technologies | 27

Sample Data Exercise

Basho Technologies | 28

Data Characteristics[Date, Serial Number, Model, Capacity (bytes), Failure, …, smart_194_raw (Temp), …]

Sample Row:• Date: “2013-04-10”• Model: “Hitachi HDS5C3030ALA630”• Failure: 0• Temp: 26

Which columns are good candidates for indexing given the question we are asking of the data?

Basho Technologies | 29

Define the Conceptual QueryEffect of temperature on hard drive stability

Approach 1:

SELECT * FROM HardDrivesWHERE date >= 2013-01-01

AND date <= 2013-12-31AND failure = 'true’

“Find all failures in 2013”• Pros:

– All data is colocated physically• Cons:

– Requires client side processing for further analysis

Basho Technologies | 30

Create the Table

riak-admin bucket-type create HardDrives '{"props":{"n_val":3, "table_def":”CREATE TABLE HardDrives (

date TIMESTAMP NOT NULL, family VARCHAR NOT NULL, failure VARCHAR NOT NULL, serial VARCHAR, model VARCHAR, capacity FLOAT, temperature FLOAT,

PRIMARY KEY ((quantum(date, 1, ‘y'), family, failure), date, family, failure))"}}’

Basho Technologies | 31

Ingest the DataRawRow = [

<<“2013-04-10”>>, %% Date<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model<<“3000592982016”>>, %% Capacity<<“0”>>, %% Failure…, <<“26”>>, …], %% SMART Stats with Temperature

ProcessedRow = [1365555661000, %% Date<<“all”>>, %% Family<<“false”>>, %% Failure<<“MJ0351YNG9Z0XA”>>, %% Serial<<“Hitachi HDS5C3030ALA630”>>, %% Model3000592982016.0, %% Capacity26.0], %% Temperature

Basho Technologies | 32

Ingest the DataProcessedRow = [ convert(lists:nth(1, RawRow), date), % date <<"all">>, % family convert(lists:nth(5, RawRow), boolean), % failure lists:nth(2, RawRow), % serial lists:nth(3, RawRow), % model convert(lists:nth(4, RawRow), float), % capacity convert(lists:nth(51, RawRow), float) % temp],

riakc_ts:put(Pid,<<"HardDrives">>,[ProcessedRow]).

Basho Technologies | 33

Query the DataStart = integer_to_list(date_to_epoch_ms(<<"2013-01-01">>)),End = integer_to_list(date_to_epoch_ms(<<"2013-12-31">>)),

Query = "select * from HardDrives where date >= " ++ Start ++ " and date <= " ++ End ++ " and family = 'all' and failure = 'true'",

{_Fields, Results} = riakc_ts:query(Pid, list_to_binary(Query)),

Basho Technologies | 34

Process the ResultsTotal Failures: 112Results:

[{

1365555661000,<<"all">>,<<"true">>,<<"9VS3FM1J">>,<<"ST31500341AS">>,1500301910016.0,31.0

}, {...}, {...}, ...]

Basho Technologies | 35

Results

130> ts:approach1().Total Failures: 112"ST31500341AS": ..."ST3000DM001": ..."Hitachi HDS5C4040ALE630": ..."ST4000DM000": ...

"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1

Basho Technologies | 36

Refine the QueryNew QuerySELECT * FROM HardDrivesWHERE date >= 2013-01-01

AND date <= 2013-12-31

AND model = ‘ST31500541AS‘AND failure = 'true’

New Primary KeyPRIMARY KEY (

(quantum(date, 1, ‘y'), model, failure), date, model, failure))"}}’

Same (but more focused) Results"ST31500541AS": 18.0=1 19.0=1 20.0=2 21.0=3 22.0=224.0=2 25.0=1 29.0=3 30.0=1

Basho Technologies | 37

Think Outside the BoxNew Approach: Multi-Model with Riak KV

Conceptual Query:

Read the single value of a bunch of counters!

“Find the number of failures for each Quantum, Model, and Temperature combination”• Pros:

– Each data point is pre-calculated, so very little client side processing– Potentially faster, depending on a lot of variables

• Cons:– Requires the desire to know very specific stat values prior to writing data– Requires several counter writes for every row of raw data

Basho Technologies | 38

Create the Bucket Type

riak-admin bucket-type create HardDriveCounters '{"props":{"datatype":"counter"}}’

Basho Technologies | 39

Ingest the DataFailure = lists:nth(5, RawRow), % failureYear = extract_year(lists:nth(1, RawRow), % yearTemp = lists:nth(51, RawRow),

Bucket = {<<"HardDriveCounters">>,Year},Key = list_to_binary(binary_to_list(Model) ++ binary_to_list(Temp)),

%% We only care about failurescase Failure of

<<“1”>> ->Counter = riakc_counter:new(),

Counter1 = riakc_counter:increment(Counter),riakc_pb_socket:update_type(Pid,Bucket,Key,riakc_counter:to_op(Counter1))_ -> okend.

Basho Technologies | 40

Query the DataStartTemp = 16,EndTemp = 28,Results = range_get(<<“2013”>>, <<“ST31500341AS”>>, StartTemp, EndTemp, [])....

range_get(_Year, _Model, EndTemp, EndTemp, Accum) -> lists:reverse(Accum);

range_get(Year, Model, CurrentTemp, EndTemp, Accum) -> Bucket = {<<"HardDriveCounters">>,Year}, Key = list_to_binary(binary_to_list(Model) ++ integer_to_list(Temp)),

{ok, Counter} = riakc_pb_socket:fetch_type(Pid,Bucket, Key),

NumFailures = riakc_counter:value(Counter), range_get(Year, Model, CurrentTemp + 1, EndTemp, [{CurrentTemp, NumFailures}|Accum]).

Basho Technologies | 41

Data Modeling in Riak

Multi-Model with Riak KV

• Keys: Create your own using quantum + “dimension”

• Range Queries: Create your own client side multi-get to issue incremental key gets

• Compound Queries: Create more composite keys!

• Data Location: Sometimes inefficient because data is spread across many vnodes / partitions

Basho Technologies | 42

Data Modeling in Riak

Time Series Modeling in Riak TS

• Keys: Automatically managed based on your PRIMARY KEY definition as well as the values in those fields

• Range Queries: Use a well known subset of SQL to simply specify a start and end in a WHERE clause which performs a server side multi-get

• Compound Queries: Possible with a wisely chosen composite PRIMARY KEY, although multiple tables may still be necessary

• Data Location: Very efficient data grouping by quantums, families, and series.

Basho Technologies | 43

Conclusion

Part of the Basho Data Platform

Basho Technologies | 45

SERVICEINSTANCES

STORAGEINSTANCES

Solr

Spark Redis (Caching) Solr Elastic

SearchWeb Services3rd Party Web

Services & Integrations

Riak KV Key/Value

Riak S2 Object Storage

Riak TS Time Series

Document Store Columnar Graph

Replication & Synchronization

MessageRouting

Cluster Management &

Monitoring

Logging &Analytics

Internal Data Store

CORE SERVICES

RIAK TS Feature DetailsFeature Overview

Feature BenefitData co-location by time and geohash or more generally series and data family Easily analyze temporal and geocoded data

Configure time series bucket-type that propagates across the cluster using a simple, SQL-like command

Simple setup for faster ROI

Greater data locality Faster data storage and retrieval

Option to store structured and semi-structured data

Clean data written to the database eliminating the need to cleanse data

Write queries using a subset of SQLFaster application development. Write applications to extract and analyze your data in a familiar language

Near-linear scaling Easy to grow database to meet data demands

High Availability for ingest No data loss even when data is streaming from a large number of sources

Basho Technologies | 46

QUESTIONS?

Spend Time

@basho@riconconf

OPEN SOURCE ENTERPRISE

Basho Data Platform (code)• Riak KV with parallel extract

Basho Data Platform, Enterprise• Riak EE with multi-cluster replication• Spark Leader Election Service

Basho Data Platform Add-on’s (code)• Spark + Spark Connector

Basho Data Platform Add-on’s• Redis + Cache Proxy• Spark Workers + Spark Master

Download a build Contact us to get started

getting to know us

Basho Technologies | 48