a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

A real-time architecture using Hadoop and Storm.

Nathan Bijnens


Speaker

Nathan BijnensDataCrunchers@nathan_gs

http://twitter.com/nathan_gs

http://twitter.com/nathan_gs


Our Vision

Big Data

test

Volume


Big Data

test

Velocity


Our Vision

Volume

test

Variety


Computing Trends

Source: Immutability Changes Everything - Pat Helland, RICON2012

Computation (CPUs) Expensive

Disk Storage Expensive

Coordination Easy(Latches Don’t Often Hit)

DRAM Expensive

Computation Cheap (Many Core Computers)

Disk Storage Cheap(Cheap Commodity Disks)

Coordination Hard(Latches Stall a Lot, etc)

DRAM / SSD Getting Cheap

Past Current

http://vimeo.com/52831373


Credits

Nathan Marz•Ex-Backtype & Twitter

•Startup in Stealthmode

•Storm

•Cascalog

•ElephantDB

manning.com/marz

http://manning.com/marz

http://manning.com/marz


A Data System


Not all information is equal.

Some information is derived from other pieces of information.

Data is more than Information


Eventually you will reach the most ‘raw’ form of

information.This is the information you hold true,

simple because it exists.

Let’s call this ‘data’, very similar to ‘event’.

Data is more than Information


Events used to manipulate the master

data.

Events - Before


Today, events are the master data.

Events - After


Let’s store everything.

Data System


Data is Immutable

Events


Data is Time Based

Events


Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent


Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp


Geert Dendermonde

2011-10-08



The data you query is often transformed, aggregated, ...

Rarely used in it’s original form.

Query


Query

Query = function ( all data )


Number of people living in each city.

Person Location Time


Geert Dendermonde

2011-10-08


Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1


Query

All Data Query


Query: Precompute

All Data Query

Precomputed View


Layered Architecture

Speed Layer

Batch Layer

Serving Layer


Layered Architecture

HadoopElephan

tDB

Query

Incoming Data

Cassandra


Batch Layer


Batch Layer

HadoopElephan

tDB

Incoming Data


Unrestrained computation.

Batch Layer


No need to De-Normalize.

Batch Layer


Horizontal scalable.

Batch Layer


High Latency.Let’s pretend temporarily that update latency

doesn’t matter.

Batch Layer


Functional computation, based on immutable inputs, is idempotent.

Batch Layer


Stores master copy of data set...

Batch Layer

append only.


Batch Layer


Batch: View generation

Master Dataset

View #1

View #3

View #2

MapReduc

e

MapReduce

MapReduce


1. Take a large data set and divide it into subsets

2. Perform the same function on all subsets

3. Combine the output from all subsets

…

…

Output

MA

PR

ED

UC

E

MapReduce

DoWork() DoWork() DoWork()…


MapReduce


Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.



CSV is actually a serialization language that is just poorly defined.



•Use a format with a schema.

•Thrift

•Avro

•Protobuffers

•Added bonus: it’s faster & uses less space.


Read only database.No random writes required.

Batch View Database


Every iteration produces the Views from scratch.

Batch View Database


Batch View Database

•ElephantDB

•Splout

•Voldemort

•…


Batch Layer

Not yet absorbe

d.Data absorbed into Batch Views

Time Now

We are not done yet…Just a few hours of data.


Speed Layer


Overview

HadoopElephan

tDB

Incoming Data

Cassandra


Stream processing.

Speed Layer


Continuous computation.

Speed Layer


Transactional.

Speed Layer


Storing a limited window of data.

Compensating for the last few hours of data.

Speed Layer


All the complexity is isolated in the Speed layer.

If anything goes wrong, it’s auto-corrected.

Speed Layer


CAP

You have a choice between:

•Availability

•Queries are eventual consistent.

•Consistency

•Queries are consistent.

Consistency

Availability

Partition Toleranc

e


Some algorithms are hard to implement in real time. For those

cases we could estimate the results.

Eventual accuracy


Speed Layer

Incoming Data

Real Time

View 1

Real Time

View 2


Storm

•Message passing.

•Distributed processing.

•Horizontally scalable.

•Incremental algorithms.

•Fast.

•Data in motion.


Storm


Storm

•Tuple

•Stream


Storm

•Spout

•Bolt


Storm

•Grouping


Data Ingestion

•Kafka

•Flume

•Scribe

•*MQ

•…


Speed Layer Views

•The views are stored in Read & Write database.

•Cassandra

•Hbase

•Redis

•MySQL

•ElasticSearch

•…

•Much more complex than a read only view.


Serving Layer


Overview

HadoopElephan

tDB

Query

Incoming Data

Cassandra


Random reads

Serving Layer


This layer queries the Batch & Real Time views and merges it.

Serving Layer


Serving Layer

Real Time Views

Merge

Batch Views


How to query an Average?

Serving Layer


Overview


Overview

HadoopElephan

tDB

Query

Incoming Data

Cassandra


CQRS

Source: martinfowler.com/bliki/CQRS.html – Martin Fowler

http://martinfowler.com/bliki/CQRS.html



http://vimeo.com/52831373


Lambda Architecture


Lambda Architecture

Can discard any view, batch and real time, and just recreate everything

from the master data.


Lambda Architecture

Mistakes are corrected via recomputation.

Write bad data? Remove the data & recompute.

Bug in view generation? Just recompute the view.


Lambda Architecture

Data storage is highly optimized.


Lambda Architecture

Immutability changes everything.


Questions?

Questions?@nathan_gs & #DV13

slideshare.net/nathan_gs

[email protected]

http://slideshare.net/nathan_gs

http://slideshare.net/nathan_gs


DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.


Thank you

Thank you@nathan_gs

[email protected]

a real-time architecture using Hadoop and Storm at Devoxx

Technology

Transcript of a real-time architecture using Hadoop and Storm at Devoxx