a real-time architecture using Hadoop and Storm at Devoxx

77
@nathan_gs #DV13-#rtbigdata A real-time architecture using Hadoop and Storm. Nathan Bijnens

description

 

Transcript of a real-time architecture using Hadoop and Storm at Devoxx

Page 1: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

A real-time architecture using Hadoop and Storm.

Nathan Bijnens

Page 2: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Speaker

Nathan BijnensDataCrunchers@nathan_gs

Page 3: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Our Vision

Big Data

test

Volume

Page 4: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Big Data

test

Velocity

Page 5: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Our Vision

Volume

test

Variety

Page 6: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Computing Trends

Source: Immutability Changes Everything - Pat Helland, RICON2012

Computation (CPUs) Expensive

Disk Storage Expensive

Coordination Easy(Latches Don’t Often Hit)

DRAM Expensive

Computation Cheap (Many Core Computers)

Disk Storage Cheap(Cheap Commodity Disks)

Coordination Hard(Latches Stall a Lot, etc)

DRAM / SSD Getting Cheap

Past Current

Page 7: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Credits

Nathan Marz•Ex-Backtype & Twitter

•Startup in Stealthmode

•Storm

•Cascalog

•ElephantDB

manning.com/marz

Page 8: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

A Data System

Page 9: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Not all information is equal.

Some information is derived from other pieces of information.

Data is more than Information

Page 10: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Eventually you will reach the most ‘raw’ form of

information.This is the information you hold true,

simple because it exists.

Let’s call this ‘data’, very similar to ‘event’.

Data is more than Information

Page 11: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Events used to manipulate the master

data.

Events - Before

Page 12: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Today, events are the master data.

Events - After

Page 13: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Let’s store everything.

Data System

Page 14: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Data is Immutable

Events

Page 15: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Data is Time Based

Events

Page 16: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

Page 17: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Timestamp

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

Page 18: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

The data you query is often transformed, aggregated, ...

Rarely used in it’s original form.

Query

Page 19: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Query

Query = function ( all data )

Page 20: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Number of people living in each city.

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde

2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

Page 21: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Query

All Data Query

Page 22: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Query: Precompute

All Data Query

Precomputed View

Page 23: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Layered Architecture

Speed Layer

Batch Layer

Serving Layer

Page 24: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Layered Architecture

HadoopElephan

tDB

Query

Incoming Data

Cassandra

Page 25: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Batch Layer

Page 26: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Batch Layer

HadoopElephan

tDB

Incoming Data

Page 27: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Unrestrained computation.

Batch Layer

Page 28: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

No need to De-Normalize.

Batch Layer

Page 29: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Horizontal scalable.

Batch Layer

Page 30: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

High Latency.Let’s pretend temporarily that update latency

doesn’t matter.

Batch Layer

Page 31: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Functional computation, based on immutable inputs, is idempotent.

Batch Layer

Page 32: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Stores master copy of data set...

Batch Layer

append only.

Page 33: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Batch Layer

Page 34: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Batch: View generation

Master Dataset

View #1

View #3

View #2

MapReduc

e

MapReduce

MapReduce

Page 35: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

1. Take a large data set and divide it into subsets

2. Perform the same function on all subsets

3. Combine the output from all subsets

Output

MA

PR

ED

UC

E

MapReduce

DoWork() DoWork() DoWork()…

Page 36: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

MapReduce

Page 37: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.

Page 38: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Serialization & Schema

CSV is actually a serialization language that is just poorly defined.

Page 39: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Serialization & Schema

•Use a format with a schema.

•Thrift

•Avro

•Protobuffers

•Added bonus: it’s faster & uses less space.

Page 40: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Read only database.No random writes required.

Batch View Database

Page 41: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Every iteration produces the Views from scratch.

Batch View Database

Page 42: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Batch View Database

•ElephantDB

•Splout

•Voldemort

•…

Page 43: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Batch Layer

Not yet absorbe

d.Data absorbed into Batch Views

Time Now

We are not done yet…Just a few hours of data.

Page 44: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Speed Layer

Page 45: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Overview

HadoopElephan

tDB

Incoming Data

Cassandra

Page 46: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Stream processing.

Speed Layer

Page 47: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Continuous computation.

Speed Layer

Page 48: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Transactional.

Speed Layer

Page 49: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Storing a limited window of data.

Compensating for the last few hours of data.

Speed Layer

Page 50: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

All the complexity is isolated in the Speed layer.

If anything goes wrong, it’s auto-corrected.

Speed Layer

Page 51: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

CAP

You have a choice between:

•Availability

•Queries are eventual consistent.

•Consistency

•Queries are consistent.

Consistency

Availability

Partition Toleranc

e

Page 52: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Some algorithms are hard to implement in real time. For those

cases we could estimate the results.

Eventual accuracy

Page 53: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Speed Layer

Incoming Data

Real Time

View 1

Real Time

View 2

Page 54: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Storm

•Message passing.

•Distributed processing.

•Horizontally scalable.

•Incremental algorithms.

•Fast.

•Data in motion.

Page 55: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Storm

Page 56: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Storm

•Tuple

•Stream

Page 57: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Storm

•Spout

•Bolt

Page 58: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Storm

•Grouping

Page 59: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Data Ingestion

•Kafka

•Flume

•Scribe

•*MQ

•…

Page 60: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Speed Layer Views

•The views are stored in Read & Write database.

•Cassandra

•Hbase

•Redis

•MySQL

•ElasticSearch

•…

•Much more complex than a read only view.

Page 61: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Serving Layer

Page 62: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Overview

HadoopElephan

tDB

Query

Incoming Data

Cassandra

Page 63: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Random reads

Serving Layer

Page 64: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

This layer queries the Batch & Real Time views and merges it.

Serving Layer

Page 65: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Serving Layer

Real Time Views

Merge

Batch Views

Page 66: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

How to query an Average?

Serving Layer

Page 67: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Overview

Page 68: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Overview

HadoopElephan

tDB

Query

Incoming Data

Cassandra

Page 69: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

CQRS

Source: martinfowler.com/bliki/CQRS.html – Martin Fowler

Page 70: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Page 71: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Can discard any view, batch and real time, and just recreate everything

from the master data.

Page 72: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Mistakes are corrected via recomputation.

Write bad data? Remove the data & recompute.

Bug in view generation? Just recompute the view.

Page 73: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Data storage is highly optimized.

Page 74: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Lambda Architecture

Immutability changes everything.

Page 75: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Questions?

Questions?@nathan_gs & #DV13

slideshare.net/nathan_gs

[email protected]

Page 76: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

Page 77: a real-time architecture using Hadoop and Storm at Devoxx

@nathan_gs#DV13-#rtbigdata

Thank you

Thank you@nathan_gs

[email protected]