a real-time architecture using Hadoop and Storm at Devoxx
-
Upload
nathan-bijnens -
Category
Technology
-
view
9.376 -
download
11
description
Transcript of a real-time architecture using Hadoop and Storm at Devoxx
@nathan_gs#DV13-#rtbigdata
A real-time architecture using Hadoop and Storm.
Nathan Bijnens
@nathan_gs#DV13-#rtbigdata
Speaker
Nathan BijnensDataCrunchers@nathan_gs
@nathan_gs#DV13-#rtbigdata
Our Vision
Big Data
test
Volume
@nathan_gs#DV13-#rtbigdata
Big Data
test
Velocity
@nathan_gs#DV13-#rtbigdata
Our Vision
Volume
test
Variety
@nathan_gs#DV13-#rtbigdata
Computing Trends
Source: Immutability Changes Everything - Pat Helland, RICON2012
Computation (CPUs) Expensive
Disk Storage Expensive
Coordination Easy(Latches Don’t Often Hit)
DRAM Expensive
Computation Cheap (Many Core Computers)
Disk Storage Cheap(Cheap Commodity Disks)
Coordination Hard(Latches Stall a Lot, etc)
DRAM / SSD Getting Cheap
Past Current
@nathan_gs#DV13-#rtbigdata
Credits
Nathan Marz•Ex-Backtype & Twitter
•Startup in Stealthmode
•Storm
•Cascalog
•ElephantDB
manning.com/marz
@nathan_gs#DV13-#rtbigdata
A Data System
@nathan_gs#DV13-#rtbigdata
Not all information is equal.
Some information is derived from other pieces of information.
Data is more than Information
@nathan_gs#DV13-#rtbigdata
Eventually you will reach the most ‘raw’ form of
information.This is the information you hold true,
simple because it exists.
Let’s call this ‘data’, very similar to ‘event’.
Data is more than Information
@nathan_gs#DV13-#rtbigdata
Events used to manipulate the master
data.
Events - Before
@nathan_gs#DV13-#rtbigdata
Today, events are the master data.
Events - After
@nathan_gs#DV13-#rtbigdata
Let’s store everything.
Data System
@nathan_gs#DV13-#rtbigdata
Data is Immutable
Events
@nathan_gs#DV13-#rtbigdata
Data is Time Based
Events
@nathan_gs#DV13-#rtbigdata
Capturing change traditionally
Person Location
Nathan Antwerp
Geert Dendermonde
John Ghent
Person Location
Nathan Ghent
Geert Dendermonde
John Ghent
@nathan_gs#DV13-#rtbigdata
Capturing change
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Person Location Timestamp
Nathan Antwerp 2005-01-01
Geert Dendermonde
2011-10-08
John Ghent 2010-05-02
@nathan_gs#DV13-#rtbigdata
The data you query is often transformed, aggregated, ...
Rarely used in it’s original form.
Query
@nathan_gs#DV13-#rtbigdata
Query
Query = function ( all data )
@nathan_gs#DV13-#rtbigdata
Number of people living in each city.
Person Location Time
Nathan Antwerp 2005-01-01
Geert Dendermonde
2011-10-08
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
Location Count
Ghent 2
Dendermonde 1
@nathan_gs#DV13-#rtbigdata
Query
All Data Query
@nathan_gs#DV13-#rtbigdata
Query: Precompute
All Data Query
Precomputed View
@nathan_gs#DV13-#rtbigdata
Layered Architecture
Speed Layer
Batch Layer
Serving Layer
@nathan_gs#DV13-#rtbigdata
Layered Architecture
HadoopElephan
tDB
Query
Incoming Data
Cassandra
@nathan_gs#DV13-#rtbigdata
Batch Layer
@nathan_gs#DV13-#rtbigdata
Batch Layer
HadoopElephan
tDB
Incoming Data
@nathan_gs#DV13-#rtbigdata
Unrestrained computation.
Batch Layer
@nathan_gs#DV13-#rtbigdata
No need to De-Normalize.
Batch Layer
@nathan_gs#DV13-#rtbigdata
Horizontal scalable.
Batch Layer
@nathan_gs#DV13-#rtbigdata
High Latency.Let’s pretend temporarily that update latency
doesn’t matter.
Batch Layer
@nathan_gs#DV13-#rtbigdata
Functional computation, based on immutable inputs, is idempotent.
Batch Layer
@nathan_gs#DV13-#rtbigdata
Stores master copy of data set...
Batch Layer
append only.
@nathan_gs#DV13-#rtbigdata
Batch Layer
@nathan_gs#DV13-#rtbigdata
Batch: View generation
Master Dataset
View #1
View #3
View #2
MapReduc
e
MapReduce
MapReduce
@nathan_gs#DV13-#rtbigdata
1. Take a large data set and divide it into subsets
2. Perform the same function on all subsets
3. Combine the output from all subsets
…
…
Output
MA
PR
ED
UC
E
MapReduce
DoWork() DoWork() DoWork()…
@nathan_gs#DV13-#rtbigdata
MapReduce
@nathan_gs#DV13-#rtbigdata
Serialization & Schema
Catch errors as quickly as they happen. Validation on write vs on read.
@nathan_gs#DV13-#rtbigdata
Serialization & Schema
CSV is actually a serialization language that is just poorly defined.
@nathan_gs#DV13-#rtbigdata
Serialization & Schema
•Use a format with a schema.
•Thrift
•Avro
•Protobuffers
•Added bonus: it’s faster & uses less space.
@nathan_gs#DV13-#rtbigdata
Read only database.No random writes required.
Batch View Database
@nathan_gs#DV13-#rtbigdata
Every iteration produces the Views from scratch.
Batch View Database
@nathan_gs#DV13-#rtbigdata
Batch View Database
•ElephantDB
•Splout
•Voldemort
•…
@nathan_gs#DV13-#rtbigdata
Batch Layer
Not yet absorbe
d.Data absorbed into Batch Views
Time Now
We are not done yet…Just a few hours of data.
@nathan_gs#DV13-#rtbigdata
Speed Layer
@nathan_gs#DV13-#rtbigdata
Overview
HadoopElephan
tDB
Incoming Data
Cassandra
@nathan_gs#DV13-#rtbigdata
Stream processing.
Speed Layer
@nathan_gs#DV13-#rtbigdata
Continuous computation.
Speed Layer
@nathan_gs#DV13-#rtbigdata
Transactional.
Speed Layer
@nathan_gs#DV13-#rtbigdata
Storing a limited window of data.
Compensating for the last few hours of data.
Speed Layer
@nathan_gs#DV13-#rtbigdata
All the complexity is isolated in the Speed layer.
If anything goes wrong, it’s auto-corrected.
Speed Layer
@nathan_gs#DV13-#rtbigdata
CAP
You have a choice between:
•Availability
•Queries are eventual consistent.
•Consistency
•Queries are consistent.
Consistency
Availability
Partition Toleranc
e
@nathan_gs#DV13-#rtbigdata
Some algorithms are hard to implement in real time. For those
cases we could estimate the results.
Eventual accuracy
@nathan_gs#DV13-#rtbigdata
Speed Layer
Incoming Data
Real Time
View 1
Real Time
View 2
@nathan_gs#DV13-#rtbigdata
Storm
•Message passing.
•Distributed processing.
•Horizontally scalable.
•Incremental algorithms.
•Fast.
•Data in motion.
@nathan_gs#DV13-#rtbigdata
Storm
@nathan_gs#DV13-#rtbigdata
Storm
•Tuple
•Stream
@nathan_gs#DV13-#rtbigdata
Storm
•Spout
•Bolt
@nathan_gs#DV13-#rtbigdata
Storm
•Grouping
@nathan_gs#DV13-#rtbigdata
Data Ingestion
•Kafka
•Flume
•Scribe
•*MQ
•…
@nathan_gs#DV13-#rtbigdata
Speed Layer Views
•The views are stored in Read & Write database.
•Cassandra
•Hbase
•Redis
•MySQL
•ElasticSearch
•…
•Much more complex than a read only view.
@nathan_gs#DV13-#rtbigdata
Serving Layer
@nathan_gs#DV13-#rtbigdata
Overview
HadoopElephan
tDB
Query
Incoming Data
Cassandra
@nathan_gs#DV13-#rtbigdata
Random reads
Serving Layer
@nathan_gs#DV13-#rtbigdata
This layer queries the Batch & Real Time views and merges it.
Serving Layer
@nathan_gs#DV13-#rtbigdata
Serving Layer
Real Time Views
Merge
Batch Views
@nathan_gs#DV13-#rtbigdata
How to query an Average?
Serving Layer
@nathan_gs#DV13-#rtbigdata
Overview
@nathan_gs#DV13-#rtbigdata
Overview
HadoopElephan
tDB
Query
Incoming Data
Cassandra
@nathan_gs#DV13-#rtbigdata
CQRS
Source: martinfowler.com/bliki/CQRS.html – Martin Fowler
@nathan_gs#DV13-#rtbigdata
Lambda Architecture
@nathan_gs#DV13-#rtbigdata
Lambda Architecture
Can discard any view, batch and real time, and just recreate everything
from the master data.
@nathan_gs#DV13-#rtbigdata
Lambda Architecture
Mistakes are corrected via recomputation.
Write bad data? Remove the data & recompute.
Bug in view generation? Just recompute the view.
@nathan_gs#DV13-#rtbigdata
Lambda Architecture
Data storage is highly optimized.
@nathan_gs#DV13-#rtbigdata
Lambda Architecture
Immutability changes everything.
@nathan_gs#DV13-#rtbigdata
Questions?
Questions?@nathan_gs & #DV13
slideshare.net/nathan_gs
@nathan_gs#DV13-#rtbigdata
DataCrunchers
We enable companies in envisioning, defining and implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.