The three generations of Big Data processing

87
The three generations of Big Data processing Rubén Casado [email protected]

description

Big Data is often characterized by the 3 “Vs”: variety, volume and velocity. While variety refers to the nature of the information (multiple sources, schema-less data, etc), both volume and velocity refer to processing issues that have to be addressed by different processing paradigms. Assuming that the volumes of data are larger than those conventional relational database infrastructures can cope with, the processing solution break down broadly into massively parallel processing (batch processing). Batch processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time. Data is collected, entered, processed and then the batch results are produced. Several applications require real-time processing of data streams from heterogeneous sources, in contrast with the approach of batch processing. Real time processing involves a continual input, process and output of data. Data must be processed in a small time period (or near real time). Domains of application include smart cities, entertainment of disaster management. The low latency is the main goal of this processing paradigm. Batch processing provides strong results since it can use more data and, for example, perform better training of predictive models. But it is not feasible for domains where a low response time is a critical issue. Real time processing solves this issue, but the analyzed information is limited in order to achieve low latency. Many domains require the benefit of both batch and real time processing approaches so a new processing paradigm is needed: the hybrid model. To obtain a complete result, the batch and real-time results must be queried and the results merged together. Synchronization, results composition and other non-trivial issues have to be addressed at this stage in which could be considered a key element of the hybrid modell. This walk will overview the time-evolution of the big data processing techniques, identify main hits (both technologies and scientific publications) and give and introduction of the key technologies to understand the complex Big Data processing domain.

Transcript of The three generations of Big Data processing

Page 1: The three generations of Big Data processing

The three generations of

Big Data processing

Rubén Casado

[email protected]

Page 2: The three generations of Big Data processing

1. Big Data processing

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

Page 3: The three generations of Big Data processing

About me :-)

Page 4: The three generations of Big Data processing

PhD in Software Engineering

MSc in Computer Science

BSc in Computer Science

Academics

Work

Experience

Page 5: The three generations of Big Data processing

About Treelogic

Page 6: The three generations of Big Data processing

Treelogic is an R&D

intensive company with

the mission of creating,

boosting, developing and

adapting scientific and

technological

knowledge to improve

quality standards in our

daily life

Page 7: The three generations of Big Data processing

TREELOGIC – Distributor and Sales

Page 8: The three generations of Big Data processing

International Projects

National Projects

Regional Projects

R&D Manag.

System

Internal Projects

Research Lines

Computer Vision

Big Data

Teraherzt technology

Data science

Social Media Analysis

Semantics

Security & Safety

Justice

Health

Transport

Financial services

ICT tailored solutions

Solutions

R&D

Page 9: The three generations of Big Data processing

7 ongoing FP7 projects

ICT, SEC, OCEAN

Coordinating 5 of them

3 ongoing Eurostars projects

Coordinating all of them

Page 10: The three generations of Big Data processing

Research

INNOVATION &

7 years’ experience in R&D projects

Page 11: The three generations of Big Data processing

www.datadopter.com

Page 12: The three generations of Big Data processing

1. Big Data processing

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

Page 13: The three generations of Big Data processing

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Page 14: The three generations of Big Data processing

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

Page 15: The three generations of Big Data processing

3 problems

Volume

Variety Velocity

Page 16: The three generations of Big Data processing

3 solutions

Batch processing

NoSQL Real-time

processing

Page 17: The three generations of Big Data processing

3 solutions

Batch processing

NoSQL Real-time

processing

Page 18: The three generations of Big Data processing

• Scalable

• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

Page 19: The three generations of Big Data processing

• Low latency

• Continuous unbounded

streams of data

• Distributed

• Parallel

• Fault-tolerant

Real-time processing

Velocity

Page 20: The three generations of Big Data processing

• Low latency

• Massive data + Streaming data

• Scalable

• Combine batch and real-time results

Hybrid computation model

Volume Velocity

Page 21: The three generations of Big Data processing

All data

New data

Batch processing

Real-time processing

Batch results

Stream results

Combination Final results

Hybrid computation model

Page 22: The three generations of Big Data processing

Batch processing

Large amount of statics data

Scalable solution

Volume

Real-time processing

Computing streaming data

Low latency

Velocity

Hybrid computation

Lambda Architecture

Volume + Velocity

2006

2010

2014

1ª Generation

2ª Generation

3ª Generation

Inception

2003 Processing Paradigms

Page 23: The three generations of Big Data processing

Batch

10 years of Big Data

processing technologies

2003 2004 2005 2013 2011 2010 2008

The Google File System

MapReduce: Simplified Data Processing on Large Clusters

Doug Cutting starts developing Hadoop

2006

Yahoo! starts working on Hadoop

Apache Hadoop is in production Nathan Marz

creates Storm

Yahoo! creates S4

2009

Facebook creates Hive

Yahoo! creates Pig

Google publishes MillWheel: Fault-Tolerant Stream Processing at Internet Scale

LinkedIn presents Samza

LinkedIn! presents KafkA

Cloudera presents Flume

2012

Nathan Marz defines the Lambda Architecture

Real-Time Hybrid

Page 24: The three generations of Big Data processing

Processing Pipeline

DATA

ACQUISITION

DATA

STORAGE

DATA

ANALYSIS RESULTS

Page 25: The three generations of Big Data processing

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Air Quality case study

Page 26: The three generations of Big Data processing

1. Big Data processing overview

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

Page 27: The three generations of Big Data processing

Batch processing technologies

DATA

ACQUISITION

DATA

STORAGE

DATA

ANALYSIS RESULTS

o HDFS commands

o Sqoop

o Flume

o Scribe

o HDFS

o HBase

o MapReduce

o Hive

o Pig

o Cascading

o Spark

o Shark

Page 28: The three generations of Big Data processing

• Import to HDFS

hadoop dfs -copyFromLocal

<path-to-local> <path-to-remote>

hadoop dfs –copyFromLocal /home/hduser/AirQuality/ /hdfs/AirQuality/

HDFS commands DATA

ACQUISITION

B

A

T

C

H

Page 29: The three generations of Big Data processing

• Tool designed for transferring data between

HDFS/HBase and structural datastores

• Based in MapReduce

• Includes connectors for multiple databases

o MySQL,

o PostgreSQL,

o Oracle,

o SQL Server and

o DB2

o Generic JDBC connector

• Java API

Sqoop DATA

ACQUISITION

B

A

T

C

H

Page 30: The three generations of Big Data processing

import -all-tables --connect

jdbc:mysql://localhost/testDatabase

--target-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

1) Import data from database to HDFS

export --connect

jdbc:mysql://localhost/testDatabase

--export-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

3) Export results to database

2)

Ana

lyze d

ata

(H

AD

OO

P)

Sqoop DATA

ACQUISITION

B

A

T

C

H

Page 31: The three generations of Big Data processing

• Service for collecting, aggregating, and moving

large amounts of log data

• Simple and flexible architecture based on

streaming data flows

• Reliability, scalability, extensibility, manageability

• Support log stream types

o Avro

o Syslog

o Netcast

Flume DATA

ACQUISITION

B

A

T

C

H

Page 32: The three generations of Big Data processing

Sources Channels Sinks

Avro Memory HDFS

Thrift JDBC Logger

Exec File Avro

JMS Thrift

NetCat IRC

Syslog

TCP/UDP

File Roll

HTTP Null

HBase

Custom Custom

• Architecture o Source

• Waiting for events .

o Sink

• Sends the information towards

another agent or system.

o Channel

• Stores the information until it is

consumed by the sink.

Flume DATA

ACQUISITION

B

A

T

C

H

Page 33: The three generations of Big Data processing

Stations send the information to the servers. Flume collects

this information and move it into the HDFS for further analsys

Air quality syslogs

Flume DATA

ACQUISITION

B

A

T

C

H

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";

Page 34: The three generations of Big Data processing

• Server for aggregating log data streamed in real time from

a large number of servers

• There is a scribe server running on every node in the

system, configured to aggregate messages and send them

to a central scribe server (or servers) in larger groups.

• The central scribe server(s) can write the messages to the

files that are their final destination

Scribe DATA

ACQUISITION

B

A

T

C

H

Page 35: The three generations of Big Data processing

category=‘mobile‘;

// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …' message= sensor_log.readLine();

log_entry = scribe.LogEntry(category, message)

// Create a Scribe Client

client = scribe.Client(iprot=protocol, oprot=protocol)

transport.open()

result = client.Log(messages=[log_entry])

transport.close()

• Sending a sensor message to a Scribe Server

Scribe DATA

ACQUISITION

B

A

T

C

H

Page 36: The three generations of Big Data processing

• Distributed FileSystem for Hadoop

• Master-Slaves Architecture (NameNode – DataNodes)

o NameNode: Manage the directory tree and regulates access to files by clients

o DataNodes: Store the data

• Files are split into blocks of the same size and these blocks are stored and replicated in a set of DataNodes

HDFS DATA

STORAGE

B

A

T

C

H

Page 37: The three generations of Big Data processing

• Open-source non-relational distributed column-oriented

database modeled after Google’s BigTable.

• Random, realtime read/write access to the data.

• Not a relational database.

o Very light «schema»

• Rows are stored in sorted order.

DATA

STORAGE

B

A

T

C

H

HBase

Page 38: The three generations of Big Data processing

• Framework for processing large amount of data in parallel

across a distributed cluster

• Slightly inspired in the Divide and Conquer (D&C) classic strategy

• Developer has to implement Map and Reduce functions:

o Map: It takes the input, partitions it up into smaller sub-problems, and

distributes them to worker nodes parsed to the format <K, V>

o Reduce: It collects the <K, List(V)> and generates the results

MapReduce DATA

ANALYTICS

B

A

T

C

H

Page 39: The three generations of Big Data processing

• Design Patterns

o Joins

o Reduce side Join

o Replicated join

o Semi join

o Sorting:

o Secondary sort

o Total Order Sort

o Filtering

MapReduce

o Statistics

o AVG

o VAR

o Count

o …

o Top-K

o Binning

o …

DATA

ANALYTICS

B

A

T

C

H

Page 40: The three generations of Big Data processing

• Obtain the S02 average of each station

MapReduce

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";

DATA

ANALYTICS

B

A

T

C

H

Page 41: The three generations of Big Data processing

Input Data

Mapper

Mapper

Mapper

<1, 6> …

Shufflin

g

<1, 2> <3, 1> <1, 9>

<3, 9> <2, 6> <2, 6> <1, 6>

<2, 0> <2, 8> <1, 2> <3,9>

<Station_ID, S02_VALUE>

MapReduce DATA

ANALYTICS

B

A

T

C

H

• Maps get records and produce the SO2 value in

<Station_Id, SO2_value>

Page 42: The three generations of Big Data processing

Station_ID, AVG_SO2

1, 2,013

2, 2,695

3, 3,562

Reducer

Sum

Divide

Sh

ufflin

g

Reducer

Sum

Divide

<Station_ID, [SO1, SO2,…,SOn>

• Reducer receives <Station_Id, List<SO2_value> >

and computes the average for the station

MapReduce DATA

ANALYTICS

B

A

T

C

H

Page 43: The three generations of Big Data processing

Hive

• Hive is a data warehouse system for Hadoop

that facilitates easy data summarization, ad-hoc

queries, and the analysis of large datasets

• Abstraction layer on top of MapReduce

• SQL-like language called HiveQL.

• Metastore: Central repository of Hive metadata.

DATA

ANALYTICS

B

A

T

C

H

Page 44: The three generations of Big Data processing

CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;' LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;

Hive

• Obtain the S02 average of each station

SELECT Titulo, avg(SO2)

FROM air_quality

GROUP BY Estacion

DATA

ANALYTICS

B

A

T

C

H

Page 45: The three generations of Big Data processing

• Platform for analyzing large data sets

• High-level language for expressing data

analysis programs. Pig Latin. Data flow

programming language.

• Abstraction layer on top of MapReduce

• Procedural language

Pig DATA

ANALYTICS

B

A

T

C

H

Page 46: The three generations of Big Data processing

Pig DATA

ANALYTICS

B

A

T

C

H

• Obtain the S02 average of each station

calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';')

AS (estacion:chararray, titulo:chararray, latitud:chararray,

longitud:chararray, fecha:chararray, so2:chararray,

no:chararray, co:chararray, pm10:chararray, o3:chararray,

dd:chararray, vv:chararray, tmp:chararray, hr:chararray,

prb:chararray, rs:chararray, ll:chararray, ben:chararray,

tol:chararray, mxil:chararray, pm25:chararray);

grouped = GROUP air_quality BY estacion;

avg = FOREACH grouped GENERATE group, AVG(so2);

dump avg;

Page 47: The three generations of Big Data processing

• Cascading is a data processing API and

processing query planner used for defining,

sharing, and executing data-processing

workflows

• Makes development of complex Hadoop

MapReduce workflows easy

• In the same way that Pig

DATA

ANALYTICS

B

A

T

C

H

Cascading

Page 48: The three generations of Big Data processing

// define source and sink Taps.

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) ); Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( “avgSO2" ); assembly = new GroupBy( assembly, new Fields( “Estacion" ) );

// For every Tuple group

Aggregator avg = new Average( new Fields( “SO2" ) ); assembly = new Every( assembly, avg );

// Tell Hadoop which jar file to use

Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly );

// execute the flow, block until complete

flow.complete();

DATA

ANALYTICS

B

A

T

C

H

• Obtain the S02 average of each station

Cascading

Page 49: The three generations of Big Data processing

Spark

• Cluster computing systems for faster data analytics

• Not a modified version of Hadoop

• Compatible with HDFS

• In-memory data storage for very fast iterative

processing

• MapReduce-like engine

• API in Scala, Java and Python

DATA

ANALYTICS

B

A

T

C

H

Page 50: The three generations of Big Data processing

Spark DATA

ANALYTICS

B

A

T

C

H

• Hadoop is slow due to replication, serialization

and IO tasks

Page 51: The three generations of Big Data processing

Spark DATA

ANALYTICS

B

A

T

C

H

• 10x-100x faster

Page 52: The three generations of Big Data processing

Shark

• Large-scale data warehouse system for Spark

• SQL on top of Spark

• Actually Hive QL over Spark

• Up to 100 x faster than Hive

DATA

ANALYTICS

B

A

T

C

H

Page 53: The three generations of Big Data processing

Pros

• Faster than Hadoop ecosystem

• Easier to develop new applications

o (Scala, Java and Python API)

Cons

• Not tested in extremely large clusters yet

• Problems when Reducer’s data does not fit in memory

DATA

ANALYTICS

B

A

T

C

H

Spark / Shark

Page 54: The three generations of Big Data processing

1. Big Data processing

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

Page 55: The three generations of Big Data processing

Real-time processing technologies

DATA

ACQUISITION

DATA

STORAGE

DATA

ANALYSIS RESULTS

o Flume o Kafka

o Kestrel

o Flume

o Storm

o Trident

o S4

o Spark Streaming

Page 56: The three generations of Big Data processing

Flume DATA

ACQUISITION

R

E

A

L

Page 57: The three generations of Big Data processing

• Kafka is a distributed, partitioned, replicated commit log service

o Producer/Consumer model

o Kafka maintains feeds of messages in categories called topics

o Kafka is run as a cluster

Kafka DATA

STORAGE

R

E

A

L

Page 58: The three generations of Big Data processing

Insert AirQuality sensor log file into Kafka

cluster and consume the info.

// new Producter

Producer<String, String> producer = new Producer<String, String>(config);

//Open sensor log file

BufferedReader br… String line;

while(true)

{

line = br.readLine();

if(line ==null)

… //wait; else

producer.send(new KeyedMessage<String, String>(topic, line));

}

Kafka DATA

STORAGE

R

E

A

L

Page 59: The three generations of Big Data processing

AirQuality Consumer

ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config);

Map<String, Integer> topicCountMap = new HashMap<String,

Integer>();

topicCountMap.put(topic, new Integer(1));

Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap);

KafkaMessageStream stream = consumerMap.get(topic).get(0);

ConsumerIterator it = stream.iterator();

while(it.hasNext()){

// consume it.next()

Kafka DATA

STORAGE

R

E

A

L

Page 60: The three generations of Big Data processing

• Simple distributed message queue

• A single Kestrel server has a set of queues (strictly-ordered FIFO)

• On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication

• Kestrel vs Kafka

o Kafka consumers cheaper (basically just the bandwidth usage)

o Kestrel does not depend on Zookeeper which means it is operationally

less complex if you don't already have a zookeeper installation.

o Kafka has significantly better throughput.

o Kestrel does not support ordered consumption

Kestrel DATA

STORAGE

R

E

A

L

Page 61: The three generations of Big Data processing

Interceptor

• Interface org.apache.flume.interceptor.Interceptor

• Can modify or even drop events based on any criteria

• Flume supports chaining of interceptors.

• Types:

o Timestamp interceptor

o Host interceptor

o Static interceptor

o UUID interceptor

o Morphline interceptor

o Regex Filtering interceptor

o Regex Extractor interceptor

DATA

ANALYTICS

R

E

A

L

Flume

Page 62: The three generations of Big Data processing

• The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel.

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"3";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

"2";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "6"; "0.31"; "7"; "67"; "135"; "2.41"; "19.2"; "36"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "6"; "9"; "0.24"; "24"; "63"; "44"; "1.7"; "15.9"; "62"; "983"; "23";

DATA

ANALYTICS

R

E

A

L

Flume

Page 63: The three generations of Big Data processing

# Write format can be text or writable

… #Defining channel – Memory type …1 … #Defining source – Syslog … … # Defining sink – HDFS … … #Defining interceptor

agent.sources.source.interceptors = i1

agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter

class StationFilter implements Interceptor

… if(!"Station".equals("2"))

discard data;

else

save data;

DATA

ANALYTICS

R

E

A

L

Flume

Page 64: The three generations of Big Data processing

Hadoop Storm

JobTracker Nimbus

TaskTracker Supervisor

Job Topology

• Distributed and scalable realtime computation system

• Doing for real-time processing what Hadoop did for batch processing

• Topology: processing graph. Each node contains processing logic (spouts and bolts). Links between nodes are streams of data

o Spout: Source of streams. Read a data source and emit the data into the

topology as a stream

o Bolts: Processing unit. Read data from several streams, does some

processing and possibly emits new streams

o Stream: Unbounded sequence of tuples. Tuples can contain any

serializable object

Storm DATA

ANALYTICS

R

E

A

L

Page 65: The three generations of Big Data processing

CAReader LineProcessor AvgValues

• AirQuality average values

oStep 1: build the topology

Storm DATA

ANALYTICS

R

E

A

L

Spout Bolt Bolt

Page 66: The three generations of Big Data processing

• AirQuality average values

oStep 1: build the topology

TopologyBuilder AirAVG= new TopologyBuilder();

builder.setSpout("ca-reader", new CAReader(), 1);

//shuffleGrouping -> even distribution

AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3)

.shuffleGrouping("ca-reader");

//fieldsGrouping -> fields with the same value goes to the same task

AirAVG.setBolt("ca-avg-values", new AvgValues(), 2)

.fieldsGrouping("ca-line-processor", new Fields("id"));

Storm DATA

ANALYTICS

R

E

A

L

Page 67: The three generations of Big Data processing

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

//Initialize file

BufferedReader br = new … … }

public void nextTuple() { String line = br.readLine();

if (line == null) {

return;

} else

collector.emit(new Values(line));

}

Storm

• AirQuality average values

oStep 2: CAReader implementation (IRichSpout interface)

DATA

ANALYTICS

R

E

A

L

Page 68: The three generations of Big Data processing

public void declareOutputFields (OutputFieldsDeclarer declarer)

{

declarer.declare(new

Fields("id", "stationName", "lat", … }

public void execute (Tuple input, BasicOutputCollector collector)

{

collector.emit(new

Values(input.getString(0).split(";");

}

Storm

• AirQuality average values

oStep 3: LineProcessor implementation (IBasicBolt interface)

DATA

ANALYTICS

R

E

A

L

Page 69: The three generations of Big Data processing

69

public void execute (Tuple input, BasicOutputCollector collector)

{

//totals and count are hashmaps with each station accumulated values

if (totals.containsKey(id)) {

item = totals.get(id);

count = counts.get(id);

}

else {

//Create new item

}

//update values

item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2")));

item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no")));

… }

Storm

• AirQuality average values

oStep 4: AvgValues implementation (IBasicBolt interface)

DATA

ANALYTICS

R

E

A

L

Page 70: The three generations of Big Data processing

• High level abstraction on top of Storm

o Provides high level operations (joins, filters,

projections, aggregations, functions…)

Pros o Easy, powerful and flexible

o Incremental topology development

o Exactly-once semantics

Cons o Very few built-in functions

o Lower performance and higher latency than Storm

Trident DATA

ANALYTICS

R

E

A

L

Page 71: The three generations of Big Data processing

Simple Scalable Streaming System

Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data

Inspired by MapReduce and Actor models of computation

o Data processing is based on Processing Elements (PE)

o Messages are transmitted between PEs in the form of events (Key, Attributes)

o Processing Nodes are the logical hosts to PEs

DATA

ANALYTICS

R

E

A

L

S4

Page 72: The three generations of Big Data processing

<bean id="split" class="SplitPE">

<property name="dispatcher" ref="dispatcher"/>

<property name="keys">

<!-- Listen for both words and sentences -->

<list>

<value>LogLines *</value>

</list>

</property>

</bean>

<bean id="average" class="AveragePE">

<property name="keys">

<list>

<value>CAItem stationId</value>

</list>

</property>

</bean> …

• AirQuality average values

S4 DATA

ANALYTICS

R

E

A

L

Page 73: The three generations of Big Data processing

Spark Streaming

• Spark for real-time processing

• Streaming computation as a series of very short

batch jobs (windows)

• Keep state in memory

• API similar to Spark

DATA

ANALYTICS

R

E

A

L

Page 74: The three generations of Big Data processing

1. Big Data processing

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

Page 75: The three generations of Big Data processing

• We are in the beginning of this generation

• Short-term Big Data processing goal

• Abstraction layer over the Lambda Architecture

• Promising technologies

o SummingBird

o Lambdoop

Hybrid Computation Model

Page 76: The three generations of Big Data processing

SummingBird

• Library to write MapReduce-like process that can

be executed on Hadoop, Storm or hybrid model

• Scala syntaxis

• Same logic can be executed in batch, real-time

and hybrid bath/real mode

HYBRID

COMPUTATION

MODEL

Page 77: The three generations of Big Data processing

SummingBird HYBRID

COMPUTATION

MODEL

Page 78: The three generations of Big Data processing

Pros

• Hybrid computation model

• Same programing model for all proccesing paradigms

• Extensible

Cons

• MapReduce-like programing

• Scala

• Not as abstract as some users would like

SummingBird HYBRID

COMPUTATION

MODEL

Page 79: The three generations of Big Data processing

Software abstraction layer over Open Source technologies

o Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

Lambdoop HYBRID

COMPUTATION

MODEL

Page 80: The three generations of Big Data processing

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

HYBRID

COMPUTATION

MODEL

Page 81: The three generations of Big Data processing

DataInput db_historical = new StaticCSVInput(URI_db);

Data historical = new Data (db_historical);

Workflow batch = new Workflow (historical);

Operation filter = new Filter (“Station", “=", 2); Operation select = new Select (“Titulo“, “SO2"); Operation group = new Group(“Titulo"); Operation average = new Average (“SO2");

batch.add(filter);

batch.add(select);

batch.add(group);

batch.add(variance);

batch.run();

Data results = batch.getResults();

Lambdoop HYBRID

COMPUTATION

MODEL

Page 82: The three generations of Big Data processing

DataInput stream_sensor = new StreamXMLInput(URI_sensor);

Data sensor = new Data(stream_sensor)

Workflow streaming = new Workflow (sensor, new WindowsTime(100) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "S02");

Operation group = new Group("Titulo");

Operation average = new Average ("S02");

streaming.add(filter);

streaming.add(select);

streaming.add(group);

streaming.add(average);

streaming.run();

While (true) { Data live_results = streaming.getResults(); … }

Lambdoop HYBRID

COMPUTATION

MODEL

Page 83: The three generations of Big Data processing

DataInput historical= new StaticCSVInput(URI_folder);

DataInput stream_sensor= new StreamXMLInput(URI_sensor);

Data all_info = new Data (historical, stream_sensor);

Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "SO2");

Operation group = new Group("Titulo");

Operation average = new Average ("SO2");

hybrid.add(filter);

hybrid.add(select);

hybrid.add(group);

hybrid.add(variance);

hybrid.run();

Data updated_results = hybrid.getResults();

Lambdoop HYBRID

COMPUTATION

MODEL

Page 84: The three generations of Big Data processing

Pros

• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• Extensible

Cons

• Ongoing project

• Not open-source yet

• Not tested in larger cluster yet

Lambdoop HYBRID

COMPUTATION

MODEL

Page 85: The three generations of Big Data processing

1. Big Data processing

2. Batch processing

3. Real-time processing

4. Hybrid computation model

5. Conclusions

Agenda

Page 86: The three generations of Big Data processing

Conclusions

• Big Data is not only Hadoop

• Identify the processing requirements of your

project

• Analyze the alternatives for all steps in the

data pipeline

• The battle for real-time processing is open

• Stay tuned for the hybrid computation model

Page 87: The three generations of Big Data processing

Thanks for your attention!

www.datadopter.com

www.treelogic.com

Contact us:

[email protected]

[email protected]

MADRID Avda. de Manoteras, 38

Oficina D507

28050 Madrid · España

ASTURIAS Parque Tecnológico de Asturias

Parcela 30

33428 Llanera - Asturias · España

902 286 386