Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts)...

45
Stratosphere for Hadoop Users Potsdam, January 03, 2012 Arvid Heise

Transcript of Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts)...

Page 1: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Stratosphere for Hadoop UsersPotsdam, January 03, 2012

Arvid Heise

Page 2: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Outline

1 Overview over Stratosphere

2 Dataflow Orientation

3 Tuple-based Data Model

4 Other Differences

5 Seminar Organization

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

2

Page 3: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Stratosphere Stack

Execution Engine

Higher-Level Language

Simple Script

SOPREMO CompilerSOPREMO Plan

Simple Parser

PACT OptimizerPACT Program

Nephele SchedulerExecution Graph

ProgrammingModel

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

3

Page 4: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Pact (PArallelization ConTracts)

Parallel programming modelImplementation and generalization of Map/ReduceSimilar interface as HadoopDefines the parallelization semantics of tasksPact plan is dataflow-orientedPact optimizes plans and compiles them to execution graphs forNephele

Alexandrov et al. 2010. MapReduce and Pact - ComparingData Parallel Programming Models.

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

4

Page 5: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Hadoop and Stratosphere Job

SELECT ∗ FROM Documents d JOIN Rankings r ON r . u r l = d . u r lWHERE CONTAINS( d . tex t , [ keywords ] ) AND r . rank > [ rank ]AND NOT EXISTS (SELECT ∗ FROM V i s i t s v WHERE v . u r l = d . u r l

AND v . v i s i t D a t e = CURDATE ( ) ) ;

sink

d

SAME-KEY

MATCH

IF :

MAP

CONTAINS [keywords]

MAP

rank > [rank]

r

MAP

visitDate = CURDATE()

v

SAME-KEY

SAME-KEY

UNIQUE-KEY

COGROUP

IF NONE :

rd

vj

REDUCE

IF :

MAP

:CONTAINS [keywords]: :rank > [rank]:

MAP

:visitDate = CURDATE(): :

sink

REDUCE

IF none :

url (url, content) ip_addr (ip_addr, url, date, ad_revenue, ...)

url ()rank (rank, url, avg_duration)

url (rank, url, avg_duration)

* + § * *

*

* ****

* **

*

**

**

*

* **

*

* *

* *

* *

* *

*

+

+

§

§

§

+

Battré et al. 2010. Nephele/PACTs: A Programming Model andExecution Framework for Web-Scale Analytical Processing

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

5

Page 6: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Hadoop and Stratosphere Job

sink

d

SAME-KEY

MATCH

IF :

MAP

CONTAINS [keywords]

MAP

rank > [rank]

r

MAP

visitDate = CURDATE()

v

SAME-KEY

SAME-KEY

UNIQUE-KEY

COGROUP

IF NONE :

rd

vj

REDUCE

IF :

MAP

:CONTAINS [keywords]: :rank > [rank]:

MAP

:visitDate = CURDATE(): :

sink

REDUCE

IF none :

url (url, content) ip_addr (ip_addr, url, date, ad_revenue, ...)

url ()rank (rank, url, avg_duration)

url (rank, url, avg_duration)

* + § * *

*

* ****

* **

*

**

**

*

* **

*

* *

* *

* *

* *

*

+

+

§

§

§

+

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

5

Page 7: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Nephele

Executes an Execution GraphDecides for each task how many instances are appropriateAssigns task instances to computation unitsManages fault tolerance and adapts to changes

Daniel Warneke and Odej Kao. 2009. Nephele: EfficientParallel Data Processing in the Cloud

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

6

Page 8: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Sopremo and Simple

High level language layerSimple = query languageSopremo = semi-structured data model (JSON) and operatorsExtensible operators for several use casesText Mining, Data Cleansing, Data Mining

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

7

Page 9: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Outline

1 Overview over Stratosphere

2 Dataflow Orientation

3 Tuple-based Data Model

4 Other Differences

5 Seminar Organization

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

8

Page 10: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Data Analysis Program

HadoopDriver program + multiple jobsDriver program manually connects inputand outputsFixed pipeline1 Job = 1 Map + 1 Reduce

StratosphereDirected acyclic graph of arbitrary PactsExplicit data sources and sinksPact also support two inputs (for join-likeoperations)

sink

d

SAME-KEY

MATCH

IF :

MAP

CONTAINS [keywords]

MAP

rank > [rank]

r

MAP

visitDate = CURDATE()

v

SAME-KEY

SAME-KEY

UNIQUE-KEY

COGROUP

IF NONE :

rd

vj

REDUCE

IF :

MAP

:CONTAINS [keywords]: :rank > [rank]:

MAP

:visitDate = CURDATE(): :

sink

REDUCE

IF none :

url (url, content) ip_addr (ip_addr, url, date, ad_revenue, ...)

url ()rank (rank, url, avg_duration)

url (rank, url, avg_duration)

* + § * *

*

* ****

* **

*

**

**

*

* **

*

* *

* *

* *

* *

*

+

+

§

§

§

+

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

9

Page 11: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Map/Reduce in Stratosphere

Works same as HadoopBut the semantics are interpreted differentlyEach Pact defines data dependenciesMap: each tuple can be treated separatelyReduce: tuple with same key are grouped and guaranteed tobe processed by same reducer

Key Value

Input Independent

Subsets

Key Value

Input Independent

Subsets

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

10

Page 12: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Two Input Pacts

Currently three additional Pacts for two inputsCross: make all possible pairsMatch: find all matching pairsCoGroup: group all matching tuplesAll pairs/groups are treated independently

Input A

Input B

Independent

Subsets

Input A

Input B

Independent

Subsets

Input A

Input B

Independent

Subsets

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

11

Page 13: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Comparison

HadoopOften workarounds necessary in a complex M/R programError-prone, re-occurring manual data partitioningPattern evolved for joins etc. (Data mining book)Fixed pipeline allows fine-grain tweaks (exploits)

StratosphereMore expressiveness for complex data operationsMaintains dataflow semantics to some degreeAllows optimization and different shipping strategyLess hooks than Hadoop

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

12

Page 14: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

ComparisonT

L PPartblock

Replicate Replicate Replicate Replicate Replicate Replicate

T1 T6

H1 Fetch Fetch Fetch

Store Store Store

Scan Scan

PPartsplit PPartsplit

H2

. . .

H3 Fetch Fetch Fetch

Store Store Store

Scan

PPartsplit

H4

. . .

M1 Union

RecReaditemize

MMapmap

PPartmem

LPartsh LPartsh LPartsh

Sortcmp Sortcmp Sortcmp

SortGrpgrp SortGrpgrp SortGrpgrp

MMapcombine MMapcombine MMapcombine

Store Store Store

Mergecmp

SortGrpgrp

MMapcombine

Store

PPartsh

M2

. . .

M3

RecReaditemize

MMapmap

LPartsh

Sortcmp

SortGrpgrp

MMapcombine

Store

PPartsh

M4

. . .

T1 T5 T2 T4 T3 T6

Fetch Fetch Fetch Fetch

Buffer Buffer Buffer Buffer

Store Store Merge

Store

Mergecmp

SortGrpgrp

MMapreduce

R1 Store

T ′1

. . .

R2

T ′2

Dat

aL

oad

Phas

eM

apPh

ase

Shuffl

ePh

ase

Red

uce

Phas

e

Dittrich et al. 2010. Hadoop++: Making a yellow elephant run likea cheetah (without it even noticing)

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

12

Page 15: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Comparison

HadoopOften workarounds necessary in a complex M/R programError-prone, re-occurring manual data partitioningPattern evolved for joins etc. (Data mining book)Fixed pipeline allows fine-grain tweaks (exploits)

StratosphereMore expressiveness for complex data operationsMaintains dataflow semantics to some degreeAllows optimization and different shipping strategyLess hooks than Hadoop

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

12

Page 16: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Optimization

Inspired by traditional query optimizationChoose best join/shipping strategy for plan

Reorder Pacts

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

13

Page 17: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Outline

1 Overview over Stratosphere

2 Dataflow Orientation

3 Tuple-based Data Model

4 Other Differences

5 Seminar Organization

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

14

Page 18: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Switch to Tuple-based Model

Bleeding Edge! Check pact-examples subprojectTypical Hadoop job:

Map: transforms/filters data, sets keyReduce: combines data, unsets key

Preceding maps limit reordering

Stratosphere uses tuples instead of k/v-pairsUser should set keys as soon as possible for ALL PactsEach Pact is annotated to define the "key"

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

15

Page 19: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Switch to Tuple-based Model

Bleeding Edge! Check pact-examples subprojectTypical Hadoop job:

Map: transforms/filters data, sets keyReduce: combines data, unsets key

Preceding maps limit reordering

Stratosphere uses tuples instead of k/v-pairsUser should set keys as soon as possible for ALL PactsEach Pact is annotated to define the "key"

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

15

Page 20: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Reduce Example

1 public static class CountWords extends ReduceStub {2 public void reduce(Iterator<PactRecord> records,

Collector out){3 PactRecord element = null;4 int sum = 0;5 while (records.hasNext()) {6 element = records.next();7 PactInteger count =8 element.getField(1, PactInteger.class);9 sum += count.getValue();

10 }1112 element.setField(1, new PactInteger(sum));13 out.collect(element);14 }15 }16 ...17 ReduceContract reducer = new ReduceContract(18 CountWords.class, // <-- UDF19 PactString.class, // <-- key class20 0, // <-- key index21 mapper); // <-- input

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

16

Page 21: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

PactRecords Semantics

List of fields = keys or valuesFields that are used as keys need to implement KeyMaintains serialized form as long as possible, lazydeserialization

Fields are only deserialized when readSerializes only written fieldsNeeds type of field to deserialize

Very efficient storage of null values⇒ Use a separate field for each key in your code

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

17

Page 22: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Word Count Example

Wrap Pact Stubs in Pact ContractsAll pacts except source specify their input

1 String dataInput = ...;2 FileDataSource source = new FileDataSource(3 LineInFormat.class, dataInput, "Input Lines");4 MapContract mapper = new MapContract(5 TokenizeLine.class, source, "Tokenize Lines");6 ReduceContract reducer = new ReduceContract(7 CountWords.class, PactString.class, 0, mapper,8 "Count Words");9 FileDataSink out = new FileDataSink(

10 WordCountOutFormat.class, output, reducer,11 "Word Counts");

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

18

Page 23: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Word Count Example

Input format needs to be parsed in any caseHere we could already split lines and omit mapIn this case, we emit a PactRecord with 1 fieldReuse objects to minimize garbage collection

1 public static class LineInFormat extendsDelimitedInputFormat {

2 private final PactString string = new PactString();34 public boolean readRecord(PactRecord record, byte[] line,

int numBytes) {5 this.string.setValueAscii(line, 0, numBytes);6 record.setField(0, this.string);7 return true;8 }9 }

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

18

Page 24: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Word Count Example

Implicit semantic of fields1 public static class TokenizeLine extends MapStub {2 private final PactRecord outputRecord = new PactRecord();3 private final PactString string = new PactString();4 private final PactInteger integer = new PactInteger(1);56 private final AsciiUtils.WhitespaceTokenizer tokenizer =7 new AsciiUtils.WhitespaceTokenizer();89 @Override

10 public void map(PactRecord record, Collector collector) {11 // get the first field (as type PactString)12 PactString str = record.getField(0, PactString.class);1314 // tokenize the line15 this.tokenizer.setStringToTokenize(str);16 while (tokenizer.next(this.string)) {17 // we emit a (word, 1) pair18 this.outputRecord.setField(0, this.string);19 this.outputRecord.setField(1, this.integer);20 collector.collect(this.outputRecord);21 }22 }23 }

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

18

Page 25: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Word Count Example

1 @Combinable2 public static class CountWords extends ReduceStub {3 private final PactInteger theInteger = new PactInteger();45 @Override6 public void reduce(Iterator<PactRecord> records,

Collector out) throws Exception {7 PactRecord element = null;8 int sum = 0;9 while (records.hasNext()) {

10 element = records.next();11 PactInteger i =12 element.getField(1, PactInteger.class);13 sum += i.getValue();14 }1516 this.theInteger.setValue(sum);17 element.setField(1, this.theInteger);18 out.collect(element);19 }20 }

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

18

Page 26: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Word Count Example

1 public static class WordCountOutFormat extendsFileOutputFormat {

2 private final StringBuilder buffer = new StringBuilder();34 @Override5 public void writeRecord(PactRecord record) throws

IOException {6 this.buffer.setLength(0);7 this.buffer.append(8 record.getField(0, PactString.class));9 this.buffer.append(’ ’);

10 this.buffer.append(11 record.getField(1, PactInteger.class).getValue());12 this.buffer.append(’\n’);1314 byte[] bytes = this.buffer.toString().getBytes();15 this.stream.write(bytes);16 }17 }

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

18

Page 27: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Composite Keys

Reduce, CoGroup, and Match use keysMay be composed of more than one fieldUseful for block identifier: separate row and column

Assuming a block is a three-tuple (row, column, data):

1 ReduceContract reducer = new ReduceContract(MergeBlocks.class,

2 new Class[] { PactInteger.class, PactInteger.class },// <-- key classes

3 new int[] { 0, 1 }, // <-- key indizes4 mapper); // <-- input

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

19

Page 28: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Composite Keys

Reduce, CoGroup, and Match use keysMay be composed of more than one fieldUseful for block identifier: separate row and column

Assuming a block is a three-tuple (row, column, data):

1 ReduceContract reducer = new ReduceContract(MergeBlocks.class,

2 new Class[] { PactInteger.class, PactInteger.class },// <-- key classes

3 new int[] { 0, 1 }, // <-- key indizes4 mapper); // <-- input

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

19

Page 29: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Comparison to Key/Value-Pairs

Key/value-pairs easier to understandMay use generic syntax to check compatibilityImplicit type specification through generics

Tuple-based is more flexible, especially for optimizationPays quickly off in more complex tasksMore convention based, field semantic less clearUDF more verbose, especially when reusing objects⇒ Work on class mapping, HLL uses schema inferenceGeneralization of k/v-pairs, may simulate k/v

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

20

Page 30: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Comparison to Key/Value-Pairs

Key/value-pairs easier to understandMay use generic syntax to check compatibilityImplicit type specification through generics

Tuple-based is more flexible, especially for optimizationPays quickly off in more complex tasksMore convention based, field semantic less clearUDF more verbose, especially when reusing objects⇒ Work on class mapping, HLL uses schema inferenceGeneralization of k/v-pairs, may simulate k/v

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

20

Page 31: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Outline

1 Overview over Stratosphere

2 Dataflow Orientation

3 Tuple-based Data Model

4 Other Differences

5 Seminar Organization

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

21

Page 32: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Separate Execution Engine

Nephele and Pact together are equivalent to HadoopNepehle may be used by itself for fine-grain data managementHowever, decoupling may loose some optimization potential

UDF is wrapped in Pact and Nephele code

UF3

(match)

UF1

(map)

UF2

(map)

UF4

(reduce)

function match(Key k, Tuple val1, Tuple val2)

-> (Key, Tuple)

{

Tuple res = val1.concat(val2);

res.project(...);

Key k = res.getColumn(1);

return (k, res);

}

invoke():

while (!input2.eof)

KVPair p = input2.next();

hash-table.put(p.key, p.value);

while (!input1.eof)

KVPair p = input1.next();

KVPait t = hash-table.get(p.key);

if (t != null)

KVPair[] result =

UF.match(p.key, p.value, t.value);

output.write(result);

end

V1 V2

V3

V4

…V1

V2

V3span

In-Memory

Channel

Network

ChannelNode 1 Node 2

V3

V1 V2

V3

V4

V3

Nephele DAG Spanned Data Flow

user

function

PACT code

(grouping)

Nephele code

(communication)

compile

PACT Program

V4

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

22

Page 33: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Separate Execution Engine

Nephele and Pact together are equivalent to HadoopNepehle may be used by itself for fine-grain data managementHowever, decoupling may loose some optimization potentialUDF is wrapped in Pact and Nephele code

UF3

(match)

UF1

(map)

UF2

(map)

UF4

(reduce)

function match(Key k, Tuple val1, Tuple val2)

-> (Key, Tuple)

{

Tuple res = val1.concat(val2);

res.project(...);

Key k = res.getColumn(1);

return (k, res);

}

invoke():

while (!input2.eof)

KVPair p = input2.next();

hash-table.put(p.key, p.value);

while (!input1.eof)

KVPair p = input1.next();

KVPait t = hash-table.get(p.key);

if (t != null)

KVPair[] result =

UF.match(p.key, p.value, t.value);

output.write(result);

end

V1 V2

V3

V4

…V1

V2

V3span

In-Memory

Channel

Network

ChannelNode 1 Node 2

V3

V1 V2

V3

V4

V3

Nephele DAG Spanned Data Flow

user

function

PACT code

(grouping)

Nephele code

(communication)

compile

PACT Program

V4

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

22

Page 34: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Dynamic Resource Allocation

Hadoop works best on cluster environmentStratosphere also targets a true cloud environmentComputation units may be dynamically booked and released

Aggregate the smallest 20% numbers, Nephele (l), Hadoop (r)

0 5 10 15 20 25 30 35

020

40

60

80

100

Time [minutes]

Avera

ge insta

nce u

tiliz

ation [%

]

(a)

(b)(c)

(d)

(e)●

USR

SYS

WAIT

Network traffic

050

100

150

200

250

300

350

400

450

500

Avera

ge n

etw

ork

tra

ffic

am

ong insta

nces [M

Bit/s

]

0 20 40 60 80 100

020

40

60

80

100

Time [minutes]

Avera

ge insta

nce u

tiliz

ation [%

]

(a)

(b)(c)

(d)

(e)

(f)

(g)

(h)

USR

SYS

WAIT

Network traffic

050

100

150

200

250

300

350

400

450

500

Avera

ge n

etw

ork

tra

ffic

am

ong insta

nces [M

Bit/s

]

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

23

Page 35: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Dynamic Resource Allocation

Hadoop works best on cluster environmentStratosphere also targets a true cloud environmentComputation units may be dynamically booked and releasedAggregate the smallest 20% numbers, Nephele (l), Hadoop (r)

0 5 10 15 20 25 30 35

020

40

60

80

100

Time [minutes]

Avera

ge insta

nce u

tiliz

ation [%

]

(a)

(b)(c)

(d)

(e)●

USR

SYS

WAIT

Network traffic

050

100

150

200

250

300

350

400

450

500

Avera

ge n

etw

ork

tra

ffic

am

ong insta

nces [M

Bit/s

]

0 20 40 60 80 100

020

40

60

80

100

Time [minutes]

Avera

ge insta

nce u

tiliz

ation [%

]

(a)

(b)(c)

(d)

(e)

(f)

(g)

(h)

USR

SYS

WAIT

Network traffic

050

100

150

200

250

300

350

400

450

500

Avera

ge n

etw

ork

tra

ffic

am

ong insta

nces [M

Bit/s

]

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

23

Page 36: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Continuous Reoptimization

Stratosphere wants to optimize plansHowever, environment (cloud or cluster) is not stableAn apriori optimal plan may be bad after a while

Start with robust initial planAdapt and optimize during runtime

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

24

Page 37: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Continuous Reoptimization

Stratosphere wants to optimize plansHowever, environment (cloud or cluster) is not stableAn apriori optimal plan may be bad after a while

Start with robust initial planAdapt and optimize during runtime

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

24

Page 38: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Smart Checkpointing

(Work in Progress)Hadoop materializes all resultsVery good for fault-toleranceA crashed computation unit may be immediately replaced

However, often bad for runtime performance

Stratosphere materializes only data when meaningfulMostly, when materialization is cheaper than recalculation

Will materialize result of computation-intensive taksWill not materialize cartesian products

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

25

Page 39: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Smart Checkpointing

(Work in Progress)Hadoop materializes all resultsVery good for fault-toleranceA crashed computation unit may be immediately replacedHowever, often bad for runtime performance

Stratosphere materializes only data when meaningfulMostly, when materialization is cheaper than recalculation

Will materialize result of computation-intensive taksWill not materialize cartesian products

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

25

Page 40: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Outline

1 Overview over Stratosphere

2 Dataflow Orientation

3 Tuple-based Data Model

4 Other Differences

5 Seminar Organization

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

26

Page 41: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Get Stratosphere Running

https://www.stratosphere.eu/register/register

Write me email to be unlocked for stage1Install git and maven2mkdir/cd stratosphere

git clone https://stratosphere.eu/git/stage1.git .

git checkout -b version02 remotes/origin/version02

mvn install

http://www.stratosphere.eu/projects/Stratosphere/wiki/GettingStarted

Start local modeGet word count running

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

27

Page 42: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Port Hadoop to Stratosphere

Meeting on 01/10/2012Conceptual port should be completedImplementation startedWrite email when troubledWrite tickets if bug was foundHelp other teams, write to sdaa mailinglist

Who uses joins, composite keys?

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

28

Page 43: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Cluster Times

Proposal: for the next 6 weeks, every group gets a fixed dayThursday 6pm – Monday 6pmTime slots are flexibly tradableAnnounce start/end on cluster mailinglist

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

29

Page 44: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Final presentations

May be rescheduled if everybody agreesFirst part should be similar to Hadoop talkSecond part should include detailed benchmark andcomparisonState what you will additionally evaluate until report

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

30

Page 45: Stratosphere for Hadoop Users - Hasso Plattner Institute · Pact (PArallelization ConTracts) Parallel programming model Implementation and generalization of Map/Reduce Similar interface

Report

Two column format (sig-alternate.cls)6–8 pages, max. 2 appendixHow did you change the original algorithms?Which design decisions did you make?Focus on important, interesting evaluationsWhat results did you get? Are they making sense?How would you improve the algorithm for betterruntime/results?No code, only conceptional

Arvid Heise | Scalable Data Analysis Algorithms | January 03, 2012

31