The Flink - Apache Bigtop integration

36
1 © Cloudera, Inc. All rights reserved. Marton Balassi | Solutions Architect Flink PMC member @MartonBalassi | [email protected] The Flink - Apache Bigtop integration

Transcript of The Flink - Apache Bigtop integration

Page 1: The Flink - Apache Bigtop integration

1© Cloudera, Inc. All rights reserved.

Marton Balassi | Solutions Architect Flink PMC

member@MartonBalassi | [email protected]

The Flink - Apache Bigtop integration

Page 2: The Flink - Apache Bigtop integration

2© Cloudera, Inc. All rights reserved.

Outline

• Short introduction to Bigtop

• An even shorter intro to Flink

• From Flink source to linux packages

• Implementing BigPetStore

• From linux packages to Cloudera parcels

• Summary

Page 3: The Flink - Apache Bigtop integration

3© Cloudera, Inc. All rights reserved.

Short introduction to Bigtop

Page 4: The Flink - Apache Bigtop integration

4© Cloudera, Inc. All rights reserved.

What is Bigtop?

Apache project for standardizing testing, packaging and integration of

leading big data components.

Page 5: The Flink - Apache Bigtop integration

5© Cloudera, Inc. All rights reserved.

Components as building blocks

And many more …

Page 6: The Flink - Apache Bigtop integration

6© Cloudera, Inc. All rights reserved.

Dependency hell

-------------------------------------------------------------------------hdfszookeeperhbasekafkaspark...mapredooziehiveetc ---

-------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

----------------------------------------------------------

Build all the Things!!!

Page 7: The Flink - Apache Bigtop integration

7© Cloudera, Inc. All rights reserved.

Early value added

• Bigtop has been around since the 0.20 days of Hadoop

• Provide a common foundation for proper integration of growing number of

Hadoop family components

• Foundation provides solid base for validating applications running on top of the

stack(s)

• Neutral packaging and deployment/config

Page 8: The Flink - Apache Bigtop integration

8© Cloudera, Inc. All rights reserved.

Early mission accomplished

• Foundation for commercial Hadoop distros/services

• Leveraged by app providers

Page 9: The Flink - Apache Bigtop integration

9© Cloudera, Inc. All rights reserved.

Adding more components

Page 10: The Flink - Apache Bigtop integration

10© Cloudera, Inc. All rights reserved.

New focus and target groups

• Going way beyond just building debs/rpms

• Data engineers vs distro builders

• Enhance Operations/Deployment

• Reference implementations & tutorials

Page 11: The Flink - Apache Bigtop integration

11© Cloudera, Inc. All rights reserved.

An even shorter intro to Flink

Page 12: The Flink - Apache Bigtop integration

12© Cloudera, Inc. All rights reserved.

The Flink stack

DataStream APIStream Processing

DataSet APIBatch Processing

RuntimeDistributed Streaming Data Flow

Libraries

Streaming and batch as first class citizens.

Page 13: The Flink - Apache Bigtop integration

13© Cloudera, Inc. All rights reserved.

Flink in the wild

30 billion events daily 2 billion events in 10 1Gb machines

Picked Flink for "Saiki"data integration &

distribution platform

See talks by at

Runs their fork of Flink on 1000+ nodes

Page 14: The Flink - Apache Bigtop integration

14© Cloudera, Inc. All rights reserved.

From Flink source to linux packages

Page 15: The Flink - Apache Bigtop integration

15© Cloudera, Inc. All rights reserved.

The Bigtop component build

• Bigtop builds the component (potentially after patching it)

• Breaks up the files to linux distro friendly way (/etc/flink/conf, …)

• Adds users, groups, systemd services for the components

• Sets up the paths and alternatives for convenient access

• Builds the debs/rpm, takes care of the dependencies

http://jayunit100.blogspot.com/2014/04/how-bigtop-packages-hadoop.html

Page 16: The Flink - Apache Bigtop integration

16© Cloudera, Inc. All rights reserved.

Implementing BigPetStore

Page 17: The Flink - Apache Bigtop integration

17© Cloudera, Inc. All rights reserved.

BigPetStore Outline

• BigPetStore model

• Data generator with the DataSet API

• ETL with the DataSet and Table APIs

• Matrix factorization with FlinkML

• Recommendation with the DataStream API

Page 18: The Flink - Apache Bigtop integration

18© Cloudera, Inc. All rights reserved.

BigPetStore

• Blueprints for Big Data applications

• Consists of:• Data Generators• Examples using tools in Big Data ecosystem

to process data• Build system and tests for integrating tools

and multiple JVM languages• Part of the Bigtop project

Page 19: The Flink - Apache Bigtop integration

19© Cloudera, Inc. All rights reserved.

BigPetStore model

• Customers visiting pet stores generating transactions, location based

Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

Page 20: The Flink - Apache Bigtop integration

20© Cloudera, Inc. All rights reserved.

Data generation

• Use RJ Nowling’s Java generator classes• Write transactions to JSON

val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()

val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}

transactions.writeAsText(output)

Page 21: The Flink - Apache Bigtop integration

21© Cloudera, Inc. All rights reserved.

ETL with the DataSet API

• Read the dirty JSON• Output (customer, product) pairs for the recommenderval env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId

val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,

p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct

customerAndProductPairs.writeAsCsv(output)

Page 22: The Flink - Apache Bigtop integration

22© Cloudera, Inc. All rights reserved.

ETL with Table API

• Read the dirty JSON• SQL style queries (SQL coming in Flink 1.1)

val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val table = transactions.map(toCaseClass(_)).toTable

val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)

val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]

Page 23: The Flink - Apache Bigtop integration

23© Cloudera, Inc. All rights reserved.

A little recommender theory

Item factors

User side information User-Item matrixUser factors

Item side information

U

I

PQ

R

• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)

Page 24: The Flink - Apache Bigtop integration

24© Cloudera, Inc. All rights reserved.

• Read the (customer, product) pairs• Write P and Q to file

Matrix factorization with FlinkML

val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)

.map(pair => (pair._1, pair._2, 1.0))

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(input)

val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)

Page 25: The Flink - Apache Bigtop integration

25© Cloudera, Inc. All rights reserved.

Recommendation with the DataStream API

• Give the TopK recommendation for a user• (Could be optimized)

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();

Page 26: The Flink - Apache Bigtop integration

26© Cloudera, Inc. All rights reserved.

From linux packages

to Cloudera parcels

Page 27: The Flink - Apache Bigtop integration

27© Cloudera, Inc. All rights reserved.

Why parcels?

• We have linux packages, why a new format?

• Cloudera Manager needs to update parcel without root privileges

• A big, single bundle for the whole ecosystem

• Plays well with the CM services and monitoring

• Package signing

https://github.com/cloudera/cm_ext

Page 28: The Flink - Apache Bigtop integration

28© Cloudera, Inc. All rights reserved.

Managing the Flink parcel from CM

Page 29: The Flink - Apache Bigtop integration

29© Cloudera, Inc. All rights reserved.

Next steps – Flink operations

• Flink does not offer a HistoryServer yet

Running on YARN is inconvenient like this

Follow [FLINK-4136] for resulotion

• The stand-alone cluster mode runs multiple jobs in the JVM

In practice users fire up clusters per job

Alibaba has a multitenant fork, aim is to contribute

https://www.youtube.com/watch?v=_Nw8NTdIq9A

Page 30: The Flink - Apache Bigtop integration

30© Cloudera, Inc. All rights reserved.

Next steps – CM services, monitoring

Page 31: The Flink - Apache Bigtop integration

31© Cloudera, Inc. All rights reserved.

Summary

Page 32: The Flink - Apache Bigtop integration

32© Cloudera, Inc. All rights reserved.

Summary

• Flink is a dataflow engine with batch and streaming as first class citizens

• Bigtop offers unified packaging, testing and integration

• BigPetStore gives you a blueprint for a range of apps

• It is straight-forward to CM Parcel based on Bigtop

Page 33: The Flink - Apache Bigtop integration

33© Cloudera, Inc. All rights reserved.

Big thanks to

• Clouderans supporting the project:

Sean Owen

Alexander Bartfeld

Justin Kestelyn

• The BigPetStore folks:

Suneel Marthi

Ronald J. Nowling

Jay Vyas

• Bigtop people answering my silly

questions:

Konstantin Boudnik

Roman Shaposhnik

Nate D'Amico

• Squirrels pushing the integration:

Robert Metzger

Fabian Hueske

Page 34: The Flink - Apache Bigtop integration

34© Cloudera, Inc. All rights reserved.

Check out the code

github.com/mbalassi/bigpetstore-flinkgithub.com/mbalassi/flink-parcel

Feel free to give me feedback.

Page 35: The Flink - Apache Bigtop integration

35© Cloudera, Inc. All rights reserved.

Come to Flink Forward

Page 36: The Flink - Apache Bigtop integration

36© Cloudera, Inc. All rights reserved.

Thank you@[email protected]