Implementing BigPetStore with Apache Flink

13
Implementing BigPetStore A blueprint for Flink users Márton Balassi [email protected] / @MartonBalassi Hungarian Academy of Sciences

Transcript of Implementing BigPetStore with Apache Flink

Page 1: Implementing BigPetStore with Apache Flink

Implementing BigPetStore A blueprint for Flink users

Márton [email protected] / @MartonBalassi

Hungarian Academy of Sciences

Page 2: Implementing BigPetStore with Apache Flink

Outline

• BigPetStore model• Data generator with the DataSet API• ETL with the DataSet & Table API• Matrix factorization with FlinkML• Recommendation with the DataStream API• Summary

Page 3: Implementing BigPetStore with Apache Flink

BigPetStore

Blueprints for Big Data applicationsConsists of:• Data Generators

• Examples using tools in Big Data ecosystem to process data

• Build system and tests for integrating tools and multiple JVM languages

Part of the Apache BigTop project

Page 4: Implementing BigPetStore with Apache Flink

BigPetStore model

• Customers visiting pet stores generating transactions, location based

Nowling, R.J.; Vyas, J., "A Domain-Driven, Generative Data Model for Big Pet Store," in Big Data and Cloud Computing (BDCloud), 2014 IEEE Fourth International Conference on , vol., no., pp.49-55, 3-5 Dec. 2014

Page 5: Implementing BigPetStore with Apache Flink

Data generation

val env = ExecutionEnvironment.getExecutionEnvironmentval (stores, products, customers) = getData()val startTime = getCurrentMillis()

val transactions = env.fromCollection(customers).flatMap(new TransactionGenerator(products)).withBroadcastSet(stores, ”stores”).map{t => t.setDateTime(t.getDateTime + startTime); t}

transactions.writeAsText(output)

• Use RJ Nowling’s Java generator classes• Write transactions to JSON

Page 6: Implementing BigPetStore with Apache Flink

ETL with the DataSet API

val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val productsWithIndex = transactions.flatMap(_.getProducts).distinct.zipWithUniqueId

val customerAndProductPairs = transactions.flatMap(t => t.getProducts.map(p => (t.getCustomer.getId,

p))).join(productsWithIndex).where(_._2).equalTo(_._2).map(pair => (pair._1._1, pair._2._1)).distinct

customerAndProductPairs.writeAsCsv(output)

• Read the dirty JSON• Output (customer, product) pairs for the

recommender

Page 7: Implementing BigPetStore with Apache Flink

ETL with the Table API

val env = ExecutionEnvironment.getExecutionEnvironmentval transactions = env.readTextFile(json).map(new FlinkTransaction(_))

val table = transactions.map(toCaseClass(_)).toTable

val storeTransactionCount = table.groupBy('storeId).select('storeId, 'storeName, 'storeId.count as 'count)

val bestStores = table.groupBy('storeId).select('storeId.max as 'max).join(storeTransactionCount).where(”count = max”).select('storeId, 'storeName, 'storeId.count as 'count).toDataSet[StoreCount]

• Read the dirty JSON• SQL style queries

Page 8: Implementing BigPetStore with Apache Flink

A little Recommeder theory

Item factors

User side information User-Item matrixUser factors

Item side informatio

n

U

I

PQ

R

• R is potentially huge, approximate it with PQ• Prediction is TopK(user’s row Q)

Page 9: Implementing BigPetStore with Apache Flink

Matrix factorization with FlinkML

val env = ExecutionEnvironment.getExecutionEnvironmentval input = env.readCsvFile[(Int,Int)](inputFile)

.map(pair => (pair._1, pair._2, 1.0))

val model = ALS().setNumfactors(numFactors).setIterations(iterations).setLambda(lambda)

model.fit(input)

val (p, q) = model.factorsOption.getp.writeAsText(pOut)q.writeAsText(qOut)

• Read the (customer, product) pairs• Write P and Q to file

Page 10: Implementing BigPetStore with Apache Flink

Recommendation with the DataStream API

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.socketTextStream(”localhost”, 9999).map(new GetUserVector()).broadcast().map(new PartialTopK()).keyBy(0).flatMap(new GlobalTopK()).print();

• Get the user’s row for a userID• Compute the distributed TopK of the

user’s row Q

Page 11: Implementing BigPetStore with Apache Flink

Summary

• Go beyond WordCount with BigPetStore• Feel free to mix the DataSet, DataStream,

FlinkML, Table APIs in your Flink workflows• Data generation, cleaning, ETL, Machine

learning, streaming prediction on top of one engine with under 500 lines of code

• Java and Scala APIs work well together• A Flink pet project is always fun. No pun intended.

Page 12: Implementing BigPetStore with Apache Flink

Big thanks to

• The BigPetStore folks:Suneel MarthiRonald J. NowlingJay Vyas

• Squirrels helping with the code:Gyula FóraGábor GevayGábor HermannFabian HueskeAljoscha Krettek

• And to the whole Flink community

Page 13: Implementing BigPetStore with Apache Flink

Check out the code

https://github.com/mbalassi/bigpetstore-flink

Márton [email protected] / @MartonBalassi

Hungarian Academy of Sciences