Download - Budapest Spark Meetup - Apache Spark @enbrite.ly

Transcript
Page 1: Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache Spark @enbrite.lyBudapest Spark MeetupMarch 30, 2016

Page 2: Budapest Spark Meetup - Apache Spark @enbrite.ly

Joe MÉSZÁROSsoftware engineer

@joemesz

joemeszaros

Page 3: Budapest Spark Meetup - Apache Spark @enbrite.ly

Who we are?

Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.

Page 4: Budapest Spark Meetup - Apache Spark @enbrite.ly

Agenda

● What we do?● How we do? - enbrite.ly data

platform● Real world antifraud example● LL + Spark in scale +/-

Page 5: Budapest Spark Meetup - Apache Spark @enbrite.ly

DATA COLLECTION

ANALYZEDATA PROCESSION

ANTI FRAUDVIEWABILITY

BRAND SAFETYREPORT + API

What we do?

Page 6: Budapest Spark Meetup - Apache Spark @enbrite.ly

How we do?

DATA COLLECTION

Page 7: Budapest Spark Meetup - Apache Spark @enbrite.ly

How we do?

DATA PROCESSION

Page 8: Budapest Spark Meetup - Apache Spark @enbrite.ly

Amazon EMR● Most popular cloud service provider● Amazon Big Data ecosystem● Applications: Hadoop, Spark, Hive, ….● Scaling is easy● Do not trust the BIG guys (API problem)● Spark application in EMR runs on YARN

(cluster manager)

For more information: https://aws.amazon.com/elasticmapreduce/

Page 9: Budapest Spark Meetup - Apache Spark @enbrite.ly

Tools we use

https://github.com/spotify/luigi | 4500 ★ | more than 200 contributors

Workflow engine, that helps you build complex data pipelines of batch jobs. Created by Spotify’s engineering team.

Page 10: Budapest Spark Meetup - Apache Spark @enbrite.ly

Your friendly plumber, that sticks your Hadoop, Spark, … jobs with simple dependency definition and failure management.

Page 11: Budapest Spark Meetup - Apache Spark @enbrite.ly

class SparkMeetupTask(luigi.Task):

param = luigi.Parameter(default=42)

def requires(self): return SomeOtherTask(self.param)

def run(self): with self.output().open('w') as f: f.write('Hello Spark meetup!')

def output(self): return luigi.LocalTarget('/meetup/message')

if __name__ == '__main__': luigi.run()

Page 12: Budapest Spark Meetup - Apache Spark @enbrite.ly

Web interface

Page 13: Budapest Spark Meetup - Apache Spark @enbrite.ly

Web interface

Page 14: Budapest Spark Meetup - Apache Spark @enbrite.ly

Let me tell you a short story...

Page 15: Budapest Spark Meetup - Apache Spark @enbrite.ly

Tools we created

GABO LUIGI

Luigi + enbrite.ly extensions = Gabo Luigi● Dynamic task configuration +

dependencies● Reshaped web interface● Define reusable data pipeline

template● Monitoring for each task

Page 16: Budapest Spark Meetup - Apache Spark @enbrite.ly

Tools we created

GABO LUIGI

Page 17: Budapest Spark Meetup - Apache Spark @enbrite.ly

Tools we created

GABO LUIGI

We plan to release it to the wild and make it open source as part of Spotify’s Luigi! If you

are interested, you are front of open doors :-)

Page 18: Budapest Spark Meetup - Apache Spark @enbrite.ly

Tools we created

GABO MARATHON

Motivation: Testing with large data sets and slow batch jobs is boring and wasteful!

Page 19: Budapest Spark Meetup - Apache Spark @enbrite.ly

Tools we created

GABO MARATHON

Graphite

Page 20: Budapest Spark Meetup - Apache Spark @enbrite.ly

Real world exampleYou are fighting against robots and want to humanize ad tech era. You have a simple idea to detect bot traffic, which saves the world. Let’s implement it!

Page 21: Budapest Spark Meetup - Apache Spark @enbrite.ly

Real world example

THE IDEA: Analyse events which are too hasty and deviate from regular, humanlike profiles: too many clicks in a defined timeframe.

INPUT: Load balancer access logs files on S3OUTPUT: Print invalid sessions

Page 22: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 1: convert access log files to events

Step 2: sessionize events

Step 3: detect too many clicks

How to solve it?

Page 23: Budapest Spark Meetup - Apache Spark @enbrite.ly

The way to access log

{ "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click"}

eyJzZXNzaW9uX2lkIjoic3BhcmtfbWVldHVwX2pzbW1tb3EiLCJ0aW1lc3RhbXAiOjE0NTYwODA5MTU2MjEsInR5cGUiOiAiY2xpY2sifQo=

Click event attributes (created by JS tracker)

Access log formatTS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."

1. 2.

3.

Page 24: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 1: log to eventSimplify: log files are on the local storage, only click events.

SparkConf conf = new SparkConf().setAppName("LogToEvent");

JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD<String> rawEvents = sparkContext.textFile(LOG_FOLDER);

// 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET

https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."

Page 25: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 1: log to eventJavaRDD<String> rawUrls = rawEvents.map(l -> l.split("\\s+")[3]);

// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...

JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("\\s+")[3]);

// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...

JavaRDD<String> eventParameter = rawUrls

.map(u -> parseUrl(u).get("event"));

// eyJzZXNzaW9uX2lkIj…

JavaRDD<String> rawUrls = rawEvents.map(l -> l.split("\\s+")[3]);

// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...

JavaRDD<String> eventParameter = rawUrls

.map(u -> parseUrl(u).get("event"));

// eyJzZXNzaW9uX2lk

JavaRDD<String> base64Decoded = eventParameter

.map(e -> new String(Base64.getDecoder().decode(e)));

// {"session_id": "spark_meetup_jsmmmoq",

// "timestamp": 1456080915621, "type": "click"}

IoUtil.saveAsJsonGzipped(base64Decoded);

Page 26: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 2: event to sessionSparkConf conf = new SparkConf().setAppName("EventToSession");

JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);

JavaRDD<ClickEvent> clickEvents = jsonEvents

.map(e -> readJsonObject(e));

SparkConf conf = new SparkConf().setAppName("EventToSession");

JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD<String> jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);

JavaRDD<ClickEvent> clickEvents = jsonEvents

.map(e -> readJsonObject(e));

JavaPairRDD<String, Iterable<ClickEvent>> groupedEvents =

clickEvents.groupBy(e -> e.getSessionId());

JavaPairRDD<String, Session> sessions = grouped

.flatMapValues(sessionizer);

Page 27: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 2: event to session//Sessionizer

public Session call(Iterable<ClickEvent> clickEvents) {

List<ClickEvent> ordered = sortByTimestamp(clickEvents);

Session session = new Session();

for (ClickEvent event: ordered) {

session.addClick(event)

}

return session;

}

Page 28: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 2: event to sessionclass Session {

public Boolean isBad = False;

public List<Long> clickTimestamps;

public void addClick(ClickEvent e) {

clickTimestamps.add(e.getTimestamp());

}

public void badify() { this.isBad = True; }

}

Page 29: Budapest Spark Meetup - Apache Spark @enbrite.ly

Step 3: detect bad sessionsJavaRDD<Session> sessions = IoUtil.readFrom(LOCAL_STORAGE);

JavaRDD<Session> markedSessions = sessions

.map(s -> s.clickTimestamps.size() > THRESHOLD);

JavaRDD<Session> badSessions = markedSessions

.filter(s -> s.isBad());

badSessions.collect().foreach(println);

Page 30: Budapest Spark Meetup - Apache Spark @enbrite.ly

Congratulation!MISSION COMPLETED

YOU just saved the world with a simple idea within ~10

minutes.

Page 31: Budapest Spark Meetup - Apache Spark @enbrite.ly

Using Spark pros● Sparking is funny, community,

tools● Easy to start with it● Language support: Python, Scala,

Java, R● Unified stack: batch, streaming,

SQL, ML

Page 32: Budapest Spark Meetup - Apache Spark @enbrite.ly

Using Spark cons

● You need memory and memory● Distributed application, hard to

debug● Hard to optimize

Page 33: Budapest Spark Meetup - Apache Spark @enbrite.ly

Lessons learned● Do not use default config, always

optimize!● Eliminate technical debt + automate ● Failures happen, use monitoring from

the very first breath + fault tolerant implementation

● Sparking is funny, but not a hammer for everything

Page 34: Budapest Spark Meetup - Apache Spark @enbrite.ly

Data platform future● Would like to play with Redshift

● Change data format (avro, parquet, …)

● Would like to play with streaming● Would like to play with Spark 2.0

Page 35: Budapest Spark Meetup - Apache Spark @enbrite.ly

WE ARE HIRING!

working @exPrezi office, K9

check out the company in Forbes :-)

amazing company culture

BUT the real reason ….

Page 36: Budapest Spark Meetup - Apache Spark @enbrite.ly

WE ARE HIRING!

… is our mood manager, Bigyó :)

Page 37: Budapest Spark Meetup - Apache Spark @enbrite.ly

Joe MÉSZÁROSsoftware [email protected]

@joemesz @enbritely

joemeszarosenbritely

THANK YOU!

?QUESTIONS?