Budapest Spark Meetup - Apache Spark @enbrite.ly

download Budapest Spark Meetup - Apache Spark @enbrite.ly

If you can't read please download the document

Embed Size (px)

Transcript of Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache Spark @enbrite.lyBudapest Spark MeetupMarch 30, 2016

Joe MSZROSsoftware engineer @joemesz

joemeszaros

Who we are?Our vision is to revolutionize the KPIs and metrics the online advertisement industry currently using. With our products, Antifraud, Brandsafety and Viewability we provide actionable data to our customers.

AgendaWhat we do?How we do? - enbrite.ly data platformReal world antifraud exampleLL + Spark in scale +/-

DATA COLLECTIONANALYZEDATA PROCESSIONANTI FRAUDVIEWABILITYBRAND SAFETYREPORT + API

What we do?

How we do?DATA COLLECTION

How we do?DATA PROCESSION

Amazon EMRMost popular cloud service providerAmazon Big Data ecosystemApplications: Hadoop, Spark, Hive, .Scaling is easyDo not trust the BIG guys (API problem)Spark application in EMR runs on YARN (cluster manager)For more information: https://aws.amazon.com/elasticmapreduce/

Tools we use https://github.com/spotify/luigi | 4500 | more than 200 contributorsWorkflow engine, that helps you build complex data pipelines of batch jobs. Created by Spotifys engineering team.

Your friendly plumber, that sticks your Hadoop, Spark, jobs with simple dependency definition and failure management.

class SparkMeetupTask(luigi.Task):

param = luigi.Parameter(default=42)

def requires(self): return SomeOtherTask(self.param)

def run(self): with self.output().open('w') as f: f.write('Hello Spark meetup!')

def output(self): return luigi.LocalTarget('/meetup/message')

if __name__ == '__main__': luigi.run()

Web interface

Web interface

Let me tell you a short story...

Tools we createdGABO LUIGI

Luigi + enbrite.ly extensions = Gabo LuigiDynamic task configuration + dependenciesReshaped web interfaceDefine reusable data pipeline templateMonitoring for each task

Tools we createdGABO LUIGI

Tools we createdGABO LUIGIWe plan to release it to the wild and make it open source as part of Spotifys Luigi! If you are interested, you are front of open doors :-)

Tools we createdGABO MARATHONMotivation: Testing with large data sets and slow batch jobs is boring and wasteful!

Tools we createdGABO MARATHON

Graphite

Real world exampleYou are fighting against robots and want to humanize ad tech era. You have a simple idea to detect bot traffic, which saves the world. Lets implement it!

Real world exampleTHE IDEA: Analyse events which are too hasty and deviate from regular, humanlike profiles: too many clicks in a defined timeframe.

INPUT: Load balancer access logs files on S3OUTPUT: Print invalid sessions

Step 1: convert access log files to events

Step 2: sessionize events

Step 3: detect too many clicksHow to solve it?

The way to access log{ "session_id": "spark_meetup_jsmmmoq", "timestamp": 1456080915621, "type": "click"}

eyJzZXNzaW9uX2lkIjoic3BhcmtfbWVldHVwX2pzbW1tb3EiLCJ0aW1lc3RhbXAiOjE0NTYwODA5MTU2MjEsInR5cGUiOiAiY2xpY2sifQo=Click event attributes (created by JS tracker)

Access log formatTS CLIENT_IP STATUS "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."1.2.3.

Step 1: log to eventSimplify: log files are on the local storage, only click events.

SparkConf conf = new SparkConf().setAppName("LogToEvent");JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD rawEvents = sparkContext.textFile(LOG_FOLDER);// 2016-02-29T23:50:36.269432Z 178.165.132.37 200 "GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj..."

Step 1: log to eventJavaRDD rawUrls = rawEvents.map(l -> l.split("\\s+")[3]);// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...

JavaRDD rawUrls = rawEvents.map(l -> l.split("\\s+")[3]);// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...

JavaRDD eventParameter = rawUrls .map(u -> parseUrl(u).get("event"));// eyJzZXNzaW9uX2lkIj

JavaRDD rawUrls = rawEvents.map(l -> l.split("\\s+")[3]);// GET https://api.endpoint?event=eyJzZXNzaW9uX2lkIj...

JavaRDD eventParameter = rawUrls .map(u -> parseUrl(u).get("event"));// eyJzZXNzaW9uX2lkJavaRDD base64Decoded = eventParameter .map(e -> new String(Base64.getDecoder().decode(e)));// {"session_id": "spark_meetup_jsmmmoq", // "timestamp": 1456080915621, "type": "click"}

IoUtil.saveAsJsonGzipped(base64Decoded);

Step 2: event to sessionSparkConf conf = new SparkConf().setAppName("EventToSession");JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);JavaRDD clickEvents = jsonEvents .map(e -> readJsonObject(e));

SparkConf conf = new SparkConf().setAppName("EventToSession");JavaSparkContext sparkContext = new JavaSparkContext(conf);

JavaRDD jsonEvents = IoUtil.readFrom(LOCAL_STORAGE);JavaRDD clickEvents = jsonEvents .map(e -> readJsonObject(e));

JavaPairRDD groupedEvents = clickEvents.groupBy(e -> e.getSessionId());

JavaPairRDD sessions = grouped .flatMapValues(sessionizer);

Step 2: event to session//Sessionizer public Session call(Iterable clickEvents) { List ordered = sortByTimestamp(clickEvents); Session session = new Session(); for (ClickEvent event: ordered) { session.addClick(event) } return session;}

Step 2: event to sessionclass Session {

public Boolean isBad = False; public List clickTimestamps;

public void addClick(ClickEvent e) { clickTimestamps.add(e.getTimestamp()); }

public void badify() { this.isBad = True; }}

Step 3: detect bad sessionsJavaRDD sessions = IoUtil.readFrom(LOCAL_STORAGE);JavaRDD markedSessions = sessions .map(s -> s.clickTimestamps.size() > THRESHOLD);JavaRDD badSessions = markedSessions .filter(s -> s.isBad());

badSessions.collect().foreach(println);

Congratulation!MISSION COMPLETEDYOU just saved the world with a simple idea within ~10 minutes.

Using Spark prosSparking is funny, community, toolsEasy to start with itLanguage support: Python, Scala, Java, RUnified stack: batch, streaming, SQL, ML

Using Spark consYou need memory and memoryDistributed application, hard to debugHard to optimize

Lessons learnedDo not use default config, always optimize!Eliminate technical debt + automate Failures happen, use monitoring from the very first breath + fault tolerant implementationSparking is funny, but not a hammer for everything

Data platform futureWould like to play with RedshiftChange data format (avro, parquet, )Would like to play with streamingWould like to play with Spark 2.0

WE ARE HIRING!

working @exPrezi office, K9check out the company in Forbes :-)amazing company cultureBUT the real reason .

WE ARE HIRING!

is our mood manager, Bigy :)

Joe MSZROSsoftware engineerjoe@enbrite.ly

@joemesz @enbritely

joemeszarosenbritely

THANK YOU!?QUESTIONS?