Apache Spark: The modern data analytics platform

31
Apache Spark Mate Gulyas

Transcript of Apache Spark: The modern data analytics platform

Page 1: Apache Spark: The modern data analytics platform

Apache SparkMate Gulyas

Page 2: Apache Spark: The modern data analytics platform

WHY WE DO IT?

54% average viewability**

36% non-human visitor *

What percentage of digital ads reach

people?

Clickbots, botnets

Invisible, hidden ads

Transparency in the market

*http://technorati.com/iab-keynote-36-percent-ad-traffic-from-bots-and-threatening-industry/**http://www.statista.com/statistics/255061/viewability-rates-for-rich-media-ads-worldwide-by-industry/

Page 3: Apache Spark: The modern data analytics platform

JavaScript SegmentationBehaviour analysis

WHAT WE DO?

Page 4: Apache Spark: The modern data analytics platform

Distributed data processing

WHAT DO WE NEED?

Averge size client30 GB / day

900 GB / month

20 average size clients600 GB / day

18 TB / month

Page 5: Apache Spark: The modern data analytics platform

Recurring data transformations

WHAT DO WE NEED?

Page 6: Apache Spark: The modern data analytics platform

Interactive / Batch / Streaming / SQL / Graph processing

IT’S SO COOL

Page 7: Apache Spark: The modern data analytics platform

In-memory

WHY SPARK?

Page 8: Apache Spark: The modern data analytics platform

Productive API

WHY SPARK?

Page 9: Apache Spark: The modern data analytics platform

Multiple language

WHY SPARK?

Page 10: Apache Spark: The modern data analytics platform

Active community

WHY SPARK?

Page 11: Apache Spark: The modern data analytics platform

friendly

WHY SPARK?

developeranalyst

CFO

Page 12: Apache Spark: The modern data analytics platform

friendly

WHY SPARK?

developeranalyst

CFO

Page 13: Apache Spark: The modern data analytics platform

friendly

WHY SPARK?

developeranalyst

CFO

Page 14: Apache Spark: The modern data analytics platform

Resilient Distributed Dataset (RDD)

ONE THING TO REMEMBER

Page 15: Apache Spark: The modern data analytics platform

RDD

Page 16: Apache Spark: The modern data analytics platform

IT RUN’S ON

MesosYARNStandaloneAWS EC2

Page 17: Apache Spark: The modern data analytics platform

IT GET’S DATA FROM?

Amazon S3, HDFS, Cassandra, Hive, Hbase, Tachyon, Local Filesystem, ODBC databases, etc...

Page 18: Apache Spark: The modern data analytics platform

Batch processing

THE OLD WAY

Page 19: Apache Spark: The modern data analytics platform

Interactive analytics

THE NEW WAY

Page 20: Apache Spark: The modern data analytics platform

SPARK WITH IPYTHON

Page 21: Apache Spark: The modern data analytics platform

Spark SQL

THE NOT THAT OLD WAY

Page 22: Apache Spark: The modern data analytics platform

{"name": "Mate Gulyas", "twitter": "gulyasm"}{"name": "John Doe", "email": "[email protected]"}{"name": "Jane Doe", "email": "[email protected]"}

val input = hiveCtx.jsonFile(“example.json”)input.registerAsTable(“users”)hiveCtx.sql(“SELECT name, twitter FROM people;”)

SQL WITH JSON

Page 23: Apache Spark: The modern data analytics platform

Spark Streaming

THE LOW LATENCY WAY

Page 24: Apache Spark: The modern data analytics platform

DStream

Page 25: Apache Spark: The modern data analytics platform

MLlib

THE SKYNET WAY

Page 26: Apache Spark: The modern data analytics platform

GraphX

I LOVE GRAPHS

Page 27: Apache Spark: The modern data analytics platform

Third party modules

THE OTHERS WAY

Page 28: Apache Spark: The modern data analytics platform

On-premisesAWSDatabricks Cloud

BUT… WHERE TO GO?

Page 29: Apache Spark: The modern data analytics platform

TAKEAWAY I

Spark can provide one platform to cover most of the use-cases in data analytics

Page 30: Apache Spark: The modern data analytics platform

TAKEAWAY II

Productive, fast data processing framework that helps you minimize to time business impact.

Page 31: Apache Spark: The modern data analytics platform

MATE [email protected]

@gulyasm@enbritely

THANK YOU!