USING APACHE SPARK FOR ANALYTICS IN THE...

56
USING APACHE SPARK FOR USING APACHE SPARK FOR ANALYTICS IN THE CLOUD ANALYTICS IN THE CLOUD William C. Benton William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Transcript of USING APACHE SPARK FOR ANALYTICS IN THE...

Page 1: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

USING APACHE SPARK FORUSING APACHE SPARK FORANALYTICS IN THE CLOUDANALYTICS IN THE CLOUD

William C. BentonWilliam C. Benton

Principal Software Engineer

Red Hat Emerging Technology

June 24, 2015

Page 2: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

ABOUT MEABOUT MEDistributed systems and data science in Red Hat'sEmerging Technology groupActive open-source and Fedora developerBefore Red Hat: programming language research

Page 3: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

FORECASTFORECASTDistributed data processing: history and mythologyData processing in the cloudIntroducing Apache SparkHow we use Spark for data science at Red Hat

Page 4: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Recent history & persistent mythology

DATA PROCESSINGDATA PROCESSING

Page 5: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

What makes distributed data processing difficult?

CHALLENGESCHALLENGES

Page 6: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

MAPREDUCE (2004)MAPREDUCE (2004)A novel application of some very old functionalprogramming ideas to distributed computingAll data are modeled as key-value pairs Mappers transform pairs; reducers merge severalpairs with the same key into one new pairRuntime system shuffles data to improve locality

Page 7: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

WORD COUNTWORD COUNT"a b" "c e" "a b" "d a" "d b"

Page 8: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

MAPPED INPUTSMAPPED INPUTS(a, 1)(b, 1)

(c, 1)(e, 1)

(a, 1)(b, 1)

(d, 1)(a, 1)

(d, 1)(b, 1)

Page 9: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

SHUFFLED RECORDSSHUFFLED RECORDS(a, 1)(a, 1)(a, 1)

(b, 1)(b, 1)(b, 1)

(c, 1) (d, 1)(d, 1)

(e, 1)

Page 10: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

REDUCED RECORDSREDUCED RECORDS

(a, 3) (b, 3) (c, 1) (d, 2) (e, 1)

Page 11: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

HADOOP (2005)HADOOP (2005)Open-source implementation of MapReduce, adistributed filesystem, and moreInexpensive way to store and process data withscale-out on commodity hardwareMotivates many of the default assumptions wemake about “big data” today

Page 12: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data

Page 13: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data

Page 14: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

...at least two analytics production clusters (atMicrosoft and Yahoo) have median job input sizesunder 14 GB and 90% of jobs on a Facebookcluster have input sizes under 100 GB.

Appuswamy et al., “Nobody ever got fired forbuying a cluster.” Microsoft Research Tech Report.

Page 15: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Takeaway #1: you may need petascalestorage, but you probably don't evenneed terascale compute.

Page 16: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Takeaway #2: moderately sizedworkloads benefit more fromscale-up than scale out.

Page 17: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data

Page 18: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Contrary to our expectations ... CPU (and not I/O)is often the bottleneck [and] improving networkperformance can improve job completion time bya median of at most 2%

Ousterhout et al., “Making Sense of Performance inData Analytics Frameworks.” USENIX NSDI ’15.

Page 19: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Takeaway #3: I/O is not the bottleneck(especially in moderately-sized jobs);focus on CPU performance.

Page 20: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

“FACTS”“FACTS”You need an architecture that will scale out to manynodes to handle real-world data analyticsYour network and disks probably aren't fast enoughLocality is everything: you need to be able to runcompute jobs on the nodes storing your data

Page 21: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Takeaway #4: collocated data andcompute was a sensible choice forpetascale jobs in 2005, but shouldn'tnecessarily be the default today.

Page 22: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

FACTS (REVISED)FACTS (REVISED)You probably don't need an architecture that willscale out to many nodes to handle real-world dataanalytics (and might be better served by scaling up)Your network and disks probably aren't the problemYou have enormous flexibility to choose the besttechnologies for storage and compute

Page 23: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

HADOOP IN 2015HADOOP IN 2015MapReduce is low-level, verbose, and not an obviousfit for many interesting problemsNo unified abstractions: Hive or Pig for query,Giraph for graph, Mahout for machine learning, etc.Fundamental architectural assumptions need to berevisited along with the “facts” motivating them

Page 24: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

How our assumptions should change

DATA PROCESSINGDATA PROCESSINGIN THE CLOUDIN THE CLOUD

Page 25: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

COLLOCATED DATACOLLOCATED DATAAND COMPUTEAND COMPUTE

Page 26: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

ELASTIC RESOURCESELASTIC RESOURCES

Page 27: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

DISTINCT STORAGEDISTINCT STORAGEAND COMPUTEAND COMPUTECombine the best storage system for your applicationwith elastic compute resources.

Page 28: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

INTRODUCING SPARKINTRODUCING SPARK

Page 29: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Apache Spark is a framework fordistributed computing based on ahigh-level, expressive abstraction.

Page 30: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

Query ML Graph StreamingSpark core

Page 31: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

ad hoc Mesos YARNQuery ML Graph StreamingSpark core

Page 32: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

ad hoc Mesos YARNSpark core

Query ML Graph Streaming

Language bindings for Scala,Java, Python, and R

Access data from JDBC,Gluster, HDFS, S3, and more

Page 33: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

ad hoc Mesos YARNA resilient distributed dataset is apartitioned, immutable, lazy collection.

Page 34: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

A resilient distributed dataset is apartitioned, immutable, lazy collection.

Page 35: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

A resilient distributed dataset is apartitioned, immutable, lazy collection.

The PARTITIONS making upan RDD can be distributedacross multiple machines

Page 36: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

A resilient distributed dataset is apartitioned, immutable, lazy collection.

TRANSFORMATIONS create new(lazy) collections; ACTIONS forcecomputations and return results

Page 37: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

CREATING AN RDDCREATING AN RDD

spark.parallelize(range(1, 1000))

spark.textFile("hamlet.txt")

spark.hadoopFile("...")spark.sequenceFile("...")spark.objectFile("...")

# from an in-memory collection

# from the lines of a text file

# from a Hadoop-format binary file

Page 38: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

TRANSFORMING RDDSTRANSFORMING RDDS

numbers.map(lambda x: x + 1)

lines.flatMap(lambda s: s.split(" "))

vowels = ['a', 'e', 'i', 'o', 'u']words.filter(lambda s: s[0] in vowels)

words.distinct()

# transform each element independently

# turn each element into zero or more elements

# reject elements that don't satisfy a predicate

# keep only one copy of duplicate elements

Page 39: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

TRANSFORMING RDDSTRANSFORMING RDDS

pairs.sortByKey()

pairs.reduceByKey(lambda x, y: max(x, y))

pairs.join(other_pairs)

# return an RDD of key-value pairs, sorted by# the keys of each

# combine every two pairs having the same key,# using the given reduce function

# join together two RDDs of pairs so that# [(a, b)] join [(a, c)] == [(a, (b, c))]

Page 40: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

CACHING RESULTSCACHING RESULTS

sorted_pairs = pairs.sortByKey()sorted_pairs.cache()

sorted_pairs.persist(MEMORY_AND_DISK)

sorted_pairs.unpersist()

# tell Spark to cache this RDD in cluster# memory after we compute it

# as above, except also store a copy on disk

# uncache and free this result

Page 41: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

COMPUTING RESULTSCOMPUTING RESULTS

numbers.count()

counts.collect()

words.saveAsTextFile("...")

# compute this RDD and return a# count of elements

# compute this RDD and materialize it# as a local collection

# compute this RDD and write each# partition to stable storage

Page 42: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

WORD COUNT EXAMPLEWORD COUNT EXAMPLE

f = spark.textFile("...")

words = f.flatMap(lambda line: line.split(" "))

occs = words.map(lambda word: (word, 1))

counts = occs.reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("...")

# create an RDD backed by the lines of a file

# ...mapping from lines of text to words

# ...mapping from words to occurrences

# ...reducing occurrences to counts

# POP QUIZ: what have we computed so far?

Page 43: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

PETASCALE STORAGE,PETASCALE STORAGE,IN-MEMORY COMPUTEIN-MEMORY COMPUTE

storage/compute nodes

compute-only nodes, whichprimarily operate oncached post-ETL data

Page 44: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

DATA SCIENCE ATDATA SCIENCE ATRED HATRED HAT

Page 45: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

THE EMERGING TECHTHE EMERGING TECHDATA SCIENCE TEAMDATA SCIENCE TEAMEngineers with distributed systems, data science,and scientific computing expertiseGoal: help internal customers solve data problemsand make data-driven decisionsPrinciples: identify best practices, question outdatedassumptions, use best-of-breed technology

Page 46: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

DEVELOPMENTDEVELOPMENT

Six compute-only nodes Two nodes for GlusterstorageApache Spark runningunder Apache MesosOpen-source “notebook”interfaces to analyses

Page 47: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

DATA SOURCESDATA SOURCES

FTP

S3

SQL

MongoDB

ElasticSearch

Page 48: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

INTERACTIVE QUERYINTERACTIVE QUERY

Page 49: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

TWO CASE STUDIESTWO CASE STUDIES

Page 50: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

ROLE ANALYSISROLE ANALYSISData source: historical configuration and telemetrydata for internal machines from ElasticSearchData size: hundreds of GBAnalysis: identify machine roles based on thepackages each has installed

Page 51: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

BUDGET FORECASTINGBUDGET FORECASTINGData sources: operational log data for OpenShift Online(from MariaDB), actual costs incurred by OpenShiftData size: over 120 GBAnalysis: identify operational metrics most stronglycorrelated with operating expenses; model dailyoperating expense as a function of these metrics

Aggregating performance metrics:17 hours in MariaDB, 15 minutes in Spark!

Page 52: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

NEXT STEPSNEXT STEPS

Page 53: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

DEMO VIDEODEMO VIDEOSee a video demo of Continuum Analytics, PySpark,and Red Hat Storage: or h.264 Ogg Theora

Page 54: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

WHERE FROM HEREWHERE FROM HERECheck out the Emerging Technology Data Scienceteam's library to help build your own data-drivenapplications: See my blog for articles about open source datascience: Questions?

https://github.com/willb/silex/

http://chapeau.freevariable.com

Page 55: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton

THANKSTHANKS

Page 56: USING APACHE SPARK FOR ANALYTICS IN THE CLOUDvideos.cdn.redhat.com/.../15605_using-apache-spark-for-analytics-in... · USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton