Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark...

of 75 /75
Apache Spark Easy and Fast Big Data Analytics Pat McDonough

Embed Size (px)

Transcript of Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark...

Page 1: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Apache SparkEasy and Fast Big Data Analytics

Pat McDonough

Page 2: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab

Fully committed to 100% open source Apache Spark

Support and Grow the Spark Community and Ecosystem

Building Databricks Cloud

Page 3: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Databricks & DatastaxApache Spark is packaged as part of Datastax

Enterprise Analytics 4.5

Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Page 4: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Big Data AnalyticsWhere We’ve Been

• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop

• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Page 5: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Big Data AnalyticsA Zoo of Innovation

Page 6: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Big Data AnalyticsA Zoo of Innovation

Page 7: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Big Data AnalyticsA Zoo of Innovation

Page 8: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Big Data AnalyticsA Zoo of Innovation

Page 9: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

What's Working?

Many Excellent Innovations Have Come From Big Data Analytics:

• Distributed & Data Parallel is disruptive ... because we needed it

• We Now Have Massive throughput… Solved the ETL Problem

• The Data Hub/Lake Is Possible

Page 10: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

What Needs to Improve? Go Beyond MapReduce

MapReduce is a Very Powerful and Flexible Engine

Processing Throughput Previously Unobtainable on

Commodity Equipment

But MapReduce Isn’t Enough:

• Essentially Batch-only

• Inefficient with respect to memory use, latency

• Too Hard to Program

Page 11: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

What Needs to Improve? Go Beyond (S)QL

SQL Support Has Been A Welcome Interface on Many

Platforms

And in many cases, a faster alternative

But SQL Is Often Not Enough:

• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.

• Machine Learning (see above, plus iterative)

• Multi-step pipelines

• Often an Additional System

Page 12: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

What Needs to Improve? Ease of Use

Big Data Distributions Provide a number of Useful Tools and

Systems

Choices are Good to Have

But This Is Often Unsatisfactory:

• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging

• A typical solution requires stringing together disparate systems - we need unification

• Developers want the full power of their programming language

Page 13: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

What Needs to Improve? Latency

Big Data systems are throughput-oriented

Some new SQL Systems provide interactivity

But We Need More:

• Interactivity beyond SQL interfaces

• Repeated access of the same datasets (i.e. caching)

Page 14: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Can Spark Solve These Problems?

Page 15: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Apache SparkOriginally developed in 2009 in UC Berkeley’s

AMPLab

Fully open sourced in 2010 – now at Apache Software Foundation

http://spark.apache.org

Page 16: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 17: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Project ActivityJune 2013 June 2014

total contributors 68 255

companies contributing 17 50

total linesof code 63,000 175,000

Page 18: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Page 19: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Compared to Other Projects

0

300

600

900

1200

0

75000

150000

225000

300000

Commits Lines of Code Changed

Activity in past 6 months

Spark is now the most active project in the Hadoop ecosystem

Page 20: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Spark on GithubSo active on Github, sometimes we break it

Over 1200 Forks (can’t display Network Graphs)

~80 commits to master each week

So many PRs We Built our own PR UI

Page 21: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Page 22: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Apache Spark - Easy to Use And Very Fast

Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra

Improved Efficiency: • In-memory computing primitives

• General computation graphs

Improved Usability: • Rich APIs

• Interactive shell

Up to 100× faster (2-10× on disk)

2-5× less code

Page 23: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Apache Spark - A Robust SDK for Big Data Applications

SQL Machine Learning Streaming Graph

Core

Unified System With Libraries to Build a Complete Solution !

Full-featured Programming Environment in Scala, Java, Python…

Very developer-friendly, Functional API for working with Data !

Runtimes available on several platforms

Page 24: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Spark Is A Part Of Most Big Data Platforms

• All Major Hadoop Distributions Include Spark

• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE

• Spark Applications Can Be Written Once and Deployed Anywhere

SQL Machine Learning Streaming Graph

Core

Deploy Spark Apps Anywhere

Page 25: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Cassandra + Spark: A Great Combination

Both are Easy to Use

Spark Can Help You Bridge Your Hadoop and Cassandra Systems

Use Spark Libraries, Caching on-top of Cassandra-stored Data

Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

Page 26: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Get Started Immediately

Interactive Shell Multi-language support

Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()

Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()

Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Page 27: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Clean API

Resilient Distributed Datasets

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

• Automatically rebuilt on failure

Operations

• Transformations (e.g. map, filter, groupBy)

• Actions(e.g. count, collect, save)

Write programs in terms of transformations on distributed datasets

Page 28: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Expressive APImap reduce

Page 29: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin

reduce count fold reduceByKey groupByKey cogroup cross zip

sample take first partitionBy mapWith pipe save ...

Page 30: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 31: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Example – Word Count

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")

Spark

Page 32: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Works Well With Hadoop

Data Compatibility

• Access your existing Hadoop Data

• Use the same data formats

• Adheres to data locality for efficient processing

!

Deployment Models

• “Standalone” deployment

• YARN-based deployment

• Mesos-based deployment

• Deploy on existing Hadoop cluster or side-by-side

Page 33: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Example: Logistic Regression

data = spark.textFile(...).map(readPoint).cache()

!

w = numpy.random.rand(D)

!

for i in range(iterations):

gradient = data

.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)

w -= gradient

!

print “Final w: %s” % w

Page 34: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Fast: Using RAM, Operator Graphs

In-memory Caching

• Data Partitions read from RAM instead of disk

Operator Graphs

• Scheduling Optimizations

• Fault Tolerance

=  cached  partition

=  RDD

join

filter

groupBy

Stage  3

Stage  1

Stage  2

A: B:

C: D: E:

F:

map

Page 35: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Fast: Logistic Regression Performance

Runn

ing

Tim

e (s

)

0

1000

2000

3000

4000

Number of Iterations1 5 10 20 30

Hadoop Spark

110  s  /  iteration

first  iteration  80  s  further  iterations  1  s

Page 36: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Fast: Scales Down SeamlesslyEx

ecution  time  (s)

0

25

50

75

100

%  of  working  set  in  cache

Cache  disabled 25% 50% 75% Fully  cached

11.5304

29.747140.7407

58.061468.8414

Page 37: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Easy: Fault RecoveryRDDs track lineage information that can be used to

efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDDMapped

RDDfilter(func  =  startsWith(…))

map(func  =  split(...))

Page 38: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

How Spark Works

Page 39: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Working With RDDs

Page 40: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Working With RDDs

RDD

textFile = sc.textFile(”SomeFile.txt”)

Page 41: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Working With RDDs

RDDRDD

RDDRDD

Transformations

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

textFile = sc.textFile(”SomeFile.txt”)

Page 42: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Working With RDDs

RDDRDD

RDDRDD

Transformations

Action Value

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark

textFile = sc.textFile(”SomeFile.txt”)

Page 43: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 44: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 45: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Worker

Worker

Worker

Driver

lines = spark.textFile(“hdfs://...”)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 46: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 47: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))

Worker

Worker

Worker

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 48: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 49: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count() Action

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 50: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Worker

Driver

messages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 51: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Drivertasks

tasks

tasks

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 52: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Read HDFS Block

Read HDFS Block

Read HDFS Block

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 53: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

Process& Cache Data

Process& Cache Data

Process& Cache Data

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 54: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

results

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 55: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Driver

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 56: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

tasks

tasks

tasks

Driver

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 57: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

ProcessfromCache

ProcessfromCache

ProcessfromCache

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 58: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driverresults

results

results

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Page 59: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Example: Log Mining

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()

Worker

Worker

Workermessages.filter(lambda s: “mysql” in s).count()

Block 1

Block 2

Block 3

Cache 1

Cache 2

Cache 3

messages.filter(lambda s: “php” in s).count()

Driver

Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk

Load error messages from a log into memory, then interactively search for various patterns

Page 60: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Spark’s Libraries

SQL Machine Learning Streaming Graph

Core

Page 61: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Spark SQL

Page 62: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

What is Spark SQL?

• Out of the box APIs built on the same system

• SQL interfaces, SchemaRDDs, and a LINQ-like DSL for end users

• An optimizer framework for manipulating trees of relational operators.

• Native support for executing relational queries (SQL) in Spark.

• Optimized integration with external sources

Page 63: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

SparkSQL Architecture

Page 64: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Relationship to Shark

Borrows

• Hive data loading code / in-memory columnar representation

• hardened spark execution engine

Adds

• RDD-aware optimizer / query planner

• execution engine

• language interfaces.

Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark

Page 65: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Hive CompatibilityInterfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the

Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs

Page 66: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Parquet SupportNative support for reading data stored in Parquet:

• Columnar storage avoids reading unneeded data.

• Nested Data support

• RDDs can be written to parquet files, preserving the schema.

• Predicate push-down support

Page 67: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

JSON SupportNative support for reading data stored in JSON: !

• Schema-inference through sampling

• Nested data support

Page 68: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Built-in Driver

JDBC available OOTB as of Spark 1.1

Page 69: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Optimizations

• In addition to the standard Spark framework’s optimizations…

• Predicate push-down

• Partition pruning

• Code gen

• Automatic Broadcasts (based on statistics)

Page 70: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Example: SparkSQL, Core APIs, and MLlib Working Together

val trainingDataTable = sql(""" SELECT e.action,

u.age, u.latitude, u.logitude FROM Users u JOIN

Events e ON u.userId = e.userId""")// Since `sql`

returns an RDD, the results of can be easily used in

MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new

LogisticRegressionWithSGD().run(trainingData)

Page 71: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Recent Roadmap Updates

Performance and Usability Improvements

• Disk spilling for skewed blocks during cache operations

• Disk spilling during aggregations for PySpark

• “sort-based shuffle”

• usability improvements for monitoring the performance long running or complex jobs

Page 72: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Recent Roadmap UpdatesSparkSQL

• JDBC/ODBC server built-in

• Support for loading JSON data directly into Spark’s SchemaRDD format, including automatic schema inference.

• Dynamic bytecode generation significantly speeding up execution for queries that perform complex expression evaluation.

• This release also adds support for registering Python, Scala, and Java lambda functions as UDF

• Spark 1.1 adds a public types API to allow users to create SchemaRDD’s from custom data sources.

• Many, many optimizations (Parquet-specific, cost-based

Page 73: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Recent Roadmap Updates

MLlib

• New library of statistical packages which provides exploratory analytic functions *stratified sampling, correlations, chi-squared tests, creating random datasets…)

• Utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling).

• Support for nonnegative matrix factorization and SVD via Lanczos.

• Decision tree algorithm has been added in Python and Java.

• Tree aggregation primitive

• Performance improves across the board, with improvements of around 2-3X for many algorithms and up to 5X for large scale decision tree problems.

Page 74: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Recent Roadmap Updates

Spark Streaming

• New data source for Amazon Kinesis

• Apache Flume: a new pull-based mode (simplifying deployment and providing high availability)

• The first of a set of streaming machine learning algorithms is introduced with streaming linear regression.

• Rate limiting has been added for streaming inputs

Page 75: Apache Spark - Meetupfiles.meetup.com/14077672/Staxing Bricks.pdfDatabricks & Datastax Apache Spark is packaged as part of Datastax Enterprise Analytics 4.5 Databricks & Datstax Have

Thank You!Visit http://databricks.com:Blogs, Tutorials and more

!

Questions?