Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ......
Embed Size (px)
Transcript of Apache Spark Bricks.pdf · Apache Spark Easy and Fast Big Data Analytics Pat McDonough. ......

Apache SparkEasy and Fast Big Data Analytics
Pat McDonough

Founded by the creators of Apache Sparkout of UC Berkeley’s AMPLab
Fully committed to 100% open source Apache Spark
Support and Grow the Spark Community and Ecosystem
Building Databricks Cloud

Databricks & DatastaxApache Spark is packaged as part of Datastax
Enterprise Analytics 4.5
Databricks & Datstax Have Partnered for Apache Spark Engineering and Support

Big Data AnalyticsWhere We’ve Been
• 2003 & 2004 - Google GFS & MapReduce Papers are Precursors to Hadoop
• 2006 & 2007 - Google BigTable and Amazon DynamoDB Paper Precursor to Cassandra, HBase, Others

Big Data AnalyticsA Zoo of Innovation

Big Data AnalyticsA Zoo of Innovation

Big Data AnalyticsA Zoo of Innovation

Big Data AnalyticsA Zoo of Innovation

What's Working?
Many Excellent Innovations Have Come From Big Data Analytics:
• Distributed & Data Parallel is disruptive ... because we needed it
• We Now Have Massive throughput… Solved the ETL Problem
• The Data Hub/Lake Is Possible

What Needs to Improve? Go Beyond MapReduce
MapReduce is a Very Powerful and Flexible Engine
Processing Throughput Previously Unobtainable on
Commodity Equipment
But MapReduce Isn’t Enough:
• Essentially Batch-only
• Inefficient with respect to memory use, latency
• Too Hard to Program

What Needs to Improve? Go Beyond (S)QL
SQL Support Has Been A Welcome Interface on Many
Platforms
And in many cases, a faster alternative
But SQL Is Often Not Enough:
• Sometimes you want to write real programs (Loops, variables, functions, existing libraries) but don’t want to build UDFs.
• Machine Learning (see above, plus iterative)
• Multi-step pipelines
• Often an Additional System

What Needs to Improve? Ease of Use
Big Data Distributions Provide a number of Useful Tools and
Systems
Choices are Good to Have
But This Is Often Unsatisfactory:
• Each new system has it’s own configs, APIs, and management, coordination of multiple systems is challenging
• A typical solution requires stringing together disparate systems - we need unification
• Developers want the full power of their programming language

What Needs to Improve? Latency
Big Data systems are throughput-oriented
Some new SQL Systems provide interactivity
But We Need More:
• Interactivity beyond SQL interfaces
• Repeated access of the same datasets (i.e. caching)

Can Spark Solve These Problems?

Apache SparkOriginally developed in 2009 in UC Berkeley’s
AMPLab
Fully open sourced in 2010 – now at Apache Software Foundation
http://spark.apache.org

Project ActivityJune 2013 June 2014
total contributors 68 255
companies contributing 17 50
total linesof code 63,000 175,000

Project ActivityJune 2013 June 2014
total contributors 68 255
companies contributing 17 50
total linesof code 63,000 175,000

Compared to Other Projects
0
300
600
900
1200
0
75000
150000
225000
300000
Commits Lines of Code Changed
Activity in past 6 months

Compared to Other Projects
0
300
600
900
1200
0
75000
150000
225000
300000
Commits Lines of Code Changed
Activity in past 6 months
Spark is now the most active project in the Hadoop ecosystem

Spark on GithubSo active on Github, sometimes we break it
Over 1200 Forks (can’t display Network Graphs)
~80 commits to master each week
So many PRs We Built our own PR UI

Apache Spark - Easy to Use And Very Fast
Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra
Improved Efficiency: • In-memory computing primitives
• General computation graphs
Improved Usability: • Rich APIs
• Interactive shell

Apache Spark - Easy to Use And Very Fast
Fast and general cluster computing system interoperable with Big Data Systems Like Hadoop and Cassandra
Improved Efficiency: • In-memory computing primitives
• General computation graphs
Improved Usability: • Rich APIs
• Interactive shell
Up to 100× faster (2-10× on disk)
2-5× less code

Apache Spark - A Robust SDK for Big Data Applications
SQL Machine Learning Streaming Graph
Core
Unified System With Libraries to Build a Complete Solution !
Full-featured Programming Environment in Scala, Java, Python…
Very developer-friendly, Functional API for working with Data !
Runtimes available on several platforms

Spark Is A Part Of Most Big Data Platforms
• All Major Hadoop Distributions Include Spark
• Spark Is Also Integrated With Non-Hadoop Big Data Platforms like DSE
• Spark Applications Can Be Written Once and Deployed Anywhere
SQL Machine Learning Streaming Graph
Core
Deploy Spark Apps Anywhere

Cassandra + Spark: A Great Combination
Both are Easy to Use
Spark Can Help You Bridge Your Hadoop and Cassandra Systems
Use Spark Libraries, Caching on-top of Cassandra-stored Data
Combine Spark Streaming with Cassandra Storage Datastaxspark-cassandra-connector:https://github.com/datastax/spark-cassandra-connector

Easy: Get Started Immediately
Interactive Shell Multi-language support
Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()
Scala val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();

Easy: Clean API
Resilient Distributed Datasets
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
• Automatically rebuilt on failure
Operations
• Transformations (e.g. map, filter, groupBy)
• Actions(e.g. count, collect, save)
Write programs in terms of transformations on distributed datasets

Easy: Expressive APImap reduce

Easy: Expressive APImap filter groupBy sort union join leftOuterJoin rightOuterJoin
reduce count fold reduceByKey groupByKey cogroup cross zip
sample take first partitionBy mapWith pipe save ...

Easy: Example – Word Count
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
Spark

Easy: Example – Word Count
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { ! private final static IntWritable one = new IntWritable(1); private Text word = new Text(); ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } !public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Hadoop MapReduceval spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
Spark

Easy: Works Well With Hadoop
Data Compatibility
• Access your existing Hadoop Data
• Use the same data formats
• Adheres to data locality for efficient processing
!
Deployment Models
• “Standalone” deployment
• YARN-based deployment
• Mesos-based deployment
• Deploy on existing Hadoop cluster or side-by-side

Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
!
w = numpy.random.rand(D)
!
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y)
w -= gradient
!
print “Final w: %s” % w

Fast: Using RAM, Operator Graphs
In-memory Caching
• Data Partitions read from RAM instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
= cached partition
= RDD
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map

Fast: Logistic Regression Performance
Runn
ing
Tim
e (s
)
0
1000
2000
3000
4000
Number of Iterations1 5 10 20 30
Hadoop Spark
110 s / iteration
first iteration 80 s further iterations 1 s

Fast: Scales Down SeamlesslyEx
ecution time (s)
0
25
50
75
100
% of working set in cache
Cache disabled 25% 50% 75% Fully cached
11.5304
29.747140.7407
58.061468.8414

Easy: Fault RecoveryRDDs track lineage information that can be used to
efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDDMapped
RDDfilter(func = startsWith(…))
map(func = split(...))

How Spark Works

Working With RDDs

Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs
RDDRDD
RDDRDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs
RDDRDD
RDDRDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count() 74 !linesWithSpark.first() # Apache Spark
textFile = sc.textFile(”SomeFile.txt”)

Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Action
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Drivertasks
tasks
tasks
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read HDFS Block
Read HDFS Block
Read HDFS Block
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process& Cache Data
Process& Cache Data
Process& Cache Data
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
ProcessfromCache
ProcessfromCache
ProcessfromCache
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driverresults
results
results
Example: Log MiningLoad error messages from a log into memory, then interactively search for various patterns

Example: Log Mining
lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“\t”)[2]) messages.cache()
Worker
Worker
Workermessages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data ➔ Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
Load error messages from a log into memory, then interactively search for various patterns

Spark’s Libraries
SQL Machine Learning Streaming Graph
Core

Spark SQL

What is Spark SQL?
• Out of the box APIs built on the same system
• SQL interfaces, SchemaRDDs, and a LINQ-like DSL for end users
• An optimizer framework for manipulating trees of relational operators.
• Native support for executing relational queries (SQL) in Spark.
• Optimized integration with external sources

SparkSQL Architecture

Relationship to Shark
Borrows
• Hive data loading code / in-memory columnar representation
• hardened spark execution engine
Adds
• RDD-aware optimizer / query planner
• execution engine
• language interfaces.
Catalyst/SparkSQL is a nearly from scratch rewrite that leverages the best parts of Shark

Hive CompatibilityInterfaces to access data and code in the Hive ecosystem: o Support for writing queries in HQL o Catalog for that interfaces with the
Hive MetaStore o Tablescan operator that uses Hive SerDes o Wrappers for Hive UDFs, UDAFs, UDTFs

Parquet SupportNative support for reading data stored in Parquet:
• Columnar storage avoids reading unneeded data.
• Nested Data support
• RDDs can be written to parquet files, preserving the schema.
• Predicate push-down support

JSON SupportNative support for reading data stored in JSON: !
• Schema-inference through sampling
• Nested data support

Built-in Driver
JDBC available OOTB as of Spark 1.1

Optimizations
• In addition to the standard Spark framework’s optimizations…
• Predicate push-down
• Partition pruning
• Code gen
• Automatic Broadcasts (based on statistics)

Example: SparkSQL, Core APIs, and MLlib Working Together
val trainingDataTable = sql(""" SELECT e.action,
u.age, u.latitude, u.logitude FROM Users u JOIN
Events e ON u.userId = e.userId""")// Since `sql`
returns an RDD, the results of can be easily used in
MLlib val trainingData = trainingDataTable.map { row => val features = Array[Double](row(1), row(2), row(3)) LabeledPoint(row(0), features) } val model = new
LogisticRegressionWithSGD().run(trainingData)

Recent Roadmap Updates
Performance and Usability Improvements
• Disk spilling for skewed blocks during cache operations
• Disk spilling during aggregations for PySpark
• “sort-based shuffle”
• usability improvements for monitoring the performance long running or complex jobs

Recent Roadmap UpdatesSparkSQL
• JDBC/ODBC server built-in
• Support for loading JSON data directly into Spark’s SchemaRDD format, including automatic schema inference.
• Dynamic bytecode generation significantly speeding up execution for queries that perform complex expression evaluation.
• This release also adds support for registering Python, Scala, and Java lambda functions as UDF
• Spark 1.1 adds a public types API to allow users to create SchemaRDD’s from custom data sources.
• Many, many optimizations (Parquet-specific, cost-based

Recent Roadmap Updates
MLlib
• New library of statistical packages which provides exploratory analytic functions *stratified sampling, correlations, chi-squared tests, creating random datasets…)
• Utilities for feature extraction (Word2Vec and TF-IDF) and feature transformation (normalization and standard scaling).
• Support for nonnegative matrix factorization and SVD via Lanczos.
• Decision tree algorithm has been added in Python and Java.
• Tree aggregation primitive
• Performance improves across the board, with improvements of around 2-3X for many algorithms and up to 5X for large scale decision tree problems.

Recent Roadmap Updates
Spark Streaming
• New data source for Amazon Kinesis
• Apache Flume: a new pull-based mode (simplifying deployment and providing high availability)
• The first of a set of streaming machine learning algorithms is introduced with streaming linear regression.
• Rate limiting has been added for streaming inputs