An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo...

45
An Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014

Transcript of An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo...

Page 1: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

An Overview of Apache Spark

Ilya Gulman CTO, Twingo June, 2014

Page 2: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

About Twingo •  A Big Data Company

•  Established in 2006 by Golan Nahum

•  25 Employees

•  Reseller and expert integrator in HP VERTICA

•  Reseller and integrator in MAPR

•  Reseller and expert integrator in MICROSTRATEGY

•  Deep knowedge in Phyton and Linux

•  We did more than 20 Big Data successful Projects

•  Expertise in SAAS /OEM BIG DATA solutions

Page 3: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

More Customers

Our BIG DATA

Page 4: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Agenda

•  What is Spark?

•  The Difference with Spark

•  SQL on Spark

•  Combining the power

•  Real-World Use Cases

•  Resources

Page 5: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

What is Spark?

Page 6: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Fast and general MapReduce-like engine for large-scale data processing •  Fast

In memory data storage for very fast interactive queries Up to 100 times faster then Hadoop

•  General - Unified platform that can combine: SQL, Machine Learning , Streaming , Graph & Complex analytics

•  Ease of use Can be developed in Java, Scala or Python

•  Integrated with Hadoop Can read from HDFS, HBase, Cassandra, and any Hadoop data source.

What is Spark ?

Page 7: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Spark is the Most Active Open Source Project in Big Data

Proj

ect c

ontri

buto

rs in

pas

t yea

r

Giraph!Storm!

Tez!

0!

20!

40!

60!

80!

100!

120!

140!

Page 8: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

The Spark Community

Page 9: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Unified Platform

Shark (SQL)

Spark Streaming (Streaming)

MLlib (Machine learning)

Spark (General execution engine)

GraphX (Graph computation)

Continued innovation bringing new functionality, e.g.: •  Java 8 (Closures, Lamba Expressions) •  Spark SQL (SQL on Spark, not just Hive) •  BlinkDB (Approximate Queries) •  SparkR (R wrapper for Spark)

Page 10: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Supported Languages

•  Java •  Scala •  Python •  SQL

Page 11: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Data Sources

•  Local Files –  file:///opt/httpd/logs/access_log

•  S3 •  Hadoop Distributed Filesystem

–  Regular files, sequence files, any other Hadoop InputFormat

•  Hbase •  Can also read from any other Hadoop data

source.

Page 12: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

The Difference with Spark

Page 13: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Easy and Fast Big Data

•  Easy to Develop –  Rich APIs in Java, Scala,

Python –  Interactive shell

•  Fast to Run –  General execution graphs –  In-memory storage

2-5× less code Up to 10× faster on disk, 100× in memory

Page 14: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Resilient Distributed Datasets (RDD)

•  Spark revolves around RDDs •  Fault-tolerant collection of elements that can be

operated on in parallel –  Parallelized Collection: Scala collection which is run in

parallel –  Hadoop Dataset: records of files supported by

Hadoop http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 15: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

RDD Operations

•  Transformations –  Creation of a new dataset from an existing

•  map, filter, distinct, union, sample, groupByKey, join, etc…

•  Actions –  Return a value after running a computation

•  collect, count, first, takeSample, foreach, etc…

Check the documentation for a complete list

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations

Page 16: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

RDD Code Example file = spark.textFile("hdfs://...") errors = file.filter(lambda line: "ERROR" in line) # Count all the errors errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL" in line).count() # Fetch the MySQL errors as an array of strings errors.filter(lambda line: "MySQL" in line).collect()

Page 17: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

RDD Persistence / Caching

•  Variety of storage levels –  memory_only (default), memory_and_disk, etc…

•  API Calls –  persist(StorageLevel) –  cache() – shorthand for

persist(StorageLevel.MEMORY_ONLY) •  Considerations

–  Read from disk vs. recompute (memory_and_disk) –  Total memory storage size (memory_only_ser) –  Replicate to second node for faster fault recovery

(memory_only_2) •  Think about this option if supporting a web application

http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence

Page 18: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Cache Scaling Matters

69  

58  

41  

30  

12  

0  20  40  60  80  100  

Cache  disabled  

25%   50%   75%   Fully  cached  Ex

ecution  time  (s)  

%  of  working  set  in  cache  

Page 19: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data

msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])

HDFS File Filtered RDD Mapped RDD filter  

(func  =  startsWith(…))  map  

(func  =  split(...))  

Page 20: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Interactive Shell •  Iterative Development

–  Cache those RDDs –  Open the shell and ask questions

•  We have all wished we could do this with MapReduce –  Compile / save your code for scheduled jobs later

•  Scala – spark-shell •  Python – pyspark

Page 21: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

SQL on Spark

Page 22: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Before Spark - Hive

•  Puts structure/schema onto HDFS data •  Compiles HiveQL queries into MapReduce

jobs •  Very popular: 90+% of Facebook Hadoop

jobs generated by Hive •  Initially developed by Facebook

Page 23: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

But.. Hive is slow

•  Takes 20+ seconds even for simple

queries

•  "A good day is when I can run 6 Hive queries” – @mtraverso

Page 24: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

SQL over Spark

Page 25: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Shark – SQL over Spark

•  Hive-compatible (HiveQL, UDFs, metadata) –  Works in existing Hive warehouses without changing

queries or data! •  Augments Hive

–  In-memory tables and columnar memory store •  Fast execution engine

–  Uses Spark as the underlying execution engine –  Low-latency, interactive queries –  Scale-out and tolerates worker failures

Page 26: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Shark – SQL over Spark

•  Hive-compatible (HiveQL, UDFs, metadata) –  Works in existing Hive warehouses without changing

queries or data! •  Augments Hive

–  In-memory tables and columnar memory store •  Fast execution engine

–  Uses Spark as the underlying execution engine –  Low-latency, interactive queries –  Scale-out and tolerates worker failures

Page 27: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Machine Learning

Page 28: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Machine Learning - MLlib

•  K-Means •  L1 and L2-regularized Linear Regression •  L1 and L2-regularized Logistic Regression •  Alternating Least Squares •  Naive Bayes •  Stochastic Gradient Descent * As of May 14, 2014 ** Don’t be surprised if you see the Mahout library converting to Spark soon

Page 29: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Streaming

Page 30: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Comparison to Storm •  Higher throughput than Storm

–  Spark Streaming: 670k records/sec/node –  Storm: 115k records/sec/node –  Commercial systems: 100-500k records/sec/node

0  

50  

100   1000  Th

roughp

ut  per  

node

 (MB/s)  

Record  Size  (bytes)  

WordCount  

Spark  

Storm  

0  

20  

40  

60  

100   1000  

Throughp

ut  per  

node

 (MB/s)  

Record  Size  (bytes)  

Grep  

Spark  

Storm  

Page 31: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Combining the power

Page 32: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Combining the power •  Use Machine Learningg result as table.

GENERATE KMeans(tweet_locations)

SAVE AS TABLE tweet_clusters;

•  Combine SQL, ML, and streaming (Scala) val points = sc.runSql[Double, Double](

“select latitude, longitude from historic_tweets”)

val model = KMeans.train(points, 10)

sc.twitterStream(...)

.map(t => (model.closestCenter(t.location), 1))

.reduceByWindow(“5s”, _ + _)

Page 33: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Real-World Use Cases

Page 34: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Spark at Yahoo!

•  Fast Machine Learning Personalized news pages

Page 35: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Spark at Yahoo! •  Hive on Spark (Shark)

Using existing BI tools to view and query advertising analytic data collected in Hadoop. Any tool that plugs into Hive, like Tableau, automatically works with Shark.

Page 36: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Spark at

•  One of the largest streaming video companies on the Internet

•  4+ billion video feeds per month (second only to YouTube)

•  CONVIVA uses Spark Streaming to learn network conditions in real time

Page 37: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Resources

Page 38: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Remember •  If you want to use a new technology you must learn that

new technology

•  For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it

•  To get the most out of a new technology you need to learn that technology, this includes tuning –  There are switches you can use to optimize your work

Page 39: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Configuration

http://spark.apache.org/docs/latest/

Most Important •  Application Configuration

http://spark.apache.org/docs/latest/configuration.html

•  Standalone Cluster Configuration http://spark.apache.org/docs/latest/spark-standalone.html

•  Tuning Guide http://spark.apache.org/docs/latest/tuning.html

Page 40: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Resources

•  Pig on Spark –  http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html –  https://github.com/aniket486/pig –  https://github.com/twitter/pig/tree/spork –  http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1 –  https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix

•  Latest on Spark

–  http://databricks.com/categories/spark/ –  http://www.spark-stack.org/

Page 41: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Thank You

www.twingo.co.il

Page 42: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Optional - More Examples

Page 43: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);

JavaRDD<String> file = sc.textFile("hdfs://...");

JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))

.mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y);

counts.saveAsTextFile("hdfs://...");

val sc = new SparkContext(master, appName, [sparkHome], [jars])

val file = sc.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)

counts.saveAsTextFile("hdfs://...")

Word Count

•  Java MapReduce (~15 lines of code) •  Java Spark (~ 7 lines of code) •  Scala and Python (4 lines of code)

–  interactive shell: skip line 1 and replace the last line with counts.collect() •  Java8 (4 lines of code)

Page 44: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Network Word Count – Streaming

// Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))

// Create a NetworkInputDStream on target host:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

wordCounts.print()

ssc.start()

Page 45: An Overview of Apache Spark - Big Data Everywhere Overview of Apache Spark Ilya Gulman CTO, Twingo June, 2014 About Twingo • A Big Data Company • Established in 2006 by Golan Nahum

Deploying Spark – Cluster Manager Types

•  Standalone mode – Comes bundled (EC2 capable)

•  YARN •  Mesos