Apache Spark Overview @ ferret
-
Upload
andrii-gakhov -
Category
Software
-
view
283 -
download
4
description
Transcript of Apache Spark Overview @ ferret
![Page 1: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/1.jpg)
APACHE SPARK OVERVIEWtech talk @ ferret
Andrii Gakhov
![Page 2: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/2.jpg)
• Apache Spark™ is a fast and general engine for large-scale data processing.
• Lastest release: Spark 1.1.1 (Nov 26, 2014)
• spark.apache.org
• Originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010. Now Spark is supported by Databricks.
![Page 3: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/3.jpg)
standalone with local storage
EC2
S3 HDFS
node node nodenode
MESOS YARN
Apache Spark
Spark SQL MLlib GraphX Streaming
APACHE SPARK
![Page 4: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/4.jpg)
RDD• Spark’s primary conception is a Resilient
Distributed Dataset (RDD) - abstraction of an immutable, distributed dataset.
• Collections of objects that can be stored in memory or disk across the cluster
• Parallel functional transformations (map, filter, …)• Automatically rebuild of failure
textFile = sc.textFile(“api.log")anotherFile = sc.textFile(“hdfs://var/log/api.log”)
![Page 5: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/5.jpg)
RDD• RDDs have act ions , wh ich re tur n va lues , and
transformations, which return pointers to new RDDs.• Actions:
• reduce collect count countByKey take saveAsTextFile takeSample …
• Transformations:• map filter flatMap distinct sample join union intersection
reduceByKey groupByKey sortByKey …
errors = logFile.filter(lambda line: line.startswith(“ERROR”))print errors.count()
![Page 6: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/6.jpg)
PERSISTANCE• You can control persistence of RDD across operations
(MEMORY_ONLY, MEMORY_AND_DISK …)• When you persist an RDD in memory, each node stores
any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it)
• This allows future actions to be much faster (often by more than 10x).
errors.cache()endpoint_errors = errors.filter( lambda line: “/test/endpoint” in line)endpoint_errors.count()
![Page 7: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/7.jpg)
HDFS
iteration iteration iteration
iteration iteration iteration
HDFSMEMORY
Apache Spark
Hadoop MapReduce
![Page 8: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/8.jpg)
INTERACTIVE DEMOSTRATA+HADOOP WORD EXAMPLE
http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html
![Page 9: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/9.jpg)
SPARK SQLTRANSFORM RDD WITH SQL
![Page 10: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/10.jpg)
SCHEMA RDD• Spark SQL allows relational queries expressed in SQL,
HiveQL, or Scala to be executed using Spark.
• At the core of this component is a new type of RDD - SchemaRDD.
• SchemaRDDs are composed of Row objects, along with a schema that describes the data types of each column in the row.
• A SchemaRDD is similar to a table in a traditional relational database.
• A SchemaRDD can be created from an existing RDD, a Parquet file, a JSON dataset, or by running HiveQL against data stored in Apache Hive.
![Page 11: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/11.jpg)
SCHEMA RDD• To work with SparkSQL you need SQLContext
(or HiveContext)
from spark.sql import SQLContextsqlCtx = SQLContext(sc)
records = sc.textFile(“customers.csv”)customers = records.map(lambda line: line.split(“,”))\
.map(lambda r : Row(name=r[0], age=int(r[1])))
customersTable = sqlCtx.inferSchema(customers)customersTable.registerAsTable(“customers”)
![Page 12: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/12.jpg)
SCHEMA RDD
• Transformations over RDD are just functional transformation on partitioned collections of objects
• Transformation over the SchemaRDD are declarative transformations on par titioned collections of tuples
UserUserUser
Name Age PhoneName Age PhoneName Age Phone
RDD SchemaRDD
![Page 13: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/13.jpg)
SPARK SQL
• Schema RDD could be used as regular RDD at the same time.
seniors = sqlCtx.sql(“””SELECT from customers WHERE age >= 70”””)
print seniors.map(lambda r : “Name: “ + r.name).take(10)print seniors.count()
![Page 14: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/14.jpg)
MLLIBDistributed Machine Learning
![Page 15: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/15.jpg)
MACHINE LEARNING LIBRARY
• MLlib uses the linear algebra package Breeze, which depends on netlib-java, and jblas
• MLlib in Python requires NumPy version 1.4+
• MLlib is under active development • Many API changes every release• Not all algorithms are fully functional
![Page 16: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/16.jpg)
MACHINE LEARNING LIBRARY• Basic statistics
• Classification and regression • linear models (SVMs, logistic regression, linear
regression)• decision trees• naive Bayes
• Collaborative filtering • alternating least squares (ALS)
• Clustering • k-means
![Page 17: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/17.jpg)
MACHINE LEARNING LIBRARY
• Dimensionality reduction • singular value decomposition (SVD)• principal component analysis (PCA)
• Feature extraction and transformation• Optimization
• stochastic gradient descent• limited-memory BFGS (L-BFGS)
![Page 18: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/18.jpg)
return LabeledPoint(values[0], values[1:])
def parsePoint(line):values = [float(x) for x in line.replace(',', ' ').split(' ')]
parsedData = data.map(parsePoint)
MACHINE LEARNING LIBRARY• LinearRegression with stochastic gradient descent (SGD)
example on Spark:
model = LinearRegressionWithSGD.train(parsedData)
valuesAndPreds = parsedData.map(lambda p: (p.label, model.predict(p.features)))
MSE = valuesAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / valuesAndPreds.count()
![Page 19: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/19.jpg)
SPARK STREAMINGFault-tolerant stream processing
![Page 20: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/20.jpg)
SPARK STREAMING• Spark Streaming enables scalable, high-throughput,
fault-tolerant stream processing of live data streams• Spark Streaming provides a high-level abstraction
called discretized stream or DStream, which represents a continuous stream of data
• Internally, a DStream is represented as a sequence of RDDs.
![Page 21: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/21.jpg)
import org.apache.spark.streaming._
val hashTags = tweets.flatMap(status=>getTags(status))
val ssc = new StreamingContext(sc, Seconds(1))
import org.apache.spark.streaming.twitter._
val tweets = TwitterUtils.createStream(ssc, auth)
hashTags.saveAsHadoopFiles("hdfs://...")
…
SPARK STREAMING• Example of processing Twitter Stream with Spark
Streaming:
![Page 22: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/22.jpg)
SPARK STREAMING• Any operation applied on a DStream translates to
operations on the underlying RDDs.
RDD @ time1 RDD @ time2 RDD @ time3 RDD @ time4
![Page 23: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/23.jpg)
SPARK STREAMING• Spark Streaming also provides windowed
computations, which allow you to apply transformations over a sliding window of data
![Page 24: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/24.jpg)
CONCLUSIONS
![Page 25: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/25.jpg)
SPEED• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk.
• Spark has won the Daytona GraySort contest for 2014 (sortbenchmark.org) with 4.27 TB/min (in 2013 Hadoop was the winner with 1.42 TB/min)
Logistic regression in Hadoop and Spark
![Page 26: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/26.jpg)
EASE OF USE
• Supports out of the box:• Java• Scala• Python
• You can use it interactively from the Scala and Python shells
![Page 27: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/27.jpg)
GENERALITY• SQL with SparkSQL
• Machine Learning with MLlib
• Graphs computation with GraphX
• Streaming processing with Spark Streaming
![Page 28: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/28.jpg)
RUNS EVERYWHERE• Spark could be run on
• Hadoop (YARN)• Mesos• standalone• in the cloud
• Spark can read from• S3• HDFS• HBase• Cassandra• any Hadoop data source.
![Page 29: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/29.jpg)
![Page 30: Apache Spark Overview @ ferret](https://reader034.fdocuments.net/reader034/viewer/2022042614/559ef4bd1a28ab34208b4795/html5/thumbnails/30.jpg)
Thank you.
• Credentials:
• http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014
• http://spark.apache.org
• http://www.databricks.com
• http://www.datacrucis.com/research/twitter-analysis-for-strata-barcelona-2014-with-apache-spark-and-d3.html