Introduction to Apache Spark
-
Upload
samy-dindane -
Category
Technology
-
view
663 -
download
3
Transcript of Introduction to Apache Spark
OUTLINEHistory of "Big Data" enginesApache Spark: What is it and what's special about it?Apache Spark: What is used for?Apache Spark: APITools and software usually used with Apache SparkDemo
HISTORY OF "BIG DATA" ENGINES
BATCH VS STREAMING
HISTORY OF "BIG DATA" ENGINES2011 - Hadoop MapReduce: Batch, in-disk processing2011 - Apache Storm: Realtime2014 - Apache Tez2014 - Apache Spark: Batch and near-realtime, in-memory processing2015 - Apache Flink: Realtime, in-memory processing
APACHE SPARK: WHAT IS IT ANDWHAT'S SPECIAL ABOUT IT?
WHY SPARK?
Most machine learning algorithms are iterative; eachiteration can improve the resultsWith disk-based approach, each iteration's output iswritten to disk making the processing slow
HADOOP MAPREDUCE EXECUTION FLOW
SPARK EXECUTION FLOW
Spark is a distributed data processing engine
Started in 2009Open source & written in ScalaCompatible with Hadoop's data
It runs on memory and on diskRun 10 to 100 times faster than Hadoop MapReduceCan be written in Java, Scala, Python & RSupports batch and near-realtime workflows (micro-batches)
Spark has four modules:
APACHE SPARK: WHAT IS USED FOR?
CAPTURE AND EXTRACT DATA
Data can come from several sources:
DatabasesFlat filesWeb and mobile applications' logsData feeds from social mediaIoT devices
TRANSFORM DATA
Data in an analytics pipeline needs transformation
Check and correct quality issuesHandle missing valuesCast fields into specific data typesCompute derived fieldsSplit or merge records for more granularityJoin with other datasetsRestructure data
STORE DATA
Data can then be stored in several ways
As self describing files (Parquet, JSON, XML)SQL databasesSearch databases (Elasticsearch, Solr)Key-value stores (HBase, Cassandra)
QUERY, ANALYZE, VISUALIZE
With Spark Shell, notebooks, Kibana, etc.
APACHE SPARK: API
EXECUTION FLOW
RESILENT DISTRIBUTED DATASETS
RDD's are the fundamental data unit in Spark
Resilient: If data in memory is lost, it can be recreatedDistributed: Stored in memory across the clusterDataset: The initial data can come from a file orcreated programmatically
RDD'S
Immutable and partionned collection of elementsBasic operations: map, filter, reduce, persistSeveral implementations: PairRDD, DoubleRDD,SequenceFileRDD
HISTORY
2011 (Spark release) - RDD API2013 - introduction of the DataFrame API: Add theconcept of schema and allow Spark to manage it formore efficient serialization and deserialization2015 - introduction of the DataSet API
OPERATIONS ON RDD'S
TransformationsActions
TRANSFORMATIONS
Create a new dataset from an RDD, like filter, map,reduce
ACTIONS:
Return a value to the driver program after running acomputation on the dataset
EXAMPLE OF MAP AND FILTER TRANSFORMATIONS
EXAMPLE OF MAP AND FILTER TRANSFORMATIONS
HOW TO RUN SPARK PROGRAMS?Inside Spark ShellUsing a notebookAs a Spark applicationBy submitting Spark application to spark-submit
INSIDE SPARK SHELLRun ./bin/spark-shell
val textFile = sc.textFile("README.md")val lines = textFile.filter(line => line contains "Spark")lines.collect()
USING A NOTEBOOKThere are many Spark notebooks, we are going to use
http://spark-notebook.io/sparknotebookopen http://localhost:9000/
AS A SPARK APPLICATIONBy adding spark-core and other Spark modules as projectdependencies and using Spark API inside the application
codedef main(args: Array[String]) { val conf = new SparkConf() .setAppName("Sample Application") .setMaster("local")
val sc = new SparkContext(conf)
val logData = sc.textFile("/tmp/spark/README.md")
val lines = textFile.filter(line => line contains "Spark") lines.collect()
sc.stop()}
BY SUBMITTING SPARK APPLICATIONTO SPARK-SUBMIT
./bin/sparksubmit \class <mainclass>master <masterurl> \deploymode <deploymode> \conf <key>=<value> \... # other options<applicationjar> \[applicationarguments]
TERMINOLOGY
SparkContext: A connection to a Spark contextWorker node: Node that runs the program in a clusterTask: A unit of workJob: Consists of multiple tasksExecutor: Process in a worker node, that runs thetasks
TOOLS AND SOFTWARE USUALLYUSED WITH APACHE SPARK
HDFS: HADOOP DISTRIBUTED FILESYSTEM
Simple: Uses many servers as one big computerReliable: Detects failures, has redundant storageFault-tolerant: Auto-retry, self-healing
Scalable: Scales (almost) lineary with disks and CPU
APACHE KAFKA
A DISTRIBUTED AND REPLICATED MESSAGING SYSTEM
APACHE ZOOKEEPER
ZOOKEEPER IS A DISTRIBUTED, OPEN-SOURCECOORDINATION SERVICE FOR DISTRIBUTED APPLICATIONS
Coordination: Needed when multiple nodes need towork togetherExamples:
Group membershipLockingLeaders electionSynchronizationPublisher/subscriber
APACHE MESOS
Mesos is built using the same principles as the Linuxkernel, only at a different level of abstraction.
The Mesos kernel runs on every machine and providesapplications (e.g., Hadoop, Spark, Kafka, Elastic Search)
with API's for resource management and schedulingacross entire datacenter and cloud environments.
A cluster manager that:
Runs distributed applicationsAbstracts CPU, memory, storage, and other resourcesHandles resource allocationHandles applications' isolationHas a Web UI for viewing the cluster's state
NOTEBOOKSSpark Notebook: Allows performing reproducibleanalysis with Scala, Apache Spark and moreApache Zeppelin: A web-based notebook that enablesinteractive data analytics
THE ENDAPACHE SPARK
Is a fast distributed data processing engineRuns on memoryCan be used with Java, Scala, Python & RIts main data structure is a Resilient DistributedDataset
SOURCEShttp://www.slideshare.net/jhols1/kafkaatlmeetuppublicv2?qid=8627acbff89d4ada8cddc8e752da25a1&v=&b=&from_search=2http://www.slideshare.net/Clogeny/anintroductiontozookeeper?qid=ac974e3bc9354974af4adbc3a587ca40&v=&b=&from_search=4http://www.slideshare.net/rahuldausa/introductiontoapachespark39638645?qid=4cd97031912d41779955c0ff3dea2260&v=&b=&from_search=35http://www.slideshare.net/junjun1/apachesparkitsplacewithinabigdatastack?qid=4cd97031912d41779955c0ff3dea2260&v=&b=&from_search=4http://www.slideshare.net/cloudera/sparkdevwebinarslidesfinal?qid=4cd97031912d41779955c0ff3dea2260&v=&b=&from_search=13http://www.slideshare.net/pacoid/ausmesoshttps://spark.apache.org/docs/latest/submittingapplications.htmlhttps://spark.apache.org/docs/1.6.1/quickstart.html