Introduction to Apache Spark

INTRODUCTION TO APACHESPARK

BY SAMY DINDANE

http://samy.dindane.com/

OUTLINEHistory of "Big Data" enginesApache Spark: What is it and what's special about it?Apache Spark: What is used for?Apache Spark: APITools and software usually used with Apache SparkDemo

HISTORY OF "BIG DATA" ENGINES

BATCH VS STREAMING

HISTORY OF "BIG DATA" ENGINES2011 - Hadoop MapReduce: Batch, in-disk processing2011 - Apache Storm: Realtime2014 - Apache Tez2014 - Apache Spark: Batch and near-realtime, in-memory processing2015 - Apache Flink: Realtime, in-memory processing

APACHE SPARK: WHAT IS IT ANDWHAT'S SPECIAL ABOUT IT?

WHY SPARK?

Most machine learning algorithms are iterative; eachiteration can improve the resultsWith disk-based approach, each iteration's output iswritten to disk making the processing slow

HADOOP MAPREDUCE EXECUTION FLOW

SPARK EXECUTION FLOW

Spark is a distributed data processing engine

Started in 2009Open source & written in ScalaCompatible with Hadoop's data

It runs on memory and on diskRun 10 to 100 times faster than Hadoop MapReduceCan be written in Java, Scala, Python & RSupports batch and near-realtime workflows (micro-batches)

Spark has four modules:

APACHE SPARK: WHAT IS USED FOR?

CAPTURE AND EXTRACT DATA

Data can come from several sources:

DatabasesFlat filesWeb and mobile applications' logsData feeds from social mediaIoT devices

TRANSFORM DATA

Data in an analytics pipeline needs transformation

Check and correct quality issuesHandle missing valuesCast fields into specific data typesCompute derived fieldsSplit or merge records for more granularityJoin with other datasetsRestructure data

STORE DATA

Data can then be stored in several ways

As self describing files (Parquet, JSON, XML)SQL databasesSearch databases (Elasticsearch, Solr)Key-value stores (HBase, Cassandra)

QUERY, ANALYZE, VISUALIZE

With Spark Shell, notebooks, Kibana, etc.

APACHE SPARK: API

EXECUTION FLOW

RESILENT DISTRIBUTED DATASETS

RDD's are the fundamental data unit in Spark

Resilient: If data in memory is lost, it can be recreatedDistributed: Stored in memory across the clusterDataset: The initial data can come from a file orcreated programmatically

RDD'S

Immutable and partionned collection of elementsBasic operations: map, filter, reduce, persistSeveral implementations: PairRDD, DoubleRDD,SequenceFileRDD

HISTORY

2011 (Spark release) - RDD API2013 - introduction of the DataFrame API: Add theconcept of schema and allow Spark to manage it formore efficient serialization and deserialization2015 - introduction of the DataSet API

OPERATIONS ON RDD'S

TransformationsActions

TRANSFORMATIONS

Create a new dataset from an RDD, like filter, map,reduce

ACTIONS:

Return a value to the driver program after running acomputation on the dataset

EXAMPLE OF MAP AND FILTER TRANSFORMATIONS

HOW TO RUN SPARK PROGRAMS?Inside Spark ShellUsing a notebookAs a Spark applicationBy submitting Spark application to spark-submit

INSIDE SPARK SHELLRun ./bin/spark-shell

val textFile = sc.textFile("README.md")val lines = textFile.filter(line => line contains "Spark")lines.collect()

USING A NOTEBOOKThere are many Spark notebooks, we are going to use

http://spark-notebook.io/sparknotebookopen http://localhost:9000/

http://spark-notebook.io/

AS A SPARK APPLICATIONBy adding spark-core and other Spark modules as projectdependencies and using Spark API inside the application

codedef main(args: Array[String]) { val conf = new SparkConf() .setAppName("Sample Application") .setMaster("local")

val sc = new SparkContext(conf)

val logData = sc.textFile("/tmp/spark/README.md")

val lines = textFile.filter(line => line contains "Spark") lines.collect()

sc.stop()}

BY SUBMITTING SPARK APPLICATIONTO SPARK-SUBMIT

./bin/sparksubmit \class <mainclass>master <masterurl> \deploymode <deploymode> \conf <key>=<value> \... # other options<applicationjar> \[applicationarguments]

TERMINOLOGY

SparkContext: A connection to a Spark contextWorker node: Node that runs the program in a clusterTask: A unit of workJob: Consists of multiple tasksExecutor: Process in a worker node, that runs thetasks

TOOLS AND SOFTWARE USUALLYUSED WITH APACHE SPARK

HDFS: HADOOP DISTRIBUTED FILESYSTEM

Simple: Uses many servers as one big computerReliable: Detects failures, has redundant storageFault-tolerant: Auto-retry, self-healing

Scalable: Scales (almost) lineary with disks and CPU

APACHE KAFKA

A DISTRIBUTED AND REPLICATED MESSAGING SYSTEM

APACHE ZOOKEEPER

ZOOKEEPER IS A DISTRIBUTED, OPEN-SOURCECOORDINATION SERVICE FOR DISTRIBUTED APPLICATIONS

Coordination: Needed when multiple nodes need towork togetherExamples:

Group membershipLockingLeaders electionSynchronizationPublisher/subscriber

APACHE MESOS

Mesos is built using the same principles as the Linuxkernel, only at a different level of abstraction.

The Mesos kernel runs on every machine and providesapplications (e.g., Hadoop, Spark, Kafka, Elastic Search)

with API's for resource management and schedulingacross entire datacenter and cloud environments.

A cluster manager that:

Runs distributed applicationsAbstracts CPU, memory, storage, and other resourcesHandles resource allocationHandles applications' isolationHas a Web UI for viewing the cluster's state

NOTEBOOKSSpark Notebook: Allows performing reproducibleanalysis with Scala, Apache Spark and moreApache Zeppelin: A web-based notebook that enablesinteractive data analytics

THE ENDAPACHE SPARK

Is a fast distributed data processing engineRuns on memoryCan be used with Java, Scala, Python & RIts main data structure is a Resilient DistributedDataset

SOURCEShttp://www.slideshare.net/jhols1/kafkaatlmeetuppublicv2?qid=8627acbff89d4ada8cddc8e752da25a1&v=&b=&from_search=2http://www.slideshare.net/Clogeny/anintroductiontozookeeper?qid=ac974e3bc9354974af4adbc3a587ca40&v=&b=&from_search=4http://www.slideshare.net/rahuldausa/introductiontoapachespark39638645?qid=4cd97031912d41779955c0ff3dea2260&v=&b=&from_search=35http://www.slideshare.net/junjun1/apachesparkitsplacewithinabigdatastack?qid=4cd97031912d41779955c0ff3dea2260&v=&b=&from_search=4http://www.slideshare.net/cloudera/sparkdevwebinarslidesfinal?qid=4cd97031912d41779955c0ff3dea2260&v=&b=&from_search=13http://www.slideshare.net/pacoid/ausmesoshttps://spark.apache.org/docs/latest/submittingapplications.htmlhttps://spark.apache.org/docs/1.6.1/quickstart.html

Introduction to Apache Spark

Technology

Transcript of Introduction to Apache Spark