Introduction to Apache Spark

49
INTRODUCTION TO APACHE SPARK BY SAMY DINDANE

Transcript of Introduction to Apache Spark

Page 1: Introduction to Apache Spark

INTRODUCTION TO APACHESPARK

BY SAMY DINDANE

Page 2: Introduction to Apache Spark

OUTLINEHistory of "Big Data" enginesApache Spark: What is it and what's special about it?Apache Spark: What is used for?Apache Spark: APITools and software usually used with Apache SparkDemo

Page 3: Introduction to Apache Spark

HISTORY OF "BIG DATA" ENGINES

Page 4: Introduction to Apache Spark

BATCH VS STREAMING

Page 5: Introduction to Apache Spark

HISTORY OF "BIG DATA" ENGINES2011 - Hadoop MapReduce: Batch, in-disk processing2011 - Apache Storm: Realtime2014 - Apache Tez2014 - Apache Spark: Batch and near-realtime, in-memory processing2015 - Apache Flink: Realtime, in-memory processing

Page 6: Introduction to Apache Spark

APACHE SPARK: WHAT IS IT ANDWHAT'S SPECIAL ABOUT IT?

Page 7: Introduction to Apache Spark

WHY SPARK?

Most machine learning algorithms are iterative; eachiteration can improve the resultsWith disk-based approach, each iteration's output iswritten to disk making the processing slow

Page 8: Introduction to Apache Spark

HADOOP MAPREDUCE EXECUTION FLOW

SPARK EXECUTION FLOW

Page 9: Introduction to Apache Spark

Spark is a distributed data processing engine

Started in 2009Open source & written in ScalaCompatible with Hadoop's data

Page 10: Introduction to Apache Spark

It runs on memory and on diskRun 10 to 100 times faster than Hadoop MapReduceCan be written in Java, Scala, Python & RSupports batch and near-realtime workflows (micro-batches)

Page 11: Introduction to Apache Spark

Spark has four modules:

Page 12: Introduction to Apache Spark

APACHE SPARK: WHAT IS USED FOR?

Page 13: Introduction to Apache Spark

CAPTURE AND EXTRACT DATA

Data can come from several sources:

DatabasesFlat filesWeb and mobile applications' logsData feeds from social mediaIoT devices

Page 14: Introduction to Apache Spark
Page 15: Introduction to Apache Spark

TRANSFORM DATA

Data in an analytics pipeline needs transformation

Check and correct quality issuesHandle missing valuesCast fields into specific data typesCompute derived fieldsSplit or merge records for more granularityJoin with other datasetsRestructure data

Page 16: Introduction to Apache Spark
Page 17: Introduction to Apache Spark

STORE DATA

Data can then be stored in several ways

As self describing files (Parquet, JSON, XML)SQL databasesSearch databases (Elasticsearch, Solr)Key-value stores (HBase, Cassandra)

Page 18: Introduction to Apache Spark
Page 19: Introduction to Apache Spark

QUERY, ANALYZE, VISUALIZE

With Spark Shell, notebooks, Kibana, etc.

Page 20: Introduction to Apache Spark
Page 21: Introduction to Apache Spark

APACHE SPARK: API

Page 22: Introduction to Apache Spark

EXECUTION FLOW

Page 23: Introduction to Apache Spark

RESILENT DISTRIBUTED DATASETS

RDD's are the fundamental data unit in Spark

Resilient: If data in memory is lost, it can be recreatedDistributed: Stored in memory across the clusterDataset: The initial data can come from a file orcreated programmatically

Page 24: Introduction to Apache Spark

RDD'S

Immutable and partionned collection of elementsBasic operations: map, filter, reduce, persistSeveral implementations: PairRDD, DoubleRDD,SequenceFileRDD

Page 25: Introduction to Apache Spark

HISTORY

2011 (Spark release) - RDD API2013 - introduction of the DataFrame API: Add theconcept of schema and allow Spark to manage it formore efficient serialization and deserialization2015 - introduction of the DataSet API

Page 26: Introduction to Apache Spark

OPERATIONS ON RDD'S

TransformationsActions

Page 27: Introduction to Apache Spark

TRANSFORMATIONS

Create a new dataset from an RDD, like filter, map,reduce

Page 28: Introduction to Apache Spark

ACTIONS:

Return a value to the driver program after running acomputation on the dataset

Page 29: Introduction to Apache Spark

EXAMPLE OF MAP AND FILTER TRANSFORMATIONS

Page 30: Introduction to Apache Spark

EXAMPLE OF MAP AND FILTER TRANSFORMATIONS

Page 31: Introduction to Apache Spark

HOW TO RUN SPARK PROGRAMS?Inside Spark ShellUsing a notebookAs a Spark applicationBy submitting Spark application to spark-submit

Page 32: Introduction to Apache Spark

INSIDE SPARK SHELLRun ./bin/spark-shell

val textFile = sc.textFile("README.md")val lines = textFile.filter(line => line contains "Spark")lines.collect()

Page 33: Introduction to Apache Spark

USING A NOTEBOOKThere are many Spark notebooks, we are going to use

http://spark-notebook.io/spark­notebookopen http://localhost:9000/

Page 34: Introduction to Apache Spark

AS A SPARK APPLICATIONBy adding spark-core and other Spark modules as projectdependencies and using Spark API inside the application

codedef main(args: Array[String]) {    val conf = new SparkConf()        .setAppName("Sample Application")        .setMaster("local")

    val sc = new SparkContext(conf)

    val logData = sc.textFile("/tmp/spark/README.md")

    val lines = textFile.filter(line => line contains "Spark")    lines.collect()

    sc.stop()}

Page 35: Introduction to Apache Spark

BY SUBMITTING SPARK APPLICATIONTO SPARK-SUBMIT

./bin/spark­submit \­­class <main­class>­­master <master­url> \­­deploy­mode <deploy­mode> \­­conf <key>=<value> \... # other options<application­jar> \[application­arguments]

Page 36: Introduction to Apache Spark

TERMINOLOGY

SparkContext: A connection to a Spark contextWorker node: Node that runs the program in a clusterTask: A unit of workJob: Consists of multiple tasksExecutor: Process in a worker node, that runs thetasks

Page 37: Introduction to Apache Spark

TOOLS AND SOFTWARE USUALLYUSED WITH APACHE SPARK

Page 38: Introduction to Apache Spark

HDFS: HADOOP DISTRIBUTED FILESYSTEM

Page 39: Introduction to Apache Spark

Simple: Uses many servers as one big computerReliable: Detects failures, has redundant storageFault-tolerant: Auto-retry, self-healing

Scalable: Scales (almost) lineary with disks and CPU

Page 40: Introduction to Apache Spark

APACHE KAFKA

A DISTRIBUTED AND REPLICATED MESSAGING SYSTEM

Page 41: Introduction to Apache Spark
Page 42: Introduction to Apache Spark

APACHE ZOOKEEPER

ZOOKEEPER IS A DISTRIBUTED, OPEN-SOURCECOORDINATION SERVICE FOR DISTRIBUTED APPLICATIONS

Page 43: Introduction to Apache Spark

Coordination: Needed when multiple nodes need towork togetherExamples:

Group membershipLockingLeaders electionSynchronizationPublisher/subscriber

Page 44: Introduction to Apache Spark

APACHE MESOS

Mesos is built using the same principles as the Linuxkernel, only at a different level of abstraction.

The Mesos kernel runs on every machine and providesapplications (e.g., Hadoop, Spark, Kafka, Elastic Search)

with API's for resource management and schedulingacross entire datacenter and cloud environments.

Page 45: Introduction to Apache Spark

A cluster manager that:

Runs distributed applicationsAbstracts CPU, memory, storage, and other resourcesHandles resource allocationHandles applications' isolationHas a Web UI for viewing the cluster's state

Page 46: Introduction to Apache Spark

NOTEBOOKSSpark Notebook: Allows performing reproducibleanalysis with Scala, Apache Spark and moreApache Zeppelin: A web-based notebook that enablesinteractive data analytics

Page 47: Introduction to Apache Spark
Page 48: Introduction to Apache Spark

THE ENDAPACHE SPARK

Is a fast distributed data processing engineRuns on memoryCan be used with Java, Scala, Python & RIts main data structure is a Resilient DistributedDataset

Page 49: Introduction to Apache Spark

SOURCEShttp://www.slideshare.net/jhols1/kafka­atlmeetuppublicv2?qid=8627acbf­f89d­4ada­8cdd­c8e752da25a1&v=&b=&from_search=2http://www.slideshare.net/Clogeny/an­introduction­to­zookeeper?qid=ac974e3b­c935­4974­af4a­dbc3a587ca40&v=&b=&from_search=4http://www.slideshare.net/rahuldausa/introduction­to­apache­spark­39638645?qid=4cd97031­912d­4177­9955­c0ff3dea2260&v=&b=&from_search=35http://www.slideshare.net/junjun1/apache­spark­its­place­within­a­big­data­stack?qid=4cd97031­912d­4177­9955­c0ff3dea2260&v=&b=&from_search=4http://www.slideshare.net/cloudera/spark­devwebinarslides­final?qid=4cd97031­912d­4177­9955­c0ff3dea2260&v=&b=&from_search=13http://www.slideshare.net/pacoid/aus­mesoshttps://spark.apache.org/docs/latest/submitting­applications.htmlhttps://spark.apache.org/docs/1.6.1/quick­start.html