Dec6 meetup spark presentation

download Dec6 meetup spark presentation

of 45

  • date post

    13-Apr-2017
  • Category

    Software

  • view

    300
  • download

    3

Embed Size (px)

Transcript of Dec6 meetup spark presentation

Spark - Lightning-Fast Cluster Computing by Example

Spark - Lightning-Fast Cluster Computing by Example

Ramesh Mudunuri, Vectorum

Saturday, December 6, 2014

1

About me

Big data enthusiast

Vectorum.com , Startup Product development team member and using spark technology

What to expect

Introduction to Spark

Spark Eco system

How Spark is different from Hadoop Map Reduce

Where Spark shines well

How easy to install and start learning

Small code demos

Where to find additional information

This is not

Training class

Work shop

Product demo with commercial interest

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is it different from Hadoop Map Reduce

Where it shines well

How easy to install and start learning

Small code demos

Where to find additional information

What is Spark ?

General purpose large-scale high performance processing engine

http://spark.apache.org/

What is Spark ?

Like map-Reduce but in-memory processing engine and also runs fast

http://spark.apache.org/

What is Spark

Apache Spark is a fast and general engine for large-scale data processing.

Spark History

Started as research project in 2009 at UC Berkeley amplab and became Apache open source project since 2010

Matai Zaharia Spark Dev. team member and Databricks co-founder

Why is Spark so special

Speed

General purpose faster processing In-memory engine

(relatively) Easy to develop and deploy complex analytical applications

APIs for : Java, Scala and Python

Well integrated eco system tools

www.databricks.com

Why is Spark so special..

In-memory processing makes well suites for Iterative nature Algorithm computations

Can run in various setups

Standalone (my favorite way to learn Spark)

Cluster, EC2,

Yarn, Mesos

Read data from

Local file system

HDFS

Hbase, Cassandra and

http://www.cloudera.com

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is it different from Hadoop Map Reduce

Where it shines well

How easy to install and start learning

Small code demos

Where to find additional information

Apache Spark Core

Foundation

Scheduling,

Memory Management

Fault Recovery and etc.

Spark SQL

Execute SPARK with SQL expressions

Compatible with Hive*

JDBC/ODBC connection capabilities

* Hive :Distributed Data storage SQL software with custom UDF capabilities

SerDe is short for Serializer/Deserializer. Hive uses the SerDe interface for IO. The interface handles both serialization and deserialization and also interpreting the results of serialization as individual fields for processing.

13

Spark Streaming

Component to process live streaming of the data.

API to handle streaming data

E.g: Sources : Log files, queued messages, sensor emitted data

MLlib- Machine leaning

Libraries for Machine learning Algorithms

Eg : Classification, regression, clustering, collaborative filtering, dimensionality reduction

Very active Spark Development community

GraphX

APIs for Graph computation

PageRank

Connected components

Label propagation

SVD++

Strongly connected components

Triangle count

Alpha level

Spark Engine Terminology

Spark Context

An Object Spark uses to access cluster

Driver & Executor

Driver runs main program and execute parallel operations

Executor runs inside worker and execute the tasks

Resilient Distributes Dataset (RDD)

Immutable fault tolerant collection object

RDD functions (similar to Hadoop map-Reduce functions)

Transformation

Action

Spark shell and Spark context

Driver & ExecutorDriver runs main program and execute parallel operationsExecutor runs inside worker and execute the tasks

19

RDD-Resilient Distributed Dataset

Resilient Distributed Data sets(RDD)is Sparks fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a cluster.

Simple Definition: Immutable and fault tolerant collection object

There are two ways to create an RDD in Spark:

1. Create an RDD from an external data source

2. Performing a transformation on one or more existing RDDs

val lines = sc.textFile("/filepath/README.md")

val errors = lines.filter(_.startsWith("ERROR"))

RDD

There are two ways to create an RDD in Spark:

Create an RDD from an external data source

val lines = sc.textFile("/filepath/REDME.md")

Performing a transformation on one or more existing RDDs

val errors = lines.filter(_.startsWith("ERROR"))

Transformation - Action

Transformations operations are lazy (will not be executed immediately)

Transformations can create new RDDs from existing RDD

e.g filter, map,

Action operations return final values to driver program or write data into file system

e.g: Collect, SaveAsTextFile

http://www.mapr.com

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is Spark different from Hadoop Map Reduce

Where it shines well

How easy to install and start learning

Small code demos

Where to find additional information

How is Spark different from Hadoop Map-Reduce

SPARKHadoop1Speed : 100X in-memory and 10X on Disk2Easy of Use : Easily write application using Java, Scala, PythonInteractive Shell available with Scala and PythonHigh level simple map-reduce Operations JavaNo shellcomplex map-reduce operations3Tools : Well integrated tools (Spark SQL, Streaming, Mllib and etc.) to develop complex analytical application Loosely coupled large set of tools, but very matured4Deployment: Hadoop : V1/V2(YARN) And also Meson, Amazon-EC2--5Data Source: HDFS(Hadoop), HBase, Cassandra, Amazon-S3--

How is spark different from Hadoop Map-Reduce

SPARKHadoop6Applications: Spark Application is higher level of Unit, runs multiple jobs in sequence or parallelApplication process are called executors, runs on clusters(workers)Hadoop job is higher level unit process data with map reduce and writes data to storage 7Executors: Executors can run multiple tasks in a single processorEach mapReduce runs in its own processor8Shared Variable: Broadcast variables: Read-only(look-up) variable, ships only once to workerAccumulators: Workers add values and driver reads the data, and fault tolerantHadoop counter have additional (system ) metric counters like Map input records9Persisting/Caching RDD: Cached RDDs can be used & reused in across the operation, thus increase the processing speed --10Lazy Evaluation: Transformation functions execution plan bundled together and executes only with RDD action function--

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is it different from Hadoop Map Reduce

Where Spark shines well

How easy to install and start learning

Small code demos

Where to find additional information

Where Spark shines well

Well suited for any iterative computations

Machine Learning Algorithms

Iterative Analytics

Multi data source Computations

Multi sourced Sensor data

Aggregated Analytics

Transforming and Summering the data

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is it different from Hadoop Map Reduce

Where it shines well

How easy to install and start learning

Small code demos

Where to find additional information

Link http://spark.apache.org/downloads.html

Standalone - Chose a package type

: Prebuild for hadoop1.x

Source code is also available

Build Toolls : maven or sbt

Distro-versions : Hadoop, Cloudera, MapR

Current Spark version

Release Cycle : Every 3 months

How easy to install and start learning

Can install quickly on our laptop/PC

Parameter check lists

JAVA 1.7

SCALA 2.10X

SPARK/Conf

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is it different from Hadoop Map Reduce

Where it shines well

How easy to install and start learning

Small code demos

Where to find additional information

Spark Scala REPL

cd $SPARK_HOME

./bin/spark-shell

port 4040

Spark Master & Worker in background

cd $SPARK_HOME

./sbin/start-all.sh

Starts both Master and worker

Spark - Lightning-Fast Cluster Computing by Example

Introduction to Spark

Spark Eco system

How is it different from Hadoop Map Reduce

Where it shines well

How easy to install and start learning

Small code demos

Where to find additional information

Use case with Spark SQL

Spark Scala REPL

Spark SQL