Apache Spark: Lightning Fast Cluster Computing

Apache Spark

Lightening Fast Cluster Computing

Eric Mizell – Director, Solution Engineering

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

What is Apache Spark?

Apache Open Source Project

Distributed Compute Engine

for fast and expressive data processing

Designed for Iterative, In-Memory

computations and interactive data mining

Expressive Multi-Language APIs

for Java, Scala, Python, and R

Powerful Abstractions

Enable data workers to rapidly iterate over

data for:

• ETL, Machine Learning, SQL, Stream Processing,

and Graph Processing

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

GraphXSpark

SQL

Spark

StreamingMLlib


Why Spark?

Elegant Developer APIs

• Data Frames/SQL, Machine Learning, Graph algorithms and streaming

• Scala, Python, Java and R

• Single environment for pre-processing and Machine Learning

In-memory computation model

• Effective for iterative computations and machine learning

Machine Learning On Hadoop

• Implementation of distributed ML-algorithms

• Pipeline API (Spark ML)

Runs on Hadoop on YARN, Mesos, standalone


Interactions with Spark

Command Line

• Scala shell – Scala/Java (./bin/spark-shell)

• Python - (./bin/pyspark)

Notebooks

• Apache Zeppelin Notebook

• Juptyer/IPython Notebook

• IRuby Notebook

ODBC/JDBC (Spark SQL only via Thrift)

• Simba driver

• DataDirect driver


Introducing Apache Zeppelin Web-based Notebook for

interactive analytics

FeaturesAd-hoc experimentation

Deeply integrated with Spark + Hadoop

Supports multiple language backends

Incubating at Apache

Use CaseData exploration and discovery

Visualization

Interactive snippet-at-a-time experience

“Modern Data Science Studio”


Fundamental Abstraction: Resilient Distributed Datasets

RDD

Work with distributed collections as

primitives

RDD Properties

• Immutable collections of objects spread across

a cluster

• Built through parallel transformations (map,

filter, etc.)

• Automatically rebuilt on failure

• Controllable persistence (e.g. caching in RAM)

Multiple Languages

broad developer, partner and customer

engagement

RDD

Partition 1

RDD

Partition 2

RDD

Partition 3Worker Node

Worker Node

Worker Node

RDD

LogicalSpark

Driver

sc = new SparkContext

rDD

=sc.textfile(“hdfs://…”)

rDD.filter(…)

rDD.Cache

rDD.Count

rDD.map

…

Developer

Physical

Writes

RDD

RDDs are collections of objects distributed across a cluster,

cached in RAM or on disk. They are built through parallel

transformations, automatically rebuilt on failure and immutable

(each transformation creates a new RDD).


What can developers do with RDDs?

RDD Operations

Transformations

• e.g. map, filter, groupBy, join

• Lazy operations to build RDDs from other

RDDs

Actions

• e.g. count, collect, save

• Return a result or write it to storage

Other primitives

• Accumulator

• Broadcast Variables

Developer

Writes

RDD

Operations

Writes

Accumulator

s

Actions

Broadcast

Variables

Transformations


lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache()messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec

(vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Mining Console Logs

Load error messages from a log into memory, then

interactively search for patterns


RDDDemo


SQLSQL Access and Data Frames

YARN

HDFS

Scala

Java

Python

APIs


GraphXSpark

StreamingMLlib

Spark

SQL


YARN

HDFS

Spark SQL

Table Structure

integrated to work with tables and rows

Hive Queries via Spark

by Spark SQL Context can connect to Hive and

query Hive

Bindings

to Python, Scala, Java, and R

Data Frames

new abstractions simplifies and speeds up SQL

processing

Spark Core Engine

Spark SQL

Data Frame DSL Spark SQL

Data Frame API

Data Source API


Sto

rag

eWhat are Data Frames?

Data Frames represent data in RDDs as a Table

RDD is a low level abstraction

–Think of RDD as bytecode and DataFrame as the

Java Program

Data Frame Properties

–Data Frames attach schema to RDDs

–Allows users to perform aggressive query

optimizations

–Brings the power of SQL to RDDs!

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Tuple

Relational

View

Columnar Storage

ORCFile Parquet

Unstructured Data

JSON CSV

Text Avro

Custom

Weblog


Data Frames are intuitive

RDD Example

Equivalent Data Frame Example

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Find average age by department?


DataFrameDemo

YARN

HDFS

Scala

Java

Python

APIs


GraphXSpark

StreamingMLlib

Spark

SQL


MLlibMachine Learning Library

YARN

HDFS

Scala

Java

Python

APIs


GraphXSpark

SQL

Spark

StreamingMLlib


What is Machine Learning?

Machine learning is the study of

algorithms that learn concepts from

data.

A key aspect of learning is

generalization: how well a learning

algorithm is able to predict on unseen

examples.


Machine Learning PrimitivesUnsupervised Learning

Clustering (K-means)

Recommendation

Collaborative Filtering

- alternating least squares

Dimensionality Reductions

- Principal component analysis (PCA) and singular

value decomposition (SVD)

Supervised Learning

Classification

- Naïve Bayes, Decision Tree, Random Forest,

Gradient Boosted Trees

Regression

- linear, logistic and Support Vector Machines

(SVMs)


ML Workflows are complex

Q-Q

Q-A

similarit

y

Log

Parsing,

Cleanin

g

Ad

category

mapping

Query

category

mapping

Poly

Exp

(Q-A)

Feature

s

Model

Linear

Solver

train

test

Metrics

• Feature Extraction

Feature

ExtractionAd Server

Sponsored Search Advertising Pipeline Challenges:-> specify pipeline

-> inspect and debug

-> tune hyperparameters

-> productionize

HDFS


ML Pipeline makes ML workflows easier

Transformer

Transforms one dataset into another

Estimator

Fits model to data

Pipeline

Sequence of stages, consisting of estimators

or transformers

Parameters

Trait for components that take parameters


StreamingReal Time Stream Processing

YARN

HDFS

Scala

Java

Python

APIs


GraphXSpark

SQLMLlib

Spark

Streaming


Spark Streaming

• Spark Streaming is an extension of Spark-core API that supports scalable, high

throughput and fault-tolerant streaming applications.

• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or

TCP sockets

• Data is processed using the now-familiar API: map, filter, reduce, join and window

• Processed data can be stored in databases, filesystems, or live dashboards


GraphXGraph Processing

YARN

HDFS

Scala

Java

Python

APIs


Spark

SQL

Spark

StreamingMLlib GraphX


Spark GraphXGraph API on Spark

Seamlessly work with graphs and collections

Growing library of graph algorithms

• SVD++, Connected Components, Triangle

Count, …

Iterative Graph Computations using

Pregel

Implements Valiant’s Bulk Synchronous

Parallel (BSP) model for distributing graph

algorithms.

Use Case

Social Media: Suggest new connections based

on existing relationships

Networking: Best routing through a given

network

© Hortonworks Inc. 2011 – 2015. All Rights ReservedPart. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Distributed Graphs as Tables (RDDs)

D

Property Graph

B C

D

E

AA

F

Edge Table

(RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing

Table (RDD)

B

C

D

E

A

F

1

2

1 2

1 2

1

2

2D Vertex Cut Heuristic


How to Get Started with Spark


Try Spark Today

Download the Hortonworks Sandboxhttp://hortonworks.com/products/hortonworks-sandbox/

Go to the Apache Spark Websitehttp://spark.apache.org/

Learn Spark

Build a Proof of Concept

Test New Functionality

Apache Spark: Lightning Fast Cluster Computing

Technology

Transcript of Apache Spark: Lightning Fast Cluster Computing