Apache Spark: Lightning Fast Cluster Computing

27
Apache Spark Lightening Fast Cluster Computing Eric Mizell Director, Solution Engineering

Transcript of Apache Spark: Lightning Fast Cluster Computing

Page 1: Apache Spark: Lightning Fast Cluster Computing

Apache Spark

Lightening Fast Cluster Computing

Eric Mizell – Director, Solution Engineering

Page 2: Apache Spark: Lightning Fast Cluster Computing

Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

What is Apache Spark?

Apache Open Source Project

Distributed Compute Engine

for fast and expressive data processing

Designed for Iterative, In-Memory

computations and interactive data mining

Expressive Multi-Language APIs

for Java, Scala, Python, and R

Powerful Abstractions

Enable data workers to rapidly iterate over

data for:

• ETL, Machine Learning, SQL, Stream Processing,

and Graph Processing

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

GraphXSpark

SQL

Spark

StreamingMLlib

Page 3: Apache Spark: Lightning Fast Cluster Computing

Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Why Spark?

Elegant Developer APIs

• Data Frames/SQL, Machine Learning, Graph algorithms and streaming

• Scala, Python, Java and R

• Single environment for pre-processing and Machine Learning

In-memory computation model

• Effective for iterative computations and machine learning

Machine Learning On Hadoop

• Implementation of distributed ML-algorithms

• Pipeline API (Spark ML)

Runs on Hadoop on YARN, Mesos, standalone

Page 4: Apache Spark: Lightning Fast Cluster Computing

Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Interactions with Spark

Command Line

• Scala shell – Scala/Java (./bin/spark-shell)

• Python - (./bin/pyspark)

Notebooks

• Apache Zeppelin Notebook

• Juptyer/IPython Notebook

• IRuby Notebook

ODBC/JDBC (Spark SQL only via Thrift)

• Simba driver

• DataDirect driver

Page 5: Apache Spark: Lightning Fast Cluster Computing

Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Introducing Apache Zeppelin Web-based Notebook for

interactive analytics

FeaturesAd-hoc experimentation

Deeply integrated with Spark + Hadoop

Supports multiple language backends

Incubating at Apache

Use CaseData exploration and discovery

Visualization

Interactive snippet-at-a-time experience

“Modern Data Science Studio”

Page 6: Apache Spark: Lightning Fast Cluster Computing

Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Fundamental Abstraction: Resilient Distributed Datasets

RDD

Work with distributed collections as

primitives

RDD Properties

• Immutable collections of objects spread across

a cluster

• Built through parallel transformations (map,

filter, etc.)

• Automatically rebuilt on failure

• Controllable persistence (e.g. caching in RAM)

Multiple Languages

broad developer, partner and customer

engagement

RDD

Partition 1

RDD

Partition 2

RDD

Partition 3Worker Node

Worker Node

Worker Node

RDD

LogicalSpark

Driver

sc = new SparkContext

rDD

=sc.textfile(“hdfs://…”)

rDD.filter(…)

rDD.Cache

rDD.Count

rDD.map

Developer

Physical

Writes

RDD

RDDs are collections of objects distributed across a cluster,

cached in RAM or on disk. They are built through parallel

transformations, automatically rebuilt on failure and immutable

(each transformation creates a new RDD).

Page 7: Apache Spark: Lightning Fast Cluster Computing

Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

What can developers do with RDDs?

RDD Operations

Transformations

• e.g. map, filter, groupBy, join

• Lazy operations to build RDDs from other

RDDs

Actions

• e.g. count, collect, save

• Return a result or write it to storage

Other primitives

• Accumulator

• Broadcast Variables

Developer

Writes

RDD

Operations

Writes

Accumulator

s

Actions

Broadcast

Variables

Transformations

Page 8: Apache Spark: Lightning Fast Cluster Computing

Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda s: s.startswith(“ERROR”))

messages = errors.map(lambda s: s.split(‘\t’)[2])

messages.cache()messages.filter(lambda s: “foo” in s).count()

messages.filter(lambda s: “bar” in s).count()

. . .

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec

(vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec

(vs 170 sec for on-disk data)

Example: Mining Console Logs

Load error messages from a log into memory, then

interactively search for patterns

Page 9: Apache Spark: Lightning Fast Cluster Computing

Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

RDDDemo

Page 10: Apache Spark: Lightning Fast Cluster Computing

Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

SQLSQL Access and Data Frames

YARN

HDFS

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

GraphXSpark

StreamingMLlib

Spark

SQL

Page 11: Apache Spark: Lightning Fast Cluster Computing

Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

YARN

HDFS

Spark SQL

Table Structure

integrated to work with tables and rows

Hive Queries via Spark

by Spark SQL Context can connect to Hive and

query Hive

Bindings

to Python, Scala, Java, and R

Data Frames

new abstractions simplifies and speeds up SQL

processing

Spark Core Engine

Spark SQL

Data Frame DSL Spark SQL

Data Frame API

Data Source API

Page 12: Apache Spark: Lightning Fast Cluster Computing

Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Sto

rag

eWhat are Data Frames?

Data Frames represent data in RDDs as a Table

RDD is a low level abstraction

–Think of RDD as bytecode and DataFrame as the

Java Program

Data Frame Properties

–Data Frames attach schema to RDDs

–Allows users to perform aggressive query

optimizations

–Brings the power of SQL to RDDs!

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Tuple

Relational

View

Columnar Storage

ORCFile Parquet

Unstructured Data

JSON CSV

Text Avro

Custom

Weblog

Page 13: Apache Spark: Lightning Fast Cluster Computing

Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Data Frames are intuitive

RDD Example

Equivalent Data Frame Example

dept name age

Bio H Smith 48

CS A Turing 54

Bio B Jones 43

Phys E Witten 61

Find average age by department?

Page 14: Apache Spark: Lightning Fast Cluster Computing

Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

DataFrameDemo

YARN

HDFS

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

GraphXSpark

StreamingMLlib

Spark

SQL

Page 15: Apache Spark: Lightning Fast Cluster Computing

Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

MLlibMachine Learning Library

YARN

HDFS

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

GraphXSpark

SQL

Spark

StreamingMLlib

Page 16: Apache Spark: Lightning Fast Cluster Computing

Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

What is Machine Learning?

Machine learning is the study of

algorithms that learn concepts from

data.

A key aspect of learning is

generalization: how well a learning

algorithm is able to predict on unseen

examples.

Page 17: Apache Spark: Lightning Fast Cluster Computing

Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Machine Learning PrimitivesUnsupervised Learning

Clustering (K-means)

Recommendation

Collaborative Filtering

- alternating least squares

Dimensionality Reductions

- Principal component analysis (PCA) and singular

value decomposition (SVD)

Supervised Learning

Classification

- Naïve Bayes, Decision Tree, Random Forest,

Gradient Boosted Trees

Regression

- linear, logistic and Support Vector Machines

(SVMs)

Page 18: Apache Spark: Lightning Fast Cluster Computing

Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

ML Workflows are complex

Q-Q

Q-A

similarit

y

Log

Parsing,

Cleanin

g

Ad

category

mapping

Query

category

mapping

Poly

Exp

(Q-A)

Feature

s

Model

Linear

Solver

train

test

Metrics

• Feature Extraction

Feature

ExtractionAd Server

Sponsored Search Advertising Pipeline Challenges:-> specify pipeline

-> inspect and debug

-> tune hyperparameters

-> productionize

HDFS

Page 19: Apache Spark: Lightning Fast Cluster Computing

Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

ML Pipeline makes ML workflows easier

Transformer

Transforms one dataset into another

Estimator

Fits model to data

Pipeline

Sequence of stages, consisting of estimators

or transformers

Parameters

Trait for components that take parameters

Page 20: Apache Spark: Lightning Fast Cluster Computing

Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

StreamingReal Time Stream Processing

YARN

HDFS

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

GraphXSpark

SQLMLlib

Spark

Streaming

Page 21: Apache Spark: Lightning Fast Cluster Computing

Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Spark Streaming

• Spark Streaming is an extension of Spark-core API that supports scalable, high

throughput and fault-tolerant streaming applications.

• Data can be ingested from many data sources like Kafka, Flume, Twitter, ZeroMQ or

TCP sockets

• Data is processed using the now-familiar API: map, filter, reduce, join and window

• Processed data can be stored in databases, filesystems, or live dashboards

Page 22: Apache Spark: Lightning Fast Cluster Computing

Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

GraphXGraph Processing

YARN

HDFS

Scala

Java

Python

APIs

Spark Core EngineSpark Core Engine

Spark

SQL

Spark

StreamingMLlib GraphX

Page 23: Apache Spark: Lightning Fast Cluster Computing

Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Spark GraphXGraph API on Spark

Seamlessly work with graphs and collections

Growing library of graph algorithms

• SVD++, Connected Components, Triangle

Count, …

Iterative Graph Computations using

Pregel

Implements Valiant’s Bulk Synchronous

Parallel (BSP) model for distributing graph

algorithms.

Use Case

Social Media: Suggest new connections based

on existing relationships

Networking: Best routing through a given

network

Page 24: Apache Spark: Lightning Fast Cluster Computing

Page 24 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPart. 2

Part. 1

Vertex Table

(RDD)

B C

A D

F E

A D

Distributed Graphs as Tables (RDDs)

D

Property Graph

B C

D

E

AA

F

Edge Table

(RDD)

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

Routing

Table (RDD)

B

C

D

E

A

F

1

2

1 2

1 2

1

2

2D Vertex Cut Heuristic

Page 25: Apache Spark: Lightning Fast Cluster Computing

Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

How to Get Started with Spark

Page 26: Apache Spark: Lightning Fast Cluster Computing

Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Try Spark Today

Download the Hortonworks Sandboxhttp://hortonworks.com/products/hortonworks-sandbox/

Go to the Apache Spark Websitehttp://spark.apache.org/

Learn Spark

Build a Proof of Concept

Test New Functionality

Page 27: Apache Spark: Lightning Fast Cluster Computing

Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved© Hortonworks Inc. 2013

Thank You!

Eric Mizell - Director, Solutions [email protected]