Apache Spark: Usage and Roadmap in Hadoop

1© Cloudera, Inc. All rights reserved.

Apache Spark: Usage and Roadmap in HadoopJai Ranganathan


Spark will replace MapReduceTo become the standard execution engine for Hadoop


The Future of Data Processing on HadoopSpark complemented by specialized fit-for-purpose engines

General Data Processing w/Spark

Fast Batch Processing, Machine Learning, and Stream Processing

Analytic Database w/Impala

Low-LatencyMassively Concurrent

Queries

Full-Text Search w/Solr Querying textual data

On-Disk Processing w/MapReduceJobs at extreme scale and extremely disk IO intensive

Shared:• Data Storage• Metadata• Resource

Management• Administration• Security• Governance


Cloudera Leading the Spark Movement

2013 2014 2015 2016

Identified Spark’s early potential

Ships and Supports Spark with CDH 4.4

Spark on YARN integration

Announces initiative to make Spark the standard execution engine

Launches first Spark training

Added security integration

Cloudera engineers publish O’Reilly Spark book

Leading effort to further performance, usability, and enterprise-readiness


Community Initiative: Spark Supersedes MapReduce

Stage 1• Crunch on Spark• Search on Spark

Stage 2• Hive on Spark (beta)• Spark on HBase (beta)

Stage 3• Pig on Spark (alpha)• Sqoop on Spark

Community development to port components to Spark:


Cloudera Customer Use CasesCore Spark Spark Streaming

• Portfolio Risk Analysis• ETL Pipeline Speed-Up• 20+ years of stock dataFinancial

Services

Health

• Identify disease-causing genes in the full human genome

• Calculate Jaccard scores on health care data sets

ERP

• Optical Character Recognition and Bill Classification

• Trend analysis • Document classification (LDA)• Fraud analyticsData

Services

1010

• Online Fraud DetectionFinancial Services

Health

• Incident Prediction for Sepsis

Retail

• Online Recommendation Systems• Real-Time Inventory Management

Ad Tech

• Real-Time Ad Performance Analysis


Apache SparkFlexible, in-memory data processing for Hadoop

Easy Development

Flexible Extensible API

Fast Batch & Stream Processing

• Rich APIs for Scala, Java, and Python

• Interactive shell

• APIs for different types of workloads:• Batch • Streaming• Machine Learning• Graph

• In-Memory processing and caching


The Spark Ecosystem & HadoopHadoop Integration• Spark-on-YARN integration• Shares data, metadata,

administration, security, & governance

STORAGEHDFS, HBase

RESOURCE MANAGEMENTYARN

Spark Impala MR Others

Spark Streaming MLlib SparkSQL GraphX Data-

frames SparkR


Logistic Regression Performance (Data Fits in Memory)

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

4000

MapReduceSpark

# of Iterations

Runn

ing

Tim

e(s)

110 s/iteration

First iteration = 80sFurther iterations 1s due to caching


Apache Spark StreamingWhat is it?• Run continuous processing of data using

Spark’s core API• Extends Spark concepts to fault-tolerant,

transformable streams • Adds “rolling window” operations

• Example: Compute rolling averages or counts for data over last five minutes

Benefits:• Reuse knowledge and code in both contexts

• Same programming paradigm for streaming and batch

• Simplicity of development• High-level API with automatic DAG generation

• Excellent throughput• Scale easily to support large volumes of data

ingest• Combine elements like MLlib and Oryx into

streaming applicationsCommon Use Cases:• “On-the-fly” ETL as data is ingested into

Hadoop/HDFS• Detect anomalous behavior and trigger alerts• Continuous reporting of summary metrics for

incoming data


Spark Streaming Architectures

Data Sources

High-Fidelity Archival

Ingest

Integration Layer

• Flume• Kafka

Spark Stream Processing

Data Prep Aggregation / Scoring

Tran

sfor

med

Res

ults

Application

Notifications

HDFS

Spark Long-Term Analytics/Model Building

HBase

Real-Time Result Serving

Real-Time

Serving


SparkSQL + DataframesMachine Learning Applications

• Goal: • Spark/Java Developers and Data

Scientists can inline SQL into Spark apps

• Designed for:• Ease of development for Spark

developers• Handful of concurrent Spark jobs

• Strengths:• Ease of embedding SQL into Java or Scala

applications• SQL for common functionality in

developer flow (eg. aggregations, filters, samples)


Execution Pipeline

SQL AST Logical Plan Optimized Logical Plan

Logical PlanPhysical

PlansCBO Selected

PlanRDDsRDDsRDDs

Dataframes


Uniting Spark and HadoopThe One Platform Initiative

ManagementLeverage Hadoop-nativeresource management.

SecurityFull support for Hadoop security

and beyond.

ScaleEnable 10k-node clusters.

StreamingSupport for 80% of common stream

processing workloads.


Management Security Scale Streaming• Spark on YARN Integration• HBase integration• Improved metrics for

monitoring/troubleshooting• Dynamic Resource Allocation

• Spark on YARN:• Container resizing• Dynamic Resource

Allocation for Streaming• Simplified resource

configuration• Improved WebUI for

debugging • Improved metrics for visibility

into resource utilization• Smart auto-tuning of job

parameters

• Kerberos Integration• HDFS Sync (Sentry)• Secure data at rest

• Secure data over the wire• Audit/Lineage (Navigator)• Spark PCI compliance• Integration with Intel’s

advanced encryption libraries• Enable column and view level

security

• Revamp Scheduler handling of node failure

• Sort based shuffle improvements

• Task Scheduling based on HDFS data locality and caching

• Scheduler improvements for performance at scale

• Stress test at scale with mixed multi-tenant workloads

• HDFS DDM Integration• Dynamic resource utilization &

prioritization• Scale Spark History Server for

1000s of jobs

• Zero Data Loss with Spark Streaming Resilience

• Flume integration• Kafka integration

• SQL semantics for expressing streaming jobs (Business Users)

• New streaming specific API extensions

• Streaming application management (pause, update, redeploy) via CM

• Optimized state updates: efficient point lookups and delta updates

Detailed Roadmap: One Platform Initiative= Completed Work

= Planned Future Work


Spark Resources• Learn Spark• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)• Cloudera Developer Blog• cloudera.com/spark

• Get Trained• Cloudera Spark Training

• Try it Out• Cloudera Live Spark Tutorial

http://shop.oreilly.com/product/0636920035091.do

http://blog.cloudera.com/blog/category/spark/

http://www.cloudera.com/spark



http://cloudera.com/content/cloudera/en/training/courses/spark-training.html

http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-live.html


Try It With Cloudera Live

cloudera.com/live

Featuring tutorials on:

CDH


Thank YouJairam [email protected]

Apache Spark: Usage and Roadmap in Hadoop

Technology

Transcript of Apache Spark: Usage and Roadmap in Hadoop