Apache Spark: Usage and Roadmap in Hadoop
-
Upload
cloudera-japan -
Category
Technology
-
view
2.568 -
download
1
Transcript of Apache Spark: Usage and Roadmap in Hadoop
1© Cloudera, Inc. All rights reserved.
Apache Spark: Usage and Roadmap in HadoopJai Ranganathan
2© Cloudera, Inc. All rights reserved.
Spark will replace MapReduceTo become the standard execution engine for Hadoop
3© Cloudera, Inc. All rights reserved.
The Future of Data Processing on HadoopSpark complemented by specialized fit-for-purpose engines
General Data Processing w/Spark
Fast Batch Processing, Machine Learning, and Stream Processing
Analytic Database w/Impala
Low-LatencyMassively Concurrent
Queries
Full-Text Search w/Solr Querying textual data
On-Disk Processing w/MapReduceJobs at extreme scale and extremely disk IO intensive
Shared:• Data Storage• Metadata• Resource
Management• Administration• Security• Governance
4© Cloudera, Inc. All rights reserved.
Cloudera Leading the Spark Movement
2013 2014 2015 2016
Identified Spark’s early potential
Ships and Supports Spark with CDH 4.4
Spark on YARN integration
Announces initiative to make Spark the standard execution engine
Launches first Spark training
Added security integration
Cloudera engineers publish O’Reilly Spark book
Leading effort to further performance, usability, and enterprise-readiness
5© Cloudera, Inc. All rights reserved.
Community Initiative: Spark Supersedes MapReduce
Stage 1• Crunch on Spark• Search on Spark
Stage 2• Hive on Spark (beta)• Spark on HBase (beta)
Stage 3• Pig on Spark (alpha)• Sqoop on Spark
Community development to port components to Spark:
6© Cloudera, Inc. All rights reserved.
Cloudera Customer Use CasesCore Spark Spark Streaming
• Portfolio Risk Analysis• ETL Pipeline Speed-Up• 20+ years of stock dataFinancial
Services
Health
• Identify disease-causing genes in the full human genome
• Calculate Jaccard scores on health care data sets
ERP
• Optical Character Recognition and Bill Classification
• Trend analysis • Document classification (LDA)• Fraud analyticsData
Services
1010
• Online Fraud DetectionFinancial Services
Health
• Incident Prediction for Sepsis
Retail
• Online Recommendation Systems• Real-Time Inventory Management
Ad Tech
• Real-Time Ad Performance Analysis
7© Cloudera, Inc. All rights reserved.
Apache SparkFlexible, in-memory data processing for Hadoop
Easy Development
Flexible Extensible API
Fast Batch & Stream Processing
• Rich APIs for Scala, Java, and Python
• Interactive shell
• APIs for different types of workloads:• Batch • Streaming• Machine Learning• Graph
• In-Memory processing and caching
8© Cloudera, Inc. All rights reserved.
The Spark Ecosystem & HadoopHadoop Integration• Spark-on-YARN integration• Shares data, metadata,
administration, security, & governance
STORAGEHDFS, HBase
RESOURCE MANAGEMENTYARN
Spark Impala MR Others
Spark Streaming MLlib SparkSQL GraphX Data-
frames SparkR
9© Cloudera, Inc. All rights reserved.
Logistic Regression Performance (Data Fits in Memory)
1 5 10 20 300
500
1000
1500
2000
2500
3000
3500
4000
MapReduceSpark
# of Iterations
Runn
ing
Tim
e(s)
110 s/iteration
First iteration = 80sFurther iterations 1s due to caching
10© Cloudera, Inc. All rights reserved.
Apache Spark StreamingWhat is it?• Run continuous processing of data using
Spark’s core API• Extends Spark concepts to fault-tolerant,
transformable streams • Adds “rolling window” operations
• Example: Compute rolling averages or counts for data over last five minutes
Benefits:• Reuse knowledge and code in both contexts
• Same programming paradigm for streaming and batch
• Simplicity of development• High-level API with automatic DAG generation
• Excellent throughput• Scale easily to support large volumes of data
ingest• Combine elements like MLlib and Oryx into
streaming applicationsCommon Use Cases:• “On-the-fly” ETL as data is ingested into
Hadoop/HDFS• Detect anomalous behavior and trigger alerts• Continuous reporting of summary metrics for
incoming data
11© Cloudera, Inc. All rights reserved.
Spark Streaming Architectures
Data Sources
High-Fidelity Archival
Ingest
Integration Layer
• Flume• Kafka
Spark Stream Processing
Data Prep Aggregation / Scoring
Tran
sfor
med
Res
ults
Application
Notifications
HDFS
Spark Long-Term Analytics/Model Building
HBase
Real-Time Result Serving
Real-Time
Serving
12© Cloudera, Inc. All rights reserved.
SparkSQL + DataframesMachine Learning Applications
• Goal: • Spark/Java Developers and Data
Scientists can inline SQL into Spark apps
• Designed for:• Ease of development for Spark
developers• Handful of concurrent Spark jobs
• Strengths:• Ease of embedding SQL into Java or Scala
applications• SQL for common functionality in
developer flow (eg. aggregations, filters, samples)
13© Cloudera, Inc. All rights reserved.
Execution Pipeline
SQL AST Logical Plan Optimized Logical Plan
Logical PlanPhysical
PlansCBO Selected
PlanRDDsRDDsRDDs
Dataframes
14© Cloudera, Inc. All rights reserved.
Uniting Spark and HadoopThe One Platform Initiative
ManagementLeverage Hadoop-nativeresource management.
SecurityFull support for Hadoop security
and beyond.
ScaleEnable 10k-node clusters.
StreamingSupport for 80% of common stream
processing workloads.
15© Cloudera, Inc. All rights reserved.
Management Security Scale Streaming• Spark on YARN Integration• HBase integration• Improved metrics for
monitoring/troubleshooting• Dynamic Resource Allocation
• Spark on YARN:• Container resizing• Dynamic Resource
Allocation for Streaming• Simplified resource
configuration• Improved WebUI for
debugging • Improved metrics for visibility
into resource utilization• Smart auto-tuning of job
parameters
• Kerberos Integration• HDFS Sync (Sentry)• Secure data at rest
• Secure data over the wire• Audit/Lineage (Navigator)• Spark PCI compliance• Integration with Intel’s
advanced encryption libraries• Enable column and view level
security
• Revamp Scheduler handling of node failure
• Sort based shuffle improvements
• Task Scheduling based on HDFS data locality and caching
• Scheduler improvements for performance at scale
• Stress test at scale with mixed multi-tenant workloads
• HDFS DDM Integration• Dynamic resource utilization &
prioritization• Scale Spark History Server for
1000s of jobs
• Zero Data Loss with Spark Streaming Resilience
• Flume integration• Kafka integration
• SQL semantics for expressing streaming jobs (Business Users)
• New streaming specific API extensions
• Streaming application management (pause, update, redeploy) via CM
• Optimized state updates: efficient point lookups and delta updates
Detailed Roadmap: One Platform Initiative= Completed Work
= Planned Future Work
16© Cloudera, Inc. All rights reserved.
Spark Resources• Learn Spark• O’Reilly Advanced Analytics with Spark eBook (written by Clouderans)• Cloudera Developer Blog• cloudera.com/spark
• Get Trained• Cloudera Spark Training
• Try it Out• Cloudera Live Spark Tutorial
17© Cloudera, Inc. All rights reserved.
Try It With Cloudera Live
cloudera.com/live
Featuring tutorials on:
CDH
18© Cloudera, Inc. All rights reserved.
Thank YouJairam [email protected]