Spark Summit EU talk by Tug Grall

19
How Spark is Enabling the New Wave of Converged Applications Tugdual Grall MapR Technologies

Transcript of Spark Summit EU talk by Tug Grall

Page 1: Spark Summit EU talk by Tug Grall

How Spark is Enabling the New Wave of Converged Applications

Tugdual Grall MapR Technologies

Page 2: Spark Summit EU talk by Tug Grall

Decreasing Job Latencies

Hours Mins Secs Milli Secs

on-disk

in-memory Tipping Point

Page 3: Spark Summit EU talk by Tug Grall

Analytics & ETL: Batch or Continuous ?

Value of Data

Time since data is generated

Value of Data

Volume of Data used for Analytics

It’s not an either or, you have to do both

Page 4: Spark Summit EU talk by Tug Grall

Why Stream Processing?

6:01 P.M.: 32° 6:02 P.M.: 32° 6:03 P.M.: 33° 6:04 P.M.: 36° 6:05 P.M.: 37° 6:06 P.M.: 36° 6:07 P.M.: 36° 6:08 P.M.: 35° 6:09 P.M.: 35° 6:10 P.M.: 35° 6:11 P.M.: 35° 6:12 P.M.: 35° 6:13 P.M.: 35°

37°

It was hot at 6:05 yesterday!

Batch processing may be too late for some events

Page 5: Spark Summit EU talk by Tug Grall

Why Stream Processing? It’s becoming important to process events as they arrive

6:05 P.M.: 37°Topic

Temperature

Turn on the air conditioning!

Stream

Page 6: Spark Summit EU talk by Tug Grall

Advanced Analytics

Descriptive Predictive Streaming Prescriptive

● What Happened ● Why did it happen ● Discovery in nature ● Batch Analytics

● What will happen ● Combines historical data with

rules and algorithms ● ML (Batch + Real Time)

● What + When + Why ● Suggestions

to take advantage of future opportunity or mitigate risks

● Agility is key to success.

● Analyse data as it happens ● Triggers and Alarms. ● Anomaly detection ● Continuous ETL and Analytics

There is a need to converged these Analytics

Page 7: Spark Summit EU talk by Tug Grall

Converged Computing

Offline Real Time

Programmatic Spark & ML Spark Streaming

SQL Spark SQL Spark Structured Streaming

Page 8: Spark Summit EU talk by Tug Grall

The Many “Convergences” In Progress

CONVERGENCE

On Prem & Cloud

Analytics & Operations

Data at Rest & Data in Motion

Storage & Compute

Files, Tables, Stream data

Page 9: Spark Summit EU talk by Tug Grall

Spark on Non-Converged Platform

Kafka

Topic

Topic

Clu

ster

1

Clu

ster

3

NoSQL Database

Advanced Analytics

ManagementMonitoringSecurity

ManagementMonitoring

Security

Hadoop/S3 Storage

ManagementMonitoringSecurity

Kafka Cluster

Clu

ster

2

Real-time dashboards

Real-Time Producers

• Redundant 3x Management, Monitoring and Security • Redundant 3x Data Storage

Page 10: Spark Summit EU talk by Tug Grall

Converged Computing & Converged Data Management

Page 11: Spark Summit EU talk by Tug Grall

11

Open Source Engines & Tools Commercial Engines & Applications

Enterprise-Grade Platform Services

Dat

aPr

oces

sing

Web-Scale StorageMapR-FS MapR-DB

Search and Others

Real Time Unified Security Multi-tenancy Disaster Recovery Global NamespaceHigh Availability

MapR Streams

Cloud and Managed Services

Search and Others

Unified M

anagement and M

onitoring

Search and Others

Event StreamingDatabase

Custom Apps

HDFS API POSIX, NFS HBase API JSON API Kafka API

MapR Converged Data Platform

Page 12: Spark Summit EU talk by Tug Grall

SAMPLE CUSTOMER USE CASE

Page 13: Spark Summit EU talk by Tug Grall

13

Website Click-Stream

Topic

Topic

Topic

Topic

Real Time/Offline ClickStream Analysis

Internal Data Sources

External Data Sources

Support Tickets

DBMSEmail

CRM

● Prediction Modelling ● Attribution Modelling ● Cohort Analysis ● Customer Lifetime Value ● Attrition Modelling ● Response Modelling ● Churn Modelling

Eliminate latency due to data movement between clusters

Datalake/DataHub

Eliminate Redundant storage with MapR streams and lower the TCO

360 Degree Customer View

Customer Behavior PredictionBetter Conversion Rate and Lower attrition $$$

Offline Real Time

HA, DR, NFS, Snapshots, Data Protection

Customer 360 & Behavior prediction

Page 14: Spark Summit EU talk by Tug Grall

STREAMING FIRST ARCHITECTURE

Page 15: Spark Summit EU talk by Tug Grall

What Do We Exactly Need to Do ?Serve DataStore DataCollect Data Process DataData Sources

Stream

Topic

NFS/POSIX

Page 16: Spark Summit EU talk by Tug Grall

Trinity of Real Time

Real-Time Producers

Top

Topic

Global Messaging System

Transformational Tier

Operational NoSQL/Document

Database

Real Time Analytics

Page 17: Spark Summit EU talk by Tug Grall

Continuous Streaming ETL & Computed Analytics

17

DB

Application

Topic

Topic

Topic

Topic

● 60 events/sec ● 10 MB/event ● Tabled based

topics

Search Application

Multi-Tier Data Archival

Level 1 Aggregates

Level 2 Aggregates

Level 3 Aggregates

Pre-Computed

On-Demand

Advanced ML Analytics

Delta Aggregates

Pre-compute analytics with Spark Streaming on Data-in-motion

Page 18: Spark Summit EU talk by Tug Grall

Q&A

1. Read explanation of and Download code – https://www.mapr.com/blog/fast-scalable-streaming-applications-mapr-streams-spark-streaming-and-mapr-db – https://www.mapr.com/blog/spark-streaming-hbase

2. Get Started: MapR Converged Data Platform https://www.mapr.com/get-started-with-mapr 3. Get Answers: MapR Converge Community https://community.mapr.com/community/answers 4. Get Trained: MapR On-Demand Training https://learn.mapr.com

Engage with us!

Page 19: Spark Summit EU talk by Tug Grall

THANK YOU.Contact information or call to action goes here.