Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

45
Big Data 2.0 HOW SPARK TECHNOLOGIES ARE RESHAPING THE WORLD OF BIG DATA ANALYTICS Presented By: Lillian Pierson, P.E.

Transcript of Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Page 1: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Big Data 2.0HOW SPARK TECHNOLOGIES ARE RESHAPING THE WORLD OF BIG DATA ANALYTICS

Presented By: Lillian Pierson, P.E.

Page 2: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Today’s webinarApache Spark: Journey from “Hadoop Eco System component” to “Big Data platform”

The story of how Spark began

Is Spark a data engineering or data science platform?

Who is using Spark and for what?

Got Spark skills? Here’s why you should

Page 3: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Apache SparkJOURNEY FROM “HADOOP ECO SYSTEM COMPONENT ” TO “BIG DATA PLATFORM”

Page 4: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

What is Spark?

Page 5: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

“In-memory computing appliances are … faster than the traditional Hadoop system because in-memory appliances don’t use MapReduce… By storing data in memory, in-memory appliances are able to bypass the time-consuming disk accesses that are required as part of the map and reduce operations that comprise the MapReduce process. In-memory data storage processing, and analysis is fast enough to generate data analytics in real-time, derived from streaming data sources.“ –Excerpt from my book:

Big Data/Hadoop for Dummies

Why in-memory applications?

Page 6: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

From Hadoop ecosystem component…

HDFS

MapReduce 2.0

YARN

Page 7: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

From Hadoop ecosystem component…

HDFS

SparkMapReduce

2.0YARN

Page 8: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

To big data platform

HDFS

MapReduce 2.0

Spark YARN

Page 9: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

To big data platform

Spark-as-a-Service

Page 10: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Spark’s 4 submodules

Spark SQL MLlib

GraphX Streaming

Page 11: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Spark SQL moduleDataFrames

Spark SQL◦ SQL

Hive◦ HiveQL

◦ Spark Processing Engine

Page 12: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Mllib moduleData analysis

Statistics

Machine learning

Page 13: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

GraphX moduleGraph data storage and processing

Graphx◦ In-memory graph data processing

HDFS◦ Graph data storage

Page 14: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Streaming module

Continuously Streaming

Data

Discreet Data Streams

(Dstream)

Micro-batch processing

Page 15: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Dstreams and micro-batch architecture

Source: http://www.slideshare.net/skpabba/hadoop-and-spark

RDD @ time 1 RDD @ time 2 RDD @ time 3

Page 16: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Basic Spark Architecture

Spark SQL MLlib GraphX Streaming

Physical Hardware

Data Storage Layer (HDFS)

Resource Manager (YARN)

Spark Core Libraries

Single Abstraction Layer

Processing Processing Processing Processing

Page 17: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Changes with Spark 2.0

RDD API

•DataFrame API

Spark 1.0

•RDD API

•DataFrame API

Spark 1.3

*RDD API

*DataFrame API

*Dataset API

Spark 1.6

Dataset API

•DataFrame API

•RDD API

Spark 2.0

Page 18: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Changes with Spark 2.0

RDD API

Dataset API

DataFrame API

RDD API

Spark 1.0 Spark 2.0

Page 19: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Changes with Spark 2.0

Structured Stream

Processing

DataFrame API

Dataset API

Page 20: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

The story of how Spark began

Page 21: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Taking things from the beginning…2009

Mesos

UC Berkeley

Interactive, iterative parallel processing (in-memory)

◦ Machine learning requirements

Integrates with Hadoop ecosystem

Dr. Ion StoicaComputer Science Professor

UC Berkeley

Page 22: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Databricks… the cutting edge of SparkDelivers Apache Spark-as-a-Service

Most popular solution for deploying Spark on the cloud

Dr. Ion StoicaExecutive Chairman, Apache Databricks

Page 23: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Databricks… the cutting edge of SparkSpark on an as-needed basis

Automates◦ Cluster building and configuration

◦ Security

◦ Process monitoring

◦ Resource monitoring

Notebooks◦ For data analysis and machine learning using Python, R, and Scala

Data visualization capabilities◦ Data visualization and dashboard design options

Page 24: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Is Spark a data engineering or data science platform?DATA ENGINEERING COMPONENTS AND TECHNOLOGIES

DATA SCIENCE COMPONENTS AND TECHNOLOGIES

Page 25: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Spark’s data engineering elementsAutomate cluster sizing and configuration requirements

Data Storage: HDFS

Resource Management:◦ Spark Standalone

◦ Apache Mesos

◦ Hadoop YARN

Page 26: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Spark’s data engineering elementsSpark Streaming Submodule – Reuse same code you use for batch processing, but get real-time results!

◦ Integrates with big data source, like:

◦ HDFS

◦ Flume

◦ Kafka

◦ Twitter and

◦ ZeroMQ

Page 27: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Doing data science with SparkUseful for machine learning and analysis of big data

Build big data analytics products

Programmable in Python, R, Scala, and SQL

Submodules:◦ SQL and DataFrames

◦ MLlib for machine learning

◦ GraphX for in-memory big (graph) data computations

Page 28: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Doing data science with SparkSpark integrates with the following data sources and formats:

◦ Hive, Avro, Parquet, CSV, JSON, and JDBC, HBase

◦ BI Tools: Tableau, QLIK, ZoomData, etc. (through JDBC)

Page 29: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Who is using Spark and for what?A U T O M A T I C L A B S

L E N D U P

S E L L P O I N T S

F I N D I F Y

Page 30: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Automatic Labs on DatabricksMaking cars smarter with real-time analytics

Connect to, and make smart use, of your car’s data

Page 31: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Automatic Labs on DatabricksAutomatic apps do things like:

◦ Decoding engine problems

◦ Locating parked cars

◦ Crash detection and response

◦ Low fuel warnings, etc.

Automatic is using Spark to make cars smarter with real-time analytics

During product development, Automatic needs to query, explore, and visualize large amounts of data, QUICKLY. By moving this work over to Spark, Automatic was able to:

◦ Validate products in days, not weeks

◦ Complete complex queries in minutes

◦ Free up 1 full-time data scientist

◦ Save $10K/month on infrastructure costs

Page 32: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

LendUp on DatabricksImproving the lending process and experience

“Moving up the LendUpLadder means earningaccess to more money, atbetter rates, for longerperiods of time” - LendUp

Page 33: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

LendUp on DatabricksLendUp uses Spark for:

◦ Feature engineering at scale

◦ Fast model building and testing

By using Spark to do this work, LendUp is able to:◦ Build more accurate models, faster

◦ Offer more lines of credit

◦ Develop new products more quickly

◦ Increase in-house productivity of data science team

Page 34: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

sellpoints on DatabricksIncreasing ROI on ad spend

Page 35: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

sellpoints on DatabricksIncreasing ROI on ad spend

Sellpoint offers services in:◦ Identifying qualified shoppers

◦ Driving traffic

◦ Increasing sales conversion

By moving to Databricks, sellpoints was able to:◦ Productize a new predictive analytics offering, improving the ad spend ROI

by threefold compared to competitive offerings.

◦ Reduce the time and effort required to deliver actionable insights to the business team while lowering costs.

◦ Improve productivity of the engineering and data science team by eliminating the time spent on DevOps and maintaining open source software.

Page 36: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Findify on DatabricksImproving shopping experience for ecommerce customers

Uses machine learning to continually improve search accuracy

Page 37: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Findify on DatabricksImproving shopping experience for ecommerce customers

By moving to Databricks, Findify was able to:◦ Focus on development instead of infrastructure – Allowing them to complete

their feature development projects faster and reduce customer frustration in delayed analytics

◦ Focus on building innovative features - because the managed Spark platform eliminated time spent on DevOps and infrastructure issues.

Uses machine learning to continually improve search accuracy

Page 38: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Got Spark skills? Here’s why you shouldIMPACT ON SALARY

TRAINING ISSUES AND OPPORTUNITIES

Page 39: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

How much do Spark skills pay?2015 Data Science Salary Survey, by O’Reilly

$11,000

$4,000$4,600

$8,000

$0

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

Spark Skills Scala Programming Basic ExploratoryAnalysis (>4 hr/wk)

D3.js Skills

Annual Salary Increase

Annual Salary Increase

Page 40: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Getting training and experience in Spark

$149.50

SaleUntil

March 30Only

DiscountCode:

‘SPRING50’

Page 41: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Getting training and experience in SparkGet hands-on training in the following areas:

◦ Using RDD

◦ Writing applications using Scala

◦ Spark SQL

◦ Spark Streaming

◦ Machine Learning in Spark (Mllib)

◦ Spark GraphX

◦ Spark Project Implementation

Page 42: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Getting training and experience in Spark

$149.50

SaleUntil

March 30Only

DiscountCode:

‘SPRING50’

Page 43: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Download these slide

Page 44: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

Why Data Science From Simplilearn

Key Features

40 hours of real life industry project

experience

25 hours of High Quality e-learning

Visualize and optimize data

effectively using the built-in tools in

R , SAS and Excel

48 hours of Live Instructor Led

Online sessions

Get proficient in using R,SAS and Excel

to model data and predict solutions to business problems

Master the concepts of statistical analysis like linear & logistic regression, cluster

analysis & forecasting

Page 45: Big Data 2.0 - How Spark technologies are reshaping the world of big data analytics

OUR JOURNEY SO FARProject

ManagementDigital Marketing

Big Data & Analytics

Business Productivity

Tools

Quality Management

Virtualization and Cloud Computing

IT Security

Financial Management

CompTIACertification

IT Hardware and N/W ERP

IT Services and Architecture

Agile and Scrum Certification

OS and DatabaseWeb and App Programming

Simplilearn : World’s Largest Certification Training Destination

One of the largest collections of accredited certification training in the world.

YEAR 2010

YEAR 2015

YEAR 2010

YEAR 2016