using Apache Spark and Zeppelin -...

48
Big Data Visualization using Apache Spark and Zeppelin Prajod Vettiyattil, Software Architect, Wipro

Transcript of using Apache Spark and Zeppelin -...

Page 1: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Big Data Visualizationusing

Apache Spark and Zeppelin

Prajod Vettiyattil, Software Architect, Wipro

Page 2: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Agenda

• Big Data and Ecosystem tools

• Apache Spark

• Apache Zeppelin

• Data Visualization

• Combining Spark and Zeppelin

2

Page 3: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

BIG DATA AND ECOSYSTEM TOOLS

Page 4: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Big Data

• Data size beyond systems capability

– Terabyte, Petabyte, Exabyte

• Storage

– Commodity servers, RAID, SAN

• Processing

– In reasonable response time

– A challenge here

4

Page 5: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Server

Tradition processing tools

• Move what ?

– the data to the code or

– the code to the data

5

Data

Server

move data to code

move code to data

Code

Page 6: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Traditional processing tools

• Traditional tools

– RDBMS, DWH, BI

– High cost

– Difficult to scale beyond certain data size

• price/performance skew

• data variety not supported

6

Page 7: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Map-Reduce and NoSQL

• Hadoop toolset

– Free and open source

– Commodity hardware

– Scales to exabytes(1018), maybe even more

• Not only SQL

– Storage and query processing only

– Complements Hadoop toolset

– Volume, velocity and variety

7

Page 8: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

All is well ?

• Hadoop was designed for batch processing

• Disk based processing: slow

• Many tools to enhance Hadoop’s capabilities

– Distributed cache, Haloop, Hive, HBase

• Not for interactive and iterative

8

Page 9: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

TOWARDS SINGULARITY

Page 10: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

What is singularity ?

10

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 3 4 5 6 7

AI

cap

acit

y

Decade

Decade vs AI capacity

Point of singularity

Page 11: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Technological singularity

• When AI capability exceeds Human capacity

• AI or non-AI singularity

• 2045: http://en.wikipedia.org/wiki/Ray_Kurzweil

– The predicted year

11

Page 12: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

APACHE SPARK

Page 13: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

History of Spark

13

Spark 1.3.0 released

2015

MarchSpark 1.0.0

released. 100TB sort achieved in

23 mins

2014

Spark donated to Apache Software

Foundation

2013

Spark is made open source. Available on

github

2010

Spark created by PhD student

at UC Berkeley, Matei

Zaharia

2009

Page 14: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Contributors in Spark

• Yahoo

• Intel

• UC Berkeley

• …

• 50+ organizations

14

Page 15: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Hadoop and Spark

• Spark complements the Hadoop ecosystem

• Replaces: Hadoop MR

• Spark integrates with

– HDFS

– Hive

– HBase

– YARN

15

Page 16: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Other big data tools

• Spark also integrates with

– Kafka

– ZeroMQ

– Cassandra

– Mesos

16

Page 17: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Programming Spark

• Java

• Scala

• Python

• R

17

Page 18: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Spark toolset

18

Apache Spark

Spark

SQL

Spark

Streamin

g

MLlib GraphX

Spark R

Blink DB

Spark

Cassandra

Tachyon

Page 19: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

What is Spark for ?

19

Batch

Interactive Streaming

Page 20: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

The main difference: speed

• RAM access vs Disk access

– RAM access is 100,000 times faster !

20

https://gist.github.com/hellerbarde/2843375

Page 21: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Lambda Architecture pattern

• Used for Lambda architecture implementation

– Batch layer

– Speed layer

– Serving layer

21

Batch

Layer

Speed

Layer

Serving Layer

Data Input

Data

consumers

Page 22: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Worker Node

Executor

Deployment Architecture

22

Master Node

Executor

Task CacheTaskTask

Worker Node

ExecutorExecutorExecutor

HDFS

Data

Node

HDFS

Data

Node

TaskTaskTaskCache

Spark’s Cluster

Manager

Spark Driver

HDFS Name

Node

Page 23: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

APACHE ZEPPELIN

Page 24: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Interactive data analytics

• For Spark and Flink

• Web front end

• At the back, it connects to

– SQL systems(Eg: Hive)

– Spark

– Flink

24

Page 25: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Deployment Architecture

25

Spark / Flink /

Hive

Zeppelin daemon

Web

browser 1

Web

browser 2

Web

browser 3

Web Server

Local

Interpreters

Optional

Remote

Interpreters

Page 26: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Notebook

• Is where you do your data analysis

• Web UI REPL with pluggable interpreters

• Interpreters

– Scala, Python, Angular, SparkSQL, Markdown and Shell

26

Page 27: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

User Interface features

• Markdown

• Dynamic HTML generation

• Dynamic chart generation

• Screen sharing via websockets

27

Page 28: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

SQL Interpreter

• SQL shell

– Query spark data using SQL queries

– Return normal text, HTML or chart type results

28

Page 29: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Scala interpreter for Spark

• Similar to the Spark shell

• Upload your data into Spark

• Query the data sets(RDDs) in your Spark server

• Execute map-reduce tasks

• Actions on RDD

• Transformations on RDD

29

Page 30: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

DATA VISUALIZATION

Page 33: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

The need for visualization

33

Big DataDo

something to data

User gets

comprehensible

data

Page 34: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Tools for Data Presentation Architecture

34

1.Identify

2.Locate

3.Manipulate

4.Format

5.Present

A data analysis tool/toolset would support:

Page 35: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

COMBINING SPARK AND ZEPPELIN

Page 36: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Spark and Zeppelin

36

Spark

Worker

Node

Spark

Master

Node

Spark

Worker

Node

Zeppelin daemon

Web

browser 1

Web

browser 2

Web

browser 3

Web Server

Local

Interpreters

Remote

Interpreters

Page 37: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Zeppelin views: Table from SQL

37

Page 38: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Zeppelin views: Table from SQL

38

%sql select age, count(1) from bank where

marital="${marital=single,single|divorced|married}"

group by age order by age

Page 39: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Zeppelin views: Pie chart from SQL

39

Page 40: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Zeppelin views: Bar chart from SQL

40

Page 41: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Zeppelin views: Angular

41

Page 42: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Share variables: MVVM

• Between Scala/Python/Spark and Angular

• Observe scala variables from angular

42

Scala-Spark Angular

x = “foo” x = “bar”

Zeppelin

Page 43: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Screen sharing using Zeppelin

• Share your graphical reports

– Live sharing

– Get the share URL from zeppelin and share with others

– Uses websockets

• Embed live reports in web pages

43

Page 44: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

FUTURE

Page 45: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Spark and Zeppelin

• Spark

– Machine Learning using Spark

– GraphX and MLlib

• Zeppelin

– Additional interpreters

– Better graphics

– Report persistence

– More report templates

– Better angular integration45

Page 46: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

SUMMARY

Page 47: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod

Summary

• Spark and tools

• The need for visualization

• The role of Zeppelin

• Zeppelin – Spark integration

47

Page 48: using Apache Spark and Zeppelin - DeveloperMarchdevelopermarch.com/.../report/...SparkBigData_PrajodVettiyattil.pdf · Big Data Visualization using Apache Spark and Zeppelin Prajod