Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

44
Robert Hryniewicz Data Evangelist @RobertH8z Hands-on Intro to Spark & Zeppelin Crash Course

Transcript of Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Page 1: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Robert HryniewiczData Evangelist@RobertH8z

Hands-on Intro to Spark & ZeppelinCrash Course

Page 2: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda• Quick Demo

• Spark Overview

• Zeppelin + HDP

• Lab ~ 1hr

• Spark 2.0

• Q/A

Page 3: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

“Big Data”

Internet of Anything (IoAT)– Wind Turbines, Oil Rigs, Cars– Weather Stations, Smart Grids– RFID Tags, Beacons, Wearables

User Generated Content (Web & Mobile)– Twitter, Facebook, Snapchat, YouTube– Clickstream, Ads, User Engagement– Payments: Paypal, Venmo

Where does “Big Data” come from?

44ZB in 2020

Page 4: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

The “Big Data” Problem

A single machine cannot process or even store all the data!Problem

Solution Distribute data over large clusters

Difficulty How to split work across machines? Moving data over network is expensive Must consider data & network locality How to deal with failures? How to deal with slow nodes?

Page 5: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Background

Page 6: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

History of Hadoop & Spark

Page 7: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Access Rates

At least an order of magnitude difference between memory and hard drive / network speed

FAST slower slowest

Page 8: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is Spark?

Apache Open Source Project– originally developed at AMPLab (University of California Berkeley)

Data Processing Engine – In-memory computation – FAST!

Elegant Developer-friendly APIs– Supports: Scala, Python, Java and R– Single environment for Data Wrangling, Machine Learning (ML), SQL Queries, Streaming Apps

Page 9: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Ecosystem

Spark Core

Spark SQL Spark Streaming Spark MLlib GraphX

Page 10: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Spark Basics

Page 11: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Context

Main entry point for Spark functionality Represents a connection to a Spark cluster Represented as sc in your code

What is it?

Page 12: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL

Page 13: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Overview

Spark module for structured data processing (e.g. DB tables, JSON files) Three ways to manipulate data:

– DataFrames API– SQL queries– Datasets API

Page 14: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrames

Distributed collection of data organized into named columns Conceptually equivalent to a table in relational DB or data frame in R/Python

– rows, columns, and schema

API available in Scala, Java, Python, and R

Page 15: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFramesCSVAvro

HIVE

Spark SQL

Text

Col1 Col2 … … ColN

DataFrame

Column

Row

Created from Various Sources

DataFrames from HIVE:– Reading and writing HIVE tables

DataFrames from files:– Built-in: JSON, JDBC, ORC, Parquet, HDFS– External plug-in: CSV, HBASE, Avro

Data is described as a DataFrame with rows, columns and a schema

Page 16: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL Context and Hive Context

Entry point into all functionality in Spark SQL All you need is SparkContextval sqlContext = SQLContext(sc)

SQLContext

Superset of functionality provided by basic SQLContext– Read data from Hive tables– Access to Hive Functions UDFs

HiveContext

val hc = HiveContext(sc)

Use when your data resides in

Hive

Page 17: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark SQL Examples

Page 18: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrame Example

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

Reading Data From Table

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+

Page 19: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

DataFrame Example

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

Using DataFrame API to Filter Data (show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

Page 20: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

SQL Example

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelay FROM flights WHERE DepDelay > 15 LIMIT

5").show

Using SQL to Query and Filter Data (again, show delays more than 15 min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

Page 21: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Page 22: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Extension of Spark Core API Stream processing of live data streams

– Scalable– High-throughput– Fault-tolerant

Overview

ZeroMQ

MQTT

Page 23: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Page 24: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark Streaming

Apply transformations over a sliding window of data, e.g. rolling averageWindow Operations

Page 25: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark MLlib

Page 26: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Where Can We Use Data Science / Machine Learning

Healthcare• Predict diagnosis• Prioritize screenings• Reduce re-admittance rates

Financial services• Fraud Detection/prevention• Predict underwriting risk• New account risk screens

Public Sector• Analyze public sentiment• Optimize resource allocation• Law enforcement & security

Retail• Product recommendation• Inventory management• Price optimization

Telco/mobile• Predict customer churn• Predict equipment failure• Customer behavior analysis

Oil & Gas• Predictive maintenance• Seismic data management• Predict well production levels

Page 27: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark ML: Spark API for building ML pipelines

Feature transform

1

Feature transform

2

Combine features

LinearRegression

Input DataFrame

Input DataFrame

Output DataFrame

Pipeline

Pipeline Model

Train

Predict

Export Model

Page 28: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark GraphX

Page 29: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

GraphX

Page Rank Topic Modeling (LDA) Community Detection

Source: ampcamp.berkeley.edu

Page 30: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Zeppelin & HDP Sandbox

Page 31: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s Apache Zeppelin?

Web-based notebook that enables

interactive data analytics.

You can make beautiful data-

driven, interactive and collaborative

documents with SQL, Scala and more

Page 32: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is a Note/Notebook?

• A web base GUI for small code snippets• Write code snippets in browser• Zeppelin sends code to backend for execution• Zeppelin gets data back from backend• Zeppelin visualizes data• Zeppelin Note = Set of (Paragraphs/Cells)• Other Features - Sharing/Collaboration/Reports/Import/Export

Page 33: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How does Zeppelin work?

Notebook Author

Collaborators/Report viewers

Zeppelin

ClusterSpark | Hive | HBaseAny of 30+ back ends

Page 34: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Data Lifecycle

Collect ETL /Process Analysis

Report

DataProduct

Business userCustomer

Data ScientistData Engineer

Page 35: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDP Sandbox

What’s included in the Sandbox?

Zeppelin Latest Hortonworks Data Platform (HDP)

– Spark– YARN Resource Management– HDFS Distributed Storage Layer– And many more components... YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

Page 36: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFS EncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

Hortonworks Data Platform 2.4.x

Deployment ChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

Page 37: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Spark 2.0

Page 38: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What’s New in Spark 2.0 API Unification

– DataFrame alias for DataSet[Row]– SparkSession (%spark) replaces SparkContext, SQLContext, and HiveContext

• spark is the new entry point to all Spark features

Structured Streaming– DataFrame/DataSet for manipulating stream data– Real-time incremental processing – Attempt to unify streaming, interactive, and batch processing

Performance Improvements– Tungsten - “bare metal” code generation– ORC & Parquet file formats

Page 39: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Community Connection

Page 40: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Community Connection

Read access for everyone, join to participate and be recognized

• Full Q&A Platform (like StackOverflow)

• Knowledge Base Articles

• Code Samples and Repositories

Page 41: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Community Engagement

Participate now at: community.hortonworks.com© Hortonworks Inc. 2011 – 2015. All Rights Reserved

7,500+Registered Users

15,000+Answers

20,000+Technical Assets

One Website!

Page 42: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Lab Preview

Page 43: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Link to Lab Setup Instructions

http://tinyurl.com/hwx-spark-intro

Page 44: Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin

Robert [email protected]@RobertH8z

Thanks!