Unlocking Your Hadoop Data with Apache Spark and CDH5

45
1 Seattle Spark Meetup

Transcript of Unlocking Your Hadoop Data with Apache Spark and CDH5

Page 1: Unlocking Your Hadoop Data with Apache Spark and CDH5

1

Seattle Spark Meetup

Page 2: Unlocking Your Hadoop Data with Apache Spark and CDH5

2

Next Sessions

• Currently planning

– New sessions

• Joint Seattle Spark Meetup and Seattle Mesos User Group meeting

• Joint Seattle Spark Meetup and Cassandra meeting

• Mobile marketing scenario with Tune

– Repeats requested for:

• Deep Dive into Spark, Tachyon, and Mesos Internals (Eastside)

• Spark at eBay: Troubleshooting the everyday issues (Seattle)

• This session (Eastside)

Page 3: Unlocking Your Hadoop Data with Apache Spark and CDH5

3

Unlocking your Hadoop data with Apache

Spark and CDH5

Denny Lee, Steven Hastings

Data Sciences Engineering

Page 4: Unlocking Your Hadoop Data with Apache Spark and CDH5

4

Purpose

• Showcasing Apache Spark scenarios within your Hadoop cluster

• Utilizing some Concur Data Sciences Engineering Scenarios

• Special Guests:

– Tableau Software: Showcasing Tableau to Apache Spark

– Cloudera: How to configure and deploy Spark on YARN via Cloudera Manager

Page 5: Unlocking Your Hadoop Data with Apache Spark and CDH5

5

Agenda

• Configuring and Deploying Spark on YARN with Cloudera Manager

• Quick primer on our expense receipt scenario

• Connecting to Spark on your CDH5.1 cluster

• Quick demos

– Pig vs. Hive

– SparkSQL

• Tableau connecting to SparkSQL

• Deep Dive demo

– MLLib: SVD

Page 6: Unlocking Your Hadoop Data with Apache Spark and CDH5

6

Take Picture of Receipt601 108th Ave. NE Uel lcvue, WA 98004

Chantanee Thai Restaurant & Bar’

www.chantanee.com

TABLE: B 6 - 2 Guests

Your Server was Jerry

4/14/2014 1 ;14:02 PM Sequence #0000052

ID #0281727 Subtotal $40-00

Tota] Taxes $380 Grand Tota1 $43-80

Credit Purchase Name

BC Type : Amex

00 Num : xxxx xxxx xxxx 2000

Approval : 544882

Server : Jerry

Ticket Name : B 6

15% $6.00

I agree to pay the amount shown above.

Visit us i

Payment Amount:

2.5% .15 1 O .00

n BotheH-Pen Thai

Page 7: Unlocking Your Hadoop Data with Apache Spark and CDH5

7

Help Choose Expense Type

Page 8: Unlocking Your Hadoop Data with Apache Spark and CDH5

8

Gateway Node

Hadoop

Gateway Node- Can connect to HDFS

- Execute Hadoop on cluster

- Can execute Spark on cluster

- OR on local for adhoc

- Can setup multiple VMs

Page 9: Unlocking Your Hadoop Data with Apache Spark and CDH5

9

Connecting…

spark-shell --master spark://$masternode:7077 --executor-

memory 1G --total-executor-cores 16

--master

specify the master node OR if using a gateway node

can just run locally to test

--executor-memory

limit amount of memory you use otherwise you’ll

use up as much as you can (should set defaults)

--total-executor-cores

limit amount of cores you use otherwise you’ll use up

as much as you can (should set defaults)

Page 10: Unlocking Your Hadoop Data with Apache Spark and CDH5

10

Connecting… and resources

Page 11: Unlocking Your Hadoop Data with Apache Spark and CDH5

11

RowCount: Pig

A = LOAD

'/user/hive/warehouse/dennyl.db/sample_ocr/000000_0' USING

TextLoader as (line:chararray);

B = group A all;

C = foreach B generate COUNT(A);

dump C;

Page 12: Unlocking Your Hadoop Data with Apache Spark and CDH5

12

RowCount: Spark

val ocrtxt =

sc.textFile("/user/hive/warehouse/dennyl.db/sample_ocr/0000

00_0")

ocrtxt.count

Page 13: Unlocking Your Hadoop Data with Apache Spark and CDH5

13

RowCount: Pig vs. Spark

Query Pig Spark

1 0:00:41 0:00:02

2 0:00:42 0:00:00.5

Row Count against 1+ million categorized receipts

Page 14: Unlocking Your Hadoop Data with Apache Spark and CDH5

14

RowCount: Spark Stages

Page 15: Unlocking Your Hadoop Data with Apache Spark and CDH5

15

WordCount: Pig

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as

word;

C = group B by word;

D = foreach C generate COUNT(B), group;

E = group D all;

F = foreach E generate COUNT(D);

dump F

Page 16: Unlocking Your Hadoop Data with Apache Spark and CDH5

16

WordCount: Spark

var wc =

ocrtxt.flatMap(

line => line.split(" ")

).map(

word => (word, 1)

).reduceByKey(_ + _)

wc.count

Page 17: Unlocking Your Hadoop Data with Apache Spark and CDH5

17

WordCount: Pig vs. Spark

Query Pig Spark

1 0:02:09 0:00:38

2 0:02:07 0:00:02

Word Count against 1+ million categorized receipts

Page 18: Unlocking Your Hadoop Data with Apache Spark and CDH5

18

SparkSQL: Querying

-- Utilize SQLContext

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

-- Configure class

case class ocrdata(company_code: String, user_id: Long, date: Long,

category_desc: String, category_key: String, legacy_key: String,

legacy_desc: String, vendor: String, amount: Double)

Page 19: Unlocking Your Hadoop Data with Apache Spark and CDH5

19

SparkSQL: Querying (2)

-- Extract Data

val ocr = sc.textFile("/$HDFS_Location"

).map(_.split("\t")).map(

m => ocrdata(m(0), m(1).toLong, m(2).toLong, m(3), m(4), m(5), m(6),

m(7), m(8).toDouble)

)

-- For Spark 1.0.2

ocr.registerAsTable("ocr")

-- For Spark 1.1.0+

ocr.registerTempTable("ocr")

Page 20: Unlocking Your Hadoop Data with Apache Spark and CDH5

20

SparkSQL: Querying (3)

-- Write a SQL statement

val blah = sqlContext.sql(

"SELECT company_code, user_id FROM ocr”

)

-- Show the first 10 rows

blah.map(

a => a(0) + ", " + a(1)

).collect().take(10).foreach(println)

Page 21: Unlocking Your Hadoop Data with Apache Spark and CDH5

21

Oops!

14/11/15 09:55:35 ERROR scheduler.TaskSetManager: Task 16.0:0 failed 4 times; aborting job

14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Cancelling stage 16

14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Stage 16 was cancelled

14/11/15 09:55:35 INFO scheduler.DAGScheduler: Failed to run collect at <console>:22

14/11/15 09:55:35 WARN scheduler.TaskSetManager: Task 136 was killed.

14/11/15 09:55:35 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all

completed, from pool

org.apache.spark.SparkException: Job aborted due to stage failure: Task 16.0:0 failed 4 times, most

recent failure: Exception failure in TID 135 on host $server$: java.lang.NumberFormatException: For

input string: "\N"

sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)

java.lang.Double.parseDouble(Double.java:540)

scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:232)

Page 22: Unlocking Your Hadoop Data with Apache Spark and CDH5

22

Let’s find the error

-- incorrect configuration, let me try to find "\N"

val errors = ocrtxt.filter(line => line.contains("\\N"))

-- error count (71053 lines)

errors.count

-- look at some of the data

errors.take(10).foreach(println)

Page 23: Unlocking Your Hadoop Data with Apache Spark and CDH5

23

Solution

-- Issue

The [amount] field contains \N which is NULL value generated by Hive

-- Configure class (Original)

case class ocrdata(company_code: String, user_id: Long, ... amount:

Double)

-- Configure class (Original)

case class ocrdata(company_code: String, user_id: Long, ... amount:

String)

Page 24: Unlocking Your Hadoop Data with Apache Spark and CDH5

24

Re-running the Query

14/11/16 18:43:32 INFO scheduler.DAGScheduler: Stage 10 (collect at

<console>:22) finished in 7.249 s

14/11/16 18:43:32 INFO spark.SparkContext: Job finished: collect at

<console>:22, took 7.268298566 s

-1978639384, 20156192

-1978639384, 20164613

542292324, 20131109

-598558319, 20128132

1369654093, 20130970

-1351048937, 20130846

Page 25: Unlocking Your Hadoop Data with Apache Spark and CDH5

25

SparkSQL: By Category (Count)

-- Query

val blah = sqlContext.sql("SELECT category_desc, COUNT(1) FROM ocr GROUP BY

category_desc")

blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)

-- Results

14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose

tasks have all completed, from pool

14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at

<console>:22, took 4.275620339 s

Category 1, 25

Category 2, 97

Category 3, 37

Page 26: Unlocking Your Hadoop Data with Apache Spark and CDH5

26

SparkSQL: via Sum(Amount)

-- Query

val blah = sqlContext.sql("SELECT category_desc, sum(amount) FROM ocr GROUP BY

category_desc")

blah.map(a => a(0) + ", " + a(1)).collect().take(10).foreach(println)

-- Results

14/11/16 18:46:12 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 12.0, whose

tasks have all completed, from pool

14/11/16 18:46:12 INFO spark.SparkContext: Job finished: collect at

<console>:22, took 4.275620339 s

Category 1, 2000

Category 2, 10

Category 3, 1800

Page 27: Unlocking Your Hadoop Data with Apache Spark and CDH5

27

Connecting Tableau to SparkSQL

Page 28: Unlocking Your Hadoop Data with Apache Spark and CDH5

28

Diving into MLLib / SVD for Expense

Receipt Scenario

Page 29: Unlocking Your Hadoop Data with Apache Spark and CDH5

29

Overview

• Re-Intro to Expense Receipt Prediction

• SVD

– What is it?

– Why do I care?

• Demo

– Basics (get the data)

– Data wrangling

– Compute SVD

Page 30: Unlocking Your Hadoop Data with Apache Spark and CDH5

30

Expense Receipts (Problems)

• Want to guess Expense type based on word on receipt.

• Receipt X Word matrix is sparse

• Some words are likely to be found together

• Some words are actually the same word

Page 31: Unlocking Your Hadoop Data with Apache Spark and CDH5

31

SVD (Singular Value Decomposition)

• Been around a while

– But still popular and useful

• Matrix Factorization

– Intuition: rotate your view of the data

• Data can be well approximated by fewer features

– And you can get an idea of how good of an approximation

Page 32: Unlocking Your Hadoop Data with Apache Spark and CDH5

32

Demo: Overview

Raw Data

Tokenized

Words

Grouped

Words,

Records

Matrix

SVD

Page 33: Unlocking Your Hadoop Data with Apache Spark and CDH5

33

Basics

val rawlines = sc.textFile(“/user/stevenh/subocr/subset.dat”)

val ocrRecords = rawlines map { rawline =>

rawline.split(“\t”)

} filter { line =>

line.length == 10 && line(8) != “”

} zipWithIndex() map { case (lineItems, lineIdx) =>

OcrRecord(lineIdx, lineItems(0), lineItems(1).toLong, lineItems(4),

lineItems(8).toDouble, lineItems(9))

}

zipWithIndex() lets you give each record in your RDD an incrementing integer index

Page 34: Unlocking Your Hadoop Data with Apache Spark and CDH5

34

Tokenize Records

val splitRegex = new scala.util.matching.Regex("\\\\r\\\\n")

val wordRegex = new scala.util.matching.Regex("[a-zA-Z0-9_]+")

val recordWords = ocrRecords flatMap { rec =>

val s1 = splitRegex.replaceAllIn(rec.ocrText, "")

val s2 = wordRegex.findAllIn(s1)

for { S <- s2 } yield (rec.recIdx, S.toLowerCase)

}

Keep track of which record this came from

Page 35: Unlocking Your Hadoop Data with Apache Spark and CDH5

35

Demo: Overview

Raw Data

Tokenized

Words

Grouped

Words,

Records

Matrix

SVD

Page 36: Unlocking Your Hadoop Data with Apache Spark and CDH5

36

Group data by Record and Word

val wordCounts = recordWords groupBy { T => T }

val wordsByRecord = wordCounts map { gr =>

(gr._1._2, (gr._1._1, gr._1._2, gr._2.size))

}

val uniqueWords = wordsByRecord groupBy { T =>

T._2._2

} zipWithIndex() map { gr =>

(gr._1._1, gr._2)

}

Page 37: Unlocking Your Hadoop Data with Apache Spark and CDH5

37

Join Record, Word Data

val preJoined = wordsByRecord join uniqueWords

val joined = preJoined map { pj =>

RecordWord(pj._2._1._1, pj._2._2, pj._2._1._2, pj._2._1._3.toDouble)

}

Join 2-tuple RDDs on first value of tuple

Now we have data for each non-zero word/record combo

Page 38: Unlocking Your Hadoop Data with Apache Spark and CDH5

38

Demo: Overview

Raw Data

Tokenized

Words

Grouped

Words,

Records

Matrix

SVD

Page 39: Unlocking Your Hadoop Data with Apache Spark and CDH5

39

Generate Word x Record Matrix

val ncols = ocrRecords.count().toInt

val nrows = uniqueWords.count().toLong

val vectors: RDD[Vector] = joined groupBy { T =>

T.wordIdx

} map { gr =>

val indices = for { x <- gr._2 } yield x.recIdx.toInt

val data = for { x <- gr._2 } yield x.n

new SparseVector(ncols, indices.toArray, data.toArray)

}

val rowMatrix = new RowMatrix(vectors, nrows, ncols)

This is a Spark Vector, not a scala Vector

Page 40: Unlocking Your Hadoop Data with Apache Spark and CDH5

40

Demo: Overview

Raw Data

Tokenized

Words

Grouped

Words,

Records

Matrix

SVD

Page 41: Unlocking Your Hadoop Data with Apache Spark and CDH5

41

Compute SVD

val svd = rowMatrix.computeSVD(5, computeU = true)

• Ironically, in Spark v1.0 computeSVD is limited by an operation

which must complete on a single node…

Page 42: Unlocking Your Hadoop Data with Apache Spark and CDH5

42

Spark / SVD References

• Distributing the Singular Value Decomposition with Spark

– Spark-SVD gist

– Twitter / Databricks blog post

• Spotting Topics with the Singular Value Decomposition

Page 43: Unlocking Your Hadoop Data with Apache Spark and CDH5

43

Now do something with the data!

Page 44: Unlocking Your Hadoop Data with Apache Spark and CDH5

44

Upcoming Conferences

• Strata + Hadoop World

– http://strataconf.com/big-data-conference-ca-2015/public/content/home

– San Jose, Feb 17-20, 2015

• Spark Summit East

– http://spark-summit.org/east

– New York, March 18-19, 2015

• Ask for a copy of “Learning Spark”

– http://www.oreilly.com/pub/get/spark

Page 45: Unlocking Your Hadoop Data with Apache Spark and CDH5