Spark - Shark Data Analytics Stack on a Hadoop Cluster

Post on 25-Feb-2016

58 views 0 download

description

Spark - Shark Data Analytics Stack on a Hadoop Cluster. April 22, 2013. Big Data Week Data Science Group. April 23, 2013. Michael Malak Data Analytics Senior Engineer at Time Warner Cable T echnicaltidbit.com . Chris Deptula Senior Big Data Consultant 317.840.2935 - PowerPoint PPT Presentation

Transcript of Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark - Shark Data Analytics Stack on a Hadoop Cluster

April 22, 2013

Big Data WeekData Science Group

April 23, 2013

Michael Malak

Data Analytics Senior Engineer at Time Warner CableTechnicaltidbit.com

Chris DeptulaSenior Big Data Consultant317.840.2935chris.deptula@openbi.com@chrisdeptulahttp://www.openbi.com

Michael Walker Managing Partner720.373.2200m@rosebt.comhttp://www.rosebt.com

Agenda

• The Big Data Problem• Spark Ecosystem• NFL Data Science Use Case• Visualizing Data

The Big Data Problem

Speed Kills in Data Science

Hype Cycle for Emerging Tech 2012

Hype Cycle for Big Data 2012

Evolution of DW Architecture

Emerging DW Architecture

Next-Generation Data Architecture

Big Data Ecosystem Parts

DW Database Systems MQ 2013

Total Enterprise Data Growth 2005-2015

Structured vs Unstructured Data

Modern DW/BI Analytical Ecosystems

Big Data Ecosystem Parts

The Internet of Things

Big Data 4 V's

New World of Databases

New World of Databases

Hadoop

Hadoop

Hadoop

Big Data Vendor Focused on Hadoop and NoSQL Revenue 2012

Big Data Analytics Infrastructure

The Spark Ecosystem

Agenda

• What Hadoop gives us• What everyone is complaining about in

2013• Spark

o Berkeley Teamo BDAS (Berkeley Data Analytics Stack)o RDDs (Resilient Distributed Datasets)o Sharko Spark Streamingo Other Spark subsystems

Global Big Data Apr 23, 2013

technicaltidbit.com 31

What Hadoop Gives Us

• HDFS• Map/Reduce

Global Big Data Apr 23, 2013

technicaltidbit.com 32

Hadoop: HDFS

Image from mark.chmarny.com

Global Big Data Apr 23, 2013

technicaltidbit.com 33

Hadoop: Map/Reduce

Global Big Data Apr 23, 2013

technicaltidbit.com 34

Image from people.apache.org/~rdonkin

Image from blog.octo.com

Map/Reduce Tools

Global Big Data Apr 23, 2013

technicaltidbit.com 35

Linux

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Hadoop Distribution Dogs in the Race

Global Big Data Apr 23, 2013

technicaltidbit.com 36

Hadoop Distribution Query Tool

Stinger

Apache Drill

Other Open Source Solutions

• Druid• Spark

Global Big Data Apr 23, 2013

technicaltidbit.com 37

Not just caching, but streaming

• 1st generation: HDFS• 2nd generation: Caching & “Push” Map/Reduce• 3rd generation: Streaming

Global Big Data Apr 23, 2013

technicaltidbit.com 38

Berkeley Team• 40 students• 8 faculty• 3 staff software

engineers• Silicon Valley style

skunkworks office space

• 2 years into 6 year program

Global Big Data Apr 23, 2013

technicaltidbit.com 39

Image from Ian Stoica’s slides from Strata 2013 presentation

Spark

BDAS(Berkeley Data Analytics Stack)

Global Big Data Apr 23, 2013

technicaltidbit.com 40

Linux

Mesos

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

RDDs(Resilient Distributed Dataset)

Global Big Data Apr 23, 2013

technicaltidbit.com 41

Image from Matei Zaharia’s paper

RDDs: Laziness

Global Big Data Apr 23, 2013

technicaltidbit.com 42

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”))cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

RDDs: Transformations vs. Actions

Global Big Data Apr 23, 2013

technicaltidbit.com 43

Transformationsmap(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actionsreduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Hive vs. Shark

Global Big Data Apr 23, 2013

technicaltidbit.com 44

HDFS files

Shark

Hiv

eQL

HDFS files RDDs+

Hiv

eQL

Shark: Copy from HDFS to RDDCREATE TABLE wiki_small_in_mem TBLPROPERTIES

("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s memory using RDD.cache().

Global Big Data Apr 23, 2013

technicaltidbit.com 45

Shark: Just a Shim

Global Big Data Apr 23, 2013

technicaltidbit.com 46

Shark

Images from Reynold Xin’s presentation

What about “Big Data”?

Global Big Data Apr 23, 2013

technicaltidbit.com 47

PB

TB

GB

MB

KB

Sha

rk E

ffect

iven

ess

Median Hadoop job input size

Global Big Data Apr 23, 2013

technicaltidbit.com 48

Image from Reynold Xin’s presentation

Spark Streaming: Motivation

Global Big Data Apr 23, 2013

technicaltidbit.com 49

x1,000,000 clientsHDFS

DStream

RDD

RDD

Spark Streaming: DStream

• “A series of small batches”

Global Big Data Apr 23, 2013

technicaltidbit.com 50

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

{{“id”: “shewolf”}, “eventType”: “error”}} . . .

RDD{{“id”: “catlover”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “logOff”}}

2 sec

2 sec

2 sec

Spark Streaming: DAG

Global Big Data Apr 23, 2013

technicaltidbit.com 51

Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

Dstream.foreach(println)

The DAG

Spark Streaming: Example Code

Global Big Data Apr 23, 2013

technicaltidbit.com 52

// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Stateful Spark Streaming

Global Big Data Apr 23, 2013

technicaltidbit.com 53

Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Other Spark Subsystems

• Bagel (similar to Google Pregel)

• Sparkler (Matrix decomposition)• (Machine Learning)

Global Big Data Apr 23, 2013

technicaltidbit.com 54

Teaser

• Future Meetup: Machine learning from real-time data streams

Global Big Data Apr 23, 2013

technicaltidbit.com 55

Data Science NFL Use-Case

Speed Kills - In Data Science and NFL

Speed Kills: Up-Tempo - No Huddle

Observation: best college and NFL offenses are using an up-tempo - no huddle strategy.

Hypothesis: NFL teams using the up-tempo - no huddle strategy have the best winning records.

Data Science Formula

Data Science FormulaData science processes include:

1) Information gathering

2) Re-representation of the information in a schema that aids analysis

3) Development of insight through the manipulation of this representation

4) Creation of some knowledge product or direct action based on the insight

Data Science Formula

Information > Schema > Insight > Product

The data analysis may be organized in two key loops:

1) Searching loop (seeking, extracting, filtering information)

2) Understanding loop (modeling and conceptualization from a schema that best fits the evidence)

Data Collection

Collect data on all NFL offenses

Collect data on NFL offenses using an up-tempo - no huddle strategy

Collect data on NFL team records (win-losses)

Data Comparison

Compare data on all NFL offenses with data on NFL offenses using an up-tempo - no huddle strategy

Data Science ToolsScientific methodsAnalytical techniquesMachine learning techniquesAlgorithm design and executionData visualization and story-tellingStatisticsMathComputer engineeringData miningData modeling

Predictive, Descriptive, Prescriptive Analytics

There are three types of data analysis:

Descriptive (business intelligence and data mining)

Predictive (forecasting)

Prescriptive (optimization and simulation)

Predictive Analytics

Predictive Modeling Techniques

Predictive Analytics

Prescriptive Analytics

Regression Analysis

NFL Data Sources

2002-2009 NFL PLAY BY PLAY

http://www.infochimps.com/tags/nfl

Supplemented stats source:

http://www.databasefootball.com/

http://www.pro-football-reference.com/

Data Results

From 2001 to 2012 only one NFL team used the up-tempo - no huddle offense consistently:

The New England Patriots

Note: In college football many teams started using the up-tempo - no huddle offense consistently in 2009. One team stands out:

The Oregon Ducks

Data Model - DVOA

DVOA (Defense-adjusted Value Over Average) for total offense as well as rushing and passing offense separated. All numbers are adjusted to an average schedule of opponents and an average percentage of fumbles recovered by the offense.

Exceptions are the three columns marked NON-ADJUSTED. Rushing includes all rushing, not just running backs.

Designed by Aaron Schatz of Football Outsiders

2001 Offensive Efficiency Ratings

2002 Offensive Efficiency Ratings

2003 Offensive Efficiency Ratings

2004 Offensive Efficiency Ratings

2005 Offensive Efficiency Ratings

2006 Offensive Efficiency Ratings

2007 Offensive Efficiency Ratings

2008 Offensive Efficiency Ratings

2009 Offensive Efficiency Ratings

2011 Offensive Efficiency Ratings

2012 Offensive Efficiency Ratings

Yr to Yr Correlation of Offense 2002 - 2009

All-Time DVOA Lists

All-Time DVOA Lists

NFL Team Plays per Game 2003

NFL Team Plays per Game 2004

NFL Team Plays per Game 2005

NFL Team Plays per Game 2006

NFL Team Plays per Game 2007

NFL Team Plays per Game 2008

NFL Team Plays per Game 2009

NFL Team Plays per Game 2011

NFL Team Plays per Game 2012

Team Records

Total Points

Average Points per Game

Total Plays

Average Plays per Game

Data Results

From 2001 to 2009 the New England Patriots had 107 wins and 37 losses - the second best record in NFL (to Mannings Ind).

They appeared in five (5) - winning three (3) super bowls.

From 2001 to 2012 the New England Patriots had 146 wins and 46 losses - the best record in NFL.

Data Results

In 128 games NE scored a total of 3,356 points (26.2 ppg)

Second to Mannings Colts.

NE runs up-tempo - no -huddle offense

IND runs no-huddle offense

Data Results

NE in 5 SBs - wins 3 SBs (up-tempo - no -huddle offense)

IND in 2 SBs - wins 1 SB (no -huddle offense)

NE Stats 2001 - 2012Yr - W/L - Plpg/R - Off R - Tot pps2001 11-5 62.5/18 6 371 - Super Bowl

Champs2002 9-7 64.4/8 16 3812003 14-2 66.4/1 12 348 - Super Bowl

Champs2004 14-2 64.3/7 4 437 - Super Bowl

Champs2005 10-6 63.7/15 10 3792006 12-4 66.4/4 7 3852007 16-0 65.8/3 1 589 - Lost SB2008 11-5 68.4/1 8 4102009 10-6 67/2 6 4272010 14-2 62.6/21 1 5182011 13-3 67.4/2 3 513 - Lost SB2012 12-4 74.3/1 1 557

Denver Broncos Stats 2011 - 2012

Yr - W/L - Plpg/R - Off R - Tot pps

2011 8-8 63.6/16 25 309

2012 13-3 69.2/4 2 481

Buffalo Bills - 1990-1993Yr - W/L - Plpg/R - Off R - Tot pps

1990 13-3 58.2/21 1 428 - Lost SB1991 13-3 66/2 2 458 - Lost SB1992 11-5 67.9/1 3 381 - Lost SB1993 12-4 67.3/3 7 329 - Lost SB

Oregon Ducks 2009-2012

Yr - W/L - Plpg/R - Off R - Tot pps

2009 10-3 68.1/54 8 468 - Lost Rose B - Final Rank 11

2010 12-1 78.8/5 1 611 - Lost BCS Ch - Final Rank 3

2011 12-2 72.5/32 3 644 - Won Rose B - Final Rank 4

2012 12-1 81.4/9 2 637 - Won Fiesta B - Final Rank 2

Forecasting Principles - Framework

Stages of Forecasting

Forecasting Methods Selection Chart

Forecasting Methodology Tree

Change in Offensive Play Calling

Ave Pass Attempts per Game 35 Yrs

Completion % 35 yrs

Interception % 35 yrs

Passing: Return vs. Risk

QB Ratings 35 yrs

Series 1st Down Likelyhood on 1st and 10

Offensive Points by Yards

Season Wins by Passing Efficiency

Plays per Game per Year

Speed Kills: Up-Tempo - No Huddle

Prediction: More NFL offenses will utilize the up-tempo - no huddle strategy.

Prediction: More NFL offenses will pass.

Forecast: NFL teams using the up-tempo - no huddle strategy will have the best winning records.

Visualizing Data

When does a Data Scientists job end?

• Data Scientists must be able to tell stories with their findings.

• The audience may not understand regression analysis, weighted ranks, etc.

• Must be able to present findings in a clear, concise, and easy to look at manner.

Which would your manager prefer?

VS

Big Data Visualization ToolsPentahoTableauJaspersoftPervasiveBirstMany many more

PentahoAn open source full suite of data integration and

analytics tools- Data Integration- Pixel perfect reports- OLAP cubes

- Dashboards

Works with many Big Data sources including Hadoop, Hive, HBase, Cassandra, MongoDB, and Couchbase along with traditional data sources.

OLAP engine automatically generates SQL using Mondrian.

PentahoHas integration with WEKA, but no other statistical

languages.

Unfortunately like all other user driven visualization tools today data must be extracted to memory or a database. Pentaho makes doing this easy.

Commercial edition includes the capability to automate this process called Instaview.

Work on creating support for Hive SQL.

DemoDemo creating visualizations

Speed Kills

Thank You

Presentation by:

Michael Malak

Chris Deptula

Michael Walker