Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark - Shark Data Analytics Stack on a Hadoop Cluster

April 22, 2013

Big Data WeekData Science Group

April 23, 2013

Michael Malak

Data Analytics Senior Engineer at Time Warner CableTechnicaltidbit.com

Chris DeptulaSenior Big Data [email protected]@chrisdeptulahttp://www.openbi.com

http://www.openbi.com/

Michael Walker Managing [email protected]://www.rosebt.com

http://www.rosebt.com/

Agenda

• The Big Data Problem• Spark Ecosystem• NFL Data Science Use Case• Visualizing Data

The Big Data Problem

Speed Kills in Data Science

Hype Cycle for Emerging Tech 2012

Hype Cycle for Big Data 2012

Evolution of DW Architecture

Emerging DW Architecture

Next-Generation Data Architecture

Big Data Ecosystem Parts

DW Database Systems MQ 2013

Total Enterprise Data Growth 2005-2015

Structured vs Unstructured Data

Modern DW/BI Analytical Ecosystems

Big Data Ecosystem Parts

The Internet of Things

Big Data 4 V's

New World of Databases

Hadoop

Big Data Vendor Focused on Hadoop and NoSQL Revenue 2012

Big Data Analytics Infrastructure

The Spark Ecosystem

Agenda

• What Hadoop gives us• What everyone is complaining about in

2013• Spark

o Berkeley Teamo BDAS (Berkeley Data Analytics Stack)o RDDs (Resilient Distributed Datasets)o Sharko Spark Streamingo Other Spark subsystems

Global Big Data Apr 23, 2013

technicaltidbit.com 31

What Hadoop Gives Us

• HDFS• Map/Reduce



Hadoop: HDFS

Image from mark.chmarny.com



Hadoop: Map/Reduce



Image from people.apache.org/~rdonkin

Image from blog.octo.com

Map/Reduce Tools



Linux

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Hadoop Distribution Dogs in the Race



Hadoop Distribution Query Tool

Stinger

Apache Drill

Other Open Source Solutions

• Druid• Spark



Not just caching, but streaming

• 1st generation: HDFS• 2nd generation: Caching & “Push” Map/Reduce• 3rd generation: Streaming



Berkeley Team• 40 students• 8 faculty• 3 staff software

engineers• Silicon Valley style

skunkworks office space

• 2 years into 6 year program



Image from Ian Stoica’s slides from Strata 2013 presentation

Spark

BDAS(Berkeley Data Analytics Stack)



Linux

Mesos

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

RDDs(Resilient Distributed Dataset)



Image from Matei Zaharia’s paper

RDDs: Laziness



lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”))cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

RDDs: Transformations vs. Actions



Transformationsmap(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actionsreduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Hive vs. Shark



HDFS files

Shark

Hiv

eQL

HDFS files RDDs+

Hiv

eQL

Shark: Copy from HDFS to RDDCREATE TABLE wiki_small_in_mem TBLPROPERTIES

("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s memory using RDD.cache().



Shark: Just a Shim



Shark

Images from Reynold Xin’s presentation

What about “Big Data”?



PB

TB

GB

MB

KB

Sha

rk E

ffect

iven

ess

Median Hadoop job input size



Image from Reynold Xin’s presentation

Spark Streaming: Motivation



x1,000,000 clientsHDFS

DStream

RDD

RDD

Spark Streaming: DStream

• “A series of small batches”



{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

{{“id”: “shewolf”}, “eventType”: “error”}} . . .

RDD{{“id”: “catlover”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “logOff”}}

2 sec

2 sec

2 sec

Spark Streaming: DAG



Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

Dstream.foreach(println)

The DAG

Spark Streaming: Example Code



// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Stateful Spark Streaming



Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Other Spark Subsystems

• Bagel (similar to Google Pregel)

• Sparkler (Matrix decomposition)• (Machine Learning)



Teaser

• Future Meetup: Machine learning from real-time data streams



Data Science NFL Use-Case

Speed Kills - In Data Science and NFL

Speed Kills: Up-Tempo - No Huddle

Observation: best college and NFL offenses are using an up-tempo - no huddle strategy.

Hypothesis: NFL teams using the up-tempo - no huddle strategy have the best winning records.

Data Science Formula

Data Science FormulaData science processes include:

1) Information gathering

2) Re-representation of the information in a schema that aids analysis

3) Development of insight through the manipulation of this representation

4) Creation of some knowledge product or direct action based on the insight

Data Science Formula

Information > Schema > Insight > Product

The data analysis may be organized in two key loops:

1) Searching loop (seeking, extracting, filtering information)

2) Understanding loop (modeling and conceptualization from a schema that best fits the evidence)

Data Collection

Collect data on all NFL offenses

Collect data on NFL offenses using an up-tempo - no huddle strategy

Collect data on NFL team records (win-losses)

Data Comparison

Compare data on all NFL offenses with data on NFL offenses using an up-tempo - no huddle strategy

Data Science ToolsScientific methodsAnalytical techniquesMachine learning techniquesAlgorithm design and executionData visualization and story-tellingStatisticsMathComputer engineeringData miningData modeling

Predictive, Descriptive, Prescriptive Analytics

There are three types of data analysis:

Descriptive (business intelligence and data mining)

Predictive (forecasting)

Prescriptive (optimization and simulation)

Predictive Analytics

Predictive Modeling Techniques

Predictive Analytics

Prescriptive Analytics

Regression Analysis

NFL Data Sources

2002-2009 NFL PLAY BY PLAY

http://www.infochimps.com/tags/nfl

Supplemented stats source:

http://www.databasefootball.com/

http://www.pro-football-reference.com/

Data Results

From 2001 to 2012 only one NFL team used the up-tempo - no huddle offense consistently:

The New England Patriots

Note: In college football many teams started using the up-tempo - no huddle offense consistently in 2009. One team stands out:

The Oregon Ducks

Data Model - DVOA

DVOA (Defense-adjusted Value Over Average) for total offense as well as rushing and passing offense separated. All numbers are adjusted to an average schedule of opponents and an average percentage of fumbles recovered by the offense.

Exceptions are the three columns marked NON-ADJUSTED. Rushing includes all rushing, not just running backs.

Designed by Aaron Schatz of Football Outsiders

2001 Offensive Efficiency Ratings

Yr to Yr Correlation of Offense 2002 - 2009

All-Time DVOA Lists

NFL Team Plays per Game 2003

Team Records

Total Points

Average Points per Game

Total Plays

Average Plays per Game

Data Results

From 2001 to 2009 the New England Patriots had 107 wins and 37 losses - the second best record in NFL (to Mannings Ind).

They appeared in five (5) - winning three (3) super bowls.

From 2001 to 2012 the New England Patriots had 146 wins and 46 losses - the best record in NFL.

Data Results

In 128 games NE scored a total of 3,356 points (26.2 ppg)

Second to Mannings Colts.

NE runs up-tempo - no -huddle offense

IND runs no-huddle offense

Data Results

NE in 5 SBs - wins 3 SBs (up-tempo - no -huddle offense)

IND in 2 SBs - wins 1 SB (no -huddle offense)

NE Stats 2001 - 2012Yr - W/L - Plpg/R - Off R - Tot pps2001 11-5 62.5/18 6 371 - Super Bowl

Champs2002 9-7 64.4/8 16 3812003 14-2 66.4/1 12 348 - Super Bowl

Champs2004 14-2 64.3/7 4 437 - Super Bowl

Champs2005 10-6 63.7/15 10 3792006 12-4 66.4/4 7 3852007 16-0 65.8/3 1 589 - Lost SB2008 11-5 68.4/1 8 4102009 10-6 67/2 6 4272010 14-2 62.6/21 1 5182011 13-3 67.4/2 3 513 - Lost SB2012 12-4 74.3/1 1 557

Denver Broncos Stats 2011 - 2012

Yr - W/L - Plpg/R - Off R - Tot pps

2011 8-8 63.6/16 25 309

2012 13-3 69.2/4 2 481

Buffalo Bills - 1990-1993Yr - W/L - Plpg/R - Off R - Tot pps

1990 13-3 58.2/21 1 428 - Lost SB1991 13-3 66/2 2 458 - Lost SB1992 11-5 67.9/1 3 381 - Lost SB1993 12-4 67.3/3 7 329 - Lost SB

Oregon Ducks 2009-2012

Yr - W/L - Plpg/R - Off R - Tot pps

2009 10-3 68.1/54 8 468 - Lost Rose B - Final Rank 11

2010 12-1 78.8/5 1 611 - Lost BCS Ch - Final Rank 3

2011 12-2 72.5/32 3 644 - Won Rose B - Final Rank 4

2012 12-1 81.4/9 2 637 - Won Fiesta B - Final Rank 2

Forecasting Principles - Framework

Stages of Forecasting

Forecasting Methods Selection Chart

Forecasting Methodology Tree

Change in Offensive Play Calling

Ave Pass Attempts per Game 35 Yrs

Completion % 35 yrs

Interception % 35 yrs

Passing: Return vs. Risk

QB Ratings 35 yrs

Series 1st Down Likelyhood on 1st and 10

Offensive Points by Yards

Season Wins by Passing Efficiency

Plays per Game per Year

Speed Kills: Up-Tempo - No Huddle

Prediction: More NFL offenses will utilize the up-tempo - no huddle strategy.

Prediction: More NFL offenses will pass.

Forecast: NFL teams using the up-tempo - no huddle strategy will have the best winning records.

Visualizing Data

When does a Data Scientists job end?

• Data Scientists must be able to tell stories with their findings.

• The audience may not understand regression analysis, weighted ranks, etc.

• Must be able to present findings in a clear, concise, and easy to look at manner.

Which would your manager prefer?

VS

Big Data Visualization ToolsPentahoTableauJaspersoftPervasiveBirstMany many more

PentahoAn open source full suite of data integration and

analytics tools- Data Integration- Pixel perfect reports- OLAP cubes

- Dashboards

Works with many Big Data sources including Hadoop, Hive, HBase, Cassandra, MongoDB, and Couchbase along with traditional data sources.

OLAP engine automatically generates SQL using Mondrian.

PentahoHas integration with WEKA, but no other statistical

languages.

Unfortunately like all other user driven visualization tools today data must be extracted to memory or a database. Pentaho makes doing this easy.

Commercial edition includes the capability to automate this process called Instaview.

Work on creating support for Hive SQL.

DemoDemo creating visualizations

Speed Kills

Thank You

Presentation by:

Michael Malak

Chris Deptula

Michael Walker

Spark - Shark Data Analytics Stack on a Hadoop Cluster

Documents

Transcript of Spark - Shark Data Analytics Stack on a Hadoop Cluster