Spark - Shark Data Analytics Stack on a Hadoop Cluster

132
Spark - Shark Data Analytics Stack on a Hadoop Cluster April 22, 2013

description

Spark - Shark Data Analytics Stack on a Hadoop Cluster. April 22, 2013. Big Data Week Data Science Group. April 23, 2013. Michael Malak Data Analytics Senior Engineer at Time Warner Cable T echnicaltidbit.com . Chris Deptula Senior Big Data Consultant 317.840.2935 - PowerPoint PPT Presentation

Transcript of Spark - Shark Data Analytics Stack on a Hadoop Cluster

Page 1: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark - Shark Data Analytics Stack on a Hadoop Cluster

April 22, 2013

Page 2: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data WeekData Science Group

April 23, 2013

Page 3: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Michael Malak

Data Analytics Senior Engineer at Time Warner CableTechnicaltidbit.com

Page 4: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Chris DeptulaSenior Big Data [email protected]@chrisdeptulahttp://www.openbi.com

Page 5: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Michael Walker Managing [email protected]://www.rosebt.com

Page 6: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Agenda

• The Big Data Problem• Spark Ecosystem• NFL Data Science Use Case• Visualizing Data

Page 7: Spark - Shark Data Analytics Stack on a Hadoop Cluster

The Big Data Problem

Page 8: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Speed Kills in Data Science

Page 9: Spark - Shark Data Analytics Stack on a Hadoop Cluster
Page 10: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hype Cycle for Emerging Tech 2012

Page 11: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hype Cycle for Big Data 2012

Page 12: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Evolution of DW Architecture

Page 13: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Emerging DW Architecture

Page 14: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Next-Generation Data Architecture

Page 15: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data Ecosystem Parts

Page 16: Spark - Shark Data Analytics Stack on a Hadoop Cluster

DW Database Systems MQ 2013

Page 17: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Total Enterprise Data Growth 2005-2015

Page 18: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Structured vs Unstructured Data

Page 19: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Modern DW/BI Analytical Ecosystems

Page 20: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data Ecosystem Parts

Page 21: Spark - Shark Data Analytics Stack on a Hadoop Cluster

The Internet of Things

Page 22: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data 4 V's

Page 23: Spark - Shark Data Analytics Stack on a Hadoop Cluster

New World of Databases

Page 24: Spark - Shark Data Analytics Stack on a Hadoop Cluster

New World of Databases

Page 25: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hadoop

Page 26: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hadoop

Page 27: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hadoop

Page 28: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data Vendor Focused on Hadoop and NoSQL Revenue 2012

Page 29: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data Analytics Infrastructure

Page 30: Spark - Shark Data Analytics Stack on a Hadoop Cluster

The Spark Ecosystem

Page 31: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Agenda

• What Hadoop gives us• What everyone is complaining about in

2013• Spark

o Berkeley Teamo BDAS (Berkeley Data Analytics Stack)o RDDs (Resilient Distributed Datasets)o Sharko Spark Streamingo Other Spark subsystems

Global Big Data Apr 23, 2013

technicaltidbit.com 31

Page 32: Spark - Shark Data Analytics Stack on a Hadoop Cluster

What Hadoop Gives Us

• HDFS• Map/Reduce

Global Big Data Apr 23, 2013

technicaltidbit.com 32

Page 33: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hadoop: HDFS

Image from mark.chmarny.com

Global Big Data Apr 23, 2013

technicaltidbit.com 33

Page 34: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hadoop: Map/Reduce

Global Big Data Apr 23, 2013

technicaltidbit.com 34

Image from people.apache.org/~rdonkin

Image from blog.octo.com

Page 35: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Map/Reduce Tools

Global Big Data Apr 23, 2013

technicaltidbit.com 35

Linux

Hadoop

Hbase App

Pig Hive

HiveQLPig Script

Page 36: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hadoop Distribution Dogs in the Race

Global Big Data Apr 23, 2013

technicaltidbit.com 36

Hadoop Distribution Query Tool

Stinger

Apache Drill

Page 37: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Other Open Source Solutions

• Druid• Spark

Global Big Data Apr 23, 2013

technicaltidbit.com 37

Page 38: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Not just caching, but streaming

• 1st generation: HDFS• 2nd generation: Caching & “Push” Map/Reduce• 3rd generation: Streaming

Global Big Data Apr 23, 2013

technicaltidbit.com 38

Page 39: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Berkeley Team• 40 students• 8 faculty• 3 staff software

engineers• Silicon Valley style

skunkworks office space

• 2 years into 6 year program

Global Big Data Apr 23, 2013

technicaltidbit.com 39

Image from Ian Stoica’s slides from Strata 2013 presentation

Page 40: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark

BDAS(Berkeley Data Analytics Stack)

Global Big Data Apr 23, 2013

technicaltidbit.com 40

Linux

Mesos

Hadoop/HDFS

Bagel Shark Spark Streaming

Spark Streaming AppShark AppBagel App

Spark App

Page 41: Spark - Shark Data Analytics Stack on a Hadoop Cluster

RDDs(Resilient Distributed Dataset)

Global Big Data Apr 23, 2013

technicaltidbit.com 41

Image from Matei Zaharia’s paper

Page 42: Spark - Shark Data Analytics Stack on a Hadoop Cluster

RDDs: Laziness

Global Big Data Apr 23, 2013

technicaltidbit.com 42

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”))cnt = errors.count

x => x.startsWith(“ERROR”)

All Lazy

Action!

Page 43: Spark - Shark Data Analytics Stack on a Hadoop Cluster

RDDs: Transformations vs. Actions

Global Big Data Apr 23, 2013

technicaltidbit.com 43

Transformationsmap(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]

Actionsreduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)

[K,V] in Scala same as <K,V> templates in C++, Java

Page 44: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Hive vs. Shark

Global Big Data Apr 23, 2013

technicaltidbit.com 44

HDFS files

Shark

Hiv

eQL

HDFS files RDDs+

Hiv

eQL

Page 45: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Shark: Copy from HDFS to RDDCREATE TABLE wiki_small_in_mem TBLPROPERTIES

("shark.cache" = "true") AS SELECT * FROM wiki;

CREATE TABLE wiki_cached AS SELECT * FROM wiki;

Creates a table that is stored in a cluster’s memory using RDD.cache().

Global Big Data Apr 23, 2013

technicaltidbit.com 45

Page 46: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Shark: Just a Shim

Global Big Data Apr 23, 2013

technicaltidbit.com 46

Shark

Images from Reynold Xin’s presentation

Page 47: Spark - Shark Data Analytics Stack on a Hadoop Cluster

What about “Big Data”?

Global Big Data Apr 23, 2013

technicaltidbit.com 47

PB

TB

GB

MB

KB

Sha

rk E

ffect

iven

ess

Page 48: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Median Hadoop job input size

Global Big Data Apr 23, 2013

technicaltidbit.com 48

Image from Reynold Xin’s presentation

Page 49: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark Streaming: Motivation

Global Big Data Apr 23, 2013

technicaltidbit.com 49

x1,000,000 clientsHDFS

Page 50: Spark - Shark Data Analytics Stack on a Hadoop Cluster

DStream

RDD

RDD

Spark Streaming: DStream

• “A series of small batches”

Global Big Data Apr 23, 2013

technicaltidbit.com 50

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “buyGoods”}}

{{“id”: “shewolf”}, “eventType”: “error”}}

{{“id”: “shewolf”}, “eventType”: “error”}} . . .

RDD{{“id”: “catlover”}, “eventType”: “buyGoods”}}

{{“id”: “hercman”}, “eventType”: “logOff”}}

2 sec

2 sec

2 sec

Page 51: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark Streaming: DAG

Global Big Data Apr 23, 2013

technicaltidbit.com 51

Kafka DStream[String] (JSON)

Dstream.transform

DStream.filter(_.eventType==“error”)

Dstream.filter(_.eventType==“buyGoods”)

Dstream.map((_.id,1))

Dstream[EvObj]

Dstream[EvObj]

Dstream.groupByKey

Dstream.foreach(println)

Dstream.foreach(println)

The DAG

Page 52: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Spark Streaming: Example Code

Global Big Data Apr 23, 2013

technicaltidbit.com 52

// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))

val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))

val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))

// Gossc.start

Page 53: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Stateful Spark Streaming

Global Big Data Apr 23, 2013

technicaltidbit.com 53

Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}

// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)

// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))

Page 54: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Other Spark Subsystems

• Bagel (similar to Google Pregel)

• Sparkler (Matrix decomposition)• (Machine Learning)

Global Big Data Apr 23, 2013

technicaltidbit.com 54

Page 55: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Teaser

• Future Meetup: Machine learning from real-time data streams

Global Big Data Apr 23, 2013

technicaltidbit.com 55

Page 56: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Science NFL Use-Case

Page 57: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Speed Kills - In Data Science and NFL

Page 58: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Speed Kills: Up-Tempo - No Huddle

Observation: best college and NFL offenses are using an up-tempo - no huddle strategy.

Hypothesis: NFL teams using the up-tempo - no huddle strategy have the best winning records.

Page 59: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Science Formula

Page 60: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Science FormulaData science processes include:

1) Information gathering

2) Re-representation of the information in a schema that aids analysis

3) Development of insight through the manipulation of this representation

4) Creation of some knowledge product or direct action based on the insight

Page 61: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Science Formula

Information > Schema > Insight > Product

The data analysis may be organized in two key loops:

1) Searching loop (seeking, extracting, filtering information)

2) Understanding loop (modeling and conceptualization from a schema that best fits the evidence)

Page 62: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Collection

Collect data on all NFL offenses

Collect data on NFL offenses using an up-tempo - no huddle strategy

Collect data on NFL team records (win-losses)

Page 63: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Comparison

Compare data on all NFL offenses with data on NFL offenses using an up-tempo - no huddle strategy

Page 64: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Science ToolsScientific methodsAnalytical techniquesMachine learning techniquesAlgorithm design and executionData visualization and story-tellingStatisticsMathComputer engineeringData miningData modeling

Page 65: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Predictive, Descriptive, Prescriptive Analytics

There are three types of data analysis:

Descriptive (business intelligence and data mining)

Predictive (forecasting)

Prescriptive (optimization and simulation)

Page 66: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Predictive Analytics

Page 67: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Predictive Modeling Techniques

Page 68: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Predictive Analytics

Page 69: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Prescriptive Analytics

Page 70: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Regression Analysis

Page 71: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Data Sources

2002-2009 NFL PLAY BY PLAY

http://www.infochimps.com/tags/nfl

Supplemented stats source:

http://www.databasefootball.com/

http://www.pro-football-reference.com/

Page 72: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Results

From 2001 to 2012 only one NFL team used the up-tempo - no huddle offense consistently:

The New England Patriots

Note: In college football many teams started using the up-tempo - no huddle offense consistently in 2009. One team stands out:

The Oregon Ducks

Page 73: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Model - DVOA

DVOA (Defense-adjusted Value Over Average) for total offense as well as rushing and passing offense separated. All numbers are adjusted to an average schedule of opponents and an average percentage of fumbles recovered by the offense.

Exceptions are the three columns marked NON-ADJUSTED. Rushing includes all rushing, not just running backs.

Designed by Aaron Schatz of Football Outsiders

Page 74: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2001 Offensive Efficiency Ratings

Page 75: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2002 Offensive Efficiency Ratings

Page 76: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2003 Offensive Efficiency Ratings

Page 77: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2004 Offensive Efficiency Ratings

Page 78: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2005 Offensive Efficiency Ratings

Page 79: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2006 Offensive Efficiency Ratings

Page 80: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2007 Offensive Efficiency Ratings

Page 81: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2008 Offensive Efficiency Ratings

Page 82: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2009 Offensive Efficiency Ratings

Page 83: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2011 Offensive Efficiency Ratings

Page 84: Spark - Shark Data Analytics Stack on a Hadoop Cluster

2012 Offensive Efficiency Ratings

Page 85: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Yr to Yr Correlation of Offense 2002 - 2009

Page 86: Spark - Shark Data Analytics Stack on a Hadoop Cluster

All-Time DVOA Lists

Page 87: Spark - Shark Data Analytics Stack on a Hadoop Cluster

All-Time DVOA Lists

Page 88: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2003

Page 89: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2004

Page 90: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2005

Page 91: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2006

Page 92: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2007

Page 93: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2008

Page 94: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2009

Page 95: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2011

Page 96: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NFL Team Plays per Game 2012

Page 97: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Team Records

Page 98: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Total Points

Page 99: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Average Points per Game

Page 100: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Total Plays

Page 101: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Average Plays per Game

Page 102: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Results

From 2001 to 2009 the New England Patriots had 107 wins and 37 losses - the second best record in NFL (to Mannings Ind).

They appeared in five (5) - winning three (3) super bowls.

From 2001 to 2012 the New England Patriots had 146 wins and 46 losses - the best record in NFL.

Page 103: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Results

In 128 games NE scored a total of 3,356 points (26.2 ppg)

Second to Mannings Colts.

NE runs up-tempo - no -huddle offense

IND runs no-huddle offense

Page 104: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Data Results

NE in 5 SBs - wins 3 SBs (up-tempo - no -huddle offense)

IND in 2 SBs - wins 1 SB (no -huddle offense)

Page 105: Spark - Shark Data Analytics Stack on a Hadoop Cluster

NE Stats 2001 - 2012Yr - W/L - Plpg/R - Off R - Tot pps2001 11-5 62.5/18 6 371 - Super Bowl

Champs2002 9-7 64.4/8 16 3812003 14-2 66.4/1 12 348 - Super Bowl

Champs2004 14-2 64.3/7 4 437 - Super Bowl

Champs2005 10-6 63.7/15 10 3792006 12-4 66.4/4 7 3852007 16-0 65.8/3 1 589 - Lost SB2008 11-5 68.4/1 8 4102009 10-6 67/2 6 4272010 14-2 62.6/21 1 5182011 13-3 67.4/2 3 513 - Lost SB2012 12-4 74.3/1 1 557

Page 106: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Denver Broncos Stats 2011 - 2012

Yr - W/L - Plpg/R - Off R - Tot pps

2011 8-8 63.6/16 25 309

2012 13-3 69.2/4 2 481

Page 107: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Buffalo Bills - 1990-1993Yr - W/L - Plpg/R - Off R - Tot pps

1990 13-3 58.2/21 1 428 - Lost SB1991 13-3 66/2 2 458 - Lost SB1992 11-5 67.9/1 3 381 - Lost SB1993 12-4 67.3/3 7 329 - Lost SB

Page 108: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Oregon Ducks 2009-2012

Yr - W/L - Plpg/R - Off R - Tot pps

2009 10-3 68.1/54 8 468 - Lost Rose B - Final Rank 11

2010 12-1 78.8/5 1 611 - Lost BCS Ch - Final Rank 3

2011 12-2 72.5/32 3 644 - Won Rose B - Final Rank 4

2012 12-1 81.4/9 2 637 - Won Fiesta B - Final Rank 2

Page 109: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Forecasting Principles - Framework

Page 110: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Stages of Forecasting

Page 111: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Forecasting Methods Selection Chart

Page 112: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Forecasting Methodology Tree

Page 113: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Change in Offensive Play Calling

Page 114: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Ave Pass Attempts per Game 35 Yrs

Page 115: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Completion % 35 yrs

Page 116: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Interception % 35 yrs

Page 117: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Passing: Return vs. Risk

Page 118: Spark - Shark Data Analytics Stack on a Hadoop Cluster

QB Ratings 35 yrs

Page 119: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Series 1st Down Likelyhood on 1st and 10

Page 120: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Offensive Points by Yards

Page 121: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Season Wins by Passing Efficiency

Page 122: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Plays per Game per Year

Page 123: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Speed Kills: Up-Tempo - No Huddle

Prediction: More NFL offenses will utilize the up-tempo - no huddle strategy.

Prediction: More NFL offenses will pass.

Forecast: NFL teams using the up-tempo - no huddle strategy will have the best winning records.

Page 124: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Visualizing Data

Page 125: Spark - Shark Data Analytics Stack on a Hadoop Cluster

When does a Data Scientists job end?

• Data Scientists must be able to tell stories with their findings.

• The audience may not understand regression analysis, weighted ranks, etc.

• Must be able to present findings in a clear, concise, and easy to look at manner.

Page 126: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Which would your manager prefer?

VS

Page 127: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Big Data Visualization ToolsPentahoTableauJaspersoftPervasiveBirstMany many more

Page 128: Spark - Shark Data Analytics Stack on a Hadoop Cluster

PentahoAn open source full suite of data integration and

analytics tools- Data Integration- Pixel perfect reports- OLAP cubes

- Dashboards

Works with many Big Data sources including Hadoop, Hive, HBase, Cassandra, MongoDB, and Couchbase along with traditional data sources.

OLAP engine automatically generates SQL using Mondrian.

Page 129: Spark - Shark Data Analytics Stack on a Hadoop Cluster

PentahoHas integration with WEKA, but no other statistical

languages.

Unfortunately like all other user driven visualization tools today data must be extracted to memory or a database. Pentaho makes doing this easy.

Commercial edition includes the capability to automate this process called Instaview.

Work on creating support for Hive SQL.

Page 130: Spark - Shark Data Analytics Stack on a Hadoop Cluster

DemoDemo creating visualizations

Page 131: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Speed Kills

Page 132: Spark - Shark Data Analytics Stack on a Hadoop Cluster

Thank You

Presentation by:

Michael Malak

Chris Deptula

Michael Walker