Spark - Shark Data Analytics Stack on a Hadoop Cluster
description
Transcript of Spark - Shark Data Analytics Stack on a Hadoop Cluster
Spark - Shark Data Analytics Stack on a Hadoop Cluster
April 22, 2013
Big Data WeekData Science Group
April 23, 2013
Michael Malak
Data Analytics Senior Engineer at Time Warner CableTechnicaltidbit.com
Chris DeptulaSenior Big Data [email protected]@chrisdeptulahttp://www.openbi.com
Michael Walker Managing [email protected]://www.rosebt.com
Agenda
• The Big Data Problem• Spark Ecosystem• NFL Data Science Use Case• Visualizing Data
The Big Data Problem
Speed Kills in Data Science
Hype Cycle for Emerging Tech 2012
Hype Cycle for Big Data 2012
Evolution of DW Architecture
Emerging DW Architecture
Next-Generation Data Architecture
Big Data Ecosystem Parts
DW Database Systems MQ 2013
Total Enterprise Data Growth 2005-2015
Structured vs Unstructured Data
Modern DW/BI Analytical Ecosystems
Big Data Ecosystem Parts
The Internet of Things
Big Data 4 V's
New World of Databases
New World of Databases
Hadoop
Hadoop
Hadoop
Big Data Vendor Focused on Hadoop and NoSQL Revenue 2012
Big Data Analytics Infrastructure
The Spark Ecosystem
Agenda
• What Hadoop gives us• What everyone is complaining about in
2013• Spark
o Berkeley Teamo BDAS (Berkeley Data Analytics Stack)o RDDs (Resilient Distributed Datasets)o Sharko Spark Streamingo Other Spark subsystems
Global Big Data Apr 23, 2013
technicaltidbit.com 31
What Hadoop Gives Us
• HDFS• Map/Reduce
Global Big Data Apr 23, 2013
technicaltidbit.com 32
Hadoop: HDFS
Image from mark.chmarny.com
Global Big Data Apr 23, 2013
technicaltidbit.com 33
Hadoop: Map/Reduce
Global Big Data Apr 23, 2013
technicaltidbit.com 34
Image from people.apache.org/~rdonkin
Image from blog.octo.com
Map/Reduce Tools
Global Big Data Apr 23, 2013
technicaltidbit.com 35
Linux
Hadoop
Hbase App
Pig Hive
HiveQLPig Script
Hadoop Distribution Dogs in the Race
Global Big Data Apr 23, 2013
technicaltidbit.com 36
Hadoop Distribution Query Tool
Stinger
Apache Drill
Other Open Source Solutions
• Druid• Spark
Global Big Data Apr 23, 2013
technicaltidbit.com 37
Not just caching, but streaming
• 1st generation: HDFS• 2nd generation: Caching & “Push” Map/Reduce• 3rd generation: Streaming
Global Big Data Apr 23, 2013
technicaltidbit.com 38
Berkeley Team• 40 students• 8 faculty• 3 staff software
engineers• Silicon Valley style
skunkworks office space
• 2 years into 6 year program
Global Big Data Apr 23, 2013
technicaltidbit.com 39
Image from Ian Stoica’s slides from Strata 2013 presentation
Spark
BDAS(Berkeley Data Analytics Stack)
Global Big Data Apr 23, 2013
technicaltidbit.com 40
Linux
Mesos
Hadoop/HDFS
Bagel Shark Spark Streaming
Spark Streaming AppShark AppBagel App
Spark App
RDDs(Resilient Distributed Dataset)
Global Big Data Apr 23, 2013
technicaltidbit.com 41
Image from Matei Zaharia’s paper
RDDs: Laziness
Global Big Data Apr 23, 2013
technicaltidbit.com 42
lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”)) .map(_.split(‘\t’)(2)) .filter(_.contains(“foo”))cnt = errors.count
x => x.startsWith(“ERROR”)
All Lazy
Action!
RDDs: Transformations vs. Actions
Global Big Data Apr 23, 2013
technicaltidbit.com 43
Transformationsmap(func)filter(func)flatMap(func)sample(withReplacement, frac, seed)union(otherDataset)groupByKey[K,V](func)reduceByKey[K,V](func)join[K,V,W](otherDataset)cogroup[K,V,W1,W2](other1, other2)cartesian[U](otherDataset)sortByKey[K,V]
Actionsreduce(func)collect()count()take(n)first()saveAsTextFile(path)saveAsSequenceFile(path)foreach(func)
[K,V] in Scala same as <K,V> templates in C++, Java
Hive vs. Shark
Global Big Data Apr 23, 2013
technicaltidbit.com 44
HDFS files
Shark
Hiv
eQL
HDFS files RDDs+
Hiv
eQL
Shark: Copy from HDFS to RDDCREATE TABLE wiki_small_in_mem TBLPROPERTIES
("shark.cache" = "true") AS SELECT * FROM wiki;
CREATE TABLE wiki_cached AS SELECT * FROM wiki;
Creates a table that is stored in a cluster’s memory using RDD.cache().
Global Big Data Apr 23, 2013
technicaltidbit.com 45
Shark: Just a Shim
Global Big Data Apr 23, 2013
technicaltidbit.com 46
Shark
Images from Reynold Xin’s presentation
What about “Big Data”?
Global Big Data Apr 23, 2013
technicaltidbit.com 47
PB
TB
GB
MB
KB
Sha
rk E
ffect
iven
ess
Median Hadoop job input size
Global Big Data Apr 23, 2013
technicaltidbit.com 48
Image from Reynold Xin’s presentation
Spark Streaming: Motivation
Global Big Data Apr 23, 2013
technicaltidbit.com 49
x1,000,000 clientsHDFS
DStream
RDD
RDD
Spark Streaming: DStream
• “A series of small batches”
Global Big Data Apr 23, 2013
technicaltidbit.com 50
{{“id”: “hercman”}, “eventType”: “buyGoods”}}
{{“id”: “hercman”}, “eventType”: “buyGoods”}}
{{“id”: “shewolf”}, “eventType”: “error”}}
{{“id”: “shewolf”}, “eventType”: “error”}} . . .
RDD{{“id”: “catlover”}, “eventType”: “buyGoods”}}
{{“id”: “hercman”}, “eventType”: “logOff”}}
2 sec
2 sec
2 sec
Spark Streaming: DAG
Global Big Data Apr 23, 2013
technicaltidbit.com 51
Kafka DStream[String] (JSON)
Dstream.transform
DStream.filter(_.eventType==“error”)
Dstream.filter(_.eventType==“buyGoods”)
Dstream.map((_.id,1))
Dstream[EvObj]
Dstream[EvObj]
Dstream.groupByKey
Dstream.foreach(println)
Dstream.foreach(println)
The DAG
Spark Streaming: Example Code
Global Big Data Apr 23, 2013
technicaltidbit.com 52
// Initializeval ssc = new StreamingContext(“mesos://localhost”, “games”, Seconds(2), …)val msgs = ssc.kafkaStream[String](prm, topic, StorageLevel.MEMORY_AND_DISK)
// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))
val errorCounts = events.filter(_.eventType == “error”)errorCounts.foreach(rdd => println(rdd.count))
val usersBuying = events.filter(_.eventType == “buyGoods”).map((_.id,1)) .groupByKeyusersBuying.foreach(rdd => println(rdd.count))
// Gossc.start
Stateful Spark Streaming
Global Big Data Apr 23, 2013
technicaltidbit.com 53
Class ErrorsPerUser(var numErrors:Int=0) extends Serializableval updateFunc = (values:Seq[evObj], state:Option[ErrorsPerUser]) => { if (values.find(_.eventType == “logOff”) == None) None else { values.foreach(e => { e.eventType match { “error” => state.numErrors += 1 } }) Option(state) }}
// DAGval events:Dstream[evObj] = messages.transform(rdd => rdd.map(new evObj(_))val errorCounts = events.filter(_.eventType == “error”)val states = errorCounts.map((_.id,1)) .updateStateByKey[ErrorsPerUser](updateFunc)
// Off-DAGstates.foreach(rdd => println(“Num users experiencing errors:” + rdd.count))
Other Spark Subsystems
• Bagel (similar to Google Pregel)
• Sparkler (Matrix decomposition)• (Machine Learning)
Global Big Data Apr 23, 2013
technicaltidbit.com 54
Teaser
• Future Meetup: Machine learning from real-time data streams
Global Big Data Apr 23, 2013
technicaltidbit.com 55
Data Science NFL Use-Case
Speed Kills - In Data Science and NFL
Speed Kills: Up-Tempo - No Huddle
Observation: best college and NFL offenses are using an up-tempo - no huddle strategy.
Hypothesis: NFL teams using the up-tempo - no huddle strategy have the best winning records.
Data Science Formula
Data Science FormulaData science processes include:
1) Information gathering
2) Re-representation of the information in a schema that aids analysis
3) Development of insight through the manipulation of this representation
4) Creation of some knowledge product or direct action based on the insight
Data Science Formula
Information > Schema > Insight > Product
The data analysis may be organized in two key loops:
1) Searching loop (seeking, extracting, filtering information)
2) Understanding loop (modeling and conceptualization from a schema that best fits the evidence)
Data Collection
Collect data on all NFL offenses
Collect data on NFL offenses using an up-tempo - no huddle strategy
Collect data on NFL team records (win-losses)
Data Comparison
Compare data on all NFL offenses with data on NFL offenses using an up-tempo - no huddle strategy
Data Science ToolsScientific methodsAnalytical techniquesMachine learning techniquesAlgorithm design and executionData visualization and story-tellingStatisticsMathComputer engineeringData miningData modeling
Predictive, Descriptive, Prescriptive Analytics
There are three types of data analysis:
Descriptive (business intelligence and data mining)
Predictive (forecasting)
Prescriptive (optimization and simulation)
Predictive Analytics
Predictive Modeling Techniques
Predictive Analytics
Prescriptive Analytics
Regression Analysis
NFL Data Sources
2002-2009 NFL PLAY BY PLAY
http://www.infochimps.com/tags/nfl
Supplemented stats source:
http://www.databasefootball.com/
http://www.pro-football-reference.com/
Data Results
From 2001 to 2012 only one NFL team used the up-tempo - no huddle offense consistently:
The New England Patriots
Note: In college football many teams started using the up-tempo - no huddle offense consistently in 2009. One team stands out:
The Oregon Ducks
Data Model - DVOA
DVOA (Defense-adjusted Value Over Average) for total offense as well as rushing and passing offense separated. All numbers are adjusted to an average schedule of opponents and an average percentage of fumbles recovered by the offense.
Exceptions are the three columns marked NON-ADJUSTED. Rushing includes all rushing, not just running backs.
Designed by Aaron Schatz of Football Outsiders
2001 Offensive Efficiency Ratings
2002 Offensive Efficiency Ratings
2003 Offensive Efficiency Ratings
2004 Offensive Efficiency Ratings
2005 Offensive Efficiency Ratings
2006 Offensive Efficiency Ratings
2007 Offensive Efficiency Ratings
2008 Offensive Efficiency Ratings
2009 Offensive Efficiency Ratings
2011 Offensive Efficiency Ratings
2012 Offensive Efficiency Ratings
Yr to Yr Correlation of Offense 2002 - 2009
All-Time DVOA Lists
All-Time DVOA Lists
NFL Team Plays per Game 2003
NFL Team Plays per Game 2004
NFL Team Plays per Game 2005
NFL Team Plays per Game 2006
NFL Team Plays per Game 2007
NFL Team Plays per Game 2008
NFL Team Plays per Game 2009
NFL Team Plays per Game 2011
NFL Team Plays per Game 2012
Team Records
Total Points
Average Points per Game
Total Plays
Average Plays per Game
Data Results
From 2001 to 2009 the New England Patriots had 107 wins and 37 losses - the second best record in NFL (to Mannings Ind).
They appeared in five (5) - winning three (3) super bowls.
From 2001 to 2012 the New England Patriots had 146 wins and 46 losses - the best record in NFL.
Data Results
In 128 games NE scored a total of 3,356 points (26.2 ppg)
Second to Mannings Colts.
NE runs up-tempo - no -huddle offense
IND runs no-huddle offense
Data Results
NE in 5 SBs - wins 3 SBs (up-tempo - no -huddle offense)
IND in 2 SBs - wins 1 SB (no -huddle offense)
NE Stats 2001 - 2012Yr - W/L - Plpg/R - Off R - Tot pps2001 11-5 62.5/18 6 371 - Super Bowl
Champs2002 9-7 64.4/8 16 3812003 14-2 66.4/1 12 348 - Super Bowl
Champs2004 14-2 64.3/7 4 437 - Super Bowl
Champs2005 10-6 63.7/15 10 3792006 12-4 66.4/4 7 3852007 16-0 65.8/3 1 589 - Lost SB2008 11-5 68.4/1 8 4102009 10-6 67/2 6 4272010 14-2 62.6/21 1 5182011 13-3 67.4/2 3 513 - Lost SB2012 12-4 74.3/1 1 557
Denver Broncos Stats 2011 - 2012
Yr - W/L - Plpg/R - Off R - Tot pps
2011 8-8 63.6/16 25 309
2012 13-3 69.2/4 2 481
Buffalo Bills - 1990-1993Yr - W/L - Plpg/R - Off R - Tot pps
1990 13-3 58.2/21 1 428 - Lost SB1991 13-3 66/2 2 458 - Lost SB1992 11-5 67.9/1 3 381 - Lost SB1993 12-4 67.3/3 7 329 - Lost SB
Oregon Ducks 2009-2012
Yr - W/L - Plpg/R - Off R - Tot pps
2009 10-3 68.1/54 8 468 - Lost Rose B - Final Rank 11
2010 12-1 78.8/5 1 611 - Lost BCS Ch - Final Rank 3
2011 12-2 72.5/32 3 644 - Won Rose B - Final Rank 4
2012 12-1 81.4/9 2 637 - Won Fiesta B - Final Rank 2
Forecasting Principles - Framework
Stages of Forecasting
Forecasting Methods Selection Chart
Forecasting Methodology Tree
Change in Offensive Play Calling
Ave Pass Attempts per Game 35 Yrs
Completion % 35 yrs
Interception % 35 yrs
Passing: Return vs. Risk
QB Ratings 35 yrs
Series 1st Down Likelyhood on 1st and 10
Offensive Points by Yards
Season Wins by Passing Efficiency
Plays per Game per Year
Speed Kills: Up-Tempo - No Huddle
Prediction: More NFL offenses will utilize the up-tempo - no huddle strategy.
Prediction: More NFL offenses will pass.
Forecast: NFL teams using the up-tempo - no huddle strategy will have the best winning records.
Visualizing Data
When does a Data Scientists job end?
• Data Scientists must be able to tell stories with their findings.
• The audience may not understand regression analysis, weighted ranks, etc.
• Must be able to present findings in a clear, concise, and easy to look at manner.
Which would your manager prefer?
VS
Big Data Visualization ToolsPentahoTableauJaspersoftPervasiveBirstMany many more
PentahoAn open source full suite of data integration and
analytics tools- Data Integration- Pixel perfect reports- OLAP cubes
- Dashboards
Works with many Big Data sources including Hadoop, Hive, HBase, Cassandra, MongoDB, and Couchbase along with traditional data sources.
OLAP engine automatically generates SQL using Mondrian.
PentahoHas integration with WEKA, but no other statistical
languages.
Unfortunately like all other user driven visualization tools today data must be extracted to memory or a database. Pentaho makes doing this easy.
Commercial edition includes the capability to automate this process called Instaview.
Work on creating support for Hive SQL.
DemoDemo creating visualizations
Speed Kills
Thank You
Presentation by:
Michael Malak
Chris Deptula
Michael Walker