Spark Under the Hood - Meetup @ Data Science London
-
Upload
databricks -
Category
Software
-
view
1.756 -
download
1
Transcript of Spark Under the Hood - Meetup @ Data Science London
Who are we?
Sameer Farooqui Doug Bateman Jon Bates
• Dir of Training @ NewCircle
• Spark Trainer for Databricks
• 800+ trainings on Java, Python, Android, Hibernate, Spring, etc
• Trainer @ Databricks • 150+ trainings on Hadoop,
C*, HBase, Couchbase, NoSQL, etc
• Data Scientist
• Consultant for Databricks
• EdX assistant instructor on Scalable ML w/ Spark
Agenda: Talks
Sameer Farooqui Doug Bateman Jon Bates
15 mins: • Intro & Spark Overview
25 mins: • Power Plant Demo
• ETL + Linear Regression
25 mins: • Iris Flower Demo
• Model Parallel w/ sci-kit learn
Agenda: Q & A 30 mins
+
• Consulting Architect for Cloudera
• Cluster setup, Security/Kerberos, Hive, Impala, HBase, Spark
• Based in Germany
• R, Sci-Kit Learn, Spark, Mahout, HBase, Hive, Pig
• Senior Data Scientist @ Big Data Partnership + Spark Trainer for DB
• Based in London
Stephane Rion
Lars Francke
Who are you?
1) I have used Spark hands on before…
2) I have more than 1 year hands on experience with ML…
Spark – 100% open source and mature Used in production by over 500 organizations. From fortune 100 to small innovators
Apache Spark: Large user community
MapReduce YARN HDFS
Storm
Spark
0
1000
2000
3000
4000
Commits in the past year
0
20
40
60
80
100
120
140
2011 2012 2013 2014 2015
Contributors per Month to Spark
Most active project in big data
13
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk 100 TB sort record
15
On-Disk Sort Record: Time to sort 100TB
Source: Daytona GraySort benchmark, sortbenchmark.org
2100 machines 2013 Record: Hadoop
72 minutes
2014 Record: Spark
207 machines
23 minutes
2014: an Amazing Year for Spark
Total contributors: 150 => 500
Lines of code: 190K => 370K
500+ active production deployments
16
Overview of ML Algorithms
Prediction: • Regression • Classification
Tokenizer, HashingTF, IDF, Word2Vec,Nomalizer, StandardScaler
LinearRegression, DecisionTree, SVM,LogisticRegression, NaiveBayes, DecisionTree
Feature Transformation:
Recommendation: ALS
Clustering: KMeans, GaussianMixtureEM, LDA
Overview of ML Algorithms
Other: • Statistics • Linear Algebra
• Optimization
Clustering Other:
• Statistics • Linear Algebra • Optimization
Correlation, ChiSqTest, Statistics, MultivariateOnlineSummarizer
RowMatrix, EigenValueDecomposition, Matrix, Vector
GradientDescent, LBFGS
Spark Driver
Executor Task Task
Executor Task Task
Executor Task Task
Executor Task Task
Spark Physical Cluster
Spark Data Model
Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1
RDD / DataFrame with 4 partitions
Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8
Error, ts, msg3 Info, ts, msg5 Info, ts, msg5
Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1
logLinesRDD
Spark Data Model
item-‐1 item-‐2
item-‐3 item-‐4
item-‐5 item-‐6
item-‐6 item-‐8
item-‐9 item-‐10
Ex RDD RDD
Ex RDD RDD
Ex RDD
more par((ons = more parallelism
Use Case: predict power output given a set of readings from various sensors in a gas-fired power generation plant
Schema Definition:
AT = Atmospheric Temperature in C V = Exhaust Vacuum Speed AP = Atmospheric Pressure RH = RelaCve Humidity PE = Power Output (value we are trying to predict)
Model Parallelism
• Model stored across workers
• Communicate data to all workers
• Examples: • Grid search • Cross validation • Ensemble
Data Parallelism
• Data stored across workers
• Communicate model to all workers
• Examples: • MLLib Linear models • Matrix outer products
Scalability Rules
1st Rule of thumb Computation & Storage should be linear (in n, d )
2nd Rule of thumb Perform parallel and in-memory computation
3rd Rule of thumb Minimize Network Communication