What’s New in the Berkeley Data Analytics Stack
-
Upload
dato-inc -
Category
Technology
-
view
63 -
download
2
Transcript of What’s New in the Berkeley Data Analytics Stack
What’s Next for theBerkeley Data Analytics
Stack
UC BERKELEY
Michael FranklinJuly 20 2015
Data Science SummitSF
The Berkeley AMPLab80+ Students, Postdocs, Faculty and Staff from:
Databases, Machine Learning, Systems, Security, and Networking
Mission Statement: Making Sense of Data at Scale by Integrating:• Algorithms – Machine Learning, Statistical Methods,• Machines – Cluster and Cloud Computing• People – Crowdsourcing and Human Computation
Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney
PopaGonzalez
AMPLab: A Public/Private PartnershipNSF CISE Expedition Award:
Part of 2012 White House Big Data Initiative
Darpa XData ProgramDoE/Lawrence Berkeley National Lab
And these Industrial Sponsors:
Velox Model Serving
Tachyon
SparkStreami
ngShark
BlinkDB
GraphX MLlib
MLBase
SparkR
Cancer Genomics, Energy Debugging, Smart BuildingsSample Clean
In House Applications
Spark
Berkeley Data Analytics Stack
(Apache and BSD open source)
HDFS, S3, …Mesos Yarn
Access and Interfaces
Processing Engine
Resource Virtualization
TachyonStorage
Big Data Ecosystem Evolution
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill Tez
Impala
S4 …
Specialized systems(iterative, interactive and
streaming apps)
General batchprocessing
AMPLab Unification Philosophy
Don’t specialize MapReduce – Generalize it!
Two additions to Hadoop MR can enable all the models shown earlier!
1. General Task DAGs
2. Data Sharing
For Users: Fewer Systems to Use Less Data Movement
SparkStr
eam
ing
Gra
phX
…Spark
SQ
L
MLb
ase
In-MemoryDataflowSystem
M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.
• Developed in AMPLab and its predecessor the RADLab
• Alternative to Hadoop MapReduce • 10-100x speedup for ML and interactive
queries• Central component of the BDAS Stack• “Graduated” to Apache Foundation ->
Apache Spark
Apache Spark Meetups Around the World (Jan ‘15)
Apache Spark MeetupsAround the World (July ‘15)
+ 72%
+124%
+ 79%
+ 57%
Berkeley Data Analytics Stack
ResourceVirtualization
Storage
ProcessingEngine
Access andInterfaces
In-houseApps
Mesos
Spark Core
Spark
Str
eam
ing
SparkSQL
BlinkDB
Gra
phX
MLlib
MLBase
Hadoop Yarn
SampleClean
G-OLA
Spark
R
Cancer Genomics, Energy Debugging, Smart Buildings
Velo
xMLPipelinesS
pla
shTachyon
HDFS, S3, Ceph, …
Succinct
Berkeley Data Analytics Stack
ResourceVirtualization
Storage
ProcessingEngine
Access andInterfaces
In-houseApps
Mesos
Spark
Str
eam
ing
SparkSQL
BlinkDB
Gra
phX
MLlib
MLBase
Hadoop Yarn
SampleClean
G-OLA
Spark
R
Cancer Genomics, Energy Debugging, Smart Buildings
Velo
xMLPipelinesS
pla
shTachyon
HDFS, S3, Ceph, …
Succinct
Spark Core
• Major rearchitecture and features (community)– DataFrames API– Tungsten: bringing Spark closer to bare
metal• Memory Management and Binary Processing • Cache-aware computation• Code generation
• R interface• Spark SQL and Spark Streaming
enhancements• Still rapidly growing!
ResourceVirtualization
Storage
ProcessingEngine
Access andInterfaces
In-houseApps
Mesos
Spark Core
Spark
Str
eam
ing
SparkSQL
BlinkDB
Gra
phX
MLlib
MLBase
Hadoop Yarn
SampleClean
G-OLA
Spark
R
Cancer Genomics, Energy Debugging, Smart Buildings
MLPipelinesS
pla
shTachyon
HDFS, S3, Ceph, …
Succinct
Velo
x
• Velox – Model Serving and Personalization– KeystoneML integration– Improved service APIs and
deployment tools– Open source alpha release
BDAS: Latest Developments
13
Data ModelWhere do models go?
ConferencePapers
SalesReports
DriveActions
Training
Introducing Velox: Model Serving
Driving Actions
14
Suggesting Itemsat Checkout
Fraud Detection
Cognitive Assistance
Internet ofThings
Low-Latency Personalized Rapidly Changing
Problem: Separate Systems
15
Offline AnalyticsSystems
Sophisticated ML on static data.
Low-Latency data serving
How do we serve low-latency predictions and train on live data?
Online ServingSystems
MongoDB
Velox Model Serving System
Decompose personalized predictive models:
16
[CIDR’15]
Velox Model Serving System
Decompose personalized predictive models:
17
[Crankshaw, Bailis, Gonzalez et al. CIDR’15]
Split
PersonalizationModel
FeatureModel
OnlineBatch
FeatureCaching
Approx.Features
OnlineUpdates
ActiveLearning
Order-of-magnitude reductions in prediction latencies.
Access andInterfaces
BDAS: Latest Developments
ResourceVirtualization
Storage
ProcessingEngine
In-houseApps
Mesos
Spark Core
Spark
Str
eam
ing
SparkSQL
BlinkDB
Gra
phX
MLlib
MLBase
Hadoop Yarn
SampleClean
G-OLA
Spark
R
Cancer Genomics, Energy Debugging, Smart Buildings
Velo
x
Spla
shTachyon
HDFS, S3, Ceph, …
Succinct
MLPipelines
• MLPipelines KeystoneML– Alpha release – End-to-end pipelines in vision, speech, and
NLP– Horizontal scalability to 100’s of machines
and multi-terabyte datasets
What is KeystoneML?Software framework for building scalable end-to-end
machine learning pipelines.
Helps us explore how to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge.
Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast.
Previewed at AMP Camp 5 and on AMPLab Blog as “ML Pipelines”
Public release last month! http://keystone-ml.org/
How does it fit with BDAS?
Spark
MLlibGraphX ml-matrix
KeystoneML
Batch Model Training
VeloxModel Server
Real Time Serving
http://amplab.github.io/velox-modelserver
Example: Image Classification
Images(VOC2007).fit( )
Resize
Grayscale
SIFT
PCA
Fisher Vector
MaxClassifier
Linear Regression
Resize
Grayscale
SIFT
MaxClassifier
PCA Map
Fisher Encoder
Linear Model
Achieves performance of Chatfield et. al., 2011
Embarassingly parallelfeaturization and evaluation
15 min on a modest cluster
5K examples, 40K features, 20 classes
Current Software Features
Data Loaders
» CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups
Transformers
» NLP - Tokenization, n-grams, term frequency, NER*, parsing*
» Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling, Windowing, HOG, Daisy
» Speech - MFCCs*
» Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT
» Utility/misc - Caching, Top-K classifier, indicator label mapping, sparse/dense encoding transformers.
Estimators
» Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*
• Example Pipelines
• NLP - 20 Newsgroups, Wikipedia Language model
• Images - MNIST, CIFAR, VOC, ImageNet
• Speech - TIMIT
• Evaluation Metrics
• Binary Classification
• Multiclass Classification
• Multilabel Classification
* - Links to external library: MLlib, ml-matrix, VLFeat, EncEval
Research Direction:Automatic Resource
EstimationLong-complicated pipelines.
» Just a composition of dataflows!
How long will this thing take to run?
When do I cache?
» Pose as a constrained optimization problem.
Enables Efficient Hyperparameter Tuning
(ref. E. Sparks et al. “Automating Model Search for Large Scale Machine Learning”, SOCC, Aug 2015)
Resize
Grayscale
SIFT
PCA
Fisher Vector
Top 5 Classifier
LCS
PCA
Fisher Vector
Block Linear Solver
Weighted Block Linear Solver
ResourceVirtualization
Storage
ProcessingEngine
Access andInterfaces
In-houseApps
Mesos
Spark Core
Spark
Str
eam
ing
SparkSQL
BlinkDB
Gra
phX
MLlib
MLBase
Hadoop Yarn
G-OLA
Spark
R
Cancer Genomics, Energy Debugging, Smart Buildings
Velo
xMLPipelinesS
pla
shTachyon
HDFS, S3, Ceph, …
Succinct
SampleClean
• Released two Spark Packages– SampleClean: SparkSQL-integrated library
for record dedup, entity resolution, and active learning
– AMPCrowd: web service for crowdsourcing through Amazon Mechanical Turk or a "internal" crowd
• REST API to allow for human-in-the-loop, asynchronous data cleaning pipelines
BDAS: Latest Developments
SampleClean Framework
Current research focus:Latency Reduction for human-in-the-loop• Straggler
Mitigation• Pool
Maintenance• Active Learning
Summary
• AmpLab project• Cross-disciplinary team, Industry
engagement• Open Source development and
community building
• BDAS philosophy: Unification• Spark + SQL + Graphs + ML + …
• After graduating Mesos, Tachyon & Spark we are moving up the stack to support declarative and real-time Machine Learning and analytics.
…
To find out more or get involved:
amplab.berkeley.edu
UC BERKELEY
Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, IBM, and SAP,
the Thomas and Stacy Siebel Foundation,all our industrial sponsors and partners, and all the members of the AMPLab Team.