What’s New in the Berkeley Data Analytics Stack

What’s Next for theBerkeley Data Analytics

Stack

UC BERKELEY

Michael FranklinJuly 20 2015

Data Science SummitSF

The Berkeley AMPLab80+ Students, Postdocs, Faculty and Staff from:

Databases, Machine Learning, Systems, Security, and Networking

Mission Statement: Making Sense of Data at Scale by Integrating:• Algorithms – Machine Learning, Statistical Methods,• Machines – Cluster and Cloud Computing• People – Crowdsourcing and Human Computation

Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney

PopaGonzalez

AMPLab: A Public/Private PartnershipNSF CISE Expedition Award:

Part of 2012 White House Big Data Initiative

Darpa XData ProgramDoE/Lawrence Berkeley National Lab

And these Industrial Sponsors:

Velox Model Serving

Tachyon

SparkStreami

ngShark

BlinkDB

GraphX MLlib

MLBase

SparkR

Cancer Genomics, Energy Debugging, Smart BuildingsSample Clean

In House Applications

Spark

Berkeley Data Analytics Stack

(Apache and BSD open source)

HDFS, S3, …Mesos Yarn

Access and Interfaces

Processing Engine

Resource Virtualization

TachyonStorage

Big Data Ecosystem Evolution

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

AMPLab Unification Philosophy

Don’t specialize MapReduce – Generalize it!

Two additions to Hadoop MR can enable all the models shown earlier!

1. General Task DAGs

2. Data Sharing

For Users: Fewer Systems to Use Less Data Movement

SparkStr

eam

ing

Gra

phX

…Spark

SQ

L

MLb

ase

In-MemoryDataflowSystem

M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.

• Developed in AMPLab and its predecessor the RADLab

• Alternative to Hadoop MapReduce • 10-100x speedup for ML and interactive

queries• Central component of the BDAS Stack• “Graduated” to Apache Foundation ->

Apache Spark

Apache Spark Meetups Around the World (Jan ‘15)

Apache Spark MeetupsAround the World (July ‘15)

+ 72%

+124%

+ 79%

+ 57%


ResourceVirtualization

Storage

ProcessingEngine

Access andInterfaces

In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R

Cancer Genomics, Energy Debugging, Smart Buildings

Velo

xMLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct



Storage

ProcessingEngine


In-houseApps

Mesos

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R


Velo

xMLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

Spark Core

• Major rearchitecture and features (community)– DataFrames API– Tungsten: bringing Spark closer to bare

metal• Memory Management and Binary Processing • Cache-aware computation• Code generation

• R interface• Spark SQL and Spark Streaming

enhancements• Still rapidly growing!


Storage

ProcessingEngine


In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R


MLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

Velo

x

• Velox – Model Serving and Personalization– KeystoneML integration– Improved service APIs and

deployment tools– Open source alpha release

BDAS: Latest Developments

13

Data ModelWhere do models go?

ConferencePapers

SalesReports

DriveActions

Training

Introducing Velox: Model Serving

Driving Actions

14

Suggesting Itemsat Checkout

Fraud Detection

Cognitive Assistance

Internet ofThings

Low-Latency Personalized Rapidly Changing

Problem: Separate Systems

15

Offline AnalyticsSystems

Sophisticated ML on static data.

Low-Latency data serving

How do we serve low-latency predictions and train on live data?

Online ServingSystems

MongoDB

Velox Model Serving System

Decompose personalized predictive models:

16

[CIDR’15]

Velox Model Serving System

Decompose personalized predictive models:

17

[Crankshaw, Bailis, Gonzalez et al. CIDR’15]

Split

PersonalizationModel

FeatureModel

OnlineBatch

FeatureCaching

Approx.Features

OnlineUpdates

ActiveLearning

Order-of-magnitude reductions in prediction latencies.




Storage

ProcessingEngine

In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R


Velo

x

Spla

shTachyon

HDFS, S3, Ceph, …

Succinct

MLPipelines

• MLPipelines KeystoneML– Alpha release – End-to-end pipelines in vision, speech, and

NLP– Horizontal scalability to 100’s of machines

and multi-terabyte datasets

What is KeystoneML?Software framework for building scalable end-to-end

machine learning pipelines.

Helps us explore how to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge.

Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast.

Previewed at AMP Camp 5 and on AMPLab Blog as “ML Pipelines”

Public release last month! http://keystone-ml.org/

http://keystone-ml.org/

How does it fit with BDAS?

Spark

MLlibGraphX ml-matrix

KeystoneML

Batch Model Training

VeloxModel Server

Real Time Serving

http://amplab.github.io/velox-modelserver

Example: Image Classification

Images(VOC2007).fit( )

Resize

Grayscale

SIFT

PCA

Fisher Vector

MaxClassifier

Linear Regression

Resize

Grayscale

SIFT

MaxClassifier

PCA Map

Fisher Encoder

Linear Model

Achieves performance of Chatfield et. al., 2011

Embarassingly parallelfeaturization and evaluation

15 min on a modest cluster

5K examples, 40K features, 20 classes

Current Software Features

Data Loaders

» CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups

Transformers

» NLP - Tokenization, n-grams, term frequency, NER*, parsing*

» Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling, Windowing, HOG, Daisy

» Speech - MFCCs*

» Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT

» Utility/misc - Caching, Top-K classifier, indicator label mapping, sparse/dense encoding transformers.

Estimators

» Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*

• Example Pipelines

• NLP - 20 Newsgroups, Wikipedia Language model

• Images - MNIST, CIFAR, VOC, ImageNet

• Speech - TIMIT

• Evaluation Metrics

• Binary Classification

• Multiclass Classification

• Multilabel Classification

* - Links to external library: MLlib, ml-matrix, VLFeat, EncEval

Research Direction:Automatic Resource

EstimationLong-complicated pipelines.

» Just a composition of dataflows!

How long will this thing take to run?

When do I cache?

» Pose as a constrained optimization problem.

Enables Efficient Hyperparameter Tuning

(ref. E. Sparks et al. “Automating Model Search for Large Scale Machine Learning”, SOCC, Aug 2015)

Resize

Grayscale

SIFT

PCA

Fisher Vector

Top 5 Classifier

LCS

PCA

Fisher Vector

Block Linear Solver

Weighted Block Linear Solver


Storage

ProcessingEngine


In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

G-OLA

Spark

R


Velo

xMLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

SampleClean

• Released two Spark Packages– SampleClean: SparkSQL-integrated library

for record dedup, entity resolution, and active learning

– AMPCrowd: web service for crowdsourcing through Amazon Mechanical Turk or a "internal" crowd

• REST API to allow for human-in-the-loop, asynchronous data cleaning pipelines


SampleClean Framework

Current research focus:Latency Reduction for human-in-the-loop• Straggler

Mitigation• Pool

Maintenance• Active Learning

Summary

• AmpLab project• Cross-disciplinary team, Industry

engagement• Open Source development and

community building

• BDAS philosophy: Unification• Spark + SQL + Graphs + ML + …

• After graduating Mesos, Tachyon & Spark we are moving up the stack to support declarative and real-time Machine Learning and analytics.

…

To find out more or get involved:

amplab.berkeley.edu

[email protected]

UC BERKELEY

Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, IBM, and SAP,

the Thomas and Stacy Siebel Foundation,all our industrial sponsors and partners, and all the members of the AMPLab Team.

What’s New in the Berkeley Data Analytics Stack

Technology

Transcript of What’s New in the Berkeley Data Analytics Stack