What’s New in the Berkeley Data Analytics Stack

27
What’s Next for the Berkeley Data Analytics Stack UC BERKELEY Michael Franklin July 20 2015 Data Science Summit SF

Transcript of What’s New in the Berkeley Data Analytics Stack

Page 1: What’s New in the Berkeley Data Analytics Stack

What’s Next for theBerkeley Data Analytics

Stack

UC BERKELEY

Michael FranklinJuly 20 2015

Data Science SummitSF

Page 2: What’s New in the Berkeley Data Analytics Stack

The Berkeley AMPLab80+ Students, Postdocs, Faculty and Staff from:

Databases, Machine Learning, Systems, Security, and Networking

Mission Statement: Making Sense of Data at Scale by Integrating:• Algorithms – Machine Learning, Statistical Methods,• Machines – Cluster and Cloud Computing• People – Crowdsourcing and Human Computation

Franklin Jordan Stoica Patterson ShenkerRechtKatzJosephGoldberg Mahoney

PopaGonzalez

Page 3: What’s New in the Berkeley Data Analytics Stack

AMPLab: A Public/Private PartnershipNSF CISE Expedition Award:

Part of 2012 White House Big Data Initiative

Darpa XData ProgramDoE/Lawrence Berkeley National Lab

And these Industrial Sponsors:

Page 4: What’s New in the Berkeley Data Analytics Stack

Velox Model Serving

Tachyon

SparkStreami

ngShark

BlinkDB

GraphX MLlib

MLBase

SparkR

Cancer Genomics, Energy Debugging, Smart BuildingsSample Clean

In House Applications

Spark

Berkeley Data Analytics Stack

(Apache and BSD open source)

HDFS, S3, …Mesos Yarn

Access and Interfaces

Processing Engine

Resource Virtualization

TachyonStorage

Page 5: What’s New in the Berkeley Data Analytics Stack

Big Data Ecosystem Evolution

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill Tez

Impala

S4 …

Specialized systems(iterative, interactive and

streaming apps)

General batchprocessing

Page 6: What’s New in the Berkeley Data Analytics Stack

AMPLab Unification Philosophy

Don’t specialize MapReduce – Generalize it!

Two additions to Hadoop MR can enable all the models shown earlier!

1. General Task DAGs

2. Data Sharing

For Users: Fewer Systems to Use Less Data Movement

SparkStr

eam

ing

Gra

phX

…Spark

SQ

L

MLb

ase

Page 7: What’s New in the Berkeley Data Analytics Stack

In-MemoryDataflowSystem

M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.

• Developed in AMPLab and its predecessor the RADLab

• Alternative to Hadoop MapReduce • 10-100x speedup for ML and interactive

queries• Central component of the BDAS Stack• “Graduated” to Apache Foundation ->

Apache Spark

Page 8: What’s New in the Berkeley Data Analytics Stack

Apache Spark Meetups Around the World (Jan ‘15)

Page 9: What’s New in the Berkeley Data Analytics Stack

Apache Spark MeetupsAround the World (July ‘15)

+ 72%

+124%

+ 79%

+ 57%

Page 10: What’s New in the Berkeley Data Analytics Stack

Berkeley Data Analytics Stack

ResourceVirtualization

Storage

ProcessingEngine

Access andInterfaces

In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R

Cancer Genomics, Energy Debugging, Smart Buildings

Velo

xMLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

Page 11: What’s New in the Berkeley Data Analytics Stack

Berkeley Data Analytics Stack

ResourceVirtualization

Storage

ProcessingEngine

Access andInterfaces

In-houseApps

Mesos

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R

Cancer Genomics, Energy Debugging, Smart Buildings

Velo

xMLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

Spark Core

• Major rearchitecture and features (community)– DataFrames API– Tungsten: bringing Spark closer to bare

metal• Memory Management and Binary Processing • Cache-aware computation• Code generation

• R interface• Spark SQL and Spark Streaming

enhancements• Still rapidly growing!

Page 12: What’s New in the Berkeley Data Analytics Stack

ResourceVirtualization

Storage

ProcessingEngine

Access andInterfaces

In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R

Cancer Genomics, Energy Debugging, Smart Buildings

MLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

Velo

x

• Velox – Model Serving and Personalization– KeystoneML integration– Improved service APIs and

deployment tools– Open source alpha release

BDAS: Latest Developments

Page 13: What’s New in the Berkeley Data Analytics Stack

13

Data ModelWhere do models go?

ConferencePapers

SalesReports

DriveActions

Training

Introducing Velox: Model Serving

Page 14: What’s New in the Berkeley Data Analytics Stack

Driving Actions

14

Suggesting Itemsat Checkout

Fraud Detection

Cognitive Assistance

Internet ofThings

Low-Latency Personalized Rapidly Changing

Page 15: What’s New in the Berkeley Data Analytics Stack

Problem: Separate Systems

15

Offline AnalyticsSystems

Sophisticated ML on static data.

Low-Latency data serving

How do we serve low-latency predictions and train on live data?

Online ServingSystems

MongoDB

Page 16: What’s New in the Berkeley Data Analytics Stack

Velox Model Serving System

Decompose personalized predictive models:

16

[CIDR’15]

Page 17: What’s New in the Berkeley Data Analytics Stack

Velox Model Serving System

Decompose personalized predictive models:

17

[Crankshaw, Bailis, Gonzalez et al. CIDR’15]

Split

PersonalizationModel

FeatureModel

OnlineBatch

FeatureCaching

Approx.Features

OnlineUpdates

ActiveLearning

Order-of-magnitude reductions in prediction latencies.

Page 18: What’s New in the Berkeley Data Analytics Stack

Access andInterfaces

BDAS: Latest Developments

ResourceVirtualization

Storage

ProcessingEngine

In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

SampleClean

G-OLA

Spark

R

Cancer Genomics, Energy Debugging, Smart Buildings

Velo

x

Spla

shTachyon

HDFS, S3, Ceph, …

Succinct

MLPipelines

• MLPipelines KeystoneML– Alpha release – End-to-end pipelines in vision, speech, and

NLP– Horizontal scalability to 100’s of machines

and multi-terabyte datasets

Page 19: What’s New in the Berkeley Data Analytics Stack

What is KeystoneML?Software framework for building scalable end-to-end

machine learning pipelines.

Helps us explore how to build systems for robust, scalable, end-to-end advanced analytics workloads and the patterns that emerge.

Example pipelines that achieve state-of-the-art results on large scale datasets in computer vision, NLP, and speech - fast.

Previewed at AMP Camp 5 and on AMPLab Blog as “ML Pipelines”

Public release last month! http://keystone-ml.org/

Page 20: What’s New in the Berkeley Data Analytics Stack

How does it fit with BDAS?

Spark

MLlibGraphX ml-matrix

KeystoneML

Batch Model Training

VeloxModel Server

Real Time Serving

http://amplab.github.io/velox-modelserver

Page 21: What’s New in the Berkeley Data Analytics Stack

Example: Image Classification

Images(VOC2007).fit( )

Resize

Grayscale

SIFT

PCA

Fisher Vector

MaxClassifier

Linear Regression

Resize

Grayscale

SIFT

MaxClassifier

PCA Map

Fisher Encoder

Linear Model

Achieves performance of Chatfield et. al., 2011

Embarassingly parallelfeaturization and evaluation

15 min on a modest cluster

5K examples, 40K features, 20 classes

Page 22: What’s New in the Berkeley Data Analytics Stack

Current Software Features

Data Loaders

» CSV, CIFAR, ImageNet, VOC, TIMIT, 20 Newsgroups

Transformers

» NLP - Tokenization, n-grams, term frequency, NER*, parsing*

» Images - Convolution, Grayscaling, LCS, SIFT*, FisherVector*, Pooling, Windowing, HOG, Daisy

» Speech - MFCCs*

» Stats - Random Features, Normalization, Scaling*, Signed Hellinger Mapping, FFT

» Utility/misc - Caching, Top-K classifier, indicator label mapping, sparse/dense encoding transformers.

Estimators

» Learning - Block linear models, Linear Discriminant Analysis, PCA, ZCA Whitening, Naive Bayes*, GMM*

• Example Pipelines

• NLP - 20 Newsgroups, Wikipedia Language model

• Images - MNIST, CIFAR, VOC, ImageNet

• Speech - TIMIT

• Evaluation Metrics

• Binary Classification

• Multiclass Classification

• Multilabel Classification

* - Links to external library: MLlib, ml-matrix, VLFeat, EncEval

Page 23: What’s New in the Berkeley Data Analytics Stack

Research Direction:Automatic Resource

EstimationLong-complicated pipelines.

» Just a composition of dataflows!

How long will this thing take to run?

When do I cache?

» Pose as a constrained optimization problem.

Enables Efficient Hyperparameter Tuning

(ref. E. Sparks et al. “Automating Model Search for Large Scale Machine Learning”, SOCC, Aug 2015)

Resize

Grayscale

SIFT

PCA

Fisher Vector

Top 5 Classifier

LCS

PCA

Fisher Vector

Block Linear Solver

Weighted Block Linear Solver

Page 24: What’s New in the Berkeley Data Analytics Stack

ResourceVirtualization

Storage

ProcessingEngine

Access andInterfaces

In-houseApps

Mesos

Spark Core

Spark

Str

eam

ing

SparkSQL

BlinkDB

Gra

phX

MLlib

MLBase

Hadoop Yarn

G-OLA

Spark

R

Cancer Genomics, Energy Debugging, Smart Buildings

Velo

xMLPipelinesS

pla

shTachyon

HDFS, S3, Ceph, …

Succinct

SampleClean

• Released two Spark Packages– SampleClean: SparkSQL-integrated library

for record dedup, entity resolution, and active learning

– AMPCrowd: web service for crowdsourcing through Amazon Mechanical Turk or a "internal" crowd

• REST API to allow for human-in-the-loop, asynchronous data cleaning pipelines 

BDAS: Latest Developments

Page 25: What’s New in the Berkeley Data Analytics Stack

SampleClean Framework

Current research focus:Latency Reduction for human-in-the-loop• Straggler

Mitigation• Pool

Maintenance• Active Learning

Page 26: What’s New in the Berkeley Data Analytics Stack

Summary

• AmpLab project• Cross-disciplinary team, Industry

engagement• Open Source development and

community building

• BDAS philosophy: Unification• Spark + SQL + Graphs + ML + …

• After graduating Mesos, Tachyon & Spark we are moving up the stack to support declarative and real-time Machine Learning and analytics.

Page 27: What’s New in the Berkeley Data Analytics Stack

To find out more or get involved:

amplab.berkeley.edu

[email protected]

UC BERKELEY

Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, IBM, and SAP,

the Thomas and Stacy Siebel Foundation,all our industrial sponsors and partners, and all the members of the AMPLab Team.