The Berkeley AMPLab - Collaborative Big Data...

24
The Berkeley AMPLab - Collaborative Big Data Research UC BERKELEY Anthony D. Joseph LASER Summer School September 2013

Transcript of The Berkeley AMPLab - Collaborative Big Data...

The Berkeley AMPLab - Collaborative Big Data Research

UC  BERKELEY  

Anthony D. Joseph

LASER Summer School September 2013

About Me Education: MIT SB, MS, PhD

Joined Univ. of California, Berkeley in 1998

Current research areas: » Cloud computing (Mesos): http://mesos.apache.org/ »  Secure Machine Learning (SecML):

http://radlab.cs.berkeley.edu/wiki/SecML » DETER security testbed: http://deter-project.org/ »  Intel Science and Technology Center for User Security:

http://scrub.cs.berkeley.edu/

Other: Peer-to-Peer networking (Tapestry), ���Mobile computing, Wireless/Cellular networking

2

Sources Driving Big Data It’s  All  Happening  On-­‐line  

Every: Click Ad impression Billing event Fast Forward, pause,… Friend Request Transaction Network message Fault …

User  Generated  (Web  &  Mobile)  

…..

Internet  of  Things  /  M2M   Scientific  Computing  

Challenge 1: Data is Big Projected  Growth  

Increa

se  ove

r  201

0  

0  

10  

20  

30  

40  

50  

60  

2010   2011   2012   2013   2014   2015  

Moore's  Law  

Overall  Data  

Particle  Accel.  

DNA  Sequencers  

Data  Grows  faster  than  Moore’s  Law  [IDC  report,  Kathy  Yelick,  LBNL]  

Challenge 2: Data is Dirty

•  Variety of diverse sources

•  Uncurated

•  No schema

•  Inconsistent syntax and semantics

Dirty  Data  worse  than  Big  Data    

Challenge 3: Complex Questions

Hard questions » What is the impact on traffic and home

prices of building a new ramp?

Real-time questions » Is there a cyber attack going on?

Open-ended questions » How many supernovae happened last

year?

Big  Data  Must  Enable  Decisions    

Requires Multifaceted Approach

Three dimensions to improve data analysis » Improving scale, efficiency, and quality of algorithms

running in datacenters (Algorithms) » Scaling up datacenters (Machines) » Leverage human activity and intelligence (People)

Need to adaptively and flexibly combine all three dimensions

7

Algorithms, Machines, People (AMP) •  Today’s apps: fixed point in solution space

8

Algorithms

Machines

People

Need  techniques  to  dynamically  pick  best  operating  point  

search

Watson/IBM

The AMP Lab

9

search

Watson/IBM

Machines

People

Algorithms

Make  sense  of  data  at  scale  by  tightly  integrating  algorithms,  machines,  and  people  

AMP Lab

Faculty » Alex Bayen (mobile sensing platforms) » Armando Fox (systems) » Michael Franklin (databases): Director » Michael Jordan (machine learning): Co-director » Anthony Joseph (secure machine learning & privacy) » Randy Katz (systems) » David Patterson (systems) » Ion Stoica (systems): Co-director » Scott Shenker (networking)

Algorithms State-of-art Machine Learning (ML) algorithms do not scale » Prohibitive to process all data points

11

How  do  you  know  when  to  stop?  

true answer

Estim

ate"

# of data points

Algorithms Given any problem, data and a budget » Immediate results with continuous improvement » Calibrate answer: provide error bars

12

Error  bars  on  every  answer!  

Estim

ate"

# of data points

true answer

Algorithms

13

Stop  when  error  smaller  than  a  given  threshold  

Estim

ate"

# of data points time

true answer

Given any problem, data and a budget » Immediate results with continuous improvement » Calibrate answer: provide error bars

Algorithms Given any problem, data and a time budget » Automatically pick a solution on ML algorithm spectrum

14

Estim

ate"

time

pick sophisticated pick simple

error too high

true answer sophisticated

simple

Machines

“The datacenter as a computer” still in its infancy » Special purpose clusters, e.g., Hadoop cluster » Highly variable performance » Hard to program » Hard to debug

15

=!?

Machines: Problem Rapid innovation in cloud computing

No single framework optimal for all applications

Want to run multiple frameworks in a single cluster » … to maximize utilization » … to share data between frameworks

16

Dryad

Pregel

Cassandra Hypertable

Machines: A Solution Apache Mesos: a resource sharing layer supporting diverse frameworks »  Fine-grained sharing: Improves utilization, latency, and data locality »  Resource offers: Simple, scalable application-controlled scheduling mechanism

Mesos  

Node   Node   Node   Node  

Hadoop   Pregel  …  

Node   Node  

Hadoop  

Node   Node  

Pregel  …  

B. Hindman, et al, Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011, March 2011. http://mesos.apache.org/ 17

People

Humans can make sense of messy data!

18

People Make people an integrated part of the system! » Leverage human activity » Leverage human intelligence ���

(crowdsourcing): •  Curate and clean dirty data

•  Answer imprecise questions •  Test and improve algorithms

Challenge » Inconsistent answer quality in all ���

dimensions (e.g., type of question, time, cost) 19

Machines  +  Algorithms  

data

, ac

tivity

Que

stio

ns A

nswers

Our Vision: A Necessary Synergy

Challenge  1:  Data  is  Big   ✔   ✔  

Challenge  3:  Questions    are  complex  

✔   ✔  ✔  

Challenge  2:  Data  is  Dirty   ✔   ✔  ✔  

lgorithms     achines     eople    

Berkeley Data Analytics Stack

Apache Spark

Shark BlinkDB

SQL

HDFS / Hadoop Storage / Tachyon

Apache Mesos / YARN Resource Manager

Spark Streaming

GraphX MLBase

Big Data in 2020 Almost Certainly:

Create a new generation of big data scientist

A real datacenter OS

ML becoming an engineering discipline

People deeply integrated in big data analysis pipeline

If We’re Lucky:

System will know what to throw away

Come up with answers in minutes no one knows

Summary Goal: Tame Big Data Problem » Get results with right quality at the right time

Approach: Holistically integrate Algorithms, Machines, and People

Huge research issues across many domains

My Talks at LASER 2013

1.  AMP Lab introduction (this talk)

2.  The Datacenter Needs an Operating System

3.  Mesos, part one

4.  Dominant Resource Fairness

5.  Mesos, part two

6.  Spark