Berkeley Data Analytics Stack: Experience and Lesson Learned

42
Berkeley Data Analytics Stack: Experience and Lesson Learned UC BERKELEY Ion Stoica UC Berkeley, Databricks, Conviva

Transcript of Berkeley Data Analytics Stack: Experience and Lesson Learned

Page 1: Berkeley Data Analytics Stack: Experience and Lesson Learned

Berkeley Data Analytics Stack:!Experience and Lesson Learned

UC  BERKELEY  

Ion Stoica UC Berkeley, Databricks, Conviva

Page 2: Berkeley Data Analytics Stack: Experience and Lesson Learned

Research Philosophy Follow real problems Focus on novel usage scenarios Build real systems » Be paranoid about simplicity » Very hard to build complex systems in academia

Push for adoption » Develop communities » Train users

Disclaimer: By no means only way to do research!

Page 3: Berkeley Data Analytics Stack: Experience and Lesson Learned

A Short History 2006: Start research in cluster computing » Improve MapReduce scheduler (e.g., Fair

Scheduler)

2009: Start building a Data Analytics Stack » Spring 2009: Mesos » Summer 2009: Spark » 2010: Shark » 2011: SparkStreaming » 2012: Tachyon » 2013: MLlib, » …

2005-2010

UC  BERKELEY  

2011-2017

Page 4: Berkeley Data Analytics Stack: Experience and Lesson Learned

The Berkeley AMPLab January 2011 – 2017 » 8 faculty » > 60 students » 3 software engineer team

Organized for collaboration

3 day retreats (twice a year)

lgorithms    

achines     eople    

220 campers (100+ companies)

AMPCamp3 (November, 2014)

Page 5: Berkeley Data Analytics Stack: Experience and Lesson Learned

The Berkeley AMPLab Governmental and industrial funding:

Goal: Next generation of open source data analytics stack for industry & academia: Berkeley Data Analytics Stack (BDAS)

Page 6: Berkeley Data Analytics Stack: Experience and Lesson Learned

Data Processing Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Page 7: Berkeley Data Analytics Stack: Experience and Lesson Learned

Hadoop Stack

Data Processing Layer

Resource Management Layer

Storage Layer

… Hadoop MR

Hive PigImpala Storm

Hadoop Yarn

HDFS, S3, …

Page 8: Berkeley Data Analytics Stack: Experience and Lesson Learned

BDAS Stack

Data Processing Layer

Resource Management Layer

Storage Layer

Mesos

Spark

SparkStreaming Shark SQL

BlinkDBGraphX

MLlib

MLBase

HDFS, S3, … Tachyon

Page 9: Berkeley Data Analytics Stack: Experience and Lesson Learned

Mesos

HDFS, S3, … Tachyon

Spark

SparkStreaming Shark SQL

BlinkDBGraphX

MLlib

MLBase

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Page 10: Berkeley Data Analytics Stack: Experience and Lesson Learned

Mesos

HDFS, S3, … Tachyon

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Spark

SparkStreaming Shark SQL

BlinkDBGraphX

MLlib

MLBaseSpark Straming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark Hadoop MR

Hive PigImpala Storm

Page 11: Berkeley Data Analytics Stack: Experience and Lesson Learned

Mesos

HDFS, S3, … Tachyon

How do BDAS & Hadoop fit together?

Hadoop Yarn

HDFS, S3, …

Spark Straming Spark

SQL

Graph X ML

library

BlinkDB MLbase

Spark Hadoop MR

Hive PigImpala Storm

Page 12: Berkeley Data Analytics Stack: Experience and Lesson Learned

Apache Mesos Problem: per-framework cluster » Inefficient resource usage » Hard to experiment, upgrade » Hard to share data

Solution: common resource sharing layer » Abstracts (“virtualizes”) resources to frameworks » Enable diverse frameworks to share cluster

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

MPI  Hadoop  

Mesos  

Mul$programing  

MPI Hadoop MPI Hadoop

Mesos

Uniprograming Multiprograming

Page 13: Berkeley Data Analytics Stack: Experience and Lesson Learned

Apache Mesos Open Source: 2010 (first release: 10,000 LoC) Apache Project: 2012 Used in production at Twitter for past 2.5 years » +10,000 machines » +500 engineers using it

Third party Mesos schedulers » AirBnB’s Chronos » Twitter’s Aurora

Mesospehere: startup to commercialize Mesos

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 14: Berkeley Data Analytics Stack: Experience and Lesson Learned

Mesos Meetups Sept 2012: started Bay Area Spark Meetup » Now +800 members

Other user groups: » +700 members » New York, Atlanta, Seattle,

Los Angeles, Paris (France), Amsterdam (Netherlands), London (UK)

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 15: Berkeley Data Analytics Stack: Experience and Lesson Learned

Monthly Contributors

65 contributors for last 12 months

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 16: Berkeley Data Analytics Stack: Experience and Lesson Learned

Selected Users Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 17: Berkeley Data Analytics Stack: Experience and Lesson Learned

Apache Spark Problem: Need to support workloads beyond batch (MapReduce) » Interactive, streaming, iterative (ML), graph processing

Motivating use cases: » Iterative computations (ML researchers in RADLab) » Interactive queries (Conviva, Facebook)

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 18: Berkeley Data Analytics Stack: Experience and Lesson Learned

Apache Spark Distributed Execution Engine » Fault-tolerant, efficient in-memory storage » Low-latency large-scale task scheduler » Powerful prog. model and APIs: Python, Java, Scala

Fast: up to 100x faster than Hadoop MR » Can run sub-second jobs on hundreds of nodes

Easy to use: 2-5x less code than Hadoop MR General: support interactive & iterative apps

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 19: Berkeley Data Analytics Stack: Experience and Lesson Learned

Apache Spark Open Source: end of 2010 (<3,000 LoC, Scala) Apache Project: 2013 Over time has grown to include key components (everyone being motivated by Spark use in prod.) » Shark (2010) à SparkSLQ (2014) » SparkStreaming (2011) » MLlib (2013) » GraphX (2014)

On its way to become the platform of choice for developing Big Data applications

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 20: Berkeley Data Analytics Stack: Experience and Lesson Learned

Spark Meetups Jan 2012: started Bay Area Spark Meetup » +3100 members

Now » 33 cities » 13 countries

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 21: Berkeley Data Analytics Stack: Experience and Lesson Learned

Meetups Around the World Mesos

Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 22: Berkeley Data Analytics Stack: Experience and Lesson Learned

Monthly Contributors

0

25

50

75

100

2011 2012 2013 2014

371 contributors for last 12 months

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 23: Berkeley Data Analytics Stack: Experience and Lesson Learned

Compared to Other Projects M

apRe

duce

YARN H

DFS Stor

m

Spar

k

0200400600800

100012001400160018002000

Map

Redu

ce

YARN

HD

FS

Stor

m

Spar

k

0

50000

100000

150000

200000

250000

300000

350000

Commits Lines of Code ChangedActivity in past 6 months

2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia, …

Page 24: Berkeley Data Analytics Stack: Experience and Lesson Learned

Wide Adoption

All major Hadoop distributions include Spark

Beyond Hadoop

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 25: Berkeley Data Analytics Stack: Experience and Lesson Learned

Selected Users Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 26: Berkeley Data Analytics Stack: Experience and Lesson Learned

Events December 2013

Talks from 22 organizations

450 attendees

June 2014

Talks from 50 organizations1100 attendees

spark-summit.org

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 27: Berkeley Data Analytics Stack: Experience and Lesson Learned

Tachyon Problem: » Different Spark contexts cannot share in-mem data

Solution: » Flexible API, including HDFS API » Allow multiple frameworks (including Hadoop) to share

in-memory data

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Tachyon

Spark (inst. 1)

Spark (inst. 2) Hadoop MR

Page 28: Berkeley Data Analytics Stack: Experience and Lesson Learned

Tachyon

Open Source: Dec 2012 (<10,000 LoC) Becoming narrow waist for storage in Big Data space

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 29: Berkeley Data Analytics Stack: Experience and Lesson Learned

Release Growth

Tachyon 0.2: - 3 contributors

Feb ‘14 Oct‘13 Apr ‘13

Tachyon 0.3: - 15 contributors

Tachyon 0.4: - 30 contributors

29  July ‘14

Tachyon 0.5: - 46 contributors

Tachyon 0.1: -1 contributor

Dec ‘12

Page 30: Berkeley Data Analytics Stack: Experience and Lesson Learned

Open Community

30  

Berkeley Contributors Non-Berkeley Contributors (20+ companies)

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 31: Berkeley Data Analytics Stack: Experience and Lesson Learned

Selected Users Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 32: Berkeley Data Analytics Stack: Experience and Lesson Learned

Multiple File System Choices

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 33: Berkeley Data Analytics Stack: Experience and Lesson Learned

Reaching Tipping Point

33  

Mesos Spark

Spark Stream. SparkSQL

BlinkDB GraphX

MLlib MLBase

HDFS, S3, …

Tachyon

Page 34: Berkeley Data Analytics Stack: Experience and Lesson Learned

Training: Integral Part of Success Aug 2012: AMP Camp!training workshop » 150 in-person, 3000 online » Now a regular event (Strata

NY, training +450 people)

This year alone » +1,800 trained people

Page 35: Berkeley Data Analytics Stack: Experience and Lesson Learned

Not Only Industrial Impact… 10s of papers at top conferences » SOSP, SIGCOMM, SIGMOD, NIPS, VLDB, OSDI,

NSDI, …

6 Best Paper Awards » SIGCOMM, NSDI, EuroSys (2), ICML, ICDE

Great crop of students » Last two years: MIT (3), Stanford (1), MSR, …

Open new research directions » Resource allocation / microeconomy (DRF) » Machine learning (Bootstrap Diagnosis)

Page 36: Berkeley Data Analytics Stack: Experience and Lesson Learned

And Even Saving Lives! Scalable Nucleotide Alignment (SNAP) » 3x-10x faster than state of art with same accuracy

ADAM Pipeline » In use at the Broad Institute, Duke, Harvard, USCS » 10x-50x faster than state of art

Already saving lives! »  See “SNAP Helps Discover Infection” AMPLab blog 6/4/14 and M.

Wilson, …, & C. Chiu, “Actionable Diagnosis of Neruoleptospirosis by Next-Generation Sequencing,” New England Journal of Medicine, 6/4/14

June 4, 2014

5

June 4, 2014

In a First, Test of DNA Finds Root of Illness By CARL ZIMMER JUNE 4, 2014

Joshua Osborn, 14, lay in a coma at American Family Children’s Hospital in Madison, Wis. For weeks his brain had been swelling with fluid, and a battery of tests had failed to reveal the cause.

The doctors told his parents, Clark and Julie, that they wanted to run one more test with an experimental new technology. Scientists would search Joshua’s cerebrospinal fluid for pieces of DNA. Some of them might belong to the pathogen causing his encephalitis.

The Osborns agreed, although they were skeptical that the test would succeed where so many others had failed. But in the first procedure of its kind, researchers at the University of California, San Francisco, managed to pinpoint the cause of Joshua’s problem — within 48 hours. He had been infected with an obscure species of bacteria. Once identified, it was eradicated within days.

The case, reported on Wednesday in The New England Journal of Medicine, signals an important advance in the science of diagnosis. For years, scientists have been sequencing DNA to identify pathogens. But until now, the process has been too cumbersome to yield useful information about an individual patient in a life-threatening emergency.

“This is an absolutely great story — it’s a tremendous tour de force,” said Tom Slezak, the leader of the pathogen informatics team at the Lawrence Livermore National Laboratory, who was not involved in the study.

Mr. Slezak and other experts noted that it would take years of further research before such a test might become approved for regular use. But it could be immensely useful: Not only might it provide speedy diagnoses to critically ill patients, they said, it could lead to more effective treatments for maladies that can be hard to identify, such as Lyme disease.

Diagnosis is a crucial step in medicine, but it can also be the most difficult. Doctors usually must guess the most likely causes of a medical problem and then order individual tests to see which is the right diagnosis. The guessing game can waste precious time. The causes of some conditions, like encephalitis, can be so hard to diagnose that doctors often end up with no answer at all.

“About 60 percent of the time, we never make a diagnosis” in encephalitis, said Dr. Michael R. Wilson, a neurologist at the University of California, San Francisco, and an author of the new paper. “It’s frustrating whenever someone is doing poorly, but it’s especially frustrating when we can’t even tell the parents what the hell is going on.”

For the last decade, researchers at the university have been working on methods for identifying pathogens based on their DNA. In 2003 Dr. Joseph DeRisi, a biochemist at https://amplab.cs.berkeley.edu/2014/06/04/snap-helps-save-a-life/5

SNAP

Page 37: Berkeley Data Analytics Stack: Experience and Lesson Learned

Research Philosophy Follow real problems Focus on novel usage scenarios Build real systems » Be paranoid about simplicity » Very hard to build complex systems in academia

Push for adoption » Develop communities » Train users

Page 38: Berkeley Data Analytics Stack: Experience and Lesson Learned

Two Types of Research Proj. New systems » Inspired from people using ours/existing systems » E.g., Spark, Shark/SparkSQL, MLlib,

SparkStreaming, Tachyon, …

New algorithms, techniques, optimizations » Workload traces from large clusters (e.g.,

Facebook, Conviva) » E.g., LATE, Sparrow, PACMan, Scarlett, …

Page 39: Berkeley Data Analytics Stack: Experience and Lesson Learned

Challenge: Public Clouds Hugely convenient and powerful » E.g., we won Terabyte sort benchmark this year

using 206 AWS instances •  3x faster, 10x fewer machines than last year (Yahoo!)

» Whatever you deploy on AWS/Azure/GC can be used by anyone •  Large pool of users (beyond academia) •  Easy to train

» Large public data sets already available http://aws.amazon.com/datasets/

Page 40: Berkeley Data Analytics Stack: Experience and Lesson Learned

Why Use Experimental Testbeds?

Control and visibility » Bare-metal servers •  Some clouds do provide this: Rackspace, DigitalOcean

» SDN networks, RDMAs, …

Re-configurability, heterogeneity

Free!

Enable end-to-end / cross-layer optimizations

Page 41: Berkeley Data Analytics Stack: Experience and Lesson Learned

What about Data and Apps? Ideally, unique data not found on other clouds Example: » Fine grained logs/traces of cloud usage (public

clouds cannot provide this) » Scientific data (?)

Applications » Ability to run existing systems/apps (need to

maintain them!) » New education apps (?)

Page 42: Berkeley Data Analytics Stack: Experience and Lesson Learned

Conclusions Have right expectations, key to success! Be aware that: » Public clouds cover a big range of needs for system

research » Insights for new use cases unlikely to come from these

testbeds

Focus on what is unique: » Cross-layer optimization exploiting access to network

(SDN, RDMA), storage, bare-bone servers » Make available unique data sets (e.g., fine grained logs)