Scaling Distributed Machine Learning with the

Mu Li muli@cs.cmu.edu

query: “osdi” search engine

advertisercharge

Machine learning is concerned with systems that can learn from data

2010 2011 2012 2013 2014

Training data

size (TB)

Ad click prediction

2010 2011 2012 2013 2014

Training data

size (TB)

Annual revenue

(Billion $)

Ad click prediction

2010 2011 2012 2013 2014

Training data

size (TB)

Annual revenue

(Billion $)

We will report results using 1,000 machines!

Ad click prediction

Overview of machine learningraw data training data machine learning system model

(key,value) pairs

example by feature matrix

osdi www.usenix.org user_mu_li

(key,value) pairs

1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines

Scale of Industry problems

(key,value) pairs

✦ scale to industry problems ✦ efficient communication ✦ fault tolerance ✦ easy to use

1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines

Scale of Industry problems

Industry size machine learning problems

Data and model partition

Training data

Training data Worker machines

Server machines

Training data

Worker machines

Server machines

Training data

Worker machines

Server machines

Example: distributed gradient descent

Worker machines

Server machines

Workers pull the working set of model

Worker machines

Server machines

Workers pull the working set of modelIterate until stop

workers compute gradients

Worker machines

Server machines

workers compute gradientsworkers push gradients

Worker machines

Server machines

workers compute gradientsworkers push gradientsupdate model

Worker machines

Server machines

workers compute gradientsworkers push gradientsupdate modelworkers pull updated model

Worker machines

Server machines

Efficient communication

Challenges for data synchronization

✦ Massive communication traffic ★ frequent access to the shared model

Challenges for data synchronization

✦ Massive communication traffic ★ frequent access to the shared model

✦ Expensive global barriers ★between iterations

✦ a push / pull / user defined function (an iteration)

✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency

CPU intensive Network intensive

gradient push & pulliter 0

iter 1 gradient push & pull

Synchronous

✦ executed asynchronously

Synchronous

✦ executed asynchronously

Synchronous

gradient push & pulliter 0Asynchronous

Flexible consistency

✦ Trade-off between algorithm efficiency and system performance

1 2 3Sequential 4

Eventual 1 2 3 4

1 2 3Sequential 4

1-bounded delay 1 2 3 4 5

Eventual 1 2 3 4

1 2 3Sequential 4

Results for bounded delay

Ad click prediction

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

Ad click prediction

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Ad click prediction

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Ad click prediction

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

best trade-off

User-defined filters

✦ Selectively communicate (key, value) pairs

User-defined filters

✦ Selectively communicate (key, value) pairs✦ E.g. , the KKT filter ★send pairs if they are likely to affect the model ★>95% keys are filtered in the ad click prediction task

Fault tolerance

Machine learning job logs in a three-month period:

#machine x time (hour)

100 1000 10000

failure rate %

100 1000 10000

failure rate %

100 1000 10000

failure rate %

100 1000 10000

failure rate %

Fault tolerance

✦ Model is partitioned by consistent hashing

Fault tolerance

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

Fault tolerance

worker 0 server 0 server 1push x push f(x)

ackack

Fault tolerance

ackack

push x

push y

worker 0

server 0 server 1

worker 1

✦ Option: Aggregation reduces backup traffic (algo specific)

Fault tolerance

ackack

push f(x+y)

push x

push yack

ackworker 0

server 0 server 1

worker 1

✦ Option: Aggregation reduces backup traffic (algo specific)

Fault tolerance

ackack

push f(x+y)

push x

push yack

ackworker 0

server 0 server 1

worker 1

✦ Option: Aggregation reduces backup traffic (algo specific)implemented by efficient

vector clock

Fault tolerance Easy to use

(Key, value) vectors for the shared parameters

math sparse vector (key, value) store

i1 i2 i3

(i1, ) (i2, ) (i3, )

✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon

structure of data

i1 i2 i3

(i1, ) (i2, ) (i3, )

✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon

structure of data

i1 i2 i3

(i1, ) (i2, ) (i3, )

Example: computing gradient gradient = dataT × ( − label × 1 / ( 1 + exp (label × data × model))

Fault tolerance Easy to use Evaluation

Sparse Logistic Regression

✦ Predict ads will be clicked or not

Method Consistency LOC

System-A L-BFGS Sequential 10K

System-B Block PG Sequential 30K

Parameter Server Block PG Bounded Delay + KKT 300

✦ Predict ads will be clicked or not✦ Baseline: two systems in production

Method Consistency LOC

System-A L-BFGS Sequential 10K

System-B Block PG Sequential 30K

Parameter Server Block PG Bounded Delay + KKT 300

✦ Predict ads will be clicked or not✦ Baseline: two systems in production

✦ 636T real ads data ★170 billions of examples, 65 billions of features

✦ 1,000 machines with 16,000 cores

Progress

0.1 1 10

time (hour)

Progress

System A

0.1 1 10

time (hour)

Progress

System ASystem B

0.1 1 10

time (hour)

Progress

System ASystem BParameter Server

0.1 1 10

time (hour)

Time decomposition

time (hour)

Time decomposition

System-A System-B Parameter Server

computing waiting

time (hour)

Time decomposition

computing waiting

time (hour)

Time decomposition

computing waiting

time (hour)

Time decomposition

computing waiting

time (hour)

Topic Modeling (“LDA”)

✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000

groups based on URLs they clicked

time (hour)0 10 20 30 40 50 60 70

1,000 machines

time (hour)0 10 20 30 40 50 60 70

1,000 machines6,000 machines

time (hour)0 10 20 30 40 50 60 70

4x speed up

Largest experiments of related systems

101 102 103 104 10510410510610710810910101011

model size

#cores

Data were collected on April’14

101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)

model size

#cores

Sparse LR

Parameter Server Data were collected on April’14

101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)Graphlab (LDA)

YahooLDA (LDA)model

#cores

Sparse LR LDA

101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)Graphlab (LDA)

YahooLDA (LDA)

Distbelief (DNN)Adam (DNN)

model size

#cores

Sparse LR LDA

Fault tolerance Easy to use Evaluation

Fault tolerance Easy to use

Code available at parameterserver.org

Evaluation

Scaling Distributed Machine Learning with the · Overview of machine learning raw data training...

Transcript of Scaling Distributed Machine Learning with the · Overview of machine learning raw data training...

Scaling Distributed Machine Learning with the · Overview of machine learning raw data training...

Documents

Transcript of Scaling Distributed Machine Learning with the · Overview of machine learning raw data training...

Machine Learning and Data Mining

Data Science = Machine Learning - bigdata.uni-saarland.de Science and Machine... · Data Science (one possible View) Application Domain Machine Learning A.I. Big Data! Management

Big Data Machine Learning

Data Mining and Machine Learning

Data Workflows for Machine Learning

Machine learning for machine data - info.sumologic.cominfo.sumologic.com/rs/sumologic/images/... · Machine learning for machine data! 1 David!Andrzejewski!1!@davidandrzej! DataSciences,!Sumo!Logic!

Data Mining/Machine Learning/ Big Data

Machine Learning: Learning with data

Machine Learning and Data

Data Science and Machine Learning MSc - ReportLabucl.reportlab.com/media/g/data-science-machine-learning...Data Science and Machine Learning MSc / Data Science brings together computational

Data Mining and Machine Learning

Machine Learning, Big Data, Insights

Introduction to Machine Learning Engineering · Machine learning vs traditional data analytics Introduction Traditional Data Analytics / BI Machine Learning Data suited for Structured

DATA SCIENCE AND MACHINE LEARNING: MAKING DATA …

Machine Learning & Data Mining

Data Mining (and machine learning)

Machine Learning & Data Miningberlin.csie.ntnu.edu.tw/PastCourses/2004S-Machine...Machine Learning & Data Mining Berlin Chen 2004 References: 1. Data Mining: Concepts, Models, Methods

Big Data and Machine Learning

Visualization of machine learning data for radio networks1453820/FULLTEXT01.pdf · based data. The machine learning team at Ericsson has collected data from their machine learning

Data Mining & Machine Learning