Scaling Distributed Machine Learning with the · Overview of machine learning raw data training...

Post on 24-May-2020

11 views 0 download

Transcript of Scaling Distributed Machine Learning with the · Overview of machine learning raw data training...

Scaling Distributed Machine Learning with the

Mu Li muli@cs.cmu.edu

query: “osdi” search engine

👆

user

query: “osdi” search engine

advertisercharge

👆

user

Machine learning is concerned with systems that can learn from data

Machine learning is concerned with systems that can learn from data

0

20

40

60

80

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data

size (TB)

Ad click prediction

Machine learning is concerned with systems that can learn from data

0

20

40

60

80

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data

size (TB)

Annual revenue

(Billion $)

Ad click prediction

Machine learning is concerned with systems that can learn from data

0

20

40

60

80

0

175

350

525

700

Year

2010 2011 2012 2013 2014

Training data

size (TB)

Annual revenue

(Billion $)

We will report results using 1,000 machines!

Ad click prediction

Overview of machine learningraw data training data machine learning system model

(key,value) pairs

example by feature matrix

osdi www.usenix.org user_mu_li

Overview of machine learningraw data training data machine learning system model

(key,value) pairs

1 1 1

example by feature matrix

osdi www.usenix.org user_mu_li

Overview of machine learningraw data training data machine learning system model

(key,value) pairs

1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines

Scale of Industry problems

example by feature matrix

osdi www.usenix.org user_mu_li

Overview of machine learningraw data training data machine learning system model

(key,value) pairs

✦ scale to industry problems ✦ efficient communication ✦ fault tolerance ✦ easy to use

1 1 1 ✦ 100 billion examples ✦ 10 billion features ✦ 1T —1P training data ✦ 100—1000 machines

Scale of Industry problems

Industry size machine learning problems

Data and model partition

Training data

Data and model partition

Training data Worker machines

Data and model partition

Training data Worker machines

Model

Data and model partition

Training data Worker machines

Server machines

Model

Data and model partition

Training data

push

Worker machines

Server machines

Model

Data and model partition

Training data

push

Worker machines

Server machines

pull

Model

Example: distributed gradient descent

Worker machines

Server machines

Example: distributed gradient descent

Workers pull the working set of model

Worker machines

Server machines

Example: distributed gradient descent

Workers pull the working set of modelIterate until stop

workers compute gradients

Worker machines

Server machines

Example: distributed gradient descent

Workers pull the working set of modelIterate until stop

workers compute gradientsworkers push gradients

Worker machines

Server machines

Example: distributed gradient descent

Workers pull the working set of modelIterate until stop

workers compute gradientsworkers push gradientsupdate model

Worker machines

Server machines

Example: distributed gradient descent

Workers pull the working set of modelIterate until stop

workers compute gradientsworkers push gradientsupdate modelworkers pull updated model

Worker machines

Server machines

Industry size machine learning problems

Efficient communication

Challenges for data synchronization

Challenges for data synchronization

✦ Massive communication traffic ★ frequent access to the shared model

Challenges for data synchronization

✦ Massive communication traffic ★ frequent access to the shared model

✦ Expensive global barriers ★between iterations

Task

Task

✦ a push / pull / user defined function (an iteration)

Task

✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency

Task

✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency

CPU intensive Network intensive

gradient push & pulliter 0

iter 1 gradient push & pull

Synchronous

Task

✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency

✦ executed asynchronously

CPU intensive Network intensive

gradient push & pulliter 0

iter 1 gradient push & pull

Synchronous

Task

✦ a push / pull / user defined function (an iteration)✦ “execute-after-finished” dependency

✦ executed asynchronously

CPU intensive Network intensive

gradient push & pulliter 0

iter 1 gradient push & pull

iter 1 gradient push & pull

Synchronous

gradient push & pulliter 0Asynchronous

Flexible consistency

✦ Trade-off between algorithm efficiency and system performance

Flexible consistency

✦ Trade-off between algorithm efficiency and system performance

1 2 3Sequential 4

Flexible consistency

✦ Trade-off between algorithm efficiency and system performance

Eventual 1 2 3 4

1 2 3Sequential 4

Flexible consistency

✦ Trade-off between algorithm efficiency and system performance

1-bounded delay 1 2 3 4 5

Eventual 1 2 3 4

1 2 3Sequential 4

Results for bounded delay

Results for bounded delay

Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

Results for bounded delay

Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Results for bounded delay

Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

Results for bounded delay

Ad click prediction

0

0.45

0.9

1.35

1.8

Bounded delay

0 1 2 4 8 16

computing waiting

time (hour)

sequential

best trade-off

User-defined filters

User-defined filters

✦ Selectively communicate (key, value) pairs

User-defined filters

✦ Selectively communicate (key, value) pairs✦ E.g. , the KKT filter ★send pairs if they are likely to affect the model ★>95% keys are filtered in the ad click prediction task

Industry size machine learning problems

Efficient communication

Fault tolerance

Machine learning job logs in a three-month period:

Machine learning job logs in a three-month period:

0

6.5

13

19.5

26

#machine x time (hour)

100 1000 10000

failure rate %

Machine learning job logs in a three-month period:

0

6.5

13

19.5

26

#machine x time (hour)

100 1000 10000

failure rate %

Machine learning job logs in a three-month period:

0

6.5

13

19.5

26

#machine x time (hour)

100 1000 10000

failure rate %

Machine learning job logs in a three-month period:

0

6.5

13

19.5

26

#machine x time (hour)

100 1000 10000

failure rate %

Fault tolerance

Fault tolerance

✦ Model is partitioned by consistent hashing

Fault tolerance

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

Fault tolerance

worker 0 server 0 server 1push x push f(x)

ackack

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

Fault tolerance

worker 0 server 0 server 1push x push f(x)

ackack

push x

push y

worker 0

server 0 server 1

worker 1

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

✦ Option: Aggregation reduces backup traffic (algo specific)

Fault tolerance

worker 0 server 0 server 1push x push f(x)

ackack

push f(x+y)

ack

push x

push yack

ackworker 0

server 0 server 1

worker 1

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

✦ Option: Aggregation reduces backup traffic (algo specific)

Fault tolerance

worker 0 server 0 server 1push x push f(x)

ackack

push f(x+y)

ack

push x

push yack

ackworker 0

server 0 server 1

worker 1

✦ Model is partitioned by consistent hashing✦ Default replication: Chain replication (consistent, safe)

✦ Option: Aggregation reduces backup traffic (algo specific)implemented by efficient

vector clock

Industry size machine learning problems

Efficient communication

Fault tolerance Easy to use

(Key, value) vectors for the shared parameters

math sparse vector (key, value) store

i1 i2 i3

(i1, ) (i2, ) (i3, )

(Key, value) vectors for the shared parameters

✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon

structure of data

math sparse vector (key, value) store

i1 i2 i3

(i1, ) (i2, ) (i3, )

(Key, value) vectors for the shared parameters

✦ Good for programmers: Matches mental model ✦ Good for system: Expose optimizations based upon

structure of data

math sparse vector (key, value) store

i1 i2 i3

(i1, ) (i2, ) (i3, )

Example: computing gradient gradient = dataT × ( − label × 1 / ( 1 + exp (label × data × model))

Industry size machine learning problems

Efficient communication

Fault tolerance Easy to use Evaluation

Sparse Logistic Regression

✦ Predict ads will be clicked or not

Sparse Logistic Regression

Method Consistency LOC

System-A L-BFGS Sequential 10K

System-B Block PG Sequential 30K

Parameter Server Block PG Bounded Delay + KKT 300

✦ Predict ads will be clicked or not✦ Baseline: two systems in production

Sparse Logistic Regression

Method Consistency LOC

System-A L-BFGS Sequential 10K

System-B Block PG Sequential 30K

Parameter Server Block PG Bounded Delay + KKT 300

✦ Predict ads will be clicked or not✦ Baseline: two systems in production

✦ 636T real ads data ★170 billions of examples, 65 billions of features

✦ 1,000 machines with 16,000 cores

Progress

0.1 1 10

time (hour)

large

small

error

Progress

System A

0.1 1 10

time (hour)

large

small

error

Progress

System ASystem B

0.1 1 10

time (hour)

large

small

error

Progress

System ASystem BParameter Server

0.1 1 10

time (hour)

large

small

error

Time decomposition

time (hour)

Time decomposition

0

1.25

2.5

3.75

5

System-A System-B Parameter Server

computing waiting

time (hour)

Time decomposition

0

1.25

2.5

3.75

5

System-A System-B Parameter Server

computing waiting

time (hour)

Time decomposition

0

1.25

2.5

3.75

5

System-A System-B Parameter Server

computing waiting

time (hour)

Time decomposition

0

1.25

2.5

3.75

5

System-A System-B Parameter Server

computing waiting

time (hour)

Topic Modeling (“LDA”)

✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000

groups based on URLs they clicked

Topic Modeling (“LDA”)

✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000

groups based on URLs they clicked

time (hour)0 10 20 30 40 50 60 70

large

small

error

Topic Modeling (“LDA”)

✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000

groups based on URLs they clicked

1,000 machines

time (hour)0 10 20 30 40 50 60 70

large

small

error

Topic Modeling (“LDA”)

✦ Gradient descent with eventual consistency ✦ 5B users’ click logs, Group users into 1,000

groups based on URLs they clicked

1,000 machines6,000 machines

time (hour)0 10 20 30 40 50 60 70

4x speed up

large

small

error

Largest experiments of related systems

101 102 103 104 10510410510610710810910101011

model size

#cores

Data were collected on April’14

Largest experiments of related systems

101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)

model size

#cores

Sparse LR

Parameter Server Data were collected on April’14

Largest experiments of related systems

101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)Graphlab (LDA)

YahooLDA (LDA)model

size

#cores

Sparse LR LDA

Parameter Server Data were collected on April’14

Largest experiments of related systems

101 102 103 104 10510410510610710810910101011

REEF (LR)

VW (LR)

Naiad (LR)

Petuum (Lasso)

MLbase (LR)Graphlab (LDA)

YahooLDA (LDA)

Distbelief (DNN)Adam (DNN)

model size

#cores

Sparse LR LDA

Parameter Server Data were collected on April’14

Industry size machine learning problems

Efficient communication

Fault tolerance Easy to use Evaluation

Industry size machine learning problems

Efficient communication

Fault tolerance Easy to use

Code available at parameterserver.org

Evaluation

Q&A