[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin,...

Post on 29-Nov-2014

3.612 views 2 download

description

Abstract:Machine learning researchers and practitioners develop computeralgorithms that "improve performance automatically throughexperience". At Google, machine learning is applied to solve manyproblems, such as prioritizing emails in Gmail, recommending tags forYouTube videos, and identifying different aspects from online userreviews. Machine learning on big data, however, is challenging. Some"simple" machine learning algorithms with quadratic time complexity,while running fine with hundreds of records, are almost impractical touse on billions of records.In this talk, I will describe lessons drawn from various Googleprojects on developing large scale machine learning systems. Thesesystems build on top of Google's computing infrastructure such as GFSand MapReduce, and attack the scalability problem through massivelyparallel algorithms. I will present the design decisions made inthese systems, strategies of scaling and speeding up machine learningsystems on web scale data.Speaker biography:Max Lin is a software engineer with Google Research in New York Cityoffice. He is the tech lead of the Google Prediction API, a machinelearning web service in the cloud. Prior to Google, he publishedresearch work on video content analysis, sentiment analysis, machinelearning, and cross-lingual information retrieval. He had a PhD inComputer Science from Carnegie Mellon University.

Transcript of [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin,...

Machine Learning on Big DataLessons Learned from Google Projects

Max LinSoftware Engineer | Google Research

Massively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

“Machine Learning is a study of computer algorithms that

improve automatically through experience.”

Training

Testing

The quick brown fox jumped over the lazy dog. English

To err is human, but to really foul things up you need a computer.

English

No hay mal que por bien no venga. Spanish

La tercera es la vencida. Spanish

To be or not to be -- that is the question

?

La fe mueve montañas. ?

Input X

f(x’)

Output Y

Model f(x)

= y’

The quick brown fox jumped over the lazy dog.

Linear Classifier

0,‘a’ ‘aardvark’...

[ 0,...

... ...‘dog’

1,... ‘the’

1,...... ‘montañas’... 0, ...

...]x

0.1,[ 132,... ... 150, 200,... ... -153, ... ]w

f(x) = w · x =P�

p=1

wp ∗ xp

Training Data

...

...

...

... ... ... ... ... ...

...

NP

Input X Ouput Y

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Typical machine learning data at Google

N: 100 billions / 1 billionP: 1 billion / 10 million(mean / median)

Classifier Training

• Training: Given {(x, y)} and f, minimize the following objective function

argminw

N�

n=1

L(yi, f(xi;w)) +R(w)

http://www.flickr.com/photos/visitfinland/5424369765/

Use Newton’s method?wt+1 ← wt −H(wt)−1∇J(wt)

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Scaling Up

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Machine

SubsamplingBig Data

Shard 1 Shard 2 Shard MShard 3...

Model

Reduce N

Why not Small Data?

[Banko and Brill, 2001]

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Scaling Up

Parallelize Estimates

• Naive Bayes Classifier

• Maximum Likelihood Estimates

wthe|EN =

�Ni=1 1EN,the(xi)�N

i=1 1EN (xi)

argminw

−N�

i=1

P�

p=1

P (xip|yi;w)P (yi;w)

Word Counting

MapX: “The quick brown fox ...”Y: EN

(‘the|EN’, 1)(‘quick|EN’, 1)(‘brown|EN’, 1)

Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]

C(‘the’|EN) = SUM of values = 3

w�the�|EN =C(�the�|EN)

C(EN)

Map

Reduce

Big Data

Mapper 1

Shard 1

Mapper 2

Shard 2

Mapper 3

Shard 3

Mapper M

Shard M

(‘the’ | EN, 1)

Reducer

Tally counts and update w

...

Word Counting

(‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Model

Parallelize Optimization

• Maximum Entropy Classifiers

• Good: J(w) is concave

• Bad: no closed-form solution like NB

• Ugly: Large N

argminw

N�

i=1

exp(�P

p=1 wp ∗ xip)

yi

1 + exp(�P

p=1 wp ∗ xip)

Gradient Descent

http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf

Gradient Descent

• w is initialized as zero

• for t in 1 to T

• Calculate gradients

∇J(w)

wt+1 ← wt − η∇J(w)

∇J(w) =N�

i=1

P (w, xi, yi)

Distribute Gradient

• w is initialized as zero

• for t in 1 to T

• Calculate gradients in parallel

• Training CPU: O(TPN) to O(TPN / M)

wt+1 ← wt − η∇J(w)

Distribute Gradient

Map

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, partial gradient sum)

Sum and Update w

...

ModelRepeat M/R

until converge

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Scaling Up

Parallelize Subroutines

• Support Vector Machines

• Solve the dual problems.t. 1− yi(w · φ(xi) + b) ≤ ζi, ζi ≥ 0

arg minw,b,ζ

1

2||w||22 + C

n�

i=1

ζi

argminα

1

2αTQα− αT1

s.t. 0 ≤ α ≤ C,yTα = 0

http://www.flickr.com/photos/sea-turtle/198445204/

The computational cost for the Primal-Dual Interior Point

Method is O(n^3) in time and O(n^2) in

memory

Parallel SVM• Parallel, row-wise incomplete Cholesky

Factorization for Q

• Parallel interior point method

• Time O(n^3) becomes O(n^2 / M)

• Memory O(n^2) becomes O(n / M)

• Parallel Support Vector Machines (psvm) http://code.google.com/p/psvm/

• Implement in MPI

√N

[Chang et al, 2007]

• Distribute Q by row into M machines

• For each dimension n <

• Send local pivots to master

• Master selects largest local pivots and broadcast the global pivot to workers

Machine 1

row 1

√N

Parallel ICF

Machine 2

...row 2

row 3

row 4

Machine 3

row 5

row 6

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Scaling Up

Majority Vote

Map

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M...

Model 1 Model 2 Model 3 Model 4

Majority Vote

• Train individual classifiers independently

• Predict by taking majority votes

• Training CPU: O(TPN) to O(TPN / M)

Parameter Mixture

Map

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

...

(dummy key, w2) ...

[Mann et al, 2009]

Model

http://www.flickr.com/photos/annamatic3000/127945652/

Much Less network usage than distributed gradient descentO(MN) vs. O(MNT)

Iterative Param Mixture

Map

Reduce after each epoch

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

...

(dummy key, w2) ...

Model

[McDonald et al., 2010]

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Scalable

http://www.flickr.com/photos/aloshbennett/3209564747/

Parallel

http://www.flickr.com/photos/wanderlinse/4367261825/

Accuracy

http://www.flickr.com/photos/brenderous/4532934181/

Binary Classification

http://www.flickr.com/photos/mararie/2340572508/

Automatic Feature

Discovery

http://www.flickr.com/photos/prunejuice/3687192643/

Fast Response

http://www.flickr.com/photos/jepoirrier/840415676/

Memory is new hard disk.

http://www.flickr.com/photos/neubie/854242030/

Algorithm + Infrastructure

Design for Multicores

http://www.flickr.com/photos/geektechnique/2344029370/

Combiner

Multi-shard Combiner

[Chandra et al., 2010]

Machine Learning on

Big Data

Parallelize ML Algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Parallel Accuracy

Fast Response

Google APIs

• Prediction API

• machine learning service on the cloud

• http://code.google.com/apis/predict

• BigQuery

• interactive analysis of massive data on the cloud

• http://code.google.com/apis/bigquery