[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin,...

Machine Learning on Big DataLessons Learned from Google Projects

Max LinSoftware Engineer | Google Research

Massively Parallel Computing | Harvard CS 264Guest Lecture | March 29th, 2011

Outline

• Machine Learning intro

• Scaling machine learning algorithms up

• Design choices of large scale ML systems

Outline

“Machine Learning is a study of computer algorithms that

improve automatically through experience.”

Training

Testing

The quick brown fox jumped over the lazy dog. English

To err is human, but to really foul things up you need a computer.

English

No hay mal que por bien no venga. Spanish

La tercera es la vencida. Spanish

To be or not to be -- that is the question

La fe mueve montañas. ?

Input X

f(x’)

Output Y

Model f(x)

= y’

The quick brown fox jumped over the lazy dog.

Linear Classifier

0,‘a’ ‘aardvark’...

[ 0,...

... ...‘dog’

1,... ‘the’

1,...... ‘montañas’... 0, ...

0.1,[ 132,... ... 150, 200,... ... -153, ... ]w

f(x) = w · x =P�

wp ∗ xp

Training Data

... ... ... ... ... ...

Input X Ouput Y

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Typical machine learning data at Google

N: 100 billions / 1 billionP: 1 billion / 10 million(mean / median)

Classifier Training

• Training: Given {(x, y)} and f, minimize the following objective function

argminw

L(yi, f(xi;w)) +R(w)

http://www.flickr.com/photos/visitfinland/5424369765/

Use Newton’s method?wt+1 ← wt −H(wt)−1∇J(wt)

Outline

Scaling Up

• Why big data?

• Parallelize machine learning algorithms

• Embarrassingly parallel

• Parallelize sub-routines

• Distributed learning

Machine

SubsamplingBig Data

Shard 1 Shard 2 Shard MShard 3...

Reduce N

Why not Small Data?

[Banko and Brill, 2001]

• Why big data?

Scaling Up

Parallelize Estimates

• Naive Bayes Classifier

• Maximum Likelihood Estimates

wthe|EN =

�Ni=1 1EN,the(xi)�N

i=1 1EN (xi)

argminw

−N�

P (xip|yi;w)P (yi;w)

Word Counting

MapX: “The quick brown fox ...”Y: EN

(‘the|EN’, 1)(‘quick|EN’, 1)(‘brown|EN’, 1)

Reduce [ (‘the|EN’, 1), (‘the|EN’, 1), (‘the|EN’, 1) ]

C(‘the’|EN) = SUM of values = 3

w�the�|EN =C(�the�|EN)

Reduce

Big Data

Mapper 1

Shard 1

Mapper 2

Shard 2

Mapper 3

Shard 3

Mapper M

Shard M

(‘the’ | EN, 1)

Reducer

Tally counts and update w

Word Counting

(‘fox’ | EN, 1) ... (‘montañas’ | ES, 1)

Parallelize Optimization

• Maximum Entropy Classifiers

• Good: J(w) is concave

• Bad: no closed-form solution like NB

• Ugly: Large N

argminw

exp(�P

p=1 wp ∗ xip)

1 + exp(�P

p=1 wp ∗ xip)

Gradient Descent

http://www.cs.cmu.edu/~epxing/Class/10701/Lecture/lecture7.pdf

Gradient Descent

• w is initialized as zero

• for t in 1 to T

• Calculate gradients

∇J(w)

wt+1 ← wt − η∇J(w)

∇J(w) =N�

P (w, xi, yi)

Distribute Gradient

• w is initialized as zero

• for t in 1 to T

• Calculate gradients in parallel

• Training CPU: O(TPN) to O(TPN / M)

wt+1 ← wt − η∇J(w)

Distribute Gradient

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, partial gradient sum)

Sum and Update w

ModelRepeat M/R

until converge

• Why big data?

Scaling Up

Parallelize Subroutines

• Support Vector Machines

• Solve the dual problems.t. 1− yi(w · φ(xi) + b) ≤ ζi, ζi ≥ 0

arg minw,b,ζ

2||w||22 + C

argminα

2αTQα− αT1

s.t. 0 ≤ α ≤ C,yTα = 0

http://www.flickr.com/photos/sea-turtle/198445204/

The computational cost for the Primal-Dual Interior Point

Method is O(n^3) in time and O(n^2) in

memory

Parallel SVM• Parallel, row-wise incomplete Cholesky

Factorization for Q

• Parallel interior point method

• Time O(n^3) becomes O(n^2 / M)

• Memory O(n^2) becomes O(n / M)

• Parallel Support Vector Machines (psvm) http://code.google.com/p/psvm/

• Implement in MPI

[Chang et al, 2007]

• Distribute Q by row into M machines

• For each dimension n <

• Send local pivots to master

• Master selects largest local pivots and broadcast the global pivot to workers

Machine 1

Parallel ICF

Machine 2

...row 2

Machine 3

• Why big data?

Scaling Up

Majority Vote

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M...

Model 1 Model 2 Model 3 Model 4

Majority Vote

• Train individual classifiers independently

• Predict by taking majority votes

• Training CPU: O(TPN) to O(TPN / M)

Parameter Mixture

Reduce

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

(dummy key, w2) ...

[Mann et al, 2009]

http://www.flickr.com/photos/annamatic3000/127945652/

Much Less network usage than distributed gradient descentO(MN) vs. O(MNT)

Iterative Param Mixture

Reduce after each epoch

Big Data

Machine 1

Shard 1

Machine 2

Shard 2

Machine 3

Shard 3

Machine M

Shard M

(dummy key, w1)

Average w

(dummy key, w2) ...

[McDonald et al., 2010]

Outline

http://www.flickr.com/photos/mr_t_in_dc/5469563053/

Scalable

http://www.flickr.com/photos/aloshbennett/3209564747/

Parallel

http://www.flickr.com/photos/wanderlinse/4367261825/

Accuracy

http://www.flickr.com/photos/imagelink/4006753760/

http://www.flickr.com/photos/brenderous/4532934181/

Binary Classification

http://www.flickr.com/photos/mararie/2340572508/

Automatic Feature

Discovery

http://www.flickr.com/photos/prunejuice/3687192643/

Fast Response

http://www.flickr.com/photos/jepoirrier/840415676/

Memory is new hard disk.

http://www.flickr.com/photos/neubie/854242030/

Algorithm + Infrastructure

Design for Multicores

http://www.flickr.com/photos/geektechnique/2344029370/

Combiner

Multi-shard Combiner

[Chandra et al., 2010]

Machine Learning on

Big Data

Parallelize ML Algorithms

Parallel Accuracy

Fast Response

Google APIs

• Prediction API

• machine learning service on the cloud

• http://code.google.com/apis/predict

• BigQuery

• interactive analysis of massive data on the cloud

• http://code.google.com/apis/bigquery

[Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin,...

Education

Transcript of [Harvard CS264] 09 - Machine Learning on Big Data: Lessons Learned from Google Projects (Max Lin,...

[Harvard CS264] 14 - Dynamic Compilation for Massively Parallel Processors (Gregory Diamos, Georgia Tech)

LIN-X450 MONOFLAT LIN-X LIN-X600 MONOFLAT LIN-X HINGE …

Inec. lin. y no lin.

LINCOLN SOUTH BELTWAY DPU-LIN-2-6(120); C.N. …CITY OF TM LINCOLN NEBRASKA Good Life. Great Journey. DEPARTMENT OF TRANSPORTATION Federal Highway Administration Google Earth Google

Google PageRank Algorithm By: Danny Lin. Table of Contents Google Search History / What is Page Rank? Page Rank Algorithm Inbound/Outbound Links.

MC33662, LIN 2.1 / SAEJ2602-2, LIN Physical Layer - Data Sheet · LIN 2.1 / SAEJ2602-2, LIN Physical Layer The Local Interconnect Network (LIN) is a serial communication protocol,

wayne · Google PEARSON Google Google Google Google Google Google Google Google Google Wayne

[Harvard CS264] 15a - The Onset of Parallelism, Changes in Computer Architecture and Microsoft's Role in the Transition (David Rich, Microsoft Research)

Satish Srirama - ut€¦ · Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed Computing

[Harvard CS264] 08a - Cloud Computing, Amazon EC2, MIT StarCluster (Justin Riley, MIT)

Protecting Browsers from Cross-Origin CSS Attacks Lin-Shung Huang, Zack Weinberg Carnegie Mellon University Chris Evans Google Collin Jackson Carnegie.

[Harvard CS264] 05 - Advanced-level CUDA Programming

Presenter ： Eudora Lin Milky Chen Keanu Lin

[Harvard CS264] 10b - cl.oquence: High-Level Language Abstractions for Low-Level Programming (Cyrus Omar, CMU)

Doe Bug Prediction Support Human Developers? Findings From a Google Case Study Chris Lewis, ZhongPeng Lin, Caitlin Sadowski, Xiaoyan Zhu, Rong Ou, E.James.

cs264: Program Analysis catalog name: Implementation of Programming Languages

Y (km) - United States Agency for International Developmentpdf.usaid.gov/pdf_docs/PNADM485.pdf · rstein@usgs.gov Jian Lin (testing, ... Google Earth Output ... Potential receiver

Mystery skype lin lin

Google Web Toolkit Dudeanu Ermoghen Ib ă nescu Diana Melinte Laurenţiu-Ionuţ Petrişor Ionuţ C ă t ă lin.

Physically-Based Modeling, Simulation and Animation Ming C. Lin lin@cs.unc.edu lin