Optimization via (too much?) Randomization

32
Optimization via (too much?) Randomization Peter Richtarik Why parallelizing like crazy and being lazy can be good

description

Optimization via (too much?) Randomization. Why parallelizing like crazy and being lazy can be good. Peter Richtarik. Optimization as Mountain Climbing. =. Extreme* Mountain Climbing. Optimization with Big Data. * in a billion dimensional space on a foggy day. Big Data. - PowerPoint PPT Presentation

Transcript of Optimization via (too much?) Randomization

Page 1: Optimization via  (too much?) Randomization

Optimization via (too much?) Randomization

Peter Richtarik

Why parallelizing like crazy and being lazy can be good

Page 2: Optimization via  (too much?) Randomization

Optimization as Mountain Climbing

Page 3: Optimization via  (too much?) Randomization

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Page 4: Optimization via  (too much?) Randomization

Big Data

• digital images & videos• transaction records• government records• health records• defence• internet activity (social media,

wikipedia, ...)• scientific measurements

(physics, climate models, ...)

BIG Volume BIG Velocity BIG Variety

Page 5: Optimization via  (too much?) Randomization

God’s Algorithm = Teleportation

Page 6: Optimization via  (too much?) Randomization

If You Are Not a God...

x0x1

x2 x3

Page 7: Optimization via  (too much?) Randomization

start

settle for this

holy grail

Randomized Parallel Coordinate Descent

Page 8: Optimization via  (too much?) Randomization

Western General Hospital(Creutzfeldt-Jakob Disease)

Arup (Truss Topology Design)

Ministry of Defence dstl lab(Algorithms for Data Simplicity)Royal Observatory

(Optimal Planet Growth)

Page 9: Optimization via  (too much?) Randomization

Optimization as Lock Breaking

Page 10: Optimization via  (too much?) Randomization

A Lock with 4 Dials

Setup: Combination maximizing F opens the lock

x = (x1, x2, x3, x4) F(x) = F(x1, x2, x3, x4)

A function representing the

“quality” of a combination

Optimization Problem: Find combination maximizing F

Page 11: Optimization via  (too much?) Randomization

Optimization Algorithm

Page 12: Optimization via  (too much?) Randomization

A System of Billion Locks with Shared Dials

# dials = n

x1

x2

x3

x4

xn

Lock

1) Nodes in the graph correspond to dials

2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge

= # locks

Page 13: Optimization via  (too much?) Randomization

How do we Measure the Quality of a Combination?

F : Rn R

• Each lock j has its own quality function Fj

depending on the dials it owns

• However, it does NOT open when Fj is maximized

• The system of locks opens when

is maximized

F = F1 + F2 + ... + Fn

Page 14: Optimization via  (too much?) Randomization

1) Randomly select a lock

2) Randomly select a dial belonging to the lock

3) Adjust the value on the selected dial based only on the info corresponding to the selected lock

An Algorithm with (too much?) Randomization

Page 15: Optimization via  (too much?) Randomization

IDLE IDLE

IDLE IDLE

IDLE

IDLE

Synchronous Parallelization

J4

J7

J1

J5

J8

J2

time

J6

J9

J3Processor 1

Processor 2

Processor 3 WASTEFUL

Page 16: Optimization via  (too much?) Randomization

Crazy (Lock-Free) Parallelization

time

J4 J5 J6

J7 J8 J9

J1 J2 J3Processor 1

Processor 2

Processor 3 NO WASTE

Page 17: Optimization via  (too much?) Randomization

Crazy Parallelization

Page 18: Optimization via  (too much?) Randomization

Crazy Parallelization

Page 19: Optimization via  (too much?) Randomization

Crazy Parallelization

Page 20: Optimization via  (too much?) Randomization

Crazy Parallelization

Page 21: Optimization via  (too much?) Randomization

Theoretical Result

Average # dials in a lock

Average # of dials common between 2 locks

# Locks

# Processors

Page 22: Optimization via  (too much?) Randomization

Computational Insights

Page 23: Optimization via  (too much?) Randomization
Page 24: Optimization via  (too much?) Randomization
Page 25: Optimization via  (too much?) Randomization
Page 26: Optimization via  (too much?) Randomization
Page 27: Optimization via  (too much?) Randomization

Theory vs Reality

Page 28: Optimization via  (too much?) Randomization

Why parallelizing like crazy and being lazy can be good?

Randomization

• Effectivity• Tractability• Efficiency• Scalability (big data)• Parallelism• Distribution• Asynchronicity

Parallelization

Page 29: Optimization via  (too much?) Randomization

Optimization Methods for Big Data

• Randomized Coordinate Descent– P. R. and M. Takac: Parallel coordinate descent methods

for big data optimization, ArXiv:1212.0873 [can solve a problem with 1 billion variables in 2 hours using 24

processors]• Stochastic (Sub) Gradient Descent

– P. R. and M. Takac: Randomized lock-free methods for minimizing partially separable convex functions

[can be applied to optimize an unknown function]• Both of the above

M. Takac, A.Bijral, P. R. and N. Srebro: Mini-batch primal and dual methods for support vector machines, ArXiv:1303.xxxx

Page 30: Optimization via  (too much?) Randomization

Final 2 Slides

Page 31: Optimization via  (too much?) Randomization

ToolsProbability

Machine LearningMatrix Theory

HPC

Page 32: Optimization via  (too much?) Randomization