Optimization via (too much?) Randomization

Optimization via (too much?) Randomization

Peter Richtarik

Why parallelizing like crazy and being lazy can be good

Optimization as Mountain Climbing

Optimization with Big Data

* in a billion dimensional space on a foggy day

Extreme* Mountain Climbing=

Big Data

• digital images & videos• transaction records• government records• health records• defence• internet activity (social media,

wikipedia, ...)• scientific measurements

(physics, climate models, ...)

BIG Volume BIG Velocity BIG Variety

God’s Algorithm = Teleportation

If You Are Not a God...

x0x1

x2 x3

start

settle for this

holy grail

Randomized Parallel Coordinate Descent

Western General Hospital(Creutzfeldt-Jakob Disease)

Arup (Truss Topology Design)

Ministry of Defence dstl lab(Algorithms for Data Simplicity)Royal Observatory

(Optimal Planet Growth)

Optimization as Lock Breaking

A Lock with 4 Dials

Setup: Combination maximizing F opens the lock

x = (x1, x2, x3, x4) F(x) = F(x1, x2, x3, x4)

A function representing the

“quality” of a combination

Optimization Problem: Find combination maximizing F

Optimization Algorithm

A System of Billion Locks with Shared Dials

# dials = n

x1

x2

x3

x4

xn

Lock

1) Nodes in the graph correspond to dials

2) Nodes in the graph also correspond to locks: each lock (=node) owns dials connected to it in the graph by an edge

= # locks

How do we Measure the Quality of a Combination?

F : Rn R

• Each lock j has its own quality function Fj

depending on the dials it owns

• However, it does NOT open when Fj is maximized

• The system of locks opens when

is maximized

F = F1 + F2 + ... + Fn

1) Randomly select a lock

2) Randomly select a dial belonging to the lock

3) Adjust the value on the selected dial based only on the info corresponding to the selected lock

An Algorithm with (too much?) Randomization

IDLE IDLE

IDLE IDLE

IDLE

IDLE

Synchronous Parallelization

J4

J7

J1

J5

J8

J2

time

J6

J9

J3Processor 1

Processor 2

Processor 3 WASTEFUL

Crazy (Lock-Free) Parallelization

time

J4 J5 J6

J7 J8 J9

J1 J2 J3Processor 1

Processor 2

Processor 3 NO WASTE

Crazy Parallelization

Theoretical Result

Average # dials in a lock

Average # of dials common between 2 locks

# Locks

# Processors

Computational Insights

Theory vs Reality

Why parallelizing like crazy and being lazy can be good?

Randomization

• Effectivity• Tractability• Efficiency• Scalability (big data)• Parallelism• Distribution• Asynchronicity

Parallelization

Optimization Methods for Big Data

• Randomized Coordinate Descent– P. R. and M. Takac: Parallel coordinate descent methods

for big data optimization, ArXiv:1212.0873 [can solve a problem with 1 billion variables in 2 hours using 24

processors]• Stochastic (Sub) Gradient Descent

– P. R. and M. Takac: Randomized lock-free methods for minimizing partially separable convex functions

[can be applied to optimize an unknown function]• Both of the above

M. Takac, A.Bijral, P. R. and N. Srebro: Mini-batch primal and dual methods for support vector machines, ArXiv:1303.xxxx

Final 2 Slides

ToolsProbability

Machine LearningMatrix Theory

HPC

Optimization via (too much?) Randomization

Documents

Transcript of Optimization via (too much?) Randomization