Tutorial on Optimization with Submodular Functions

96
Tutorial on Optimization with Submodular Functions Andreas Krause

description

Tutorial on Optimization with Submodular Functions. Andreas Krause. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. M. Algorithms implemented. Acknowledgements. - PowerPoint PPT Presentation

Transcript of Tutorial on Optimization with Submodular Functions

Page 1: Tutorial on Optimization with  Submodular  Functions

Tutorial on Optimization with Submodular Functions

Andreas Krause

Page 2: Tutorial on Optimization with  Submodular  Functions

2

AcknowledgementsPartly based on material presented at IJCAI 2009 together with Carlos Guestrin (CMU)

MATLAB Toolbox available on http://las.ethz.ch/

MAlgorithms implemented

Page 3: Tutorial on Optimization with  Submodular  Functions

Discrete OptimizationMany discrete optimization problems can be formulated as follows:

Given finite set V wish to select subset A (subject to some constraints) maximizing utility F(A)

These problems are the focus of this tutorial.

3

Page 4: Tutorial on Optimization with  Submodular  Functions

Contamination of drinking watercould affect millions of people

4

Example: Sensor Placement [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt ‘08]

Place sensors to detect contaminations“Battle of the Water Sensor Networks” competition

Where should we place sensors to quickly detect contamination?

Sensors

Simulator from EPA Hach Sensor~$14K

Contamination

Page 5: Tutorial on Optimization with  Submodular  Functions

5

Can only make a limited number of measurements!

Dept

h

Location across lake

Robotic monitoring of rivers and lakes

[with Singh, Guestrin, Kaiser, Journal of AI Research ’08]

Need to monitor large spatial phenomenaTemperature, nutrient distribution, fluorescence, …

Predict atunobserved

locations

NIMSKaiseret.al.

(UCLA)Color indicates actual temperature Predicted temperatureUse robotic sensors to

cover large areas

Where should we sense to get most accurate predictions?

Page 6: Tutorial on Optimization with  Submodular  Functions

6

Example: Model-based sensingModel predicts impact of contaminations

For water networks: Water flow simulator from EPAFor lake monitoring: Use statistical, data-driven models

For each subset A of V compute sensing quality F(A)

S2

S3

S4S1 S2

S3

S4

S1

High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01

Model predictsHigh impact

Medium impactlocation

Low impactlocation

Sensor reducesimpact throughearly detection!

S1

Contamination

Set V of all network junctions

Page 7: Tutorial on Optimization with  Submodular  Functions

7

Sensor placementGiven: finite set V of locations, sensing quality FWant: such that

NP-hard!

How well can this simple heuristic do?

S1

S2

S3

S4

S5

S6Greedy algorithm:Start with A = {}For i = 1 to k

s* := argmaxs F(A U {s})A := A U {s*}

M

Page 8: Tutorial on Optimization with  Submodular  Functions

8

2 4 6 8 100.5

0.6

0.7

0.8

0.9

Number of sensors placed

Pop

ulat

ion

affe

cted

Performance of greedy algorithm

Greedy score empirically close to optimal. Why?

Small subset of Water networks

data

2 4 6 8 100.5

0.6

0.7

0.8

0.9

Number of sensors placed

Pop

ulat

ion

affe

cted

Greedy

Optimal

Popu

latio

n pr

otec

ted

(hig

her i

s bet

ter)

Number of sensors placed

Page 9: Tutorial on Optimization with  Submodular  Functions

9

Key property 1: Monotonicity

S2

S1

Placement A = {S1, S2}

S2S3

S4S1

Placement B = {S1, S2, S3, S4}

F is monotonic:

Adding sensors can only help

Page 10: Tutorial on Optimization with  Submodular  Functions

10

S2S3

S4S1

Key property 2: Diminishing returns

S2

S1

S’

Placement A = {S1, S2} Placement B = {S1, S2, S3, S4}

Adding S’ will help a lot!

Adding S’ doesn’t help muchNew

sensor S’New sensor Y’

S’B AS’

+

+

Large improvement

Small improvement

Submodularity:

Page 11: Tutorial on Optimization with  Submodular  Functions

11

One reason submodularity is useful

Theorem [Nemhauser et al ‘78]Suppose F is monotonic and submodular. Thengreedy algorithm gives constant factor approximation:

Greedy algorithm gives near-optimal solution!In general, guarantees best possible unless P = NP!

~63%

Page 12: Tutorial on Optimization with  Submodular  Functions

In this tutorial you willSee examples and properties of submodular functionsLearn how submodularity helps to find good solutionsFind out about current research directions

12

Page 13: Tutorial on Optimization with  Submodular  Functions

13

Set functionsFinite set V = {1,2,…,n}Function Will always assume F({}) = 0 (w.l.o.g.)Assume black-box that can evaluate F for any input A

Approximate (noisy) evaluation of F is ok

S2

S3

S4

S1

Low sensing quality F(A)=0.01

S2

S3

S4S1

Set V of all network junctions

High sensing quality F(A) = 0.9

Page 14: Tutorial on Optimization with  Submodular  Functions

14

Submodular set functionsSet function F on V is called submodular if

Equivalent diminishing returns characterization:

S

B

A

S

+

+

Large improvement

Small improvement

Submodularity:

BAA U BA B

++ ≥

Page 15: Tutorial on Optimization with  Submodular  Functions

15

Submodularity and supermodularityF is called supermodular if –F is submodular

F is called modular if F is both sub- and supermodularfor modular (“additive”) F:

Page 16: Tutorial on Optimization with  Submodular  Functions

16

Example: Set cover

Node predictsvalues of positionswith some radius

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

For A V: F(A) = “area covered by sensors placed at A”

Formally: W finite set, collection of n subsets Si W

For A V define F(A) = |i A Si|

Want to cover floorplan with discsPlace sensorsin building Possible

locations V

Page 17: Tutorial on Optimization with  Submodular  Functions

17

Set cover is submodular

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

S1 S2

S1 S2

S3

S4 S’

S’

A={S1,S2}

B = {S1,S2,S3,S4}

F(A U {S’})-F(A)

F(B U {S’})-F(B)

Page 18: Tutorial on Optimization with  Submodular  Functions

18

A better model for sensing

Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn | Y1,…,Yn)

Ys: temperatureat location s

Xs: sensor valueat location s

Xs = Ys + noise

Prior Likelihood

Y1 Y2 Y3

Y6

Y5Y4

X1

X4

X3

X6X5

X2

Page 19: Tutorial on Optimization with  Submodular  Functions

Information gainUtility of having sensors at subset A of all locations

19

Uncertaintyabout temperature Y

after sensing

Uncertaintyabout temperature Y

before sensing

X1

X2

X3

A={1,2,3}: High value F(A)

X4X5

X1

A={1,4,5}: Low value F(A)

Page 20: Tutorial on Optimization with  Submodular  Functions

20

Example: Submodularity of info-gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = I(Y; XA) = H(Y)-H(Y | XA)F(A) is always monotonicHowever, NOT always submodular

Theorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular!

Y1

X1

Y2

X2

Y3

X4X3

Page 21: Tutorial on Optimization with  Submodular  Functions

21

Proof: Submodularity of information gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = I(Y; XA) = H(Y)-H(Y | XA)

Variables X independent given Y

F(A U {s}) – F(A) = H(Xs|XA) – H(Xs|Y)

Δ(s | A) = F(A U {s})-F(A) monotonically nonincreasing F submodular

Nonincreasing in A:A B : H(Xs|XA) ≥ H(Xs|XB)

Constant(indep. of A)

„information never hurts“

Page 22: Tutorial on Optimization with  Submodular  Functions

22

Example: Influence in social networks[Kempe, Kleinberg, Tardos KDD ’03]

Who should get free cell phones?V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}F(A) = Expected number of people influenced when targeting A

0.5

0.30.5 0.4

0.2

0.2 0.5

Alice

Bob

Charlie

Dorothy Eric

Fiona

Prob. ofinfluencing

Page 23: Tutorial on Optimization with  Submodular  Functions

23

Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03]

0.5

0.30.5 0.4

0.2

0.2 0.5

Alice

Bob

Charlie

Dorothy Eric

Fiona

Key idea: Flip coins c in advance “live” edges

Fc(A) = People influenced under outcome c (set cover!) F(A) = c P(c) Fc(A) is submodular as well!

Page 24: Tutorial on Optimization with  Submodular  Functions

24

Closedness propertiesF1,…,Fm submodular functions on V and 1,…,m > 0Then: F(A) = i i Fi(A) is submodular!

Submodularity closed under nonnegative linear combinations!

Extremely useful fact!!F(A) submodular P() F(A) submodular!Multicriterion optimization: F1,…,Fm submodular, i>0 i i Fi(A) submodular

Page 25: Tutorial on Optimization with  Submodular  Functions

25

Submodularity and ConcavitySuppose g: N R and F(A) = g(|A|)Then F(A) submodular if and only if g concave!

E.g., g could say “buying in bulk is cheaper”

|A|

g(|A|)

Page 26: Tutorial on Optimization with  Submodular  Functions

26

Maximum of submodular functionsSuppose F1(A) and F2(A) submodular.Is F(A) = max(F1(A),F2(A)) submodular?

|A|

F2(A)F1(A)

F(A) = max(F1(A),F2(A))

max(F1,F2) not submodular in general!

Page 27: Tutorial on Optimization with  Submodular  Functions

27

Minimum of submodular functionsWell, maybe F(A) = min(F1(A),F2(A)) instead?

F1(A) F2(A) F(A)

{} 0 0 0{a} 1 0 0{b} 0 1 0{a,b} 1 1 1

F({b}) – F({})=0

F({a,b}) – F({a})=1<

min(F1,F2) not submodular in general!

Page 28: Tutorial on Optimization with  Submodular  Functions

28

Submodularity and convexity

Key result [Lovasz ’83]: Every submodular function F induces a convex function g on Rn

+, such thatF(A) = g(wA) for all A V minA F(A) = minw g(w) s.t. w [0,1]n is attained at corner!g is efficient to evaluate

Can use ellipsoid algorithm to minimize submodular functions!

F: 2V R submodularon V = {a,b}

g: R|V| R g(1,1)=F({a,b})

g(1,0)=F({a})wa

wb

1

1

g(w)

Page 29: Tutorial on Optimization with  Submodular  Functions

Minimization ofsubmodular functions

Page 30: Tutorial on Optimization with  Submodular  Functions

30

Minimizing submodular functions

Theorem [Iwata & Orlin (SODA 2009)]There is a fully combinatorial, strongly polynomial algorithm for minimizing SFs, that runs in time

O(n6 log n)

Polynomial-time = Practical ??

Is there a more practical alternative?

Page 31: Tutorial on Optimization with  Submodular  Functions

31

The submodular polyhedron PF

Example: V = {a,b}

x({a}) ≤ F({a})

x({b}) ≤ F({b})

x({a,b}) ≤ F({a,b})PF

-1 x{a}

x{b}

0 1

1

2

-2

A F(A){} 0{a} -1{b} 2{a,b} 0

Page 32: Tutorial on Optimization with  Submodular  Functions

32

A more practical alternative? [Fujishige ’91, Fujishige et al ‘11]

Minimum norm algorithm:1. Find x* = argmin ||x||2 s.t. x in BF x*=[-1,1]2. Return A* = {i: x*(i) < 0} A*={a}

Theorem [Fujishige ’91]: A* is an optimal solution!Note: Can solve 1. using Wolfe’s algorithmRuntime finite but unknown!!

-1 x{a}

x{b}

0 1

1

2

-2

Base polytope:[-1,1]

x({a,b})=F({a,b})

x*

M

A F(A){} 0{a} -1{b} 2{a,b} 0

Page 33: Tutorial on Optimization with  Submodular  Functions

33

Empirical comparison [Fujishige et al ’06]

Minimum norm algorithm orders of magnitude faster!Our implementation can solve n = 10k in < 6 minutes!

Cut functions from DIMACS Challenge

Runn

ing

time

(sec

onds

)

Low

er is

bett

er (l

og-s

cale

!)

Problem size (log-scale!)512 102425612864

Minimumnorm algorithm

Page 34: Tutorial on Optimization with  Submodular  Functions

34

Example: Image denoising

Page 35: Tutorial on Optimization with  Submodular  Functions

35

Example: Image denoising

X1

X4

X7

X2

X5

X8

X3

X6

X9

Y1

Y4

Y7

Y2

Y5

Y8

Y3

Y6

Y9

P(x1,…,xn,y1,…,yn) = i,j i,j(yi,yj) i i(xi,yi)

Want argmaxy P(y | x) =argmaxy log P(x,y) =argminy i,j Ei,j(yi,yj)+i Ei(yi)

When is this MAP inference efficiently solvable(in high treewidth graphical models)?

Ei,j(yi,yj) = -log i,j(yi,yj)

Pairwise Markov Random Field

Xi: noisy pixelsYi: “true” pixels

Page 36: Tutorial on Optimization with  Submodular  Functions

36

MAP inference in Markov Random Fields

Energy E(y) = i,j Ei,j(yi,yj)+i Ei(yi)

Suppose yi are binary, defineF(A) = E(yA) where yA

i = 1 iff i in AThen miny E(y) = minA F(A)

Theorem [c.f., Hammer ’65, Kolmogorov et al, PAMI ’04]MAP inference problem solvable by graph cuts For all i,j: Ei,j(0,0)+Ei,j(1,1) ≤ Ei,j(0,1)+Ei,j(1,0) each Ei,j is submodular“Efficient if prefer that neighboring pixels have same color”

Page 37: Tutorial on Optimization with  Submodular  Functions

37

Constrained minimizationHave seen: if F submodular on V, can solve

What about

E.g., balanced cut, …

In general, not much known about constrained minimization However, can do

A*=argmin F(A) s.t. 0<|A|< nA*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95]A*=argmin F(A) s.t. A in argmin G(A) for G submodular [Fujishige ’91]

Some recent lower bounds by Svitkina & Fleischer ‘08

Page 38: Tutorial on Optimization with  Submodular  Functions

State of the art in minimizationFaster algorithms for special cases via modern results for convex minimization [Stobbe & Krause NIPS ‘10]

Approximation algorithms for constrained minimization: Cooperative cuts [Jegelka & Bilmes CVPR ’11]

Fast approximate minimization via iterative reduction to graph cuts [Jegelka & Bilmes NIPS ‘11]

Many open question in principled, efficient, large-scale submodular minimization!

38

Page 39: Tutorial on Optimization with  Submodular  Functions

Maximizing submodular functions

Page 40: Tutorial on Optimization with  Submodular  Functions

40

Maximizing submodular functions

Minimizing convex functions:Polynomial time solvable!

Minimizing submodular functions:Polynomial time solvable!

Maximizing convex functions:NP hard!

Maximizing submodular functions:NP hard!

But can get approximation guarantees

Page 41: Tutorial on Optimization with  Submodular  Functions

41

Maximizing non-monotonic functionsSuppose we want for not monotonic F

Example:F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost functionE.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08]

In general: NP hard. Moreover:If F(A) can take negative values:As hard to approximate as maximum independent set (i.e., NP hard to get O(n1-) approximation)

|A|

maximum

Page 42: Tutorial on Optimization with  Submodular  Functions

42

Maximizing positive submodular functions[Feige, Mirrokni, Vondrak FOCS ’07]

picking a random set gives ¼ approximation (½ approximation if F is symmetric!)Current best approximation [Chekuri, Vondrak, Zenklusen]: ~0.41 via simulated annealingwe cannot get better than ½ approximation unless P = NP

TheoremThere is an efficient randomized local search procedure,that, given a positive submodular function F, F({})=0, returns set ALS such that

F(ALS) ≥ (2/5) maxA F(A)

M

Page 43: Tutorial on Optimization with  Submodular  Functions

43

Scalarization vs. constrained maximizationGiven monotonic utility F(A) and cost C(A), optimize:

Option 1: Option 2:

Can get 2/5 approx…if F(A)-C(A) ≥ 0 for all sets A

Greedy algorithm gets(1-1/e)≈0.63 approx.

Many more results!!Focus for rest of today

Positiveness is a strong requirement

“Scalarization” “Constrained maximization”

Page 44: Tutorial on Optimization with  Submodular  Functions

44

Battle of the Water Sensor Networks Competition [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt 2008]

Real metropolitan area network (12,527 nodes)Water flow simulator provided by EPA3.6 million contamination eventsMultiple objectives: Detection time, affected population, …Place sensors that detect well “on average”

Page 45: Tutorial on Optimization with  Submodular  Functions

45

Reward function is submodularClaim:Reward function is monotonic submodular

Consider event i:Ri(uk) = benefit from sensor uk in event iRi(A) = max Ri(uk), ukA

Þ Ri is submodular

Overall objective:R(A) = Prob(i) Ri(A)

Þ R is submodular

Can use greedy algorithm to solve !

u1

Ri(u1)Event i

u2

Ri(u2)

Page 46: Tutorial on Optimization with  Submodular  Functions

46

Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08]

(1-1/e) bound quite loose… can we get better bounds?

Popu

latio

n pr

otec

ted

F(A)

High

er is

bett

er

Water networks

data

0 5 10 15 200

0.2

0.4

0.6

0.8

1

1.2

1.4Offline

(Nemhauser)bound

Greedysolution

Number of sensors placed

Page 47: Tutorial on Optimization with  Submodular  Functions

47

Data dependent bounds [Minoux ’78]

Suppose A is candidate solution to

argmax F(A) s.t. |A| ≤ k

and A* = {s1,…,sk} be an optimal solution

Then F(A*) ≤ F(A U A*) = F(A)+i (F(A U {s1,…,si})-F(A U {s1,…,si-1}))≤ F(A) + i (F(A U {si})-F(A))= F(A) + i Δsi

For each s V\A, let Δs = F(A U {s})-F(A)Order such that Δ1 ≥ Δ2 ≥ … ≥ Δn

Then: F(A*) ≤ F(A) + i=1k Δi

M

Page 48: Tutorial on Optimization with  Submodular  Functions

48

Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08]

Submodularity gives data-dependent bounds on the performance of any algorithm

Sens

ing

qual

ity F

(A)

High

er is

bet

ter

Water networks

data

0 5 10 15 200

0.2

0.4

0.6

0.8

1

1.2

1.4Offline

(Nemhauser)bound Data-dependent

bound

Greedysolution

Number of sensors placed

Page 49: Tutorial on Optimization with  Submodular  Functions

49

BWSN Competition results13 participantsPerformance measured in 30 different criteria

0

5

10

15

20

25

30

Tota

l Sco

reHi

gher

is b

etter

Greedy approach

Berry e

t. al.

Dorini e

t. al.

Wu & W

alski

Ostfeld &

Salomons

Propato &

Piller

Eliad

es & Polyc

arpou

Huang e

t. al.

Guan et. a

l.

Ghimire

& Bark

doll

Trach

tman

Gueli

Preis & O

stfeld

EE

D DG

GG

GG

H

H

H

G: Genetic algorithmH: Other heuristic

D: Domain knowledgeE: “Exact” method (MIP)

24% better performance than runner-up!

Page 50: Tutorial on Optimization with  Submodular  Functions

50

Simulated all on 2 weeks / 40 processors152 GB data on disk Very accurate computation of F(A)

, 16 GB in main memory (compressed)

Lowe

r is b

ette

r 30 hours/20 sensors6 weeks for all30 settings

3.6M contaminations

Very slow evaluation of F(A)

1 2 3 4 5 6 7 8 9 100

100

200

300

Number of sensors selected

Runn

ing

time

(min

utes

)

Exhaustive search(All subsets)

Naivegreedy

What was the trick?

ubmodularity to the rescue

Page 51: Tutorial on Optimization with  Submodular  Functions

51

Scaling up the greedy algorithm [Minoux ’78]In round i+1,

have picked Ai = {s1,…,si}pick si+1 = argmaxs F(Ai U {s})-F(Ai)

I.e., maximize “marginal benefit” Δ(s | Ai)

Δ(s | Ai) = F(Ai U {s})-F(Ai)

Key observation: Submodularity implies

i ≤ j => Δ(s | Ai) ≥ Δ(s | Aj)

Marginal benefits can never increase!

s

Δ (s | Ai) ≥ Δ(s | Ai+1)

Page 52: Tutorial on Optimization with  Submodular  Functions

52

“Lazy” greedy algorithm [Minoux ’78]

Lazy greedy algorithm: First iteration as usual Keep an ordered list of marginal

benefits Δi from previous iteration Re-evaluate Δi only for top

element If Δi stays on top, use it,

otherwise re-sort

a

b

c

d

Benefit Δ(s | A)

e

a

d

b

c

e

a

c

d

b

e

Note: Very easy to compute online bounds, lazy evaluations, etc.[Leskovec, Krause et al. ’07]

M

Page 53: Tutorial on Optimization with  Submodular  Functions

53

Simulated all on 2 weeks / 40 processors152 GB data on disk Very accurate computation of F(A)

Using “lazy evaluations”:1 hour/20 sensorsDone after 2 days!

, 16 GB in main memory (compressed)

Lowe

r is b

ette

r 30 hours/20 sensors6 weeks for all30 settings

3.6M contaminations

Very slow evaluation of F(A)

1 2 3 4 5 6 7 8 9 100

100

200

300

Number of sensors selected

Runn

ing

time

(min

utes

)

Exhaustive search(All subsets)

Naivegreedy

Fast greedy ubmodularity to the rescue:

Result of lazy evaluation

Page 54: Tutorial on Optimization with  Submodular  Functions

54

Non-constant cost functionsFor each s V, let c(s)>0 be its cost(e.g., feature acquisition costs, …)Cost of a set C(A) = sA c(s)Want to solve

A* = argmax F(A) s.t. C(A) ≤ B

Cost-benefit greedy algorithm:Start with A := {};While there is an s V\A s.t. C(A U {s}) ≤ B

A := A U {s*}

M

Page 55: Tutorial on Optimization with  Submodular  Functions

55

Performance of cost-benefit greedyWant

maxA F(A) s.t. C(A) ≤ 1

Cost-benefit greedy picks a.Then cannot afford b!

Cost-benefit greedy performs arbitrarily badly!

Set A F(A) C(A){a} 2 {b} 1 1

Page 56: Tutorial on Optimization with  Submodular  Functions

56

Cost-benefit optimization[Wolsey ‘82, Sviridenko ‘04, Krause et al ‘05]

Theorem [Krause and Guestrin‘05]ACB: cost-benefit greedy solution andAUC: unit-cost greedy solution (i.e., ignore costs)

Then max { F(ACB), F(AUC) } ≥ (1-1/√e) OPT

Can still compute online bounds and speed up using lazy evaluations

Note: Can also get(1-1/e) approximation in time O(n4) [Sviridenko ’04](1-1/e) approximation for multiple linear constraints [Kulik ‘09]0.38/k approximation for k matroid and m linear constraints

[Chekuri et al ‘11]

~39%

M

Page 57: Tutorial on Optimization with  Submodular  Functions

57

Informative path planningSo far:max F(A) s.t. |A|≤ k

s1s2

s4

s5s3

2 11

1

s10 s11

11 1

2

Most informative locationsmight be far apart!Robot needs to travelbetween selected locations

Locations V nodes in a graphC(A) = cost of cheapest path

connecting nodes A

max F(A) s.t. C(A) ≤ BCan exploit submodularity to get near-optimal paths

Outperform existing heuristics M

Page 58: Tutorial on Optimization with  Submodular  Functions

58

Question

I have 10 minutes. Which blogs should I read to be most up to date?[Leskovec-Krause et al. ‘07]

77% read blogs184M blogs [Universal McCann, ‘08]

Page 59: Tutorial on Optimization with  Submodular  Functions

59

Tim

e

Information cascade

Cascades in the Blogosphere[with Leskovec, Guestrin, Faloutsos, VanBriesen, Glance KDD 07 – Best Paper]

Which blogs should we read to learn about big cascades early?

Learn aboutstory after us!

Page 60: Tutorial on Optimization with  Submodular  Functions

60

Water vs. Web

In both problems we are givenGraph with nodes (junctions / blogs) and edges (pipes / links)Cascades spreading dynamically over the graph (contamination / citations)

Want to pick nodes to detect big cascades early

Placing sensors inwater networks

Selectinginformative blogsvs.

In both applications, utility functions submodular [Generalizes Kempe, Kleinberg, Tardos, KDD ’03]

Page 61: Tutorial on Optimization with  Submodular  Functions

61

Performance on Blog selection

Outperforms state-of-the-art heuristics700x speedup using submodularity!

Blog selection

Low

er is

bett

er

1 2 3 4 5 6 7 8 9 100

100

200

300

400

Number of blogs selected

Runn

ing

time

(sec

onds

)

Exhaustive search(All subsets)

Naivegreedy

Fast greedy

Blog selection~45k blogs

High

er is

bett

er

Number of blogs

Casc

ades

cap

ture

d

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Greedy

In-linksAll outlinks

# PostsRandom

Page 62: Tutorial on Optimization with  Submodular  Functions

Naïve approach: Just pick 10 best blogsSelects big, well known blogs (Instapundit, etc.)These contain many posts, take long to read!

Taking “attention” into account

casc

ades

cap

ture

d

number of posts (time) allowed

x 104

cost/benefitanalysis

ignoring cost

cost-benefit optimization picks summarizer blogs!62

Page 63: Tutorial on Optimization with  Submodular  Functions

Generalizations

Page 64: Tutorial on Optimization with  Submodular  Functions

Learning to optimize submodular functionsOnline submodular optimization

Learn to pick a sequence of sets to maximize a sequence of (unknown) submodular functionsApplication: Building algorithm portfolios

Adaptive submodular optimizationGradually build up a set, taking into account feedbackApplication: Sequential experimental design

64

Page 65: Tutorial on Optimization with  Submodular  Functions

65

Predicting the “hot” blogs

Jan Feb Mar Apr May0

200

#det

ectio

ns

Greedy

Jan Feb Mar Apr May0

200

#det

ectio

ns

Saturate

Detects on training set

Greedy on historicTest on future

Poor generalization!Why’s that?

0 1000 2000 3000 40000

0.05

0.1

0.15

0.2

0.25 Greedy on futureTest on future

“Cheating”

Casc

ades

cap

ture

d

Number of posts (time) allowed

Detect wellhere!

Detect poorlyhere!

Want blogs that will be informative in the futureSplit data set; train on historic, test on future

Blog selection “overfits”to training data!

Let’s see whatgoes wrong here.

Want blogs thatcontinue to do well!

Page 66: Tutorial on Optimization with  Submodular  Functions

66

Online optimization

Jan Feb Mar Apr May0

200

#det

ectio

ns

Greedy

Jan Feb Mar Apr May0

200

#det

ectio

ns

Saturate

F1(A)=.5

F2 (A)=.8

F3 (A)=.6

F4(A)=.01

F5 (A)=.02

Online optimization:

Fi(A) = detections in interval i

“Overfit” blog selection A

Page 67: Tutorial on Optimization with  Submodular  Functions

67

Online maximization of submodular functions[Streeter, Golovin NIPS ‘08]

Theorem Can efficiently choose A1,…At s.t. in expectation

for any sequence Fi, as T ∞

Can get ‘no-regret’ over “cheating” greedy algorithm

A1 A2Pick sets

SFs

Reward

F1

r1=F1(A1) Total: t rt max

F2

r2

A3

F3

r3

AT

FT

rT

…Time

Observe eitherFt, or only Ft(At)

Page 68: Tutorial on Optimization with  Submodular  Functions

Online Greedy Algorithm[Streeter & Golovin, NIPS `08]

Replace each stage of greedy algorithm with an expert / multi-armed bandit algorithm.

a1 a2 a3 akSelect { , , , … , } .

Feedback to for action aj = (h, t) is f({a1, a2, …, aj-1, aj}) – f({a1, a2, …, aj-1})

Page 69: Tutorial on Optimization with  Submodular  Functions

69

Results on blogs[Streeter, Golovin, Krause NIPS ‘09]

Performance of online algorithm converges quicklyto “cheating” offline greedy algorithm!

Can learn to rank by modeling position dependent effects

0 100 200 3000

0.2

0.4

0.6

0.8

1

Time (days)

Avg.

nor

mal

ized

perfo

rman

ceT=47

Page 70: Tutorial on Optimization with  Submodular  Functions

Hybridizing Solvers[Streeter, Golovin, Smith]

• Hybridization via Task-Switching Schedules

• Need to learn which schedules work well!

Heuristic #1

Heuristic #2

Heuristic #3

Heuristic #4

2 sec

Time

4 sec 6 sec 2 sec 10 sec

Example Schedule S

Page 71: Tutorial on Optimization with  Submodular  Functions

Hybridizing Solvers[Streeter & Golovin ‘08]

• Actions A = {run heuristic h for t seconds : h in H, t >0}• Instances fi(S) = Pr[schedule S completes job f in time limit].• This is a submodular function!

• Task: Select {Si : i =1, 2, …, T} online to maximize # instances solved• Online greedy algorithm achieves no-(1-1/e)-regret

2 secTime

4 sec 6 sec 2 sec 10 sec

Page 72: Tutorial on Optimization with  Submodular  Functions

(log scale)

(log scale)

[Streeter, Golovin, Smith, AAAI ’07 & CSP ‘08]

Page 73: Tutorial on Optimization with  Submodular  Functions

(2008)

Page 74: Tutorial on Optimization with  Submodular  Functions

Learning to optimize submodular functionsOnline submodular optimization

Learn to pick a sequence of sets to maximize a sequence of (unknown) submodular functionsApplication: Building algorithm portfolios

Adaptive submodular optimizationGradually build up a set, taking into account feedbackApplication: Sequential experimental design

74

Page 75: Tutorial on Optimization with  Submodular  Functions

75

Adaptive Selectionx1

x2

x3

1

1

0

0 01

=

=

=

Want to effectively diagnose while minimizing cost of testing!Interested in a policy (decision tree), not a set.

Can’t use submodular set functions!!

Can we generalize submodularity for sequential decision making?

Page 76: Tutorial on Optimization with  Submodular  Functions

Problem StatementGiven:

Items (tests, experiments, actions, …) V={1,…,n}Associated with random variables X1,…,Xn taking values in OObjective:

Value of policy π:

Want

NP-hard (also hard to approximate!)76

Tests run by πif world in state xV

Page 77: Tutorial on Optimization with  Submodular  Functions

Adaptive greedy algorithmSuppose we’ve seen XA = xA.

Conditional expected benefit of adding item s:

Adaptive Greedy algorithm:Start with For i = 1:k

PickObserve Set

77When does this adaptive greedy algorithm work??

Benefit if world in state xV

Page 78: Tutorial on Optimization with  Submodular  Functions

Adaptive submodularity[Golovin & Krause, JAIR 2011]

f is called adaptive submodular, iff

f is called adaptive monotone iff

Theorem: If f is adaptive submodular and adaptive monotone w.r.t. to distribution P, thenF(πgreedy) ≥ (1-1/e) F(πopt)

78

xB observes more than xA

whenever

Many other results about submodular set functionscan also be “lifted” to the adaptive setting!

Page 79: Tutorial on Optimization with  Submodular  Functions

79

0.50.2

0.2Prob. ofinfluencing

Eric

Fiona

0.4

Daria

0.5

0.30.5

Alice

Bob

Charlie

Adaptively select promotion targetsAfter each selection, see which people are influenced.

Application: Adaptive Viral Marketing

Page 80: Tutorial on Optimization with  Submodular  Functions

80

Application: Adaptive Viral Marketing

Theorem: Objective adaptive submodular!

Hence, adaptive greedy choice is a (1-1/e) ≈ 63% approximation to the optimal adaptive solution.

0.50.2

0.2Eric

Fiona

0.4

Daria

0.5

0.30.5

Alice

Bob

Charlie

Page 81: Tutorial on Optimization with  Submodular  Functions

Stochastic Minimum Cost Cover

Suppose we wish to get 25% market penetration at minimum cost

Theorem [Golovin & Krause, ’11]If F is adaptive submodular and monotone: Cost(πgreedy) ≤ O(log n) Cost(πopt)

0.50.2

0.2Eric

Fiona

0.4

Daria

0.5

0.30.5

Alice

Bob

Charlie

Page 82: Tutorial on Optimization with  Submodular  Functions

Optimal Decision TreesPrior over diseases P(Y)Likelihood of test outcomes P(XV | Y)Suppose that P(XV | Y) is deterministic (noise free)Each test eliminates hypotheses yHow should we test to eliminate all incorrect hypotheses?

“Generalized binary search”Equivalent to max. infogain

82

Y“Sick”

X1

“Fever”X2

“Rash”X3

“Cough”

X1=1 X3=0

X2=1X2=0

Page 83: Tutorial on Optimization with  Submodular  Functions

ODT is Adaptive Submodular

83

Objective = probability mass of hypotheses you have ruled out.

Outcome = 1Outcome = 0

Test s

Test wTest v

Not hard to show that

Page 84: Tutorial on Optimization with  Submodular  Functions

Guarantees for Optimal Decision Treesx1

x2

x3

1

1

0

0 0

=

=

=

Garey & Graham, 1974; Loveland, 1985;

Arkin et al., 1993; Kosaraju et al., 1999;

Dasgupta, 2004; Guillory & Bilmes, 2009;

Nowak, 2009; Gupta et al., 2010

1

10

Result requires that tests are exact (no noise)!

With adaptivesubmodular

analysis!

Page 85: Tutorial on Optimization with  Submodular  Functions

Noisy Bayesian diagnosis[with Daniel Golovin, Deb Ray, NIPS ‘10]In practice, observations typically are noisyResults for noise-free case do not generalize Key problem: Tests no longer rule out hypotheses (only make them less likely)

Intuitively, want to gather enough information to make the right decision!

85

Y“Sick”

X1

“Fever”X2

“Rash”X3

“Cough”

Page 86: Tutorial on Optimization with  Submodular  Functions

Noisy Bayesian diagnosisSuppose I run all tests, i.e., see Best I can do is to maximize expected utility

How should I cheaply test to guarantee that I choose a*?

Existing approaches:Generalized binary search?Maximize information gain?Maximize value of information?

Theorem: All these approaches can have cost more than n/log n times the optimal cost!

86

Not adaptive submodularin the noisy setting!

Page 87: Tutorial on Optimization with  Submodular  Functions

A new criterion for nonmyopic VOIStrategy: Replace noisy problem with noiseless problemKey idea: Make test outcomes part of the hypothesis

87

Test

[0,1,0][0,0,0]

[1,0,0][0,0,1] [1,0,1]

[1,1,0]

Page 88: Tutorial on Optimization with  Submodular  Functions

A new criterion for nonmyopic VOIOnly need to distinguish between noisy hypotheses that lead to different decisions!

88

Tests

[0,0,0]

[1,0,0]

[0,1,0]

[1,0,1]

[1,1,1]

= total mass of edges cut if observing

Suppose we find

Weight of edge =product of hypotheses’

probabilities

Page 89: Tutorial on Optimization with  Submodular  Functions

Theoretical guarantees[with Daniel Golovin, Deb Ray, NIPS ‘10]

Theorem: Equivalence class edge-cutting (EC2) is adaptive submodular and adaptive monotone.Suppose for all Then it holds that

First approximation guarantees for nonmyopic VOI in general graphical models!

89

Page 90: Tutorial on Optimization with  Submodular  Functions

Example: The Iowa Gambling Task[with Colin Camerer, Deb Ray]

Various competing theories on how people make decisions under uncertainty

Maximize expected utility? [von Neumann & Morgenstern ‘47]Constant relative risk aversion? [Pratt ‘64]Portfolio optimization? [Hanoch & Levy ‘70](Normalized) Prospect theory? [Kahnemann & Tversky ’79]

How should we design tests to distinguish theories?90

-10$ +10$0$

Prob.

.3

.7

-10$ +10$0$

Prob. .7

.3

What would you prefer?A) B)

Page 91: Tutorial on Optimization with  Submodular  Functions

Iowa Gambling as BEDEvery possible test Xs = (gs,1,gs,2) is a pair of gambles

Theories parameterized by θ

Each theory predicts utility forevery gamble U(g,y,θ)

91

YTheory

X1(g1,1,g1,2)

X2(g2,1,g2,2)

Xn(gn,1,gn,2)

θParam’s

Page 92: Tutorial on Optimization with  Submodular  Functions

Simulation Results

Adaptive submodular criterion (EC2)outperforms existing approaches 92

Page 93: Tutorial on Optimization with  Submodular  Functions

Experimental Study [with Colin Camerer, Deb Ray]

Indication of heterogeneity in populationUnexpectedly no support for CRRA Submodularity crucial for real-time performance! 93

32,000 designsin <5s;

8x fasterthrough lazy!

Study with 57 naïve subjects

Page 94: Tutorial on Optimization with  Submodular  Functions

94

Structure in learning problems

MATLAB Toolbox for optimizing submodular functions (JMLR 10)Series of NIPS Workshops on Discrete Optimization in ML (DISCML)(talks recorded on videolectures.net)

AI/ML last 10 years:

ConvexityKernel machines

SVMs, GPs, MLE…

AI/ML “next 10 years:”

Submodularity New structural

properties

Page 95: Tutorial on Optimization with  Submodular  Functions

Submodularity in ML / AI

95

Fast inference forhigh-order submodular

MAP problems [NIPS ’10]

Submodular dictionary selection forsparse representation [ICML ’10]

Learning to solving zero-sumsubmodular games [IJCAI ’11]

First convergence ratesGP optimization [ICML ’10]

Page 96: Tutorial on Optimization with  Submodular  Functions

ConclusionsDiscrete optimization abundant in applicationsFortunately, some of those have structure: submodularitySubmodularity can be exploited to develop efficient, scalable algorithms with strong guaranteesCan handle complex constraintsCan learn to optimize (online, adaptive, …)

96