Tutorial on Optimization with Submodular Functions

Tutorial on Optimization with Submodular Functions

Andreas Krause

2

AcknowledgementsPartly based on material presented at IJCAI 2009 together with Carlos Guestrin (CMU)

MATLAB Toolbox available on http://las.ethz.ch/

MAlgorithms implemented

Discrete OptimizationMany discrete optimization problems can be formulated as follows:

Given finite set V wish to select subset A (subject to some constraints) maximizing utility F(A)

These problems are the focus of this tutorial.

3

Contamination of drinking watercould affect millions of people

4

Example: Sensor Placement [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt ‘08]

Place sensors to detect contaminations“Battle of the Water Sensor Networks” competition

Where should we place sensors to quickly detect contamination?

Sensors

Simulator from EPA Hach Sensor~$14K

Contamination

5

Can only make a limited number of measurements!

Dept

h

Location across lake

Robotic monitoring of rivers and lakes

[with Singh, Guestrin, Kaiser, Journal of AI Research ’08]

Need to monitor large spatial phenomenaTemperature, nutrient distribution, fluorescence, …

Predict atunobserved

locations

NIMSKaiseret.al.

(UCLA)Color indicates actual temperature Predicted temperatureUse robotic sensors to

cover large areas

Where should we sense to get most accurate predictions?

6

Example: Model-based sensingModel predicts impact of contaminations

For water networks: Water flow simulator from EPAFor lake monitoring: Use statistical, data-driven models

For each subset A of V compute sensing quality F(A)

S2

S3

S4S1 S2

S3

S4

S1

High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01

Model predictsHigh impact

Medium impactlocation

Low impactlocation

Sensor reducesimpact throughearly detection!

S1

Contamination

Set V of all network junctions

7

Sensor placementGiven: finite set V of locations, sensing quality FWant: such that

NP-hard!

How well can this simple heuristic do?

S1

S2

S3

S4

S5

S6Greedy algorithm:Start with A = {}For i = 1 to k

s* := argmaxs F(A U {s})A := A U {s*}

M

8

2 4 6 8 100.5

0.6

0.7

0.8

0.9

Number of sensors placed

Pop

ulat

ion

affe

cted

Performance of greedy algorithm

Greedy score empirically close to optimal. Why?

Small subset of Water networks

data

2 4 6 8 100.5

0.6

0.7

0.8

0.9


Pop

ulat

ion

affe

cted

Greedy

Optimal

Popu

latio

n pr

otec

ted

(hig

her i

s bet

ter)


9

Key property 1: Monotonicity

S2

S1

Placement A = {S1, S2}

S2S3

S4S1

Placement B = {S1, S2, S3, S4}

F is monotonic:

Adding sensors can only help

10

S2S3

S4S1

Key property 2: Diminishing returns

S2

S1

S’

Placement A = {S1, S2} Placement B = {S1, S2, S3, S4}

Adding S’ will help a lot!

Adding S’ doesn’t help muchNew

sensor S’New sensor Y’

S’B AS’

+

+

Large improvement

Small improvement

Submodularity:

11

One reason submodularity is useful

Theorem [Nemhauser et al ‘78]Suppose F is monotonic and submodular. Thengreedy algorithm gives constant factor approximation:

Greedy algorithm gives near-optimal solution!In general, guarantees best possible unless P = NP!

~63%

In this tutorial you willSee examples and properties of submodular functionsLearn how submodularity helps to find good solutionsFind out about current research directions

12

13

Set functionsFinite set V = {1,2,…,n}Function Will always assume F({}) = 0 (w.l.o.g.)Assume black-box that can evaluate F for any input A

Approximate (noisy) evaluation of F is ok

S2

S3

S4

S1

Low sensing quality F(A)=0.01

S2

S3

S4S1

Set V of all network junctions

High sensing quality F(A) = 0.9

14

Submodular set functionsSet function F on V is called submodular if

Equivalent diminishing returns characterization:

S

B

A

S

+

+

Large improvement

Small improvement

Submodularity:

BAA U BA B

++ ≥

15

Submodularity and supermodularityF is called supermodular if –F is submodular

F is called modular if F is both sub- and supermodularfor modular (“additive”) F:

16

Example: Set cover

Node predictsvalues of positionswith some radius

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

For A V: F(A) = “area covered by sensors placed at A”

Formally: W finite set, collection of n subsets Si W

For A V define F(A) = |i A Si|

Want to cover floorplan with discsPlace sensorsin building Possible

locations V

17

Set cover is submodular

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

SERVER

LAB

KITCHEN

COPYELEC

PHONEQUIET

STORAGE

CONFERENCE

OFFICEOFFICE

S1 S2

S1 S2

S3

S4 S’

S’

A={S1,S2}

B = {S1,S2,S3,S4}

F(A U {S’})-F(A)

F(B U {S’})-F(B)

≥

18

A better model for sensing

Joint probability distribution P(X1,…,Xn,Y1,…,Yn) = P(Y1,…,Yn) P(X1,…,Xn | Y1,…,Yn)

Ys: temperatureat location s

Xs: sensor valueat location s

Xs = Ys + noise

Prior Likelihood

Y1 Y2 Y3

Y6

Y5Y4

X1

X4

X3

X6X5

X2

Information gainUtility of having sensors at subset A of all locations

19

Uncertaintyabout temperature Y

after sensing

Uncertaintyabout temperature Y

before sensing

X1

X2

X3

A={1,2,3}: High value F(A)

X4X5

X1

A={1,4,5}: Low value F(A)

20

Example: Submodularity of info-gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = I(Y; XA) = H(Y)-H(Y | XA)F(A) is always monotonicHowever, NOT always submodular

Theorem [Krause & Guestrin UAI’ 05]If Xi are all conditionally independent given Y,then F(A) is submodular!

Y1

X1

Y2

X2

Y3

X4X3

21

Proof: Submodularity of information gainY1,…,Ym, X1, …, Xn discrete RVsF(A) = I(Y; XA) = H(Y)-H(Y | XA)

Variables X independent given Y

F(A U {s}) – F(A) = H(Xs|XA) – H(Xs|Y)

Δ(s | A) = F(A U {s})-F(A) monotonically nonincreasing F submodular

Nonincreasing in A:A B : H(Xs|XA) ≥ H(Xs|XB)

Constant(indep. of A)

„information never hurts“

22

Example: Influence in social networks[Kempe, Kleinberg, Tardos KDD ’03]

Who should get free cell phones?V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}F(A) = Expected number of people influenced when targeting A

0.5

0.30.5 0.4

0.2

0.2 0.5

Alice

Bob

Charlie

Dorothy Eric

Fiona

Prob. ofinfluencing

23

Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03]

0.5

0.30.5 0.4

0.2

0.2 0.5

Alice

Bob

Charlie

Dorothy Eric

Fiona

Key idea: Flip coins c in advance “live” edges

Fc(A) = People influenced under outcome c (set cover!) F(A) = c P(c) Fc(A) is submodular as well!

24

Closedness propertiesF1,…,Fm submodular functions on V and 1,…,m > 0Then: F(A) = i i Fi(A) is submodular!

Submodularity closed under nonnegative linear combinations!

Extremely useful fact!!F(A) submodular P() F(A) submodular!Multicriterion optimization: F1,…,Fm submodular, i>0 i i Fi(A) submodular

25

Submodularity and ConcavitySuppose g: N R and F(A) = g(|A|)Then F(A) submodular if and only if g concave!

E.g., g could say “buying in bulk is cheaper”

|A|

g(|A|)

26

Maximum of submodular functionsSuppose F1(A) and F2(A) submodular.Is F(A) = max(F1(A),F2(A)) submodular?

|A|

F2(A)F1(A)

F(A) = max(F1(A),F2(A))

max(F1,F2) not submodular in general!

27

Minimum of submodular functionsWell, maybe F(A) = min(F1(A),F2(A)) instead?

F1(A) F2(A) F(A)

{} 0 0 0{a} 1 0 0{b} 0 1 0{a,b} 1 1 1

F({b}) – F({})=0

F({a,b}) – F({a})=1<

min(F1,F2) not submodular in general!

28

Submodularity and convexity

Key result [Lovasz ’83]: Every submodular function F induces a convex function g on Rn

+, such thatF(A) = g(wA) for all A V minA F(A) = minw g(w) s.t. w [0,1]n is attained at corner!g is efficient to evaluate

Can use ellipsoid algorithm to minimize submodular functions!

F: 2V R submodularon V = {a,b}

g: R|V| R g(1,1)=F({a,b})

g(1,0)=F({a})wa

wb

1

1

g(w)

Minimization ofsubmodular functions

30

Minimizing submodular functions

Theorem [Iwata & Orlin (SODA 2009)]There is a fully combinatorial, strongly polynomial algorithm for minimizing SFs, that runs in time

O(n6 log n)

Polynomial-time = Practical ??

Is there a more practical alternative?

31

The submodular polyhedron PF

Example: V = {a,b}

x({a}) ≤ F({a})

x({b}) ≤ F({b})

x({a,b}) ≤ F({a,b})PF

-1 x{a}

x{b}

0 1

1

2

-2

A F(A){} 0{a} -1{b} 2{a,b} 0

32

A more practical alternative? [Fujishige ’91, Fujishige et al ‘11]

Minimum norm algorithm:1. Find x* = argmin ||x||2 s.t. x in BF x*=[-1,1]2. Return A* = {i: x*(i) < 0} A*={a}

Theorem [Fujishige ’91]: A* is an optimal solution!Note: Can solve 1. using Wolfe’s algorithmRuntime finite but unknown!!

-1 x{a}

x{b}

0 1

1

2

-2

Base polytope:[-1,1]

x({a,b})=F({a,b})

x*

M

A F(A){} 0{a} -1{b} 2{a,b} 0

33

Empirical comparison [Fujishige et al ’06]

Minimum norm algorithm orders of magnitude faster!Our implementation can solve n = 10k in < 6 minutes!

Cut functions from DIMACS Challenge

Runn

ing

time

(sec

onds

)

Low

er is

bett

er (l

og-s

cale

!)

Problem size (log-scale!)512 102425612864

Minimumnorm algorithm

34

Example: Image denoising

35

Example: Image denoising

X1

X4

X7

X2

X5

X8

X3

X6

X9

Y1

Y4

Y7

Y2

Y5

Y8

Y3

Y6

Y9

P(x1,…,xn,y1,…,yn) = i,j i,j(yi,yj) i i(xi,yi)

Want argmaxy P(y | x) =argmaxy log P(x,y) =argminy i,j Ei,j(yi,yj)+i Ei(yi)

When is this MAP inference efficiently solvable(in high treewidth graphical models)?

Ei,j(yi,yj) = -log i,j(yi,yj)

Pairwise Markov Random Field

Xi: noisy pixelsYi: “true” pixels

36

MAP inference in Markov Random Fields

Energy E(y) = i,j Ei,j(yi,yj)+i Ei(yi)

Suppose yi are binary, defineF(A) = E(yA) where yA

i = 1 iff i in AThen miny E(y) = minA F(A)

Theorem [c.f., Hammer ’65, Kolmogorov et al, PAMI ’04]MAP inference problem solvable by graph cuts For all i,j: Ei,j(0,0)+Ei,j(1,1) ≤ Ei,j(0,1)+Ei,j(1,0) each Ei,j is submodular“Efficient if prefer that neighboring pixels have same color”

37

Constrained minimizationHave seen: if F submodular on V, can solve

What about

E.g., balanced cut, …

In general, not much known about constrained minimization However, can do

A*=argmin F(A) s.t. 0<|A|< nA*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95]A*=argmin F(A) s.t. A in argmin G(A) for G submodular [Fujishige ’91]

Some recent lower bounds by Svitkina & Fleischer ‘08

State of the art in minimizationFaster algorithms for special cases via modern results for convex minimization [Stobbe & Krause NIPS ‘10]

Approximation algorithms for constrained minimization: Cooperative cuts [Jegelka & Bilmes CVPR ’11]

Fast approximate minimization via iterative reduction to graph cuts [Jegelka & Bilmes NIPS ‘11]

Many open question in principled, efficient, large-scale submodular minimization!

38

Maximizing submodular functions

40

Maximizing submodular functions

Minimizing convex functions:Polynomial time solvable!

Minimizing submodular functions:Polynomial time solvable!

Maximizing convex functions:NP hard!

Maximizing submodular functions:NP hard!

But can get approximation guarantees

41

Maximizing non-monotonic functionsSuppose we want for not monotonic F

Example:F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost functionE.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08]

In general: NP hard. Moreover:If F(A) can take negative values:As hard to approximate as maximum independent set (i.e., NP hard to get O(n1-) approximation)

|A|

maximum

42

Maximizing positive submodular functions[Feige, Mirrokni, Vondrak FOCS ’07]

picking a random set gives ¼ approximation (½ approximation if F is symmetric!)Current best approximation [Chekuri, Vondrak, Zenklusen]: ~0.41 via simulated annealingwe cannot get better than ½ approximation unless P = NP

TheoremThere is an efficient randomized local search procedure,that, given a positive submodular function F, F({})=0, returns set ALS such that

F(ALS) ≥ (2/5) maxA F(A)

M

43

Scalarization vs. constrained maximizationGiven monotonic utility F(A) and cost C(A), optimize:

Option 1: Option 2:

Can get 2/5 approx…if F(A)-C(A) ≥ 0 for all sets A

Greedy algorithm gets(1-1/e)≈0.63 approx.

Many more results!!Focus for rest of today

Positiveness is a strong requirement

“Scalarization” “Constrained maximization”

44

Battle of the Water Sensor Networks Competition [with Leskovec, Guestrin, VanBriesen, Faloutsos, J Wat Res Mgt 2008]

Real metropolitan area network (12,527 nodes)Water flow simulator provided by EPA3.6 million contamination eventsMultiple objectives: Detection time, affected population, …Place sensors that detect well “on average”

45

Reward function is submodularClaim:Reward function is monotonic submodular

Consider event i:Ri(uk) = benefit from sensor uk in event iRi(A) = max Ri(uk), ukA

Þ Ri is submodular

Overall objective:R(A) = Prob(i) Ri(A)

Þ R is submodular

Can use greedy algorithm to solve !

u1

Ri(u1)Event i

u2

Ri(u2)

46

Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08]

(1-1/e) bound quite loose… can we get better bounds?

Popu

latio

n pr

otec

ted

F(A)

High

er is

bett

er

Water networks

data

0 5 10 15 200

0.2

0.4

0.6

0.8

1

1.2

1.4Offline

(Nemhauser)bound

Greedysolution


47

Data dependent bounds [Minoux ’78]

Suppose A is candidate solution to

argmax F(A) s.t. |A| ≤ k

and A* = {s1,…,sk} be an optimal solution

Then F(A*) ≤ F(A U A*) = F(A)+i (F(A U {s1,…,si})-F(A U {s1,…,si-1}))≤ F(A) + i (F(A U {si})-F(A))= F(A) + i Δsi

For each s V\A, let Δs = F(A U {s})-F(A)Order such that Δ1 ≥ Δ2 ≥ … ≥ Δn

Then: F(A*) ≤ F(A) + i=1k Δi

M

48

Bounds on optimal solution[Krause et al., J Wat Res Mgt ’08]

Submodularity gives data-dependent bounds on the performance of any algorithm

Sens

ing

qual

ity F

(A)

High

er is

bet

ter

Water networks

data

0 5 10 15 200

0.2

0.4

0.6

0.8

1

1.2

1.4Offline

(Nemhauser)bound Data-dependent

bound

Greedysolution


49

BWSN Competition results13 participantsPerformance measured in 30 different criteria

0

5

10

15

20

25

30

Tota

l Sco

reHi

gher

is b

etter

Greedy approach

Berry e

t. al.

Dorini e

t. al.

Wu & W

alski

Ostfeld &

Salomons

Propato &

Piller

Eliad

es & Polyc

arpou

Huang e

t. al.

Guan et. a

l.

Ghimire

& Bark

doll

Trach

tman

Gueli

Preis & O

stfeld

EE

D DG

GG

GG

H

H

H

G: Genetic algorithmH: Other heuristic

D: Domain knowledgeE: “Exact” method (MIP)

24% better performance than runner-up!

50

Simulated all on 2 weeks / 40 processors152 GB data on disk Very accurate computation of F(A)

, 16 GB in main memory (compressed)

Lowe

r is b

ette

r 30 hours/20 sensors6 weeks for all30 settings

3.6M contaminations

Very slow evaluation of F(A)

1 2 3 4 5 6 7 8 9 100

100

200

300

Number of sensors selected

Runn

ing

time

(min

utes

)

Exhaustive search(All subsets)

Naivegreedy

What was the trick?

ubmodularity to the rescue

51

Scaling up the greedy algorithm [Minoux ’78]In round i+1,

have picked Ai = {s1,…,si}pick si+1 = argmaxs F(Ai U {s})-F(Ai)

I.e., maximize “marginal benefit” Δ(s | Ai)

Δ(s | Ai) = F(Ai U {s})-F(Ai)

Key observation: Submodularity implies

i ≤ j => Δ(s | Ai) ≥ Δ(s | Aj)

Marginal benefits can never increase!

s

Δ (s | Ai) ≥ Δ(s | Ai+1)

52

“Lazy” greedy algorithm [Minoux ’78]

Lazy greedy algorithm: First iteration as usual Keep an ordered list of marginal

benefits Δi from previous iteration Re-evaluate Δi only for top

element If Δi stays on top, use it,

otherwise re-sort

a

b

c

d

Benefit Δ(s | A)

e

a

d

b

c

e

a

c

d

b

e

Note: Very easy to compute online bounds, lazy evaluations, etc.[Leskovec, Krause et al. ’07]

M

53

Simulated all on 2 weeks / 40 processors152 GB data on disk Very accurate computation of F(A)

Using “lazy evaluations”:1 hour/20 sensorsDone after 2 days!

, 16 GB in main memory (compressed)

Lowe

r is b

ette

r 30 hours/20 sensors6 weeks for all30 settings

3.6M contaminations

Very slow evaluation of F(A)

1 2 3 4 5 6 7 8 9 100

100

200

300

Number of sensors selected

Runn

ing

time

(min

utes

)


Naivegreedy

Fast greedy ubmodularity to the rescue:

Result of lazy evaluation

54

Non-constant cost functionsFor each s V, let c(s)>0 be its cost(e.g., feature acquisition costs, …)Cost of a set C(A) = sA c(s)Want to solve

A* = argmax F(A) s.t. C(A) ≤ B

Cost-benefit greedy algorithm:Start with A := {};While there is an s V\A s.t. C(A U {s}) ≤ B

A := A U {s*}

M

55

Performance of cost-benefit greedyWant

maxA F(A) s.t. C(A) ≤ 1

Cost-benefit greedy picks a.Then cannot afford b!

Cost-benefit greedy performs arbitrarily badly!

Set A F(A) C(A){a} 2 {b} 1 1

56

Cost-benefit optimization[Wolsey ‘82, Sviridenko ‘04, Krause et al ‘05]

Theorem [Krause and Guestrin‘05]ACB: cost-benefit greedy solution andAUC: unit-cost greedy solution (i.e., ignore costs)

Then max { F(ACB), F(AUC) } ≥ (1-1/√e) OPT

Can still compute online bounds and speed up using lazy evaluations

Note: Can also get(1-1/e) approximation in time O(n4) [Sviridenko ’04](1-1/e) approximation for multiple linear constraints [Kulik ‘09]0.38/k approximation for k matroid and m linear constraints

[Chekuri et al ‘11]

~39%

M

57

Informative path planningSo far:max F(A) s.t. |A|≤ k

s1s2

s4

s5s3

2 11

1

s10 s11

11 1

2

Most informative locationsmight be far apart!Robot needs to travelbetween selected locations

Locations V nodes in a graphC(A) = cost of cheapest path

connecting nodes A

max F(A) s.t. C(A) ≤ BCan exploit submodularity to get near-optimal paths

Outperform existing heuristics M

58

Question

I have 10 minutes. Which blogs should I read to be most up to date?[Leskovec-Krause et al. ‘07]

77% read blogs184M blogs [Universal McCann, ‘08]

59

Tim

e

Information cascade

Cascades in the Blogosphere[with Leskovec, Guestrin, Faloutsos, VanBriesen, Glance KDD 07 – Best Paper]

Which blogs should we read to learn about big cascades early?

Learn aboutstory after us!

60

Water vs. Web

In both problems we are givenGraph with nodes (junctions / blogs) and edges (pipes / links)Cascades spreading dynamically over the graph (contamination / citations)

Want to pick nodes to detect big cascades early

Placing sensors inwater networks

Selectinginformative blogsvs.

In both applications, utility functions submodular [Generalizes Kempe, Kleinberg, Tardos, KDD ’03]

61

Performance on Blog selection

Outperforms state-of-the-art heuristics700x speedup using submodularity!

Blog selection

Low

er is

bett

er

1 2 3 4 5 6 7 8 9 100

100

200

300

400

Number of blogs selected

Runn

ing

time

(sec

onds

)


Naivegreedy

Fast greedy

Blog selection~45k blogs

High

er is

bett

er

Number of blogs

Casc

ades

cap

ture

d

0 20 40 60 80 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Greedy

In-linksAll outlinks

# PostsRandom

Naïve approach: Just pick 10 best blogsSelects big, well known blogs (Instapundit, etc.)These contain many posts, take long to read!

Taking “attention” into account

casc

ades

cap

ture

d

number of posts (time) allowed

x 104

cost/benefitanalysis

ignoring cost

cost-benefit optimization picks summarizer blogs!62

Generalizations

Learning to optimize submodular functionsOnline submodular optimization

Learn to pick a sequence of sets to maximize a sequence of (unknown) submodular functionsApplication: Building algorithm portfolios

Adaptive submodular optimizationGradually build up a set, taking into account feedbackApplication: Sequential experimental design

64

65

Predicting the “hot” blogs

Jan Feb Mar Apr May0

200

#det

ectio

ns

Greedy


200

#det

ectio

ns

Saturate

Detects on training set

Greedy on historicTest on future

Poor generalization!Why’s that?

0 1000 2000 3000 40000

0.05

0.1

0.15

0.2

0.25 Greedy on futureTest on future

“Cheating”

Casc

ades

cap

ture

d

Number of posts (time) allowed

Detect wellhere!

Detect poorlyhere!

Want blogs that will be informative in the futureSplit data set; train on historic, test on future

Blog selection “overfits”to training data!

Let’s see whatgoes wrong here.

Want blogs thatcontinue to do well!

66

Online optimization


200

#det

ectio

ns

Greedy


200

#det

ectio

ns

Saturate

F1(A)=.5

F2 (A)=.8

F3 (A)=.6

F4(A)=.01

F5 (A)=.02

Online optimization:

Fi(A) = detections in interval i

“Overfit” blog selection A

67

Online maximization of submodular functions[Streeter, Golovin NIPS ‘08]

Theorem Can efficiently choose A1,…At s.t. in expectation

for any sequence Fi, as T ∞

Can get ‘no-regret’ over “cheating” greedy algorithm

A1 A2Pick sets

SFs

Reward

F1

r1=F1(A1) Total: t rt max

F2

r2

A3

F3

r3

AT

FT

rT

…

…

…Time

Observe eitherFt, or only Ft(At)

Online Greedy Algorithm[Streeter & Golovin, NIPS `08]

Replace each stage of greedy algorithm with an expert / multi-armed bandit algorithm.

a1 a2 a3 akSelect { , , , … , } .

Feedback to for action aj = (h, t) is f({a1, a2, …, aj-1, aj}) – f({a1, a2, …, aj-1})

69

Results on blogs[Streeter, Golovin, Krause NIPS ‘09]

Performance of online algorithm converges quicklyto “cheating” offline greedy algorithm!

Can learn to rank by modeling position dependent effects

0 100 200 3000

0.2

0.4

0.6

0.8

1

Time (days)

Avg.

nor

mal

ized

perfo

rman

ceT=47

Hybridizing Solvers[Streeter, Golovin, Smith]

• Hybridization via Task-Switching Schedules

• Need to learn which schedules work well!

Heuristic #1

Heuristic #2

Heuristic #3

Heuristic #4

2 sec

Time

4 sec 6 sec 2 sec 10 sec

Example Schedule S

Hybridizing Solvers[Streeter & Golovin ‘08]

• Actions A = {run heuristic h for t seconds : h in H, t >0}• Instances fi(S) = Pr[schedule S completes job f in time limit].• This is a submodular function!

• Task: Select {Si : i =1, 2, …, T} online to maximize # instances solved• Online greedy algorithm achieves no-(1-1/e)-regret

2 secTime

4 sec 6 sec 2 sec 10 sec

(log scale)

(log scale)

[Streeter, Golovin, Smith, AAAI ’07 & CSP ‘08]

(2008)

Learning to optimize submodular functionsOnline submodular optimization

Learn to pick a sequence of sets to maximize a sequence of (unknown) submodular functionsApplication: Building algorithm portfolios

Adaptive submodular optimizationGradually build up a set, taking into account feedbackApplication: Sequential experimental design

74

75

Adaptive Selectionx1

x2

x3

1

1

0

0 01

=

=

=

Want to effectively diagnose while minimizing cost of testing!Interested in a policy (decision tree), not a set.

Can’t use submodular set functions!!

Can we generalize submodularity for sequential decision making?

Problem StatementGiven:

Items (tests, experiments, actions, …) V={1,…,n}Associated with random variables X1,…,Xn taking values in OObjective:

Value of policy π:

Want

NP-hard (also hard to approximate!)76

Tests run by πif world in state xV

Adaptive greedy algorithmSuppose we’ve seen XA = xA.

Conditional expected benefit of adding item s:

Adaptive Greedy algorithm:Start with For i = 1:k

PickObserve Set

77When does this adaptive greedy algorithm work??

Benefit if world in state xV

Adaptive submodularity[Golovin & Krause, JAIR 2011]

f is called adaptive submodular, iff

f is called adaptive monotone iff

Theorem: If f is adaptive submodular and adaptive monotone w.r.t. to distribution P, thenF(πgreedy) ≥ (1-1/e) F(πopt)

78

xB observes more than xA

whenever

Many other results about submodular set functionscan also be “lifted” to the adaptive setting!

79

0.50.2

0.2Prob. ofinfluencing

Eric

Fiona

0.4

Daria

0.5

0.30.5

Alice

Bob

Charlie

Adaptively select promotion targetsAfter each selection, see which people are influenced.

Application: Adaptive Viral Marketing

80

Application: Adaptive Viral Marketing

Theorem: Objective adaptive submodular!

Hence, adaptive greedy choice is a (1-1/e) ≈ 63% approximation to the optimal adaptive solution.

0.50.2

0.2Eric

Fiona

0.4

Daria

0.5

0.30.5

Alice

Bob

Charlie

Stochastic Minimum Cost Cover

Suppose we wish to get 25% market penetration at minimum cost

Theorem [Golovin & Krause, ’11]If F is adaptive submodular and monotone: Cost(πgreedy) ≤ O(log n) Cost(πopt)

0.50.2

0.2Eric

Fiona

0.4

Daria

0.5

0.30.5

Alice

Bob

Charlie

Optimal Decision TreesPrior over diseases P(Y)Likelihood of test outcomes P(XV | Y)Suppose that P(XV | Y) is deterministic (noise free)Each test eliminates hypotheses yHow should we test to eliminate all incorrect hypotheses?

“Generalized binary search”Equivalent to max. infogain

82

Y“Sick”

X1

“Fever”X2

“Rash”X3

“Cough”

X1=1 X3=0

X2=1X2=0

ODT is Adaptive Submodular

83

Objective = probability mass of hypotheses you have ruled out.

Outcome = 1Outcome = 0

Test s

Test wTest v

Not hard to show that

Guarantees for Optimal Decision Treesx1

x2

x3

1

1

0

0 0

=

=

=

Garey & Graham, 1974; Loveland, 1985;

Arkin et al., 1993; Kosaraju et al., 1999;

Dasgupta, 2004; Guillory & Bilmes, 2009;

Nowak, 2009; Gupta et al., 2010

1

10

Result requires that tests are exact (no noise)!

With adaptivesubmodular

analysis!

Noisy Bayesian diagnosis[with Daniel Golovin, Deb Ray, NIPS ‘10]In practice, observations typically are noisyResults for noise-free case do not generalize Key problem: Tests no longer rule out hypotheses (only make them less likely)

Intuitively, want to gather enough information to make the right decision!

85

Y“Sick”

X1

“Fever”X2

“Rash”X3

“Cough”

Noisy Bayesian diagnosisSuppose I run all tests, i.e., see Best I can do is to maximize expected utility

How should I cheaply test to guarantee that I choose a*?

Existing approaches:Generalized binary search?Maximize information gain?Maximize value of information?

Theorem: All these approaches can have cost more than n/log n times the optimal cost!

86

Not adaptive submodularin the noisy setting!

A new criterion for nonmyopic VOIStrategy: Replace noisy problem with noiseless problemKey idea: Make test outcomes part of the hypothesis

87

Test

[0,1,0][0,0,0]

[1,0,0][0,0,1] [1,0,1]

[1,1,0]

A new criterion for nonmyopic VOIOnly need to distinguish between noisy hypotheses that lead to different decisions!

88

Tests

[0,0,0]

[1,0,0]

[0,1,0]

[1,0,1]

[1,1,1]

= total mass of edges cut if observing

Suppose we find

Weight of edge =product of hypotheses’

probabilities

Theoretical guarantees[with Daniel Golovin, Deb Ray, NIPS ‘10]

Theorem: Equivalence class edge-cutting (EC2) is adaptive submodular and adaptive monotone.Suppose for all Then it holds that

First approximation guarantees for nonmyopic VOI in general graphical models!

89

Example: The Iowa Gambling Task[with Colin Camerer, Deb Ray]

Various competing theories on how people make decisions under uncertainty

Maximize expected utility? [von Neumann & Morgenstern ‘47]Constant relative risk aversion? [Pratt ‘64]Portfolio optimization? [Hanoch & Levy ‘70](Normalized) Prospect theory? [Kahnemann & Tversky ’79]

How should we design tests to distinguish theories?90

-10$ +10$0$

Prob.

.3

.7

-10$ +10$0$

Prob. .7

.3

What would you prefer?A) B)

Iowa Gambling as BEDEvery possible test Xs = (gs,1,gs,2) is a pair of gambles

Theories parameterized by θ

Each theory predicts utility forevery gamble U(g,y,θ)

91

YTheory

X1(g1,1,g1,2)

X2(g2,1,g2,2)

Xn(gn,1,gn,2)

θParam’s

…

Simulation Results

Adaptive submodular criterion (EC2)outperforms existing approaches 92

Experimental Study [with Colin Camerer, Deb Ray]

Indication of heterogeneity in populationUnexpectedly no support for CRRA Submodularity crucial for real-time performance! 93

32,000 designsin <5s;

8x fasterthrough lazy!

Study with 57 naïve subjects

94

Structure in learning problems

MATLAB Toolbox for optimizing submodular functions (JMLR 10)Series of NIPS Workshops on Discrete Optimization in ML (DISCML)(talks recorded on videolectures.net)

AI/ML last 10 years:

ConvexityKernel machines

SVMs, GPs, MLE…

AI/ML “next 10 years:”

Submodularity New structural

properties

Submodularity in ML / AI

95

Fast inference forhigh-order submodular

MAP problems [NIPS ’10]

Submodular dictionary selection forsparse representation [ICML ’10]

Learning to solving zero-sumsubmodular games [IJCAI ’11]

First convergence ratesGP optimization [ICML ’10]

ConclusionsDiscrete optimization abundant in applicationsFortunately, some of those have structure: submodularitySubmodularity can be exploited to develop efficient, scalable algorithms with strong guaranteesCan handle complex constraintsCan learn to optimize (online, adaptive, …)

96

Tutorial on Optimization with Submodular Functions

Documents

Transcript of Tutorial on Optimization with Submodular Functions