Learning to Branch - GitHub PagesBinary classification problems can be formulated as IPs. 5. Integer...
Transcript of Learning to Branch - GitHub PagesBinary classification problems can be formulated as IPs. 5. Integer...
Learning to BranchEllen Vitercik
Joint work with Nina Balcan, Travis Dick, and Tuomas Sandholm
Published in ICML 2018
1
Integer Programs (IPs)
a
maximize π β πsubject to π΄π β€ π
π β {0,1}π
2
Facility location problems can be formulated as IPs.
3
Clustering problems can be formulated as IPs.
4
Binary classification problems can be formulated as IPs.
5
Integer Programs (IPs)
a
maximize π β πsubject to π΄π = π
π β {0,1}π
NP-hard
6
Branch and Bound (B&B)
β’ Most widely-used algorithm for IP-solving (CPLEX, Gurobi)
β’ Recursively partitions search space to find an optimal solutionβ’ Organizes partition as a tree
β’ Many parametersβ’ CPLEX has a 221-page manual describing 135 parameters
βYou may need to experiment.β
7
Why is tuning B&B parameters important?
β’ Save timeβ’ Solve more problems
β’ Find better solutions
8
B&B in the real world
Delivery company routes trucks dailyUse integer programming to select routes
Demand changes every daySolve hundreds of similar optimizations
Using this set of typical problemsβ¦
can we learn best parameters?
9
Application-
Specific
DistributionAlgorithm
Designer
B&B parameters
Model
π΄ 1 , π 1 , π 1 , β¦ , π΄ π , π π , π π
How to use samples to find best B&B parameters for my domain?
10
Model
Model has been studied in applied communities [Hutter et al. β09]
Application-
Specific
DistributionAlgorithm
Designer
B&B parameters
π΄ 1 , π 1 , π 1 , β¦ , π΄ π , π π , π π
11
Model
Model has been studied from a theoretical perspective
[Gupta and Roughgarden β16, Balcan et al., β17]
Application-
Specific
DistributionAlgorithm
Designer
B&B parameters
π΄ 1 , π 1 , π 1 , β¦ , π΄ π , π π , π π
12
Model
1. Fix a set of B&B parameters to optimize
2. Receive sample problems from unknown distribution
3. Find parameters with the best performance on the samples
βBestβ could mean smallest search tree, for example
π΄ 1 , π 1 , π 1 π΄ 2 , π 2 , π 2
13
Questions to address
How to find parameters that are best on average over samples?
Will those parameters have high performance in expectation?
π΄, π, π
?
π΄ 1 , π 1 , π 1 π΄ 2 , π 2 , π 2
14
Outline
1. Introduction
2. Branch-and-Bound
3. Learning algorithms
4. Experiments
5. Conclusion and Future Directions
15
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
16
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
17
B&B
1. Choose leaf of tree
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
18
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
π₯1 = 0 π₯1 = 1
19
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
π₯1 = 0 π₯1 = 1
20
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
π₯1 = 0 π₯1 = 1
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
π₯2 = 0 π₯2 = 1
21
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
π₯1 = 0 π₯1 = 1
π₯2 = 0 π₯2 = 1
22
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
0,3
5, 0, 0, 0, 1, 1
116
0, 1,1
3, 1, 0, 0, 1
133.3
π₯1 = 0 π₯1 = 1
π₯6 = 0 π₯6 = 1 π₯2 = 0 π₯2 = 1
23
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
0, 1,1
3, 1, 0, 0, 1
133.3
π₯1 = 0 π₯1 = 1
π₯6 = 0 π₯6 = 1 π₯2 = 0 π₯2 = 1
0,3
5, 0, 0, 0, 1, 1
116
24
B&B
1. Choose leaf of tree
2. Branch on a variable
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
0,3
5, 0, 0, 0, 1, 1
116
0, 1,1
3, 1, 0, 0, 1
133.3
0, 1, 0, 1, 1, 0, 1 0,4
5, 1, 0, 0, 0, 1
118
π₯1 = 0 π₯1 = 1
π₯6 = 0 π₯6 = 1 π₯2 = 0 π₯2 = 1
π₯3 = 0 π₯3 = 1
133
25
B&B
1. Choose leaf of tree
2. Branch on a variable
3. Fathom leaf if:i. LP relaxation solution is
integral
ii. LP relaxation is infeasible
iii. LP relaxation solution isnβt better than best-known integral solution
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
0,3
5, 0, 0, 0, 1, 1
116
0, 1,1
3, 1, 0, 0, 1
133.3
π₯1 = 0 π₯1 = 1
π₯6 = 0 π₯6 = 1 π₯2 = 0 π₯2 = 1
π₯3 = 0 π₯3 = 1
0,4
5, 1, 0, 0, 0, 1
118
0, 1, 0, 1, 1, 0, 1
133
26
B&B
1. Choose leaf of tree
2. Branch on a variable
3. Fathom leaf if:i. LP relaxation solution
is integral
ii. LP relaxation is infeasible
iii. LP relaxation solution isnβt better than best-known integral solution
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
0,3
5, 0, 0, 0, 1, 1
116
0, 1,1
3, 1, 0, 0, 1
133.3
π₯1 = 0 π₯1 = 1
π₯6 = 0 π₯6 = 1 π₯2 = 0 π₯2 = 1
π₯3 = 0 π₯3 = 1
0,4
5, 1, 0, 0, 0, 1
118
0, 1, 0, 1, 1, 0, 1
133Integral
27
B&B
1. Choose leaf of tree
2. Branch on a variable
3. Fathom leaf if:i. LP relaxation solution is
integral
ii. LP relaxation is infeasible
iii. LP relaxation solution isnβt better than best-known integral solution
max (40, 60, 10, 10, 3, 20, 60) β π
s.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
0,3
5, 0, 0, 0, 1, 1
116
0, 1,1
3, 1, 0, 0, 1
133.3
π₯1 = 0 π₯1 = 1
π₯6 = 0 π₯6 = 1 π₯2 = 0 π₯2 = 1
π₯3 = 0 π₯3 = 1
0,4
5, 1, 0, 0, 0, 1
118
0, 1, 0, 1, 1, 0, 1
133
28
B&B
1. Choose leaf of tree
2. Branch on a variable
3. Fathom leaf if:i. LP relaxation solution is
integral
ii. LP relaxation is infeasible
iii. LP relaxation solution isnβt better than best-known integral solution
This talk: How to choose which variable?(Assume every other aspect of B&B is fixed.)
29
Variable selection policies can have a huge effect on tree size
30
Outline
1. Introduction
2. Branch-and-Bounda. Algorithm Overview
b. Variable Selection Policies
3. Learning algorithms
4. Experiments
5. Conclusion and Future Directions
31
Variable selection policies (VSPs)
Score-based VSP:
At leaf πΈ, branch on variable ππ maximizingπ¬ππ¨π«π πΈ, π
Many options! Little known about which to use when
1,3
5, 0, 0, 0, 0, 1
136
1, 0, 0, 1, 0,1
2, 1
120
1, 1, 0, 0, 0, 0,1
3
120
π₯2 = 0 π₯2 = 1
32
Variable selection policies
For an IP instance π:
β’ Let ππ be the objective value of its LP relaxation
β’ Let ππβ be π with π₯π set to 0, and let ππ
+ be π with π₯π set to 1
Example.
1
2, 1, 0, 0, 0, 0, 1
140
max (40, 60, 10, 10, 3, 20, 60) β πs.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7 πππ
33
Variable selection policies
For an IP instance π:
β’ Let ππ be the objective value of its LP relaxation
β’ Let ππβ be π with π₯π set to 0, and let ππ
+ be π with π₯π set to 1
Example.
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
π₯1 = 0 π₯1 = 1
max (40, 60, 10, 10, 3, 20, 60) β πs.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
ππ1β ππ1+
πππ
34
For a IP instance π:
β’ Let ππ be the objective value of its LP relaxation
β’ Let ππβ be π with π₯π set to 0, and let ππ
+ be π with π₯π set to 1
Example.
Variable selection policies
1
2, 1, 0, 0, 0, 0, 1
140
1,3
5, 0, 0, 0, 0, 1
136
0, 1, 0, 1, 0,1
4, 1
135
π₯1 = 0 π₯1 = 1
ππ1β ππ1+
πππ
max (40, 60, 10, 10, 3, 20, 60) β πs.t. 40, 50, 30, 10, 10, 40, 30 β π β€ 100
π β {0,1}7
The linear rule (parameterized by π) [Linderoth & Savelsbergh, 1999]
Branch on variable π₯π maximizing:
score π, π = πmin ππ β πππβ , ππ β πππ
+ + (1 β π)max ππ β πππβ , ππ β πππ
+
35
Variable selection policies
And many moreβ¦
The (simplified) product rule [Achterberg, 2009]
Branch on variable π₯π maximizing:
score π, π = ππ β πππβ β ππ β πππ
+
The linear rule (parameterized by π) [Linderoth & Savelsbergh, 1999]
Branch on variable π₯π maximizing:
score π, π = πmin ππ β πππβ , ππ β πππ
+ + (1 β π)max ππ β πππβ , ππ β πππ
+
36
Variable selection policies
Given π scoring rules score1, β¦ , scored.
Goal: Learn best convex combination π1score1 +β―+ ππscored.
Branch on variable π₯π maximizing:
score π, π = π1score1 π, π + β―+ ππscored π, π
Our parameterized rule
37
Application-
Specific
DistributionAlgorithm
Designer
B&B parameters
Model
π΄ 1 , π 1 , π 1 , β¦ , π΄ π , π π , π π
How to use samples to find best B&B parameters for my domain?
38
Application-
Specific
DistributionAlgorithm
Designer
B&B parameters
Model
π΄ 1 , π 1 , π 1 , β¦ , π΄ π , π π , π π
π1, β¦ , ππ
How to use samples to find best B&B parameters for my domain?π1, β¦ , ππ
39
Outline
1. Introduction
2. Branch-and-Bound
3. Learning algorithmsa. First-try: Discretization
b. Our Approach
4. Experiments
5. Conclusion and Future Directions
40
First try: Discretization
1. Discretize parameter space
2. Receive sample problems from unknown distribution
3. Find params in discretization with best average performance
π
Average tree size
41
First try: Discretization
This has been prior workβs approach [e.g., Achterberg (2009)].
π
Average tree size
42
Discretization gone wrong
π
Average tree size
43
Discretization gone wrong
π
Average tree size
This can
actually
happen!
44
Discretization gone wrong
Theorem [informal]. For any discretization:
Exists problem instance distribution π inducing this behavior
Proof ideas:
πβs support consists of infeasible IPs with βeasy outβ variablesB&B takes exponential time unless branches on βeasy outβ variables
B&B only finds βeasy outsβ if uses parameters from specific range
Expected tree size
π
45
Outline
1. Introduction
2. Branch-and-Bound
3. Learning algorithmsa. First-try: Discretization
b. Our Approachi. Single-parameter settings
ii. Multi-parameter settings
4. Experiments
5. Conclusion and Future Directions
46
Simple assumption
Exists π upper bounding the size of largest tree willing to build
Common assumption, e.g.:
β’ Hutter, Hoos, Leyton-Brown, StΓΌtzle, JAIRβ09
β’ Kleinberg, Leyton-Brown, Lucier, IJCAIβ17
47
π β [0,1]
Lemma: For any two scoring rules and any IP π,
π (# variables)π +2 intervals partition [0,1] such that:
For any interval [π, π], B&B builds same tree across all π β π, π
Much smaller in our experiments!
Useful lemma
48
Useful lemma
branch
on π₯2
branch
on π₯3
π
π β score1 π, 1 + (1 β π) β score2 π, 1
π β score1 π, 2 + (1 β π) β score2 π, 2
π β score1 π, 3 + (1 β π) β score2 π, 3
branch
on π₯1
Lemma: For any two scoring rules and any IP π,
π (# variables)π +2 intervals partition [0,1] such that:
For any interval [π, π], B&B builds same tree across all π β π, π
49
Useful lemma
π
π2β π2
+
Any π in yellow interval:
π₯2 = 0 π₯2 = 1
branch
on π₯2
branch
on π₯3
π
branch
on π₯1
Lemma: For any two scoring rules and any IP π,
π (# variables)π +2 intervals partition [0,1] such that:
For any interval [π, π], B&B builds same tree across all π β π, π
π2β
50
Useful lemma
Lemma: For any two scoring rules and any IP π,
π (# variables)π +2 intervals partition [0,1] such that:
For any interval [π, π], B&B builds same tree across all π β π, π
π
π β score1 π2β, 1 + (1 β π) β score2 π2
β, 1
π β score1 π2β, 3 + (1 β π) β score2 π2
β, 3
π
π2β π2
+
Any π in yellow interval:
π₯2 = 0 π₯2 = 1
branch on
π₯2 then π₯3
branch on
π₯2 then π₯1
51
Useful lemma
Lemma: For any two scoring rules and any IP π,
π (# variables)π +2 intervals partition [0,1] such that:
For any interval [π, π], B&B builds same tree across all π β π, π
π
Any π in blue-yellow interval:
branch on
π₯2 then π₯3
branch on
π₯2 then π₯1
π
π2β π2
+
π₯3 = 0 π₯3 = 1
52
π₯2 = 0 π₯2 = 1
Useful lemma
Proof idea.
β’ Continue dividing [0,1] into intervals s.t.:
In each interval, var. selection order fixed
β’ Can subdivide only finite number of times
β’ Proof follows by induction on tree depth
Lemma: For any two scoring rules and any IP π,
π (# variables)π +2 intervals partition [0,1] such that:
For any interval [π, π], B&B builds same tree across all π β π, π
π
branch on
π₯2 then π₯3
branch on
π₯2 then π₯1
53
Learning algorithm
Input: Set of IPs sampled from a distribution π
For each IP, set π = 0. While π < 1:1. Run B&B using π β score1 + (1 β π) β score2, resulting in tree π―
2. Find interval π, πβ² where if B&B is run using the scoring rule πβ²β² β score1 + 1 β πβ²β² β score2
for any πβ²β² β π, πβ² , B&B will build tree π― (takes a little bookkeeping)
3. Set π = πβ²
Return: Any ΰ·π from the interval minimizing average tree size
π β [0,1]
54
Learning algorithm guarantees
Let ΖΈπ be algorithmβs output given ΰ·¨ππ 3
π2ln(#variables) samples.
W.h.p., πΌπ~π[tree-size(π, ΖΈπ)] β minπβ 0,1
πΌπ~π[treeβsize(π, π)] < π
Proof intuition: Bound algorithm classβs intrinsic complexity (IC)β’ Lemma bounds the number of βtruly differentβ parameters
β’ Parameters that are βthe sameβ come from a simple set
Learning theory allows us to translate IC to sample complexity
π β [0,1]
55
Outline
1. Introduction
2. Branch-and-Bound
3. Learning algorithmsa. First-try: Discretization
b. Our Approachi. Single-parameter settings
ii. Multi-parameter settings
4. Experiments
5. Conclusion and Future Directions
56
Useful lemma: higher dimensions
Lemma: For any π scoring rules and any IP,
a set β of π (# variables)π +2 hyperplanes partitions 0,1 π s.t.:
For any connected component π of 0,1 π ββ,
B&B builds the same tree across all π β π
57
Learning-theoretic guarantees
Fix π scoring rules and draw samples π1, β¦ , ππ~π
If π = ΰ·¨ππ 3
π2ln(π β #variables) , then w.h.p., for all π β [0,1]π,
1
π
π=1
π
treeβsize(ππ , π) β πΌπ~π[treeβsize(π, π)] < π
Average tree size generalizes to expected tree size
58
Outline
1. Introduction
2. Branch-and-Bound
3. Learning algorithms
4. Experiments
5. Conclusion and Future Directions
59
Experiments: Tuning the linear rule
Let: score1 π, π = min ππ β πππβ , ππ β πππ
+
score2 π, π = max ππ β πππβ , ππ β πππ
+
Our parameterized rule
Branch on variable π₯π maximizing:
score π, π = π β score1 π, π + (1 β π) β score2 π, π
This is the linear rule [Linderoth & Savelsbergh, 1999]
60
Experiments: Combinatorial auctions
Leyton-Brown, Pearson, and Shoham. Towards a universal test suite for combinatorial auction algorithms. In Proceedings of the Conference on Electronic Commerce (EC), 2000.
βRegionsβ generator:
400 bids, 200 goods, 100 instances
βArbitraryβ generator:
200 bids, 100 goods, 100 instances
61
Additional experiments
Facility location:
70 facilities, 70 customers,
500 instances
Clustering:
5 clusters, 35 nodes,
500 instances
Agnostically learning
linear separators:
50 points in β2,
500 instances
62
Outline
1. Introduction
2. Branch-and-Bound
3. Learning algorithms
4. Experiments
5. Conclusion and Future Directions
63
Conclusion
β’ Study B&B, a widely-used algorithm for combinatorial problems
β’ Show how to use ML to weight variable selection rulesβ’ First sample complexity bounds for tree search algorithm configuration
β’ Unlike prior work [Khalil et al. β16; Alvarez et al. β17], which is purely empirical
β’ Empirically show our approach can dramatically shrink tree sizeβ’ We prove this improvement can even be exponential
β’ Theory applies to other tree search algos, e.g., for solving CSPs
64
Future directions
How can we train faster?β’ Donβt want to build every tree B&B will make for every training instance β’ Train on small IPs and then apply the learned policies on large IPs?
Other tree-building applications can we apply our techniques to?β’ E.g., building decision trees and taxonomies
How can we attack other learning problems in B&B? β’ E.g., node-selection policies
65
Thank you! Questions?