Designing Bandit-Based Combinatorial Optimization...

Designing Bandit-Based CombinatorialOptimization Algorithms

A dissertation submitted to the University of Manchester

For the degree of Master of ScienceIn the Faculty of Engineering and Physical Sciences

2013

BySarah Nogueira

School of Computer Science

Contents

1 Abstract 4

2 Declaration 5

3 Intellectual property statement 6

4 Introduction 8

5 Background and theory 105.1 Search space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5.2.1 neighbourhood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.2.2 Local optimum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.2.3 Exploitation vs Exploration . . . . . . . . . . . . . . . . . . . . . . 115.2.4 Hill-climbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.2.5 simulated-annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.3 Multi-armed bandit (MAB) problems and Thompson sampling . . . . . . . 155.3.1 The multi-armed bandit (MAB) problem . . . . . . . . . . . . . . . 155.3.2 Thompson sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 165.3.3 Contextual bandit algorithms . . . . . . . . . . . . . . . . . . . . . 18

6 Research and experimental methods 206.1 Bandit-based search methods . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.1.1 Learning the action at a locus . . . . . . . . . . . . . . . . . . . . . 206.1.2 Learning the locus to consider . . . . . . . . . . . . . . . . . . . . . 27

6.2 Benchmark problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.2.1 NK-landscapes (NKL) . . . . . . . . . . . . . . . . . . . . . . . . . 286.2.2 Hierarchical If-and-only-if (H-IFF) . . . . . . . . . . . . . . . . . . 296.2.3 The MAX-SAT problem . . . . . . . . . . . . . . . . . . . . . . . . 306.2.4 The 0/1 knapsack problem . . . . . . . . . . . . . . . . . . . . . . . 30

7 Results and discussion 327.1 Measures of performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

7.1.1 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2

7.1.2 Results on NKL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337.1.3 Results on the H-IFF problem . . . . . . . . . . . . . . . . . . . . . 397.1.4 Results on the SAT problem . . . . . . . . . . . . . . . . . . . . . . 447.1.5 Results on the 0/1 knapsack problem . . . . . . . . . . . . . . . . . 49

7.2 Statistical tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507.2.2 The Wilcoxon signed-rank test . . . . . . . . . . . . . . . . . . . . . 51

8 Conclusions 528.1 Future avenues for investigation . . . . . . . . . . . . . . . . . . . . . . . . 52

8.1.1 Bandit-based simulated-annealing . . . . . . . . . . . . . . . . . . . 528.1.2 The reward distribution . . . . . . . . . . . . . . . . . . . . . . . . 538.1.3 The change-point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A Thompson sampling for Contextual bandits 54

WORD COUNT: 15452

3

1 Abstract

In this paper, we aim at improving local search methods for combinatorial optimizationproblems. As local search techniques suffer from being attracted to local optima, weattempt to counter this weakness on search spaces made of binary variables using twomachine learning approaches. The first approach consists in modelling the choice of thebit (0 or 1) assigned to a variable by a Bernoulli two-armed bandit problem, so that localsearch methods do learn which is the best value to assign to a given variable. The secondone consists in modelling the choice of the variable to be changed by a Bernoulli multi-armed bandit (MAB) problem, so that local search methods do learn which is the variablethat should be modified at a given stage. We use two different Thompson samplingalgorithms: the first one is adapted for dynamic environments to solve the MAB problemwhile the second one uses the context of the variables to solve it. We study the behaviourof the resulting hybrid hill-climbers on four benchmark problems: the NK-landscapes,the Hierarchical If-and-only-if problem, the MAX-SAT problem and the 0/1 knapsackproblem. Although the results obtained on these problems are not conclusive, consideringthe observations made, we propose future possible adaptations of the Thompson samplingalgorithm used to fit better our model. We also provide alternative algorithms that couldbe considered at the next stage to improve our results.

4

2 Declaration

I declare that no portion of the work referred to in the dissertation has been submitted insupport of an application for another degree or qualification of this or any other universityor other institute of learning.

5

3 Intellectual property statement

i The author of this dissertation (including any appendices and/or schedules to thisdissertation) owns certain copyright or related rights in it (the “Copyright”) ands/he has given The University of Manchester certain rights to use such Copyright,including for administrative purposes.

ii Copies of this dissertation, either in full or in extracts and whether in hard or elec-tronic copy, may be made only in accordance with the Copyright, Designs andPatents Act 1988 (as amended) and regulations issued under it or, where appropri-ate, in accordance with licensing agreements which the University has entered into.This page must form part of any such copies made.

iii The ownership of certain Copyright, patents, designs, trade marks and other intel-lectual property (the “Intellectual Property”) and any reproductions of copyrightworks in the dissertation, for example graphs and tables (“Reproductions”), whichmay be described in this dissertation, may not be owned by the author and maybe owned by third parties. Such Intellectual Property and Reproductions cannotand must not be made available for use without the prior written permission of theowner(s) of the relevant Intellectual Property and/or Reproductions.

iv Further information on the conditions under which disclosure, publication and com-mercialisation of this dissertation, the Copyright and any Intellectual Propertyand/or Reproductions described in it may take place is available in the UniversityIP Policy (see http://documents.manchester.ac.uk/display.aspx?DocID=487),in any relevant Dissertation restriction declarations deposited in the University Li-brary, The University Library’s regulations (see http://www.manchester.ac.uk/

library/aboutus/regulations) and in The University’s Guidance for the Presen-tation of Dissertations.

6

http://documents.manchester.ac.uk/display.aspx?DocID=487

http://www.manchester.ac.uk/library/aboutus/regulations

http://www.manchester.ac.uk/library/aboutus/regulations

4 Introduction

http://improved.ro/wp/wp-content/uploads/improved//2012/

04/LocalOptimum.png

In combinatorial optimization, we aim at findingthe set of parameters that maximizes (or mini-mizes) an objective (also called fitness) among aset of possible parameter settings. We can picturethe set of all possible parameter settings fitnessesas a landscape of which we want to find the lo-cation of the highest hill (global optimum). Lo-cal search techniques are metaheuristics methodsthat make small steps in the search space to findthe global optimum. Although they are known tobe efficient methods to find the global optimumon some landscapes, they have the tendency toget “stuck” in the search space when reaching thetop of a hill that is not the highest one (local op-tima).

http://research.microsoft.com/en-us/projects/bandits/

MAB-2.jpg

To each move that a local search method makesin the search space, we could associate a reward re-flecting the fitness improvements the action takenhas brought. The multi-armed-bandit (MAB) prob-lem [9] consists in finding a strategy that maximizesthe cumulative expected reward, where a strategyis the action to take (or the arm to pull). Section5 provides the background necessary to understandthe results and discussions of the paper. One onehand, it gives general knowledge on local searchmethods and on two of these methods called thehill-climber and simulated-annealing. On the otherhand, it defines the MAB problem and provides de-tails on Thompson sampling [11], a heuristic that

addresses it. Section 6 gives the solution we propose to improve the hill-climber and thesimulated-annealing and defines four benchmark problems (the NK-landscapes [6], the Hi-erarchical If-and-only-if (H-IFF) [14], the MAX-SAT problem [15] and the 0/1 knapsackproblem [16]). In section 7, we run our algorithms on several instances of these problems

8

http://improved.ro/wp/wp-content/uploads/improved//2012/04/LocalOptimum.png

http://improved.ro/wp/wp-content/uploads/improved//2012/04/LocalOptimum.png

http://research.microsoft.com/en-us/projects/bandits/MAB-2.jpg

http://research.microsoft.com/en-us/projects/bandits/MAB-2.jpg

and comment the results obtained. Section 8 gives the overall conclusions of the projectand the future avenues of investigation.

9

5 Background and theory

Combinatorial optimization aims at finding a set of discrete parameters that maximize(or minimize) a set of objectives (or costs). In such problems, an exhaustive search isoften infeasible, and a trade-off between the quality of a solution and the computationalcomplexity of search techniques has to be made. The travelling salesman problem (TSP)is a typical benchmark combinatorial problem where a salesman has to find the shortestpath to visit exactly once every city of a given list. Local search methods are among themost successful heuristics to solve it. In this report, we will give some more backgroundto these problems and methods. Section 5.1 defines the search space of an optimizationproblem, section 5.2 presents the concepts of local search methods and focuses on two ofthem, the hill-climbing algorithm and the simulated-annealing (SA). The reference usedfor general introduction to these concepts is given in [1]. Finally, section 5.3 definesthe multi-armed bandit problem in the Bernoulli case and a heuristic called Thompsonsampling that addresses the problem.

5.1 Search space

The search space of a combinatorial problem consists of all the possible sets of parameters.Hereafter, we will assume that we have N discrete parameters. The feasible search spaceof a combinatorial problem consists of the regions of the search space that constitutea candidate solution to the problem, meaning that they meet all the constraints of theproblem.

5.2 Local search

Local search techniques consist in locally searching the direction to take in the searchspace to find an optimum. They start from a point in the search space and look forsolutions making small steps in the neighbourhood of the current point.

Without loss of generality, we suppose that we want to maximize a single evaluationfunction eval (also called objective function) that returns the fitness associated to a pointof the search space. We chose to represent a point in the search space by a string x of Nintegers representing the set of N parameters (or variables). We call the optimal stringthe string that maximizes the evaluation function. In the binary case, we thus have 2N

10

possible strings that form the feasible search space F , which is exponential to the inputsize of the problem.

5.2.1 neighbourhood

There are several ways of defining the neighbourhood of a point x in the feasible searchspace F . One of them, called the ε-radius neighbourhood, is the set of points defined asfollows:

x′ ∈ F , dist(x, x′) ≤ ε.

where dist is a distance function defined over F and ε ≥ 0. For vectors of discrete values,the distance commonly used is the Hamming distance:

∀x ∈ F ,∀x′ ∈ F , Hamming(x, x′) =n∑i=1

δ(xi 6= x′i)

where δ is the indicator function returning one if the argument is true, false otherwise.In another words, the hamming distance between two points of our search space is thenumber of variables that do not have the same value.

5.2.2 Local optimum

A point of the feasible search space is said to be a local optimum when it is the best pointof its neighbourhood. As local search algorithms only look for candidate solutions in theneighbourhood of a local point, they always return a local optimum.

5.2.3 Exploitation vs Exploration

Local search techniques are a trade-off between the exploration of the search space andthe exploitation of the solution found so far. This exploitation versus exploration (EvE)problem ([2]) can be explained as follow: as mentioned in section 5.1, by exploring thewhole search space, we are ensured to find the global solution(s) of a problem. Butexploring the whole feasible search space is very time-consuming as we have to evaluateall the points of the search space to find the best one(s). The exploration ratio is definedby the size chosen for our neighbourhood: the bigger the neighbourhood, the bigger theexploration. As the size of the neighbourhood increases, the algorithm is less likely to get“stuck” in a local optimum but at the same time, the time necessary to find a solutionincreases. We can also point out that, even if the hill-climber always returns a localoptimum, the solution return is “less local” when the size of the neighbourhood increases,as the solution returned is the best from its neighbourhood.

Sections 5.2.4 and 5.2.5 present two local search algorithms in the binary case (meaningthat each one of the N parameters can take the values of 0 and 1). We will refer to theindex of a variable of a point (or string) of the search space as locus.

11

5.2.4 Hill-climbers

Traditional hill-climbing algorithm

Algorithm 1 gives us the traditional hill-climbing algorithm. It starts by (randomly)choosing a current point in the the search space, and iteratively gets closer to a solutionby making “small steps” in the search space. The algorithm always takes the better movein the neighbourhood of the current point until there is no more improvements left.

Algorithm 1 Traditional hill-climber algorithm.

1: Initialize best (a vector of N bits) randomly2: local← false3: repeat4: select x the best neighbour of best (the one that maximizes the evaluation function)

5: if eval(x) ≥ eval(best) then6: best← x7: else8: local← true9: end if

10: until local11: return best

The drawback of such an algorithm is that the string returned strongly depends onthe initial point: if the initial point is not located down the hill of a global optima, it willalways return the nearest local optimum. Also, as in many problems the global optimumis unknown, there is no way to know how far from the global optimum evaluation thesolution returned is. Another drawback is that the algorithm scans all the neighbours ofthe current point to find the best one. In the case where we take a neighbourhood of size1 (i.e. all the points that differ from the current point from only 1 bit), the algorithmevaluates each one of the N resulting neighbours. The two next sections present possibleimprovements respectively to find a better point (i.e. a point for which the fitness isgreater) and to reduce the computational complexity.

Iterated hill-climber

The simplest solution to improve the traditional hill-climber is to re-iterate the proce-dure from different (random) starting points and to take the best solution (the one thatmaximizes the evaluation function). This way, we increase our chances to reach a goodsolution. But this procedure does not guarantee to find the global optimal and does notreally escape local optima either.

12

Random-mutation hill-climbers

Algorithm 2 is the random-mutation hill-climbing (RMHC) algorithm, first introduced byForrest and Mitchell [4]. The size of the neighbourhood is set to 1 (as we flip only one bitof the current solution). The main difference with the traditional hill-climber (algorithm1) is that, at each iteration, it does not necessarily moves in the best direction. Indeed,it randomly selects a point from the neighbourhood of the current point and accepts itonly if the point is better than the current point (towards the evaluation function); soit might not select the best point from the neighbourhood. As a result, the algorithmexplores regions of the search space it would not have explored otherwise. The randomaspect of the algorithm improves the chances of the algorithm to reach a better solution.This algorithm can also be used in an iterated way as described in the previous section forbetter results. Another difference is that we stop the algorithm after a maximal numberof iterations MAX IT . Indeed, after a certain number of iterations, the algorithm willfind no more improvements in the neighbourhood of the current point, independently ofthe neighbour randomly selected. This parameter has to be well tuned to make sure thealgorithm stops after reaching a local optimum.

Algorithm 2 RMCH algorithm.

1: Initialize xcurrent (a vector of N bits) randomly2: for t = 1 to MAX IT do3: select a locus l randomly in xcurrent4: if flipping the bit at locus l is better then5: flip the bit at locus l6: end if7: end for8: return xcurrent

Stochastic hill-climbers

Algorithm 3 gives us the general procedure of a stochastic hill-climber (as described in[1]).

Algorithm 3 Stochastic hill-climbing algorithm.

1: Initialize the current point xcurrent (a vector of N bits) randomly2: for t = 1 to MAX IT do3: select a point xnew from the neighbourhood of xcurrent4: δ ← eval(xcurrent)− eval(xnew)5: accept xnew as the new current point with probability p = 1

1+exp δT

6: end for7: return xcurrent

As in the RMHC algorithm, we select a point randomly in the neighbourhood of thecurrent point, but we do not accept the point only if it brings an improvement. Indeed, we

13

accept a point with a probability p depending on δ = eval(xcurrent)− eval(xnew), which isthe relative merit of the current point and the new point. We can point out the followingfacts:• When δ = 0 (i.e. when eval(xcurrent) = eval(xnew)), p = 0.5, which means that the

decision on accepting the new point is random;• When δ < 0 (i.e. when the new point is better than the current point), the new

point is accepted with a probability p > 0.5;• When δ > 0, the new point is accepted with a probability p < 0.5;• As δ increases, the chances of accepting the new point increase;• p is always strictly positive, which means that the worst point can be selected as

the new current point.The underlying idea in this algorithm is that, by assigning a non null probability to acceptworse points than the current point, we increase our chance to escape local optima: wemight sometimes take wrong directions in the search space to avoid to get attracted ina local optimum areas of the search space. We can also point out that the probability pdepends on a constant T . The choice of T controls the importance given to the relativemerit δ of the current point and the new point.

Another state-of-the-art combinatorial optimization algorithm that relies on a math-ematical heuristic is described in section 5.2.5.

5.2.5 simulated-annealing

simulated-annealing is an heuristic based local search technique that aims at finding abetter trade-off between exploration and exploitation of the search space. Algorithm 4gives us the procedure of the simulated-annealing (as expressed in [1]). As pointed out in[1], “the main difference between the stochastic hill-climber and the simulated-annealing isthat the latter changes the parameter T [referred as the temperature parameter], makingthe procedure similar to a purely random search and gradually decreases the parameter Tduring the run”. The inputs of the algorithm are the starting temperature Tmax, the finaltemperature Tmin (also called the frozen temperature) and the cooling function g(t, Tmax)that gives the temperature at time step t. Algorithm 4 uses a geometric cooling functiong(t, Tmax) = Tmax ∗rt where the constant r is the cooling ratio, but other cooling functionscan be used.

The probability of accepting a candidate point that does not improve the fitness isequal to p = exp(−δ

T) where δ > 0 is the fitness gap between the current point and the

candidate point and T is the current temperature. At each time step, as the temperaturedecreases, the probability of accepting a point that does not improve the fitness decreasesand the algorithm explores less and less the search space in favour of points that improvethe fitness in the neighbourhood of the current point. When the starting point (which israndomly chosen) is located in the basin of attraction of a local optimum, the algorithmimproves the chances to escape it and to explore more promising regions of the searchspace.

14

Algorithm 4 simulated-annealing algorithm.

1: INPUTS: Tmax, Tmin and the cooling ratio r2: Initialize T (T ← Tmax)3: Initialize the current point xcurrent randomly4: repeat5: select a point xnew from the neighbourhood of xcurrent6: δ ← eval(xcurrent)− eval(xnew)7: if δ ≤ 0 then8: we accept xnew (xcurrent ← xnew)9: else if random[0; 1] < exp(−δ

T) then

10: we accept xnew (xcurrent ← xnew)11: end if12: T ← T ∗ r13: until T > Tmin14: return xcurrent

5.3 Multi-armed bandit (MAB) problems and Thomp-

son sampling

The multi-armed bandit problem refers to the problem a gambler faces when choosingwhich arm to pull on a slot machine. Indeed, if we consider a set of K arms (representingK slots), each arm produces a reward following a probability distribution initially un-known. The aim is to find a strategy that maximizes the expected cumulative reward.Bandit algorithms are optimization heuristics that address the EvE problem. Section 5.3.1presents the Bernoulli MAB problem and sections 5.3.2 and 5.3.3 present two heuristics tosolve the Bernoulli MAB problem considering different possible cases. [9] gives a generalintroduction on the MAB problem and on the algorithms that address it.

5.3.1 The multi-armed bandit (MAB) problem

At each time step t, the player selects the arm with index j(t) and receives a reward xt.The goal is to find a strategy (i.e. which arm to pull at each time step) to maximizethe expected cumulative reward. In our case, we will only consider the Bernoulli multi-armed bandit problem. Each arm j is modelled by a random variable θj. We assume thatpulling an arm as only two possible outcomes: or it is a success and in this case, the armreceives a reward equal to 1, or it is a failure and the arm receives a reward equal to 0.We can then see each arm as a Bernoulli trial, where the arm j has a probability θj ofissuing a reward equal to 1 and has otherwise a reward equal to 0 (we can also point outthat the parameter θj also represents the mean reward, as the the mean of a Bernoullidistribution is equal to that parameter). The MAB problem also assumes that the twoarms are independent and that the model θ is stationary. Let also Dj denote the pastrewards of the jth arm. Consequently, P(θ|D) = P(θ1|D1) ∗ P(θ2|D2). In section 5.3.2, we

15

present Thompson sampling, which is a heuristic-based method to decide upon which isthe best arm to pull at each time step.

5.3.2 Thompson sampling

As expressed in [10], in an exploitation perspective, we would seek to maximize theimmediate reward E(r|a) =

∫E(r|a, θ)P(θ|D)dθ where a = (a1, a2) is the set formed by

the two arms. Thompson sampling is a “probability matching heuristic that consists inrandomly selecting an action a according to its probability of being optimal”. [11] definesthis probability as:

P (aj = a∗) =

∫θ

I(aj = a∗|θ)P(θ|D)dθ,

where a∗ is the optimal arm. In this section, we will present the Thompson sampling usedin our algorithm. All the concepts presented here, the algorithms and the implementationused are the ones given in [11]. Sections 5.3.2 and 5.3.2 respectively discuss the stationarycase and the dynamic case (when the probability distribution of the arms model θ changesover time).

The stationary case

The Thompson sampling algorithm starts by sampling the parameter θj uniformly in [0; 1]and select the arm with the biggest θj for the first pull. It then updates the history of pastrewards D and the successes and failures meters for each arm j, αj = #{reward = 1}+1and βj = #{reward = 0} + 1. According to Bayes rule, we can write the posteriordistribution P (θ|D) as follows:

P(θj|Dj) ∝ P(xt|aj, θj)P(θj)

where P(xt|aj, θ) is the parametric likelihood function modelling the likelihood of observ-ing the reward xt at time t knowing that we take the action aj (i.e. we pull the jth arm)and knowing the model θj; where Dj is the set of past rewards for arm j and where P(θj)is the prior distribution. As the conjugate prior distribution to a Bernoulli distribution isa Beta distribution and as the likelihood follows a Bernoulli distribution, if we assume aBeta distribution for the prior distribution θ ∼ B(α, β) (prior knowledge), then the pos-terior distribution θ|D will also follow a Beta distribution. As the arms are independent,we can sample P(θ|D) by sampling the P(θj|Dj) from the distribution Beta(αj, βj) andthen pull the arm with the largest mean reward θj and update the hyper-parameters αjand βj for each arm.

We can point out that uniformly sampling the parameter θj in [0; 1] at the beginningis equivalent to sampling it from a Beta(1, 1) distribution. This method allows us toefficiently estimate the mean θj after a certain number of iterations (because a certainnumber of data-points is needed to reach a good estimation of θj). In another words, thealgorithm iteratively learns the parameter θj of the Bernoulli distribution (it improvesits estimation at every iteration) and consequently, it learns the probability of having apositive reward for an arm j. Algorithm 5 gives us the pseudo-code of the algorithm.

16

Algorithm 5 Thompson sampling algorithm

1: Initialize α = 1 and β = 12: Initialize P(θj) with a Beta(α, β) = Beta(1, 1) distribution (equivalent to a uniform

distribution)3: Initialize Sj = 0 and Fj = 0 the success and failure meters for all arm j4: for t = 1 to max iter do5: for j = 1 to N do6: sample P(θj) from Beta(1 + Sj, 1 + Fj)7: end for8: select the arm j with the maximal value of P(θj)9: update parameters alpha = α + Sj and β = β + Fj with Sj = Sj + 1 if pulling the

arm j was a success and Fj = Fj + 1 otherwise.10: update the distribution according to Beta(α, β)11: end for

The more an arm is pulled, the more the Beta distribution gets concentrated and themore we are likely to sample from the real mean and therefore pull the best arm. In mostreal-word cases, the reward distribution is non-stationary and switches. Section 5.3.2gives us a model for dynamic environments and an adapted algorithm of algorithm 5.

Dynamic environments

In this section, we assume that the environment changes over time. More specifically, weassume that the reward distribution switches at a constant rate γ. We will consider twocases: or both arms of our model switch at the same time (we will refer to this modelas the global switching model) or each arm switches independently from the other arm(referred as the per-arm switching model).In the global switching model, we assume that at each time step, we have a probability γthat both arms switch. When the switch occurs, both arms re-start sampling their meanreward θj from a Beta(1, 1) distribution. In the per-arm model, at each time step, eacharm has a probability γ of switching.In both cases, we do not know the run-length distribution rt (which is the number of timesteps since the last switching for an arm at time t). As explained in [11], if we call Dt−1the history of rewards and armed pulled so far, we have:

P(θ|Dt−1) =∑rt

P(θ|Dt−1, rt)P(rt|Dt−1)

where P(θ|Dt−1, rt) is the posterior of our model θ given the data and P(rt|Dt−1) is theprobability of run-length. “Now to sample from P(θ|Dt−1), we just need to sample fromthe P(rt|Dt−1) [...] and then given that run-length, sample from P(θ|Dt−1, rt) to arriveat our arm model θ” ([11]). As the input parameter γ is unknown, [11] also gives us animplementation that learns the switching rate γ. This model will be referred as the non-parametric model and to simplify the algorithm, the switching rate is assume constant and

17

unique (for both arms). The paper considers that the occurrence of a switching point (alsocalled change-point) as a Bernoulli variable where a success corresponds to the occurrenceof a change-point and failure to the non-occurrence of a change-point. These are usedas the hyper-parameters of a Beta distribution and the switching rate is sampled as seenpreviously for the reward and run-length distributions.

As a result, we have four change-point Thompson Sampling (CTS) algorithms that wewill refer as:• the Global model, for the parametric global model;• the NP Global model, for the non-parametric global model• the PACTS model, for the parametric per-arm model;• and the NP PACTS model, for the non-parametric per-arm model.The implementation of the CTS algorithms that is given also performs stratified re-

sampling of the run-lengths. Indeed, the algorithm allows to store the hyper-parametersof the Beta distributions associated with each arm for M different run-lengths (hereafter,we will refer to the M parameter as the maximum run-length). When the maximumnumber of run-length stored is reached, the algorithm deletes M ′ Beta distributions asso-ciated with the M ′ run-lengths that are the less likely to occur. This technique allows thealgorithm to limit the number of hyper-parameter stored (that can become very large asthe algorithm learns the rewards returned by the action taken) while keeping in memorythe hyper-parameters associated with the most probable run-lengths.

5.3.3 Contextual bandit algorithms

The contextual MAB problem

The main difference between the contextual MAB problem and the MAB we saw pre-viously is that, when choosing an arm to pull, for each arm j, the player is now givena context bj and the history Dj (which is now made of the past the past rewards andthe past contexts of arm j). As expressed in [13], if we assume a context bj to be avector of Rd and if we make the assumption of linear payoffs, the model θj associatedwith each arm j (that correspond to the mean reward of arm j) will then be θj = bTj µjwhere µj is a vector of Rd that models the reward contribution of each element of thecontext. As previously, at each time step t, the learner will then pull the arm with thebiggest mean reward θj(t) = bTj (t) µj(t). We can see such contextual bandits as a gen-eralization of the MAB problem we formulated previously. As said in [13], if we assumea Gaussian distribution N (bTj µj, v

2) for the likelihood of reward rj(t) (given the contextbj(t) and the parameter µj) where the standard deviation v is a constant assumed by theproblem; and if we also assume the prior distribution for µj at time t − 1 to be a Gaus-sian N (µj(t − 1), v2Bj(t − 1)−1), the posterior distribution at time t will be a GaussianN (bj(t)

T µj(t), v2bj(t)

TBj(t)−1bj(t)) where Bj(t) and µj(t) are defined as:

Bj(t) = Id +t−1∑τ=1

bj(τ)bj(τ)T

18

µj(t) = Bj(t)−1(

t−1∑τ=1

bj(τ)rj(τ))

As previously, at each time step t and for each arm j, given the context bj(t), theparameter µj(t) can be sampled for the prior distribution, we can decide upon which armto pull and update the Gaussian distribution with the reward observed. The associatedThompson sampling algorithm given in [13] is provided in appendix A.

19

6 Research and experimental meth-ods

At each time step, the random mutation hill-climber (RMHC) (cf sections 5.2.4) randomlyselects a neighbour of the current point (by selecting a locus on the current string) andaccepts it as the new current point if it improves the fitness. A possible improvementto this algorithm would be to learn from the previous steps made in the search space.In this perspective, we consider two modifications that could be brought to the RMHCalgorithm to address the EvE problem: or, for each locus, we could learn which is the bestbit value to be assigned (0 or 1), or we could learn which is the locus to be selected at eachtime step (instead of selecting it randomly). Section 6.1 presents these two bandit-basedsearch methods and section 6.2 presents the benchmark problems on which we are goingto observe the performance of our algorithms and compare it to the one of the hill-climber.

6.1 Bandit-based search methods

This section studies three hybrid hill-climber algorithms: one one hand, section 6.1.1presents the learnBit and learnContext algorithms that aim at improving the RMHCalgorithm by learning which bit to assign to a given locus (using respectively the CTSalgorithm (cf section 5.3.2) and the contextual bandit algorithm (cf section 5.3.3), andon the other hand, section 6.1.2 presents the learnLocus algorithm which attempts toimprove the hill-climber by learning which locus should be selected to be flipped (alsousing the the CTS algorithm).Hereafter, we assume a neighbourhood of size 1, so that each candidate solution differsfrom the current point in only one bit.

6.1.1 Learning the action at a locus

The algorithms presented in this section randomly select the neighbour of the currentpoint to consider as a candidate (in another words, the locus of the current point tobe considered). We consider each one of the N binary variables as a two-armed banditproblem. When the algorithm selects a locus, the two-armed bandit algorithm associatedwith that locus indicates the action to take (i.e. whether the bit at that locus shouldbe flipped or not: pulling the first arm corresponds to assigning the bit 0 and pulling

20

the second arm corresponds to assigning the bit 1). As in the RMHC algorithm, if theaction taken improves the fitness (meaning that the new current point has a greater fitnessthan the previous one), the new point is accepted as the current point and the rewardassociated to that action is equal to 1; otherwise the current point remains the same andthe reward associated is equal to 0. We focused on two main approaches:• In the first approach, we use the Thompson sampling CTS algorithm associated with

that locus to indicate which arm should be pulled. As seen in section 5.3.2, thisalgorithm will consider the landscape as a dynamic environment and will considerthat change-points in the landscape might occur. We will consider two versions ofthis algorithm that we will refer to as the learnBit1 and the learnBit2 algorithms.Both are presented in section 6.1.1.• In the second approach, we use the contextual bandit algorithm associated with

that locus to indicate the action to take. In this case, when a locus is selected,the algorithm is given the context of the locus (which is made of the bits of theN − 1 remaining loci of the candidate solution) and learns from this action in thiscontext. As opposed to the first case, this approach thus considers a stationaryenvironment. We will also consider two versions of this algorithm that we will referto as learnContext1 and learnContext2. Both are presented in section 6.1.1.

Using the CTS bandit algorithm

In algorithms 6 and 7, we associate a two-armed bandit to each of the N bits thatform the set of parameters. We consider a maximal number of iterations that we willdenote as MAX IT . As in the RMHC algorithm, after randomly initializing the set of Nparameters, at each iteration, a locus is randomly selected. The two-armed bandit thengives us the action to take (setting the bit at the selected locus to 0 or 1). In the eventthat the action improves the fitness, we perform the action, the new string becomes thecurrent candidate solution and a reward of 1 is associated to this action. Otherwise, thereward is set to 0.

As in these two algorithms only two-armed bandits are used, for a given run-length,we will refer to the hyper-parameters α and β associated with the arm 0 (which is theaction that sets the bit at the associated locus to 0) as α0 and β0. Similarly, we will referto the hyper-parameters of the two-armed bandit associated to arm 1 as α1 and β1. Wecan also remind that the hyper-parameter α for an arm corresponds to the number ofsuccesses obtained by pulling the arm plus 1 and that the hyper-parameter β for an armcorresponds to the number of failures obtained by pulling the arm plus 1. Thereby, whena reward of 0 is associated to an arm, it means that the hyper-parameter β is incrementedby 1.

In algorithm 6 (called the learnBit1 algorithm), when the action given by the banditmodel is equal to the current value of the bit, the algorithm learns from this action byassociating a reward of 0 to the silent action. The goal in the long term is to avoid that fora given locus, the same action is returned again and again. In theory, this configurationcould help the algorithm to make steps in the search space: after a certain number ofiterations, as a null reward would be associated with the silent action, the algorithm

21

would finally end up returning the opposite action for the locus. Indeed, for a given locusl, if the bit at this locus in the current solution is 0 and that the best action to takereturned by the bandit is also 0, when associating a null reward to that silent action, thehyper-parameter β0 (which is also a measure of the number of failures that result frompulling the arm 0) will be incremented by one. In the case that for the locus l, the silentaction 0 is returned several times by the algorithm, the hyper-parameter β0 will increaseconsequently, so that the value sampled from the Beta distribution associated with thearm 1 will be more likely to be greater than the one sampled from arm 0. Indeed, weremind that the arm returned by the algorithm is the one which has the greatest valuesampled from its Beta distribution. In expectation, the value sampled from the Betadistribution associated to arm 0 is equal to α0

α0+β0. So as β0 increases, this value is more

likely to be smaller than the value sampled from the Beta distribution associated to arm1 (which in expectation is α1

α1+β1).

An alternative to this algorithm would be to only learn from an action when theaction is not the bit already assigned at a locus (i.e. when the action is not silent) asshown in algorithm 7 (called the learnBit2 algorithm). Indeed, even if the learnBit1version presents the great advantage that it forces the algorithm to try to make stepsin the search space, it can also present disadvantages. Firstly, at an early stage (i.e.after a small number of iterations), the fact that the action returned by the bandit isthe same as the bit assignment does not imply that the bandit associated to this locusneeds to be updated: at the next iteration, the mean reward associated with each armsampled form the Beta model of the arm might not be the same and the algorithm canstill return another action. Also, our algorithm does learn for which run-length the Betamodel should be considered. As the run-length considered is also sampled, assigning anull reward when the action is the same as the current bit might not be the best choice.Finally, only learning from an action when it does change the current solution would beless computationally expensive as every time the bandit learns from an action, it resamplesthe Beta models of the M ′ less probable run-length (assuming that the maximal number ofrun-lengths M stored has been reached). For these reasons, and as it seems that learningfrom an action that is equal to the bit can have a great impact on the final solution, wealso considered a version of the algorithm (algorithm 7) that only learns from actions thatactually change the current candidate solution. These two versions of the algorithm willalso be considered when using the contextual bandit model.

22

Algorithm 6 The learnBit1 algorithm.

1: for i = 1 to N do2: Associate a Beta Model model to each arm i3: end for4: Initialize xcurrent (a vector of N bits) randomly5: i← 16: while i ≤MAX IT do7: select a locus locus randomly in xcurrent8: action← model.learner[locus]()9: if action == xcurrent[locus] then

10: reward← 011: else12: i← i+ 113: if taking the action a improves the fitness then14: reward← 115: perform action a16: else17: reward← 018: do not perform action a19: end if20: end if21: Update Beta model according to reward22: end while23: return xcurrent

23

Algorithm 7 The learnBit2 algorithm.

1: for i = 1 to N do2: Associate a Beta Model model to each arm i3: end for4: Initialize xcurrent (a vector of N bits) randomly5: i← 16: while i ≤MAX IT do7: select a locus locus randomly in xcurrent8: action← model.learner[locus]()9: if action 6= xcurrent[locus] then

10: i← i+ 111: if taking the action a improves the fitness then12: reward← 113: perform action a14: else15: reward← 016: do not perform action a17: end if18: end if19: Update Beta model according to reward20: end while21: return xcurrent

As described in these two algorithms, as we consider that an iteration corresponds toan evaluation of the fitness function, when the action selected for a given locus is alreadythe bit at that same locus, the algorithm does not count it as an iteration. In theory,the learnBit1 algorithm terminates: as a reward 0 is associated to a silent action, aftera certain number of learning steps, the opposite arm has a non null probability to bepulled after a certain amount of learning steps. Nevertheless, this process can take a bignumber of learning steps if the hyper-parameters α and β of the Beta model are largelyunbalanced. As unbalanced parameters are the result of previous learning steps that donot improve the fitness, forcing the algorithm to sample the opposite action from the Betadistribution by artificially making-up the gap between the hyper-parameters is not alwaysa good approach. For these reasons, in order to run our algorithms in a sensible amountof time, we arbitrarily decided to terminate them when the learning steps consecutivelyreturn a silent action bMAX IT

4c times (where b.c denotes the floor function). As the

maximal gap between the hyper-parameters α and β for a given run-length is the run-length itself and that the average maximal run-length in expectation for a given two-armbandit is equal to MAX IT

N(as a given locus is selected MAX IT

Ntimes in expectation), the

choice of terminating the algorithm after bMAX IT4c consecutive learning steps that return

a silent action with our choice of parameter settings for MAX IT improves the runningtime without losing its efficiency. For the learnBit2 algorithm, as the algorithm does notlearn from a silent action, if the gap between the hyper-parameters α and β is very large,

24

the action return by the CTS algorithm can remain silent for a large number of learningsteps. Indeed, the probability that a non-silent action would be sampled from the sameBeta distribution at every learning step. Therefore, when the non-silent action has a largenumber of failures β − 1 compared to the number of successes α − 1, the algorithm cantake a large amount of time to reach the MAX IT iterations without making any steps inthe search space. In the next section, we will describe these two versions of the algorithmusing the contextual bandit algorithm presented in section 5.3.3.

Using the contextual bandit algorithm

In our current implementations of the bandit-based hill-climbers, we considered the choiceof flipping a bit of a candidate solution as a MAB problem. From this point of view, ouralgorithm assumes the bits of a candidate solution independent from each other. For thisreason, considering each bit of a candidate solution as a contextual MAB problem couldsignificantly improve the performance of our algorithms, as it would attempt to model thedependencies between the different bits of a candidate solution. We are going to define thecontextual MAB problem and give a Thompson sampling algorithm solving it. Then, wewill explain how we could use this version of the Thompson sampling in our algorithms.

The advantage of using a contextual bandit algorithm instead of the CTS algorithmis that we could attempt to model the dependencies between the several bits of a binarycandidate solution of length N by modelling each bit as a contextual MAB (instead ofa simple MAB) where the context is the parameter setting given by the N − 1 otherbits. Hence, the context bj will be a vector of {0, 1}N−1 and a linear payoff will beassumed. Nevertheless, we can point out that contextual Thompson sampling is moreexpensive than the previous implementation. As the algorithm uses the context to learnthe predictor µj of each arm, the algorithm has to “see” enough contexts before it actuallylearns which is the best arm to pull given a context. As seen in section 6.1.1, algorithms8 and 9 present the algorithms in two cases, respectively when learning from an actionthat does not change the current candidate solution (i.e. a silent action) and when notlearning from it.

25

Algorithm 8 The learnContext1 algorithm.

1: for i = 1 to N do2: Associate a 2-armed bandit each arm i3: end for4: Initialize xcurrent (a vector of N bits) randomly5: i← 16: while i ≤MAX IT do7: select a locus locus randomly in xcurrent8: context ← xcurrent[−locus] where xcurrent[−locus] denotes the string xcurrent de-

prived of the bit at locus9: action← model.learner[locus](context)

10: if action == xcurrent[locus] then11: reward← 012: else13: i← i+ 114: if taking the action a improves the fitness then15: reward← 116: perform action a17: else18: reward← 019: do not perform action a20: end if21: end if22: Update the model according to reward and context23: end while24: return xcurrent

26

Algorithm 9 The learnContext2 algorithm.

1: for i = 1 to N do2: Associate a 2-armed bandit each arm i3: end for4: Initialize xcurrent (a vector of N bits) randomly5: i← 16: while i ≤MAX IT do7: select a locus locus randomly in xcurrent8: context ← xcurrent[−locus] where xcurrent[−locus] denotes the string xcurrent de-

prived of the bit at locus9: action← model.learner[locus](context)

10: if action 6= xcurrent[locus] then11: i← i+ 112: if taking the action a improves the fitness then13: reward← 114: perform action a15: else16: reward← 017: do not perform action a18: end if19: end if20: Update the model according to reward and context21: end while22: return xcurrent

In the next section, we will present another possible hybrid bandit-based hill-climberalgorithm.

6.1.2 Learning the locus to consider

Instead of associating a two-armed bandit with each one of the N bits, another solutionis to associate a N armed-bandit to the whole set of parameters. Previously, after a locuswas randomly selected, we aimed at learning if its bit should be changed or not. In thebinary case, it can be seen as if the locus selected is the one that should be changed or ifno action should be taken and another locus should be randomly selected. Another wayto modelling this problem is to associate a single N -armed bandit to the set of the Nparameters. This model presents several advantages compared to the previous one. Firstof all, using a single N -armed bandit implies that a maximum of N ∗M Beta distributionswill be stored against 2 ∗M ∗ N for the other version using the CTS algorithm (whereM is the maximal number of run-lengths stored by the algorithm). Another advantage isthat, whilst the previous version considered N independent bandits, this version considersa single bandit for all the loci, and thus might need less learning steps in expectation andless storage space to find a solution. Another advantage to this model is that we do not

27

have to deal with the issue of silent actions and the total number of learning steps isequal to the number of iterations MAX IT . This algorithm (given in algorithm 10) willbe referred to as the learnLocus algorithm and will use the CTS algorithm presented insection 5.3.2.

Algorithm 10 The learnLocus algorithm.

1: Associate a Beta Model model to the set of N parameter2: Initialize xcurrent (a vector of N bits) randomly3: i← 14: while i ≤MAX IT do5: locus← model.learner()6: i← i+ 17: if flipping the bit at locus improves the fitness of xcurrent then8: reward← 19: Flip the bit of xcurrent at locus

10: else11: reward← 012: end if13: Update Beta model according to reward14: end while15: return xcurrent

The next section presents the benchmark problems in binary combinatorial optimiza-tion that we are going to use to observe and compare the performances of our algorithms.

6.2 Benchmark problems

In this section, we describe the four benchmark problems that we have considered inorder to test our hybrid bandit-based hill-climber algorithms. Section 6.2.1 presents theNK-Landscapes [6], section 6.2.2 presents the H-IFF problem [14], section 6.2.3 presentsthe MAX-SAT problem [15] and section 6.2.4 presents the 0/1 knapsack problem [16].

6.2.1 NK-landscapes (NKL)

NKL (first introduced in [6] and [7]) are stochastically generated functions over stringsof N bits. A fitness function is generated making the assumption that each one of theN bits has interactions with K other bits of the string. NKL are particularly interestingto evaluate optimization techniques as we can vary the values of N and K to evaluatetheir behaviour on different types of landscapes. As explained in [8], a NKL is describedby a fitness function F : {0, 1}N → R+ that is generated stochastically. More precisely,each site (or variable) is assumed to have K neighbours with which it interacts (alsocalled epistatic links). For a given site, we hence have 2K+1 possible neighbourhoods(assuming that we include the site into the neighbourhood). For each of site xi of the

28

string x of length N and for each possible neighbourhood of xi, we are going to generaterandomly (i.e. uniformly in [0; 1]) an associated fitness value Fi(xi;xi1 , ..., xiK ). We canthen calculate the total fitness of a string x as follows:

F (x) =1

N

N∑i=1

Fi(xi;xi1 , ..., xiK ),x ∈ {0, 1}N

There are two possible ways of defining the neighbourhoods: the adjacent neighbour-hoods and the random neighbourhoods. In the first case, we consider that the neighbour-hood is made of the K closest sites to the site i in the string when K is an even integer,and we simply take the K − 1 nearest sites when K is an odd number. To achieve that,we also assume that the string is a “cycle” (meaning, for instance, that the two closestsites to the first site are the N th site and the 2nd site). The second type of neighbourhoodis the one where the K neighbours are selected randomly.

Testing our algorithm on NK-landscapes will allow us to observe how our algorithmperforms on landscapes with interactions between the variables. The NKL used in ourexperiments were implemented by ourselves.

6.2.2 Hierarchical If-and-only-if (H-IFF)

The Hierarchical If-and-only-if (H-IFF) problem [14] is a problem of the category of thebuilding-block problems meant to model a decomposable problem with inter-level interde-pendencies. The building-block hypothesis is used to split the search-space into blocks inorder to facilitate the search of an optimum. This hypothesis is made when the problemis said to be decomposable in sub-problems. In many real-world problems, the problemscan be hierarchically decomposable. H-IFF problems attempt to model landscapes wherethe hierarchical building-blocks have interdependencies between them.

We suppose that we have k sub-blocks in a block and p levels. A block B is definedas a string of symbols in an alphabet S of length n = kp. At the lowest level, we willhence have a string of length k0 = 1, which will be the parent of k sub-blocks, and soon until we reach the highest level. The H-IFF problem assumes a number of sub-blocksper block k = 2 and a an alphabet of three symbols S = {0, 1,−}. We implemented theH-IFF problem as specified in [14]. [14] recursively defines the transform function T (B)(representing the meaning of a block) and the fitness function F (B) as follows:

T (B) =

{b1 if |B| = 1,

t(T (B1), ..., T (Bk)) otherwise.

F (B) =

{f(B) if |B| = 1,

|B|f(T (B)) +∑k

i=1 F (Bi) otherwise.

where t : S2 → S is the base transform function giving the meaning of a single symboland f : S2 → R is the base fitness function giving the fitness of a single symbol. Boththese functions are defined in tables 6.1 and 6.2.

29

A B t({A,B})0 0 00 − −0 1 −− 0 −− − −− 1 −1 0 −1 − −1 1 1

Table 6.1: Transform base function t for H-IFF problem.

A f(A)0 1− 01 1

Table 6.2: Fitness base function f for H-IFF problem.

As hill-climbers are known to perform poorly on such landscapes [14], H-IFF representsan interesting problem to test our hybrid hill-climbers and simulated-annealing algorithmson. The H-IFF instances used in our experiments were implemented by ourselves.

6.2.3 The MAX-SAT problem

In logic, a clause is a disjunction of boolean variables. The Boolean Satisfiability Problem(SAT) is a combinatorial problem which consists in finding an interpretation that satisfiesa set of clauses. In another words, if we considers that we have N distinct booleanvariables, the SAT problem consists in finding a set of N binary values that will makethe conjunction of clauses true. A derived version of the SAT problem is the MAX-SAT problem which consists in maximizing the number of verified clauses. The MAX-SAT problem is particularly adapted to our project, as we need an objective functionto evaluate candidate solutions. The instances of the MAX-SAT problem we used areartificially generated Random-3-SAT instances (i.e. instances with clauses containingexactly 3 variables) given in [15].

6.2.4 The 0/1 knapsack problem

In the 0-1 knapsack problem, we consider that we have N items, each one with a valueand a weight. A knapsack is formed by several items and can only have one exemplaryof each item. We want to find the subset of items that will maximize the total value of

30

the knapsack without exceeding a maximal weight. We can then encode the content ofa knapsack by a string of N bits, with the value 0 at the ith position is the ith item isnot in the knapsack and by 1 otherwise. We will then try to maximize the value of theknapsack under a maximal weight constraint. In our experiments, we used small knapsackinstances generated by thanks to the generator by Pisinger ([16]). Testing our algorithmson knapsack instances will allow us to observe how our algorithms perform on constrainedproblems. Nevertheless, as the knapsack problem is constrained by a maximum weight, wehave to modify the previous algorithm to ensure the feasibility of the solution. Therefore,instead of accepting a point as the new current point when it improves the fitness, weproceeded as follows:• If both the current point and the new point are feasible, we accept the new point if

it has a greater fitness than the current point.• If the current point is feasible and the new point is not feasible we do not accept

the new point.• If the current point is infeasible and the new point is feasible, we accept the new

point.• If both points are infeasible, we accept the point that violates the constraint the

less (i.e. the one with the lowest weight).This procedure ensures us that the algorithm is always going towards “more feasible”points. Nevertheless, other solutions could be considered (as for instance, adding a penaltyfunction of the extra-weight in the knapsack to the fitness of an infeasible solution).

31

7 Results and discussion

In this section, we discuss the results obtained when running our hybrid algorithms (cfsection 6.1) on the benchmark problems (cf section 6.2) and the tuning of the inputparameters. To compare the bandit-based RMHC algorithms to the RMHC algorithms,statistical testing is performed to check whether the objective achieved by the new hybridalgorithms we created is significantly different from the objective achieved by the RMHCalgorithm.

7.1 Measures of performance

In this section, we ran our algorithms on instances of the four benchmark problems dis-cussed in section 6.2 and attempted to observe the performances of our algorithms. Section7.1.2 presents the results on the NKL for several values of N and K. Sections 7.1.3, 7.1.4ans 7.1.5 presents our results and observations when running the algorithms respectivelyon H-IFF, on SAT and on 0/1 knapsack instances.

7.1.1 Parameter settings

As our previous results on NKL and H-IFF instances using the learnBit1 algorithm andother bandit-based algorithms (using the CTS model) were not conclusive, we decided touse the smallest possible value for the M ′ parameter which is 1 (where M ′ is the numberof run-length to be resampled when M run-lengths have been stored). Indeed, as everytime the maximal number of run-length M stored is reached (meaning that we storedthe Beta distributions of all the arms for M different run-lengths), we delete the M ′ lessprobable run-lengths and speed up our algorithms by freeing storage space. Nevertheless,whilst this technique reduces the computational expense of the algorithms using the CTSmodel, the results might be less accurate. Therefore, in an attempt to observe if ouralgorithms are able to learn from the previous steps in the search space, we initiallyused the best parameter settings for this purpose while not taking the computationalexpense into account. In the event that our tests reveal that our algorithms do learnfrom the previous steps by comparing them with the hill-climber algorithm, the nextstage would be to optimize these parameters to make our algorithm less expensive. Inthis same perspective, we chose to set the maximal number of run-length M equal to thenumber of iterations MAX IT . In the case the number of learning steps and the number

32

of iterations are the same (e.g. in the learnBit2 and in the learnContext2 algorithms),this choice allows us to store all the possible run-lengths for a given bandit, even thoughthis choice is significantly more expensive. The remaining parameters (i.e. the number ofiterations MAX IT and the switching rate γ) will be determined in the following sectionsby observing the evolutions of the fitness against the number of iterations for each one ofthe benchmark problems.

7.1.2 Results on NKL

In this section, we ran our algorithms on NK-landscapes for different values of N and K.In order to be able to perform paired statistical tests and to conclude on the performancesof our algorithms on the NKL problem class, we ran each one of our algorithms 30 timeson the same NKL instances and in the same order.

Evolution of the fitness against the number of iterations

In this section, we aimed at observing the evolution of the fitness against the number ofiterations (which is different from the number of learning steps as mentioned previously).Figure 7.1 presents two plots of the mean fitness over 30 runs for N = 50 and respectivelyfor K = 3 and K = 10. These two figures have been chosen to illustrate our results onNKL: we ran all our algorithms 30 times on NKL for values of N equal to 50, 75 and 100and for values of K between 1 and 10; and all the results observed are similar to the onespresented on figure 7.1. Indeed, we can see on these figure that none of the algorithmsseem to outperform the RMHC algorithm. Our first observation from the plots is thatour choice for the maximal number of iterations MAX IT = 15 ∗N is largely sufficient.Indeed, after a certain number of iterations, it appears that the fitness does not increaseany more. To make sure that the fitness has not just reach a plateau (meaning that furtherincreases of fitness can be observed), we ran our algorithms for larger number of iterationsand observed no further improvements. From this observation, we can conclude that thesolution returned is a local optimum. Indeed, after a very large amount of iterations,as a candidate solution has N neighbours, the probability that all the neighbours fromthe current solution have been considered as a candidate solution and evaluated is veryhigh (in expectation, it takes in average N iterations for a neighbour to be consideredwhen the locus is randomly selected). Another observation that has been made from ourexperiments is that, as we expected, the learnContext1 and learnContext2 seem to takea bigger number of iterations to reach its final solution.

In the first plot of figure 7.1, the algorithms using the CTS model (i.e. the leanrBit1,the learnBit2 and the learnLocus algorithms) use the Global CTS model, while the secondplot uses the NP Global model. For the ones using the Global model, after observingthe fitnesses obtained for different switching rates, the switching rate chosen is equalto 1

N. This choice of parameter seems to be the more appropriate for the parametric

CTS algorithms on the instances of the benchmark problems we used. Nevertheless,this choice of parameter might not be a good one when considering larger instances:for practical reasons, we first tried our algorithms on small instances of the benchmark

33

problems presented in section 6.2. The results obtained with the parametric and non-parametric models (i.e. respectively when a switching rate is given to the algorithm as aninput parameter and when it is not) seem very similar. As the Global model is the lessexpensive one (compared to the NP Global, the PACTS and the NP PACTS models), itwill be the privileged model for our experiments. Indeed, the non-parametric models aremore expensive as they learn the switching rate and the per-arm model considers differentswitching points for each arm, which also consistently slows down the algorithms.

34

Figure 7.1: Mean Fitness obtained over 30 runs on NKL instances against the number of iterationsrespectively using the Global and the NP Global model for the algorithms using the CTS model. Thefirst graph is the mean fitness over 30 different NKL with N = 50 and K = 3 and the second graph isthe plot of the mean fitness over 30 different NKL with N = 50 and K = 10.

35

Once the parameters have been set up thanks to the observations made in this section,it would be interesting to observe what are the fitnesses reached by the different algorithmsfor different values of N and K. It is the purpose of the next section.

Fitness against N and K

Figure 7.2 presents the mean fitness over 30 runs for N = 50 against the values of K(which are K = 1, K = 3, K = 5 and K = 10). The first plot of the figure uses the globalmodel while the second one uses the NP Global model for the algorithms using the CTSalgorithms. Even though the best mean fitness returned by the algorithms on these plotsfor the values of N and K chosen is not equal to the one returned by the RMHC algorithm,none of the algorithms seems to globally outperform the RMHC algorithm independentlyfrom the values of N and K. Another observation is that the fitnesses obtained by thelearnContext1 and learnContext2 are very unstable: the standard error of the mean isthe greatest for these two algorithms. We can also easily notice that on the two plots, themean fitness returned by these two algorithms seem very different although the values arenot influenced by the fact that we are using a Global or a NP Global algorithm as they arethe only algorithms that do not use the CTS model but the contextual bandit model. Asensible explanation to this observation is that the algorithms using the contextual banditmodel get stuck in a local optimum before they can learn from the context. Indeed, each ofthe N loci of a candidate solution has a context of size N−1 which leads to 2N−1 possiblecontexts for each locus. Yet, figure 7.1 shows us that for N = 50, the learnContext1and learnContext2 algorithms reach a local optimum after about 500 iterations which isequivalent to 10 ∗ N iterations. Thus, the number of contexts seen by the algorithm inthat amount of time seems too small compared to the number of iterations to reach a localoptimum for the algorithm to have the time to learn from the previous steps according tothe context of the loci of the current solution.

Even though the results do not seem convincing, we could also observe the maximalfitness reached by the algorithms over the 30 runs and see if better results are obtained.Figure 7.3 presents the maximal fitness reached by the algorithms for the same algorithmsand the same values of N and K than the ones in figure 7.2.

36

Figure 7.2: Mean Fitness obtained over 30 runs on NKL instances for N = 50 against the value of K.The first graph uses the Global model and the second graph uses the NP Global model.

37

Figure 7.3: Best fitness obtained over 30 runs on NKL instances for N = 50 against the value of K.The first graph uses the Global model and the second graph uses the NP Global model.

38

7.1.3 Results on the H-IFF problem

In this section, we ran our algorithms 30 times on different H-IFF instances (for k = 2and for different values of p). The same parameters settings as in section 6.2.1 have beenchosen. First of all, we are going to observe the evolution of the mean fitness against thenumber of iterations, then we will observe the evolution of the mean fitness and of thebest fitness obtained over the 30 runs against the values of p.


Figure 7.4 contains two plots, respectively using the Global and the NP Global models forthe algorithms using the CTS model. The plots represent the evolution of the mean fitnessover 30 runs against the number of function evaluations on the 32-bit H-IFF instance(i.e. for k = 2 and p = 5). The first observation that can be made from these twoplots is that, unlike the results obtained on NKL, we can easily see differences betweenthe different algorithms. We notice that the learnContext1 algorithm seems especiallyto differentiate itself from the others as it is the only one that reaches a better meanfitness than the RMHC algorithm. Another very interesting observation is that for allthe hybrid algorithms, the current candidate solution has a greater mean fitness than themean fitness obtained with the RMHC algorithm. This observation could be the proofthat the algorithms actually do learn something from the moves made in the search space.Nevertheless, these observations have to be confirmed through statistical testing. Eventhough the results obtained by the learnContext1 algorithm seem promising, we can pointout two negative observations. Firstly, the instance considered is very small (N = 32).Secondly, even though the instance is small, none of the algorithms solves the problem asthe optimal fitness on the 32-bit H-IFF instance is equal to 192 and the best mean fitnessobtained by our hybrid algorithms is approximately equal to 87.

39

Figure 7.4: Mean Fitness over 30 runs on the 32-bit H-IFF instance against the number of iterationsrespectively using the Global and the NP Global model for the algorithms using the CTS model.

40

Fitness against the depth p

Figures 7.5 and 7.6 respectively show the mean and the best fitness obtained over the 30runs aginst the values of p (for p = 3,p = 4 and p = 5). They both contained two plotsto compare the results when using the parametric Global model and the non-parametricNP Global model. The first observation that can be made is that the algorithms donot perform better on even smaller H-IFF instances. Indeed, the dashed red line on heplots represents the optimum fitness and all the mean fitness obtained are largely below it.Nevertheless, when observing the best fitness obtained over the 30 runs, the learnContext1algorithm clearly stands out. We can also notice that the learnBit2 algorithm is theonly one that manages to solve the two smallest instances (the 8-bit and the 16-bit H-IFF instances) over the 30 restart. These observations are not sufficient to lead to anyconclusions, but they can help us on trying to improve these algorithms. As the algorithmsseemed to learn during the first iterations, we tried to switch-off the algorithms after afew number of iterations and before the algorithm got stuck on a local optimum. Figure7.7 illustrates one of the results we obtained. On this plot, we switch-off the banditalgorithms (tuning them into a RMHC algorithm) after N = 32 iterations. According tothe plots on figure 7.4, after 32 iterations, the algorithms seem to have learnt from themoves in the search space while the hybrid algorithms are not stuck in local optima yet.nevertheless, the mean fitnesses observed on figure 7.7 are not any better than the oneobserved on figure 7.5. The other trials we made by using other CTS models and othernumber of iterations after which the hybrid algorithms are turned off did not seem tobring any improvement either.

41

Figure 7.5: Mean Fitness obtained over 30 runs obtained on H-IFF instances for k = 2 against thevalue of p. The first graph uses the Global model and the second graph uses the NP Global model.

42

Figure 7.6: Best fitness obtained over 30 runs obtained on H-IFF instances for k = 2 against the valueof p. The first graph uses the Global model and the second graph uses the NP Global model.

43

Figure 7.7: Mean Fitness obtained over 30 runs obtained on H-IFF instances for k = 2 against thevalue of p when switching-off the bandit algorithms after N = 32 iterations. The first graph uses theGlobal model and the second graph uses the NP Global model.

7.1.4 Results on the SAT problem

In this section, we present some of our results on both satisfiable and unsatisfiable SATinstances.


Figure 7.8 presents the mean fitness against time over 30 runs on a satisfiable SAT instancewith N = 100 variables and 200 clauses. The same observations as previously can bemade: the hybrid algorithms cannot be distinguished from each other and the number ofiterations MAX IT = 15∗N appears to be large enough to return a local optimum. Verysimilar results are observed on different SAT instances and using different CTS model (i.e.when using the NP Global or the PACTS model).

44

Figure 7.8: Mean Fitness over 30 runs on a satisfiable SAT instance (with N = 100 variables and 200clauses) against the number of iterations using the Global model for the algorithms using the CTS model.

In the following sections, we aim at observing the mean and the best fitnesses reachedover 30 runs when running the algorithms on both satisfiable and unsatisfiable SAT in-stances.

Fitness on satisfiable instances

On figure 7.9, we can observe the mean fitness on 8 satisfiable instances: on the firstplot, we have the mean fitness returned by our algorithms on 4 satisfiable SAT instancesthat have N = 100 variables and 200 clauses and on the second plot, we can observe thesame but with instances that have N = 50 variables and 300 clauses. On the first plot,we can again observe that some of the hybrid algorithms seem to outperform the RMHCalgorithm on some instances. Nevertheless, these results are very random. Indeed, weran our algorithms on 20 different satisfiable instances and the plots we obtained did notallow us to observe a general trend or behaviour of our algorithms. The same observationscan be made when observing the best fitness over 30 runs on the plots of figure 7.10: eventhough some of the algorithms solve some the satisfiable SAT problem instance over 30restarts, no general behaviour can be found.

45

Figure 7.9: Mean Fitness over 30 runs on satisfiable SAT instances using the Global model. The firstplot correspond to instances with N = 100 variables and 200 clauses and the second plot correspond toinstances with 50 variables and 300 clauses.

46

Figure 7.10: Best fitness obtained over 30 runs obtained on satisfiable SAT instances using the Globalmodel. The first plot correspond to instances with N = 100 variables and 200 clauses and the secondplot correspond to instances with 50 variables and 300 clauses.

47

Fitness on unsatisfiable instances

In this section we can see that the exact same observations can be made when testing ouralgorithms on unsatisfiable instances thanks to figures 7.11 and 7.12.

Figure 7.11: Mean Fitness over 30 runs on unsatisfiable SAT instances (with N = 100 variables and200) using the Global model.

48

Figure 7.12: Best fitness obtained over 30 runs obtained on unsatisfiable SAT instances (with N = 100variables and 200) using the Global model.

7.1.5 Results on the 0/1 knapsack problem

The results observed on the 0/1 knapsack instances we generated are sensibly the sameas the one we obtained on the other three benchmark problems. As shown in figure 7.13,the average fitnesses (which is the total value of the knapsack) obtained on a knapsackinstance (that has 50 items) by our hybrid algorithms are not better than the one returnedby the RMHC algorithm. The same observations have been made on other 0/1 knapsackinstances.

49

Figure 7.13: Mean Fitness over 30 runs on a 0/1 knapsack instance (with N = 50 items) against thenumber of iterations using the Global model.

7.2 Statistical tests

In this section, as our observations do not show any improvement brought by our hybridalgorithms to the RMHC algorithm, except maybe on the 32-bit H-IFF instance, wedecided to first focus our statistical tests on this problem instance. In order to makesome basic observations, we first gave some summary statistics of the samples in section7.2.1 and we then applied the Wilcoxon signed-rank test in section 7.2.2. To facilitate theprocedure, we chose to retrieve the data samples and to use the language R to observethe data and to perform the statistical testing.

7.2.1 Summary

Summary statistics on the data retrieved are given in table 7.14. As observed in section7.1.3, we notice that the algorithm that reaches the best mean fitness and the best over-all fitness is the learnContext1 algorithm as the mean fitness is equal to 87.33 and themaximal fitness reached over the 30 runs is equal 132.

50

RMHC learnBit1 learnBit2 learnContext1

Min. : 72.00 Min. : 68.00 Min. : 68.00 Min. : 72.00

1st Qu.: 80.00 1st Qu.: 80.00 1st Qu.: 76.00 1st Qu.: 77.00

Median : 82.00 Median : 84.00 Median : 84.00 Median : 86.00

Mean : 85.33 Mean : 85.47 Mean : 84.53 Mean : 87.33

3rd Qu.: 92.00 3rd Qu.: 92.00 3rd Qu.: 92.00 3rd Qu.: 92.00

Max. :116.00 Max. :104.00 Max. :100.00 Max. :132.00

learnContext2 learnLocus

Min. : 68.00 Min. : 72.00

1st Qu.: 77.00 1st Qu.: 76.00

Median : 84.00 Median : 84.00

Mean : 84.27 Mean : 84.33

3rd Qu.: 91.00 3rd Qu.: 88.00

Max. :120.00 Max. :124.00

Figure 7.14: Summary statistics of the fitnesses obtained by the RMHC and the hybridalgorithms over 30 runs on the 32-bit H-IFF instance.)

7.2.2 The Wilcoxon signed-rank test

The Wilcoxon signed-rank test allows us to decide whether two data samples come froma same distribution. The hypothesis are the following:{

H0: The two samples come from a same distribution.H1: The two samples come from different distributions.

Table 7.1 give us the p-values associated with the Wilcoxon test for our five hybridbandit-based hill-climbing algorithms. At a level of significance of 0.05, we observe that allthe bandit-based samples come from a same distribution (as the p-values are all greaterthan 0.05). In another words, we found no proof of influence on the fitness distribu-tions when using bandit-based hill-climbers on the 32-bit H-IFF instance. Unfortunately,similar results have been obtained on the other benchmark problem instances.

learnBit1 learnBit2 learnContext1 learnContext2 learnLocusRMHC 1 0.9904 0.729 0.845 0.8821learnBit1 0.7018 0.575 0.4728 0.2345learnBit2 0.7168 0.9389 0.8073learnContext1 0.3726 0.2731learnContext2 0.8536

Table 7.1: P values for our hybrid algorithms.

51

8 Conclusions

Although the results we obtained on the four problem classes are not conclusive (as thestatistical tests show us that the fitnesses sampled from bandit-based and non-bandit-based RMHC algorithms come from same distributions); other hybrid algorithm could beinvestigated.Section 8.1 discusses the future avenues of investigation, including another bandit-basedcombinatorial optimization technique that we could study and some modifications thatwe could bring to our actual algorithms.

8.1 Future avenues for investigation

In this section we will discuss the possible future areas of investigation considering ourcurrent results and the progress made on the project. Section 8.1.1 discusses anotherbandit-based combinatorial optimization algorithm and sections 8.1.2 and 8.1.3 presenttwo modifications that could be brought to our current implementations of the bandit-based local search techniques to improve their performance.

8.1.1 Bandit-based simulated-annealing

In the first phase of our project, we studied and implemented some hybrid simulated-annealing algorithms using the CTS algorithm. As the results found using these algo-rithms were not conclusive, we decided to focus on the bandit-based hill-climbers as tun-ing the parameters for such algorithms is easier. Indeed, for the bandit-based simulated-annealing, in addition to the parameters of the CTS algorithms, an appropriate coolingschedule has to be determined. Nevertheless, further investigations could be led in thisdirection as the simulated-annealing presents great advantages compared to the RMHCalgorithms on many problems. As opposed to the RMHC algorithm that accepts a pointonly if it improves the fitness, the acceptance function of the simulated-annealing, basedon a heuristic, brings another dimension to encounter the EvE problem. Also, the versionsof the bandit-based simulated-annealing using the contextual bandit and using a singleN -armed bandit have not been tried yet.

52

8.1.2 The reward distribution

In our algorithms, we assumed a binary pay-off: if the action taken brings an improvementto the fitness of the current point, the reward is assumed to be equal to 1 and equal to0 otherwise. The Thompson sampling algorithm used for our algorithm was the onegiven in [11]. One possible improvement to our algorithm would be to modify the givenimplementation of the Thompson sampling algorithm so it can deal with different rewarddistributions. We could for instance consider the reward as the fitness improvementbetween the current point and the new point (resulting from the action decided by theThompson learning algorithm). In this case, we could for instance assume a Gaussianprior distribution on the parameters θj of the reward distribution of each one of thej arms and a Gaussian likelihood (of observing a reward ri(t) at time t) and proceedas previously: indeed, the posterior probability of being the optimal arm to pull wouldalso be Gaussian and easy to compute. [12] gives more details on Thompson samplingalgorithm with Gaussian prior distributions.

8.1.3 The change-point

Instead of maintaining the occurrence of a change-point with a probability equal to theswitching rate γ at each iteration, a possible alternative for the bandit-based simulated-annealing would be to make a change point occur whenever the simulated-annealing ac-cepts a point that does not improve the fitness of the current point. This way, we wouldrestart the reward distribution of the arms to the Beta(1, 1) distribution so that the algo-rithm starts to learn again from that new direction taken. As the temperature decreases,the algorithm will learn on longer run-length.

53

A Thompson sampling for Contex-tual bandits

The following algorithm is the one given in appendix C of [13] for N = 2.

Algorithm 11 Thompson sampling for contextual bandits with 2 parameters.

1: for t = 1, 2, ..., do2: For each arm j ∈ {1, 2}, sample θj independently from distribution

N (bj(t)T µj(t), v

2bj(t)TBj(t)

−1bj(t)).3: Play arm a(t)← arg maxj θj(t) and observe reward rt.4: Update Ba(t) ← Ba(t) + ba(t)(t)ba(t)(t)

T , fa(t) ← fa(t) + ba(t)(t)rt, µa(t) ← B−1a(t)fa(t)5: end for

54

Bibliography

[1] How to solve it: Modern Heuristics. Second Edition. Zbigniew Michalewicz and DavidB.Fogel.(1998). Springer

[2] The Exploration and Exploitation of Response Surfaces: Some General Considerationsand Examples. G. E. P. Box. Biometrics. (1954).

[3] Metaheuristics in Combinatorial Optimization: Overview and Conceptual Compar-ison. Christian Blum and Andrea Roli. ACM Computing Surveys, Vol. 35, No. 3,(2003)

[4] Relative Building-Block Fitness and the Building-Block Hypothesis. Stephanie For-rest and Melanie Mitchell. In D. Whitley (ed.) Foundations of Genetic Algorithms 2,Morgan Kaufmann, San Mateo, CA, 1993.

[5] When Will a Genetic Algorithm Outperform Hill-Climbing? Melanie Mitchell andJohn H. Holland. In J. D. Cowan, G. Tesauro, and J. Alspector (editors), Advancesin Neural Information Processing Systems 6. San Mateo, CA: Morgan Kaufmann,1994.

[6] Towards a general theory of adaptive walks on rugged landscapes. Kauffman, S. andLevin, S. (1987). Journal of Theoretical Biology 128 (1) 11–45.

[7] The N-k Model of the application to the maturation of the immune response. Kauff-man, S. and Weinberger, E. (1989). Journal of Theoretical Biology, Vol. 141, No. 2,211-245.

[8] Mixed-Integer NK landscapes. Rui Li, Michael T.M. Emmerick, Jeroen Eggermont,Ernst G.P. Bovenkamp, Thomas Back, Joule Dijkstra and Johan H.C. Reiber.

[9] Algorithms for the multi-armed bandit problem. Volodymyr Kuleshov and Doina Pre-cup. Journal of Machine Learning Research 1 (2000) 1-48.

[10] An Empirical Evaluation of Thompson Sampling. Olivier Chapelle and Lihong Li.NIPS (2011).

[11] Thompson Sampling ibn Switching Environments with Bayesian Online Change PointDetection. Joseph Mellor and Jonathan Shapiro (2013). University of Manchester.

55

[12] Further Optimal Regret Bounds for Thompson Sampling. Shipra Agrawal and NavinGoyal. CoRR (2012).

[13] Thompson Sampling for Contextual Bandits with Linear Payoffs. Shipra Agrawaland Navin Goyal. Microsoft Research India. (2013).

[14] Modeling Building-Block Interdependency. Richard A. Watson, Gregory S. Hornbyand Jordan B. Pollack. Volen Center for Complex Systems - Brandeis University -Waltham, MA - USA. (1998).

[15] http://www.cs.ubc.ca/~hoos/SATLIB/benchm.html

[16] http://www.diku.dk/~pisinger/codes.html

56

http://www.cs.ubc.ca/~hoos/SATLIB/benchm.html

http://www.diku.dk/~pisinger/codes.html

Designing Bandit-Based Combinatorial Optimization...

Documents

Transcript of Designing Bandit-Based Combinatorial Optimization...