Demicco & Associates, LLC - Delaware Riverkeeper

19
Multiple Instance Learning: Algorithms and Applications Boris Babenko Dept. of Computer Science and Engineering University of California, San Diego Abstract Traditional supervised learning requires a training data set that consists of inputs and corre- sponding labels. In many applications, however, it is difficult or even impossible to accurately and consistently assign labels to inputs. A relatively new learning paradigm called Multi- ple Instance Learning allows the training of a classifier from ambiguously labeled data. This paradigm has been receiving much attention in the last several years, and has many useful applications in a number of domains (e.g. computer vision, computer audition, bioinformat- ics, text processing). In this report we review several representative algorithms that have been proposed to solve this problem. Furthermore, we discuss a number of existing and potential applications, and how well the currently available algorithms address the problems presented by these applications. 1 Introduction In traditional supervised learning a training dataset, consisting of input and output/label pairs, is used to construct a classifier that can predict outputs/labels for novel inputs [1, 2]. An enormous amount of theoretical and practical work has gone into developing robust algorithms for this prob- lem. However, the requirement of input/label pairs in the training data is surprisingly prohibitive in certain applications, and more recent research has focused on developing more flexible paradigms for learning. In this paper we focus on the Multiple Instance Learning (MIL) paradigm, which has been emerging as a useful tool in a number of application domains. In this paradigm the data is assumed to have some ambiguity in how the labels are assigned. In particular, rather than pro- viding the learning algorithm with input/label pairs, labels are assigned to sets or bags of inputs. Restricting the labels to be binary, the MIL assumption is that every positive bag contains at least one positive input. The true input labels can be thought of as latent variables, because they are not known during training. Consider the following simple example (adapted from [3]) of a MIL problem, illustrated in Fig. 1. There are several faculty members, and each owns a key chain that contains a few keys. You know that some of these faculty members are able to enter a certain room, and some aren’t. The task is then to predict whether a certain key or a certain key chain can get you into this room. To solve this we need to find the key that all the “positive” key chains have in common. Note that 1

Transcript of Demicco & Associates, LLC - Delaware Riverkeeper

Page 1: Demicco & Associates, LLC - Delaware Riverkeeper

Multiple Instance Learning: Algorithms and Applications

Boris BabenkoDept. of Computer Science and Engineering

University of California, San Diego

Abstract

Traditional supervised learning requires a training data set that consists of inputs and corre-sponding labels. In many applications, however, it is difficult or even impossible to accuratelyand consistently assign labels to inputs. A relatively new learning paradigm called Multi-ple Instance Learning allows the training of a classifier from ambiguously labeled data. Thisparadigm has been receiving much attention in the last several years, and has many usefulapplications in a number of domains (e.g. computer vision, computer audition, bioinformat-ics, text processing). In this report we review several representative algorithms that have beenproposed to solve this problem. Furthermore, we discuss a number of existing and potentialapplications, and how well the currently available algorithms address the problems presentedby these applications.

1 Introduction

In traditional supervised learning a training dataset, consisting of input and output/label pairs, isused to construct a classifier that can predict outputs/labels for novel inputs [1, 2]. An enormousamount of theoretical and practical work has gone into developing robust algorithms for this prob-lem. However, the requirement of input/label pairs in the training data is surprisingly prohibitive incertain applications, and more recent research has focused on developing more flexible paradigmsfor learning. In this paper we focus on the Multiple Instance Learning (MIL) paradigm, which hasbeen emerging as a useful tool in a number of application domains. In this paradigm the data isassumed to have some ambiguity in how the labels are assigned. In particular, rather than pro-viding the learning algorithm with input/label pairs, labels are assigned to sets or bags of inputs.Restricting the labels to be binary, the MIL assumption is that every positive bag contains at leastone positive input. The true input labels can be thought of as latent variables, because they are notknown during training.

Consider the following simple example (adapted from [3]) of a MIL problem, illustrated inFig. 1. There are several faculty members, and each owns a key chain that contains a few keys.You know that some of these faculty members are able to enter a certain room, and some aren’t.The task is then to predict whether a certain key or a certain key chain can get you into this room.To solve this we need to find the key that all the “positive” key chains have in common. Note that

1

Page 2: Demicco & Associates, LLC - Delaware Riverkeeper

Serge’s key‐chain Sanjoy’s key‐chain Lawrence’s key‐chain

Serge cannot enterthe Secret Room

Sanjoy can enterthe Secret Room

Lawrence can enterthe Secret Roomthe Secret Room the Secret Room the Secret Room

Figure 1: Keychain example: see text for details.

if we can correctly identify this key, we can also correctly classify an entire key chain - either itcontains the required key, or it doesn’t.

The spirit of MIL emerged as early as 1990 when Keeler et al. designed a neural networkalgorithm that would find the best segmentation of a hand written digit during training [4]. In otherwords, the digit label was assigned to every possible (block) segmentation of a digit image, ratherthan to a tightly cropped image. The idea was further developed, and the actual term was coinedby Dietterich et al. in [3]. The problem that inspired this work was that of drug discovery, wheresome property of the molecule has to be predicted from its shape statistics. The ambiguity comesinto play because each molecule can twist and bend into several different distinct shapes, and it isnot known which shape is responsible for the particular property (see Figure 4). In Section 3 wewill see many more interesting applications that fit into the MIL paradigm.

Although MIL has received an increasing amount of attention in recent years, the problem isstill fairly undeveloped and there are many interesting open questions. The goal of this report is toprovide a review of some representative algorithms for MIL, as well as some interesting applica-tions. Finally, we discuss how well the existing algorithms address the problems presented by theapplications and point out potential avenues for future research.

2 Algorithms

The purpose of this section is to formally define the MIL problem, and review some popular al-gorithms for solving MIL. For the purpose of clarity, it is useful to divide all existing methodsinto a hierarchy, but, as is usually the case, this is impossible to do as various methods overlap invarious ways. As with supervised learning, the training procedure for MIL will require two basicingredients: a cost function (e.g. 0/1 loss, likelihood), and a method of finding a classifier that opti-mizes that cost function (e.g. gradient descent, heuristic search). We chose to structure this reviewby splitting the methods up according to the former criteria, but it should be noted that there aremethods with different cost functions that are actually very similar in terms of their optimizationprocedures. Next we define the MIL problem formally, review the most basic way of solving MIL aswell as some PAC algorithm and theoretic results; finally, we conclude with a review of maximum

2

Page 3: Demicco & Associates, LLC - Delaware Riverkeeper

likelihood and maximum margin MIL algorithms.

2.1 Problem Statement

We begin by formally stating the problem, and establishing the notation that will be used in the restof this section. Since it will be useful to compare MIL to standard supervised learning, and sincemost MIL algorithms are derived from well known supervised learning algorithms, we first establishnotation for supervised learning. The training data consists of instance examples {x1, x2 . . . , xn}and labels {y1, y2 . . . , yn}, where xi ∈ X and yi ∈ Y . Typically, X = Rd is the d−dimensionalEuclidean space, and Y = {0, 1}. The goal of supervised learning is to train a classifier functionh(X) : X → Y that will accurately predict labels y for novel instances x. On the other hand,in Multiple Instance Learning the training set consists of bags {X1, X2 . . . , Xn} and bag labels{y1, y2 . . . , yn}, where Xi = {xi1, xi2 . . . , xim}, xij ∈ X and yi ∈ {0, 1} (for convenience wealso define y′ = (2y− 1) ∈ {−1,+1}). Note that the MIL problem is defined only for binary cases(although extensions have also been explored [5, 6]). For the sake of simplicity, we will assumethat X = Rd. Furthermore, we will assume that all bags are of a fixed size m, although in mostalgorithms this is not necessary.

The basic assumption of MIL, as mentioned before, is that a bag is positive if at least one of theinstances in that bag is positive (the true positive instance inside a positive bag is sometimes referredto as the “witness” or the “key”). In other words, we assume that instance labels yij ∈ {0, 1} existfor each instance, but they are not known during training. We can think of these as latent variables.The MIL assumption can be then summarized as follows:

yi ={

1 if ∃j s.t. yij = 10 otherwise

(1)

Another way of writing this is:

yi = maxj{yij} (2)

This is a bit less intuitive, but will prove to be useful later.The goal of MIL is to either train an instance classifier h(X) : X → Y or a bag classifier

H(X) : Xm → Y . However, note that a bag classifier can be easily constructed with the cor-rect instance classifier: H(Xi) = maxj h(xij). Therefore, with few exceptions [7, 8], most MIL

algorithms aim to learn the instance classifier.

2.2 Naive MIL

Perhaps the simplest way of solving the MIL problem is to generate instance labels yij equal to thebag labels yi. For negative bags, the generated instance labels will be correct, because all instancesin a negative bag are negative. However, positive bags may contain both positive and negativeinstances, and this approach will inevitably lead us to assigning positive labels to negative instances.

3

Page 4: Demicco & Associates, LLC - Delaware Riverkeeper

xi ∈ X ith instance (supervised)Xi = {xi1, xi2 . . . , xim} ith bag with m instances xij (MIL)

yi ∈ Y label of ith instance (supervised) or ith bag (MIL)yij true label of jth instance in ith bagy′ (2y − 1)n instances (supervised) or number of bags (MIL)m number of instances per bagd dimensionality

h(x) instance classifierH(X) bag classifier

Table 1: A summary of notation.

Still the question remains: can a standard supervised learning algorithm perform comparably to aMIL algorithm? Many papers have investigated this question, and have shown for a variety ofdatasets and algorithms (both supervised and MIL) that using the MIL framework results in betterperformance [3, 9, 10, 11]. MIL algorithms tend to outperform this “naive MIL” approach in practiceby a large margin. Nevertheless, this is an important scenario to consider, and we will see that anapproach similar to this leads to interesting theoretical results.

2.3 PAC Algorithms and Bounds

Early work on MIL focused on simple algorithms, such as axis aligned rectangles, for which the-oretical results can be derived more readily [12, 13]. In particular, several researchers focused ondeveloping Probably Approximately Correct (PAC) [14] learning algorithms, and showing samplecomplexity bounds. The basic idea behind PAC is to prove that given some training data, a certainlearning algorithm will produce an accurate classifier with high probability. Formally, for smallε and δ, P[err(h) > ε] < δ, where err(h) is the error of classifier h on unseen data. Using thisframework one can show roughly how much data will be needed to achieve an error of ε with highprobability – this is often referred to as the sample complexity. To arrive at these results, thesealgorithms generally need to make some assumptions about the data. For the case of supervisedlearning, the most common assumption is that the instances are drawn i.i.d. from some distribu-tion. For MIL there is a similar assumption: we assume that all instances are drawn i.i.d. from somedistribution, and then placed randomly into bags. This means that the instances in one bag haveno special relationship. This is a very important assumption, and we will discuss when it does anddoes not apply in Section 4.

Blum [15] proposed a surprisingly simple algorithm that reveals an interesting relationshipbetween supervised learning and MIL. The basic idea of his algorithm is as follows: for negativebags, assign a negative instance label to all instances; for positive bags, choose one instance perbag at random, and assign a positive instance label. Finally, plug these instance/label pairs into

4

Page 5: Demicco & Associates, LLC - Delaware Riverkeeper

a noise tolerant learning algorithm [16]. This is similar to the naive MIL approach we discussedin the previous section. The key to this algorithm is that the positive instances will always havea correct label. However, the negative instances may be labeled positive (if they came from apositive bag). Therefore, the data set that we produce will have asymmetric label noise. Withthis reduction, Blum shows that any concept class that is PAC learnable with noise is also PAClearnable for MIL. The best sample complexity bound is also due to Blum. This algorithm is amore sophisticated extension of the above, and uses the Statistical Query framework [16] (that wasused to prove PAC bounds for noise-tolerant learning by Kearns et al.). The sample complexitybound of this MIL algorithm is O(d2m/ε2). Note that this implies that learning becomes moredifficult as ε decreases, as the dimensionality d increases, and as the bag size m increases.

Although Blum’s algorithm achieved the best known PAC bounds for MIL, it is clearly not apractical algorithm. For large bag sizes m we would essentially be throwing away vast amounts ofdata. However, this algorithm is interesting to study because it highlights the relationship betweensupervised learning and MIL. The algorithms we review in the sections below generally havefewer known theoretical properties, but are more practical and have been shown to perform well ondifficult data.

2.4 Maximum Likelihood

The first category of practical MIL algorithms that we will discuss aims to train a classifier thatmaximizes the likelihood of the data. Maximum likelihood methods are popular in supervisedlearning, and we begin with a breis review. For each instance xi and a classifier h we let pi ≡P(yi = 1|xi, h) be the posterior probability of that instance. The exact definition of this entity willdepend on the classifier. The log likelihood1 is defined as follows:

L =∑

i|yi=1

log(pi) +∑

i|yi=0

log(1− pi) (3)

To adapt this loss function for MIL we need to remove the dependence on instance labels, as theseare not known during training. Instead, we let pi ≡ P(yi = 1|Xi, h) be the posterior probability ofthe bag, and pij ≡ P(yij = 1|xij , h) be the posterior probability of an instance. The equation forlog likelihood remains the same, but note that pi is now the bag probability. As before, the exactdefinition of pij will depend on the classifier.

Finally, we need to connect the bag probability pi to the probabilities of its instances pij . Anatural way of defining this relationship is:

pi = maxj{pij} (4)

The above can be considered a probabilistic equivalent of Eqn. 2. An important observation isthat the max operator is not differentiable, which means that the log likelihood for MIL is not

1Maximizing the likelihood can cause numerical errors, so it is common to instead maximize the log likelihood,which is equivalent because log is monotonically increasing.

5

Page 6: Demicco & Associates, LLC - Delaware Riverkeeper

0 0.2 0.4 0.6 0.8 1

0.5

0.6

0.7

0.8

0.9

1

v

g i(vi)

MAXNORISRGM(r=5)GM(r=20)LSE(r=5)LSE(r=20)

Figure 2: Various models g`(v`) applied to v` = (v, 1 − v).

differentiable. This presents a considerable problem as most maximum likelihood methods insupervised learning take advantage of the smoothness of the loss function to perform some formof gradient descent. Luckily, several differentiable approximations to the max operator exist in theliterature. We continue with a review of these approximations.

2.4.1 Differentiable Max Approximations

We now present an overview of differentiable approximations of the max, which we refer to asthe softmax. The general idea is to approximate the max over {v1, v2 . . . , vm} by a differentiablefunction g`(v`), such that:

g`(v`) ≈ max`

(v`) = v∗ (5)

∂g`(v`)∂vi

≈ 1(vi = v∗)∑` 1(v` = v∗)

(6)

Intuitively, if vi is the unique max, then changing vi changes the max by the same amount, other-wise changing vi does not affect the max.

A number of approximations for g have been proposed. We summarize the choices used here inTable 2: a variant of log-sum-exponential (LSE) [17, 18], generalized mean (GM), noisy-or (NOR)[10], and the ISR model [4, 10]. For experimental comparisons of these different choices withinMIL we refer the reader to [19]. In Fig. 2 we show the different models applied to (v, 1 − v) forv ∈ [0, 1].

LSE and GM each have a parameter r the controls their sharpness and accuracy; g`(v`) →v∗ as r → ∞ (note that large r can lead to numerical instability). For LSE one can show thatv∗ − log(m)/r ≤ g`(v`) ≤ v∗ [17] and for GM that (1/m)1/rv∗ ≤ g`(v`) ≤ v∗, where m = |v`|.NOR and ISR are only defined over [0, 1]. Both have probabilistic interpretations and work well in

6

Page 7: Demicco & Associates, LLC - Delaware Riverkeeper

practice; however, these models are best suited for small m as g`(v`) → 1 as m →∞. Finally, allmodels are exact for m = 1, and if ∀` v` ∈ [0, 1], then 0 ≤ g`(v`) ≤ 1 for all models.

g`(v`) ∂g`(v`)/∂vi domain

LSE 1r ln

(1m

∑` exp(rv`)

) exp(rvi)∑` exp(rv`)

[−∞,∞]

GM(

1m

∑` vr

`

) 1r g`(v`)

vr−1i∑` vr

`[0,∞]

NOR 1−∏

`(1− v`)1−g`(v`)

1−vi[0, 1]

ISR∑

` v′`1+

∑` v′`

, v′` = v`1−v`

(1−g`(v`)

1−vi

)2[0, 1]

Table 2: Four softmax approximations g`(v`) ≈ max`(v`).

Now we are ready to go over a few specific maximum likelihood algorithms for MIL. For each,we will define the classifier h and the instance probability pij in terms of h.

2.4.2 Logistic Regression

Logistic regression [2] is a popular probabilistic model for supervised learning, and using theframework we have laid out so far it is fairly straightforward to adapt it to MIL [9]. The clas-sifier will consist of just one vector in the input space: h = {w}, w ∈ Rd. Using the sigmoidfunction:

σ(x) =1

1 + exp{−x}(7)

we define the instance probabilities as follows:

pij = σ(w · xij

)(8)

We can think of the above as a probabilistic version of linear regression. Note that the range ofthe sigmoid is between 0 and 1, so plugging the dot product (w · xij) into this function produces avalid probability value. Another interpretation of logistic regression is that w is a normal vector toa separating hyperplane, and the probability of an instance depends on the distance between it andthe hyperplane. While in standard logistic regression we would directly plug these probabilitiesinto the likelihood cost function, for MIL we must first compute the bag probabilities pi and thenplug the resulting values into the likelihood. If we use one of the max approximations discussedabove, this likelihood will be differentiable with respect to w. We can therefore use a standardunconstrained optimization procedure (e.g. gradient descent) to find a good value for w. A similar,though more sophisticated, model was proposed in [20].

7

Page 8: Demicco & Associates, LLC - Delaware Riverkeeper

2.4.3 Diverse Density

Maron et al. [21] proposed an algorithm called Diverse Density for MIL. Unlike many of theother MIL algorithm, this method does not have a close relative in supervised learning (though it isreminiscent of nearest neighbor). The classifier again consists of just one vector in the input space:h = {w}, w ∈ Rd, which we will call the target point. Rather than defining a hyperplane, however,this vector specifies a point in the input space that is “most positive”. The probability of an instancexij depends on the distance between it and this target point:

pij = exp{−||xij − w||2} (9)

In other words, the positive region in the space is defined by a Gaussian centered at the point w.Having defined this, we can again plug the instance probabilities into one of the max approxima-tions, and then into the likelihood ([21] uses the NOR model). Gradient descent can be used tofind an optimal w, and Maron et al. suggest to use points in the positive bags to initialize (becausepresumably the vector w will lie close to at least one instance in every positive bag). Several ex-tensions can be incorporated into this framework. First, we can easily include a weighting of thefeatures, rather than using basic Euclidean distance:

pij = exp{−∑

k

sk(xijk − wk)2} (10)

It is then straight forward to modify the gradient descent to find both w and s. Furthermore, ratherthan having just one target point w, it is also straight forward to modify this algorithm to findseveral target points. The instance probabilities can then be calculated by taking a max over all thetarget points. This makes the classifier more powerful, and enables it to learn positive classes thatare more sophisticated (essentially a mixture of Gaussians).

The EM-DD [22] algorithm uses the same cost function as the above, but the optimizationtechnique is different. In this algorithm, the authors chose not to use a soft approximation ofmax. As we mentioned before, this leads to a non-differentiable likelihood function. Zhang et al.therefore propose the following simple heuristic algorithm to find the optimal target point w. Inthe spirit of EM, the algorithm iterates over two steps: (1) use the current estimate of the classifierto pick the most probable point from each positive bag, and (2) find a new target point w bymaximizing likelihood over all negative instances and the positive instances from the previous step(can use any gradient descent variant to do this since the max is no longer a problem). In otherwords, step (2) reduces to standard supervised learning, rather than MIL. We will see a similarsearch heuristic used in one of the maximum margin algorithms in Section 2.5

2.4.4 Boosting

Boosting [23] is a popular, powerful tool in statistical machine learning. Many different boostingalgorithms for standard supervised learning exist in the literature, and these algorithms have beenshown to be very effective in many applications. The basic idea behind boosting is to take a simple

8

Page 9: Demicco & Associates, LLC - Delaware Riverkeeper

learning algorithm, train several simple classifiers, and combine them. Typically, the combinationis done via weighted sum, the classifier is defined as: h(x) = sign

( ∑Tt=1 αtht(x)

), where ht ∈ H

is a “weak classifier” (sometimes called the “base classifier”), and αt are the positive weights.Viola et al. proposed a boosting algorithm for solving MIL in [10]. They define the probability

of an instance as

pij = σ( ∑

t

αtht(xij))

(11)

This is similar to posterior probability estimates for AdaBoost and LogitBoost [24]. They thenuse the NOR and ISR max approximations to define the bag probabilities pi. Using the gradientboosting framework [25, 26], they derive an boosting procedure that optimizes the bag likelihood.Gradient boosting works by training one weak classifier at a time, and treating each of them as agradient descent step in function space. To do this, we first compute the gradient ∂L/∂h and thenfind a weak classifier ht that is as close as possible to this gradient (note that it may be impossibleto move exactly in the direction of the gradient because we are limited in our choice of ht). Theweight αt can then be found via line search.

2.4.5 Other Gradient Descent algorithms

Other gradient descent based MIL algorithms can be framed in a manner similar to the above,by picking various differentiable loss functions for supervised learning, and plugging in the maxapproximation to compute the loss of a bag. For example, in [18] a neural network algorithm isderived by taking the squared loss function that is generally used in neural nets for supervisedlearning, and plugging in the LSE max approximation from above. The neural network can thenbe trained using a variation of the back propagation algorithm.

2.5 Maximum Margin

Due the success of the Support Vector Machine (SVM) algorithm [1], and the various positivetheoretical results behind it, maximum margin methods have become extremely popular in machinelearning. It is no surprise then that many variations of SVM have been proposed for the MIL

problem. In this section we review a few of the most representative algorithms. We begin with abrief review of soft margin SVM.

2.5.1 Review of max margin methods

The basic idea behind SVM is to find a hyperplane in the input space2 that separates the trainingdata points with as big a margin as possible. The classifier is defined by the hyperplane normal wand the offset b, h = {w, b}. The margin, γ, is defined as the smallest distance between a positive

2For certain formulations of SVM the “kernel trick” can be employed to first map the input instances into a highdimensional feature space.

9

Page 10: Demicco & Associates, LLC - Delaware Riverkeeper

‐ negative instances‐ positive bag 1‐ positive bag 2‐ support vector

MI‐SVM mi‐SVM

Figure 3: Comparing the solutions of MI-SVM and mi-SVM for a simple 2D dataset.

or negative point and this hyperplane. Because the data can be scaled arbitrarily, it can be shownthat the following optimization is equivalent to maximizing the margin:

minw,b,ξ

12 ||w||

22 + C

∑i ξi (12)

s.t. yi(w · xi + b) ≥ 1− ξi

The above can be interpreted as follows: we assume that the margin is at least 1, and shrink thesize of the hyperplane normal w. The latter can also be seen as a form of regularization. Becausethe data may actually not be separable, we also include slack variables ξi for each point xi. Thepoints that are closest to the hyperplane (the ones that lie γ away) are called support vectors.

2.5.2 SVM for MIL

The main question we need to consider when adapting SVM to the MIL problem is how to definethe margin. For negative bags, this is simple - all instances in negative bags are known to be neg-ative, so the margin is defined as in the regular case. For positive bags, however, the issue is morecomplicated. Andrews et al. pioneered the work on max margin algorithms for MIL [27, 28]. Inthis work they propose two different ways of defining the margin, and present two programs thatsolve for the max margin classifier. In both cases the optimization turns out to be a mixed integerprogram, so Andrews et al. propose simple heuristic algorithms to do this optimization.

Bag margin: identifying the witnessThe first type of margin is what we will refer to as the bag margin. As mentioned earlier, fornegative instances, the margin is defined as in regular SVM. For positive instances we define themargin of a bag as the maximum distance between the hyperplane and all of its instances. Since wewant at least one instance in a positive bag to be positive, we want at least one instance in a positivebag to have a large positive margin. In other words, the maximum over all the instance margins in

10

Page 11: Demicco & Associates, LLC - Delaware Riverkeeper

the bag must be bigger than one. Incorporating slack variables, we arrive at the following program:

minw,b,ξ

12 ||w||

22 + C

∑ij ξij (13)

s.t. (w · xij + b) ≤ −1 + ξij ,∀i|yi = 0maxj(w · xij + b) ≥ 1− ξij ,∀i|yi = 1

ξij ≥ 0

The second constraint in this optimization is not convex, which makes this a difficult problem tosolve. By introducing an extra variable s(i) for each bag, we can convert the above program into amixed integer program:

mins(i)

minw,b,ξ

12 ||w||

22 + C

∑ij ξij (14)

s.t. (w · xij + b) ≤ −1 + ξij ,∀i|yi = 0(w · xis(i) + b) ≥ 1− ξij ,∀i|yi = 1

ξij ≥ 0

The intuition of this program is that the variables s(i) are trying to locate the witness instance ineach bag. Note that the constraints involve just one instance per positive bag. This means that thepotentially negative instance in a positive bag will be ignored. This version of SVM for MIL iscalled MI-SVM.

Mixed integer programs are difficult to solve; therefore, Andrews et al. propose a simple heuris-tic algorithm for solving the above program. The heuristic has the flavor of an EM algorithm: firstthe values of s(i) are guessed using the current classifier (by choosing the instance in every positivebag with the largest margin), and then the SVM classifier is trained just as in regular supervisedlearning. These two steps are repeated until convergence, when the values of s(i) stop changing.Note that this optimization procedure is very similar to the one used in EM-DD, although the ob-jective function is different.

Instance margin: identifying the instance labelsIn the above formulation, we were trying to recover the latent variable s(i) that identifies thekey instance in every positive bag. As mentioned previously, this approach ignores the negativeinstances in the positive bags. On the other hand, we could attempt to recover the latent variableyij , which is the label of every instance. Again, for negative bags this is known to always be 0.Once these variables are known, we could train a regular SVM using these instances labels. The

11

Page 12: Demicco & Associates, LLC - Delaware Riverkeeper

program is summarized below:

minyij

minw,b,ξ

12 ||w||

22 + C

∑ij ξij (15)

s.t. yij = 0,∀i|yi = 0∑j yij ≥ 1,∀i|yi = 1

y′ij(w · xij + b) ≥ 1− ξij

ξij ≥ 0

This variation of SVM for MIL is called mi-SVM. This is also a mixed integer program, and is alsodifficult to solve. As before, Andrews et al. propose a heuristic algorithm that is similar to the onewe described for MI-SVM. The algorithm again iterates over two steps: use the current classifierto compute instance labels yij , then use these to train a standard SVM. If for a positive bag, noneof the instance labels come out as positive, the instance with the largest value of (w · xij + b) islabeled positive.

Figure 3 shows a simple example that illustrates the differences between MI-SVM and mi-SVM. The important difference is that in the former approach, the negative instance in the positivebags are essentially ignored, and only one instance in each positive bag contributes to the optimiza-tion of the hyperplane; in the latter, the negative instances in the positive bags, as well as multiplepositive instances from one bag can be support vectors.

2.5.3 Sparse MIL

As mentioned previously, both the mi-SVM and MI-SVM involve mixed integer programs whichare difficult to solve. Instead, Bunescu et al. propose a third alternative for adapting an SVM to theMIL problem [11]. We begin with the program in Equation 13. The first constraint is simple anddoes not present any problems; the second constraint is the one that causes problems. Our goal isthen to replace this constraint with something more manageable.

First, we add the following constraint for all positive bags: |wxij + b| ≥ 1. This ensures thatevery instance in the bag has a large margin, whether it be negative or positive. Now we wouldlike to enforce the fact that every positive bag should have at least one positive instance. Considerthe following constraint:

∑j(wxij + b) ≥ 1 + −1(m − 1) = (2 − m). If all instances in a

bag are negative, then this constraint would be false. Therefore, this constraint ensures that atleast one instance in the bag will be positive. One minor issue is that even if one instance in thebag is positive, the constraint might be violated if the margins of the negative points have a largemagnitude. In this sense, this constraint is pulling all the negative instances in the bag closer to thehyperplane. However, the first constraint we added ensures that the margin for the negative points

12

Page 13: Demicco & Associates, LLC - Delaware Riverkeeper

is large enough. Adding the slack variables, we arrive at the following program:

minw,b,ξ

12 ||w||

22 + C

∑ij ξij (16)

s.t. (w · xij + b) ≤ −1 + ξij ,∀i|yi = 0∑j(w · xij) ≥ (2−m),∀i|yi = 1

|w · xij + b| ≥ 1ξij ≥ 0

Although this program is not convex, Bunescu et al. show how to solve it with a Concave-Convexsolver (CCCP), which gives an exact solution unlike a heuristic algorithm. Furthermore, they showthat this algorithm works particularly well on problems with sparse positive bags (where there arefew positives in every positive bag).

2.5.4 Combining DD and 1-norm SVM

An interesting combination of Diverse Density and SVM was developed in [29, 30]. Recall that theDD classifier consists of a “target” point w, and instances are considered positive if they are closeto this point. Bi et al. propose to embed all bags in a new feature space, and then train a regularSVM classifier. The intuition behind this feature space is closely linked to the idea behind DD.

Let x1, x2 . . . xt be the all the instances in positive bags (e.g. xij ,∀i|yi = 1). For each bag weconstruct the following vector:

di =

minj m(x1, xij)minj m(x2, xij)

...minj m(xt, xij)

(17)

In other words, in the new feature space, there will be a dimension for every xij in positive bags,and the value of this dimension will the distance to the closest bag member. Suppose that one ofthese xt instances was the target point w (or was very close to the target point). That means that forpositive bags, the value of the tth dimension in the new feature space will be small, because therewill be at least one point close to that target point. On the other hand, for negative bags, the valuewill be large. This intuition leads us to believe that this space will be discriminative.

The authors then propose to use a 1-norm SVM instead of the usual L2 based SVM objectivefunction. This is because the 1-norm SVM often results in a vector w that is more sparse. Thisis particularly attractive in this scenario because every dimension of w corresponds to an instancefrom the positive bags. First off the dimensionality will be quite large, so a sparse solution isattractive. In addition, a sparse solution will essentially “pick out” the target point(s).

minw,b,ξ

||w||1 + C∑

i ξi (18)

s.t. y′i(w · di + b) ≥ 1− ξi

13

Page 14: Demicco & Associates, LLC - Delaware Riverkeeper

An important note about this method is that it builds a bag classifier rather than an instanceclassifier. However, due to the way this problem is structured, we could easily classify an instanceby pretending it is a bag of size 1.

2.6 Other approaches

Recall that in MIL, there is a simple relationship between a bag classifier and an instance classifier:H(Xi) = maxj h(xij). As mentioned before, this relationship allows us to classify bags usingan instance classifier. In some applications, however, only a bag classifier is needed, and in somecases it can be argued that training a bag classifier directly, rather than training an instance classifierand using it to classify bags, is an easier problem. A couple of methods have been proposed forthis [8, 7], but they generally disregard the MIL assumption that a positive bag contains at least onepositive instance, and instead treat the problem as set classification.

3 Applications

In this section we will review a few interesting applications of Multiple Instance Learning. Moreimportantly, we will highlight the fundamental differences in the datasets these applications in-volve. In particular, there appear to be two distinct types of applications for MIL. We borrow theterminology of [31] and call these two types “polymorphism ambiguity” and “part-whole ambi-guity”. In the former, an object can have multiple different appearances in input space and it isnot known which of these is responsible for the output/label; in the latter, an object can be brokeninto several pieces/parts, each of which has a representation in input space, and labels are assignedto the whole objects rather than the parts, though only one part is responsible for the label. Wecontinue with a few concrete examples, and in Section 4 we discuss how the differences betweenthese types of applications affect algorithm design.

3.1 Polymorphism Ambiguity

The original application for MIL that was proposed by Dietterich et al. [3] is a case of polymor-phism ambiguity. In this application, we wish to classify molecules by looking at their shape. Eachmolecule can twist and bend, and can therefore appear in several distinct shapes (see Fig. 4). Itwould require significantly more work for a chemist to figure out which shape is responsible forthe label (which specifies a certain behavior) of the molecule. Instead, we can collect the mea-surements for all the different shapes of a molecule, and assign a label to the entire bag of shapes.Dietterich et al. also introduced a dataset, that is commonly used to this day, called MUSK. Thisdataset consists of labels indicating whether a molecule produces a “musky” smell. If successful,these types of molecule classification methods can be of great use in drug discovery.

14

Page 15: Demicco & Associates, LLC - Delaware Riverkeeper

Figure 4: Polymorphism Ambiguity. This is a figure from [3] that we reproduce for convenience;it shows an example of one molecule taking on different shapes. Presumably, only one of these twoshapes would be responsible for certain behaviors of this molecule.

3.2 Part-whole Ambiguity

The part-whole ambiguity datasets commonly arise in the computer vision and audition domains.In these areas it is common to have applications where labeling an entire image or sound clip ismuch easier than precisely cropping out an object of interest (e.g. a face in an image, a word in asound clip).

The first example we will go over is object detection. Current methods in computer vision forobject detection often train a classifier using a set of carefully cropped training images. At runtime, the classifier is applied to every sub-window of a larger image, and non-maximal suppressionis usually applied for post processing. Generating these carefully cropped training images, how-ever, can be burdensome. Furthermore, it is difficult for a human labeler to be consistent with thecropping. For this reason, Viola et al. argue that placing bounding boxes around objects is an inher-ently ambiguous task. Instead they propose to gather training data into bags: for each image that isknown to contain the object of interest, several bounding boxes (potentially of different scales andorientation) are cropped to generate a bag for MIL. Note that now only image labels are required,rather than precise bounding boxes. An example is shown in Figure 5 (BOTTOM).

A closely related approach in computer vision is to use image segmentation instead of denselysampling sub-windows of an image. As before, labels are given to images even though the labelsapply only to one region of the image. For both testing and training the image is broken into severalcoherent pieces using an image segmentation algorithm. This is useful for cases when the objectof interest cannot be easily inscribed in a box (e.g. objects that have articulation like humans,animals, etc). Some image statistics are computing for each segment, and the segments of oneimage constitute one bag for MIL. This type of approach has been explored in [27, 29, 30]. Anexample is shown in Figure 5 (TOP).

Computer audition is analogous to computer vision in many ways, except the data is 1 dimen-sional, rather than 2 dimensional. A sound clip can be split into a bag either by using a slidingwindow in the temporal domain, or by splitting it in the frequency domain such that each instancecontains features that describe only a limited range of frequencies[20]. The former is analogous tocropping out object boundaries in computer vision, and the latter can help with noisy audio data.

15

Page 16: Demicco & Associates, LLC - Delaware Riverkeeper

{ }{              }

{ }{          }Figure 5: Part/whole Ambiguity. These are two examples of a bag being generated by splitting animage into regions with (TOP) image segmentation, and (BOTTOM) sliding windows. Note that alabel that would be assigned to the entire image ((TOP) tiger and (BOTTOM) face) is attributed tojust one of the smaller regions.

Another set of applications comes from Bioinformatics, where biological sequences (eitherDNA base pairs, or amino acids) need to be classified. A common approach is to use a slidingwindow across the sequence to form a bag of overlapping sub sequences [32, 9]. The assumptionis that only one of the subsequences is responsible for the behavior of the entire DNA sequence orprotein.

Finally, document/text classification has been investigated within a MIL framework [27]. Adocument is general split into paragraphs, or into overlapping pieces consisting of several para-graphs. The features computed for each piece of the document are presumably less noisy thanfeatures computed over the entire document; this can lead to more precise classification.

4 Discussion & Future Work

An obvious question to ask is which MIL algorithm is the best. As is the case with supervisedlearning, the performance of various algorithms depends on the datasets, and no absolute “best”algorithm exists. Unfortunately there have been few comprehensive studies comparing a widerange of algorithms. [9] is one such study, and it concludes that the performance of differentalgorithms varies depending on the data. Although the MUSK dataset is commonly used to evaluateMIL algorithms, it is difficult to use this dataset to draw conclusions about algorithms because itcontains a relatively small amount of data (note that it was introduced in 1997 [3]).

Perhaps the most interesting observation is that there indeed appear to be two distinct classes ofapplications for MIL (polymorphism and part/whole ambiguities), and yet the algorithm develop-

16

Page 17: Demicco & Associates, LLC - Delaware Riverkeeper

ment has been independent of this distinction. The datasets belonging to these two different typesof ambiguities may behave in very different ways, and it would be interesting to see how thesebehaviors can be used to our advantage.

One important difference in behavior between these two ambiguities is the effect of bag size, m.Recall from Section 2.3 that the best known sample complexity bound for MIL is O(d2m/ε2) [15].This implies that increasing the bag size m actually making the problem harder. Of course thisbound would only hold if the data assumptions made by this PAC bound hold. In particular, thereare two important assumptions made: (1) the instances in a bag are drawn independently, and (2)each positive bag contains at least one positive instances. In the case of polymorphism ambiguity,these seem to be a reasonable assumptions, and the implied behavior tends to be consistent withreality. For example, when dealing with molecules that have a huge number of possible shapes,finding the one that matters becomes difficult. Note that in this scenario the user does not have anycontrol over the bag size - it depends purely on the data and is fixed. Furthermore, if a molecule islabeled positive, then the bag definitely contains a positive instance.

For part/whole ambiguity, however, this assumption does not hold and the data may behave ina completely opposite manner! Consider the case of object detection with sliding windows gener-ating a bag. Sampling an image densely with sliding windows gives us a higher chance of croppingout the “perfect” bounding box around the face in an image. Therefore, in this application, a largerbag may actually make learning easier. Note that in this type of application, the instances are notat all independent. The other key difference is that here we are actually not guaranteed to have apositive instance in a positive bag. For example, consider the case of using image segmentation togenerate a bag - what if the segmentation performs poorly and returns nonsense segments? Notethat in this scenario the user does have control over the bag size, as there is usually way of deter-mining how many parts the whole object is broken into to generate a bag. The only way to ensurethat the MIL assumption holds in this application is to generate bags exhaustively: take every pos-sible segment in an image, crop out every possible bounding box, etc. Of course, this would beprohibitive in terms of resources as the amount of data explodes – in some cases the number ofparts for one object is actually infinite.

In the future, it would be interesting to develop algorithms that take these observations intoaccount. In particular, an algorithm specifically tailored to the part/whole ambiguity datasets wouldbe useful in many vision and audition applications. Furthermore, the theoretical properties of thesetypes of datasets should be explored by making more appropriate assumptions.

Acknowledgements

The author would like to thank Serge Belongie, Sanjoy Dasgupta and Lawrence Saul for theirfeedback.

17

Page 18: Demicco & Associates, LLC - Delaware Riverkeeper

References

[1] Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag (1995)

[2] Bishop, C.: Pattern Recognition and Machine Learning (Information Science and Statistics),Secaucus. NJ, USA. Springer, Heidelberg (2006)

[3] Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple-instance problemwith axis parallel rectangles. A.I. (1997)

[4] Keeler, J.D., Rumelhart, D.E., Leow, W.K.: Integrated segmentation and recognition of hand-printed numerals. In: NIPS. (1990)

[5] Ray, S., Page, D.: Multiple instance regression. In: ICML. (2001) 425–432

[6] Zhou, Z., Zhang, M.: Multi-Instance Multi-Label Learning with Application to Scene Clas-sification. In: NIPS. (2006)

[7] Wang, J., Zucker, J.: Solving the multiple-instance problem: A lazy learning approach. In:ICML. (2000) 1119–1125

[8] Gartner, T., Flach, P., Kowalczyk, A., Smola, A.: Multi-instance kernels. In: ICML. (2002)179–186

[9] Ray, S., Craven, M.: Supervised versus multiple instance learning: an empirical comparison.ICML (2005) 697–704

[10] Viola, P., Platt, J.C., Zhang, C.: Multiple instance boosting for object detection. In: NIPS.(2005)

[11] Bunescu, R., Mooney, R.: Multiple instance learning for sparse positive bags. In: ICML.(2007) 105–112

[12] Auer, P., Long, P., Srinivasan, A.: Approximating Hyper-Rectangles: Learning and Pseudo-random Sets. Journal of Computer and System Sciences 57(3) (1998) 376–388

[13] Long, P., Tan, L.: PAC Learning Axis-aligned Rectangles with Respect to Product Distribu-tions from Multiple-Instance Examples. Machine Learning 30(1) (1998) 7–21

[14] Valiant, L.: A theory of the learnable. Communications of the ACM 27(11) (1984) 1134–1142

[15] Blum, A., Kalai, A.: A Note on Learning from Multiple-Instance Examples. Machine Learn-ing 30(1) (1998) 23–29

[16] Kearns, M.: Efficient noise-tolerant learning from statistical queries. Journal of the ACM(JACM) 45(6) (1998) 983–1006

18

Page 19: Demicco & Associates, LLC - Delaware Riverkeeper

[17] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge Univ. Press (2004)

[18] Ramon, J., De Raedt, L.: Multi instance neural networks. ICML, Workshop on Attribute-Value and Relational Learning (2000)

[19] Babenko, B., Dollar, P., Tu, Z., Belongie, S.: Simultaneous learning and alignment: Multi-instance and multi-pose learning. In: Faces in Real-Life Images. (2008)

[20] Saul, L.K., Rahim, M.G., Allen, J.B.: A statistical model for robust integration of narrowbandcues in speech. Computer Speech and Language 15 (2001) 175–194

[21] Maron, O., Lozano-Perez, T.: A framework for multiple-instance learning. In: NIPS. (1998)

[22] Zhang, Q., Goldman, S.: EM-DD: An improved multiple-instance learning technique. In:NIPS. Volume 14. (2002) 1073–1080

[23] Freund, Y., Schapire, R.: A decision-theoretic generalization of on-line learning and anapplication to boosting. Journal of Computer and System Sciences 55 (1997) 119–139

[24] Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view ofboosting. Technical report, Stanford University (1998)

[25] Mason, L., Baxter, J., Bartlett, P., Frean, M.: Boosting algorithms as gradient descent. NIPS12 (2000) 512–518

[26] Friedman, J.: Greedy function approximation: A gradient boosting machine. Annals ofStatistics 29(5) (2001) 1189–1232

[27] Andrews, S., Hofmann, T., Tsochantaridis, I.: Multiple instance learning with generalizedsupport vector machines. A.I. (2002) 943–944

[28] Andrews, S.: Learning from Ambiguous Examples. PhD thesis, Brown University (2007)

[29] Chen, Y., Wang, J.: Image categorization by learning and reasoning with regions. JMLR 5(2004) 913–939

[30] Bi, J., Chen, Y., Wang, J.: A sparse support vector machine approach to region-based imagecategorization. In: CVPR. (2005)

[31] Andrews, S., Hofmann, T.: Multiple Instance Learning via Disjunctive Programming Boost-ing. In: NIPS. (2004)

[32] Zhang, Y., Chen, Y., Ji, X.: Motif Discovery as a Multiple-Instance Problem. (2006) 805–809

19