Nonnegative matrix factorization for segmentation analysis · Nonnegative matrix factorization for...

Nonnegative matrix factorization forsegmentation analysis

Roman Sandler

Technion - Computer Science Department - Ph.D. Thesis PHD-2010-09 - 2010

Nonnegative matrix factorization forsegmentation analysis

Research thesis

In Partial Fulfillment of the Requirements for the Degree of Doctorof Philosophy

Roman Sandler

Submitted to the Senate of the Technion - Israel Institute ofTechnology

Tamuz 5770 Haifa June 2010


The Research Thesis Was Done Under The Supervision of Prof. Michael Lindenbaum inthe Faculty of Computer Science.

The generous financial help of Technion is gratefully acknowledged


Contents

1 Introduction 21 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1 Feature space clustering . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Graph partitioning methods . . . . . . . . . . . . . . . . . . . . . . . 72.3 Numerical geometry methods . . . . . . . . . . . . . . . . . . . . . . 82.4 Hierarchical Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Model-assisted segmentation . . . . . . . . . . . . . . . . . . . . . . . 8

3 Segmentation evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.1 Supervised evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Unsupervised evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Consensus of several segmentation hypotheses . . . . . . . . . . . . . 113.4 Theoretical performance prediction . . . . . . . . . . . . . . . . . . . 11

4 Nonnegative matrix factorization . . . . . . . . . . . . . . . . . . . . . . . . 125 Earth mover’s distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5.1 Fast EMD approximations . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Segmentation Evaluation 161 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 Finding true segmentation distributions using nonnegative matrix factorization 19

2.1 Nonnegative matrix factorization . . . . . . . . . . . . . . . . . . . . 192.2 NMF algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3 Factorizing the histogram matrix . . . . . . . . . . . . . . . . . . . . 202.4 Estimating model complexity using several modalities . . . . . . . . . 222.5 Dealing with boundary inaccuracies . . . . . . . . . . . . . . . . . . . 23

3 Nonnegative Matrix Factorization with Earth Mover’s Distance metric 241 Observations and intuitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 EMD NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.1 Earth mover’s distance . . . . . . . . . . . . . . . . . . . . . . . . . . 282.2 Single domain LP-based EMD algorithm . . . . . . . . . . . . . . . . 292.3 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.4 Bilateral EMD NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Efficient EMD NMF algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 333.1 A gradient based approach . . . . . . . . . . . . . . . . . . . . . . . . 33

1


3.2 A gradient optimization with WEMD approximation . . . . . . . . . 333.3 The optimization process . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Applications 351 A tool for unsupervised online algorithm tuning . . . . . . . . . . . . . . . . 362 Face recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.1 The EMD NMF components . . . . . . . . . . . . . . . . . . . . . . . 392.2 Face recognition algorithm . . . . . . . . . . . . . . . . . . . . . . . . 39

3 Texture modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414 NMF and image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1 A naive NMF based segmentation algorithm . . . . . . . . . . . . . . 434.2 Spatial smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.3 Multiscale factorization . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Boundary aware factorization . . . . . . . . . . . . . . . . . . . . . . 444.5 Bilateral EMD NMF segmentation algorithm . . . . . . . . . . . . . . 45

5 Experiments 471 Evaluation experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

1.1 The accuracy of unsupervised estimates . . . . . . . . . . . . . . . . . 481.2 Application: image-specific algorithm optimization . . . . . . . . . . 51

2 Face recognition experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 Texture descriptor estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 564 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Discussion 60


List of Figures

2.1 Distribution curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Object distribution examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Space-feature domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2 Illustrating the intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1 Precision/recall performance with fixed and manually chosen parameter sets 364.2 Precision/recall performance of the N-cut on 100 Berkeley images for some

fixed k-s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Facial space for 4 people. The two-dimensional (w1, w2) convex subspace is

projected onto the triangle with corners in (1, 0), (0, 1), and (0, 0). The cornersof the triangle represent the basis facial archetypes obtained by EMD NMF.The inner points show the actual facial images weighted in this basis. . . . . 40

4.4 Examples of texture mosaics . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.5 Multiscale W estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1 Precision/recall estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Inconsistency distributions for different measurement methods . . . . . . . . 505.3 Precision/recall performance for fixed sets and automatic parameter choice . 515.4 The performance for the outlier images . . . . . . . . . . . . . . . . . . . . . 525.5 Examples of segmentation errors . . . . . . . . . . . . . . . . . . . . . . . . . 535.6 The Yale faces database. The database contains images of 15 people, and we

considered 8 images for each person. The first two rows show examples of thedatabase images. The last row shows the basis images obtained with EMDNMF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.7 Typical recognition error in ORL database. When the test face image (a) isin a very different pose from that of the same person in the training set, themost similar person in the same pose (b) may be erroneously identified. Thesecond-most similar identifications (c,d) are correct. . . . . . . . . . . . . . . 55

5.8 Texture descriptor estimation accuracy . . . . . . . . . . . . . . . . . . . . . 575.9 Segmentation examples, Weizmann database . . . . . . . . . . . . . . . . . . 595.10 Segmentation examples, Berkeley database . . . . . . . . . . . . . . . . . . . 59

3


List of Tables

4.1 Distribution of the images in the Berkeley test set according to the betterperforming algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1 The average F and ∆F values for the segmentation algorithms . . . . . . . . 525.2 Classification accuracies of different algorithms on the ORL database and the

corresponding basis sizes cited from [74]. . . . . . . . . . . . . . . . . . . . . 555.3 Classification accuracy of EMD NMF on the ORL database for different basis

sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4


Abstract

The conducted research project is concerned with image segmentation – one of the centralproblems of image analysis.

A new model of segmented image is proposed and used to develop tools for analysis ofimage segmentations: image specific evaluation of segmentation algorithms’ performance,extraction of image segment descriptors, and extraction of image segments. Prevalent seg-mentation models are typically based on the assumption of smoothness in the chosen imagerepresentation within the segments and contrast between them. The proposed model, unlikethem, describes segmentations using image adaptive properties, which makes it relativelyrobust to context factors such as image quality or the presence of texture. The image repre-sentation in the proposed terms may be obtained in a fully unsupervised process and it doesnot require learning from other images.

The proposed model characterizes the histograms, or some other additive feature vectors,calculated over the image segments as nonnegative combinations of basic histograms. It isshown that the correct (manually drawn) segmentations generally have similar descriptionsin such representation. A specific algorithm to obtain such histograms and combinationcoefficients is proposed; it is based on nonnegative matrix factorization (NMF).

NMF approximates a given data matrix as a product of two low rank nonnegative ma-trices, usually by minimizing the L2 or the KL distance between the data matrix and thematrix product. This factorization was shown to be useful for several important computervision applications. New NMF algorithms are proposed here to minimize two kinds of theEarth Mover’s Distance (EMD) error between the data and the matrix product. We proposean iterative NMF algorithm (EMD NMF) and prove its convergence. The algorithm is basedon linear programming. We discuss the numerical difficulties of the EMD NMF and proposean efficient approximation.

The advantages of the proposed combination of linear image model with sophisticated de-composition method are demonstrated with several applications: First, we use the boundarymixing weights (the boundary is widened and is also considered a segment) to assess imagesegmentation quality in precision and recall terms without ground truth. We demonstratea surprisingly high accuracy of the unsupervised estimates obtained with our method incomparison to human-judged ground truth. We use the proposed unsupervised measure toautomatically improve the quality of popular segmentation algorithms. Second, we discussthe advantage of EMD NMF over L2-NMF in the context of two challenging computer visiontasks – face recognition and texture modeling. The latter task is built on the proposed im-age model and demonstrates its application to non-histogram features. Third, we show thata simple segmentation algorithm based on rough segmentation using bilateral EMD NMFperforms well for many image types.


List of Notations

NMF - Nonnegative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

H∗ - Feature distributions associated with the given segments . . . . . . . . . . . . . . . . 4

H - Feature distributions associated with the true segments . . . . . . . . . . . . . . . . . . 4

W - Mixing weights of the distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

EMD - Earth Mover’s Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

P - Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

R - Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

F − value - A commonly used scalar measure representing both P and R . . . 22

h~x(f) - Feature distribution in location ~x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

hf (~x) - Spatial distribution for a feature subset f . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

BEMD - Bilateral Earth mover’s distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

WEMD - Wavelet Earth mover’s distance approximation . . . . . . . . . . . . . . . . . . . 33

F (~x) - Feature value in location ~x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

TV (x) - Total variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

1


Chapter 1

Introduction

2


1 Introduction

The task of segmentation, or more generally grouping, is an essential stage in image analy-sis. The image analysis of a typical complex scene is significantly simplified if the image ispartitioned into semantically meaningful parts. High level vision algorithms, such as recog-nition and scene analysis, usually require reasonably segmented image parts as an input.The demand for good grouping methods is defined and well known. A lot of research onsegmentation has been done during the last decades. Many algorithms, implementing dif-ferent approaches have been proposed. However, the goal of efficient, fully automatic imagesegmentation algorithm is yet to be reached. Although the existing algorithms are able tosolve correctly and efficiently variety of specific (even very general) cases of segmentationtasks, there is still no unified approach which is able to solve the problem of segmentationin general.

At first sight the task of image segmentation seems to be a reasonable and well definedone. Most humans can segment an image into meaningful objects, in a way that seemsreasonable to other people, with no effort at all. Through the years image segmentation hasbeen considered from numerous directions. Psychologists found different image propertieswhich humans associate with object boundaries and details (Gestalt rules [72]). Differentmathematical formulations (e.g ., a functional minimization problem [11, 64, 51] and a re-laxation problem [63, 29]) have been proposed for segmentation description. Many differentrepresentations of the image data have been proposed: brightness, color, texture, etc. Manycombinations between the approaches[46], spaces[29, 20, 14, 66, 19], and mathematical toolswere tested. Learning methodologies for solving specific classes of segmentation problemswere shown [8, 37]. However, on contrary to humans, the existing state-of-the-art methodsusually perform well for specific classes of segmentation tasks. None of them can be taken asis and applied to a different segmentation task class. Moreover, even if such method is appliedto an image from the appropriate class it still may fail if the image is ”non-representative”for the class.

Consequently, this research is not concerned with the development of yet another seg-mentation algorithm (though, in the end, we propose one), but rather with proposing a new,segmentation related, image model and development of image adaptive analysis in the termsof this model. The model refers to two complementary domains – the spatial and imagefeature ones. We assume that in the spatial domain each image location is associated witha single image segment. Ideally, we would like each object to be associated with a uniquefeature value. However, following this ideal assumption delegates the concern for the effectsof texture and image quality to the image features, which may be very complex, such asdistribution patterns of filter responses; see e.g ., [46]. Here we use an opposite approach.The considered image features are relatively simple, such as brightness gradient magnitudeor filter responses and naturally, many objects share similar feature values. To specify anobject uniquely in the feature domain we assume that it is associated with a feature distri-bution different from distributions of other objects. This approach allows us to use sameimage features for different image classes, while feature domain object description becomesimage-specific instead of class-specific.

In the proposed model, a feature description of any (not necessarily correct) image seg-ment can be obtained from the descriptions of the true segments. A feature distribution

3


associated with such a segment is a mixture of the true segments’ distributions. Moreover,the mixture weights are equal to the ratios of the true segment areas building the analyzedsegment. This model can be mathematically expressed as a matrix product:

H∗ = HW, (1.1)

where H∗ = (h∗1| . . . |h∗M) are feature distributions associated with M different segments,H = (h1| . . . |hK) are the object feature distributions, and W are the mixing weights, whichactually have a spatial interpretation.

Note that usually matrix H∗ is easy to calculate directly from the image. Then, knowingH one can obtain spatial information on the segments associated with H∗. Moreover, evenif H is unknown, the feature distributions associated with the true segments along withsome spatial information on these segments can be estimated using nonnegative matrixfactorization (NMF).

A result of NMF is a pair of matrices holding the feature (distributions) and spatialinformation on true segments. This information can be directly used in important visiontasks. Given a segmentation hypothesis, its quality can be evaluated by comparing thefeature distributions associated with the hypothesized segments to the true distributions.The obtained true feature distributions can be used to identify the textures building theimage in a database. Finally, the feature and spatial distributions could be used for actualimage segmentation.

The core of the proposed approach is a NMF process. NMF approximates a given datamatrix H∗ as a product of two low rank nonnegative matrices – H∗ ≈ HW . The factorizationbecomes useful and interesting when the multiplied matrices are of low rank, implying usuallythat the factorization is approximate. In this case, the decomposition is useful for signalrepresentation as an additive combination of a small number of atomic signals (segmentdescriptors in our case).

The basic algorithm proposed by Lee and Seung [39] gets a matrix H∗ and tries to finda pair of low rank nonnegative matrices H and W satisfying

minH,W

Distφ(H∗, HW )s.t.W ≥ 0, H ≥ 0, (1.2)

where the distance φ is either the Frobenius norm or the Kullback-Leibler distance.We argue that measuring the dissimilarity of H∗ and HW as L2 or KL distance, even

with additional bias terms, does not match the nature of errors in realistic imagery and othertypes of distances should be preferred. In particular, the Earth mover’s distance (EMD) [56]metric is known (e.g., [71, 56, 42, 30]) to quantify the errors in image or histogram matchingbetter than other metrics. The error mechanism, modeled as a complex local deformationof the original descriptor is often a good model for the actual distortion processes in imageformation. In this work it is proposed to factorize the given matrix using the EMD. That is,the minimization (1.2) is considered here with the EMD as φ distance.

We propose two EMD based NMF tasks and provide linear programming based algo-rithms for the factorization. More efficient algorithms, based on Wavelet EMD approxima-tion [65], are described as well. The more general algorithm, denoted bilateral EMD NMF,is suitable for the case when the distortion is modeled well by small, in EMD sense, error in

4


both spatial and feature domains. The simpler algorithm is preferred when the distortionfits the EMD model in only one of the domains.

We examined the proposed approaches with four vision tasks. First, the proposed imagemodel in conjunction with EMD NMF was applied to image segmentation quality estimation.We demonstrate a surprisingly high accuracy of the unsupervised estimates obtained withthe proposed method in comparison to human-judged ground truth. Then, the proposedunsupervised measure is used to automatically improve the quality of popular segmenta-tion algorithms. The strength of EMD NMF was tested on traditional NMF test-case: facerecognition. We handle unaligned facial images with some pose changes and different fa-cial expressions and show a performance superior to that of other popular factorizationmethods. The third task is texture modeling. Given an unlabeled image containing mul-tiple textures we extract the descriptors of individual textures using EMD NMF insteadof actually segmenting the image. Finally, we show, for the first time, actual NMF basedimage segmentation. In all cases we consider sets of naturally deformed signal samples andreconstruct parts which appear to be the meaningful original signals.

The following sections are organized as follows: sections 2 through 5 in this chaptercontain a background on some of common segmentation algorithms, segmentation qualityestimation approaches, nonnegative matrix factorization, and the Earth mover’s distance.Chapter 2 provides intuition on the proposed image model by presenting it in the contextof segmentation evaluation task. The EMD NMF methods are developed in chapter 3.Chapter 4 presents four applications of the proposed approach to actual vision tasks. Theexperimental validation of the theory and benchmarks on the proposed applications arereported in chapter 5. The last chapter (6) provides a discussion on the reported research.

5


2 Segmentation

Much effort has been dedicated to the development of segmentation algorithms and theiranalysis. This effort resulted in a large variety of segmentation approaches that exist today.Some of the commonly used are graph cut algorithms [11, 64], hierarchical segmentation [63],model assisted methods [8] and numerical geometry based methods [51]. The performance ofthe state-of-the-art algorithms associated with each approach is more-or-less similar, but itstill not close to human performance. A recent comparative evaluation of several algorithmsis described in [27].

Most of the algorithms mentioned above are optimization procedures over numerousbasic properties of an image. The properties commonly used for such optimization are theintra-region homogeneity, the inter-region contrast and a model of boundaries. While thefirst two properties are relatively well studied and a bunch of algorithms for clustering andedge detection are available, the latter property is more subjective and usually obtained withlearning. It should be noted that the optimization techniques generally used in segmentationalgorithms are relatively simple in order to remain efficient.

During the years, many algorithms dealing with the problems described above were pro-posed. Below, we review one or more representative methods for several popular approaches.The algorithms mentioned in this section are not completely automatic and require parame-ter tuning. In chapter 4 we propose an automatic method to do so.

There are several families of image segmentation and segmentation evaluation algorithmswidely discussed in computer vision press recently. Most of them contain some preprocessingsteps, but it is common to group them according to the last (most important) step - thesegmentation itself. In this summary the same approach is taken, and the differences betweenthe preprocessing steps are discussed for the relevant examples. Note also, that some relationsbetween the different algorithm families were shown and, in principle, the same algorithmmay be considered from the point-of-view of different algorithm type, even though from thispoint-of-view it will be much more complicated.

2.1 Feature space clustering

In this method the pixels of an image are transferred into another representation, and eachpixel has a feature vector containing the values describing the location and/or its neighbor-hood. The segmentation algorithm receives the distribution of those vectors and searchesfor the best clustering of it. The locations associated with each cluster are considered partsof the same object. A geometric restriction is usually applied, thus two distant pixels areless likely to become parts of the same object.

The common steps of the algorithms in this family are:

• Converting the image to the feature space, such as color-location [18], or Gabor[59],wavelet or using just the original gray-scales. Some more complex modern algorithms[46, 69] cluster the obtained feature space data and take the distribution of clusterindexes in pixel’s neighborhood as its feature vector.

• Clustering of the feature vectors. The simplest method is K-Means, but it is verypopular because it is relatively fast. The more precise methods, like mean-shift for

6


example, demand much bigger computational effort, but improves the results[19].

• Postprocessing. In this stage the pixels are set to the group of the closest clustercenter. If the resulting segmentation contains some isolated pixels of different groupsit is repaired by some smoothing algorithm.

The most simple example for this approach is gray-scale image segmentation by thresh-olding. A more complicated examples may be found in [28]. In our experiments we usethe mean shift tool [19], which is one of the state-of-the-art segmentation tools as bench-marked in [27] and [4]. Mean shift considers two algorithm specific parameters: the pixelsneighborhood size in spatial and feature spaces [27].

Interestingly, the mean shift algorithm considers the spatial coordinates as features, andthus is related to the methods which operate in both feature and spatial domains, such asgraph methods and numerical geometry methods.

2.2 Graph partitioning methods

This approach considers image pixels as graph vertices. The edges of the graph representthe similarity between the connected pixels. The algorithms in this approach look for graphpartitioning in a way that minimizes the cost of the deleted edges and maximizes the costof the remaining ones.

Partitioning of any given graph is a hard problem. To simplify it two approaches aretaken. One [64] is to apply some simplifications on the graph structure and to refer theproblem as finding eigenvalues of a sparse matrix. Another, [11], is to define some verticesas ”seed” points which belong to different graph partitions. In this way the segmentationproblem becomes standard maximal flow problem with known solution.

The problem of finding the minimal cut is different from the feature space clustering.However, the simplifications done in [64] can be shown, e.g., [76], to bring the problem backinto feature clustering domain. Another domain of segmentation problems which was shownto be related to the graph formulation is a domain of numerical geometry methods; see [12].

The graph partitioning approach implicitly supposes that the similarity values, used asedge weights, describe well the image nature. That is the edge values between vertices be-longing to different objects should be higher than the edge values between vertices belongingto the same object. If this assumption does not hold, the segmentation fails. For example,the N-Cut method with features based on brightness similarity fails to segment some im-ages associated with various textured objects. The edges between the different textures areweaker than those created within the same texture. One way to overcome this problem is tolearn the edge values associated with actual edges for a specific class of images [4].

The parameters of graph-cut algorithms usually specify the relative importance of thespatial vs. feature originated similarity information [46]. Additional parameter may specifythe importance of the seed information over the contrast information [11]. In our experimentswe considered the normalized cut algorithm [64]. We used the implementation of the authors[21] and represented the images in three feature spaces (grayscale, color, and texture). Weconsidered only one parameter – the number of segments.

7


2.3 Numerical geometry methods

In contrast to the previously mentioned approaches, these methods try to separate an object(or a group of objects) from the background. These methods usually consider the image as afield of some potential. A contour is initialized to circle the area of interest and tries to shrinkto a zero length by moving towards the center of the circled area. The image (the potential)denies this movement from it in regions associated with evidence for object boundary (e.g .,high gradient). As a result in some places the contour enters farther than in other, and inthe end it is located on the boundary of the object. The segmentation process is actuallya process of solving image-based PDEs, originally formulated in [51]. This approach is verysuccessful in solving specific classes of segmentation problems. However, it is not a goodmethod for segmenting a general image with unknown number of objects.

As already mentioned, a connection between numerical geometry and graph partitioningmethods was shown in [12], thus in principle one could choose the better suiting approachaccording to the problem in hand. The algorithm for choosing the optimally performingalgorithm for a given image, described in chapter 4, may be a step towards a practical useof this connection.

2.4 Hierarchical Segmentation

On contrary to the previous bottom-up approaches, this is a combination of bottom-upwith top-down methods. Each small group of neighboring pixels of the original image isrepresented by a ”pixel” on some other image, of smaller resolution. Building a pyramidof these images of representative ”pixels” above the original image terminates when each”pixel” of the highest image represents an object on the original image. After the pyramidis built the association of the ”pixels” of the lower levels may be changed according to somerelaxation algorithm. Usually there are several such top-down reshuffling sweeps, after whichthe pyramid is rebuilt.

In [16] each pixel belong to a single group in the same time. Modern algorithms, e.g.,[63], allow multi-group belonging for each location. Both algorithms use only raw, graylevelinformation. More recent version of the latter algorithm, [29], introduces the use of filterresponses and shape distributions as pixels features.

The parameters used in the hierarchical approach determine the similarity thresholdsused in the pixel gathering process. Additional parameters are related to the relaxationprocess, in which the final decision on the pixel-to-object association is taken, where theseparameters specify the probability thresholds.

2.5 Model-assisted segmentation

This method supposes a-priory knowledge about the class of the object(s) which appearin the image (top-down approach). Given the class, the statistics about object shape, orin more recent algorithms the statistics of the relative appearance of object parts betweenthemselves, is collected. In the segmented image the known shapes are matched to the imagedata, while the ability of the shape to change and the amount of the image data to ignoreare given by the algorithm parameters.

8


Recent variations of this method apply hierarchical segmentation [8] or texture classrecognition [37] for improving the obtained segmentation. To avoid over-fitting of the modelto the training examples, a combination of bottom-up with top-down approaches can bedone in the learning stage [41].

9


3 Segmentation evaluation

In order to evaluate the quality of different segmentation algorithms a segmentation eval-uation mechanism is required. In opposite to, say, object recognition there is no binaryindicator about a success or failure of the segmentation task. The segmentation may bepartially correct. There may be two different segmentations, but both of them would bereasonable.

Segmentation algorithms are sometimes evaluated in the context of specific applications.The advocates of this task-dependent approach argue that a segmentation is not an end initself, and therefore should be evaluated using the application performance (See e.g . [9] foran estimation of grouping quality in the context of model based object recognition.) Thisapproach is best when working on a specific application, but it does not support modulardesign and does not guarantee a suitable performance for other tasks.

As we know, humans can consistently discriminate between good and bad segmentations.This implies that, at least in principle, task-independent measures exist [47]. Such taskindependent evaluations may be done by comparing the segmentation results to ground truthsegmentations (supervised evaluation). Alternately, the evaluations may be done withoutusing any reference segmentation at all (unsupervised evaluation).

3.1 Supervised evaluation

Supervised, or ground truth based, evaluation is commonly used for empirical comparison ofalgorithms. Some approaches compare the evaluated segmentation to the reference segmen-tations using some type of set difference. See [3, 48, 73, 78] for some examples. Some of thesemethods focus on the boundaries between the segments, and compare them to the referenceboundaries, in statistical terms of miss and false positive, or precision and recall [48]. A dif-ferent approach, building on information theoretic considerations, is proposed in [26]. Therecently available large image databases associated with manual segmentations [47] revealthe inconsistency of human segmentations, allow the quantitative comparison of differentapproaches on a common test bed [27], and enable learning based design of segmentationprocedures [8, 48].

3.2 Unsupervised evaluation

Unsupervised evaluation of segmentation does not require ground truth and is based onlyon the information included in the image itself. It is usually based on heuristic measuresof consistency, related to Gestalt laws, between the image and the segmentation. Someexamples are intra-region uniformity, inter-region contrast[10, 17]), specific region shapeproperties (e.g ., convexity [36]), or combinations thereof [77]. It may also be based onstatistical measures of quality (e.g ., high likelihood), when a statistical characterization ofthe underlying perceptual context is available [3, 55, 25].

Unsupervised evaluation is considered rather weak for evaluating segmentation [78]. Itis sensitive to texture and context, suffers from the absence of the very informative groundtruth and does not offer a clear interpretation: unsupervised evaluation algorithms providea measure which, supposedly, increases monotonically with the perceptual quality of the

10


segmentation. Yet, this measure is not explicitly related to the empirical error probabilityprovided by, say, precision/recall. Unsupervised evaluation is rarely discussed as an end initself; see however [10, 17, 75, 49]). It is more commonly discussed in the context of thenumerous segmentation methods (see, e.g ., [51] [55]). In fact, every segmentation algorithmmay be interpreted as an optimization of an unsupervised quality measure. Clearly, theevaluation methods associated with segmentation algorithms are often simplistic in orderthat the resulting segmentation algorithm be efficient.

In spite of its weaknesses, unsupervised segmentation is needed for generating effectivesegmentation algorithms, for selecting their parameters in different contexts, and for inform-ing subsequent stages of the visual process (e.g ., recognition), which gets the segmentationas an input, about its quality.

3.3 Consensus of several segmentation hypotheses

A relatively recent approach to segmentation evaluation compares the given segmentationhypothesis to a reference obtained from several other hypotheses [75, 70, 54]. This approach isbased on an assumption that automatic segmentations contain many true details. Moreover,the false details are random and do not repeat in different segmentations. Thus, a consensusof many automatic segmentation hypotheses will contain only the true segmentation details.

The consensus approach is similar in some details to the evaluation method proposedin this work. Grouping quality is evaluated, in an unsupervised way, relative to a resultcombination of several grouping algorithms. This way, the evaluated segmentation maybe compared to some reference in quantitatively meaningful way. The key difference isin the unsupervised reference estimation. While the consensus methods try to find a truesegmentation reference basing on an assumption of reasonability of automatic segmentations,we propose to estimate a much simpler ground truth reference (distributions associated withthe true segments) which is based on a specific image model and requires much weakerassumptions.

3.4 Theoretical performance prediction

Besides evaluating the obtained segmentation quality, there were several attempts to predictanalytically the expected quality and control the segmentation process in order to obtainthe better quality [3, 6] . The evaluations are based on the data quality and grouping cuesreliability. The analytical prediction is always dependent on the details of specific algorithm,but since the analysis relies on the Maximum Likelihood criterion, the generic algorithmanalyzed in [3] performs similarly or better than any general segmentation algorithm.

11


4 Nonnegative matrix factorization

For a long time it has been popular to use component analysis (CA) methods (e.g., PCA,LDA, CCA, spectral clustering, etc.) in modeling, clustering, classification and visualizationproblems. The idea of CA techniques is in decomposition of a given signal into componentsthat are related to the basic signals in the problem’s domain. The given signal is character-ized, explicitly or implicitly (e.g., kernel methods), by the mixing coefficients of these basiccomponents.

Many CA techniques can be formulated as eigenproblems, offering great potential forefficient learning of linear and nonlinear models without local minima. It is common toconsider only the components related to the largest eigenvalues and work with signal ap-proximation in a low dimensional space. This allows considering relatively few samples forsuccessful estimation of the components. CA techniques are especially useful to handle high-dimensional data due to the curse-of-dimensionality, which usually requires a large number ofsamples to build accurate models. During the last century many computer vision, computergraphics, signal processing, and statistical problems were posed as problems of learning alow dimensional CA model.

The traditional eigenproblems are equivalent to a least squares fit to a matrix, see e.g.,[52, 22]. In application to physical processes standard CA methods, such as PCA fail toreconstruct physically meaningful components because of incorrect weighting and allowingnegative component entries. To overcome these drawbacks Paatero and Tapper proposed[52] an alternative, least squares based approach.

The Nonnegative Matrix Factorization (NMF) is a representation of a nonnegative ma-trix as a product of two nonnegative matrices. It is common to consider matrices of lowrank, implying usually that the factorization is approximate. Initially [52], this task wasoften formulated as follows: Given a non-negative matrix A ∈ Rn×m and a positive integerk < min(m,n), find non-negative matrices H ∈ Rn×k and W ∈ Rk×m which minimize thefunctional

f(H,W ) =1

2‖A−HW‖2

2. (4.1)

Minimizing (4.1) is difficult for several reasons, including the existence of local minima as aresult of the non-convexity of f(H, W ) in both H and W , and, perhaps more importantly,the non-uniqueness of the solution. Additional information is commonly used to direct thealgorithm to the desired solution [23].

The problem got much attention after its information theoretic formulation and themultiplicative update algorithm for the Frobenius norm and the Kullback-Leibler distanceproposed by Lee and Seung [38, 39]. Different aspects of this latter algorithm were analyzedand many improvements were proposed [7, 24, 32, 31, 23, 74]. The main research topicsinclude speeding up the minimization process, research of the influence of the initializationseeds and extension of the NMF to include additional constrains on W and H. See thesurvey in [7].

The factorization is commonly done by iterative algorithms: one matrix (e.g ., W ) istreated as a constant, getting its value from the previous iteration, while the other H ischanged to reduce the cost f(H, W ). Then the roles of the matrices are switched. Thealgorithms differ mostly in the specific cost reducing iteration, and in the use of additional

12


information. There are four main approaches to NMF iterations. Paatero and Tapper [52]used the alternating least squares (ALS) approach, which was reported to be the most ac-curate, but also the slowest method [7]. Our empirical observations concur with this claim[60]. As already mentioned, for successful application of NMF for practical problems thefactorization is biased toward {H,W} having some desirable properties. In such cases it issimpler to use gradient descent step for each iteration, e.g., [34]. The third approach usesmultiplicative update algorithms (MUA), which can be regarded as a special case of gradientdescent methods. The speed and the simplicity of the original multiplicative update algo-rithm by Lee and Seung [39] are in the foundation of the NMF’s popularity. The algorithmswhich we develop in this work are related to the fourth approach, which in a sense may becosidered as a generalization of the ALS approach. Each NMF iteration is an optimizationprocess by itself which solves a convex task [32]. Sometimes, even approximate solution issufficient for good performance [1, 60].

Biasing the solution to special form of H or W is usually dictated by the applicationconsiderations. Common choices are weighting the importance of each matrix entry of thefactorized matrix, enforcing sparsity on the factors, and enforcing similarity o the weightsfactor. In [52] the matrix columns are weighted according to their reliability. In [23] a moregeneral weighting approach is presented. It was shown that sparse basis functions [43, 34]and sparse mixing weights [1] should be preferred for many applications. If the relationbetween the mixing weights is known they may be forced to comply with it [74]. In thisresearch, the columns of both H and W are forced to sum to one. In the preliminary versionof segmentation evaluation tool we enforced a special parametrization of H matrix columns.The use of better metric turned out to be better biasing tool for this application.

The NMF technique has been applied to many applications in the fields of object andface recognition, action recognition, and segmentation [74, 67, 60]. In computer vision theuse of NMF is strongly motivated by the relative complexity to obtain pure examples of,say, class descriptors, while their mixes are easily available. In a sense, Following [38] facerepresentation/recognition became standard test case for NMF methods. Therefore, in thiswork we also address the face recognition problem, although it is not in the main line of thisresearch.

The different NMF algorithms mentioned in this section are variants of the L2 and KLdistances with additional biasing terms. In the beginning [60] we also followed this path. Itturned out, however, that using a different basic distance measure is advantageous for ourline of application. Naturally, many of the methods developed for other NMF methods maybe applied to EMD NMF as well. Chapter 3 of this report is dedicated to derivation of EMDNMF method. In chapter 4 a bias on W factor is demonstrated. To avoid repetition, furtherdetails on NMF background related to the proposed new factorization may be found in thesechapters in adjunction to the related details of the proposed method.

13


5 Earth mover’s distance

The problem of distribution comparison is a very important for computer vision. Manyvision tasks deal with large amounts of data and thus need to summarize it with descriptors,e.g., mean filter responses [56]. It was shown that distributions of such features are moreinformative than just mean values [40, 69, 45]. However, when the data is described by ahistogram or histograms-like descriptor the need for distribution comparison tool arises.

The problem of comparing a distribution S = (s1, . . . , sn) to distribution T = (t1, . . . , tn)is a thoroughly studied one. When the compared distributions are related to known andtheoretically studied processes, the comparison techniques are also usually theoretically wellfounded. Some examples are:

Kullback-Leibler divergence is a tool from the Information Theory. It tells how well, onaverage, T is coded in the terms of S. Formally:

D(S, T ) =∑

i

silogsi

ti. (5.1)

χ2 distance has a statistical justification. It measures how unlikely is that T describes asample of population represented by S. Formally:

D(S, T ) =∑

i

(si − ti)2

si

. (5.2)

Kolmogorov-Smirnov distance also has a statistical justification. It measures how un-likely is that T and S are two samples drawn from the same distribution. Formally:

D(S, T ) = maxi|si − ti|, (5.3)

where si and ti are bins of the cumulative versions of S and T respectively.

While these measures were successfully applied to important computer vision problems,see, e.g., [48, 38], in practice, however, they suffer from numerical difficulties. Dividing byzero in the first two distances brings the overall measure to infinity. The discrete histogrambins may cause a situation when the data corresponding to some bin in one histogrammoved to a neighboring bin in the second histogram. Moreover, sometimes it is reasonableto consider histograms with different bin boundaries and different sum of bins. In the lattercase we would like, following [56], to consider distances between signatures.

To overcome the division by zero, more numerically stable versions of Kullback-Leiblerdivergence (Jeffrey divergence) and χ2 distance were proposed. It is also common to useempirically justified Lp norms and some other less common methods. To overcome thebinning problem, besides using the Kolmogorov-Smirnov distance, which is useful only in1D, it is common to consider quadratic-form distance d(S, T ) =

√(S − T )T A(S − T ) where

the matrix A contains the probability of i-th bin content to move to bin j. Note, however,that each of these slution is ad hoc and do not provide a general tool to compare a pair ofsignatures with a prespecified relation between the bins. To address all these problems inthe same framework Rubner proposed to use the Earth mover’s distance [56].

14


The Earth Mover’s Distance (EMD) is a method to evaluate dissimilarity between twodistributions in some feature space, where a distance measure between single features, theground distance, is given. The EMD, which variants are known as Monge-Kantarovich metric,Wasserstein metric, Mallows distance, etc., was first applied to computer vision tasks byWerman et al. [71] and generalized by Rubner [56]. The name EMD follows the intuitiveexplanation of the measure: ”Given two distributions, one can be seen as a mass of earthproperly spread in space, the other as a collection of holes in that same space. Then, theEMDmeasures the least amount of work needed to fill the holes with earth. Here, a unit of workcorresponds to transporting a unit of earth by a unit of ground distance.”

Formally, to compute EMD one should solve a transportation problem. This can be doneby solving a linear problem, see eq. (4.5, 2.4). In this formulation the bins of S representthe suppliers, the bins of T represent the receivers, and the ground distance measures thecost of moving a mass unit of goods from each supplier to each receiver.

EMD was shown to outperform other distances for numerous computer vision problems,e.g., [71, 56, 42, 30]. However, the solution of the transportation problem takes O(N3logN)for a pair of histograms of N bins. Thus, frequently it is not applied to practical problemsdue to its computation time.

5.1 Fast EMD approximations

The different accelerated EMD computation techniques proposed over the years may beroughly divided into two main groups. One refers to special cases of EMD problem and showfast and exact EMD calculation methods which work only for a specific ground distance ora specific type of signature. Another proposes fast but approximate calculation methods formore general cases.

We start with the exact, special case methods. In the original work Werman et. al.[71] showed that EMD between one dimensional histograms with L1 as ground distance isequal to the L1 distance between the cumulative histograms. Ling and Okada proposedEMD-L1 [44]. They showed that if the ground distance is L1 the number of variables inthe LP problem can be reduced from O(N2) to O(N). The worst case time complexity isexponential as for all simplex methods, however empirically, they showed that this algorithmhas an average time complexity of O(N2). Pele and Werman [53] proposed using EMD withthresholded distances. They have shown that in this case the EMD-s are metrics and theircomputation time is in the order of magnitude faster than that of the original algorithm. Aspecial case of thresholded ground distance of 0 for corresponding bins, 1 for adjacent binsand 2 for farther bins and for the extra mass can be computed in linear time.

Now we mention some examples of approximate methods. Indyk and Thaper [35] pro-posed approximating EMD-L1 by embedding it into the L1 norm. Embedding complexityis O(Nd log ∆), where N is the feature set size, d is the feature space dimension and ∆ isthe diameter of the underlying space. Grauman and Darrell [30] substituted L1 with his-togram intersection in order to approximate partial matching. Shirdhonkar and Jacobs [65]presented a linear-time algorithm for approximating EMD with some Ln ground distancesusing the wavelet coefficients of the difference between histograms.

In this work we use the Wavelet EMD approximation [65] because it is one of the generaland fastest methods and mainly because it has an analytic gradient expression.

15


Chapter 2

Segmentation Evaluation

16


(a)

1.75.5.250

0.05

0.1

0.15

0.2

0.25

(b)

1.75.5.250

1

.75

0.5

.25

0

(c)

0 .25 .5 .75 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(d)

Figure 2.1: Edge strength distributions on different regions of an image. The regions arespecified by manual (a) or automatic segmentations. Distribution densities on the segmentsspecified by (several) manual segmentations (b). Each curve is associated with a distribu-tion of a single segment. The plot contains distributions of several different segmentationsof the same image. Cumulative distributions corresponding to manual segmentations (c).Cumulative distributions corresponding to incorrect segmentations of the same image (d).The ”thick” curves are actually clusters of similar thin curves.

1 Framework

We consider the evaluation of segmentations such as those created by general purpose seg-mentation algorithms, e. g., [19, 64]. Thus, a segmentation is a partition of the image intodisjoint regions, separated by thin boundaries. We denote the evaluated segmentation as thehypothesized or given segmentation.

Every point in the image may be characterized by some local properties, such as intensity,color, or texture, which may be represented by some feature vector. In this paper we usedthree boundary sensitive operators corresponding to texture, brightness and color.

Consider a good segmentation of some image (note that some images have more than onegood segmentation). Our basic model is that for every pixel within a particular segment, thelocal characterization may be regarded as an instance of a random variable, associated withsome (discrete) distribution. The distributions associated with different segments are notnecessarily different. The region around the boundary is considered as another segment andis characterized by a distribution, just like every other segment. Intuitively, for boundarysensitive operators, we expect this distribution to put higher weights on high values. Notehowever, that due to texture, the other distributions are not expected to be disjoint from theboundary distribution. A given image is associated with a small number of distributions,characterizing the different object appearances and the boundaries. We assume that thedistributions of the true segment parts are approximately equal to the distribution of thewhole true segment.

As an example, consider the distributions associated with the intensity operator for sev-eral human segmentations of the same image (Fig. 2.1(a)(b)). Note that the distributionsare clustered into several types. The clustering phenomenon is clearer in the less noisy, cumu-lative representation of these distributions (Fig. 2.1(c)). The lower cumulative distributioncurves, which rise only for relatively high values, are those associated with the manuallymarked boundaries. The examples in Figure 1 show that this phenomenon occurs in manyimages in each of the three modalities. It should be emphasized that these distributions,characterizing the true segments and the true boundary, are not only unknown but are not

17


0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

∇ f(x,y)0 .5 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 350

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(a) original (b) manual segmen-tations

0 .5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(c) brightness0 .5 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(d) color0 5 10 15 20 25 30 35

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

(e) texture

Figure 2.2: Examples of the object distribution clustering effect for different modalities. Theboundary distributions associated with different manual segmentations shown in the secondcolumn are plotted in red. The object distributions are blue.

even uniquely specified. Note, however, that despite the significant difference in boundarymarkings made by different people, the corresponding boundary distributions are similarfor each image; see Figure 1. We shall show that estimating them leads to a quantitative,meaningful and yet unsupervised quality measure for a given segmentation.

Consider now an incorrect segmentation hypothesis. Every incorrect segment containsparts from several true segments. Each such part is associated with a distribution of thetrue segment to which the part belongs. Therefore, we expect the incorrect segments tobe characterized by mixtures of the true distributions; see Fig. 2.1(d). The basic goalconsidered here is to estimate the correctness of a given segmentation when no ground truthis given. Specifically, we would like to estimate the accuracy of the inter-segment boundariesin precision/recall terms [48].

To carry out this seemingly impossible task, we first consider a simpler one. Assume thatthe number of true segments (including the boundary segment), k, as well as the associateddistributions are known. All the mixture distributions lie in the convex hull of these truedistributions. Therefore, given a particular hypothesized segment and its distribution, themixture coefficients associated with the hypothesized segment may be obtained by solvingan overconstrained system of linear equations. Then the precision and recall may be easilycalculated; see below.

18


Consider now the more difficult task, where the true distributions are not known. To findthese true distributions, specify many (not necessarily correct) hypothesized segments andfind their distributions. Each of these hypothesized distributions lies in the convex hull of thek true distributions. Thus, this set of distributions may be regarded as a matrix product ofthe true distributions and a nonnegative weights matrix. Therefore, finding the true (hidden)distributions associated with the true segments is a nonnegative matrix factorization task.

Formally, let hi be the operator response distribution in the i-th segment, represented asan n-bin histogram or column vector. Thus, H = (h1, h2, . . . , hk) ∈ Rn×k represents all theunderlying (true) distributions on the image. Consider now some segmentation containingm segments (including the boundary). Let H∗ = (h∗1, h

∗2, . . . , h

∗m) ∈ Rn×m be the matrix of

the distributions associated with these segments. Then, H∗ may be written as

H∗ = HW, (1.1)

where W ∈ Rk×m is a weight matrix. In practice, the measured distributions may be noisy.The factorization still holds as an approximation H∗ ≈ HW for an effective value of k, whichwe estimate.

W.l.g. let h1 and h∗1 be the histograms associated with the boundaries in the true seg-mentation and in the hypothesized one. Then, by definition,

Precision = w11 Recall =α1w11∑j αjw1j

, (1.2)

where αj is the size of the j-th segment.Thus, the quality of a given hypothesized segmentation may be found by decomposing

its operator response histogram matrix into two matrices H and W , representing the distri-butions associated with the true segments and the mixture coefficients, respectively. Notethat we do not find the ground truth segmentation at any step of this evaluation, and wehave only its description in terms of feature distributions.

2 Finding true segmentation distributions using non-

negative matrix factorization

2.1 Nonnegative matrix factorization

The decomposition of the measured histogram matrix H∗ into a mixture of basic histogramsis a nonnegative matrix factorization (NMF) task [39, 23, 7, 24]. This task is often formulatedas follows: Given a nonnegative matrix H∗ ∈ Rn×m and a positive integer k < min(m,n),find nonnegative matrices H ∈ Rn×k and W ∈ Rk×m which minimize the functional

f(H,W ) = Dist(H∗, HW ). (2.1)

The matrix pair {H,W} is called a nonnegative matrix factorization of H∗, although H∗ isnot necessarily exactly equal to the product HW . Minimizing (2.1) is difficult for severalreasons, including the existence of local minima as a result of the nonconvexity of f(H, W )in both H and W , and, perhaps more importantly, the nonuniqueness of the solution. Ad-ditional information is commonly used to direct the algorithm to the desired solution [23].

19


2.2 NMF algorithms

The problem of nonnegative matrix factorization was introduced by Paatero [52] but gotmuch attention only after its information theoretic formulation and the multiplicative updatealgorithm by Lee and Seung [39]; see the survey in [7]. The factorization is commonly doneby iterative algorithms: one matrix (e. g., W ) is treated as a constant, getting its value fromthe previous iteration, while the other, H, is changed to reduce the cost f(H, W ). Thenthe roles of the matrices are switched and W is changed for fixed H. The algorithms differmostly in the specific cost reducing iteration, and in the use of additional information.

The preliminary version [60] of this paper was based on a variation of Lee and Seung’salgorithm. This method is based on the traditional minimization of L2 distance in (2.1), andrequired additional constraints to perform well.

For many signal comparison tasks, where the error mechanism is not modeled well byadditive noise but is rather a complex local deformation of the original signal or the signaldescriptor, the Earth mover’s distance (EMD) performs better than many other metrics[56, 42, 13]. Histogram comparison is a well-known case of such a task, because any noiseadded to a signal affects several bins of that signal’s histogram. EMD measures the minimalchange needed to convert one of the compared histograms into another, subject to the givendeformation costs. In [61] it is shown how to perform an NMF task that minimizes theEMD between the matrix columns. As expected, EMD NMF performs much better forfactorization of histogram sets than L2 based analogs [61]. In EMD NMF, H∗ is factorizedwith a sequence of linear tasks

Hk = arg minH

∑m

EMD(H∗m, (HW k−1)m)

W km = arg min

WEMD(H∗

m, (HkW )m), (2.2)

where Am is the m-th column of the matrix A; see [61] for details.

2.3 Factorizing the histogram matrix

To carry out the factorization (1.1), we used the EMD NMF method [61] as well as severalsupporting techniques.

Much data is needed for successful factorization. Therefore, instead of factorizing thematrix H∗, associated with a single segmentation, we consider a larger matrix associatedwith several segmentations (of the same image). Such segmentations are either availableor may be created using a segmentation algorithm with different sets of parameters. H∗ isthus redefined as an n×M matrix whose columns are the M histograms associated with allsegments of all segmentations.

H∗ =(h∗1, h

∗2, . . . , h

∗m, h∗m+1, . . . , h

∗M

). (2.3)

The factorization (1.1) is now changed to H∗ ≈ HW , where H is an n×k matrix (unchanged)and W is a much larger k ×M weight matrix.

Clearly, for successful factorization, H∗ should contain different combinations of true Hvectors. Geometrically, the columns of H∗ are points in the convex hull specified by the

20


columns of H in Rn+. To get a stable reconstruction of the convex hull, the samples (vectorsin H∗) should represent it well. An ideal situation would be if each H∗ vector were equalto some vector in the true H matrix and all H vectors were represented. The more realisticscenario is to get some vectors in H∗ that are close to the vectors of H and many othersthat are in the inner regions of the convex hull. In terms of segmentation quality, thismeans that we are interested in segments which are pure examples of the existing objectclasses in addition to segments which are mixtures of different objects. A trivial way toobtain such examples is an over-segmentation of the image, but in practice this is not a goodsolution. For reliable histogram estimation, the pure segments should be large and includea significant (10% − 20%) fraction of the true segment. Empirically, we found that if thesegmentations constructing the H∗ matrix are associated with a large diversity of precisionand recall grades (though none necessarily requires both to be high), the reconstruction isstable. This can usually be achieved by choosing a large diversity of automatic segmentationalgorithm parameter sets.

For common segmentation sets, many segments are associated with very similar distribu-tions. For an example, see Fig. 2.1d, where the middle cluster of curves corresponds to suchsimilar distributions. Geometrically, this means that the center of the convex hull is over-represented. Such uneven representation increases the computational effort and sometimesmay even cause incorrect factorization. Following [24], we represent the combinations of Hdistributions more evenly by a dilution process which replaces every set of similar columnswith a single representative. (Technically, we consider only a prespecified number of themost EMD different H∗ vectors.) After the NMF is carried out, the full, nondiluted weightmatrix W is found by a single W iteration from (3.1). The factorization algorithm is formallydescribed in Algorithm 3. This algorithm actually finds the precision and recall not only forthe segmentation of interest, but for all segmentations in H∗. These estimations are usefulfor finding the model complexity k; see below.

Given H, we still need to identify the distribution (column of H) associated with theboundary. We choose the distribution associated with highest µ + 2σ value, where µ is theexpected value of the distribution and σ is its standard deviation.

Algorithm 1 Factorization

Input: Histogram matrix H∗, model complexity k.1: Dilute H∗ to H ′ as described in 2.3.2: Initialize H0 ∈ Rn×k with the most EMD different columns from H ′. Initialize W ∈

Rk×m with random values, and normalize its columns to sum to 1.3: Do W and H iterations (3.1), solving H ′ ≈ HW , until convergence.4: Order columns of H by µ + 2σ.5: Solve H∗ = HW for W with W iteration using the the obtained H.6: Decompose W into segmentation-specific coefficients matrices Wi. Estimate the preci-

sions (Pi) and the recalls (Ri) for each segmentation using Wi and (1.2).Output: {Pi}, {Ri}.

21


2.4 Estimating model complexity using several modalities

The factorization algorithm described above decomposes the available histograms to sumsof k basic histograms. We found, empirically, that assigning the correct value to the modelcomplexity k is critical to the algorithm’s success: For example, a too-high value of k maylead to a decomposition of the true boundary histogram into two or more estimated basichistograms, and to inaccurate estimations of P and R. The best value of k differs from imageto image and depends on the type of boundary-sensitive operator (modality) as well.

Specifying the true number of clusters is a hard and, in the general case, unsolved prob-lem. However, for the current problem, we found that using the correspondence betweenestimations in different modalities may assist us in finding the correct model complexities.We use three modalities: brightness, color and texture. The NMF (Algorithm 3) is applied toeach of them separately, using three corresponding model complexities, k1, k2, k3. Let Pij, Rij

be the estimated precision and recall associated with thei-th segmentation and the j-th modality. We found empirically that these estimations shouldhave the following properties:

• Consistency between modalities. In principle, if we use different boundary sensitiveoperators, we should still get similar precision and recall if they function properly.Likewise, we should get similar F-values (commonly used scalar measure representingboth P and R, F = 2PR

P+R) for each modality if the model complexities were chosen

properly. Thus,

αc = maxi

3∑j=1

(Fij −medianj(Fij)

)(2.4)

measures the consistency and should be small. This measure was empirically chosenover other possible measures, e.g., measuring the consistency in P and R separately.

• Diversity of evaluations The parameters of the segmentation algorithm are chosen tomaximize the diversity of the precision and the recall grades of the respective segmen-tations to ensure good estimation of H vectors. Thus, it is expected that

αv = stdi(medianj(Pij)) · stdi(medianj(Rij)) (2.5)

will be large for correct Pij and Rij estimations. Note that when kj is too small (i.e., onevector in H represents two or more actual classes), the variety of recall grades decreases,because substantial parts of the inner segments are assigned to be the boundary andalways remain ”undetected”. Analogously, when kj is too large, the variety of precisiongrades decreases.

• Boundary size cannot be too large, because otherwise the image would contain onlyboundaries and no pixels inside segments. Thus, using Pi,j-s and the segment sizes, weestimate the boundary area percentage in the image and denote it αb.

Then, one empirically selected way to quantify the considerations discussed above is tominimize

c(k1, k2, k3) =αc · αb

αv

. (2.6)

22


Algorithm 2 Evaluation

Input: A test image I and its segmentation(s) si, i ∈ 1, . . . , M .1: If needed, add additional segmentations using a segmentation algorithm and different

parameter sets.2: Run three boundary sensitive operators (denoted different modalities), and measure their

distribution within the segments. Construct three matrices H∗1 , H

∗2 , H

∗3 .

3: For all combinations of k′1, k′2, k

′3 ∈ {2, 3, 4, 5}, factorize every matrix H∗

j using the corre-

sponding k′j value, by applying Algorithm 3, and obtain the precisions {Pi,k′j} and recalls

{Ri,k′j} of all segmentations. Choose the (k1, k2, k3) triple minimizing c(k′1, k′2, k

′3) (2.6).

4: Calculate: Pi = median(Pi,kj

)and Ri = median

(Ri,kj

)

Output: {Pi}, {Ri}

Note that none of the factors should dominate this expression even if it is very small. Thuseach alpha is limited by some small constant.

2.5 Dealing with boundary inaccuracies

Typical segmentation algorithms distort the boundaries. That is, even for a segmentationproviding roughly true segments, the boundary locations are inaccurate. This problem isrecognized in supervised evaluation methods [48, 27], and some small location error marginis allowed.

Naturally, the problem arises here as well: the distribution evaluated on the inaccurateboundary is not the one characterizing the true boundary, and the distribution evaluatedwithin a segment contains contributions from the boundary. Thus, we do not use the distrib-ution of the boundary sensitive operator directly. Rather, to handle this difficulty, we replacethe responses in the boundary points with the highest response in their circular neighbor-hood (r = 5). Because we expect higher values on the boundary, the pixel contributing themaximal value is indeed likely to belong to the true boundary. The other segment distribu-tions are calculated similarly, except that points which were considered when the boundarydistribution was calculated do not contribute to this distribution.

A summary of the full factorization-based evaluation algorithm is described in Algorithm2.

23


Chapter 3

Nonnegative Matrix Factorizationwith Earth Mover’s Distance metric

24


0 50 100 150 200 2500

50

100

150

200

250

300

0 50 100 150 200 2500

2000

4000

6000

8000

10000

(a) (b)

(c) (d)

Figure 3.1: Bilateral relation between the spatial and the feature domain representation. Theimage (a) may be represented in two domains. The highlighted spatial bin is associated withfeature distribution hx0(f) - (b). The highlighted feature bin in the whole image histogramh(f) (d) is associated with spatial distribution hf0(x) - (c). The highlighted bins in thespatial (c) and the feature (b) distributions are identical; hf0(x0) = hx0(f0).

1 Observations and intuitions

The EMD NMF methods we propose are general and are not limited to an image domain.For concreteness and a more intuitive explanation, we chose to focus here on image repre-sentation. Consider an image f(~x) describing some feature f as a function of the coordinate~x. We shall be interested in two types of histograms representing, respectively, parts of theimage and parts of the feature space.

A feature distribution h~x(f) corresponds to a region R~x in the image and describesthe feature distribution corresponding to the pixel values in this region.

A spatial distribution hf (~x) corresponds to a subset f of the feature space anddescribes the distribution of spatial locations corresponding to pixels having a value inthis subset. Note that the spatial distributions do not necessarily sum to one.

See Figure 3.1 for the relations between the two domains and the respective histograms.In this work we consider only spatial regions and feature domain subsets large enough tocontain a reasonable number of samples. Note that many other image representations followthis formulation. Two examples are orientation histograms [45] and Gabor jets [62]. Whileboth coordinates may be multidimensional, e.g.,Gabor jets, we chose to discuss only a scalarfeature f in the following lines for simplicity.

25


Consider representing an image object, or several similar objects (denoted visual classthrough this paper), using spatial and feature distributions. Ideally, we would expect suchan object to be associated with the same feature vector in all its locations. We wouldalso expect the spatial distributions to be piecewise constant within the objects for everyfeature subset. Naturally, this expectation is unrealistic and the respective distributionsare somewhat different, though these differences often follow a systematic pattern describedbelow.

Consider a region belonging to a visual class with some ideal gray level histogram h(f).Different regions of the same class may be associated with different surface normal directionsand corresponding histograms which are brighter or darker. In this case, the absence of somegray level in the histogram is better explained by the presence of additional gray levels innearby feature histogram bins than in the distant, unrelated bins. Consider now the spatialdomain. In realistic textures, the distribution of gray levels in every region is not entirelyuniform. Consider, for example, two adjacent regions in an image of a zebra. One regionmay contain more black pixels than the other, but the union of the regions has a histogramwhich is closer to the ideal class histogram. More generally, the absence of some gray levelin a spatial bin is better explained by the presence of surplus instances of this gray level innearby spatial bins than in other locations. This model of distortion leads to comparison ofdistributions with the Earth mover’s distance, as will be explained in greater detail in thenext section.

The proposed image model is well-suited to the NMF representation. Let the (i, j)-thelement of H∗ measure the number of pixels with the i-th feature in the j-th region of theimage. Then, the j-th column of H∗ contains the feature distribution in region j, hj(f).Analogously, the i-th row contains the spatial distribution of the i-th feature subset, hi(~x).The factorization variables, H and W , refer to the feature and spatial representations of thevisual classes of the image. The columns of H represent the ideal feature distributions andthe rows of W represent the ideal visual class locations, the image segments. The value ofthe (i, j)-th bin in the product matrix HW is the sum of i-th feature probabilities in differentclasses weighted by their relative area in j-th region. In other words, it tells us how many ofthe feature values in the range i we expect to find in region j, which is exactly the propertythe (i, j)-th bin of the matrix H∗ measures.

By factorizing H∗, we perform clustering in both spatial and feature domains. For imagesegmentation it is common to consider such groupings and gather pixels with similar appear-ance features and spatial locations. Some methods, for example, explicitly use this principleby clustering pixels in a combined (color, spatial coordinates) space [66, 19] Here we showthat NMF models both the spatial and the feature image descriptors in a complementaryway and acts as an iterative, EM-like, segmentation algorithm.

For reasonable factorization we should ensure that H∗ ≈ HW and that the differencesfollow the local deformation model we discussed earlier. This compels us to require mini-mization of the EMD error between both the rows and the columns of H∗ and HW . In thenext sections we quantify these requirements and use them to propose EMD NMF.

26


(a) The original image

0 5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 300

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

(b) Feature histograms (c) Feature indicator im-ages

(d) Spatial histograms

Figure 3.2: Intuitive explanation of the model. The feature distributions in the graphs ((b))(lower, feature histograms, upper cumulative histograms) are associated with the squareson the image ((a)). The red dotted lines refer to the red squares lying inside the totemsegments, the green dashed lines refer to the green squares lying inside the backgroundsegments, and the yellow solid lines refer to the yellow squares intersecting the totem andbackground segments. The feature indicator images ((c)) show the pixels with equal featurevalues. The respective spatial histograms are shown in ((d)).

2 EMD NMF

Consider M nonnegative histograms with N bins. The histograms are represented in amatrix form, H∗ ∈ RN×M , where the j-th histogram is the column H∗

j . The matrix H∗

may be decomposed into a product of H ∈ RN×K and W ∈ RK×M , where H and Ware interpreted as K basis vectors in two complementary domains. In most cases, a lowdimensional approximation is more meaningful than exact factorization. Then, the desiredfactorization H, W is a solution of eq. (1.2) for small K values. Let Distφ(A,B) be the sumof distances φ between the corresponding columns of A and B. Then, H∗T ≈ W T HT impliesthat Distφ(H

∗, HW ) is the sum of distances between the feature histograms. Analogously,H∗T ≈ W T HT implies that Distφ(H

∗T ,W T HT ) is the sum of distances between the spatialhistograms. Therefore, in order to find the spatial distributions, we should factorize H∗T bysolving

arg minH,W

Distφ(H∗T ,W T HT )s.t.W ≥ 0, H ≥ 0. (2.1)

A joint clustering in both domains is, therefore,

arg minH,W

λ1Distφ(H∗, HW ) + λ2Distφ(H

∗T , W T HT )

s.t. W ≥ 0, H ≥ 0. (2.2)

Conveniently, the L2 distance is bin-wise and Distφ(H∗, HW ) = Distφ(H

∗T ,W T HT ).Thus, segmenting an image in spatial and feature domains is equivalent to solving the tradi-tional L2-NMF of the feature distribution matrix associated with this image. Unfortunately,this algorithm fails for real images. Solving (2.2) with L2-NMF implicitly associates the er-ror independence assumption with different histogram bins. This assumption is not a good

27


model for the sample deviation in the approximation H∗ ≈ HW , neither in the feature northe spatial domain. As already mentioned, we propose to use the EMD metric for columncomparison and show its ability to solve such problems.

2.1 Earth mover’s distance

The Earth mover’s distance (EMD) evaluates the dissimilarity between two distributions insome feature space, where a distance measure between single features is given [56]. For imagefeatures, the EMD is motivated by the following intuitive observation: Some histogram binmass may transfer to nearby bins due to natural image formation processes. The distancebetween two distributions which may be considered as small local deformations of each othershould be less than that of other distribution pairs which differ in non-neighboring bins.Intuitively, we can view the traditional EMD metric as a sum of the changes required totransform one distribution into the other with low cost given to local deformations and highcost to nonlocal ones. Formally, the EMD distance between two histograms is formulated asa linear program (2.4, 2.5) whose goal is to minimize the total flow f(i, j) between the binsof the source histogram (i) and the bins (j) of the target histogram for a given inter-bin flowcost d(i, j); see [56]. The cost parameter d(i, j), denoted also the ground distance, specifiesthe inter-bin flow cost for each pair of source and target bins. EMD is a metric when d(i, j)is a metric as well; thus, we consider here only this type of cost function and denote it theunderlying metric.

We consider a nonnormalized distance

EMD(hs, ht) =∑i,j

f(i, j)d(i, j), (2.3)

where f(i, j) is a solution of:

minf

∑i,j

f(i, j)d(i, j) (2.4)

s.t. f(i, j) ≥ 0,∑j

f(i, j) ≤ hsi ,

∑i

f(i, j) ≤ htj, (2.5)

∑i,j

f(i, j) = min

(∑i

hsi ,

∑j

htj

),

because the total flow in our case is prespecified.

Earth mover’s distance between matrices

We define EMD between two matrices with M columns as a sum of EMDs between eachcolumn in the source matrix and the corresponding column in the target matrix:

‖Hs −H t‖EMD =M∑

m=1

EMD(Hsm, H t

m). (2.6)

28


For columns representing feature vectors, this distance measures the sum of distances be-tween respective feature pairs. Naturally, to consider EMD in spatial domain, we shouldfind ‖HsT −H tT‖EMD.

2.2 Single domain LP-based EMD algorithm

The general NMF problem is nonconvex and has a unique solution only for limited cases [24].However, if one of the variable matrices H or W is given, the problem becomes linear. Thus,by consecutively fixing either H or W , one can find a local minimum for (1.2) by solving asequence of convex tasks. This approach is also applicable to the case at hand by a simplereformulation of the EMD linear programming problem. As a result, the local minimum ofEMD NMF is found by solving a sequence of linear programming tasks.

Consider hs = H∗m and ht = (HW )m. Note that both vectors are normalized histograms

and thus sum to one:∑

i hsi =

∑j ht

j = 1; this constraint implies that the columns of Wsum to 1 as well. With these normalizations, the linear programming constraints associatedwith the EMD between H∗

m and HWm (eq. 2.5) become

fm(i, j) ≥ 0,∑j

fm(i, j) = H∗(i,m), (2.7)

∑i

fm(i, j) =∑

k

H(j, k)W (k,m).

Note that the constraint∑

i,j fm(i, j) = 1 is satisfied automatically since∑

i,j fm(i, j) =∑i H

∗(i,m) = 1.Note also that if we know H, both fm(i, j) and the matrix W minimizing it may be found

as:arg min

f,W

∑m

∑i,j

fm(i, j)d(i, j) s.t. (2.7). (2.8)

Analogously, if we know W , we can find both fm(i, j) and the matrix H minimizing it as:

arg minf,H

∑m

∑i,j

fm(i, j)d(i, j) s.t. (2.7). (2.9)

Thus, given some initial guess for H or W , we can improve the solution by the followingtwo-phase Algorithm 3.

For columns representing feature distributions, this algorithm finds a set of basic distri-butions (H) and the mixing weights (W ) to construct the samples in H∗ from this set. Forthe spatial domain we factorize H∗T . This way we find a set of basic spatial distributions(rows of W ) and the mixing weights (H) to construct the samples in H∗ from this set.

2.3 Convergence

Theorem 1.1. Algorithm 3 converges to a local minimum

29


Algorithm 3 EMD NMF

Input: The objective matrix H∗ ∈ RN×M and an initial guess for the basis H0 ∈ RN×K .1: Find W 0 using (2.8).2: k = 03: repeat4: k=k+15: Find Hk using (2.9).6: Find W k using (2.8).7: until

ε >∣∣‖H∗ −HkW k‖EMD − ‖H∗ −Hk−1W k−1‖EMD

∣∣Output: W k and Hk.

Proof. 1. Feasibility: First note that Algorithm 3 is a sequence of LP processes. Weshould show that a feasible solution exists for every one of them. The minimization(2.8) gets a pair H∗, Hk of normalized matrices. Any normalized matrix W k ensuresthat

∑i H

∗mi =

∑j(HW )mj and thus implies that a feasible solution exists. This

follows from EMD being a transportation problem, which has a feasible solution when∑i h

si =

∑j ht

j [33]. An identical argument shows the existence of a feasible solutionfor minimization (2.9).

2. Linear programming, by definition, minimizes the flow cost and, due to (2.6), minimizes‖H∗ − HW‖EMD. Thus, applying (2.9) finds globally optimal Hk for a given W k−1

and applying (2.8) finds globally optimal W k for a given Hk.

3. Since the objective in (2.9) and in (2.8) is the same,‖H∗ −HkW k−1‖EMD ≤ ‖H∗ −Hk−1W k−1‖EMD and‖H∗ −HkW k‖EMD ≤ ‖H∗ −HkW k−1‖EMD.

4. From the above it follows that every cycle of Algorithm 3 monotonically decreasesthe distance ‖H∗ − HkW k‖EMD. This distance is lower-bounded, and therefore thealgorithm converges (to a local minimum).

2.4 Bilateral EMD NMF

Algorithm 3 minimizes the EMD distance between the corresponding columns of a given ma-trix and a matrix product approximating it. Note, however, that in the general case specifiedby eq. (2.2), our goal is to minimize EMD distance both between the corresponding columnsand the corresponding rows. W.l.g. we shall denote the columns as feature distributionsand the rows as spatial distributions, as we did in section 1. The proposed bilateral NMFis a mathematically similar extension of Algorithm 3: while Algorithm 3 considers only thefeature domain but regards the spatial histogram errors as independent, we now add theminimization of the EMD in the spatial domain to the optimization function. Thus the

30


bilateral EMD distance is

BEMD(H∗, HW ) = λ1

M∑m=1

EMD(h∗m, Hwm) (2.10)

+ λ2

F∑

f=1

EMD(H∗Tf ,W T HT

f ).

Both EMD terms depend, of course, on the ground distance metric [56]. See the detailedspecification below.

To minimize this proposed distance, we extend the EMD NMF technique of alternat-ing convex minimizations. Thus, analogously to Algorithm 3, each step of the proposedminimization is a linear programming task, and a sequence of such tasks achieves a localminimum and provides estimates for H and W .

The EMD between one column of H∗m and Hwm is:

minfm

∑i,j

fm(i, j)df (i, j) s.t. (2.7) (2.11)

where fm is a variable measuring the flow that we want to minimize between the histogrambins, and df (i, j) is a ground distance measuring the cost of moving between the bins. Inthe new distance we need to minimize the flow fm between feature histogram bins while alsominimizing the flow fs between spatial histogram bins. Thus, the new cost function is:

minfm,fs,zi

∑m,i,j

fm(i, j)df (i, j) +∑s,x,y

fs(x, y)dx(u, v) (2.12)

subject to the constraints (2.7) and the additional constraints on the spatial flow for the i-throws in H∗ and HW :

fs(u, v) ≥ 0,∑v

fs(u, v) ≤ H∗(i, u), (2.13)

∑u

fs(u, v) ≤∑

k

H(i, k)W (k, v),

∑u,v

fs(u, v) = min

(∑x

H∗(i, u),∑

k,v

H(i, k)W (k, v)

).

The ground distance dx(u, v) measures the cost of moving between the spatial bins u and v.The alternating steps are:

W step – minimize (2.12) for fm, fs, and W such that (2.7) and (2.13).

H step – minimize (2.12) for fm, fs, and H such that (2.7) and (2.13).

31


Note that the two sets of constraints (2.7) and (2.13) are not of the same form. Thefirst specifies equality constraints and thus requires the total flow

∑i

∑j fm(i, j) to equal

one. This is necessary to ensure that the columns of the solution matrices H and W stillsum to one. The second constraint set (2.13), on the other hand, cannot be of the equalitytype, because formally there is no constraint on the sums of the H and W rows. In practice,however, the sums of the HW rows are very similar to the sums of the H∗ rows. We applyhere the standard inequality constraints of the EMD [56]. In a sense, this formulation of theproblem may be regarded as solving EMD NMF between the columns with an EMD penaltyterm on the distance between the rows.

32


3 Efficient EMD NMF algorithms

It is possible to find a local minimum of (2.6) by iterative application of (2.9) and (2.8) start-ing from some reasonable guess for H. Linear programming is a well-studied problem andplenty of freeware and commercial solvers are available. However, for (2.9) the dimension ofthe problem is MN2. This means that even for a traditional, relatively small problem of fac-torizing 100 facial images (each in 16×16 resolution), the LP optimization problem operatesabout 6 million variables. This makes even the specification of the problem (construction ofthe constraint matrix) a challenging task with today’s solvers.

Most of the variables arise from the need to calculate the flow fm(i, j) (and possiblyfs(i, j)) in order to estimate the EMD between the histograms. The actual variables ofinterest are H and W , which are only a small fraction of the variables in both (2.8) and(2.9).

3.1 A gradient based approach

The task of finding Hk and W k in each step of Algorithm 3 is:

Hk = arg minH

∑m

EMD(H∗m, (HW k−1)m)

W km = arg min

WEMD(H∗

m, (HkW )m). (3.1)

For bilateral EMD NMF it is:

Hk = arg minH

BEMD(H∗, (HW k−1))

W k = arg minW

BEMD(H∗, (HkW )). (3.2)

Given both H and W , the error (2.6) can be calculated by solving M (or M + N)independent, relatively small LP problems. We can solve both minimizations in (3.1) or (3.2)with some gradient based optimization over possible H (or W ) values. We are guaranteedto find the globally optimal solutions for each optimization because tasks (2.8) and (2.9) areconvex.

Unfortunately, the complexity of a single precise EMD computation is O(N3logN). Thus,the gradient based approach is expected to be complex as well.

3.2 A gradient optimization with WEMD approximation

Much effort has been devoted to speeding up the EMD calculation. For some underlyingmetrics it is easier than for others. For example, the match distance [71], which is theEMD between 1D histograms with a specific underlying metric, can be calculated as an L1

distance between the cumulative versions of the histograms. A short survey of other methodssuggested for faster EMD calculation may be found in [65, 53].

Shirdhonkar and Jacobs [65] proposed an efficient way to calculate the EMD betweentwo histograms for some common underlying metrics d(i, j). They proved that the result of

33


optimization (2.4) is approximated very well by:

d(ht, hs)WEMD =∑

λ

αλ|Wλ(ht − hs)|, (3.3)

where Wλ(ht−hs) are the wavelet transform coefficients of the n dimensional difference hs−ht

for all shifts and scales λ, and αλ are scale dependent coefficients. The different underlyingmetrics are characterized by the chosen scale weightings and wavelet kernels. Note that weare looking for local minima of some calculated EMD values and not for the EMD valuesthemselves. Empirically we found that the local minima of EMD and WEMD are generallyco-located, and thus the accuracy of the WEMD approximation of the actual EMD is lessimportant for our goal.

Using the approximation (3.3) in (3.1) and (3.2) reduces the computational complexity ofEMD to be linear. However, gradient methods naturally require knowledge of the gradient forthe optimization variables. In the case of linear programming, the gradient may be derivedfrom the solution of the dual problem; therefore, it is a byproduct of EMD calculation.Unfortunately, for the WEMD we need to calculate the gradient separately. This gradientis:

∇dWEMD =∑

λ

αλ · sign(Wλ(ht − hs)) · ∇Wλ(h

t), (3.4)

where the explicit expression for the gradient ∇Wλ(ht), with respect to either W or H, is

lengthy but straightforward. The complexity of the gradient (3.4) computation for H isO(N2K). Note, however, that many additives remain constant between the iterations, anda smart calculation of the gradient greatly accelerated the computation.

Note that formally applying WEMD requires equality constraints in (2.13). This condi-tion is not satisfied, but in practice the sums of the H∗ rows are similar to those of the HWrows. Thus we used WEMD to find the EMD and its gradient for both the columns and therows of the matrices.

3.3 The optimization process

We tested two optimization strategies: constrained optimization (H ≥ 0, W ≥ 0) of thedistance (3.3), and unconstrained optimization with high penalty for negative variable values:

arg minx

∑m

d(H∗m, HWm)WEMD + Φ(x), (3.5)

where x is either W or H according to the relevant iteration and Φ(x) is a quadratic penaltyterm for x < 0. The latter unconstrained optimization appears to be more precise and faster.

Still, EMD NMF iterations are more complex than those of L2-NMF. Using Matlab onan Intel Core 2 Quad 2.5 GHz processor, one full H iteration for M = 256, N = 32, K = 3(corresponding to the texture experiment described in section 3) takes around 30 seconds.One full H iteration for M = 200, N = 1024, K = 40 (corresponding to the face recognitionexperiment described in section 2) may take up to 20 minutes.

34


Chapter 4

Applications

35


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Precision

Rec

all

N−cut grayN−cut colorN−cut textureMeanShift colorMeanShift graybest N−cutbest MeanShiftOptimal by image

Figure 4.1: Precision/recall performance with fixed and manually chosen parameter sets ofthe 5 tested algorithms on Berkeley images

1 A tool for unsupervised online algorithm tuning

Benchmark databases ([47] and more recently, e.g., [2]) have become very popular in the lastdecade. Comparing the performance of an algorithm on a large set of images helps determineits advantages and disadvantages for diverse image types, to learn algorithm parameters foroptimal performance [48, 4], and compare its performance to that of competing algorithms[27].

Naturally, the requirement for ground truth segmentations restricts these works to theoff-line mode. In section 1 we will show that estimating segmentation performance onlineallows us both to select the optimal algorithm (out of a set) and increase the performanceof a given algorithm by fitting better parameters for each image.

Segmentation algorithms usually depend on a set of parameters and their performance ischaracterized by measuring the average precision and recall grades achieved for an etalon setof images for each parameter set. We denote these grades as fixed parameter set precisionand recall. The term fixed explicitly refers to the parameters being the same for all images inthe set. Performance curves — the fixed parameter set precisions plotted versus the recalls— are often used to illustrate the performance characteristics and choose the algorithmparameter set with desired performance specifications; see Figure 4.1. For general evaluation,an algorithm’s performance is usually measured with a scalar performance grade. A commonchoice for such grade is the maximal F-value, F = 2PR

P+R, associated with a point on the

performance curve. The optimal fixed parameter set is considered to be the one for which themaximal F-value is obtained. It is rather interesting that the curve points summarize a widedistribution of image specific performances. See Figure 4.2 for image specific performancesof all database images associated with three curve points from Figure 4.1.

What we show here is that online analysis of existing algorithms might improve theirperformance by allowing an optimal parameter set to be chosen for each image. The singlepurple point in Figure 4.1 illustrates the possible improvement gain. This point shows theaverage (over the dataset) precision/recall performance when the optimal algorithm and

36


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

k=2k=24k=105k=2k=24k=105

Figure 4.2: Precision/recall performance of the N-cut on 100 Berkeley images for k = 2, 24,and 105. The thick points correspond to the averaged precision and recall of 100 segmenta-tions.

Table 4.1: Distribution of the images in the Berkeley test set according to the better per-forming algorithm

Algorithm N-cut Mean shiftModality gray color texture gray color

#of images (fixed) 2 27 9 3 59#of images (auto) 2 25 8 9 56

parameter set choice is manually chosen for each image. Note that because the parametersare now image dependent, the algorithm’s performance is now described by a single point inthis plot (and not by a curve). In all precision-recall plots (Figures 4.1, 5.3, and 5.4), we showthe performance of different parameter optimization methods on each image as comparedto the performance of the traditional method of optimizing parameters for an ensemble ofimages. Additional details are given in section 1.

Segmentation performance when the algorithm (with its optimal parameters) is manuallychosen for each image is much better than that of any of the five algorithms with any fixedparameter set because the traditional method of optimizing the algorithm parameters for anensemble of images may not work well for outliers. Even the best parameter set might not beoptimal for these outliers, which might be segmented better with the same algorithm albeitwith a different parameter set; see Figure 5.3. This is where our unsupervised evaluationalgorithm comes into play: it can serve here as an independent referee, able to automaticallyestimate the performance of a segmentation algorithm with different parameter sets on aspecific image. Choosing better parameters for each image can enhance performance notably.

If we are not limited to a specific algorithm, performance can be further improved by usingseveral candidate segmentation algorithms. See the distribution of the Berkeley test imagesbetween the algorithms in Table 4.1. Each algorithm can segment an image with differentparameter sets; the proposed evaluation algorithm will point out the better performingalgorithm and parameter set for it. Although each algorithm performs moderately well for

37


predefined parameter sets, choosing the optimal parameter set for each image boosts theirperformance to be close to the state-of-the-art segmentation for predefined parameter sets;see [4]. Choosing the optimal algorithm improves this performance even more.

The proposed segmentation tuning process is hierarchical. The external part uses theproposed evaluation to specify a particular internal algorithm and tune its parameters. Anycommon algorithm may be used; here we use five variations of two algorithms: N-cut [64] andmean shift [19]. In section 1 we show that this approach indeed improves the performanceof each internal algorithm and of the combination thereof.

Note that the NMF is run only once on a small number of segmentations (we used 10).The F values for all other segmentations are calculated using the basic histograms computedduring this initialization.

38


2 Face recognition

Face representation is a common test case for the NMF algorithms [38, 38, 43, 74]. TraditionalNMF algorithms measure the differences between the faces with translation-sensitive L2

related metrics, and thus require a good alignment between the facial features. It was shownthat when the NMF is forced to prefer spatially limited basis components, these L2 basedalgorithms perform better and provide perceptually reasonable parts [43, 34]. Here we showthat the use of NMF with the EMD metric yields different, but still perceptually meaningfulcomponents. We found that these components are even more efficient for face classification.

2.1 The EMD NMF components

Unlike the L2 distance, the EMD is not very sensitive to small misalignments, facial ex-pressions, and pose changes. The basis components provided by the EMD NMF are facialarchetypes, each of which looks like a slightly deformed face. Each facial feature (e.g., theshape of the head, the haircut, or the shape of the nose) associated with some archetype isshared by several people. The face images in a set associated with the same person, and withdifferent poses and expressions, are usually close (in the EMD sense) to a common facialprototype. This prototype is usually a convex combination of a small number of archetypes.Every face image is a combination of a few archetypes with relatively high coefficients (theprototype) and some other archetypes with much lower coefficients.

To better illustrate this structure, we start by considering a simple image set of 4 faces:two parents, their daughter, and another, male, non-family member (six images of eachperson; see examples in Figure 4.3). The people in the database share several features. Themales have rougher facial features, while the female faces are smoother. The daughter sharesfacial features with both of her parents, especially with her father. The 24 images were putinto the columns of H∗ and it was decomposed with EMD NMF with k = 3. The grounddistance is the 2D distances between the image pixels. Note that the number of archetypesis smaller than the number of people. The resulting weight diagram is shown in Figure 4.3.The 3 weights associated with every image and the EMD NMF may be plotted in 2D becausew1 + w2 + w3 = 1. See Figure 4.3, where the input faces are plotted as (w1, w2) points. Thek=3 archetypes correspond to the (1, 0), (0, 1), and (0, 0) points. The archetypes and someinput images are shown as well. Note the similarity between the father (red circles) and thedaughter (black triangles): both are represented mainly by the archetype in (0, 0). However,the father shares some male facial features with the archetype in (0, 1). The daughter, on theother hand, shares many facial features with her mother’s archetype, located in (1, 0). Thevery noticeable changes in facial appearance caused by pose and expression are representedby small translations in the obtained subspace.

Interestingly, the representation of visual objects as a combination of object-like archetypeswas suggested as a plausible model for object recognition in the human visual system [15, 68].

2.2 Face recognition algorithm

To demonstrate the power of the EMD NMF, we use a straightforward recognition algorithm,based on 1-NN in the coefficient space. Let {(Ij, Cj) j = 1, . . . , L} be the training set (Ij is

39


Figure 4.3: Facial space for 4 people. The two-dimensional (w1, w2) convex subspace isprojected onto the triangle with corners in (1, 0), (0, 1), and (0, 0). The corners of thetriangle represent the basis facial archetypes obtained by EMD NMF. The inner points showthe actual facial images weighted in this basis.

an image, and Cj is the corresponding class label).Training:

Input: {(Ij, Cj) j = 1, . . . , L}1: Normalize every image Ij so that ‖Ij‖1 = 1.2: Decompose the matrix I (with columns Ij), by EMD NMF, I = HW .3: Normalize every column wj so that ‖wj‖2 = 1.

Output: H, W

Test:

Input: It, H, W .1: Normalize the test image It so that ‖It‖1 = 1.2: Approximate It as a convex combination of H’s columns, with weights

wt = arg minw EMD(It, Hw).3: Normalize wt so that ‖wt‖2 = 1.4: Find j∗ = arg maxj < wj, wt >.

Output: Cj∗ .

This algorithm was successfully tested on two standard face recognition databases; seesection ??.

40


Figure 4.4: Examples of texture mosaics. The mosaic borders change randomly, resulting inrandom combinations of the textures in the sample rectangles. Here, the images contain 3,4, 6, and 7 textures. Note the high local variability of the textures.

3 Texture modeling

A texture mosaic is an image containing several types of textures in random arrangements;see examples from [50] in Figure 4.4. We consider the task of estimating the texture de-scriptors associated with each texture class of the mosaic. We also would like to classifythe textures in each mosaic location, at least roughly (e.g., for consecutive segmentation).To that end, we consider the texture in nonoverlapping square image patches (blocks). Thetexture in each block is a positive mixture of the basic textures. Therefore the NMF suggestsitself as an analysis tool.

The textures in the database [50] exhibit a lot of spatial variation. Even for relativelylarge blocks, the average texture descriptor in the block differs greatly from the averagedescriptor for the whole texture patch. Nor are the mosaics large enough to render descriptordistribution methods (e.g., [40]) effective. The EMD metric better compensates for thevariability of the texture descriptor within the same texture than does L2 [56, 13]. Therefore,EMD NMF is expected to be more accurate than L2-NMF in estimation of the texturedescriptors and the mixing coefficients thereof.

We rephrase the image model from section 1 as follows: Let each texture class be as-sociated with some vector descriptor htrue

k in each location of this texture. Then the Kdescriptors associated with a mosaic image are H true = (htrue

1 , . . . , htrueK ). Ideally, the mean

texture descriptor in the j-th image block should be h∗j = H truewtruej , where wtrue

j is thevector of true fractions of the j-th block area associated with each texture class.

We applied the NMF to the texture mosaics by:

1. Converting the image to some feature vector representation. Following the findings in[56],we chose to work with the Gabor features, and thus each location is represented bya 6-orientation × 5-scale feature vector of Gabor responses [62]. Again, although thetexture descriptors are organized in matrix columns, we consider 2D ground distancein the scale-orientation space.

2. Dividing the image into M nonoverlapping rectangular blocks and calculating the meanfeature vector h∗j for each block. We denote all the sampled mean block descriptorsH∗ = (h∗1| . . . |h∗M).

41


3. Finding the factorization H∗ ≈ HW . In this case only the domain of texture descrip-tors fits the EMD noise model, thus we use the single domain EMD NMF version.

The results of the factorization are the approximated representative texture descriptors H =(h1| . . . |hK) and the approximated fraction of each texture in each block W = (w1| . . . |wM).In section 3 we show that the results obtained with EMD NMF are more accurate and morerobust than those obtained with L2-NMF.

42


4 NMF and image segmentation

4.1 A naive NMF based segmentation algorithm

The NMF may be applied to image segmentation. We start by describing a preliminary,naive NMF based segmentation procedure and then continue developing it to achieve betterresults. Suppose that we use the NMF procedure to obtain an H and W associated withrelatively small tiles Rm covering the image. W gives us a rough localization of the segmentsin the same resolution as the tiles; see Figure 4.5, top line. To obtain a refined, pixelresolution segmentation, we use the following Bayesian consideration: The wk,m fraction isthe fraction of pixels coming from class k in the tile Rm, and may be regarded as the priorprobability that a pixel in Rm belongs to the class k. We propose to decide, for every pixel,to which class it belongs, by means of a maximum a-posteriori decision. Suppose the imageis scalar and F (~x) is the value in pixel ~x. Let Hk,f be the value of the bin associated withthe feature value f in the histogram of the class k. Then:

C(~x) = arg maxk

P (ck|f = F (~x))

= arg maxk

wk,mHk,F (~x)∑Kk=1 wk,mHk,F (~x)

. (4.1)

The preliminary NMF-based segmentation algorithm is:

1. Tile the image with M regions.

2. Compute H∗ for these regions.

3. Factorize H∗ with NMF and obtain H and W .

4. Compute C(~x) for each image pixel using eq.(4.1).

For computational simplicity we use square tiles.Unfortunately, this algorithm does not work well for real images. Even though the EMD

NMF succeeds in finding reasonable approximations for H and W matrices, as shown insection ??, the inaccuracies in the obtained W estimations cause frequent errors in theBayesian assignment (4.1). Now we propose several improvements which bias the bilateralEMD NMF toward even more accurate W estimation, and a corresponding better imagesegmentation algorithm.

4.2 Spatial smoothing

Recall that, ideally, the spatial basis histograms W T are piecewise constant. To use thisinformation, we propose to implement the NMF under the BEMD distance with preferenceto minimizing the total variation [57]:

[H, W ] = arg minH,W

BEMD(H∗, HW ) + λTV (W ),

where (4.2)

TV (W ) =K∑

k=1

M∑m=1

|dxWm,k|+ |dyWm,k|. (4.3)

43


dxWm,k(dyWm,k) is the difference between the spatial histogram value Wm,k and anothervalue Wm′,k associated with the following x (y) coordinate on the image plane.

In the new distance we need to minimize zx and zy – the differences between neighboringW entries – in addition to minimizing the flows fm and fs between the feature and spatialhistogram bins. Thus, the new cost function is:

minfm,fs,zi

∑m,i,j

fm(i, j)df (i, j) +∑s,x,y

fs(x, y)dx(u, v)

+∑

m,k

zx(m, k) + zy(m, k). (4.4)

Subject to the constraints (2.7), (2.13), and the additional constraints on the spatialchanges of W (similar for the x and y directions):

zx(m, k) ≥ 0

−zx(m, k) ≤ dx(Wm,k) ≤ zx(m, k). (4.5)

The ground distance dx(u, v) measures the cost of moving between the spatial bins u and v.The alternating steps become:

W step – minimize (4.4) for fm, fs, z, and W such that (2.7), (2.13), and (4.5).

H step – minimize (4.4) for fm, fs, z, and H such that (2.7), (2.13), and (4.5).

In practice, we use WEMD based optimization to solve each step, analogously to what isdescribed in section 3.

4.3 Multiscale factorization

The preferred solution for W is piecewise constant. Thus, we can save a lot of computationaleffort by working with W in lower resolution during most of the factorization process. More-over, the feature histogram estimation is more precise when applied to larger regions, e.g.,see section 3. To use this twofold advantage we worked with a hierarchical, or multiscale,BEMD NMF solver.

First, the image is divided into large tiles and a small H∗ matrix is built. This matrixis factorized quickly and a rough W along with a precise H are estimated. Then, the newH∗ associated with smaller tiles is constructed and factorized with BEMD NMF. The latterfactorization is initialized with the estimated H.

This process may be continued to finer resolutions; however, for the finer scales, thecomplexity grows and the model becomes less accurate. Therefore, we usually applied thefactorization with 3-4 scales; see Figure 4.5.

4.4 Boundary aware factorization

We refer to a boundary as a special, one pixel wide segment such that each pixel has at leasta pair of neighbor pixels belonging to different object classes.

44


Figure 4.5: W estimates by multiscale BEMD. The results are for three-class factorization.The rightmost image for every scale shows the boundary class.

Because of its small size and high variability, the boundary is not modeled as a standardrow of W . In each W step the factorization algorithm associates a small part α (2.5%in our implementation) of each non-single-class region to be in the boundary segment; seeFigure 4.5, the rightmost image for each scale. For a single-class region (i.e., a region withWm,k > 1− α for some k) the boundary class weight is zero. The boundary class is usuallyassociated with a wide distribution because of the high variation in the boundary featurevalues. Technically, the boundary class is associated with a column in H and the H stepof BEMD NMF remains the same. Effectively we gain a twofold advantage: The boundaryfeature histogram effectively collects the feature distribution of the outliers in nonsingularregions and the class feature histograms become more precise.

4.5 Bilateral EMD NMF segmentation algorithm

The final segmentation algorithm (Algorithm 4) is an enhancement of the first, naive algo-rithm proposed in the beginning of this section by the spatial smoothing term, the hierarchi-cal decomposition, and the boundary extraction. The parameters are: βmax is the numberof scales (we used 3 or 4); ∆ is the length of the tile side (we used ∼ 80 pixels); K is themanually specified number of classes.

Pixelwise Bayesian assignment sometimes creates a salt-and-pepper like mix between twoclasses if both classes have similar probability in a region. To avoid this kind of noise, wesmoothed the obtained probability maps with several iterations of anisotropic diffusion.

45


Algorithm 4 Bilateral EMD NMF segmentation

Input: I(x, y), K, βmax, ∆.1: Guess initial H ∈ Rn×k+1 in a reasonable way. Set the boundary distribution as uniform.2: for scale β = 1 : βmax do3: Calculate H∗β for ∆

β× ∆

βtiles.

4: repeat5: Find W β using W step.6: W β ← FindBoundary(W β), see sec. 4.47: Find H using H step.8: until convergence9: end for

10: Find P (ck|F (~k)) with (4.1).

11: Smooth P (ck|F (~k)) and find C(~x) with MAP.Output: W β, H, and C(x, y).

46


Chapter 5

Experiments

47


1 Evaluation experiments

The proposed evaluation method was experimentally validated as follows: We segmented theimages in the Berkeley dataset using 5 segmentation algorithms, varying dozens of parametersets for each. We then compared the precision and recall grades obtained by the manualprocedure as described in [48] with the automatic grades of the proposed Algorithm 2. Wealso used the automatic estimations obtained in this experiment to choose the best algorithmand parameter set pair for each image.

We used two popular segmentation tools: normalized cut [64] and mean shift [19]. Forboth we used the code published by the authors (for normalized cut we used a faster multi-scale code version [20]). We tested mean shift in grayscale and color modalities and testednormalized cut in grayscale, color, and texture modalities.

The proposed factorization, Algorithm 3, and the normalized cut algorithm require im-age edge strength maps as input. For multidimensional features (color and texture), edgedetection is not a simple operation as it is for the grayscale feature. Here we used theedge detection algorithm described in [62]. We expect, however, that other edge detectionoperators would give results similar to those reported here.

1.1 The accuracy of unsupervised estimates

The manual markings supplied as a part of the Berkeley database serve as ground truth, andare considered below as “true”. Note, however, that different people segmented the sameimages differently and the ground truth is not unique. Thus, before comparing the automaticestimations to the supervised quality assessments, we ascertain the intrinsic limitation of thesupervised assessment method itself.

The Berkeley benchmark tool contains two types of manual segmentations. The humanoperators saw either a color or a grayscale version of each image. The resulting segmentationsare cataloged respective to the observed modality. We adopt here the supervised qualityassessment procedure as it is described in [48]. The different observed modalities correspondto two different assessment methods. We compare our automatic estimations to the gradesgiven by both methods. The difference between the two manual methods is considered asan intrinsic accuracy limitation. The distribution of manual assessment inconsistencies isshown in Figure 5.2(a).

We found that the precision of segmentation assessed according to the graylevel basedground truth is systematically lower by 4% than the one assessed according to the colorbased ground truth. Analogously, the precision assessed automatically is systematically 8.5%higher than the color based ground truth. The recall assessments are unbiased for the threemethods. We normalized the grayscale and the automatic precision grades by the factors of1.04 and 0.915 accordingly for inter-modality comparisons. Note that this normalization isnot needed if one uses a single modality estimation, e.g., for algorithm comparison, as it isdone in common precision estimations, e.g., [48].

The distributions of the differences between the automatic estimations and the manualestimations in both modalities are shown in Figure 5.2(b) and (c); see details below. Theautomatic estimation of the recall is almost as good as the manual one. The differencesbetween the manual and the automatic estimations of the precision are slightly greater

48


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure 5.1: A comparison of unsupervised precision (blue) and recall (green) estimates withthe supervised evaluations made with two types of ground truth. Each image was segmented48 times by the N-cut algorithm. The examples include the best and the worst consistenciesbetween the three evaluation methods. Note that poor segmentation quality, e.g., the birdimage in the second line, does not stand in the way of a good quality assessment. On theother hand, good but too consistent segmentations, e.g., the image of the boys in the sixthline, might result in a worse assessment.

49


−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.80

500

1000

1500

2000

2500

P

color−P

gray

Rcolor

−Rgray

(a) human color vs. gray

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.80

500

1000

1500

2000

2500

P

color−P

gray

Pest

−Pgray

Rest

−Rgray

(b) automatic vs. humangray

−0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.80

500

1000

1500

2000

2500

P

color−P

est

Rcolor

−Rest

Pcolor

−Pgray

(c) automatic vs. humancolor

−1 −0.5 0 0.5 10

0.1

0.2

0.3

0.4

0.5

∆(unsupervised,human)−(color,gray)

% o

f occ

uren

ces

abs(Punsupervised

−Phuman

)−abs(Pcolor

−Pgray

)

abs(Runsupervised

−Rhuman

)−abs(Rcolor

−Rgray

)

(d) automatic vs. human

Figure 5.2: Inconsistency distributions for different measurement methods. The color andautomatic precision assessments were normalized to be unbiased relatively to the graylevelbased assessments; see text.

than between the two manual estimations. This result is expected: the boundary is thesmallest and the most highly varying segment in the image. Thus, adding a small amountof non-boundary data to the estimated boundary distribution yields significant differencesin precision estimation, because there are a lot of miscounted non-boundary pixels. Onthe other hand, adding a small amount of boundary data to the estimated non-boundarydistributions does not change the recall much, because the few miscounted boundary pixelsare insignificant relative to the non-boundary area. Note also that the manual evaluationsare partially based on semantic image analysis, and yet, surprisingly, the precision of theproposed automatic evaluation is very high.

The accuracy was tested as follows: Most of the experiments were performed on theBerkeley “test” database. To estimate the histogram basis H, we segmented each colorimage by mean shift with 10 different parameter sets. We found empirically that this is thebest way to quickly obtain segmentations with the most diverse precision and recall values.In all successive experiments we used the established H bases and estimated W in a singleW -iteration.

The histograms shown in Figure 5.2 refer to 4800 N-cut segmentations of 100 imagesfrom the Berkeley “test” database. Each image was segmented 16 times in each of the threemodalities using a different number of expected cuts (we used k = 2, 4, 8, 12, 16, 20, 24, 29,34, 40, 50, 60, 70, 80, 90, and 105). Note that the segmentation method is not importantfor this test. Similar results were obtained in [60] by using mean shift segmentations.

Typical precision/recall correspondences for different images are shown in Figure 5.1.Note that the automatic recall estimations are always very similar to the supervised ones.The precision estimates may be less consistent with human opinion for some difficult images,for which the differences in supervised precision are also higher. Even for the worst (for 100images) inconsistency in the automatic and manual precision assessments (the “boys” imagein Figure 5.1), the automatic precision estimation is still very similar to the supervised ones.Note that the unsupervised estimate is usually monotonic in the true one. An interesting ob-servation is that the input segmentation quality does not influence the automatic estimationperformance; see, e.g., the bird in Figure 5.1.

We found that the distribution of the differences between the automatic and the manualestimations are correlated with the inconsistencies in human judgment (Fig. 5.2(d)). Largeprecision or recall differences between the automatic and the manual estimations are as rare

50


as between two different manual estimations. Numerically, the variance of the precision(recall) inconsistencies between the two supervised estimations is 0.01 (0.012) and betweenthe supervised(color) and the algorithmic estimations is 0.03 (0.018). The variance of theinconsistency difference is 0.02 << 0.01 + 0.03, i.e., the difference is far from being random,and the difference magnitudes are correlated for the two measurements. For recall thesituation is similar: 0.015 << 0.012 + 0.018.

1.2 Application: image-specific algorithm optimization

We now test the power of the proposed evaluation method for unsupervised image-specificperformance optimization. We used 100 images from the Berkeley “test” database. Eachimage was segmented 16 times by normalized cut in graylevel, color, and texture modalitiesand 75 times by the mean shift algorithm in graylevel and color modalities.

For each image segmentation we estimated the precision, the recall, and the F-value inthe proposed unsupervised way. Then, for each image we selected 5 segmentations associatedwith the highest automatic F-values, one for each algorithm. Among those we also chosethe segmentation with the highest F-value, which represents the optimal algorithm andsegmentation for this image; see Figure 5.3. Similarly to [48] and [27], we tested the

0.2 0.3 0.4 0.5 0.6 0.70.5

0.6

0.7

0.8

0.9

1

Precision

Rec

all

N−cut grayN−cut colorN−cut textureMeanshift colorMeanshift grayAutomatic best

Figure 5.3: F-value differences between the segmentations in a set (the parameter sets andthe automatically chosen segmentations) and the best segmentation of the same image withthe same algorithm and the optimal, image specific, segmentation parameters. The widthof the marker is proportional to the average F-value difference for the set. Note that theautomatic estimations are better and closer to the optimum than any fixed parameter setfor all algorithms. The top-right automatic estimation (black) is associated with choosingthe best segmentation from the 5 algorithms.

average performance on the dataset, although, as shown in Figure 4.2, this performance isvery different for different images. As expected, we found that choosing the best parameterset for an algorithm increases its performance; see Figure 5.3 for algorithm-specific graphs.We also estimated, in a supervised way, for each algorithm and for their union, the bestpossible segmentation. We checked for each image the F-value difference from the chosen

51


Table 5.1: The average F and ∆F values for the segmentation algorithmsAlgorithm N-cut Mean shift AllModality gray color texture gray color

fixed parameters .54 .59 .55 .52 .615automatic .55 .59 .55 .53 .622 .64∆F fixed .044 .043 .04 - .048∆F auto .038 .043 .039 - .042 0.04

segmentation to the optimal one and found that the variability of these differences for thechosen segmentations is smaller than for the predefined parameter sets; see Figure 5.3. Onemore way to see the improvement due to the online parameter tuning is to compare theaverage F-values of the best parameter set and the chosen segmentations in Table 5.1. Notethat because the F-value is a nonlinear combination of P and R, the average F differs fromthe F computed from average P and average R (as done in [48]). The F values associatedwith average P and R are larger by 0.02 than those in Table 5.1.

0.4 0.5 0.6 0.7 0.80.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Precision

Rec

all

P−R curveP deficient (23%)Standard (63%)R deficient (10%)All imagesAutomatic choice

Figure 5.4: The performance for the outlier images. The performance is shown for themean shift color segmentations. The points with circle markers show the performance of theproposed automatic choice algorithm, while the other end of the lines connected to themshows the performance with the optimal set of algorithm parameters. The algorithm’s tuningcurve is plotted for illustration purposes.

The most important improvement proposed by our method is associated with the “out-lier” images. For these images the optimal segmentation is associated with a parameter setsignificantly different from the algorithm’s optimal parameter set. See Figure 5.4 for theperformance curves associated with such images.

As described in section 1.1, there are natural differences between the different estimationmethods, even between two supervised methods. All the precision/recall curves are calcu-lated in the supervised color modality, and even in another supervised modality they aresomewhat different. Thus, it is not surprising that the automatic parameter tuning did notbring the algorithms to the optimum in the color modality. Parameters which are optimal for

52


the color based estimation are non-optimal for the grayscale based one. Note, however, thateven for the largest misses of the online algorithm, the chosen segmentations are reasonable;Figure 5.5.

(a) (b) (c) (d) (e) (f)

(g) (h) (i) (j)

(k) (l) (m) (n)

(o) (p) (q) (r)

Figure 5.5: The largest inconsistencies between the manual and the automatic choice of thebest mean shift segmentation from a set. The left segmentation in each pair corresponds tothe automatic choice. The right segmentation in each pair corresponds to the best supervisedscore.

53


2 Face recognition experiment

We tested the EMD NMF based recognition algorithm on the popular Yale [5] and ORL [58]face databases. We follow the experimental procedure of [74], so that we can relate our resultsto those in [74] using the ORL database. Therefore, the face images are downsampled sothat their longer side is 32 pixels. Moreover, as observed in [74], the recognition performancedepends to a small extent on the partition of the database into the training and test sets.Following [74] and the approaches cited there, we provide the best results obtained in severaltraining/test partitions.

In contrast to [74], we did not tightly align the faces by forcing the eye positions tocoincide. Both databases contained images that were only roughly aligned. We did nottouch the ORL database and, in the Yale database, we only centered the faces. This wasnecessary to avoid a situation in which facial position plays too great a role in identification.

Figure 5.6: The Yale faces database. The database contains images of 15 people, and weconsidered 8 images for each person. The first two rows show examples of the databaseimages. The last row shows the basis images obtained with EMD NMF.

The Yale face database contains fewer people than ORL, but is more challenging forrecognition. We used a subset of it containing a set of images corresponding to the samelighting direction. Even with this restriction, the recognition task is not easy due to thehigh variability of expressions and to the possible presence of glasses. This implies thateven for the best partition of the database into training and test sets, the test faces alwaysdiffer considerably from their closest training examples. Four images were used to representevery person in the training set. A relatively high recognition rate of 86.6% was achievedusing only 6 basis archetypes (representing 15 people). The archetypes obtained in this testare shown in Figure 5.6 together with examples of the faces they represent. Increasing thenumber of archetypes to 15 (one per person) increased the recognition rate to 95%. All themisses are due to glasses appearing in the test image but not in the corresponding trainingimages.

It is interesting to observe that the proposed algorithm does not behave like a nearestneighbor algorithm with EMD metric. When a representative archetype for each personwas computed as the image minimizing the sum of EMD distances over the correspondingtraining images, and 1-NN (with EMD metric) was used for recognition, accuracy was only73.3%. This advantage of the EMD NMF based algorithm could be predicted also from theweight diagram in Figure 4.3, where, clearly, the father’s images are closer to the daughter’smean image than to his own mean image (in weight space) and can be recognized only bythe additional components.

The ORL database contains images of 40 people and is somewhat easier. As in [74], fiveimages were used to represent every person in the training set. The recognition accuracy

54


(a) (b) (c) (d)

Figure 5.7: Typical recognition error in ORL database. When the test face image (a) is in avery different pose from that of the same person in the training set, the most similar personin the same pose (b) may be erroneously identified. The second-most similar identifications(c,d) are correct.

naturally changes with basis size K. For K equal to or larger than the number of classes(people), the EMD NMF algorithm outperforms all the NMF based algorithms considered in[74], which often use much larger bases; see Table 5.2. Even with much lower basis dimension,the proposed algorithm achieves very high, competitive, accuracy.

Analyzing the (few) recognition errors, we found that they are associated with poseswhich differ notably from those in the training set; see Figure 5.7.

Table 5.2: Classification accuracies of different algorithms on the ORL database and thecorresponding basis sizes cited from [74].

Algorithm NMF LNMF NGE PCA LDA MFABasis Size 158 130 121 105 39 48

Accuracy (%) 74.0 87.5 95.5 85.5 94.5 95.5

Table 5.3: Classification accuracy of EMD NMF on the ORL database for different basissizes.

Basis Size 2 5 8 10 20 30 40 50Accuracy (%) 8.5 70.5 87.5 94.5 90.5 95.0 96.5 97.0

55


3 Texture descriptor estimation

We applied the algorithm described in section 3 to 90 online generated mosaics [50]. Eachtest was repeated for combinations of two parameters: the number of textures in the mosaic(K = 3, . . . , 12 textures) and the number of blocks M = 16, 64, 256, 1024 (number of columnsin H∗). The blocks tessellate the image. Therefore, M also specifies the block size to be128 × 128, 64 × 64, 32 × 32, and 16 × 16 pixels respectively. In each test the K parameterwas set to the number of texture classes in the image.

We compared the estimated H and W matrices with the actual matrices H true and W true

using the following correlation measure:

Qa(A,Atrue) =1

K

K∑i=1

< ~ai, ~atruei >

‖~ai‖‖ ~atruei‖

. (3.1)

The estimated Qh = q(H, H true) and Qw = q(W T , (W true)T ) values for the different testparameters are shown in Figure 5.8. The columns/rows are assigned to the respective onesin the true matrices by sequential greedy assignment, which maximizes Qa. Note that as theblock size increases, the descriptors H∗ are evaluated over a bigger area and are thus moreprecise for both metrics.

The graphs in Figure 5.8 illuminate two important differences in the behavior of thetwo metrics. Although they perform comparably when sufficient (64) samples of relativelyreliable (64 × 64 blocks) of data are available, EMD NMF outperforms L2-NMF when thenumber of sample vectors is small or the samples less reliable. For the EMD metric, theperformance of the H reconstruction does not depend on the number of classes, whereas forthe L2 metric it decreases with a larger K. These findings also support the observation thatEMD is more robust when ideal data is not available.

In addition to the mean of the column/row correlations (3.1), we also measured theirstandard deviation. We found that the EMD NMF is generally associated with much smaller(in 30-50%) standard deviation than the L2-NMF. The intuitive explanation is that while theL2-NMF estimations of H columns and W rows are either very accurate or very inaccurate,the EMD NMF estimations are generally more stable. Together with the average correlationresults, this makes the EMD NMF estimations for both H and W more reliable than thoseof L2-NMF.

56


128x128 64x64 32x32 16x160.8

0.85

0.9

0.95

1

Window size

Ave

rage

cor

rela

tion

H correlation with the true values

L2

EMD

3 4 5 6 7 8 9 10 11 120.6

0.7

0.8

0.9

1

Number of classes

Ave

rage

cor

rela

tion

H correlation with the true values

L2 128× 128

EMD 128 × 128L

2 16 × 16

EMD 16 × 16

128x128 64x64 32x32 16x160.6

0.65

0.7

0.75

0.8

Window size

Ave

rage

cor

rela

tion

W correlation with the true values

L2EMD

3 4 5 6 7 8 9 10 11 120.3

0.4

0.5

0.6

0.7

0.8

0.9

Number of classes

Ave

rage

cor

rela

tion

W correlation with the true values

L2 128× 128

L2 16×16

EMD 16×16EMD 128 ×128

Figure 5.8: Texture descriptor estimation accuracy. The first row shows the reconstructionquality of the basis descriptors and the second row shows the reconstruction quality of themixing coefficients. The left column shows the average (over different K-s) reconstructionquality for the different sizes of the sampling blocks and the right column demonstrates thereconstruction quality as a function of the number of texture classes for two sizes of thesampling blocks.

57


4 Segmentation

We experimented on two popular image databases: the Berkeley Segmentation Dataset[47] and the Weizmann Segmentation Evaluation Database [2]. Both databases are builton similar ideas and provide tools to benchmark algorithm performance using the manualsegmentations of the database images. Both test performance in similar terms – an algorithmreceives an F -number score for each segmented database image.

The evaluation task associated with the F -value score is different for the two databases.The score in the Berkeley database judges the algorithm by its ability to detect all objectboundaries specified in manual segmentations and to avoid boundary detection in otherplaces. The evaluation task in the Weizmann database is to specify the main object’s pixelsin the image as accurately as possible.

We performed a similar simple test on both databases. Each pixel was characterizedwith gray level value and gradient size as its 2D feature. Each image was segmented withAlgorithm 4 into a manually specified number (between 2 and 7) of classes and the boundaryclass. In both tests the proposed algorithm showed consistent results; see some examplesin Figures 5.9 and 5.10. However, the interpretation of these results is different for the twodatabases.

Weizmann database. The goal of this database benchmark is to detect the main objectin the image accurately. The database was purposely designed to contain “a variety of imageswith objects that differ from their surroundings by either intensity, texture, or other low levelcues.” These low level cues may differ along the goal object as well as along the background.The images in the database are gray scale. The best achieved performance of the algorithmon this database was F = 0.83. According to [2], this performance is much better than thatof N-Cut (F = 0.72) and MeanShift (F = 0.57) and even better than that of some complexmultifeature algorithms. The algorithm best succeeded with images having different featuredescriptions of the object and the background, no matter how complex this description is,and failed mostly on the images where the object and background descriptions share a largepart of the feature space, especially if these shared features have large spatial presence; seeexamples in Figure 5.9.

Berkeley database. The Berkeley test checks an algorithm’s performance on boundarydetection tasks for color images. The ground truth segmentations include some of the imageobjects chosen manually. Algorithm 4 provides for each image point a probability to be aboundary point. These probability maps were tested by the database benchmark tools.

Testing the object on this database reveals both the merits and the deficiencies of thealgorithm. While its results (F = 0.55) are worse than those obtained by state-of-the-artlearning based algorithms [4], it should be noted that the state-of-the-art results are obtainedusing the color information from the images. Our results are similar to those obtained byN-cut and mean shift on grayscale images [?]. Looking at some examples, it is apparent thatthe algorithm is able to extract the appearance model but fails to exploit this knowledge tosegment the fine details of the object. Stronger features (e.g., texture and color) and a moresophisticated final segmentation stage are needed to exhibit the strength of the proposedalgorithm in this test.

58


Figure 5.9: Segmentation examples, Weizmann database

Figure 5.10: Segmentation examples, Berkeley database

59


Chapter 6

Discussion

60


A new image model is proposed. A given image is represented as a set of feature dis-tributions, and the model proposes an interpretation of the distributions as a product oftwo factors representing the image contents. Decomposition, using the nonnegative matrixfactorization, of the distributions has two immediate applications in the computer vision:segmentation evaluation and texture modeling.

We propose a fundamentally new approach to unsupervised estimation of segmentationquality. The approach is able the predict precision/recall characterizations in unsupervisedway. Experiments, carried out on a large database, demonstrate the accuracy of the esti-mates and their application to tuning the segmentation process. The key idea behind thesurprising ability to calculate the precision and recall without ground truth reference is thatthe distribution of certain image properties on the boundaries and in segments is a weak,partial ground truth representation, which should suffice to establish precision and recall yetmay be established online. The segmentations optimized by the proposed measure are oftenconsistent with manual segmentation. When inconsistencies do arise, it is frequently becausethe manual segmentations are themselves inconsistent. This seems to be the case when a lotof semantic knowledge is used.

Using the obtained evaluation method we were able to adapt image segmentation algo-rithms to specific images. The best algorithm for a particular image is chosen, along with theoptimal parameters. Diverse images can thus be efficiently segmented by the most efficientalgorithm rather than by a complex general-purpose segmentation algorithm.

The core of the proposed image model and image analysis approach is NMF. A newtype of NMF task, NMF with EMD metric, is proposed. The problem is solved with alinear programming based iterative algorithm. A WEMD [65] based optimization techniqueis proposed for fast implementation of the proposed algorithm. Algorithms based on theproposed EMD NMF outperformed previous NMF based algorithms in the context of twochallenging computer vision tasks.

The main advantage of the new approach would seem to be its enhanced robustness.Consider, for example, the task of identifying a set of basis descriptors from mixture mea-surements. When the given measurements closely approximate linear combinations of thehidden descriptors, then the L2-NMF technique suffices to accurately extract the basis.When the mixtures are, however, mixtures of deformed descriptors, this is no longer thecase. Nonetheless, the deformed descriptors may be close, in the EMD sense, to the originaldescriptors. Then, the mixture of deformed descriptors is EMD close to the mixture of orig-inal descriptors (with the same weights). This lower sensitivity to deformations allows theEMD NMF to succeed when the L2-NMF does not. Note that this situation is typical whenwe approximate a histogram from a small sample mixture.

The image model discussed in this work allows to use the enhanced properties of EMDNMF for a simple and elegant image description as a matrix product. Naturally, the simplelinear model merely replaces the complex, nonlinear approximation with the more complexEMD metric. However, it allows an elegant image analysis independent of technical details.The segmentation evaluation, texture modeling, and even the segmentation approach directlyfollow from the proposed linear model.

We see several possible directions for future research. Each of the considered applicationsis just a straightforward demonstration of the advantages of EMD NMF and the image model(for three of them). Converting the applications into full-scale face recognition, database

61


search, and segmentation tool worth an additional research. Another research directioncould be introducing learned data from sources which are external to the analyzed image.Finally, the proposed approach can be used with another data types, e.g., movies.

62


Bibliography

[1] M. Aharon, M. Elad, and A.M. Bruckstein, K-svd and its non-negative variant fordictionary design, Proc. SPIE Conference on Wavelets, vol. 5914, July 2005.

[2] ”S. Alpert, M. Galun, R. Basri, and A. Brandt”, ”image segmentation by probabilisticbottom-up aggregation and cue integration.”, CVPR, ”June” 2007.

[3] A. Amir and M. Lindenbaum, A generic grouping algorithm and its quantitative analysis,PAMI 20 (1998), no. 2, 168–185.

[4] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, From contours to regions: An empiricalevaluation, CVPR, 2009.

[5] P. N. Bellhumer, J. Hespanha, and D. Kriegman, Eigenfaces vs. fisherfaces: Recognitionusing class specific linear projection, PAMI 17 (1997), no. 7, 711–720.

[6] A. Berengolts and M. Lindenbaum, On the performance of connected components group-ing, IJCV 41 (2001), no. 3, 195–216.

[7] M. Berry, M. Browne, A. Langville, P. Pauca, and R. Plemmons, Algorithms and appli-cations for approximate nonnegative matrix factorization, Computational Statistics andData Analysis 52 (2007), no. 1, 155–173.

[8] E. Borenstein, E. Sharon, and S. Ullman, Combining top-down and bottom-up segmen-tation., CVPRW, 2004, p. 46.

[9] Sudhir Borra and Sudeep Sarkar, A framework for performance characterization ofintermediate-level grouping modules, PAMI 19 (1997), no. 11, 1306–1312.

[10] M. Borsotti, P. Campadelli, and R. Schettini, Quantitative evaluation of color imagesegmentation results, Pattern Recogn. Lett. 19 (1998), no. 8, 741–747.

[11] Y. Boykov and M.-P. Jolly, Interactive graph cuts for optimal boundary and regionsegmentation of objects in N-D images., ICCV, 2001, pp. 105–112.

[12] Y. Boykov and V. Kolmogorov, Computing geodesics and minimal surfaces via graphcuts, ICCV, 2003.

[13] R. E. Broadhurst, Statistical estimation of histogram variation for texture classification,Texture Analysis and Synthesis Workshop, ICCV, 2005, pp. 25–30.

63


[14] T. Brox, M. Rousson, R. Deriche, and J. Weickert, Unsupervised segmentation incor-porating colour, texture, and motion, Computer Analysis of Images and Patterns 2756(2003), 353–360.

[15] H H Bulthoff and S Edelman, Psychophysical support for a two-dimensional view inter-polation theory of object recognition, PNAS, vol. 89, January 1992, pp. 60–64.

[16] P. J. Burt, T.-H. Hong, and A. Rosenfeld, Segmentation and estimation of image regionproperties through cooperative hierarchial computation., IEEE Transactions on Systems,Man and Cybernetics 11 (1981), no. 12, 802–809.

[17] S. Chabrier, B. Emile, H. Laurent, C. Rosenberger, and P. Marche, Unsupervised evalu-ation of image segmentation application to multi-spectral images, ICPR, 2004, pp. 576–579.

[18] C. Christoudias, B. Georgescu, and P. Meer, Synergism in lowlevel vision, ICPR, 2002,pp. 150–155.

[19] D. Comanicu and P. Meer, Mean shift: A robust approach toward feature space analysis,PAMI 24 (2002), no. 5, 603–619.

[20] T. Cour, F. Benezit, and J. Shi, Spectral segmentation with multiscale graph decompo-sition., CVPR, 2005, pp. 1124–1131.

[21] T. Cour, F. Benezit, and J. Shi, Spectral segmentation with multiscale graph decompo-sition, CVPR (Washington, DC, USA), IEEE Computer Society, 2005, pp. 1124–1131.

[22] F. de la Torre, ch. A unification of component analysis methods.

[23] I. Dhillon and S. Sra, Generalized nonnegative matrix approximations with Bregmandivergences, NIPS, vol. 18, 2006, pp. 283–290.

[24] D. Donoho and V. Stodden, When does non-negative matrix factorization give a correctdecomposition into parts?, NIPS, 2003.

[25] J. H. Elder and R. M. Goldberg, Ecological statistics of Gestalt laws for the perceptualorganization of contours, J. Vis. 2 (2002), no. 4, 324–353.

[26] E. A. Engbers, M. Lindenbaum, and A. W. M. Smeulders, An information-based measurefor grouping quality., ECCV (3), 2004, pp. 392–404.

[27] F.J. Estrada and A.D. Jepson, Benchmarking image segmentation algorithms, IJCV 85(2009), no. 2, 167–181.

[28] D. A. Forsyth and J. Ponce, Computer vision: A modern approach, Prentice Hall PTR,August 2002.

[29] M. Galun, E. Sharon, R. Basri, and A. Brandt, Texture segmentation by multiscaleaggregation of filter responses and shape elements., ICCV, 2003, pp. 716–723.

64


[30] K. Grauman and T. Darrel, Fast contour matching using approximate earth mover’sdistance, CVPR, vol. 1, 2004, pp. 220–227.

[31] T. Hazan and A. Shashua, Analysis of l2-loss for probabilistically valid factorizationsunder general additive noise, Tech. Report 2007-13, The Hebrew University, 2007.

[32] Matthias Heiler and Christoph Schnorr, Learning sparse representations by non-negativematrix factorization and sequential cone programming, J. Mach. Learn. Res. 7 (2006),1385–1407.

[33] F. S. Hillier and G. J. Lieberman, Introduction to operations research, McGraw-HillScience/Engineering/Math, 2005.

[34] P.O. Hoyer, Non-negative matrix factorization with sparseness constraints, J. Mach.Learn. Res. 5 (2004), 1457–1469.

[35] P. Indyk and N. Thaper, Fast image retrieval via embeddings, IWSCTV, 2003.

[36] D.W. Jacobs, Robust and efficient detection of salient convex groups, PAMI 18 (1996),no. 1, 23–37.

[37] M. P. Kumar, P. H. S. Torr, and A. Zisserman, Obj cut., CVPR, 2005, pp. 18–25.

[38] D. D. Lee and H. S. Seung, Learning the parts of objects by non-negative matrix factor-ization., Nature 401 (1999), no. 6755, 788–791.

[39] D. D. Lee and H. S. Seung, Algorithms for non-negative matrix factorization, NIPS 13(2001), 556–562.

[40] T. Leung and J. Malik, Representing and recognizing the visual appearance of materialsusing three-dimensional textons, IJCV 43 (2001), no. 1, 29–44.

[41] A. Levin and Y. Weiss, Learning to combine bottom-up and top-down segmentation,IJCV 81 (2009), no. 1, 105–118.

[42] E. Levina and P. Bickel, The earth mover’s distance is the mallows distance: someinsights from statistics, ICCV, vol. 2, 2001, pp. 251–256.

[43] S.Z. Li, X. Hou, H. Zhang, and Q. Cheng, Learning spatially localized, parts-basedrepresentation, CVPR, 2001, pp. 207–212.

[44] H. Ling and K. Okada, An efficient earth mover’s distance algorithm for robust histogramcomparison, PAMI 29 (2007), no. 5, 840–853.

[45] D.G. Lowe, Distinctive image features from scale-invariant keypoints, IJCV 60 (2004),no. 2, 91–110.

[46] J. Malik, S. Belongie, T. K. Leung, and J. Shi, Contour and texture analysis for imagesegmentation, IJCV 43 (2001), no. 1, 7–27.

65


[47] D. Martin, C. Fowlkes, D. Tal, and J. Malik, A database of human segmented natural im-ages and its application to evaluating segmentation algorithms and measuring ecologicalstatistics, ICCV, vol. II, 2001, pp. 416–423.

[48] D.R. Martin, C.C. Fowlkes, and J. Malik, Learning to detect natural image boundariesusing local brightness, color, and texture cues, PAMI 26 (2004), no. 5, 530–549.

[49] P. Meer, B. Matei, and K. Cho, Performance characterization in computer vision, ch. In-put guided performance evaluation, pp. 115–124, Kluwer, Amsterdam, 2000.

[50] S. Mikes and M. Haindl, Prague texture segmentation data generator and benchmark,2006, pp. 67–68.

[51] D. Mumford and J. Shah, Optimal approximations by piecewise smooth functions andassociated variational problems, Comm. Pure Appl. Math XLII (1989), 577–685.

[52] P. Paatero and U. Tapper, Positive matrix factorization: a non-negative factor modelwith optimal utilization of error estimates of data values, Environmetrics 5 (1994), no. 2,111–126.

[53] O. Pele and M. Werman, Fast and robust earth mover’s distances, ICCV, 2009.

[54] A. Rabinovich, S. Belongie, T. Lange, and J. M. Buhmann, Model order selection andcue combination for image segmentation, CVPR (Washington, DC, USA), IEEE Com-puter Society, 2006, pp. 1130–1137.

[55] X. Ren and J. Malik, Learning a classification model for segmentation, ICCV, 2003,pp. 10–17.

[56] Y. Rubner, Perceptual metrics for image database navigation, Ph.D. thesis, StanfordUniversity, 1999.

[57] L. Rudin, S. Osher, and E. Fatemi, Nonlinear total variation based noise removal algo-rithms, Phys. D 60 (1992), no. 1-4, 259–268.

[58] F. Samaria and A. Harter, Parameterisation of a stochastic model for human face iden-tification, Proceedings of 2nd IEEE Workshop on Applications of Computer Vision(Sarasota FL), IEEE, December 1994.

[59] R. Sandler, Gabor filter analysis for texture segmentation, Master’s thesis, Technion,June 2005.

[60] R. Sandler and M. Lindenbaum, Unsupervised estimation of segmentation quality usingnonnegative factorization, CVPR, 2008, pp. 1–8.

[61] , Nonnegative matrix factorization with earth movers distance metric, CVPR,2009, pp. 1–8.

[62] , Optimizing gabor filter design for texture edge detection and classification,IJCV, 2009.

66


[63] E. Sharon, A. Brandt, and R. Basri, Fast multiscale image segmentation., CVPR, 2000,pp. 1070–1077.

[64] Jianbo Shi and Jitendra Malik, Normalized cuts and image segmentation, CVPR, 1997,p. 731.

[65] S. Shirdhonkar and D. Jacobs, Approximate earth mover’s distance in linear time,CVPR, 2008, pp. 1–8.

[66] N. Sochen, R. Kimmel, and R. Malladi, A general framework for low level vision, IEEETran. on Image Processing 7(3 (1998), 310–318.

[67] C. Thurau and V. Hlavac, Pose primitive based human action recognition in videos orstill images, CVPR, 2008, pp. 1–8.

[68] S. Ullman, High-level vision: Object recognition and visual cognition, The MIT Press,Cambridge, MA, 1996.

[69] M. Varma and A. Zisserman, Classifying images of materials: Achieving viewpoint andillumination independence, ECCV, vol. 3, May 2002, pp. 255–271.

[70] S.K. Warfield, K.H. Zou, and W.M. Wells, III, Simultaneous truth and performance levelestimation (staple): An algorithm for the validation of image segmentation, MedImg 23(2004), no. 7, 903–921.

[71] M. Werman, S. Peleg, and A. Rosenfeld, A distance metric for multidimensional his-tograms, CVGIP, vol. 32, 1985, pp. 328–336.

[72] M. Wertheimer, Untersuchungen zur lehre von der gestalt, Psycologische Forschung 4(1923), 301–350.

[73] L. R. Williams and K. K. Thornber, A comparison of measures for detecting naturalshapes in cluttered backgrounds, IJCV 34 (1999), no. 2-3, 81–96.

[74] J.C. Yang, S.C. Yang, Y. Fu, X.L. Li, and T.S. Huang, Non-negative graph embedding,CVPR, 2008, pp. 1–8.

[75] Y. Yitzhaky and E. Peli, A method for objective edge detection evaluation and detectorparameter selection, PAMI 25 (2003), no. 8, 1027–1033.

[76] R. Zass and A. Shashua, A unifying approach to hard and probabilistic clustering, ICCV(Washington, DC, USA), IEEE Computer Society, 2005, pp. 294–301.

[77] H. Zhang, S. Cholleti, S. A. Goldman, and J. E. Fritts, Meta-evaluation of image seg-mentation using machine learning, CVPR, 2006, pp. 1138–1145.

[78] Y.J. Zhang, A review of recent evaluation methods for image segmentation, ISSPA 1(2001), 148–15.

67


דורשת כמובן של המודל הפשטות , שהיא לכאורה פשוטה התוצאות מתקבלות בדרך הודות למודל המוצע אך לצד הפשטות אנו מקבלים גם תיאור התמונה , מאלה הנהוגות בדרך כלליותר מטריקה מסובכת בשימוש

מאפשר ניתוח מותאם לתמונה וויתור מסוים תיאור ה . סמנטית בראיה ממוחשבתמים בעלי משמעות גורי "ע .על למידה חיצונית שנהוג בעבודות המגיעות לתוצאות דומות

כמובן אפשר להמשיך לפתח את ארבעת האפליקציות שהצגנו . עים לעבודה זוהמשכים טבאנו רואים מספר

מחקר של המודל ה ניתן לעמיק את ,כמו כן. קורות מחוץ לתמונה עצמהובכל זאת להכניס בהם למידה ממשימוש במספר מאפינים , כגון מישקול המאפינים, י הוספת גורמים משמעותיים נוספים"המוצע עצמו ע

.כגון סרטים, כיון אפשרי נוסף הוא שימוש ברעיונות המוצגים פה לסוגי מידע נוספים. ב"וכיוצ


בין זוגות העמודות וגם השורות EMDאת מקטין מסוג אחר EMD NMF פרוק . HW - וקירובו *Hשל .HW- ו*Hשל

לכן בדרך כלל . אך במידה ואחד מהמשתנים ידוע הבעיה הנותרת היא קמורה, אינה קמורהNMFבעית

ים משתנה אחד ועושים מקבעאיטרציה בכל נחוש התחלתי מוצלח ותהליך איטרטיבי שבו י " עNMFפותרים , אבל, בשני האלגוריתמים שהצענו אנו נוקטים גישה דומה. י שינוי המשתנה השני" ע)המקומי(צעד לפתרון

פתרון בעית תיכנות גם כל צעד שלנו הוא , יש לפתור בעית תכנות לינאריEMDמכיון שכדי למצוא את כדי להפוך את . למינימוםEMDביא את שמW) או H( וגם מוצא את המשתנה EMD שגם מוצא את לינארי

בשיטה NMFופותרים את ) wavelets( מבוסס גליות EMDאנו משתמשים בקירוב של , הפתרון לישים ).gradients(מבוססת גרדיינטים

מומשו הן ) כות סגמנטציה ופרוק תמונות טקסטורה מורכבותיהערכת א(שני האלגוריתמים שתוארו לעיל

EMDהתוצאות עם אלגוריתם ה . EMD NMF והן בעזרת L2וסס על מרחב המבNMFבעזרת פרוק שלא , נוספיםם ליישומיNMF- מאפשר להשתמש ב EMDאלגוריתם מבוסס , יחד עם זאת.טובות בהרבה

.L2-NMF- היו אפשריים ל

אלגוריתם לאפיון תמונות פנים ולזיהוי המשתמש בפרוק האי שלילי החדש

חלקי פנים שונים . גוריתמי ייצוג וזיהוי פניםאל מתחילת דרכו ככלי יעיל בבסיס דידוע עו שלילי-פירוק אי פנים של אנשים –ערבובים שונים שלהם של מתגלים כוקטורי בסיס טובים ליצוג ) 'פה וכו, עיניים, ףאכמו (

ם בכל למשל עינייכך ש, דורשות התאמה מוקדמת של תמונות הפניםL2אולם כל השיטות המבוססות . שונים, משחרר אותנו מדרישה להתאמה כזוEMD-שימוש ב. התמונות במאגר ימצאו בדיוק באותן קואורדינטות

אלגוריתם פשוט לזיהוי פנים אנו מציעים . קטןEMDכיוון שתזוזה קטנה של חלק זה או אחר משליכה ערך .אלגוריתמים מתחרים הטובים בהרבה מאלו של, ים מצויניםעמביא לביצוו ומראים ששימוש בNMFמבוסס

שלילי החדש-אלגוריתם סגמנטציה המשתמש בפרוק האי

- עמודה ב–י פילוג המאפיינים"יצג כל ריבוע עיאלגוריתם הסגמנטציה מתחיל מחלוקת התמונה לריבועים ומH* . אלגוריתםEMD NMF בהקשר זה כל שורה . משמש להערכת פילוגי המאפיינים בסגמנטים האמיתיים

אמורות אף הן להיות דומות HW–של ו*H השורות שלאנו מראים ש. אמורה לתאר סגמנט, Wבמטריצת ה מאפיינים וקואורדינאטות : בשני המרחביםEMD ולכן אנו משתמשים באלגוריתם המבוסס על EMDבמובן אלגוריתם זה עובד בצורה טובה עבור רזולוציה גסה וכדי לעדן אותו אנו משתמשים במימוש .תמונה

multiscaleהסגמנטציה שאנו מקבלים נבדקה ומשיגה תוצאות טובות יותר מאלו של . ובהשמה בייסיאנית אך אינה משיגה את התוצאות המושגות על ידי ,mean shift - וnormalized cutאלגוריתמים בסיסיים כמו

.אלגוריתמים מבוססי למידה

סיכום

והמודל , י אוסף של פילוגים"התמונה מיוצגת ע. שלילי- הצענו מודל חדש של תמונה המבוסס על פירוק איעל מנת לנצל בצורה מרבית את . מציע לאוסף זה פירוש של מכפלת שני גורמים המתארים את תוכן התמונה

הדגמנו את חוזק . EMDשלילי חדשה שמשתמשת במטריקת - הצענו גם שיטת פירוק אייתרונות המודל שיפור אלגוריתמי סגמנטציה על סמך יכולת להעריך : בניתוח תמונההמודל החדש בשלושה יישומים חשובים

שלושת היישומים אינם משתמשים בשום מורה ומגיעים . מידול מרקמים ואלגוריתם סגמנטציה, את איכותהאנו בשלושת היישומים כמו גם בניסוי של זיהוי פנים . לתוצאות טובות יותר מאלגוריתמים מקבילים

אחרות בהקשר לסוגי הנתונים NMFומדגימים את יתרונותיו על פני שיטות EMD NMF-משתמשים ב .שלנו


לכל אחת . הוא בעל השלכות מיידיותמציאת תיאור של תמונה נתונה במודל המוצע של מכפלת המטריצות

: בראיה ממוחשבתתאופייניו יש שימוש במטלות H ו Wמהמטריצות

רכת טיב הסגמנטציהעה

ולכן , )עצמים (לסגמנטים אמיתיים) כשלהם( מבטאת את הקשר המרחבי בין סגמנטים נתונים Wהמטריצה תמיד נרצה לדעת עד כמה , תמונה וסגמנטציה שלהןבהינת. שימושית למשל להערכת טיב סגמנטציה

איכות דרך אינפורמטיבית ופופולרית למדוד טיב זה היא מציאת שני מאפינים המודדים את . הסגמנטציה טובה : מציאת הגבולות בין הסגמנטים בתמונה

.החלק האמיתי בגבול שנמצא לכל האורך של הגבול שנמצאהיחס בין המודד את ) precision(דיוק •

.המודד את היחס בין החלק של הגבול האמיתי שנמצא לכל האורך של הגבול האמיתי) recall(החזרה •בניגוד לכל השיטות הקודמות . Wניתן להראות בקלות שאת שני המדדים האלו ניתן לחשב מתוך המטריצה

ללא צורך בסגמנטצית precision/recallזו מאפשרת לחשב את טיב הסגמנט במונחי שהופיעו בספרות שיטה אנו מצליחים להשוות את השפות של הסגמנטציה . סגמנטציה אמיתית של התמונהומר ללא שימוש בלכ, יחוס

בפרק התוצאות של תיזה זו אפשר לראות .למרות שאיננו יודעים את מיקומם, גבולות אמיתייםהנתונה ל .י בן אדם"ה עתרכת איכות בשיטה שתוארה לעיל טובה כמעט כמו השוואה לסגמנטציה שנעששהע

אופטימיזציה ללא מורה של אלגוריתם סגמנטציה עבור תמונה נתונה: אפליקציה

את אלגוריתם החלוקה המתאים הן יכולת לדעת את איכות הסגמנטציה בזמן אמת מאפשרת לבחור לכל תמונה בפרק התוצאות אנו מתארים . אלגוריתם המתאימים ביותר לאותה תמונהטרים של אותו את הפרמהן ביותר ו

.ש בשיטת הערכה המוצעתואת השיפור בביצועים של מספר אלגוריתמים כתוצאה משימ

איפיון מרקמיםניתן טכניקה זו ב. ושימושית כאשר הרכב זה דרוש, מבטאת את ההרכב של הסגמנטים האמיתייםHהמטריצה

. כדי לחפש את הסוגים של המרקמים המרכיבים את התמונה בבסיס נתונים של מרקמים, למשל, משלהשתחילקנו . לבדיקה של טענה זו השתמשנו בתמונות מוזאיקה המורכבות מדוגמאות של מרקמים קשים במיוחד

זורים טובים מאד חקיבלנו ש. שלילי- י פירוק אי" עW- וHאת התמונה לסגמנטים ריבועים ומצענו את . ראה פרק התוצאות; האמיתייםלוקטורים של מאפיני הטקסטורה

NMF EMD: שלילי - לפרוק איים חדשמיםאלגורית

Paatero & Tapperי " ע1994-פירוק זה הוצע ב. שלילי- הוא הצלחת הפירוק האיעחלק מרכזי במודל המוצהפירוק מושג בדרך כלל על ידי פתרון . 2001-ו 1999- מLee & Seungוהפך לפופולרי אחרי עבודות של

:בעית המינימיזציה

- וL2 מבוססים על מרחקי Φלהרבה ואריאציות של מרחקי הוצעו מספר אלגוריתמים שפתרו את הבעיה Kullback-Leibler .ולכן , מכיון שאנו דנים בפילוגים, ימלייםטבמקרה שלנו שני המרחקים האלה לא אופ

Earth mover's distance-כ המרחק הידוע –רחק שידוע כמוצלח יותר להשואת פילוגים עדיף להשתמש במ(EMD) .

שתי גרסאות של השיטות מבוססות על . EMD- מטריקת השלילי תחת- שיטות לפרוק איפיתחנו בעבודה זו

ת בין העמודוEMDפירוק מקטין את הבדל ה, EMD אחת שהיא המובן הרגיל של אבגרס. EMD-מרחק ה

( )*

,arg min ,

H WH HWΦ


תקציר

ה תחומי מחקר הרב. סגמנטציה היא אחד מתחומי מחקר מרכזיים בראיה ממוחשבת מזה מספר עשורים לאזורים תלוים בביצועים של תהליך חלוקת התמונה ההתחלתי, מיקוד ועוד, עקיבה, כגון זיהוי, אחרים

, בזמן האחרוןאלגוריתמים משופרים למרות מאמצים רבים ו. שהוא הסגמנטציה, המכילים עצמים בודדים ישנן מספר שיטות.האדם -התוצאות עדיין לא מגיעות לרמה של מכונת הסגמנטציה האולטימטיבית

שמסוגלות לחלק סוגים שונים גם שיטות ןישנ. סגמנטציה שמצליחות מאד עבור סוגים מסוימים של תמונותעבודה זו . אבל אין שיטות שמקבלות תמונה כלשהיא ויודעות לנתחה נכון. של תמונות אחרי למידה מאסיבית

במושגים של סגמנטציה נכונה ח והבנה של תמונה תורמת לחקר הסגמנטציה בעיקר בכך שמציעה שיטה לניתו .ועל ידי כך מאפשרת בחירת כלי סגמנטציה מתאימים לתמונה ספציפית

מודלה

המודל מתיחס לשני מרחבים . דל חדש של תמונהוממציעה עבודה ניתן לומר שה, באופן הכללי ביותר מניחים שבמרחב המקום התיאור אנחנו . מרחב המקום ומרחב המאפינים–משלימים בהם מתוארת תמונה

הינו רוצים שלכל עצם ) והלא ריאלי(במקרה האידיאלי . חד ערכי ולכל מיקום משויך רק עצם אחד- הוא חד במציאות המצב . בין זהות העצם למאפייניו תחד ערכי-חדקיימת התאמה כלומר , צבע יחודיבבתמונה יהיה

מודל ריאלי יותר בו . קטים רבים יש פיקסלים באותו צבעבמרחב המאפינים רחוק מהאידיאל שתיארנו ולאובי .פילוג ייחודי במרחב המאפיניםמתואר על ידי כל עצם בתמונה אנו משתמשים הוא ש

שאינו (בתמונה סגמנט כלשהו ניתן למצוא את הפילוג המתאים ל, בהנתן הפילוגים עבור הסגמנטים האמיתיים

המשוקללים לפי האמיתיים ום משוקלל של הפילוגים לסגמנטים פילוג זה יהיה סכ). אמיתי סגמנט אדווק-כלומר המודל המוצע מצהיר שפילוג המאפיינים בסגמנט כלשהוא ניתן לכתיבה כסכום אי. זהבסגמנט חלקם

. המתאימים לסגמנטים הנכונים ולעצמים בתמונה, ר קטן של פילוגים בסיסייםפשלילי של מס

*H*=(hתהי : ענה הזאת כך הטבכתיב מטריצי ניתן לבטא את1|…|h*

m) מטריצה שעמודותיה הפילוגים של mאז ניתן לרשום את . סגמנטים כלשהםH*כמכפלה :

H*=HW

הם יחסי W=(w1|…|wk) את העצמים בתמונה ואילו םהמאפייניהם הפילוגים H=(h1|…|hk) -כך ש

. הנוכחות של העצמים בסגמנטים

מודלפרוק תמונה לפי ה

מדוד את ניתן ל, בתמונהסגמנט כלשהובהנתן : להפוך אותו ,בתנאים מסויימים , של יחס זה הוא שניתןהיתרון, כאןמ . של התמונה הוא מורכב ובאיזה יחסםאובייקטימאילו Hעל ידי ידיעת חשב ולפילוג המאפיינים עליו

סגמנטציה גסה של מצוא ל ,המרצפים את התמונה, קטנים יחסיתמדידה של יחסי העצמים באזורים משניתן . של האובייקטים בתמונה Hאפיון הרכב הסגמנטים מתבסס על ידיעת הפילוגים ש,מובן .התמונה

י "פיינת עוכדי להגיע לתיאור התמונה שמא, אי לכך. ל" במשואה הנW ולא את Hבפועל איננו יודעים לא את

H* ,בגלל ש. למכפלת שתי מטריצותנצטרך לפתור בעיה של פירוק מטריצה -K – מספר העצמים בתמונה – גם . בדרך כלל פירוק זה יהיה מקורב, )m(ים שאפשר לבדוק לתמונה טהוא מספר קטן יחסית למספר הסגמנ

לכן נשתמש בשיטת פירוק הנקראת , מכילים ערכים חיוביים בלבד) W(המשקולות וגם ) H(המאפיינים .nonnegative matrix factorization (NMF (-שלילי -פירוק אי


מדעי בפקולטה ל פרופ מיכאל לינדנבאוםבהנחיית נעשה המחקר המחשב

בהשתלמותי על התמיכה הכספית הנדיבה טכניוןאני מודה ל


שלילי-בעזרת פירוק איה בתמונה טצינניתוח סגמ

חיבור על מחקר

התואר לשם מילוי חלקי של הדרישות לקבלת דוקטור לפילוסופיה

רומן סנדלר

י לישראלטכנולוג מכון-הוגש לסנט הטכניון

2010 יוני חיפה ה "תשס תמוז


שלילי-בעזרת פירוק איה בתמונה טצינניתוח סגמ

רומן סנדלר


Nonnegative matrix factorization for segmentation analysis · Nonnegative matrix factorization for...

Documents

Transcript of Nonnegative matrix factorization for segmentation analysis · Nonnegative matrix factorization for...