SURFACES WITH OCCLUSIONS FROM LAYERED...

97
SURFACES WITH OCCLUSIONS FROM LAYERED STEREO a dissertation submitted to the department of computer science and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy Michael H. Lin December 2002

Transcript of SURFACES WITH OCCLUSIONS FROM LAYERED...

  • SURFACES WITH OCCLUSIONS

    FROM LAYERED STEREO

    a dissertation

    submitted to the department of computer science

    and the committee on graduate studies

    of stanford university

    in partial fulfillment of the requirements

    for the degree of

    doctor of philosophy

    Michael H. Lin

    December 2002

  • c© Copyright by Michael H. Lin 2003

    All Rights Reserved

    ii

  • I certify that I have read this dissertation and that, in

    my opinion, it is fully adequate in scope and quality as a

    dissertation for the degree of Doctor of Philosophy.

    Carlo Tomasi(Principal Advisor)

    I certify that I have read this dissertation and that, in

    my opinion, it is fully adequate in scope and quality as a

    dissertation for the degree of Doctor of Philosophy.

    Christoph Bregler

    I certify that I have read this dissertation and that, in

    my opinion, it is fully adequate in scope and quality as a

    dissertation for the degree of Doctor of Philosophy.

    Dwight Nishimura(Electrical Engineering)

    Approved for the University Committee on Graduate

    Studies:

    iii

  • Abstract

    Stereo, or the determination of 3D structure from multiple 2D images of a scene,

    is one of the fundamental problems of computer vision. Although steady progress

    has been made in recent algorithms, producing accurate results in the neighborhood

    of depth discontinuities remains a challenge. Moreover, among the techniques that

    best localize depth discontinuities, it is common to work only with a discrete set of

    disparity values, hindering the modeling of smooth, non-fronto-parallel surfaces.

    This dissertation proposes a three-axis categorization of binocular stereo algo-

    rithms according to their modeling of smooth surfaces, depth discontinuities, and

    occlusion regions, and describes a new algorithm that simultaneously lies in the most

    accurate category along each axis. To the author’s knowledge, it is the first such

    algorithm for binocular stereo.

    The proposed method estimates scene structure as a collection of smooth surface

    patches. The disparities within each patch are modeled by a continuous-valued spline,

    while the extent of each patch is represented via a labeled, pixelwise segmentation of

    the source images. Disparities and extents are alternately estimated by surface fitting

    and graph cuts, respectively, in an iterative, energy minimization framework. Input

    images are treated symmetrically, and occlusions are addressed explicitly. Boundary

    localization is aided by image gradients.

    Qualitative and quantitative experimental results are presented, which demon-

    strate that, for scenes consisting of smooth surfaces, the proposed algorithm signifi-

    cantly improves upon the state of the art, more accurately localizing both the depth

    of surface interiors and the position of surface boundaries. Finally, limitations of the

    proposed method are discussed, and directions for future research are suggested.

    iv

  • Acknowledgements

    I would like to thank my advisor, Professor Carlo Tomasi, for bestowing upon

    me his unwavering support, guidance, generosity, patience, and wisdom, even during

    unfortunate circumstances of his own.

    I would also like to thank the members of my reading and orals committees—

    Professors Chris Bregler, Dwight Nishimura, Amnon Shashua, Robert Gray, and

    Steve Rock—for their thought-provoking questions and encouraging comments on

    my work.

    I am indebted to Stan Birchfield, whose research and whose prophetic words led

    me to the subject of this dissertation.

    Thanks also to Burak Göktürk, Héctor González-Baños, and Mark Ruzon for their

    assistance in the preparation of my thesis defense, and to the other members of the

    Stanford Vision Lab and Robotics Lab.

    I am grateful to Daniel Scharstein and Olga Veksler, whose constructive critiquing

    of preliminary versions of this manuscript helped to shape its final form.

    Finally, I would like to thank my family and friends, and especially Lily Kao, who

    have made the process of completing this work much more pleasant.

    v

  • Contents

    Abstract iv

    Acknowledgements v

    1 Introduction 1

    1.1 Foundations of Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Solving the Correspondence Problem . . . . . . . . . . . . . . . . . . 4

    1.3 Surface Interiors vs. Boundaries . . . . . . . . . . . . . . . . . . . . . 7

    1.4 A Categorization of Stereo Algorithms . . . . . . . . . . . . . . . . . 8

    1.4.1 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.4.2 Discontinuity . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    1.4.3 Uniqueness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    1.5 A Brief Survey of Stereo Methods . . . . . . . . . . . . . . . . . . . . 15

    1.5.1 Pointwise Color Matching . . . . . . . . . . . . . . . . . . . . 16

    1.5.2 Windowed Correlation . . . . . . . . . . . . . . . . . . . . . . 16

    1.5.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    1.5.4 Cooperative Methods . . . . . . . . . . . . . . . . . . . . . . . 19

    1.5.5 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . 20

    1.5.6 Graph-Based Methods . . . . . . . . . . . . . . . . . . . . . . 22

    1.5.7 Layered Methods . . . . . . . . . . . . . . . . . . . . . . . . . 23

    1.6 Our Proposed Approach . . . . . . . . . . . . . . . . . . . . . . . . . 24

    1.7 Outline of Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    vi

  • 2 Preliminaries 25

    2.1 Design Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    2.2 Mathematical Abstraction . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.3 Desired Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3.1 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3.2 Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    2.3.3 Non-triviality . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.4 Energy Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    3 Surface Fitting 31

    3.1 Defining Surface Smoothness . . . . . . . . . . . . . . . . . . . . . . . 31

    3.2 Surfaces as 2D Splines . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    3.3 Surface Non-triviality . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.4 Surface Smoothness . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    3.5 Surface Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    3.6 Surface Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    4 Segmentation 38

    4.1 Segmentation by Graph Cuts . . . . . . . . . . . . . . . . . . . . . . 38

    4.2 Segmentation Non-triviality . . . . . . . . . . . . . . . . . . . . . . . 40

    4.3 Segmentation Smoothness . . . . . . . . . . . . . . . . . . . . . . . . 41

    4.4 Segmentation Consistency . . . . . . . . . . . . . . . . . . . . . . . . 43

    4.5 Segmentation Optimization . . . . . . . . . . . . . . . . . . . . . . . 44

    5 Integration 45

    5.1 Segmentation Consistency, Revisited . . . . . . . . . . . . . . . . . . 45

    5.2 Overall Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.2.1 Iterative Descent . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.2.2 Merging Surfaces . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5.2.3 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    5.2.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    vii

  • 6 Experimental Results 53

    6.1 Quantitative Evaluation Metric . . . . . . . . . . . . . . . . . . . . . 53

    6.2 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    6.2.1 “Map” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    6.2.2 “Venus” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    6.2.3 “Sawtooth” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.2.4 “Tsukuba” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    6.3 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

    6.3.1 “Cheerios” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    6.3.2 “Clorox” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    6.3.3 “Umbrella” . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    7 Discussion and Future Work 72

    7.1 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    7.2 Theory vs. Practicality . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    7.3 Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    A Image Interpretation 76

    A.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    A.2 Certainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    Bibliography 79

    viii

  • List of Tables

    2.1 Contributions to energy . . . . . . . . . . . . . . . . . . . . . . . . . 30

    5.1 Our overall optimization algorithm . . . . . . . . . . . . . . . . . . . 48

    5.2 Our post-processing algorithm . . . . . . . . . . . . . . . . . . . . . . 51

    6.1 Layout for figures of complete results . . . . . . . . . . . . . . . . . . 56

    ix

  • List of Figures

    1.1 The geometry of an ideal pinhole camera . . . . . . . . . . . . . . . . 3

    1.2 The geometry of triangulation . . . . . . . . . . . . . . . . . . . . . . 4

    1.3 The geometry of the epipolar constraint . . . . . . . . . . . . . . . . 6

    1.4 An example of a smooth surface . . . . . . . . . . . . . . . . . . . . . 9

    1.5 An example of discontinuities with occlusion . . . . . . . . . . . . . . 11

    6.1 “Map” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    6.2 “Map” error distributions . . . . . . . . . . . . . . . . . . . . . . . . 59

    6.3 “Venus” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    6.4 “Venus” error distributions . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.5 “Sawtooth” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    6.6 “Sawtooth” error distributions . . . . . . . . . . . . . . . . . . . . . . 63

    6.7 “Tsukuba” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    6.8 “Tsukuba” error distributions . . . . . . . . . . . . . . . . . . . . . . 65

    6.9 “Cheerios” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    6.10 “Clorox” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

    6.11 “Umbrella” results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    x

  • Chapter 1

    Introduction

    Ever since antiquity, people have wondered: How does vision work? How is it

    that we “see” a three-dimensional world? For nearly two millennia, the generally-

    accepted theory (proposed by many, including Euclid [ca. 325-265 BC] and Ptolemy

    [ca. AD 85-165]) was that peoples’ eyes send out probing rays which “feel” the world.

    This notion persisted until the early 17th century, when, in 1604, Kepler published

    the first theoretical explanation of the optics of the eye [45], and, in 1625, Scheiner

    observed experimentally the existence of images formed at the rear of the eyeball.

    Those discoveries emphasized a more focussed question: How does depth perception

    work? How is it that, from the two-dimensional images projected on the retina, we

    perceive not two, but three dimensions in the world?

    With the advent of computers in the mid-20th century, a broader question arose:

    how can depth perception be accomplished in general, whether by human physi-

    ology or otherwise? Aside from possibly elucidating the mechanism of human depth

    perception, a successful implementation of machine depth perception could have

    many practical applications, from terrain mapping and industrial automation to

    autonomous navigation and real-time human-computer interaction. In a sense, Eucild

    and Ptolemy had the right idea: methods using active illumination (including sonar,

    radar, and laser rangefinding) can produce extremely accurate depth information and

    are relatively easy to implement. Unfortunately, such techniques are invasive and have

    limited range, and thus have a restricted application domain; purely passive methods

    1

  • 2 CHAPTER 1. INTRODUCTION

    would be much more generally applicable. So how can passive depth perception be

    accomplished? In particular, how can images with only two spatial dimensions yield

    information about a third?

    Static monocular depth cues (including occlusion, relative size, and vertical

    placement within the field of view, but excluding binocular disparity and motion

    parallax) are very powerful and often more than sufficient; after all, we are typically

    able to perceive depth even when limited to using only one eye from only one view-

    point. Inspired by that human ability, much research has investigated how depth

    information can be inferred from a single “flat” image (e.g., Roberts [59] in 1963).

    However, most such monocular approaches depend heavily upon strong assumptions

    about the scene, and for general scenes, such knowledge has proven to be very difficult

    to instill in a computer.

    Julesz [42, 43] demonstrated in 1960 that humans can perform binocular depth

    perception in the absence of any monocular cues. This discovery led to a prolifer-

    ation of research on the extraction of 3D depth information from two 2D images

    of the same static scene taken from different viewpoints. In this dissertation, we

    address this problem of computational binocular stereopsis, or stereo: recovering

    three-dimensional structure from two color (or intensity) images of a scene.

    1.1 Foundations of Stereo

    The foundations for reconstructing a 3D model from multiple 2D images, are

    correspondence and triangulation. To understand how this works, let us first take a

    look at the reverse process, of the formation of 2D images from a 3D scene. Suppose

    we have an ideal pinhole camera (Figure 1.1), with center of projection P and image

    plane Π. Then the image p of a world point X is located at the intersection of Π with

    the line segment through P and X. Conversely, given the center of projection P , if

    we are told that p is the image of some world point X, we know that X is located

    somewhere along the ray from P through p.

    Although this alone is not enough to determine the 3D location of X, if there is

    a second camera with center of projection Q, whose image of the same world point

  • 1.1. FOUNDATIONS OF STEREO 3

    Π

    X

    P

    p

    Figure 1.1: The geometry of an ideal pinhole camera: ray from center of projectionP through world point X intersects image plane Π at image point p.

    X is q, then we additionally know that X is located somewhere along the ray from

    Q through q (Figure 1.2). That is, barring degenerate geometry, the position of

    X is precisely the intersection of the rays−→Pp and

    −→Qq. This well-known process of

    triangulation, or position estimation by the intersection of two back-projected rays, is

    what enables precise 3D localization, and thus forms the theoretical basis for stereo.

    However, note that the preceding description of triangulation requires the posi-

    tions of two image points p and q which are known to be images of the same world

    point X. Moreover, reconstructing the entire scene would require knowing the posi-

    tions of all such pairs (p, q) that are the image of some pairwise-common world

    point. Determining this pairing is the correspondence problem, and is a prerequisite

    to performing triangulation. Correspondence is much more difficult than triangu-

    lation, and thus forms the pragmatic basis for stereo.

    Although stereo is not difficult to understand in theory, it is not easy to solve

    in practice. Mathematically, stereo is an ill-posed, inverse problem: its solution

    essentially involves inverting a many-to-one transformation (image formation in this

    case), and thus is underconstrained. Specifically, triangulation is very sensitive to

  • 4 CHAPTER 1. INTRODUCTION

    X

    P Q

    qp

    Figure 1.2: The geometry of triangulation: world point X lies at intersection of raysfrom centers of projection P, Q through respective image points p, q.

    input perturbations: small changes in the position of one or both image points can

    lead to arbitrarily large changes in the position of the triangulated world point. In

    other words, any uncertainty in the result of correspondence can potentially yield

    a virtually unbounded uncertainty in the result of triangulation. This difficulty is

    further exacerbated by the fact that correspondence is often prone to small errors.

    1.2 Solving the Correspondence Problem

    Thus we see that accurately solving the correspondence problem is the key to

    accurately solving the stereo problem. In effect, we have narrowed our question

    from “How can passive binocular stereopsis be done?” to “How can passive binocular

    correspondence be done?”

    The fundamental hypothesis behind multi-image correspondence is that the

    appearance of any sufficiently small region in the world changes little from image

    to image. In general, “appearance” might emphasize higher-level descriptors over

    raw intensity values, but in its strongest sense, this hypothesis would mean that the

  • 1.2. SOLVING THE CORRESPONDENCE PROBLEM 5

    color of any world point remains constant from image to image. In other words, if

    image points p and q are both images of some world point X, then the color values

    at p and q are equal. This color constancy (or brightness constancy in the case of

    grayscale images) hypothesis is in fact true with ideal cameras if all visible surfaces

    in the world are perfectly diffuse (i.e., Lambertian). In practice, given photometric

    camera calibration and typical scenes, color constancy holds well enough to justify

    its use by most algorithms for correspondence.

    The geometry of the binocular imaging process also significantly prunes the set of

    possible correspondences, from lying potentially anywhere within the 2D image, to

    lying necessarily somewhere along a 1D line embedded in that image. Suppose that

    we are looking for all corresponding image point pairs (p, q) involving a given point

    q (Figure 1.3). Then we know that the corresponding world point X, of which q is

    an image, must lie somewhere along the ray through q from the center of projection

    Q. The image of this ray−→Qq in the other camera’s image plane Π lies on a line l

    that is the intersection of Π with the plane spanned by the points P, Q, q. Because

    X lies on−→Qq, its projection p on Π must lie on the corresponding epipolar line l.

    (When corresponding epipolar lines lie on corresponding scanlines, the images are

    said to be rectified ; the difference in coordinates of corresponding image points is

    called the disparity at those points.) This observation, that given one image point, a

    matching point in the other image must lie on the corresponding epipolar line, is called

    the epipolar constraint. Use of the epipolar constraint requires geometric camera

    calibration, and is what typically distinguishes stereo correspondence algorithms from

    other, more general correspondence algorithms.

    Based on color constancy and the epipolar constraint, correspondence might

    proceed by matching every point in one image to every point with exactly the same

    color in its corresponding epipolar line. However, this is obviously flawed: there would

    be not only missed matches at the slightest deviation from color constancy, but also

    potentially many spurious matches from anything else that happens to be the same

    color. Moreover, with real cameras, sensor noise and finite pixel sizes lead to addi-

    tional imprecision in solving the correspondence problem. It is apparent that color

    constancy and the epipolar constraint are not enough to determine correspondence

  • 6 CHAPTER 1. INTRODUCTION

    l

    Q

    qp? p?

    P

    X?

    X?

    X?

    p?

    Π

    Figure 1.3: The geometry of the epipolar constraint: image point p corresponding toimage point q must lie on epipolar line l, which is intersection of image plane Π withplane spanned by q and centers of projection P, Q.

    with sufficient accuracy for reliable triangulation. Thus, some additional constraint

    is needed in order to reconstruct a meaningful three-dimensional model. What other

    information can we use to solve the correspondence problem?

    Marr and Poggio [52] proposed two such additional rules to guide binocular corre-

    spondence: uniqueness, which states that “each item from each image may be assigned

    at most one disparity value,” and continuity, which states that “disparity varies

    smoothly almost everywhere.”

    In explaining the uniqueness rule, Marr and Poggio specified that each “item

    corresponds to something that has a unique physical position,” and suggested that

    detected features such as edges or corners could be used. They explicitly cautioned

    against equating an “item” with a “gray-level point,” describing a scene with trans-

    parency as a contraindicating example. However, this latter interpretation, that each

    image location be assigned at most one disparity value, is nonetheless very prevalent

    in practice; only a small number of stereo algorithms (such as [71]) attempt to find

    more than one disparity value per pixel. This common simplification is in fact justi-

    fiable, if pixels are regarded as point samples rather than area samples, under the

  • 1.3. SURFACE INTERIORS VS. BOUNDARIES 7

    assumption that the scene consists of opaque objects: in that case, each image point

    receives light from, and is the projection of, only the one closest world point along

    its optical ray.

    In explaining the continuity rule, Marr and Poggio observed that “matter is

    cohesive, it is separated into objects, and the surfaces of objects are generally smooth

    compared with their distance from the viewer” [52]. These smooth surfaces, whose

    normals vary slowly, generally meet or intersect in smooth edges, whose tangents

    vary slowly [36]. When projected onto a two-dimensional image plane, these three-

    dimensional features result in smoothly varying disparity values almost everywhere

    in the image, with “only a small fraction of the area of an image . . . composed of

    boundaries that are discontinuous in depth” [52]. In other words, a reconstructed

    disparity map can be expected to be piecewise smooth, consisting of smooth surface

    patches separated by cleanly defined, smooth boundaries.

    These two rules further disambiguate the correspondence problem. Together

    with color constancy and the epipolar constraint, uniqueness and continuity typically

    provide sufficient constraints to yield a reasonable solution to the stereo correspon-

    dence problem.

    1.3 Surface Interiors vs. Boundaries

    A closer look at the continuity rule shows a clear distinction between the interiors

    and the boundaries of surfaces: depth is smooth at the former, and non-smooth at the

    latter. This bifurcation of continuity into two complementary aspects is often reflected

    in the design of stereo algorithms, because, as noted by Belhumeur [7], “depth, surface

    orientation, occluding contours, and creases should be estimated simultaneously.”

    On the one hand, it is important to recover surface interiors by estimating their

    depth and orientation, because such regions typically constitute the vast majority of

    the image area. On the other hand, it is important to recover surface boundaries by

    estimating occluding contours and creases, because boundaries typically are the most

    salient image features. (In fact, much of the earliest work on the three-dimensional

    interpretation of images focussed on line drawing interpretation, in which boundaries

  • 8 CHAPTER 1. INTRODUCTION

    are the only image features.) Birchfield [13] in particular emphasized the importance

    of discontinuities, which generally coincide with surface boundaries.

    Moreover, in general, neither surface interiors nor surface boundaries can be unam-

    biguously derived solely from the other. For example, given only a circular boundary

    that is discontinuous in depth, the interior could be either a fronto-parallel circle or

    the front hemisphere of a ball; this difficulty is inherent in the sparseness of bound-

    aries. Conversely, given the disparity value at every pixel in an image, while one could

    threshold the difference between neighboring values to detect depth discontinuities,

    the selection of an appropriate threshold would be tricky at best; detecting creases by

    thresholding second-order differences would be even more problematic. This difficulty

    is inherent in the discrete nature of pixel-based reconstructions that do not otherwise

    indicate the presence or absence of discontinuities.

    Furthermore, not only should surface interiors and surface boundaries each be

    estimated directly, but because boundaries and interiors are interdependent, with the

    former bordering on the latter, they should in fact be estimated cooperatively within

    a single algorithm, rather than independently by separate algorithms.

    In other words, a stereo algorithm should explicitly and simultaneously consider

    both of the two complementary aspects of the continuity rule: smoothness over surface

    interiors, and discontinuity across surface boundaries.

    1.4 A Categorization of Stereo Algorithms

    Many stereo algorithms are based upon the four constraints listed in Section 1.2:

    the epipolar constraint, color constancy, continuity, and uniqueness. Of these, the

    former two are relatively straightforward, but the manner in which the latter two

    are applied varies greatly [4, 22, 27, 65]. We propose a three-axis categorization

    of binocular stereo algorithms according to their interpretations of continuity and

    uniqueness, where we subdivide continuity according to the discussion of Section 1.3.

    In the following subsections, we list last, for all three axes, that category which we

    consider to be the most preferable.

  • 1.4. A CATEGORIZATION OF STEREO ALGORITHMS 9

    Figure 1.4: An example of a smooth surface (left and right images shown).

    1.4.1 Continuity

    The first axis describes the modeling of continuity over disparity values within

    smooth surface patches.

    As an example of a smooth surface patch, consider a slanted plane, with left and

    right images as shown in Figure 1.4. Then the true disparity over the surface patch

    would also be a slanted plane (i.e., a linear function of the x and y coordinates within

    the image plane).

    How might a stereo algorithm model the disparity along this slanted plane, or

    along any one smooth surface in general? Using this example as an illustration, we

    propose to categorize smooth surface models into three broad groups; most models

    used by prior stereo algorithms fall into one of these categories.

    Constant

    In these most restricted models, every point within any one smooth surface patch

    is assigned the same disparity value. This value is usually chosen from a finite, pre-

    determined set of possible disparities, such as the set of all integers within a given

    range, or the set of all multiples of a given fraction (e.g., 1/4 or 1/2) within a given

    range. Examples of prior work in this category include traditional sum-of-squared-

    differences correlation, as well as [17, 30, 44, 47, 52].

    Applied to our example, these models would likely recover several distinct fronto-

    parallel surfaces. This would be a poor approximation to the true answer. While one

    could lump together multiple constant-disparity surfaces to simulate slanted or curved

  • 10 CHAPTER 1. INTRODUCTION

    surfaces, such a grouping would likely contain undesirably large internal jumps in

    disparity, especially in textureless regions. It would be desirable to be able to represent

    directly, not only fronto-parallel surfaces, but also slanted and curved surfaces.

    Discrete

    In these intermediate models, disparities are again limited to discrete values,

    but with multiple distinct values permitted within each surface patch. Surface

    smoothness in this context means that within each surface, neighboring pixels should

    have disparity values that are numerically as close as possible to one another. In

    other words, intra-surface discontinuities are expected to be small. For identical

    discretization, the “smooth” surfaces expressible by these models are a strict superset

    of those expressible by “Constant” models. Examples of prior work in this category

    include [7, 41, 60, 86].

    Applied to our example, this category would improve upon the previous by

    shrinking the jumps in disparity to the resolution of the disparity discretization.

    However, it would be even better if the jumps were completely removed. Although

    one could fit a smooth surface to the discretized data from these models, such a fit

    would still be subject to error; e.g., if our slanted plane had a disparity range less than

    one discretization unit, these models would likely recover a fronto-parallel surface.

    Real

    In these most general models, disparities within each smooth surface patch vary

    smoothly over the real numbers (or some computer approximation thereof). This

    category can be thought of as the limit of the “Discrete” category as the discretization

    step approaches zero. Various interpretations of smoothness can be used; most try to

    minimize local first- or second-order differences in disparity. Examples of prior work

    in this category include [1, 3, 11, 70, 75].

    Applied to our example, in the absence of other sources of error, these models

    should correctly find the true disparity. Therefore, among these three categories for

    modeling smooth surfaces, we find this one to be the most preferable, because it

    allows for the greatest precision in estimating depth.

  • 1.4. A CATEGORIZATION OF STEREO ALGORITHMS 11

    Figure 1.5: An example of discontinuities with occlusion. Top: left and right images.Bottom: left and right disparity maps (dark gray = small disparity; light gray = largedisparity; white = no disparity).

    1.4.2 Discontinuity

    The second axis describes the treatment of discontinuities at the boundaries of

    smooth surface patches.

    As an example of a scene with boundaries discontinuous in depth, consider a small,

    fronto-parallel square “floating” in front of a larger one, with left and right images as

    shown in Figure 1.5. Then the true disparity for this scene would be a small square

    of larger disparity inside a larger square of smaller disparity, with step edges along

    the top and bottom of the smaller square.

    How might a stereo algorithm model the disparity across this depth discontinuity?

    Using this example as an illustration, we propose to categorize discontinuity models

    into four broad groups; most models used by prior stereo algorithms fall into one of

    these categories. Specifically, the penalty associated with a discontinuity is examined

    as a function of the size of the jump of the discontinuity.

  • 12 CHAPTER 1. INTRODUCTION

    Free

    In this category, discontinuities are not specifically penalized. That is, with all else

    being equal, no preference is given for continuity in the final, reconstructed disparity

    map. In particular, these methods often fail to resolve the ambiguity caused by

    periodic textures or textureless regions. Examples of prior work in this category

    include traditional sum-of-squared-differences correlation, as well as [44, 52, 74, 84, 86].

    Applied to our example, these models would likely produce frequent, scattered

    errors throughout the interior of the squares: the cross-hatched pattern is perfectly

    periodic, so the identity of the single, best match would likely be determined by

    random perturbations.

    Infinite

    In this category, discontinuities are penalized infinitely; i.e., they are disallowed.

    The entire image is treated as one “smooth” surface. That is, the entire image, as

    a unit, is subject to the chosen model of smooth surface interiors; “almost every-

    where” continuity in fact applies everywhere. The recovered disparity map is smooth

    everywhere, although potentially not uniformly so. (Note, however, that the surface

    smoothness model may itself allow small discontinuities within a single surface.)

    Examples of prior work in this category include [1, 5, 37, 70].

    Applied to our example, these models would not separate the foreground and

    background squares, but would instead connect them by smoothing over their common

    boundary. The width of the blurred boundary can vary depending on the specific

    algorithm, but typically, the boundary will be at least several pixels wide.

    Convex

    In this category, discontinuities are allowed, but a penalty is imposed that is a

    finite, positive, convex function of the size of the jump of the discontinuity. Typically,

    that convex cost function is either the square or the absolute value of the size of the

    jump. The resulting discontinuities often tend to be somewhat blurred, because the

    cost of two adjacent discontinuities is no more than that of a single discontinuity of

    the same total size. Examples of prior work in this category include [41, 60, 75].

  • 1.4. A CATEGORIZATION OF STEREO ALGORITHMS 13

    Applied to our example, these models would likely separate the foreground and

    background squares successfully. However, at the top and bottom edges of the smaller

    square, where there is just a horizontal line, these models might output a disparity

    value in between those of the foreground and background.

    Non-convex

    In this category, discontinuities are allowed, but a penalty is imposed that is a

    non-convex function of the size of the jump of the discontinuity. One common choice

    for that non-convex cost function is the Potts energy [57], which assesses a constant

    penalty for any non-zero discontinuity, regardless of size. The resulting discontinuities

    usually tend to be fairly clean, because the cost of two adjacent discontinuities is

    generally more than that of a single discontinuity of the same total size. Examples

    of prior work in this category include [7, 21, 24, 30].

    Applied to our example, these models would likely separate the foreground and

    background squares successfully. Moreover, the recovered disparity values would

    likely contain only two distinct depths: those of the foreground and the background.

    Therefore, among these four categories for modeling discontinuities, we find this one

    to be the most preferable, because it reconstructs boundaries the most cleanly, with

    minimal warping of surface shape.

    1.4.3 Uniqueness

    The third axis describes the application of uniqueness, especially to the occlusions

    that accompany depth discontinuities.

    As an example of a scene with occlusions, consider again the example of the

    floating square shown in Figure 1.5. The true disparity for this scene would be a

    small square of larger disparity inside a larger square of smaller disparity, with an

    occlusion region of no disparity to the left or right side of the smaller square in the left

    or right images, respectively. The occlusion region is the portion of the background

    square which, in the other image, is occluded by the foreground square. Points

    within the occlusion region have no disparity because they do not have corresponding

    points in the other image, and we define disparity using correspondence (as opposed

  • 14 CHAPTER 1. INTRODUCTION

    to inverse depth). In general, disparity discontinuities are always accompanied by

    occlusion regions [30], except when the boundary lies along an epipolar line. This is

    a consequence of uniqueness and symmetry.

    How might a stereo algorithm model the disparity in this scene, or in others with

    discontinuities and occlusions? Using this example as an illustration, we propose to

    categorize occlusion models into four broad groups; most models used by prior stereo

    algorithms fall into one of these categories.

    Transparent

    In this category, uniqueness is not assumed; these models allow for transparency.

    Our “floating squares” example does not exhibit transparency, so these models would

    be of little benefit for it; however, for scenes that do exhibit transparency, these models

    would be essential for adequate reconstruction. Furthermore, natural scenes often

    contain fine details (such as tree branches against the sky) that are only a few pixels

    in width; because of pixelization, these images effectively contain transparency as

    well [62]. Unfortunately, stereo reconstruction with transparency is a very challenging

    problem with few existing solutions; one such example of prior work is [71].

    One-way

    In this category, uniqueness is assumed within a chosen reference image, but

    not considered within the other. That is, each location in the reference image

    is assigned at most one disparity, but the disparities at multiple locations in the

    reference image may point to the same location in the other image. Typically, each

    location in the reference image is assigned exactly one disparity, and occlusion rela-

    tionships are ignored. That is, these models generally search for correspondences

    within occlusion regions as well as within non-occlusion regions. Examples of prior

    work in this category include traditional SSD correlation, as well as [11, 21, 44, 84].

    Applied to our example, these models would likely find the correct disparity within

    the unoccluded regions. Points within the occluded regions will typically be assigned

    some intermediate disparity value between those of the foreground and background.

    Note that such an assignment would result in the occluded point and a different,

  • 1.5. A BRIEF SURVEY OF STEREO METHODS 15

    unoccluded point both being paired with the same point in the other image. This

    “collision,” or failure of reciprocal uniqueness, is an undesirable yet readily detectable

    condition; these models allow it and are thus less than ideal.

    Asymmetric Two-way

    In this category, uniqueness is encouraged for both images, but the two images

    are treated unequally. That is, reasoning about occlusion is done, and the occlusions

    that accompany depth discontinuities are qualitatively recovered, but there is still

    one chosen reference image, resulting in asymmetries in the reconstructed result.

    Examples of prior work in this category include [3, 17, 52, 75, 86].

    Applied to our example, these models would likely find the correct disparity within

    the unoccluded regions, and most of the occlusion region would be marked as such.

    However, some occluded pixels near the edge of the occlusion region might mistakenly

    be assigned to the nearer surface; the outline of the reconstructed smaller square would

    likely look different on its left versus right edges.

    Symmetric Two-way

    In this category, uniqueness is enforced in both images symmetrically; detected

    occlusion regions are marked as being without correspondence. Examples of prior

    work in this category include [7, 30, 41, 47].

    Applied to our example, these models would likely find the correct disparity (or

    lack thereof) everywhere, barring other sources of error. Therefore, among these four

    categories for modeling uniqueness and occlusions, we find this one to be the most

    preferable for our purposes, because it encourages the greatest precision in localizing

    boundaries in image space by fully utilizing the assumption of uniqueness.

    1.5 A Brief Survey of Stereo Methods

    Section 1.1 explained that stereo correspondence is fundamentally an undercon-

    strained, inverse problem. Section 1.2 proposed that it is fairly straightforward to

    impose color constancy and the epipolar constraint (by matching each image point

  • 16 CHAPTER 1. INTRODUCTION

    with every image point of the same color in the corresponding epipolar line), but

    observed that additional constraints are necessary. Section 1.2 concluded by claiming

    that uniqueness and continuity generally suffice as those further constraints, without

    discussing how they are applied.

    This section motivates and reviews a few selected approaches to stereo that have

    been used in the past, discusses how they use uniqueness and continuity, and explains

    how they fit within our three-axis categorization.

    1.5.1 Pointwise Color Matching

    Under the assumption of opacity, uniqueness implies that each image point may

    correspond only to at most one other image point. Thus, the simplest way to apply

    uniqueness on top of color constancy and the epipolar constraint would be to match

    each image point with the one image point of the most similar color in the corre-

    sponding epipolar line. This naive technique might work well in an ideal, Lambertian

    world in which every world point has a unique color, but in practice, the discretized

    color values of digital images can cause problems.

    As an extreme example, let us consider a binary random-dot stereogram, in which

    each image consists of pixels which are randomly and independently black or white.

    With pointwise correspondence, there will be a true match at the correct disparity,

    but there will also be a 50% chance of a false match at any incorrect disparity. This

    is because looking at the color at a single point does not provide enough information

    to uniquely identify that point.

    Thus we see that even with the use of color constancy, uniqueness, and the

    epipolar constraint, without continuity, direct pointwise stereo correspondence is still

    ambiguous, in that false matches may appear as good as the correct match.

    1.5.2 Windowed Correlation

    With the use of continuity, which implies that neighboring image points will likely

    have similar disparities, one can pool information among neighboring points to reduce

  • 1.5. A BRIEF SURVEY OF STEREO METHODS 17

    ambiguity and false matches. This is the basic idea behind windowed correlation, on

    which many early stereo methods are based.

    Classical windowed correlation consists of comparing a fixed-size window of pixels,

    rather than individual pixels, and choosing the disparity that yields the best match

    over the whole window. Windowed correlation is simple, efficient, and effective at

    reducing false matches. For example, let us reconsider the aforementioned random-

    dot stereogram. Whereas matching individual pixels gave a false match rate of 50%,

    with a 5× 5 window of pixels for correlation, the probability of an exact false match

    would be reduced to 2−25, or under 1 in 33 million.

    However, for a window to match exactly, the disparity within the window must

    be constant. Otherwise, no disparity will yield a perfect match, and the algorithm

    will pick whichever disparity gives the smallest mismatch, which may or may not

    be the disparity at the center of the window. In other words, windowed correlation

    methods depend on the implicit assumption that disparities are “locally” constant;

    these methods work best where that is indeed the case.

    The meaning of “locally” above is determined by the size and shape of the corre-

    lation window. However, choosing the configuration of the window is not easy. On

    the one hand, if the window is too small, spurious matches will remain, and many

    incorrect matches will look “good”; larger windows are better at reducing ambiguity

    by minimizing false matches. On the other hand, if the window is too large, it will

    be unlikely to contain a single disparity value, and even the correct match will look

    “bad”; larger windows are also less likely to contain only a single disparity value, and

    thus more likely to reject true matches.

    To some extent, this problem of choosing a window size can be alleviated by using

    adaptive windows, which shift and/or shrink to avoid depth discontinuities, while

    remaining larger away from discontinuities. Kanade and Okutomi [44] use explicit,

    adaptive, rectangular windows. Hirschmüller [35] uses adaptive, piecewise square

    windows. Scharstein and Szeliski [63] use implicit “windows” formed by iterative,

    nonlinear diffusion.

    In practice, windowed correlation techniques work fairly well within smooth,

    textured regions, but tend to blur across any discontinuities. Moreover, they generally

  • 18 CHAPTER 1. INTRODUCTION

    perform poorly in textureless regions, because they do not specifically penalize discon-

    tinuities in the recovered depth map. That is, although these methods assume conti-

    nuity by their use of windows, they do not directly encourage continuity in the case

    of ambiguous matches.

    1.5.3 Regularization

    One way to enforce continuity, even in the presence of ambiguous matches, is

    through the use of regularization. Generically, regularization is a technique for stabi-

    lizing inverse problems by explicitly quantifying smoothness and adding it as one

    more simultaneous goal to optimize. Applied to stereo, it treats disparity as a real-

    valued function of image location, defines a functional measuring the “smoothness” of

    such a disparity function, and tries to maximize that functional while simultaneously

    maximizing color constancy. Such “smoothness” can be quantified in many ways, but

    in general, nearby image locations should have similar disparities.

    Horn and Schunck [38] popularized both the use of regularization on otherwise

    underconstrained image correspondence problems, and the use of variational methods

    to solve the resulting energy minimization problems in a continuous domain. Horn

    [37], Poggio et al. [56], and Barnard [5] suggested several formulae for quantifying

    smoothness, all of which impose uniform smoothness and forbid discontinuities.

    Terzopoulos [77] and Lee and Pavlidis [49] investigated regularization with discontinu-

    ities. Rivera and Marroqúın [58] formulated a higher-order, edge-preserving method

    that does not penalize constant, non-zero slopes.

    Computationally, regularization tends to yield challenging nonlinear optimization

    problems that, without fairly sophisticated optimization algorithms, can be highly

    dependent on good initial conditions. Often, multiscale or multigrid methods are

    needed [76], but Akgul et al. [1] presented an alternate method of ensuring reliable

    convergence, by starting with two initial conditions, and evolving them cooperatively

    until they coincide. Allowing discontinuities further complicates the optimization

    process; Blake and Zisserman [16] propose a method for optimizing certain specific

    models of regularization with discontinuities.

  • 1.5. A BRIEF SURVEY OF STEREO METHODS 19

    Aside from optimization challenges, the primary weakness of regularization

    methods is that they do not readily allow occlusions to be represented. Every image

    point is forced to have some disparity; no point can remain unmatched, as would be

    required for proper occlusions under the constraint of uniqueness.

    1.5.4 Cooperative Methods

    We have seen that neither windowed correlation nor regularization methods

    support proper occlusions: they try to find unique disparity values for one reference

    image, but without checking for “collisions” in the other image. In effect, uniqueness,

    although assumed, is not enforced; only one-way uniqueness is applied

    Inspired by biological nervous systems, cooperative methods directly implement

    the assumptions of continuity and two-way uniqueness in an iterative, locally

    connected, massively parallel system. These techniques operate directly in the space

    of correspondences (refered to as the matching score volume by [85], and as the

    disparity-space image, or DSI, by [17, 40, 65]), rather than in image space, evolving a

    3D lattice of continuous-valued weights via mutual excitation and inhibition.

    This space of possible correspondences can be parameterized in several ways.

    Typically, (x, y, d) is used, with (x, y) representing position in the chosen reference

    image, and d representing disparity. Assuming rectified input images, however,

    an alternate, symmetric parameterization is (xl, xr, y). Qualitatively, a weight at

    (xl, xr, y) in such a coordinate system represents the likelihood that (xl, y) in the left

    image and (xr, y) in the right image correspond to one another (i.e., are images of

    the same, physical world point).

    Initially, this matching score volume is populated with local similarity measures,

    typically obtained via correlation with small windows. Subsequently, the weights are

    updated in parallel as follows: if a weight at (xl, xr, y) is large, then for uniqueness,

    weights at (x, xr, y) and (xl, x, y) are inhibited, and for continuity, weights at any

    other, non-inhibited points “near” (xl, xr, y) are excited. Upon convergence of this

    relaxation algorithm, these real-valued weights are compared with one another and

    thresholded to determine final correspondences (or the lack thereof).

  • 20 CHAPTER 1. INTRODUCTION

    Different cooperative algorithms use different models of excitation corresponding

    to different models of smooth surfaces. Marr and Poggio [52] use a fixed, 2D excitation

    region for “Constant” surfaces; moreover, their algorithm is only defined for binary-

    valued (e.g., black and white) input images. Zitnick and Kanade [86] use a fixed, 3D

    excitation region for “Discrete” surfaces; their algorithm is designed for real-valued

    images.

    In practice, [86] can give very good results, but because it uses a fixed window for

    excitation, boundaries can be rounded or blurred (analogous to classical windowed

    correlation). To improve boundary localization, Zhang and Kambhamettu [85] use a

    variable, 3D excitation region that is dependent on an initial color segmentation of

    the input images; the idea is that depth discontinuities will likely correlate well with

    monocular color edges.

    Regarding convergence, because the cooperative update is local in nature, accurate

    results depend upon good initialization. In particular, although a limited number of

    false matches can start with a good initial score, true matches must start with a

    good initial score. These methods support discontinuities with non-convex penalties;

    two-way uniqueness is encouraged, generally asymmetrically.

    1.5.5 Dynamic Programming

    Like cooperative methods, dynamic programming methods also operate in a

    discretized disparity space in order to encourage bidirectional uniqueness along

    with continuity. However, while cooperative methods are iterative and find a

    locally optimal set of real-valued weights that must then be thresholded, dynamic

    programming is non-iterative and finds a globally optimal set of binary weights that

    directly translate into the presence or absence of each candidate correspondence.

    Thus, dynamic programming methods for stereo are much faster, and apparently also

    more principled, than their cooperative counterparts.

    However, the downsides to dynamic programming are twofold. First, dynamic

    programming can only optimize one scanline at a time. Many desired interactions

    among scanlines, such as that required for continuity between scanlines, requires less

    principled, ad hoc post-processing, usually with no guarantee of optimality. Second,

  • 1.5. A BRIEF SURVEY OF STEREO METHODS 21

    dynamic programming depends upon the validity of the ordering constraint, which

    states that in each pair of corresponding scanlines, corresponding pixels appear in the

    same order in the left and right scanlines. Because the ordering constraint is a gener-

    alization that is not always true [28], and because optimizing scanlines independently

    can be rather prone to noise, dynamic programming is better suited for applications

    where speed is an important consideration.

    Various dynamic programming approaches differ in their treatment of continuity.

    Baker and Binford [2] and Ohta and Kanade [54] impose continuity simply and

    directly, by first matching edges, then interpolating over the untextured regions.

    Unfortunately, such interpolation does not preserve sharp discontinuities.

    Taking the opposite approach, Intille and Bobick [17, 40] do not use continuity at

    all, relying upon “ground control points” and the ordering constraint to obviate the

    need for any external smoothness constraints. Their asymmetric method uses neither

    intra- nor inter-scanline smoothness, and treats each scanline independently. The

    method of Geiger, Ladendorf, and Yuille [30] also treats scanlines independently, but

    supposes that disparities are piecewise constant along each scanline, and symmetri-

    cally enforces a strict correspondence between discontinuities in one image and occlu-

    sions in the other.

    In contrast, both Cox et al. [24] and Belhumeur and Mumford [8] impose 2D conti-

    nuity through inter-scanline constraints. Cox et al. [24] count the total number of

    depth discontinuities (horizontal plus vertical), and specify that this number should

    be minimized as a subordinate goal; they suggest either one or two passes of dynamic

    programming as efficient methods for approximating this minimization. Belhumeur

    and Mumford [8] also require the minimization of the number of pixels at which

    discontinuities are present, but Belhumeur [6, 7] generalizes the notion of disconti-

    nuity, counting both step edges and crease edges. Belhumeur formulates a symmetric

    energy functional that incorporates this count, and proposes that it be minimized

    with iterated stochastic dynamic programming.

    Aside from depending upon the ordering constraint, all of these methods have

    discrete, pixelized approaches to continuity that are at most one-dimensional. These

    limitations are the primary weaknesses of dynamic programming for stereo.

  • 22 CHAPTER 1. INTRODUCTION

    1.5.6 Graph-Based Methods

    As do dynamic programming methods, graph-based methods leverage combina-

    torial optimization techniques for their power, but unlike dynamic programming

    methods, graph-based methods are able to optimize continuity over the entire

    2D image, instead of only along individual 1D scanlines. Graph-based methods are

    based upon efficient algorithms [32, 46] for calculating the minimum-cost cut (or,

    equivalently, the maximum flow [29]) through a network graph.

    There are two general flavors of graph-based stereo methods. One flavor computes

    the global minimum of a convex energy functional with a single minimum-cost cut;

    typically, the cost of a discontinuity is a linear function of its size. Roy and Cox [60]

    propose one such method, which discards the ordering constraint used in dynamic

    programming in favor of a local coherence constraint. Their method uses an undi-

    rected graph built upon a chosen reference image, and find exactly one disparity for

    each pixel therein. Ishikawa and Geiger [41] propose another such method, which

    retains the ordering constraint, and furthermore distinguishes among ordinary, edge,

    and junction pixels. Their method uses a directed graph, and symmetrically enforces

    two-way uniqueness. Both of these methods tend to produce discontinuities that

    are somewhat blurred, because they are incapable of using non-convex, sub-linear

    penalties for discontinuities.

    The other flavor of graph-based methods computes a strong local minimum of

    a non-convex energy functional with iterated minimum-cost cuts. Boykov, Veksler,

    and Zabih [18] developed one such optimization technique that is applicable to an

    extremely wide variety of non-convex energies. Boykov et al. [19] subsequently

    developed another such optimization technique that is somewhat less widely appli-

    cable, but which produces results that are provably within a constant factor of being

    globally optimal [79]. Boykov et al. [20, 21] apply these techniques to stereo, again

    using an undirected graph built upon a chosen reference image, and finding exactly

    one disparity for each pixel therein. Kolmogorov and Zabih [47] build more complex

    graphs; their method enforces symmetric, two-way uniqueness, but is limited to

    constant-disparity continuity.

  • 1.5. A BRIEF SURVEY OF STEREO METHODS 23

    In general, graph-based methods are not only quite powerful, but also fairly effi-

    cient for what they do. For our purposes, their main weakness is their restriction to

    computing discrete-valued disparities, due to their inherently combinatorial nature.

    1.5.7 Layered Methods

    Like regularization with discontinuities, layered models [25, 26, 80] estimate real-

    valued disparities while allowing discontinuities, producing piecewise-smooth surface

    reconstructions. However, while the former methods represent all surface patches

    together with a single function mapping image location to disparity value, layered

    methods separately represent each surface patch with its own such function, and

    combine them into a single depth map through the use of support maps, which define

    the image regions in which each surface is “active.”

    The primary consequence of this representational enrichment is that when

    combining support maps, there need not be exactly one “active” surface per pixel.

    In particular, it is trivial to represent pixels at which no surface is active. This lets

    layered methods readily model occlusion regions, enabling them to consider two-way

    uniqueness.

    Another consequence of modeling surfaces with layers is that each surface gets

    its own identity, independent of its support map. This allows image regions to be

    grouped semantically, rather than merely topologically. In other words, layered

    models have the advantage of being able to model hidden connections among visible

    surface patches that are separated by occluding objects. For example, consider a

    scene seen through chain link fence. With standard regularization, either the fence

    must be ignored, or the remainder of the scene must be cut into many independent

    pieces. With a layered model, the fence and the background can both survive intact.

    Baker, Szeliski, and Anandan [3] developed a layered method for stereo recon-

    struction that is based upon minimizing the resynthesis error obtained by comparing

    input images warped according to the recovered depth map. Their method models

    the disparity of each surface patch as a plane with small, local deviations. Their

    theory includes transparency, but their implementation uses asymmetric, two-way

  • 24 CHAPTER 1. INTRODUCTION

    uniqueness. Like windowed correlation methods, they achieve some degree of conti-

    nuity by spatially blurring the match likelihood.

    Birchfield and Tomasi [11] developed a method that models each surface as a

    connected, slanted plane. They estimate the assignment of pixels to surfaces with the

    graph-based techniques of [18], and favor placing boundaries along intensity edges,

    again yielding exactly one disparity value for each image location. Among prior work,

    their algorithm is the most similar to ours.

    1.6 Our Proposed Approach

    In Section 1.4, we proposed a three-axis categorization of binocular stereo algo-

    rithms according to their treatment of continuity and uniqueness. In the remainder

    of this dissertation, we propose an algorithm that simultaneously lies in the most

    preferable category along all three axes: real-valued disparities, non-convex disconti-

    nuity penalties, and symmetric two-way occlusions. To the author’s knowledge, ours

    is the first such algorithm for binocular stereo.

    We contend that, for scenes consisting of smooth surfaces, our algorithm improves

    upon the current state of the art, achieving both more accurate localization in depth

    of surface interiors via subpixel disparity estimation, and more accurate localization

    in the image plane of surface boundaries via the symmetric treatment of images with

    proper handling of occluded regions.

    1.7 Outline of Dissertation

    In Chapter 2, we describe our mathematical model of the stereo problem and solu-

    tions thereof. In Chapters 3 and 4, we describe surface fitting and boundary local-

    ization, respectively. In Chapter 5, we describe the interaction between surface fitting

    and boundary localization, and give the overall optimization algorithm. In Chapter 6,

    we present some promising qualitative and quantitative experimental results. Finally,

    in Chapter 7, we offer a few concluding remarks.

  • Chapter 2

    Preliminaries

    In this chapter, we develop a mathematical abstraction of the stereo problem.

    This abstract formulation is defined within a continuous domain; discretization of

    the problem for computational feasiblity will be discussed in subsequent chapters.

    2.1 Design Principles

    Because the stereo problem is so sensitive to perturbations, in order to get best

    results, it is especially important that the algorithm be designed to minimize the

    unnecessary introduction and propagation of errors. To this end, we follow two

    guiding principles: least commitment, and least discretization.

    In the computation of our final answer, we would like to make the best possible

    use of all available data. This means that, at any particular stage in the computation,

    we would like to be the least committed possible to any particular interpretation of

    the data. Because of this, it is better to match images directly, instead of matching

    only extracted features (such as edges and corners): we don’t want to discard the

    dense image data so early in the process. Similarly, it is better to directly estimate

    subpixel disparities, rather than fit smooth surfaces to pre-calculated, integer-only

    disparities, for the same reason.

    In a similar spirit, we also would like to avoid rounding errors as much as possible,

    25

  • 26 CHAPTER 2. PRELIMINARIES

    so our computations are done in a continuous space as much as possible. Most basi-

    cally, our algorithm estimates floating point disparity values defined on a continuous

    domain; these disparity values are only discretized in our implementation by finite

    machine precision. In addition, since we are trying to recover subpixel (non-integer)

    disparity values, we need to match image appearance at inter-pixel image positions.

    This means that we must define image appearance at inter-pixel image positions; that

    is, we must interpolate the input images. We describe how we do this in Appendix A.

    2.2 Mathematical Abstraction

    As motivated in Section 1.5, in order to place in the most preferable category

    along each of our three proposed axes, we use a layered model [25, 26, 80] to represent

    possible solutions to the stereo problem.

    Our stereo algorithm follows the common practice of assuming that input images

    have been normalized with respect to both photometric and geometric calibration.

    In particular, we assume that the images are rectified. Let

    I = {p = (x, y, t)} = (R×R× {‘left’, ‘right’})

    be the space of image locations, and let

    I : I 7→ Rm

    be the given input image pair. Typically, m = 3 (with the components of Rm repre-

    senting red, green, and blue) for color images, and m = 1 for grayscale images, but our

    algorithm does not depend on the semantic interpretation of Rm; any feature space

    can be used. Note that the image space is defined to be continuous, not discrete; we

    discuss this matter further in Appendix A.

    Our abstract model of a hypothesized solution consists of a labeling (or segmen-

    tation) f , which assigns each point of the two input images to zero or one of n surfaces,

    plus n disparity maps d[k], each of which assigns a disparity value to each point of

  • 2.2. MATHEMATICAL ABSTRACTION 27

    the two input images:

    [segmentation] f : I 7→ {0, 1, . . . , N}

    [disparity map] d[k] : I 7→ R for k in {1,2, . . . ,N}

    In other words, these functions are the independent unknowns that are to be esti-

    mated.

    The segmentation function f specifies to which one of n surfaces, if any, each

    image location “belongs.” We take belonging to mean the existence of a world point

    which (a) projects to the image location in question, and (b) is visible in both images.

    For each surface, the signed disparity function d[k] defines the correspondence (or

    matching) function m[k] between image locations:

    m[k] : I 7→ I

    m[k](x, y, t) =(

    x + d[k](x, y, t), y, ¬t)

    where ¬‘left’ = ‘right’ and vice versa. That is, for each surface k, m[k] maps each

    location in one image to the corresponding location in the other image. Note that, for

    all k, d[k] and m[k] are both defined for all (x, y, t), regardless of the value of f(x, y, t).

    Furthermore, for standard camera configurations, d[k] will generally be positive in the

    right image and negative in the left image, if it represents a real surface.

    Thus, the interpretation of this model is:

    for all p:

    f(p) = k with k > 0 ⇒ p corresponds to m[k](p)

    f(p) = 0 ⇒ p corresponds to no location in the other image

    That is, a hypothesized solution specifies a set of correspondences between left and

    right image locations, where each image location is a member of at most one corre-

    spondence.

  • 28 CHAPTER 2. PRELIMINARIES

    2.3 Desired Properties

    Given this abstract representation of a solution, how can we evaluate any

    particular hypothesized solution? What are some conditions that characterize a

    “good” solution? We propose three desired properties: consistency, smoothness, and

    non-triviality.

    2.3.1 Consistency

    Correspondence of image locations should be bidirectional. In other words, if

    points p and q are images of the same world point, then each corresponds to the

    other; otherwise, neither corresponds to the other. If a hypothesized solution were to

    say that p corresponds to q but that q does not correspond to p, it would make no

    sense; we call such a solution inconsistent.

    Within each surface, this translates into a constraint on m[k] that:

    for all k, p: m[k](m[k](p)) = p (2.1)

    which is equivalent to a constraint on each d[k]. In particular, for each k, given one of

    d[k](·, ·, ‘left’) or d[k](·, ·, ‘right’), the other is uniquely determined. This reflects

    the notion that d[k](x, y, ‘left’) and d[k](x, y, ‘right’) are two representations of

    the same surface.

    Regarding segmentation, we also have the constraint on f that

    for all p: f(p) = k with k > 0 ⇒ f(m[k](p)) = k (2.2)

    Ideally, these consistency constraints should be satisfied exactly, but for computa-

    tional purposes, we merely attempt to maximize consistency.

    2.3.2 Smoothness

    Continuity dictates that a recovered disparity map should be piecewise smooth,

    consisting of smooth surface patches separated by cleanly defined, smooth boundaries.

  • 2.3. DESIRED PROPERTIES 29

    Thus, in trying to estimate the best reconstruction, we would like to maximize the

    “smoothness,” both of the surface shapes defined by d[k], and of the boundaries

    defined by f .

    Because the disparity maps d[k] are continuous-valued functions, they are

    amenable to the usual meaning of smoothness. We take smoothness of d[k] to mean

    differentiability, with the magnitude of higher derivatives being relatively small.

    Because the segmentation function f can only take on the integer values 0 . . .N ,

    it is piecewise constant, with line-like boundaries separating those pieces. We take

    smoothness of f to mean simplicity of these boundaries, with the total boundary

    length being relatively small.

    2.3.3 Non-triviality

    Good solutions should conform to, and explain, rather than ignore, the input

    data as much as possible. For example, any two input images could be interpreted as

    views of two painted, planar surfaces, each presented to one camera. Such a trivial

    interpretation, yielding no correspondence for any image location, would be valid but

    undesirable. In general, we expect that a correspondence exists for “most” image

    locations; i.e., we expect that the segmentation function f is “mostly” non-zero:

    for most p: f(p) > 0

    Moreover, although color constancy is sometimes violated (e.g., due to specular-

    ities), and smoothness and consistency are needed to fill in the gap, a solution that

    supposes a perfectly smooth and consistent surface, at the expense of violating color

    constancy everywhere, is also not desirable. In other words, we expect that color

    constancy holds for “most” image locations:

    for most p where f(p) > 0: I(

    m[f(p)](p))

    = I(p)

    Intuitively, using the language of differential equations, consistency and

    smoothness provide the homogeneous terms that result in the general solution,

  • 30 CHAPTER 2. PRELIMINARIES

    disparity maps segmentation

    non-triviality E match I E unassigned

    smoothness E smooth d E smooth f

    consistency E match d E match f

    Table 2.1: Contributions to energy.

    while non-triviality provides the non-homogeneous terms that result in the particular

    solution.

    2.4 Energy Minimization

    Now that we have defined the form of a solution, and stated its desired properties,

    how do we find the best solution?

    We formalize the stereo problem in the framework of energy minimization. In

    general, energy minimization approaches split a problem into two parts: defining

    the cost, or energy, of all hypothesized solutions [31], and finding the best solution

    by minimizing that energy. This separation is advantageous because it facilitates the

    use of general-purpose minimization techniques, enabling more focus upon the unique

    aspects of the specific application.

    For our application, we formulate six energy terms, corresponding to each of the

    three desired properties, applied to both surface interiors and surface boundaries (see

    Table 2.1). These terms are developed in the next two chapters; total energy is the

    sum of these terms.

  • Chapter 3

    Surface Fitting

    In this chapter, we consider a restricted subproblem. Rather than simultaneously

    estimating both the 3D shape (given by d[k]) and the 2D support (given by f) of each

    surface, we consider the problem of estimating 3D shape, given 2D support. That

    is, supposing that the segmentation f is known, how can the disparity maps d[k] be

    found? Using this context, we explain our model of smooth surfaces; formulate the

    three energy terms that encourage surface non-triviality, smoothness, and consistency;

    and discuss the minimization of these energy terms.

    3.1 Defining Surface Smoothness

    Fitting a smooth surface to sparse and/or noisy data is a classic mathematical

    problem. Sometimes there are obvious gaps in a data set, and one would like to fill

    them in; other times, the data set is complete but is a mixture of signal and noise,

    and one would like extract the signal. In either of these cases, the key detail to

    determining the solution is the exact specification of the smoothness that one expects

    to find. What are some ways to define, and subsequently impose, surface smoothness?

    Perhaps the simplest way to impose surface smoothness is to decree that the

    surface belong to some pre-defined class of known “smooth” surfaces, for example

    planar or quadric surfaces. This approach is extremely efficient: because qualified

    surfaces can be described by a small number of parameters (e.g., horizontal tilt,

    31

  • 32 CHAPTER 3. SURFACE FITTING

    vertical tilt, and perpendicular distance for planar surfaces), there are only a few

    degrees of freedom, and it is relatively easy to find the optimal surface. Moreover,

    this approach can also be quite robust and tolerant of perturbations, for the same

    reason. Thus, when they adequately approximate true scene geometry, and effeciency

    is of concern, global parametric models can be a good choice (e.g., [3, 11]).

    Unfortunately, true scene geometry often contains local details that cannot be

    represented by a global parametric model. Low-order parametric models simply lack

    sufficient degrees of freedom, and generally smooth over all but the coarsest details.

    High-order global parametric models are less well-behaved, and can produce recon-

    structions with large-amplitude ringing. Thus, when the reconstruction of accurate

    details is required, global parametric models are generally unsuitable.

    At the other end of the spectrum are regularized, pixel-level lattices, where the

    complete disparity map is specified not by a few global parameters, but instead by

    the individual disparity values at every pixel. Each of these individual disparities is

    a separate degree of freedom, so fine detail can be represented faithfully, but some

    degree of interdependence between them is imposed, to encourge solutions that are

    “regular,” or smooth, in some sense. This is generally accomplished by defining a

    quantitative measure of departure from smoothness, and “weakly” minimizing it while

    simultaneously achieving the primary goal (typically color constancy). By varying the

    type and the strength of the regularization, one can usually balance the competing

    requirements for accurate details and overall smoothness.

    However, a pixel-level lattice alone is obviously discrete, specifying disparity values

    only at its grid points, while the underlying surface has a value at every point. In

    order to determine the disparity at subpixel positions, one must therefore interpolate

    between pixel positions. Moreover, the method of interpolation should be smooth

    enough not to interfere with the chosen form of regularization; for example, one would

    not use nearest-neighbor interpolation while minimizing quadratic variation. On the

    other hand, smoothly interpolating between pixels amounts to fitting a spline surface

    to the grid of data points, with one “control point” per pixel, and with regularization

    acting upon the control points.

    We generalize this idea of a regularized spline by allowing the spacing of the

  • 3.2. SURFACES AS 2D SPLINES 33

    control point grid to vary; for example, there could be one spline control point at

    every n-th pixel in each direction. This model subsumes both global parametric

    models and pixel-level lattices: the former correspond to splines where one patch

    covers the entire image, and the latter correspond to splines where there is a control

    point at every pixel.

    3.2 Surfaces as 2D Splines

    We model the disparity map of each surface as a bicubic B-spline. This gives us

    the flexibility to represent a wide range of slanted or curved surfaces with subpixel

    disparity precision, while ensuring that disparity values and gradients vary smoothly

    over the surface. In addition, splines define analytic expressions for disparity values

    and gradients everywhere, and in particular at subpixel positions, as required by our

    abstract mathematical model.

    The control points of the bicubic B-spline are placed on a regular grid with fixed

    image coordinates (but variable disparity value). The resulting spline surface can be

    thought of as a linear combination of shifted basis functions, with shifts constrained

    to the grid.

    Mathematically, we restrict each d[k] to take the form of a bicubic spline with

    control points on a fairly coarse, uniform rectangular grid:

    d[k](x, y, t) =∑

    i,j

    (

    D[k][i, j, t] · b(x− in, y − jn))

    (3.1)

    where b is the bicubic basis function, D is the lattice of control points, and n is the

    spacing thereof.

    In general, the spacing of the grid of spline control points should be fine enough

    so that surface shape details can be recovered, but not much finer than that so that

    computational costs are not inflated unnecessarily. In our experiments, for each view

    (left and right) of each hypothesized surface, we use a fixed, 5×5 grid of control points,

    giving a 2× 2 grid of spline patches that divide the image into equal quadrants.

  • 34 CHAPTER 3. SURFACE FITTING

    3.3 Surface Non-triviality

    This energy term, often called the “data term” in other literature, expresses

    the basic assumption of color constancy, penalizing any deviation therefrom. There

    are many possible ways to quantify deviation from color constancy; commonly used

    measures include the absolute difference, squared difference, truncated quadratic, and

    other robust norms [14, 34, 39]. For simplicity with differentiability, we use a scaled

    sum of squared differences:

    E match I =∑

    p

    g(

    I(m[k](p))− I(p); A(p))

    if f(p) = k with k > 0,

    0 otherwise,(3.2)

    where g(v; A) = vT · A · v, and where A(p) is a space-variant measure of certainty

    defined in Appendix A. Qualitatively, A(p) normalizes for the local contrast of I

    around p.

    Note that, ideally, E match I would be defined as an integral over all p ∈ I.

    However, for computational convenience, we approximate the integral with a finite

    sum over discrete pixel positions, for this and other energy terms. This is a reasonable

    approximation if the summand is spatially smooth.

    3.4 Surface Smoothness

    Since we expect smooth surfaces to be more likely to occur, we would like to

    quantify and penalize any deviation from smoothness. Although our spline model

    already ensures some degree of surface smoothness, this inherent smoothness is limited

    to a spatial scale not much larger than that of the spline control point grid. On the

    other hand, we would like the option to encourage additional smoothness on a more

    global scale; hence we impose an additional energy term.

    We take the class of perfectly smooth surfaces to be the set of planar surfaces

    (including both fronto-parallel and slanted planes). The usual measures of deviation

  • 3.5. SURFACE CONSISTENCY 35

    from planarity are the squared Laplacian:

    E = (uxx + uyy)2

    and the quadratic variation, or thin plate spline bending energy [16, 33, 77]:

    E = u2xx + 2u2xy + u

    2yy

    However, these measures have the disadvantage of using second derivatives, which

    can be susceptible to noise, and which tend to place too much emphasis on high-

    frequency, local deviations, relative to low-frequency, global deviations. Instead, in

    addition to restricting d[k] to take the form of a spline, we add an energy term which,

    loosely speaking, is proportional to the global “variance” of the surface slope:

    E smooth d [k ] = λ smooth d ·∑

    p

    ∥∇d[k](p)−mean(∇d[k])∥

    2

    where the summation and the mean are both taken over all discrete pixel positions p,

    independent of the segmentation f . This energy term attempts to quantify deviations

    from global planarity.

    In our experiments, this energy term is given a very small weight, and mainly

    serves to accelerate the convergence of numerical optimization by shrinking the

    nullspace of the total energy function. This term does not prevent surfaces from

    being non-planar.

    3.5 Surface Consistency

    For perfect consistency, a surface should have left and right views that coincide

    exactly with one another, as specified in Equation (2.1). In some prior work, this

    constraint has been enforced directly through specified update equations, without

    an explicit energy term being given [51]. However, with left and right views simul-

    taneously constrained each to have the form of Equation (3.1), exact coincidence is

    generally no longer possible. That is, a surface shape that conforms to Equation (3.1)

  • 36 CHAPTER 3. SURFACE FITTING

    in one view, will no longer necessarily conform to Equation (3.1) when warped to the

    other view. Therefore, to allow but discourage any non-coincidence, we propose the

    energy term

    E match d [k ] = λ match d ·∑

    p

    (

    m[k](m[k](p))− p)2

    or equivalently,

    E match d [k ] = λ match d ·∑

    p

    (

    d[k](p) + d[k](m[k](p)))2

    which, intuitively, measures the distance between the surfaces defined by the left

    and right views. Again, the summation is taken over all discrete pixel positions p,

    independent of the segmentation f .

    3.6 Surface Optimization

    Given a particular k, this chapter’s subproblem is to minimize (or reduce) total

    energy by varying d[k], while holding f and the remaining d[j] constant. Total energy

    is a sum of six terms; in this chapter, three of them were shown to depend smoothly

    on d[k]. In Chapter 4, the two terms E unassigned and E smooth f are shown to

    depend only on f , and thus can be considered constant for the present subproblem.

    In Chapter 5, the remaining term E match f is shown to depend smoothly on d[k].

    Therefore, the total energy as a function of d[k] is differentiable, and can be minimized

    with standard gradient-based numerical methods.

    For convenience, we use Matlab’s optimization toolbox. The specific algorithm

    chosen is a trust region method with a 2D quadratic subproblem. This is a greedy

    descent method; at each step, it minimizes a quadratic model within a 2D trust

    region spanned by the gradient and the Newton direction. Experimentally, this algo-

    rithm exhibited more reliable convergence than the quasi-Newton methods with line

  • 3.6. SURFACE OPTIMIZATION 37

    searches, and although it requires the calculation of the Hessian, in our implemen-

    tation, that expense is relatively small compared to the total computational require-

    ments of solving the stereo problem.

    In this chapter, we have shown how to minimize total energy by varying each d[k]

    individually. As long as f remains fixed, there is nothing to be gained by varying all

    d[k] simultaneously. For each k > 0, we call minimizing over d[k] a surface-fitting step,

    and consider it to be a building block towards a complete algorithm for minimizing

    total energy.

  • Chapter 4

    Segmentation

    In this chapter, we consider a restricted subproblem. Rather than simultaneously

    estimating both the 3D shape (given by d[k]) and the 2D support (given by f) of

    each surface, we consider the problem of estimating 2D support, given 3D shape.

    That is, supposing that the disparity maps d[k] are known for all surfaces, how can

    the segmentation f be found? Using this context, we explain our model of segmen-

    tation; formulate the three energy terms that encourage segmentation non-triviality,

    smoothness, and consistency; and discuss the minimization of these energy terms.

    4.1 Segmentation by Graph Cuts

    Boykov, Veksler, and Zabih [21] showed that certain labeling problems can be

    formulated as energy minimization problems and solved efficiently by repeatedly using

    maximum flow techniques to find minimum-cost cuts of associated network graphs.

    Generally speaking, such problems seek to assign one of a given set of labels to each

    of a given set of items, simultaneously optimizing for not only items’ individual pref-

    erences for each label, but also interactions between pairs of items, where interacting

    items additionally prefer to be assigned similar labels, for some measure of similarity.

    Formally, let L be a finite set of labels, P be a finite set of items, and N ⊆ P ×P

    be the set of interacting pairs of items. The methods of [21] find a labeling f that

    assigns exactly one label fp ∈ L to each item p ∈ P, subject to the constraint that

    38

  • 4.1. SEGMENTATION BY GRAPH CUTS 39

    an energy function of the form

    E(f) =∑

    (p,q)∈N

    Vp,q(fp, fq) +∑

    p∈P

    Dp(fp) (4.1)

    be minimized. Individual energies Dp should be nonnegative but can otherwise be

    arbitrary; interaction energies Vp,q should be either semi-metric or metric, where V is

    a semi-metric if it is symmetric and positive strictly for distinct labels, and where V

    is a metric if it additionally satisfies the triangle inequality. Kolmogorov and Zabih

    [48] generalize these results, deriving necessary and sufficient condit