Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared...

154
Multimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos´ e Fernando Barrera Campo at Universitat Aut` onoma de Barcelona in partial fulfilment of the require- ments for the degree of Doctor of Philosophy. Bellaterra, December 5, 2012

Transcript of Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared...

Page 1: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Multimodal Stereo from

Thermal Infrared and Visible Spectrum

A dissertation submitted by Jose FernandoBarrera Campo at Universitat Autonoma deBarcelona in partial fulfilment of the require-ments for the degree of Doctor of Philosophy.

Bellaterra, December 5, 2012

Page 2: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Director Felipe LumbrerasDept. Ciencies de la Computacio & Centre de Visio per ComputadorUniversitat Autonoma de Barcelona

Co-director Angel SappaComputer Vision Center

Thesis Committee Dr. Joaquim Salvi MasDept. Arquitectura i Tecnologia de ComputadorsUniversitat de GironaDr. Antonio Lopez PenaDept. Ciencies de la Computacio & Centre de Visio per ComputadorUniversitat Autonoma de BarcelonaDr. Filiberto Pla BanonDept. Llenguatges i Sistemes InformaticsUniversitat Jaume IDr. Daniel Ponsa MussarraDept. Ciencies de la Computacio & Centre de Visio per ComputadorUniversitat Autonoma de BarcelonaDr. Carme Julia FerreDept. Enginyeria Informatica i MatematiquesUniversitat Rovira i Virgili

This document was typeset by the author using LATEX 2ε.

The research described in this book was carried out at the Computer Vision Center,Universitat Autonoma de Barcelona.

Copyright c© MMX by Jose Fernando Barrera Campo. All rights reserved. No partof this publication may be reproduced or transmitted in any form or by any means,electronic or mechanical, including photocopy, recording, or any information storageand retrieval system, without permission in writing from the author.

ISBN 978-84-940530-2-3

Printed by Ediciones Graficas Rey, S.L.

Page 3: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Dedicada a mi madrePatricia E. Campo y a

mi hermana Marıa A. Barrera

Page 4: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Multimodal Stereo from Thermal Infraredand Visible Spectrum

This thesis has been partially supported by theprojects TIN2011-29494-C03-02 and TIN2011-25606; and research programme Consolider-Ingenio 2010: MIPRCV (CSD2007-00018).

Bellaterra, December 5, 2012

Page 5: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Acknowledgement

The work presented in this thesis is the result from the research carried out dur-ing 2008-2012 at Computer Vision Center (CVC), and Universitat Autonoma deBarcelona (UAB). These years have been very rich in both professional and in privatelife and I would like to express my gratitude to all who have directly or indirectlysupported this project.

First of all I would like to thank my advisors Dr. Felipe Lumbreras and Dr.Angel Sappa, whose guidance, patience and support made possible this thesis. Theircombined advice brought focus to my work and shaped this research. They introducedme to the world of scientific research through a novel problem, which not only let meto learn a set of computer vision technics, but also how to face up a real challengingproblem.

For the financial support I would like to thank the Computer Vision Center, theDepartment of Informatics (UAB), and projects TIN2011-29494-C03-02 and TIN2011-25606; and research programme Consolider-Ingenio 2010: MIPRCV (CSD2007-00018).

I would like to also express my gratitude to the other co-authors I have had theprivilege of working with: Dr. Cristhian Aguilera and Dr. Ricardo Toledo. Also, I liketo thank the ADAS group (CVC): Dr. Antonio Lopez, Dr. Joan Serrat, Dr. FelipeLumbreras, Dr. Daniel Ponsa, Dr. Angel Sappa, Dr. David Geronimo, Dr. JaumeAmores, Dr. Ricardo Toledo, Dr. Aura Hernandez, Dr. Katerine Dıaz, Dr. JuanPeralta, Dr. Carme Julia, Rao Muhammad, Dr. Jose Alvarez, Dr. Ferran Diego,David Vazquez, Javier Marın, Naveen Onkarappa, Muhammad Rouhani, Jose Rubio,Diego Cheda, Alejandro Gonzalez, German Ros, Monica Pinol, Yainuvis Socarras andJiaolong Xu, who create an nice environment of scientific discussions and debates thathelped me to end successfully this dissertation.

During these years, I have gained many friends among the great people of theCVC. Especially, I will be missing the interesting talks during the coffee and lunchbreaks with Diego Cheda, David Rojas, Naveen Onkarappa, Jorge Bernal, MuradAl Haj, Ariel Amato, Shida Beigpour, Jon Almazan, Noha Elfiky, Dr Debora Gil,Pep Gonfaus, Wenjuan Gong, Dr. Jordi Gonzalez, Mario Rojas, Marco Pedersoli,Fahad Khan, Patricia Marquez, Carles Sanchez, Dr. Andrew Bagdanov, AntonioClavelli, Bhaskar Chakraborty, Albert Gordo, Jaime Moreno, Naila Murray, Muham-mad Rouhani, Farshad Nourbakhsh and Marc Serra. I would like to thank DiegoCheda and David Rojas for their great support and friendship in hard times. Also, Iwould like to thanks the administrative staff of CVC: Montse Cullere, Raquel Gomez,Ana M. Garcıa, and Claire Perez-Mangado, Gisele Kohatsu, Mireia Martı, Raquel

i

Page 6: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

ii

Rionegro, Eva Caballero, Carmen Merino, and Marc Paz for their friendship andprofessional advice.

Last but not the least I would like to express all my gratitude to my family, mymother Patricia Campo, my sister Marıa A Barrera, and my father Fernando Barrera.Thank to all of you for your support and love.

Page 7: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Abstract

Recent advances in thermal infrared imaging (LWIR) has allowed its use in appli-cations beyond of the military domain. Nowadays, this new family of sensors isincluded in different technical and scientific applications. They offer features that fa-cilitate tasks, such as detection of pedestrians, hot spots, differences in temperature,among others, which can significantly improve the performance of a system wherethe persons are expected to play the principal role. For instance, video surveillanceapplications, monitoring, and pedestrian detection.

During this dissertation the next question is stated: Could a couple of sensorsmeasuring different bands of the electromagnetic spectrum, as the visible and thermalinfrared, be used to extract depth information? Although it is a complex question, weshows that a system of these characteristics is possible as well as their advantages,drawbacks, and potential opportunities.

The matching and fusion of data coming from different sensors, as the emissionsregistered at visible and infrared bands, represents a special challenge, because it hasbeen showed that theses signals are weak correlated. Therefore, many traditionaltechniques of image processing and computer vision are not helpful, requiring adjust-ments for their correct performance in every modality.

In this research an experimental study that compares different cost functions andmatching approaches is performed, in order to build a multimodal stereovision system.Furthermore, the common problems in infrared/visible stereo, specially in the outdoorscenes are identified. Our framework summarizes the architecture of a generic stereoalgorithm, at different levels: computational, functional, and structural, which canbe extended toward high-level fusion (semantic) and high-order (prior).

The proposed framework is intended to explore novel multimodal stereo matchingapproaches, going from sparse to dense representations (both disparity and depthmaps). Moreover, context information is added in form of priors and assumptions.Finally, this dissertation shows a promissory way toward the integration of multiplesensors for recovering three-dimensional information.

iii

Page 8: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

iv ABSTRACT

Page 9: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Resumen

Recientes avances en imagenes termicas (LWIR) han permitido su uso en aplicacionesmas alla del ambito militar. Actualmente, esta nueva familia de sensores esta siendoincluida en diversas aplicaciones tanto tecnicas como cientıficas. Este tipo de sen-sores facilitan tareas tales como: deteccion de peatones, puntos calientes, deteccionde cambios de temperatura, entre otros. Caracterısticas que pueden mejorar sig-nificativamente el desempeno de un sistema, especialmente cuando hay interaccioncon humanos. Por ejemplo, aplicaciones de vıdeo vigilancia, deteccion de peatones,analisis de postura.

En esta tesis se plantea entre otras la siguiente pregunta de investigacion: Podrıaun par de sensores operando en diferentes bandas del espectro electromagnetico, comoel visible e infrarrojo termico, proporcionar informacion de profundidad? Si bien esuna cuestion compleja, nosotros demostramos que un sistema de estas caracterısticases posible. Ademas de discutir sus posibles ventajas, desventajas y oportunidadespotenciales.

La fusion y correspondencia de los datos procedentes de diferentes sensores, comolas emisiones registradas en la banda visible e infrarroja, representa un reto atractivo,ya que se ha demostrado que aquellas senales estan debilmente correlacionadas. Porlo tanto, muchas tecnicas tradicionales de procesamiento de imagenes y vision porordenador son inadecuadas, requiriendo ajustes para su correcto funcionamiento.

En esta investigacion se realizo un estudio experimental comparando diferentesfunciones de coste multimodal y tecnicas de correspondencia, a fin de construir unsistema estereo multimodal. Tambien se identifico el problema comun entre estereovisible/visible e infrarrojo/visible, particularmente en ambientes al aire libre. En-tre las contribuciones de esta tesis se encuentra una metodologıa que es generica adiferentes niveles, tanto computacional como funcional y estructural, permitiendo suextension a esquemas mas complejos tales como fusion de alto nivel (semantica) y deorden superior (supuestos).

El enfoque propuesto esta destinado a explorar nuevos metodos de correspon-dencia estereo, pasando de una solucion esparsa a una densa (tanto en disparidadcomo en mapas de profundidad). Ademas, se ha incluido informacion de contexto enforma de asunciones y restricciones. Finalmente, esta disertacion muestra un caminoprometedor hacia la integracion de multiples sensores.

v

Page 10: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

vi RESUMEN

Page 11: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Contents

Abstract iii

Resumen v

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Objectives and Methodology . . . . . . . . . . . . . . . . . . . . . . . 31.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Literature Review 72.1 Image Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Image Fusion by Registering . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Infinite Homographic Registration . . . . . . . . . . . . . . . . 102.2.2 Global Image Registration . . . . . . . . . . . . . . . . . . . . . 112.2.3 Regions-of-Interest Registration . . . . . . . . . . . . . . . . . . 122.2.4 Stereo Geometric Registration . . . . . . . . . . . . . . . . . . 12

2.3 Image Fusion by Matching . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Pre-Processing Stage . . . . . . . . . . . . . . . . . . . . . . . . 142.3.2 Cost Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.4 Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Multimodal Imagery 193.1 Infrared Sensors: History, Theory and Evolution . . . . . . . . . . . . 193.2 Multimodal Stereo Head . . . . . . . . . . . . . . . . . . . . . . . . . . 243.3 Camera Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4 Image Rectification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4.1 Thermal Infrared and Visible Camera Model . . . . . . . . . . 293.4.2 Rectification of Camera Matrix . . . . . . . . . . . . . . . . . . 30

3.5 Depth Computation from Disparity Maps . . . . . . . . . . . . . . . . 313.6 Multimodal Evaluation Datasets . . . . . . . . . . . . . . . . . . . . . 32

3.6.1 OSU Color-Thermal Database . . . . . . . . . . . . . . . . . . 323.6.2 CVC-01 Multimodal Urban Dataset: Generic Surfaces . . . . . 333.6.3 CVC-02 Multimodal Urban Dataset: Planar Surfaces . . . . . . 34

vii

Page 12: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

viii CONTENTS

4 Similarity Window-based Matching Cost Function 374.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.2 Matching Cost Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.3 Thermal Infrared and Visible Matching Cost Functions . . . . . . . . . 40

4.3.1 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . 434.3.2 Gradient Information . . . . . . . . . . . . . . . . . . . . . . . 454.3.3 Mutual and Gradient Information . . . . . . . . . . . . . . . . 474.3.4 Scale-Space Context . . . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5 Local Information Based Approach 575.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Sparse Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2.1 Matching Cost Volume Computation . . . . . . . . . . . . . . . 595.2.2 Disparity Selection Based on Confidence Measures . . . . . . . 61

5.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 645.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6 Piecewise Planar Based Approach 756.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.3 Pixelwise Planar Stereo . . . . . . . . . . . . . . . . . . . . . . . . . . 78

6.3.1 Matching Cost Volume Computation . . . . . . . . . . . . . . . 796.3.2 Plane-Based Hypotheses Generation . . . . . . . . . . . . . . . 826.3.3 Pixelwise Plane Labeling . . . . . . . . . . . . . . . . . . . . . 85

6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7 Context and Piecewise Planar Based Approach 957.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977.3 Superpixelwise Planar Stereo . . . . . . . . . . . . . . . . . . . . . . . 99

7.3.1 Matching Cost Volume Computation . . . . . . . . . . . . . . . 1007.3.2 Plane-Based Hypotheses Generation . . . . . . . . . . . . . . . 1017.3.3 Superpixelwise Plane Labeling . . . . . . . . . . . . . . . . . . 104

7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

8 Conclusions and Future Work 1198.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Bibliography 123

Publications 137

Page 13: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

List of Tables

3.1 Visible and infrared spectral bands . . . . . . . . . . . . . . . . . . . . 233.2 OSU database specifications . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Multimodal Urban Dataset: Generic Surfaces . . . . . . . . . . . . . . 353.4 Context-based multimodal evaluation dataset . . . . . . . . . . . . . . 36

4.1 Right matches order by maximums using a stack . . . . . . . . . . . . 514.2 Right matches order by maximums using a pyramid . . . . . . . . . . 514.3 Improvement achieved using interscale propagation . . . . . . . . . . . 52

5.1 ROC evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2 Examples of sparse depth maps from outdoor scenarios . . . . . . . . . 705.3 Examples of sparse depth maps from outdoor scenarios . . . . . . . . . 71

6.1 Examples of evaluation data set . . . . . . . . . . . . . . . . . . . . . . 896.2 Global Eabs and Erel of case studies . . . . . . . . . . . . . . . . . . . 92

7.1 Scale space representation . . . . . . . . . . . . . . . . . . . . . . . . . 1017.2 Context-based multimodal evaluation dataset . . . . . . . . . . . . . . 1087.3 Global Eabs (pixel) and Erel of the case studies . . . . . . . . . . . . . 115

ix

Page 14: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

x LIST OF TABLES

Page 15: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

List of Figures

2.1 Multimodal data fusion schema . . . . . . . . . . . . . . . . . . . . . . 82.2 Modular stereo vision framework . . . . . . . . . . . . . . . . . . . . . 14

3.1 IR working principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Detector materials and their corresponding spectral responses . . . . . 223.3 Multimodal stereo heads. . . . . . . . . . . . . . . . . . . . . . . . . . 243.4 Evaluation set-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.5 Example of rectified images . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Examples of calibration points . . . . . . . . . . . . . . . . . . . . . . 283.7 LWIR/VS camera model . . . . . . . . . . . . . . . . . . . . . . . . . . 293.8 Example of rectified images . . . . . . . . . . . . . . . . . . . . . . . . 313.9 Depth computation from disparity maps . . . . . . . . . . . . . . . . . 323.10 OSU color-thermal database . . . . . . . . . . . . . . . . . . . . . . . . 333.11 CVC-01 Urban Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Multimodal Matching Cost Volume. . . . . . . . . . . . . . . . . . . . 394.2 Expected multimodal correspondences . . . . . . . . . . . . . . . . . . 424.3 Comparison between MI and SAD cost functions . . . . . . . . . . . 444.4 Multimodal gradient field . . . . . . . . . . . . . . . . . . . . . . . . . 454.5 Multimodal pyramidal representations . . . . . . . . . . . . . . . . . . 484.6 Propagation comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 534.7 v-disparity representation . . . . . . . . . . . . . . . . . . . . . . . . . 534.8 RMS error of C under view changes . . . . . . . . . . . . . . . . . . . 54

5.1 Proposed sparse stereo framework . . . . . . . . . . . . . . . . . . . . 605.2 Example of cost curves . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.3 Sub-pixel disparity interpolation . . . . . . . . . . . . . . . . . . . . . 645.4 Evaluation regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.1 Illustration of the algorithm’s stages. . . . . . . . . . . . . . . . . . . . 796.2 Inputs and output images of first stage . . . . . . . . . . . . . . . . . . 826.3 Illustration of the split and merge segmentation . . . . . . . . . . . . . 836.4 Inter-plane distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.5 Planar hypotheses simplification . . . . . . . . . . . . . . . . . . . . . 866.6 Graph cuts outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

xi

Page 16: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

xii LIST OF FIGURES

6.7 Experimental results scene 1-4 . . . . . . . . . . . . . . . . . . . . . . 906.8 Experimental results scene 5-8 . . . . . . . . . . . . . . . . . . . . . . 916.9 Average accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926.10 Average Erel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

7.1 Context-based multimodal stereo algorithm . . . . . . . . . . . . . . . 1007.2 Multimodal matching cost volume . . . . . . . . . . . . . . . . . . . . 1007.3 Split and merge segmentation example . . . . . . . . . . . . . . . . . . 1027.4 Wrong region assignment . . . . . . . . . . . . . . . . . . . . . . . . . 1037.5 Plane hypotheses graph . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.6 Superpixel neighborhood N . . . . . . . . . . . . . . . . . . . . . . . . 1067.7 Superpixelwise plane labeling results . . . . . . . . . . . . . . . . . . . 1097.8 Case study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.9 Case study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.10 Case study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.11 Case study 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1127.12 Case study 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.13 Case study 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1137.14 Case study 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.15 Case study 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1147.16 Case study 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.17 Average accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1167.18 Average Erel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Page 17: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 1

Introduction

Stereo matching is a classical problem in computer vision that has been largelystudied in the last decades [25, 73, 136]; especially, for binocular systems that work

at the same spectral band—in general visible spectrum. Nowadays, a new familyof sensors, operating at other spectral bands, is being used in different technicaland scientific domains. This allows the simultaneous use of sensors that registerinformation from separated spectral bands. The multispectral concept is not new.Indeed, it goes to the 1950’s and 1960’s, where the search for practical methods forcombining data coming from various sensors has been studied. The main target ofthose studies was to provide an unified representation that could be used for a betterunderstanding of a given scene. Hence, terms such as fusion, combination, integration,and other related ones that express more or less the same concept have emerged inthe literature. These early contributions were always related to military applications,mainly due to the high costs of manufacture as well as the secrets involved with thistechnology. However, recent advances in the field of multi-wavelength detection allowthis technology to became more popular and commercially available.

This thesis is focused on studying a particular combination of two image sensors.On the one hand, standard cameras that respond to wavelengths from about 390 to750 µm (Visible Spectrum: VS). On the other hand, thermal infrared cameras thatdetect radiation in the range of 8 to 14 µm (Long-Wavelength InfraRed band: LWIR).The proposed combination is looking for a richer representation of the scene; in otherwords, just by using 2D images from different spectral band the goal is to obtain a 3Drepresentation. Reaching such a target, is far away to be a trivial problem. Note thatthey fuse radiation emissions coming from different spectral bands. That is, visualinformation, which is related with the appearance of the objects, and thermal infraredinformation, which refers to the temperature of the objects in the scene.

These multimodal systems have grown to become a significant tool for dealingwith a wide range of problems, for instance: remote sensing, pedestrian detection,surveillance, medical imaging, among others. However, their capabilities and poten-tiality in recovering 3D information, although it is a novel challenge, is still unclear.

1

Page 18: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

2 INTRODUCTION

In the current thesis, a multimodal stereo matching algorithm for extracting depthinformation from thermal infrared and visible images is investigated. This problemis addressed by formulating two questions: (i) is it possible to obtain 3D informationfrom visible and thermal infrared images? and (ii) is it possible to obtain dense 3Dinformation from visible and thermal infrared images? The answer to these issuesoriginates the current dissertation.

At first glance, it may seem that an unified representation obtained from this mul-tispectral systems (LWIR/VS) would lead to a representation with higher informationcontent. In other words, not only depth but also appearance as well as temperaturecould be provided. However, before obtaining that desired representation the corre-spondence problem between thermal infrared and visible images must be overcome.These images are strongly different, making it a challenging task. Furthermore, fewapproaches have been proposed in the literature and it is difficult to identify thebest option or approaches to develop a robust matching algorithm able to deal withinfrared and visible variations.

Finding correspondences on these images suppose a great challenge due to theirweak correlation. In fact, they are near to be uncorrelated [120, 139]. Although,these conclusions are somehow discouraging, it is possible to exploit information ofshapes as well as local underling relationships between VS and LWIR measurements.This will help to find enough similarities for recovering a sketch of the structure ofthe scene. In this way, the first question mentioned above will be tackled. Then,from these sparse correspondences collected in a low-level way, together with a priorknowledge base and a set of scene constraints, the second question is faced out.

Data fusion systems have evolved from competitive to cooperative schemes, passingby hybrid systems where the information is treated in a complementary form. Mostpractical applications in state of the art involve data fusion in a competitive and com-plementary manner. Although, a cooperative scheme is the most interesting regardingthe generation of information, it is the most difficult one because information contentis intrinsically new and not contained in the individual inputs. The current disser-tation presents a multimodal stereo vision system working in a cooperative fashion,where not only the two inputs are themselves formed from complementary sensors(VS and LWIR), but the fusion is cooperative. A special emphasis is given to the costfunction used for matching, due to it is a critical issue of these systems.

1.1 Motivation

Data fusion is a biologically-inspired concept that has helped living creatures to evolvetheir capabilities to use multiple senses and improve their ability to survive. Sensessuch as sight, hearing, touch, smell, and taste, are naturally combined by livingcreatures to achieve an understanding of their surrounding environment, and avoidpotential hazards. Even though, a computational system is still far from emulatingthose capacities, recent advances in data fusion exploit the statistical potentialityintroduced by the observation of the same phenomena via multiple types of sensors.

Page 19: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

1.2. Objectives and Methodology 3

Thermal infrared sensors have been used in thermography, medical imaging, nightvision, surveillance, and non-destructive testing, just to name a few. However, nowa-days they are becoming a common source of information for different computer visionapplications. For instance, in video surveillance, they facilitate the people detection[95, 103, 66] as well as increase the system availability during the nighttime. In drivingassistance, similarly to the surveillance task, the thermal information helps detectingpedestrians as well as to analysis occupant’s vehicle posture [60, 157]. Finally, newtechniques for thermal performance analysis in: building; industrial facilities; andmaterials [79, 150], require the use of VS images just for visualizing the information.It is clear that the coexistence of LWIR and VS sensors is increasing, but in mostof the cases the multispectral images are independently analyzed and used. In thecurrent dissertation, we propose to go a step further by fusing the information fromthese two cameras (LWIR/VS) in order to obtain 3D data.

As mentioned above, the estimation of correspondences between LWIR and VSimages is a novel problem, whose solution would has a favorable impact on otherbranches of computer vision. For instance, previous work on VS images, such as 3Dreconstruction [2, 158], object recognition [47, 112], and scene classification [101, 124],assume that a set of points has been extracted from images and are matched. ALWIR/VS matching algorithm would allow the development of similar applications,but in a multimodal context. This strategy has been followed by Susstrunk et al.[26, 49, 133] but with visible and Near InfraRed images (NIR) in tasks such as:illuminant estimation and detection, shadows removal, and scene category recognition.Note that the NIR band is just a few wavelength bellow VS (near to the red channelof VS camera). These studies have shown that the use of an additional spectral bandincreases the information content and hence the performance of a VS based visionsystem.

1.2 Objectives and Methodology

Focussed on the questions stated before in this dissertation three big and independentgeneral objectives have been defined. Although independent these objectives wereformulated in a sequential way and were reached at different stages of the developmentof the current thesis.

1. Multimodal stereo rig setup: this objective includes on the one handseveral technological challenges such as (i) the correct assembly of the multispectralstereo rig in a diverging camera focal axis orientation; (ii) the construction of a par-ticular checkerboard pattern that should be visible at the two spectral bands; (iii) thecamera synchronization problem in order to acquire stereo images at the same instant.On the other hand, once these technological problems are addressed the challengesrelated with classical stereo systems need also to be tackled here: (i) stereo cameracalibration and (ii) image rectification. Finally, once all these modules defining themultimodal stereo system are ready, another point tackled under the framework ofthis first objective is the construction of representative data bases, together with their

Page 20: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4 INTRODUCTION

corresponding ground truth data. These databases will be used through this thesisto evaluate the proposed algorithms.

All the points presented above are addressed in Chapter 3. It starts introducingthe hardware used to build the proposed multimodal stereo head, detailing the pro-cedure used to have an initial set up that facilitate further image processing. Then,the calibration and rectification approaches proposed in the current dissertation arepresented. Note that, in comparison with classical visible-visible stereo rigs, in thecurrent thesis the particular problem of working with images with different charac-teristics needs to be deal with. Chapter 3 also presents the data sets acquired duringall these years that have been used in the different stages of the thesis. Furthermore,when possible, images in the data sets are enriched with the corresponding groundtruth data. Additionally, a particular evaluation approach useful to quantify theaccuracy of the data set is proposed.

2. Matching cost functions: although this thesis has a large amount of chal-lenges, from hardware to software, the correct definition of a matching cost function isthe most relevant one. Actually, having a reliable matching cost function will be themilestone for further research. In order to reach this goal the research will start withthe assessment of classical cost function used in visible-visible stereo vision as well asin related multispectral problems. This initial study will unveil the real dimension ofthe multimodal correspondence problem (i.e., LWIR/VS) that needs to be faced outthrough the whole thesis. Given that the first objective of this thesis will tackle theimage rectification problem, the matching cost function that will be proposed in thissecond stage will be based on that representation.

Chapter 4 studies the state of the art on block matching cost functions, both uni-modal and multimodal. It is intended to evaluate the viability of current approachesto tackle the LWIR/VS matching problem. In this chapter a novel block matchingcost function for the multimodal problem is presented. It is based on the combineduse of mutual and gradient information in a scale space context. Results from Chapter4 are then exploited in Chapter 5 where a sparse 3D representation is obtained from amultimodal stereo rig. Chapter 5 formulates the multimodal stereo problem by usingthe matching cost function proposed in Chapter 4 and an iterative optimization step,which is based on a constrained winner takes all strategy. Finally, the adaptation ofan evaluation methodology, initially proposed to quantify sparse visible/visible stereorepresentations, is proposed. This methodology is used to evaluate the sparse stereoresults from Chapter 5.

3. Dense stereo representations: this final objective will rely on the resultsfrom previous stages. In the stereovision community, the study of novel dense stereoalgorithms, most of them based on some kind of energy minimization, is the mostactive research topic. Motivated by this fact, and in order to push the results obtainedin previous stages (i.e., sparse stereo representations), the work at the current stagewill be oriented towards the generation of dense multimodal representations.

In Chapter 6 the prior-knowledge of the scene is exploited to constrain the solu-tion space. It assumes a piecewise planar based representation that helps to fill inthe missed data in the sparse solution obtained in Chapter 5. It formulates the stereo

Page 21: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

1.3. Outline of the thesis 5

matching as an energy minimization problem using a Markov random field; where thedata term uses the cost function proposed in Chapter 4 and the regularization termsmooths the results using gradient information. This regularization term assumes apixelwise planar prior. In Chapter 7, this regularization term, as well as the dataterm, are extended to a superpixel framework. Additionally, in Chapter 7 the en-ergy function incorporates information from the context by using an Adaboost basedclassifier to distinguish planar and non-planar regions.

1.3 Outline of the thesis

The remainder of this dissertation is organized as follows. Chapter 2 introduces workrelated with the concepts tackled in the multimodal stereo problem. In particular,the state of the art in problems from the image fusion domain and the stereo visionapproaches are presented. It should be noticed that there are not so much literature onthe specific multimodal stereo problem tackled in this dissertation, hence this secondchapter is focused on those algorithms from the classical stereovision community thatcan been extended to the multimodal domain.

Chapter 3 firstly presents the basic concepts to understand infrared sensors. Then,the multimodal stereo rig used as a test bed through this dissertation is presented.Calibration and rectification issues are also detailed in this section, since they requirea particular set up due to the fact that the cameras capture images at different spectralbands. Finally, the data sets and corresponding ground truths generated to validatethe different algorithms proposed in this thesis are given. These data sets have beenreleased through our Web site to the multimodal community for further evaluationsand comparisons.

Chapter 4 and Chapter 5 focus on the development of a cost function for themultimodal matching and stereo problems. The proposed cost function is based onthe use of mutual and gradient information in a scale space representation. Chapter4 presents details about its use in a template matching framework. Then, in Chapter5, this cost function is used in the multimodal stereo domain tackled in this thesis.In this case, it is used to generate sparse disparity and depth representations.

The cost function mentioned above has been later on used in Chapters 6 and 7,where a global minimization framework to extract dense information is implemented.This framework is based on the use of context and planar assumptions. In Chapter 6a picewise planar constrain is made to simplify both the scene geometry and the mul-timodal correspondence problem. Chapter 7 is also focused on obtaining dense rep-resentations based on a planar assumption; however, it proposes to use an Adaboostclassifier to identify planar surfaces that will be used during the final optimizationstage.

Finally, Chapter 8 summarizes the entire work, highlighting its principal contri-butions and describing future research lines.

Page 22: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6 INTRODUCTION

Page 23: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 2

Literature Review

Stereo matching algorithms at visible spectrum have been the subject of an intenseresearch for decades [25, 136, 149]. On the contrary, LWIR/VS stereo algorithms

are still an emerging research field, and there are a few literature on this topic. Thecurrent chapter instead of presenting a straightforward enumeration of latest advancesin multimodal matching stereo, proposes a comprehensive classification of relatedliterature, yielding new insights into this problem.

In the context of this research, the term multisensor data fusion or just datafusion particularity refers to a set of techniques that enable combining informationfrom two sources: thermal infrared and visible images, in order to achieve an unifiedrepresentation. This definition is not limited to a system that fuses thermal infraredand visible information. Actually, this concept has been widely employed in diverseareas of applications such as industrial control system, video and image processing,robotics, sensor networks, to name a few, where multiple observations of the samephenomena need to be integrated.

In spite of the widespread acceptance of the multisensor systems in computervision, there has been no consensus on essential aspects such as its usage, architectureand classification. Usually, they are redefined in function of the studied problem,making hard to identify reached advances in the field of data fusion from the treatedproblem. An example of this situation could be observed when two systems apparentlydifferent are compared. On one hand, an aircraft tracking system that combinesobservations made by a radar and an infrared imaging sensor [166]. On the otherhand, a pedestrian detection system that recognizes persons through a lidar sensorand an infrared camera [60]. Although, they perform different tasks, both face thesame challenge, to fuse heterogeneous information.

A review of previous works allows us to identify two schemes of data fusion byregistering and by matching. This classification is used for the literature review pre-sented in this chapter. In the first category, the images are registered or transformedinto an unique coordinate system, and then raw fused.

7

Page 24: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

8 LITERATURE REVIEW

P2

ILWIR

IVS

P1

T2

T1

f

P3

(a)

P2

ILWIR IVS

P1

f

P3

(b)

Figure 2.1: Multimodal image fusion schema: (a) by registering and (b) by matching

Figure 2.1(a) shows a generic block diagram of multimodal fusion by image regis-tration. This illustration depicts a data flow diagram where two multimodal images,ILWIR and IV S , are transformed into an unique representation. The P1 and P2

block are pre-processing steps such as image enhancement (edges or contrast), ra-diometric and geometric corrections, noise reduction, among others. T1 and T2 aren-dimensional transformations that register the input images. Once the images arealigned their fusion is performed just by their overlapping. This action can vary de-pending on the accuracy of the transformations or the data representation. However,the goal is the same. The data fusion is represented by the f block. In the subsequentblocks thermal infrared and visible information is already available for any particularmethod operation (P3 block). On the other hand, a fusion schema based on matchingis shown in Fig. 2.1(b). Even though this has fewer blocks than the former, it ismore complex because the data are not explicitly associated by a particular trans-formation. Thus, that transformation should be found during algorithm’s execution.Under this schema P1 and P2 are pre-processing steps, f represents a general datamatching operation, and P3 is the first block where fused multimodal information isavailable. Nowadays, few research has been carried out in multimodal image fusionby matching, since there is not an unconstrained function for measuring the degreeof dis/similarity between thermal infrared radiation and visible emissions.

In the following sections the three main topics involved in this thesis are reviewed,each section is motivated to cover different aspects of the problem. Thus, image fu-sion section (Sect. 2.1) presents a review of concepts of sensor fusion, and particularlyits architectures. Then, the image fusion by registering section (Sect. 2.2) presents asurvey and a classification of approaches related with this research. Finally, the ap-proaches of image fusion by matching are reviewed in Sect. 2.2. This section presentsa survey of methods from which the algorithms proposed in this dissertation are basedon.

Page 25: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

2.1. Image Fusion 9

2.1 Image Fusion

Image fusion or data fusion is a research field, which is concerned with the searchfor practical methods of merging images from various sensors, in order to providean unified image that allows to make inferences about a physical event, activity, orsituation [65, 164]. Terms such as merging, combination, synergy, integration, andseveral others that express more or less the same concept are usual in the literature[64, 87, 168, 130]. Although, the concept of data fusion is old, the introduction ofnew sensors into the market, for example LWIR cameras, makes possible to researchdifferent sensor combinations. In recent years, data fusion techniques are being usedin artificial intelligence, pattern recognition, computer vision, and other areas. Datafusion methods were primarily developed for military applications; hence, many oftheir principles were adapted to requirements of those applications. For instance,the data fusion lexicon presented in [168], and the process model for data fusionintroduced in [87]. These publications unified the terminology related to data fusion,while laid the basis toward a possible generic architecture.

Data fusion is a challenging task because there are a large number of issues tobe addressed. They do not only include issues related with fusion systems but alsofeatures of the environment in which they operate. Recent data fusion methodolo-gies have identified the sources of those problems, at least from the perspective of ageneric fusion framework [4, 145]. They may be generated by the phenomenon underobservation, the diversity of sensors, and the fusion algorithm.

Liggins et al. [107] categorize in three-level the fusion process, each level attemptsto produce new data that are expected to be more informative than the initial inputs.So, fusion processes are divided in: low-level fusion or raw data fusion when combineseveral sources of raw data; intermediate-level fusion or feature-level fusion whencombine various features, such as lines, corners, edges, or positions into a feature map;and high-level fusion when involve methods of decision, such as voting, fuzzy-logic,and statistical methods. Since, several fusion methods combine different levels of theprevious classification [60, 170, 105, 174], Sidek and Quadri [143] extend the three-level view to five fusion categories. Although, this categorization solves ambiguitiesin the classification of the fusion paradigm, the Liggins’s model is still preferred.

Data fusion approaches can also be categorized according to the type of sensorconfiguration. Durrant-Whyte [42] defines three types of sensor configurations. So, asensor configuration is called complementary if the sensors do not directly depends oneach other, but can be combined. An example of this configuration are surveillanceapplications from disjunct data [103]. Actually, complementary fusing information isan easy task because involves to append independent sensor data. A set of sensors isin competitive or redundant configuration if they delivers independent measurementsof scene. This configuration is unusual in computer vision, but often used in other re-search fields for fault tolerance. Finally, a cooperative sensor configuration is obtainedwhen the information provided by two or more independent sensors is combined toderive information that would be available from a single sensor. In contrast to com-plementary and competitive fusion this scheme is the most difficult and sensitive to

Page 26: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

10 LITERATURE REVIEW

inaccuracies [121].

2.2 Image Fusion by Registering

By definition, image registration is the process of overlaying two or more images ofthe same scene taken at different times, from different viewpoint, and/or by differ-ent sensors [176]. Generally, one of the images is labeled as a reference while theremaining are aligned to this. In multimodal approaches, each camera can be at adifferent position with respect to a world reference frame and have different intrin-sic parameters. Therefore, every projection of an object into image planes cannotbe assumed as corresponding (or that lie at the same position). Hence, correspond-ing objects may have different sizes, shapes, positions, and intensities. In order tocombine the information of the multimodal images, several multimodal approachesrequire a registration of those object.

Multimodal image registration approaches depend on factors such as camera place-ment, scene complexity, range of registration and density of registered objects in thescene. Nevertheless, the majority of the registration methods consist of the follow-ing four steps [24]: (i) features detection; (ii) features matching; (iii) transformationmodel estimation; and (iv) image resampling and transformation. In this section,the literature related to thermal infrared and visible image fusion is classified bymeans of its transformation model [95]. Hartley and Zisserman [67] describe thesetransformations through the following equation:

p′ = Hp + δe′, (2.1)

where p and p′ are projections of the same point in the scene into images planes; H isa homographic transformation that describes the transformation of all correspondingpoints lying on the same plane (π); δ is the parallax relative to the plane; and e′

stands for the epipole of second camera. From this equation, an infinite homografictransformation is obtained if p ≈ p′ and H = H∞; a global homografic transformationis obtained if p = Hp′, ∀p ∈ π and δ → 0 or T → 0, when T corresponds to thetranslation vector; a region-of-interest transformation is obtained if it is assumed thatexists more than one plane π and for each plane a homografic transformation H (i.e.,∃πi | pi = Hip

′i). Finally, a stereo geometric transformation is obtained when p and

p′ cannot be related through a homographic transformation, then they are related bya rotation and translation (i.e., p′ = R−1(p− T )). The singularities of Eq. (2.1) giverise to the following taxonomy.

2.2.1 Infinite Homographic Registration

In Conaire et al. [32] and Davis and Sharma [39], it is assumed that thermal infraredand visible cameras are nearly placed and the scene is far from them, so that thedeviation of an object from the ground plane is negligible compared to the distancebetween the ground and the cameras. Under these assumptions, an infinite planar

Page 27: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

2.2. Image Fusion by Registering 11

homography can be applied to the scene and all the objects will be aligned in eachimage. Similarly, Beyan et al. [15] use an infinite homograpic for rectification of apair of binary mask coming from a background subtraction algorithm. They detectabandoned objects by the fusion of these masks as well as binary operations andmathematical morphology. Fay et al. [45] and Torresan et al. [156], also fuse thermalinfrared and visible videos combining background subtraction with heuristics rulesfor pedestrian detection and tracking systems. As a conclusion, these approaches areonly valid when the objects of interest are very far from the camera or when thecameras are nearly placed.

2.2.2 Global Image Registration

Fusion by global approaches can only be used when strong assumptions about themovement and placement of the objects in the given scene are made. The resultswill be accurate when the scene follows a specific model, but completely wrong whenthe imaged scene does not fit the assumptions of the model. An usual assumptionof these approaches is that all objects lie on the same plane. Often to enforce thisassumption, only foreground objects are considered. These methods also assume thatthe difference of the homographic transformations are negligible for all objects in thescene. Thus, in scenes where objects of interest are at different planes, only thoseobjects lying on the global plane will be correctly registered (any objects out of thisplane will be misaligned).

Irani and Anandan [77] use directional-derivative operators to generate featuresin a pair of Gaussian pyramids, one for each modality (LWIR and VS). Then, theyuse local correlation values of these features to obtain a global alignment for themultimodal image. Global alignment is done by estimating a parametric surface thatcan measure the error of registration between the two images. Newton’s methodis used to iteratively search for the parametric transformation that maximizes thatglobal function.

Coiras et al. [30] match triangles formed from edge features in VS and LWIRimages, in order to learn an affine transformation model for static images. Then, thisstatic model is employed to register urban scenes. The global affine transformationthat maximizes this model is then the best transformation that relates the multimodalimages. Similarly, Ye [171] uses edge information. In this case, the silhouette of aperson walking through an indoor corridor is considered. Every human silhouette istracked and matched frame by frame using a Hausdorff distance. Multimodal imagesare registered from edge matching. This method assumes that the cameras are nearlylocated and that registration can be achieved with a displacement and a scaling.

Han and Bhanu [66] use the features extracted when a person walked in a scene tolearn a projective transformation model to register VS and IR images. It is assumedthat the person walking in the scene moves in a straight line in order to ensure thatthe global projective transformation model assumption holds. Feature points fromsilhouettes in both multimodal images are used as input into a hierarchical geneticalgorithm that searches for the best global transformation.

Page 28: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

12 LITERATURE REVIEW

Itoh et al. [80] use a calibration board to registered VS and LWIR cameras. Theypropose in that work a system for hand movement recognition. Since, registrationis only required in a predefined area whose dimensions are known, the homographictransformation is computed from that calibration pattern, and assumed that are validalso for hands.

Global image registration approaches assume that all registered objects lie on asingle plane in the images, otherwise, sub optimal results are obtained. This limitationconstraints its application domain to tasks where the entities are on a plane, beingusefulness in those cases where the objects are at different depths.

2.2.3 Regions-of-Interest Registration

An improvement to previous methods is archived if regions-of-interest are used forregistering objects at multiples depths. The main assumption of these approachesis that each object in the scene lies at a specific plane and that each plane can beindividually registered with a separate homography.

Chen et al. [76] propose that multimodal video sequences (LWIR and VS) can beregistered using maximization of mutual information. Their approach assumes that amoving target is tracked in one of the modalities and represented by a bounding box.Then, its corresponding bounding box is searched for in the other modality using alocal sliding window. In this step, they also assume that the scale is constant andknown a priori in order to initializes the starting point of search. This method allowsto correspond objects at different depths and successfully computes their homographytransformations. A drawback of this method is that it depends on a correct objectsegmentation, which often is difficult to achieve.

2.2.4 Stereo Geometric Registration

When a stereo camera setup is used in combination with additional cameras fromother modalities, the images from each modality can be combined using the 3D pointsestimates by a stereo system as well as the geometric relation between the stereo andmultimodal cameras. Ju et al. [82] use this principle for registering a thermal infraredimage into a 3D cloud of points, which is provided by a VS/VS stereo head. Thatis possible due to additional cameras (LWIR cameras) are calibrated to the referencestereo pair, thus thermal infrared image can be reprojected into 3D representations(and vice versa). A similar method is presented by Yang and Chen [169] but insteadof a stereo rig they use structured light for obtaining the 3D point coordinates ofobjects in the scene.

Krotosky and Trivedi [95] report a multimodal stereo system for pedestrian de-tection consisting of three different stereo pairs: (i) two VS cameras and two LWIRcameras; (ii) two VS and a single LWIR camera; and (iii) a couple VS and LWIRcameras. All cameras in these multimodal stereo rigs are rectified (i.e., aligned inpitch, roll, and yaw). In (i) a configuration of two VS and two LWIR cameras is an-

Page 29: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

2.3. Image Fusion by Matching 13

alyzed, in order to get a dense disparity map from both VS-stereo and LWIR-stereo.The matching algorithm used in both modalities is presented in [91]. This algorithmis based on area correlation when a cost function based on absolute difference is used.Note that this approach is not multimodal because LWIR images are matched inde-pendently of remaining modalities. They conclude that accurate disparity maps canbe recovered from two LWIR images, similar conclusions are achieved by Prakash etal. [128] but using a cost function based on sum of squared differences. Both worksassume isotherm lines in LWIR images. The second Krotosky’s multimodal stereosystem is composed of a VS stereo rig paired with a single infrared camera. Similarto the two first works reviewed in this section, they use a trifocal framework to registerdisparity maps and thermal infrared data. Such a kind of approaches could be seenas textured of 3D data with heat information. Finally, the last multimodal stereo ap-proach is a system based on raw fusion opposite to the approaches introduced above.The authors use a disparity voting scheme combined with a window-based matching[94]. They propose to match regions in cross-spectral stereo images by maximizationof mutual information. The weak correlation between the multimodal images is over-come adding restrictions. For instance, it is assumed that the pedestrians’ shape isapproximately a plane. This assumption allows to define a voting matrix where a poormatch corresponds to a large disparity variation, whereas small disparity variationsindicate good matching. Subsequent works follow this approach for matching regionsand then register LWIR and VS images in diverse task such as surveillance, pedestriandetection, and sparse depth estimation from multimodal images [96, 97, 98, 103].

Multiple stereo camera approaches based on a stereo geometric setup have beenalso investigated by Bertozzi et al. [13, 12, 14] for pedestrian detection. They usea multimodal stereo head similar to one presented by Krotosky and Trivedi (i). Itsconfiguration is a redundant pair of unimodal stereo rigs at different spectral bands(LWIR and VS). The novelty of his approach is that registration occurs in the disparitydomain. Although, state-of-the-art detection rates are reported, a tetra camera stereosystem supposes a high computational overhead.

These approaches suppose an advance in the multimodal image fusion field. How-ever, the estimation of depth is constrained to certain image regions that containhuman body parts, and only information of depth is available (or height), but dis-tinctive features of objects as their shape are lost during matching process, due tolarge size of the employed windows. For these reasons, novel methods are needed,which should work in a pixelwise manner with which more dense solutions could beprovided, opening new application domains for LWIR/VS multimodal systems.

2.3 Image Fusion by Matching

In general, matching-based approaches are used in the computational stereo vision.Computational stereo refers to the problem of determining three-dimensional struc-ture of a scene from two or more images taken from distinct viewpoints. It is awell-known technique to obtain depth information by optical triangulation. Most of

Page 30: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

14 LITERATURE REVIEW

Pre-processing

Mu

ltim

od

al

Ima

ges

Dis

pa

rity

Ma

p

Segmentation

ContextInformation

Winner-Takes-All

GlobalOptimization

CostVolume

Cost Calculation Optimization Post-processing

RectificationInterpolation

Labels toDisparities

LWIR/VS CostFunction

Aggregation

Figure 2.2: Modular stereo vision framework.

the previous works have been comprehensive surveyed by Scharstein and Szeliski [136].They suggest that a stereo vision algorithm can be broadly divided into the followingfour steps: (i) cost initialization, in which the matching costs for assigning differentdisparity hypotheses to different pixels are calculated; (ii) cost aggregation, where theinitial matching costs are aggregated spatially over support regions; (iii) disparity op-timization, in which the best disparity hypothesis for each pixel is computed by eitherglobal or local optimization; and (iv) disparity refinement, where the generated dis-parity maps are post-processed to remove mismatches or to provide sub-pixel disparityestimations. Next, Brown et al. [25] classify stereo algorithms in two dichotomousclasses, either local or global. So, approaches such as block matching, gradient-based,and feature matching, fall in the former category. While, dynamic programming,intrinsic curves, graph cuts, non-linear diffusion, belief propagation, and correspon-denceless methods, belong to latter. Other later publications were heavily influencedby these works, particularly in their understanding of the process of stereo matchingfrom two-frames [75]. Recently, Oon-Ee et al. [123] present a novel stereo architec-ture, called modular stereo vision framework (MSVF), which consists of four stages:(i) pre-processing; (ii) cost calculation, (iii) optimization, and (iv) post-processing.Its initial input is a pair of stereo images, from which a cost volume is obtained, andthen it is optimized till a disparity map is obtained (output of stereo algorithm). Itsnovelty is how various submodules interact, allowing in a natural way to describe awide variety of stereo algorithms. In this section, we adapt this generic framework toreview the algorithms related with the multimodal stereo algorithms proposed alongthis thesis. Thus, Fig. 2.2 presents the main stages as well as their sub-modules.Arrows in this figure indicate data flow; arrows with feedback to the initial sourceindicate that this method can be followed by another method from the same stage.Following subsections review specific issues related to the approaches that were usedby our algorithms.

2.3.1 Pre-Processing Stage

The pre-processing stage encompasses all methods to prepare the multimodal imagesfor matching, excluding any comparisons between them. These methods only use

Page 31: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

2.3. Image Fusion by Matching 15

information from IV S and ILWIR; or from external sources such as camera calibrationor positional data (for initial image rectification). Pre-processing prepares each inputimage for matching by the cost calculation stage either by signal enhancement/noisereduction or data conversion to the appropriate format. A review of used methods inthis stage are presented below.

A. Rectification

Fusiello et al. [55] present a rectification algorithm for general and unconstrainedVS stereo rigs. Its algorithm takes two projective projection matrices of the camerasand computes a new pair, which are rectified. Since rotating axes of the camerasas well as their special position respect to a world reference frame generates thesematrices, the rectified projection matrices could suffer great geometric distortion. Inpractices, multimodal cameras no pre-aligned can lead to poor rectified images. Hence,Gluckman and Nayar [59] propose a method for stereo rectification that determinesa pair of transformations that minimizes the effect of resampling, loss of pixels dueto under-sampling and creation of new pixels due to over-sampling. This property isvaluable for stereo head build from heterogeneous sensor, like a multimodal stereo rigs.Mallon and Whelan [116] follow with this approach, but they achieved to simplify theminimization step only to evaluate the singular values of a Jacobian, which is usedas a local measure of the produced deformation by a transformation, instead of acomplicate function of minimization based on approximation of integrals as presentGluckman and Nayar.

B. Segmentation

Oon-Ee et al. [123] argue that segmentation of the reference image helps to reducethe overhead computational of stereo algorithms due to the amount of pixels to bematched is reduced. Also, the optimization stage is simplified by grouping pixel-like,because decreases the size of cost volume in comparison to one without segmentation.Although segmentation of images in visible spectrum itself is a wide field of research[46, 86, 102, 142], there are few works in LWIR segmentation that can be testedwithout be reimplemented, often they are problem oriented, for example pedestriansegmentation [27].

C. Context Information

Hoiem et al. [74] provide insight about a contextual framework for estimating thecoarse orientations of large surfaces in outdoor images. This framework classifieseach image pixel into three main categories: part of the ground plane, surfaces thatsticks up from the ground, or part of the sky. Then, Saxena et al. [135] use thisapproach to infer context information from a still image and to creates a naive 3Drepresentation (without depth measurements). Gallup et al. [56] exploit this frame-work for classifying planar and non-planar surfaces in a VS stereo algorithm for urban

Page 32: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

16 LITERATURE REVIEW

images, and then use this information as another data term to minimize. The reportedresults show that the prior knowledge of surfaces allows improves the quality of 3Dmodels.

2.3.2 Cost Calculation

The cost calculation stage encompasses all those methods that compare two imagesand produce an estimate of matching costs. This estimate should stretch across threedimensions, the two dimensions of the reference image as well as a third dimension,the disparity. Cost calculation performs the first half of stereo matching by providingnumerical estimations of the likelihood of a match. The modules reviewed here are:

A. LWIR/VS Cost Functions

Krotosky et al. [95] introduce a matching algorithm for regions that contain hu-man body silhouettes, both in thermal infrared and visible spectrum. They extractrectangular windows from these images, and then their degree of similarity is mea-sured by using mutual information. Although, this approach is valid for trackingpeople or depth estimation of pedestrians, is an application-oriented solution. Fur-thermore, it assumes that hot points correspond to persons, and only those regionsshould be matched. Recently, Torabi et al. [154] performs a comparative evalua-tion of dense stereo correspondence algorithms in the multimodal field, under thesame restrictions. They conclude that similarity functions, previously used in VS/VSstereo heads, such as Normalized Cross-Correlation (NCC) and Histograms of Ori-ented Gradients (HOG) [38] are highly sensitive to dissimilarities even in presenceof edges. In contrast, mutual information and Local Self-Similarity (LSS) [153] arethe most accurate and discriminative correspondence measures among the evaluatedones. Cost functions frequently used in VS/VS stereo, such as Absolute Differences(AD), Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), Zeronormalized SSD (znSSD), NCC, Pearson’s correlation (PC), Rank [6] and Census[173], cannot be considered in the multimodal domain since they assume a linearcorrelation and this problem is clearly non linear.

Egnal [43] was one of fist researches that evaluate mutual information as a stereocorrespondence measure. Its work presents a review of mutual information (MI)properties and highlights the invariance to input data transformations. This propertyenables that mutual information measures similarities in more situations than othersimilarity metrics (for example, the ones mentioned above). His experiments suggestthat it may use for multimodal matching; although he does not explore this usage,later contributions, such as Hirschmuller [73] investigates its properties for visual ra-diometric changes. Pluim et al. [125] propose a combination of mutual and gradientinformation for medical imaging registration, but just for some salience image points.Then, Fookes et al. [52] introduce an inter-scale propagation scheme of joint proba-bility, which is used to compute mutual information. This strategy is inefficient dueto a matrix (joint probability) is propagated through a scale-space representation by

Page 33: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

2.3. Image Fusion by Matching 17

every pixel in the given image. Moreover, the size of windows from which mutualinformation is computed must be constant, preventing adaptive windows approachesat different scales.

B. Aggregation

Veksler [159] and Yoon et al. [172] demonstrate that with a proper cost aggregationapproach, algorithms based on local WTA optimization, can outperform many globaloptimization based techniques. Then, Tombari et al. [152] quantified this improve-ment and show through a large evaluation (15 aggregations methods) that a schemebased on segmentation outperforms state-of-the-art. Therefore, by adding this featureto a multimodal stereo algorithm results can be improved. In the same evaluationmethods as adaptive weight and variable window sizes are among the top performedmethods.

2.3.3 Optimization

The optimization stage takes place after the generation of the cost volume. This stageproduces a two-dimensional disparity map that meet a set of constraints.

A. Winner-Takes-All

The winner-takes-all step is a strategy of disparity selection where the minimum ormaximum cost is taken for every pixel/region, under certain local constraints. Theyare sensitive to noise and ambiguities because any global constraint is enforced. Theseapproach are suitable for real-time stereo algorithms because its computational costis low in comparison to other minimization schema (global). So, algorithms wherethe speed is a valuable feature take advantage of this property [84, 132] or in stereoalgorithms where an initial disparity map of the scene is needed [90, 104].

B. Global Optimization

Global methods can be less sensitive to the mentioned problems of a local approachsince high-level prior provide additional information for ambiguous regions. Thesemethods formulate the problem of matching in optimization terms, allowing to in-troduce restrictions and smoothness constraint. Boykov et al. [20] study the use ofgraph-based energy minimizations techniques in diverse computer vision applications,establishing a link between two standard approaches: combinatorial graph-cut andgeometric methods. This work has allowed to include, on one hand, classical VSstereo cost functions as data terms. For instance, mutual information [88], SAD [21],or combinations of them (Census and gradient) [118]. On the other hand, smooth-ness terms that model complex relationships between the data. Initially, these were

Page 34: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

18 LITERATURE REVIEW

based on simple pairwise potential energy such as the potts model or linear func-tions [21]. However, recent energy-based algorithms are including high-order terms.For instance, piecewise planar regularization terms, which not only penalize certainconfiguration of labels, but also use spatial information, given that every label repre-sents to plane, or 3D surface in the scene [54, 56, 144]. These approaches are beingexploited for 3D reconstruction of urban scenes.

2.3.4 Post-Processing

The post-processing stage is intended to improve the obtained disparity map. Typ-ically, they include methods for error correction, such as left-right consistency check(LRC) [44] or border correction methods [71]. Egnal and Wildes [44] perform anempirical comparison for detecting half-occlusion in binocular stereo, thus methodssuch as bimodality, match goodness jumps, LRC, ordering, and occlusion constraint(OCC) are evaluated; their study conclude that OCC and LRC are in top of theevaluation. Hu and Mordohai [75] propose a confidence measure based on LRC butinstead of applying it in a post-processing step they included it into the optimizationstage (acts like a cost function).

A. Interpolation

Hirschmuller et al. [69, 70, 73] made the disparity interpolation technique popular,though it does not contribute to increase the performance of a stereo algorithm,resulting disparity maps and 3D representations are quasi continuous. Therefore,all generated disparity maps are smooth without saw-tooth caused by decimation ofbaseline in VS/VS stereo heads.

B. Labels to Disparities

This step transforms labels coming from optimization stage into disparity maps.When disparity maps are requested, a label is converted into its equivalent dispar-ity. In regions-based algorithms as [104, 108, 165] a set of labels is converted intoequivalent surfaces (i.e., planes).

Page 35: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 3

Multimodal Imagery

The study of the nature of light has been one of the most important source of progressin physics, as well as in computer vision. This section introduces the basic conceptsto understand infrared sensors, their evolution, physic properties, and their useful-ness in computer vision applications. Then, the proposed multimodal stereo head ispresented as well as its components and basic algorithms.

3.1 Infrared Sensors: History, Theory and Evolu-tion

In 1800 the German astronomer Sir Frederick William Herschel experimented with anew form of electromagnetic radiation, which was called infrared radiation. During

his experiments, he built a basic monochromator1, with which he measured the dis-tribution of energy of a ray of sunlight. The Herschel’s experiment demonstrated theexistence of radiation beyond what is known as the visible spectrum. Furthermore, itprovided evidence of a relationship between temperature and color. His experimentused a glass prism to break sunlight up into its constituent spectral colors, and anarray of thermometers with blackened bulbs, which were put over every spectral color.Initially, they should measure the temperature of the different colors. However, dur-ing the experiment Herschel noticed that a thermometer near to experimental setupregistered a higher temperature in comparison to the used in the array (visible spec-trum). Further experiments confirmed that there is an invisible form of light beyondthe visible spectrum, and it could be measured. Figure 3.1(a) shows a schematicdiagram of its experimental setup. This discovery was largely ignored till modern in-struments were used to acquire thermal infrared images, giving rise to a new researchfield as infrered thermography. More recently, computer vision techniques have been

1is an optical device that can produce monochromatic light.

19

Page 36: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

20 MULTIMODAL IMAGERY

(a)

Detector

Video

Processing

Electronics

User

Interface

Radiation

Opticsptics

IR

(b)

Figure 3.1: IR working principle: (a) Herschel’s experimental setup; and (b) Sim-plified block diagram of an IR camera.

applied to analyze these images in order to exploit thermal information in tasks suchas scene recognition, pedestrian detection, and multimodal stereo.

Clearly, the equipments used by Herschel’s have been improved several times.However, nowadays infrared cameras still are based on its initial operating principle(see Fig. 3.1(b)). Although, infrared radiation is not detectable by the human eye, anIR camera can do it. Its operation is similar to a digital camera working on visible(VS), but the infrared cameras replace the classical silicon sensor Charge CoupledDevice (CCD) by a Focal Plane Array (FPA), which is made of materials and alloyssensitives to infrared (IR) wavelengths.

Their main components are shown in Fig. 3.1(b), among them we found an opticalunit that focuses infrared radiation onto a detector. This unit is made from one ormore germanium (Ge) lenses that act as band-pass filters, which admit wavelengthswithin IR band and reject (attenuates) frequencies outside that range. The elec-tronic module converts these opto-electrical signals into images. Finally, an optionalsoftware interface that allows to change some operational parameters of the camera.For example, thermal calibration tables, digital detail enhancement, automatic gaincontrol, image stabilization, and noise suppression.

In general, the FPA can be classified into two categories, according to their work-ing principle: (i) photon detectors; and (ii) thermal detectors. The former classcorresponds to those detectors where the radiation is absorbed within the materialby interaction with electrons. This causes change in the energy distribution of thematerial, making possible to determine the amount of incident radiation. These de-tectors have a high performance but require cryogenic cooling. Therefore, they areheavy and expensive systems, being an inconvenient to many applications. In con-trast, thermal detectors absorbs the incident radiation into a semiconductor material,causing a change in its temperature or other physical properties. This variation isused to generate a flow of electrons proportional to incident radiation, which is easilytransformed into a numerical measure. Usually, these devices are made from an arrayof microbolometers, which transform an incoming photon flux into heat, and in turnthis changes the electrical resistance of the detector, allowing indirectly to measurethe emitted/reflected IR radiation by the scene. Recently, these detectors are com-mercially available for any kind of application, opposite to the ones base on photons,

Page 37: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.1. Infrared Sensors: History, Theory and Evolution 21

which are restricted to military applications.

Another advantage of the thermal detectors is that they do not require coolingunits, thus their manufacture process is less complex, and therefore its final cost is sev-eral times lower than a photon detector. Recently advances in micro-miniaturizationhas allowed to develop large FPA that compensate its low response time and im-age quality. In spite of these technological improvements, the photon detectors arestill the faster and selective devices for infrared applications (an extensive review ispresented in [129]).

Infrared radiation or IR by definition refers to that part of the electromagneticspectrum between the visible and microwave region, and its energy is equal to:

E =h · cλ, (3.1)

where c is the speed of light, 3 × 108 (m/s); λ its wavelength (m); and h is Planckconstant, equal to 6.6 × 10−34 (J · s). Notice that, an electromagnetic wave of anyfrequency, not only IR, may heat a surface. This principle is exploited for cameracalibration (Sect. 3.3). Typically, IR cameras are designed and calibrated for a specificrange of the IR spectrum. This means that the composition of optics (lenses) anddetector (FPA) must be selected according to the desired range. Figure 3.2 illustratesthe spectral response regions for various materials [51], observe that in this figurethere are two curves of high relevance for this research, these are the VS band (blue-to-red colored bar) and microbolometer curve (solid red). They correspond to VS andLWIR spectral bands, from which the multimodal stereo head acquire IV S and ILWIR

images, which will be converted into 3D representations. Table 3.1 presents a divisionof visible and infrared spectrum by wavelength, and also the detectors appropriatedfor each range.

IR signals have the same properties as visible light regarding reflection, refraction,and transmission. So, the intensity of the emitted energy from an object varies withits temperature and radiation wavelength. Beside emitting radiation, an object alsoreacts to incident radiation from its surroundings by absorbing and reflecting a portionof it, or allowing some of it to pass through (for example a lens). So, the total radiation(W ) can be stated in function of these three phenomena by the following equation:

W = αWα + ρWρ + τWτ , (3.2)

where the coefficients α, ρ, and τ stand for the object’s incident energy: absorption(α), reflection (ρ), and transmission (τ). Each coefficient is defined by a value fromzero to one, indicating how well an object absorbs, reflects, or transmits incidentradiation. For example, the coefficients of a perfect blackbody are α = 1, ρ = 0, andτ = 0, that means 100% of incident radiation is absorbed (there is no reflected ortransmitted radiation). By contrast, an opaque body is a theoretically perfect absorberand emitter of radiation, and its coefficients are α ∼ 1, ρ ∼ 1, and τ = 0.

In the real world there are no objects that are perfect absorbers, reflectors, ortransmitters, although some may come very close to one of these properties. Nonethe-less, the concept of a perfect blackbody is very important, because it is the foundation

Page 38: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

22 MULTIMODAL IMAGERY

Figure 3.2: Examples of detector materials and their spectral reposes; from left toright: VS visible spectrum; InGaAs alloy of Indium (In), Gallium (Ga), and Arsenide(AS); VisGaAS alloy based on InGaAs; InSb crystalline compound made from In andantimony (Sb); MCT-SW arrays based on Mercury (Hg), Cadmium (Cd), and Tel-luride (Te) for Short-Wave; Microbolometer; QWIP (Quantum Well Infrared Photon)a layered FPA made from Ga/As; and MTC-LW similar composition to MCT-SW butfor Long-Waves (image taken from [51]).

for relating IR radiation to the temperature of a given object. Thus, from Stefan-Bolzmann law, the total radiated energy from a blackbody is expressed by the follow-ing equation:

W = σT 4, (3.3)

where σ is the Stefan-Bolzmann’s constant (5.67 × 10−8 W/m2K4) and T is theabsolute temperature of the blackbody, measured in Kelvin (K). Equation (3.3)provides important relationships between emitted radiation and temperature of aperfect blackbody. Although, any object of interest in the real world is a perfectblackbody, this equation allows to relate real objects to perfect blackbodies and thusestimates its radiate energy. Finally, note that implicitly the Stefan-Bolzmann’s lawtakes into account the effective body surface, measured in m2.

As it is mentioned above, the radiative properties of an object of interest aredescribed in relation to a perfect blackbody (the perfect emitter). So, the emissivityor emittance (ε) of the object is defined as the ratio between the emitted energy froma blackbody (Ebb) and energy of the object at the same temperature (Eobj). Then:

ε =Wobj

Wbb. (3.4)

The emissivity of a body is a number between zero to one, whose value can be madeequal to α by Kirchhoffs law (Eq. (3.2)), thus α(λ) = ε(λ). Note that the emissivityof a body is not constant, actually, it varies with the wavelength of the radiation.Finally, if it is assumed that the emissivity of object holds constant for a range ofwavelengths, then Stefan-Bolzmann’s law takes the form:

W = εσT 4, (3.5)

Page 39: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.1. Infrared Sensors: History, Theory and Evolution 23

Table 3.1: Visible and infrared spectral bands

Spectral Band Acronym λ (µm) Detector

Visible VS 0.4 - 0.7 SiliconNear InfraRed NIR 0.78 - 1.0 VisGaAs and InGaAsShort-Wave InfraRed SWIR 1 - 3 VisGaAs, InGaAs and MCT-SWMid-Wave InfraRed MWIR 3 - 5 InSb, MCT-SW and MCT-LWLong-Wave InfraRed LWIR 8 - 14 QWIP, µ-bolometer and MCT-LW

which states that the total emissive power of an object is proportional to the emissivepower of a blackbody at same temperature but reduced in a factor of ε.

Once the energy radiated from a body has been defined, the radiation that im-pinges on the IR camera lens can be stated. Mainly, this is modelled by three differentsources: (i) radiation from the object of interest; (ii) radiation from its surroundingsthat has been reflected onto the surface of the object; and (iii) emission from theatmosphere. This model assumes that the emission from the atmosphere is origi-nated by the attenuation of radiation when it passes through the atmosphere (i.e.,absorption by gases and scattering by suspended particles). Modelling these sourcesof emission is possible to obtain an equation for computing the object’s temperaturefrom images thermally calibrated. Each source becomes a term as indicated below:

(i) Emission from the object: ε · τ · Eobj , where τ is the transmittance of theatmosphere.

(ii) Reflected emission from ambient sources: (1− ε) · τ ·Eamb, where (1− ε) is thereflectance of the object. It is assumed that the temperature Tamb is the samefor all emitting surfaces within the half sphere seen from a point on the object’ssurface.

(iii) Emission from the atmosphere: (1 − τ) · Eatm, where (1 − τ) is the emissivityof the atmosphere.

Finally, the total radiation received by the camera can now be written as:

W = ε · τ ·Wobj + (1− ε) · τ ·Wamb + (1− τ) ·Watm, (3.6)

where Wamb depends on Tamb, which is the (effective) temperature of the objectssurroundings, or the reflected ambient (background) temperature, and Tatm is thetemperature of the atmosphere (from Watm). To achieve an accurate estimation ofobject’s temperature, it is required the emissivity of the object, atmospheric atten-uation and temperature of the ambient surroundings. Depending on circumstances,these factors may be measured, assumed, or found from look-up tables.

In this research, the parameters mentioned above are ignored, because our problemis not how to calculate temperatures from images. However, all equations described

Page 40: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

24 MULTIMODAL IMAGERY

in this section are oriented to provide an insights into the thermal infrared principles,which are necessary in further chapters.

3.2 Multimodal Stereo Head

This section presents in detail the multimodal stereo head used for generating theevaluation dataset and its ground truth data. Figure 3.3 shows illustrations of themultimodal platforms used during this dissertation. The different challenges of thisresearch can be appreciated in these illustrations, from stereo rig manufacture tilldeveloping all algorithms that transform camera outputs into adequate images fordepth estimation. The different stages of the image processing as well as operationalissues are presented below.

(a)

(b) (c)

(d) (e)

Figure 3.3: Multimodal stereo heads: (a) and (b) three VS cameras and one LWIR;(c) and (d) three VS cameras and one LWIR free; and (e) two VS cameras and oneLWIR rigidly attached.

The proposed multispectral stereo rig is built with two cameras, which are rigidlymounted and oriented in the same direction. These cameras work at different spectralbands, while one measure radiation in the visible band the other one registers infraredradiation. From now on, this system will be referred to as multimodal stereo head,which is able to provide a couple of multispectral images.

Most of the stereo heads presented in the literature for VS stereo, and othercommercially available, are built from cameras that have the same specifications (i.e.,detector and focal length). This selection constrains the stereo problem and facilitatesthe reuse of software as well as published methods. However, the multimodal case is farmore complex since heterogeneous sensors are used, beside the problems originated

Page 41: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.2. Multimodal Stereo Head 25

LWIR VS VSR and T

Multimodal Stereo

Head

VS /VS Stereo

Head

x x-dx+dGT

P

WRF

1 2

Z

X

z

VS

1 2

Figure 3.4: 2D representation ((X,Z) plane) of the evaluation set-up.

by the multimodality. So, the alignment of two views coming from cameras withdifferent detectors and intrinsic parameters is a task that requires a special attention.In general, it is more complicate than a classical VS/VS stereo calibration.

Figure 3.3(e) shows the multimodal stereo head used during this research. Anadvantage of this configuration is its low software/hardware requirements, which helpsto obtain synchronized frames. Other configurations increasing the number of sensors,and therefore its repose time [110].

The multimodal stereo head consists of a pair of cameras (LWIR/VS) separatedby a baseline of 12 cm and a non verged geometry. This configuration is obtainedby adjusting the pose of the cameras till their z coordinate axis are perpendicular tothe baseline. Hence, the images provided by the multispectral stereo head are pre-aligned, ensuring their right rectification. Thermal infrared images are obtained witha Long-Wavelength InfraRed camera (PathFindIR from Flir2) while color ones with astandard camera based on an ICX084 Sony CCD sensor with 6 mm focal length lens.The former detects radiation in the range of 8 − 14 µm (LWIR band), whereas thecolor camera responds to wavelengths from about 390 to 750 nm (Visible Spectrum).In order to evaluate the accuracy of the obtained results the proposed multimodalstereo rig is assembled together with a commercial (VS/VS) stereo head, which isused to generate synthesized ground truth values.

A 2D illustration of the evaluation set-up is shown in Fig. 3.4. This diagramdepicts the three cameras, which are linearly arranged, and a point P on the scene.

2www.flir.com

Page 42: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

26 MULTIMODAL IMAGERY

(a) (b)

(c) (d)

Figure 3.5: Example of rectified images: (a) infrared image of the checkerboardpattern; (b) original color image; (c) infrared rectified image; and (d) visible rectifiedimage.

As mentioned above the VS cameras are part of a commercial stereo vision systems3

(BPG) that provides the ground truth data, whereas the multimodal stereo head iscomposed of the LWIR and VS1 cameras, as indicated in the figure. The origin ofWorld Reference Frame (WRF) of both stereo rigs (LWIR/VS1 and VS1/VS2), issituated on the optical center of VS1, therefore both stereo systems share the samecoordinate system.

VS1 and VS2 are rigidly attached and form a compact device, whose calibrationparameters are known. In contrast, the calibration parameters of the multimodalstereo head must be obtained, particularly those related to LWIR camera. Nextsections give details of method followed to camera calibration as well as the algorithmproposed for its rectification. Hereinafter VS1 is referred to as VS, for simplicity.

3.3 Camera Calibration

Multispectral stereo camera calibration is considerably more complex than the classi-cal VS/VS, because the LWIR sensor measures heat variations. Therefore, a calibra-tion pattern ideally should have two different temperatures for generating contrastimages. In practice, this is not feasible. Furthermore, the effect of thermal diffusionbetween the calibration pattern and air causes both smooth step edges and distortedcorners in infrared images, which are not perceived at a glance. In order to avoidthese problems we calibrate the multispectral head in an outdoor scenario using ametallic checkerboard. In this way, sun rays are reflected in white rectangles and ab-

3Bumblebee from Point Grey

Page 43: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.3. Camera Calibration 27

sorbed in the black ones, this procedure enhances the contrast of image and helps thedetection of calibration points. Although the problem of blurred calibration points ispartially solved by the lighting reflection/absortion technique, a saddle point detectoris considered instead of a classical corner detector to obtain more robust results.

LWIR camera has been calibrated using a modified version of Bouguet’s toolbox[19]. The main difference in comparison to cameras calibration at VS spectrum ishow to make visible the calibration pattern in this modality. As mentioned above,a special metallic checkerboard has been made using a thin aluminium metallizedpaper. Black squares over this surface are generated by means of a laser printer,being able to detect them from both VS and LWIR cameras. Figure 3.5(a)-(b) showsa pair of calibration images (LWIR and color). Despite of using a metallic calibrationpattern, the junctions of black and white squares are not correctly detected dueto thermal diffusion. Hence, calibration points are extracted using a saddle pointdetector, instead of a classical corner detector. In our particular case the use of saddlepoints results in a more stable detection; it is due to the fact that thermal variationbetween black and white squares are not enough to generate step edges, and thestructure of junctions looks more like saddle points than corners [99]. Figure 3.6(a)-(c) shows three illustrations of junctions obtained with the saddle point detector;note that even though the contrast of these infrared images is different the junctionsare correctly detected. Figure 3.6(d)-(f) depicts local structure indicated by the redwindows in Fig. 3.6(a)-(c); the green points are saddle points while red ones arecorners; straight lines show diagonal directions where their intersection correspondsto the most likely position of junctions. As can be seen in these plots, the green pointsare nearer to the intersections than the corresponding red ones.

Three independent calibration processes under different temperature were per-formed to study the robustness of intrinsic parameters of LWIR camera when thesaddle point detector is used; as a result, the obtained intrinsic parameters werestables beside the changes in temperature. Notice that the LWIR images in Fig.3.6(a)-(c) correspond to one image of calibration. As mentioned above, the camerashave been aligned before starting the calibration process. This action ensures thatthe needed projective transformations for their rectification are smooth (the imageplanes’ position are approximately coplanar). Once each camera has been calibrated,and its intrinsic parameters are known, the next step is to estimate the geometry ofmultispectral stereo rig. Since the current thesis is focused on the generation of densedisparity maps, it is necessary to estimate the epipolar geometry (fundamental matrixF). Then, with this matrix, the next step is to rectify the multispectral images.

The fundamental matrix is accurately estimated using 149 pairs of calibrationimages (LWIR/VS); a saddle point detector; and a standard Least Median of Squaresmethod with random sampling and Sampson distance for determining valid calibrationpoints.

Page 44: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

28 MULTIMODAL IMAGERY

(a) (b) (c)

(d) (e) (f)

Figure 3.6: Calibration points detected at different temperatures: (a)-(c) infraredimages of the checkerboard pattern; and (d)-(f) detected saddle points.

3.4 Image Rectification

Once the LWIR and VS cameras have been calibrated, their intrinsic and extrinsicparameters are known, being possible, not only the image rectification, but also tocalculate the depth map of the scene. The image rectification was done, using anextension of the method proposed in [55], with an accuracy improvement due to theinclusion of the radial and tangential distortion coefficients into their camera model.

The image rectification is a problem usually avoided in multimodal stereo applica-tions, applying constrains on the scene and cameras position. For instance, assumingthat the image planes are approximately parallel [66], or that a 2D homogeneoustransformation relates the images [174]. These restrictions on the geometry of thesystem are correct, but constrain the application domain, since are true in few realsituations.

Some authors consider the image rectification as an optional step. However, ithad been shown that search correspondences in rectified images is more efficient thanto use epipolar lines, specially in the case of dense stereo algorithms. A multimodalrectification algorithm must take into account that the sensors are arranged at dis-tinct positions, have different intrinsic parameters, and they register signals comingform two spectral bands and physical phenomena —reflection of light and emissionof infrared radiation. Therefore, a same object may have different aspect in themultimodal images.

This section presents a rectification algorithm for multimodal images, this is basedon [55] and [3]. But, our approach support multiples modalities, and considers theparameter of the infrared camera as well as their contribution to complexity of geome-try’s stereo head. Since infrared cameras are thermal sensors, two kinds of calibrationcould be performed. The first one is a thermal calibration, which relates a radiationmeasurement to a temperature value. The second calibration relates points in thescene with its image projection. Given that, we are focused in recovering 3D infor-

Page 45: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.4. Image Rectification 29

mation the thermal calibration is not performed.

3.4.1 Thermal Infrared and Visible Camera Model

X

Y

ZWRF

X

Yp

X

Image

Plane

f

Principal

Axis

O

Figure 3.7: LWIR/VS camera model

Through this dissertation a projective camera model is assumed for both multi-modal cameras. Following Zisserman and Hartley [67] the projection of an arbitrary3D point X = [X Y Z]T to image plane is given by its intersection with the linethrough optical center O and X. The focal length f is equal to distance betweenimage plane and O. An illustration of camera model is depicted in Fig. 3.7. If anyworld coordinate and its projection to image plane are represented in homogeneousvectors, that is, X and x, respectively, then, they can be related by a transformationP (perspective projection matrix) as follows:

x = PX, (3.7)

which can be decomposed using a QR factorization into the product of:

P = K[R | t], (3.8)

where K is the camera calibration matrix; R is a 3× 3 matrix representing the orien-tation of the camera with respect to world reference framework; and t its translation.The matrix K only depends on the intrinsic parameters of the camera, and has thefollowing form:

K =

αx γ x0

0 αy y0

0 0 1

, (3.9)

where: αx = fmx and αy = fmy represent the focal length of the camera in termsof pixel dimensions in the x and y directions; mx and my are the effective numberof pixels per millimetre along the x and y axes; (x0, y0) are the coordinates of theprincipal point; and γ is the skew factor.

The projection of a point in the scene to the image plane is obtained by substitutingEq. (3.8) into Eq. (3.7) as follows:

x = K[R|t]X. (3.10)

Some times it is convenient to make the camera centre explicit, so t = −RC.

Page 46: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

30 MULTIMODAL IMAGERY

3.4.2 Rectification of Camera Matrix

At this point, it is assumed that both LWIR and VS cameras are calibrated (Sect. 3.3),and their perspective projection matrices PLWIR and PV S are known. The rectifi-cation should compute two new projective matrices P

LWIR and P′

V S that verticallyalign the images (in y direction). This is done by rotating the old projective matricesaround their optical centers till image planes become coplanar, thereby containingthe baseline. This ensures that epipoles are at infinity, and epipolar lines are parallelto itself. In order to get horizontal epipolar lines, the baseline must be parallel to thenew x axis of both cameras, thus corresponding points must have the same verticalcoordinate (y).

The projection matrix P in Eq. (3.8) can be written in matrix form as:

P =

q11 q12 q13 q14

q21 q22 q23 q24

q31 q32 q33 q34

= [M |q], (3.11)

where M is a 3× 3 matrix and q a column vector for a given camera. By definition,the coordinates of optical centres O are given by:

O = −RK−1 q. (3.12)

In order to rectify the images is necessary to compute the new principal axes x, y,and z. Thus, a special constrain is applied to the x computation, since this axis mustbe parallel to the baseline, and epipolar lines to be horizontal. Then:

x =O2 −O1

‖ O2 −O1 ‖. (3.13)

The new y axis is orthogonal to x and old Z:

y =Z × x‖ Z × x ‖

, (3.14)

since the old Z axis is the third vector (rT3 ) of rotation matrix R, the new z axis thenis orthogonal to x and y:

z =x× y‖ x× y ‖

. (3.15)

The previous steps show how to compute the new axes of rectified stereo head, buttill the moment none image has been rectified. In order to do this, the new projectivematrices P

LWIR and P′

V S should be expressed in matrix form, similar to Eq. (3.11).Thus:

P′

= K[R′| −R

′C] = [M

′|q

′], (3.16)

since the optical centres of cameras LWIR and VS stay fixed, only R′

is unknown inthe previous equation. But, note that this rotation could be computed from the newaxis (Eq. (3.13), Eq. (3.14), and Eq. (3.15)) as follows:

R′

=

xyz

. (3.17)

Page 47: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.5. Depth Computation from Disparity Maps 31

Once R′

is obtained the term [M′ |q′

] is also found, then a transformation T that

relates the old projective matrix to new one (PT−→ P

′) is needed to compute. Then:

T = M′(M)

−1. (3.18)

The transformation T is applied to the original images: IV S and ILWIR, in order torectify them. Since, rectified pixels are different to initials, the new ones are obtainedby interpolation. Figure 3.8 shows some results of the proposed method. These imageswere acquired by the stereo head shown in Fig. 3.3(c)-(d). Since, LWIR camera isnot attached to the rest of the stereo rig, the principal axis of cameras are unaligned,hence the rectified images exhibit a strong distortion (Fig. 3.8(c)-(d)). The error ofrectification of this method is less than one pixel despite low resolution of LWIR image(164× 129 pixels), in contrast to the resolution of IV S , which is 640× 480 pixels.

(a) (b)

(c) (d)

Figure 3.8: Example of rectified images: (a)-(b) LWIR and VS images and (c)-(d)rectified images.

3.5 Depth Computation from Disparity Maps

Once the multimodal stereo head has been rectified, its geometry becomes non-verged(i.e., two cameras in parallel with respect to their principal axes, and their correspond-ing y-axes aligned) [26]. The depth of a point in the scene is given by:

z = fT

d, (3.19)

where d is this disparity; T is the baseline of stereo head, and f is new focal length ofrectified multimodal stereo head. In general, the accuracy of a stereo vision systemis difficult to quantify, since it depends on both its calibration and image matching.However, the calibration process followed in this dissertation minimizes the repro-jection error till obtain a standard deviation about 0.12 pixels. Figure 3.9 show anexample of a disparity map and its corresponding depth map.

Page 48: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

32 MULTIMODAL IMAGERY

(a) (b) (c)

Figure 3.9: Depth computation from a disparity map: (a) disparity map and (b)-(c)depth maps in 3D.

3.6 Multimodal Evaluation Datasets

3.6.1 OSU Color-Thermal Database

OTCBVS dataset is a publicly available4 benchmark of non-visible (e.g., infrared)images and videos for testing and evaluating novel and state-of-the-art multimodalalgorithms. This benchmark contains videos and images recorded in and beyond thevisible spectrum (VS, NIR, and LWIR); it is divided into seven subgroups: OSUthermal pedestrian database, IRIS thermal/visible face database, OSU color-thermaldatabase, terravic facial IR database, terravic motion IR database, terravic weaponIR database, and CBSR NIR face dataset. These subgroups are intended to covertopics such as person detection in LWIR; face recognition and analysis under vari-able illuminations, expressions, and poses (VS, NIR, and LWIR); fusion of VS andLWIR; multimodal detection and tracking; weapon detection; and weapon dischargedetection.

Despite of the large collection of data, the only database fitted to multimodalproblem is the OSU color-thermal database. This consists of six sequences of LWIRand VS images, grabbed in two different locations (3 at each location). Every sequencecontains on average 2500 pair of frames (LWIR/VS). Figure 3.10 shows one of thesemultimodal sequences. This dataset is particularly interesting for evaluating templatematching algorithms, due to the images are aligned. Thus, two corresponding pointcan be found easily found by assuming a registration error smaller than 2 pixels.

Although, these images are rectified, they are useful, among other, for evaluatingof LWIR/VS cost functions when the effects of projective distortion (different views)and occlusions are negligible. In general, the sequences depict a static backgroundwhere several persons are walking (pathway intersections). Given that sequences arerecorded at different moments of the day, there are variations in the temperature, mov-ing cloud shadows, and static objects in the scene casts shadows that change amongsequences. Cameras are mounted adjacent to each other on tripod at two locationsapproximately three floors above ground, providing a global vision of the scene. VSand LWIR images were registered using homography with manually-selected points.More details about the hardware is provided in Table 3.2.

4http://www.cse.ohio-state.edu/otcbvs-bench/

Page 49: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.6. Multimodal Evaluation Datasets 33

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 3.10: OSU color-thermal database: (a)-(d) VS images and (e)-(h) LWIRimages.

Table 3.2: Specifications of OSU color-thermal database

Sensor details

Features LWIR VS

Reference Raytheon PalmIR 250D Sony TRV87 HandycamFocal length 25 mmBit-depth 8 bits 24 bitsImage resolution 320× 240Sample rate ∼ 30 hz

3.6.2 CVC-01 Multimodal Urban Dataset: Generic Surfaces

This multimodal stereo dataset consists of four kinds of images, which are classified bytheir context and predominant geometry: (i) roads; (ii) facades; (iii) smooth surfaces;and (iv) OSU Color-Thermal dataset. The first three groups were acquired with theproposed stereo head and contain outdoor scenarios with one or multiple planes andsmooth surfaces. The latter subset corresponds to OSU Color-Thermal Database(Sect. 3.6.1). Although, their images are aligned (i.e., without disparity), a groundtruth data in form of disparity maps can be obtained if a registration accuracy of ±2pixels is assumed. Table 3.6.2 shows an illustration of the whole dataset.

The multimodal stereo images in the dataset have been enriched with ground truthdisparity maps and semi-automatically generated ground truth depth maps. Theseground truth were obtained by fitting planes to the 3D data points obtained from thePGB stereo head. It works as follows. Firstly, a color image from the left PGB camerais manually segmented into a set of planar regions (see Fig. 3.11(a)). Planar regions areeasily identified since are the predominant surfaces in the considered outdoor scenar-ios. Then, every region is independently fitted with a plane using their corresponding

Page 50: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

34 MULTIMODAL IMAGERY

3D data points, by orthogonal regression using principal components analysis. Figure3.11(c) shows an illustration of the synthetic 3D representation containing differentplanes. Additionally, during this semi-automatic ground truth generation process,labels for occluded, valid and unavailable pixels are obtained (see Fig. 3.11(b)).

(a) (b) (c)

Figure 3.11: CVC-01 Urban Dataset: (a) Facade image from the proposed datasetoverlapped with a mask from the segmentation. (b) Mask of regions with occludedand no depth information. (c) Synthetic 3D representation generated from the visiblestereo pair used as ground truth for evaluating the multispectral depth map.

Once the 3D planes for a given image are obtained, since they are referred tothe VS camera, the corresponding data points are projected to the infrared camera.Thus, a ground truth disparity map is obtained. The fourth column in Table 3.6.2shows some of these disparity maps and a sparse 3D representation.

In the case of smooth surfaces (e.g., third row in Table 3.6.2) no planes are fitted,and depth information provided by PGB is used as a reference. PGB software offers atrade off between density and accuracy of data points. Hence, in order to have a goodrepresentation, its parameters have been tuned so that 3D models are dense enoughand contain few noisy data. Those models should not be considered as ground truth,strictly speaking, however we use them as a baseline for qualitative comparisons.Finally, all this material: VS and LWIR images, together with their correspondingdisparity maps and 3D models (ground truth) is available through our website5 foran automatic evaluation and comparisons of LWIR/VS stereo algorithms. Up to ourknowledge there is not such a kind of dataset in the research community.

3.6.3 CVC-02 Multimodal Urban Dataset: Planar Surfaces

In order to evaluate multimodal stereo methods a set of multimodal images has beencollected. The captured data depict a variety of urban scenes, which includes: build-ings, sidewalk, trees, vehicles, among others (see Table 3.4) It consists of 116 scenes.Every scene includes its corresponding thermal infrared and color images, which wereacquired by using the proposed multimodal stereo head, and processed till get arectified pair. Moreover, this includes both a disparity map of the scene and a hand-annotated map of planar regions.

The disparity maps are provided by a VS/VS stereo vision system (PGB). Note

5http://www.cvc.uab.es/adas/datasets/cvc-multimodal-stereo

Page 51: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

3.6. Multimodal Evaluation Datasets 35

Table 3.3: Illustration of the four subsets of images contained in the proposedmultispectral dataset.

DatasetImage

Name IV S ILWIRDisparity& Depth

Pairs ofImages

Roads 2

Facades 2

Smooth sur-faces

19

OSUColor-ThermalDataset [39]

— 2

that, for the sake of simplicity, the multispectral stereo head was presented as an in-dependent system, however it uses one of the cameras that belongs to PGB—the rightone. In other words, the camera refereed to as VS in the current section correspondsto the right camera of PGB.

Additionally, our dataset also includes a group of maps that indicate connectedplanar regions. These maps have been hand-labeled taking into account the geometryof the surfaces, thus an unique label is assigned to each region and it identifies the pixesthat belongs to the same plane. Table 3.4 shows some of the images into the dataset.ILWIR and IV S images are rectified; in both cases the size of resulting images is 506×408 pixels. Hand-labeled and disparity maps are given in their original format 640×480 pixels. Since the disparity maps provided by PGB are only accurate in texturedregions, we have used a hand-labeled planar regions for obtaining dense and accuraterepresentations, particularly in textureless and noisy regions. To address this problem,we fit a plane to each hand-annotated region, through of image coordinates andcorresponding disparity values. The disparity maps resulting from this user supervisedlabelling process are shown in Table 3.4.

The image rectification is a critical issue since this dataset is developed for multi-modal stereo algorithm that assume epipolar lines perfectly parallel. So, despite theaccuracy with which intrinsic and extrinsic parameters were estimated, it is essen-tial to use a rectification method that reduces the loss and creation of pixels due toprojective transformations during the rectification process (resampling effect), whilepreserving the aspect of image content. Therefore, this dataset was rectified using anadapted version of the method proposed by Mallon et al. [116].

Page 52: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

36 MULTIMODAL IMAGERY

Table 3.4: Some examples of the evaluation data set.

IV S ILWIRMaps of planar Synthesized PGB disparity

regions disparity maps maps

Page 53: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 4

Similarity Window-based MatchingCost Functions

The aim of this chapter is to present a cost function that measures the similaritybetween two block of data extracted from different modalities; particularly VS andLWIR images. This function combines mutual information with a shape descriptorbased on gradient. These metrics are computed for every level of a scale-space rep-resentation, and then, they are propagated through that representation following acoarse-to-fine strategy. The main contributions of this chapter are: (i) a completedescription of the proposed multimodal cost function, and its relationship with in-formation theory; (ii) a quantitative evaluation of the combined use of gradient andmutual information, and the benefit of their propagation between consecutive levels;and (iii) to establish the principles toward a stereo multimodal cost function.

4.1 Introduction

The coexistence of infrared cameras with other sensors has opened new perspectivesfor the development of multimodal systems. One of the challenges is to find the

best way to fuse all this information in a useful representation. In the current chapterthe problem of corresponding two blocks of data belonging to different spectral bandsis considered (multimodal correspondence problem). These data are provided bytwo cameras that are capable of measuring emissions in visible and thermal infraredspectrum. Since the cameras are mounted adjacent to each other, offering a similarview of objects in the scene, the occlusions are negligible.

The literature on multimodal matching can be broadly divided into entropy-basedmethods and feature-based methods. Through this chapter a hybrid approach is pro-posed. It exploits mutual information and gradient information in scale-space repre-sentations.

37

Page 54: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

38 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

Mutual information is a concept derived from information theory, and measuresthe amount of information that one random variable contains about another. It isa powerful concept in situations where no prior relationships between the data areknown. The previous property makes mutual information the ideal tool to addressproblems involving signals without an apparent relationship [61].

Viola et al. [161] intuitively introduces mutual information as a measure of align-ment between images and 3D models; then, it was formalized in [115], only for images.The importance of these early contributions were to identify the main properties ofmutual information in the field of multimodal image processing, and its usability.Just few years later, Engal proposed in [43], a cost function for VS/VS stereo basedon mutual information. He assumed that two data blocks are corresponding if theamount of shared information is maximum, that is to say, it is possible to find the cor-respondence between a window (template) and a set of candidate windows (searchingspace) by maximizing mutual information. Although, its performance is not betterin comparison to other less complex cost functions [136, 149]; it has been shown thatis robust to radiometric differences [73].

Mutual information has been also largely used for medical image registration. Inthis field, Pluim et al. [125] propose to combine mutual and gradient information,showing an improvement with respect to classical mutual information based formu-lations [115, 148]. Approaches that combine multiresolution schemes with mutualinformation have also been proposed for local matching of medical imaging. An ad-vantage of these methods is the information suppression, which allow to analyze thestructure of the images with different level of details. Thus, the information of thecurrent level can be enriched by using the prior knowledge collected from previouslevels in the hierarchy [99, 126]. A strategy similar to the one mentioned above ispresented in [52], being restricted to propagation of the joint probability obtainedfrom two patches. In the current chapter not only mutual information (MI) but alsogradient information (GI) are propagated through two different scale-space represen-tations. The first one is based on a scale-space stack while the second is a pyramidalrepresentation.

In summary, this chapter presents a quantitative evaluation of performance onceMI and GI are considered. Furthermore, a comparison of the discriminative powerof a scale-space representation based on stacks and pyramids and the advantages ofpropagating mutual and gradient information (MI and GI) are presented. The pro-posed approach is evaluated with a large number of experiments; up to our knowledgeprevious works were, on the one hand, specially devoted to the registration problem;and one the other hand, they were validated on few samples. The rest of this chapteris organized as follows. Section 4.3 presents the theoretical principles of proposedmultimodal cost function as well as their relationship with information theory (en-tropy and mutual information). Experimental results and comparisons are given inSect. 4.4. Finally, conclusions and final remarks are drawn in Sect. 4.5.

Page 55: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.2. Matching Cost Volume 39

4.2 Matching Cost Volume

The matching cost volume is a three-dimensional array that stores the cost of cor-responding two square windows, obtained from a pair of multimodal images. Thisvolume is obtained following a local window based approach, which consists in com-puting a cost for each displacement of a sliding window, while a second window iskept fixed on a point in the reference image. The cost volume is referred to as C(p, d),where p = (x, y) is the point on reference image (IV S) and d is the disparity or thedisplacement of the sliding window measured in pixel. The point p corresponds to thecenter of the squared window, of size wz, placed on IV S(x, y) whereas d represents thelocation of the sliding window in ILWIR. Specifically, the latter is a window with thesame size than the previous one but centered on ILWIR (x+ d, y). Notice that thesliding window location can be parametrized by coordinates of the reference windowand d since multimodal images are rectified. Finally, the searching space is defined asan interval [dmin, dmax] that contains all possible values of d. Figure 4.1 shows howa cost c (x, y, d) is indexed in C(p, d) together with the windows (i.e., reference andsliding windows) used for its calculation.

X

Y

y

x x+d dmaxdmin

Referencewindow

Slidingwindow

0 0

C( p, d )Disparity

d

d

max

min

c(x, y, d)

P

Figure 4.1: Multimodal Matching Cost Volume.

Based on that representation of the costs, the LWIR/VS correspondence problemcould be stated as an optimization problem. Formally, let C(p, d) be a cost volume.Then, the LWIR/VS correspondence problem is equivalent to optimize d for a givenp in a space of candidate solutions C, when a set of constraints J are applied. Thesought solution D is given by:

D = argmaxd

(C(p, d)) , (4.1)

subject to: J constraints.

This representation decomposes the LWIR/VS correspondence problem into threeparts: cost function, constrains, and optimization framework. So, all assumptionsmade during the matching process are translated in form of a cost function and a setof constrains. Computationally, C(p, d) is a discrete array. But, it is generated from

Page 56: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

40 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

a cost function, which measures the similarity degree between a pair of windows.This cost function establish a relationship of the form f : IV S(p) ILWIR(x +d, y) that relates two similar data blocks. In our multimodal framework, the costfunction reflects a condition that can be relaxed, but which value should be optimized.Common cost functions in VS/VS stereo are: (i) distance-based, such as AD, SAD,SSD and L2 norm; (ii) correlation-based, such as NCC, MNCC and PC; (iii)locally normalized, such as znSSB; (iv) filter-based, such as LoG, Rank, and Mean;(v) transformation-based, such as Census; (vi) information-content-based, such MI,and HMI; and (vii) sampling-insensitive, such as BT . Ignoring those functions basedon mutual information, the remaining ones are useless for multimodal matching sincethey indirectly assume radiometric consistency (brightness constancy).

The second part are the constrains. They are conditions that must be strictly met;since they prune the space of possible solutions to the most likely ones and reducesambiguities. Examples of usual constrains are order [122], monotonicity [57], epipolar[111], uniqueness [117], visibility [81], and smoothness [21, 131].

The last part is the optimization framework used to solve the optimization prob-lem. Usual optimization method are bipartite graph matching [48], dynamic pro-graming [85], graph search [131], convex minimization [113], simulated annealing [89],greedy algorithms [16], relaxation [7], semiglobal matching [70], and graph cuts [21].

Another advantage of the formulation presented in Eq. (4.1) is that it integratesinto a unique representation most of the assumptions commonly used in VS/VS stereo[149], also is possible to introduce processes integration such as feature selection andcorrespondence as presents [114]. Furthermore, sparse and dense solutions are com-puted in a similar way, only handling the cost volume.

4.3 Thermal Infrared and Visible Matching CostFunctions

The definition of a cost function able to find the good correspondences between in-formation provided by VS and LWIR cameras is a challenging task due to their lowcorrelation. Several studies have confirmed this fact through different dependencetests. For instance, Scribner et al. [139] use a statistical analysis based on correlationratios. They divide the wavelengths from 0.36µm to 12µm into 16 bands (approxi-mately from VS to LWIR), and then compute their corresponding cross coefficients.As a result of this analysis, VS and LWIR bands are mild uncorrelated (η = 0.17).Another comprehensive analysis of correlation between thermal infrared and visibleimages is performed by Morris et al. [120]. They employed joint distributions ofwavelet coefficients as a correlation measure. Their conclusions were not differentfrom those presented by Scribner. However, three characteristics of LWIR images arestated: (i) its power spectra is better fitted by a generalized Laplace curve (oppo-site to VS that is logarithmic), this behavior is caused by the lack of texture; (ii)these images capture the shape of objects while discarding much of their texture,suggesting that they may closely resemble to intrinsic images; (iii) object boundaries

Page 57: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.3. Thermal Infrared and Visible Matching Cost Functions 41

in these images are strongly correlated with VS boundaries. These conclusions willbe considered in order to propose a LWIR/VS cost function.

Recently, the research in LWIR/VS correspondence is focused on recycling costfunctions employed in similar problems, particularly VS/VS matching, multispectralregistration, and multimodal tracking. Some naive attempts to put in correspondencethermal infrared and visible images, assuming linear relationships between pixel val-ues, such as metrics based on similarities such as AD, SAD, SSD and NCC just toname a few, ends in sub optimal results. In general, these investigations conclude thateither these metrics are inadequate and require additional techniques for guaranteeconsistency, or only are valid for near spectral bands [105, 154, 170].

Following this trend, Egnal et al. [43] compare mutual information with a modi-fied NCC. Although, their early results were never published in any indexed journal,they reveal strengths and weakness of MI like VS/VS cost function. No long time af-terwards, Hirschmuller et al. [73] investigate its invariance to radiometric differences.The study concludes that mutual information is a cost function able to address nonlin-ear correlated signals. However, we have experimentally found that although this is anefficient similarity when it is used in the multispectral images, its formulation ignoresshape information, which is the most correlated feature between thermal infrared andvisible images.

Krotosky et al. [95] investigate a cost function only based on mutual informa-tion combined with a voting scheme. They extract rectangular windows centered onsurrounding of a hot spot. In its application domain, these points correspond to candi-date human body silhouettes. Although, this approach is valid for tracking people ordepth estimation of pedestrians, it is an application-oriented solution. Furthermore,it assumes that hot points correspond to persons, and only those regions should bematched.

This section presents the key concepts involved in the definition of proposed mul-timodal cost functions. Initially, they are presented as a still equation. Then, theircomponents are individually introduced. They are: (i) mutual information; (ii) gradi-ent information; and (iii) their combination (i.e., mutual and gradient information).Finally, its adaptation to a multi-resolution context.

One of our strategies for improving the discriminative power of MI and GI isthrough a scale-space representation. Note that a similar scheme is proposed in [52],but propagating joint probabilities (p (ai, bi)). In contrast, the proposed cost functiondirectly spread the MI and GI between adjacent levels, allowing to change the size ofthe windows and joint probability function. This supposes a great advantage becauseat each level an optimum set of setting parameters can be obtained.

The following cost functions are motivated in the observation of Fig. 4.2. Al-though, LWIR does not correspond linearly to VS at pixels’ level, previous researchescarried out in similar multispectral problems showed that non-linear relationship canbe exploited (e.g., [126, 52]). These investigations provide evidence that mutual in-formation may work for LWIR/VS matching. Furthermore, in the same figure wecan see that edges seem to be a relevant feature present in both modalities. These

Page 58: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

42 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

(a) (b) (c) (d)

Figure 4.2: Expected multimodal correspondences: (a) IV S ; (b) | ∇1(IV S) |; (d)ILWIR; (e) | ∇1(ILWIR) |.

facts motivate us to combine them into a novel cost function. Finally, a scale spacerepresentation adds robustness and spread local matches from coarser to finer scales,increasing the number of final correspondences. The proposed LWIR/VS cost func-tion is defined as a linear combination of CMI and CGI weighted by a confidencevalue:

C(p, d) = [λ0, . . . , λt] · (CMI(p, d) · diag (CGI(p, d))︸ ︷︷ ︸CMGI

)T , (4.2)

where λt is the confidence assigned to measurements of mutual and gradient infor-mation at a scale t; and CMGI is the vector of costs that results from merging CMI

and CGI . As can be seen, the gradient information is combined with the mutualinformation through its product. Although there are other fusion rules, it has beenselected because satisfies the properties of monotonicity, continuity and stability tolinear transformations, while reducing the dimensionality of the input space. More-over, this has a noise cancellation effect, thus low costs are obtained when LWIRtextureless and VS textured regions are put in correspondence and vice-verse. CMI

and CGI values are scaled locally to range [0, 1] before their fusion, that is, in di-rection of disparity axis of the cost volume representation Fig. 4.1. The matchingcost function presented above is used in a scale space representation scheme. So, thevalues of CMI and CGI are computed at all the scales as indicated below:

CMI(p, d) =[MI

(∇0

0 (I (p, d))), . . . , MI

(∇t0 (I (p, d))

)], (4.3)

CGI(p, d) =[GI(∇0

1 (I (p, d))), . . . , GI

(∇t1 (I (p, d))

)], (4.4)

where ∇t0 stands a convolution of an image I with a Gaussian derivative kernel oforder zero (smooth) at scale t and ∇t1 a convolution with a Gaussian derivative kernelof order one (derivative operator) at scale t. Previous works have shown that thegradient information by itself is not enough to find the right correspondence betweenLWIR and VS images when a stereo problem is considered [154, 125], since gradientorientations are useful only in the range of [0, π] (e.g., half range). Therefore, MIhelps to GI to overcome its loss of descriptiveness. Next sub-sections present in detailthe computation of MI and GI needed in Eq. (4.3) and Eq. (4.4).

Page 59: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.3. Thermal Infrared and Visible Matching Cost Functions 43

4.3.1 Mutual Information

Mutual information has shown to be a valid cost function in several multimodalproblems. For instance, medical imagining registration [115, 125], LWIR/VS videoregistration [95], medical imaging matching [126], and LWIR/VS matching [96]. Theresults obtained by these studies provide enough evidence to justify its uses in sucha kind of problems. Therefore, we formalize the multimodal correspondence problemin context of a more general theory, that is, information theory. For this purpose,first of all, it is necessary to define a probability space (Ω, B, P ). Formally speaking,this space is defined by Ω that denotes a sample space, B a space of events definedas σ-field B of subsets of Ω, and a probability measure P that assigns a real numberP (F ) to every member F of the σ-field B [61]. In our case, ILWIR and IV S areconsidered as random variables that map thermal infrared or visible measurements(symbols) into finites alphabets A. Thus, ILWIR : Ω→ ALWIR and IV S : Ω→ AV S .

Given that the LWIR/VS images are handled as two information sources, and theseproduce a succession of symbols in a random manner, the probability space (Ω,B, P )is a mathematical generalization of the interaction between: (i) symbols that couldbe intensity or thermal infrared measurements; (ii) the space of all possible outputsymbols, grouped per blocks; (iii) events, which are sequences of symbols that canbe drawn by sources of information; and (iv) a probability measure assigned to everyevent.

By definition, mutual information (MI) measures the amount of information thatone random variable contains about another [36, 61]. It is a useful concept where noprior relationship between the data is known. This is estimated in a local way, fortwo square windows centered on p and (x+ d, y), and size of wz × wz pixels. Thus,MI is defined as follows:

MI(p, d) =∑ai, bj

p(ai, bj) logp(ai, bj)

p(ai) p(bj), (4.5)

where ai and bj are discretized pixel values that are within windows (ai ∈ AV S andbj ∈ ALWIR); p(ai, bj) represents their joint probability mass function; p(ai) andp(bj) are their respective marginal probability mass functions. The alphabets AV Sand ALWIR are built by normalizing each window independently (range [0, 1]) andthen quantizing them into Q levels. The joint probability mass function p(ai, bj) is a2-dimensional matrix, whose values correspond to the probability that a pair (ai, bj)occurs. The marginal probabilities are determined by summing along each dimensionof the previous matrix.

An additional explanation of why mutual information is a valid LWIR/VS costfunction arises when their boundary conditions are analyzed. Formally, mutual infor-mation is bounded to range [0, min (MI (p,p) , MI (d, d))] [43], so that, the minimumvalue occurs when ai and bi are completely independent, and it is maximum whenthese symbols are either identical or they are affected by a transformation T suchthat generate an equivalent joint probability matrix. The latter boundary conditiondeserves special attention because this justifies the performance of mutual informa-

Page 60: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

44 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

(a) (b)

(c)

(d)

Figure 4.3: Toy example of MI and SAD: (a) patch P ; (b) searching space S; (c)mutual information; and (d) sum of absolute differences.

tion as a cost function. Notice that MI is maximized by more than one ai and bicombination, therefore, exists a set of T transformation that perform an one-to-onemapping such that MI(p, T (p)) = MI(p, p) or MI(d, T (d)) = MI(d, d). This in-variance gives a great advantage to MI over other similarity functions, which lookingfor identical pattern as is common in VS/VS stereo. This enables to MI measuressimilarity in more situations, particularly when the underlying probability of symbolsai and bi is nonlinear.

In order to discuss the capability of mutual information as a LWIR/VS cost func-tion let us consider a simple scenario, like the one depicted in Fig. 4.3, where a patchdenoted by P is searched in an image. In this toy example, a random block of 4× 1pixels is discretized into 3 levels, and then is compared with other of the same size ex-tracted from the searching space. This patch is build from an alphabet of 3 symbols,AP = a0, a1, a2, as can be seen in Fig. 4.3(a). In contrast, the searching space Scontains all possible combinations of AP (i.e., 34 = 81) as is shown in Fig. 4.3(b). Inorder to find the right matches two cost functions are employed. They are (i) mutualinformation (MI) and (ii) Sum of Absolute Differences (SAD). Their resulting costsare plotted in Fig. 4.3(c) and Fig. 4.3(d), respectively. Potential right match are allthose 4×1 blocks at searching space that optimize some of the cost functions, in caseof MI they are argmaxd(MI(P, S(d))), while in the SAD is argmind(SAD(P, S(d))).

It can be observed that SAD function (Fig. 4.3(d)) is minimized just in oneposition (d = 43), actually when a patch identical to P is found in S. On the contrary,MI is maximized several times, in total 6 times, positions d = 6, 8, 39, 43, 74, 76.This is due to the fact that the patches on these locations produce equivalent jointprobability distributions, and in consequence, equal MI costs. Notice that both MIand SAD costs are normalized to range [0, 1]. This simple example illustrates, on theone hand, the main advantage of MI over SAD, and other similarity functions thatonly reward identical pattern, such as NCC, SSD, and PC which is precisely that anidentical patch or pattern not necessarily is the right one, because thermal variations

Page 61: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.3. Thermal Infrared and Visible Matching Cost Functions 45

have not a direct relationship with intensity variations. So, mutual information isable to find linear and nonlinear correlations between patches, taking into accountthe whole dependence structure of the variables. On the other hand, in the contextof mutual information it is clear that a balance between the number of quantizationlevels Q and size of windows wz should be met for accurate solutions.

4.3.2 Gradient Information

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 4.4: Gradient vectors: (a) to (d) VS patches; and (e) to (h) LWIR patches.

The current section explains how the gradient information is incorporated to theproposed LWIR/VS cost function. We adopt the formulation presented in [125], andshow that this is not only valid for three-dimensional medical image registration, aswas initially proposed, but also is effective for LWIR/VS correspondence problems.Figure 4.4 shows eight image patches and their corresponding gradient vectors. Thesepatches are divided into two groups according to their modalities. So, Fig. 4.4(a-d)depict intensity changes in visible band, while Fig. 4.4(e-h) show thermal infrared vari-ations. Every column depicts a patch from the same scene but from different spectralband. Since these patches are registered (column wise, for example Fig. 4.4(a) andFig. 4.4(e)) there exist a point-to-point correspondence between pixels and gradients.Every gradient on these figures is represented according to its norm. Thus, stronggradients are red while softs are blue (the gradient color coding is the vertical bar inleft side of each patch). Note that in these figures the green and red gradients seemsomehow to be related. In fact, experiments conducted by Pluim et al. [125, 126]shown a high correlation between middle and strong gradients for medical imaging.In this way, they can assume that image locations with a strong gradient denote tran-sitions, and these transitions have high information content. We achieve the same

Page 62: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

46 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

conclusions. However, it is necessary to clarify what type of transitions or edgescomplies with this assumption.

In general, an edge could be classified into three categories [58]: (i) shadow-geometry edges, (ii) highlight edges, and (iii) material edges. But, given that exista wide spectral separation between images coming from thermal infrared and visi-ble bands, this classification requires an additional characteristic, which is related toconcurrence of these events or phenomenas in multispectral sensors. So, a detectablephenomena in IV S and ILWIR images is denominated as simultaneous iff its band-wide overlap the VS and LWIR bands. Otherwise, it would be a typical phenomenonof a spectral band. Figure 4.4(a) and Fig. 4.4(e) show two shadow edges, one producedby a building (facade), and another by a walking pedestrian. The former appears inboth modalities and their gradient are correlated. Whereas, the latter only is regis-tered by the VS camera. This type of shadow edges, which are produced by dynamicbodies are undetectable because they are phenomenons that can be perceived onlyby a sensor (VS). Therefore, they cannot be corresponded. Moreover, a fast movingshadow, like the one produced by a pedestrian, does not alter the temperature of thecovered surface, thus this phenomena is only observable in VS spectrum.

The highlight edges are another example of typical phenomenas, because oftenthey are perceived in different ways by every sensor. Commonly, highlights and shad-ing in VS spectrum are explained through the dichromatic reflection model [140],however this model is unsuitable for the LWIR band, because ignores relevant inter-actions between bodies and their environment and thermal radiation. Hence, theyare avoided during the elaboration of datasets.

In contrast to the categories mentioned above, material edges as shown in Fig. 4.4are strongly correlated. For example: object boundaries, pedestrians, sidewalk, andlamppost. These kind of features are those that provide useful information for match-ing due to they are simultaneous phenomenas and are locally distinctive. In conclu-sion, only events or phenomenas simultaneous to LWIR and VS band are candidatesto be matched, next sections show that they corresponds to 40% - 50 % of pixel inthe image.

It is important to note that thermal infrared and intensity variations are not nec-essarily equals (nor in orientation neither in magnitude). However, since the imagesare rectified, not only the search for correspondences is simplified to one dimension,but also the objects in the scene appear with a similar aspect (see Fig. 4.4). This isan advantage because both patches depict the same gradient field, thus their gradientvectors could be corresponded pixel-to-pixel. Moreover, their phase difference shouldbe near to 0 or π (phase or counter-phase). In this way, these vectors could be used tounveil possible matchings [120]. Let x and x’ be two corresponding points that belongto ∇1 (IV S) and ∇1 (ILWIR), respectively, refereed to local window coordinates. Thegradient information is obtained as follows:

GI(p, d) =∑x,x’

w(θ(x, x’))︸ ︷︷ ︸Orientation

min(|x|, |x’|)︸ ︷︷ ︸Norm

, (4.6)

where | · | is the norm of gradients; θ(x,x′) is the angle between corresponding gradi-

Page 63: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.3. Thermal Infrared and Visible Matching Cost Functions 47

ents; and w(θ) is a function that penalizes gradient orientation out of phase or counterphase. The gradient information is computed similarly to MI on two windows cen-tered on ∇1(IV S(p)) and ∇1(ILWIR(q)), thus the cost volume CGI(p, d) is obtainedby sliding them through the searching space defined by each p on the reference image.The phase difference of two gradient vectors in the location x and x’ is defined asfollows:

θ(x, x’) = arccos

(x · x’

|x| |x’|

). (4.7)

The Eq. (4.7) is weighted by a function w : θ −→ [0, 1] that penalizes those gradientvectors that are not in phase or counter-phase, this function is defined as follows:

w(θ) =cos(2 θ) + 1

2. (4.8)

4.3.3 Mutual and Gradient Information

The combination of mutual and gradient information is defined as:

CMGI(p, d) = CMI(p, d) · diag (CGI(p, d)) , (4.9)

this equation is mainly used to compare the performance after and before of MI andGI fusion. Hence, CMGIt represent the cost value at certain scale t.

4.3.4 Scale-Space Context

This section presents the basic notions related to scale-space representation, whichare used to build two data structures. Firstly, a scale-space stack representation, andthen a pyramidal representation. Previous sections have highlighted the importanceof using contours and edges in the matching process. Therefore, in this section astructural analysis of images is presented. This requires a scale-space representation,which is obtained applying local derivative operators. The aim is to recover significantinformation through a scale-space representation, and boost the accuracy of proposedcost functions at a fine level, using as feedback previous coarse levels. In order to dothis, the cost function presented above is adjusted to a multiresolution scheme. Itworks by taking the values of mutual and gradient information at a certain scale t,and propagating it following a coarse to fine strategy, from a scale t+ 1 to t.

Stack Representation

The stack representation is obtained by convolving an image (IV S or ILWIR) with aGaussian derivative kernel of order n. Note that the zero scale is also included andcorresponds to the given image. Following the notation presented in [109]:

∇tn(Ik) = gn(σt) ∗ Ik, (4.10)

Page 64: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

48 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

(a) (b) (c) (d)

Figure 4.5: Three level pyramidal representations: (a) ∇0(IV S); (b) ∇1(IV S); (c)∇0(ILWIR); and (d) ∇1(ILWIR).

where t is a identifier of the current scale, Ik is a given image (IV S or ILWIR), andgn(σt) is a Gaussian derivative kernel of order n and standard deviation σt. If n = 0the Gaussian function is obtained, otherwise its corresponding derivative kernel. Inthis chapter, only gradient information is required, hence ∇0 and ∇1 are computed forIV S and ILWIR. It means a stack of Gaussian blurred images and their correspondingfirst order derivative images. Note that in the case of a stack all the images in thestack have the same size. Thus, ∇t+1

0 (Ik (p, d)) and ∇t0 (Ik (p, d)) have an interscalecorrespondence. Hereinafter, for a compactness reason, a given level t of the scalespace representation is referred to as σt (standard deviation of Gaussian mask).

Pyramidal Representation

Another way to generate a scale-space representation is by means of a pyramidalhierarchy, which is similar to the method described above. It consists in adding a newstage after the Gaussian filtering, which apply a downscale algorithm, sampling theoutput image at a constant rate. In this chapter, we have explored the use of a halfoctave Gaussian pyramid of zero and first order [37]. This representation has beenchosen due to the reduction factor; hence it assure an optimal propagation of mutualinformation. Figure 4.5 shows two pyramidal representations of three levels; (a) and(b) correspond to an intensity image, while (c) and (d) to an infrared image. Notethat in the two coarser levels the image features are still visible, in spite of their smallsizes. See [109, 37, 99] for a detailed description on pyramidal representations.

In the case of a pyramidal representation the level t+ 1 contains a smaller imagethan the one in the current level t (downsampling). Therefore, two situations mustbe considered: (i) if p is not present in the previous level, only its value at the currentlevel is considered (MI or GI) and the term λ in Eq. (4.2) is set to 1; and (ii) if p ispresent in the previous level, then a cubic spline interpolation is used to compute its

Page 65: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.4. Experimental Results 49

value of MI or GI at previous level (prior), since we are using rectified images onlyancestors on the epipolar line are considered (one dimensional interpolation problem);thus priors are obtained from its neighborhood but at (t+ 1) scale.

4.4 Experimental Results

In order to evaluate the proposed approach, small parts of a thermal infrared and colorimages were cropped and used as templates —in total 61600 patterns per image wereextracted from OTCBVS Benchmark Dataset (Sect. 3.6.1). Then, all cost functionsintroduced above were computed between the reference template and all possiblewindows on its corresponding searching space, which is defined without disparityrestrictions. Thus, the correct match is assumed at point d where the cost functionreaches the maximum value, as indicated below:

D = argmaxd

(C(p, d)) . (4.11)

During the experiments four cost volumes are optimized in order to compare them.They are CMI , CGI , CMGI and C. These cost volumes are obtained from Eq. (4.3),Eq. (4.4), Eq. (4.2), and Eq. (4.9). Once all matching costs between a template and itssearching are computed, the three largest local maximum values are extracted (onlythree values were selected for the sake of a simple presentation). These values are usedto illustrate how many times a cost function is right. Since, VS and LWIR imagesin OTCBVS dataset are registered, the correct matches are those candidate windowscentered on (x+d, y) when −2 < d < 2. In this way, it is possible to determine whichof the tree maximum values correspond to the right match (i.e., a tolerance range of± 2 pixels).

In order to evaluate the performance of cost functions, their most relevant pa-rameters are varied as follow: wz = 7, 19, 31; σ = 0.5, 1, 1.5, . . . 5.5, 6 andQ = 8, 16, 32, 64. The different levels of scale-space representations are obtainedby convolving the images with a Gaussian kernel of order 0, 1 and a standard de-viation σ. The experiments are conducted in the following manner, for each locationp in the reference image, three windows of size 7, 19, and 31 are extracted and thencorresponded independently with their searching space. The window size decreaseswith the scale. So, they started with a size of 31 × 31 and finish with 7 × 7 at levelt = 0. The propagation also follows this direction from t = 2 to t = 0 (coarse tofine). This arrange of windows decreasing is enforced because upper levels of spacescale should enrich to the finest ones, allowing small windows without loses of accu-racy. Although, in this section that constraint is insignificant, in a stereo scenario isrelevant because small windows avoid smooth disparity maps, particularly on edges.

During the experiments several initializations of confidence vector [λ0, λ1, λ2] fromEq. (4.2) are also tested as well as quantization levels Q from Eq. (4.5). The best set-ting of these parameters were [0.45, 0.30, 0.25], which assign high confidence to lowestlevels of space scale, and [Q0, Q1, Q2] = [7, 19, 32], quantization levels proportionalto wz.

Page 66: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

50 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

Table 4.1 and Table 4.2 summarize the obtained results when the best parametersare used, and present the percentages of correct matching when the first, second orthird maximum get right. For example, if a winner − takes − all optimization isused for matching two multimodal images, the number of correct matching shouldbe the values in columns labeled as: 1st. maximum % ; since they are the maximumarguments (argmax in Eq. (4.11)). The proposed cost functions are also evaluatedwhen a scale-space stack or pyramidal representation is used. Note that Table 4.1and Table 4.2 are divided into two part: without and with propagation. This makesreference to the cases where MI, GI, and MGI are propagated through a scale-space representation. It could be observed that if MGI is propagated then partialresults of C are obtained. In fact, C is the aggregation of all these values. Finally, thepercentages of success of the cost function C are presented in bold-faced and precededby an asterisk (∗), since they are obtained only if a scale space representation is used,their corresponding percentages are available at finest level. The efficiency of fusionof MI and GI through the different levels is defined as η = MGI −max(MI,GI) (ηalso is given in percentage).

The experiments show that MI, GI, and MGI have a behavior proportional to thesize of the template. Thus, if wz is increased, these cost functions are better estimateddue to a large number of observations. This tendency could be observed in Table 4.1and Table 4.2 comparing the results coming from different scales. These tables alsoshow that the efficiency of mutual and gradient information fusion is improved if aspace-scale representation is used. Particularly, we obtain an improvement about5 times at the last level (t = 0) for stack, and about 4 times for the pyramidalrepresentation. Since, C (bold-faced) contains MI, GI, and MGI, is not surprisingthat their scores overcome to the rest of cost functions, becoming the best performedcost function.

Observing the percentages in these tables, we can say that theoretically, a 50%of correct matches could be achieved just by improving the optimization step. Thiscould be done, transforming second and third maximums into first maximums (futurechapter are focused in this fact). The results show that the proposed propagationscheme increases the number of right matching, both stack and pyramid. The re-sults of pyramid using propagation are better than without it. However, these resultscannot be compared to the ones obtained with the stack, except at level 0, due tocompression of images (see Fig. 4.5 for a example). Note that in this representationeach level contains less information and the images are smaller; hence, the estima-tion of MI become weak. On the other hand, we observe that mutual informationestimator presented in Eq. (4.5) and the way to assemble the alphabets establishes adependency between the accuracy of the MI estimation and the number of membersin the sample (template size), which affects the performance of propagation, speciallyin the pyramid.

Page 67: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.4. Experimental Results 51

Table

4.1:

Rig

ht

matc

hes

ord

erby

maxim

um

susi

ng

ast

ack

(%)

1st.

max

imu

m%

2nd

.m

axim

um

%3r

d.

max

imu

m%

MI

GI

MGI

ηMI

GI

MGI

ηMI

GI

MGI

η

Wit

hou

tpro

paga

tion

t=

237.2

43.3

50.

06.

711.5

10.7

11.6

0.0

5.1

4.9

5.8

0.8

t=

117.3

22.6

27.6

57.

68.

610.0

1.3

4.1

4.4

5.1

0.8

t=

04.

910.4

12.5

23.

65.

26.

41.

23.

43.

24.

91.

6

Wit

hpro

paga

tion

t=

237.

243.

3∗ 5

0.0

6.7

11.5

10.7

∗ 11.

60.

05.

14.9

∗ 5.8

0.8

t=

131.7

33.0

∗ 41.6

8.4

12.6

11.2

∗ 12.

62.

26.

75.2

∗ 6.0

0.0

t=

015.2

30.4

∗ 40.0

109.

211.0

∗ 12.0

2.2

7.2

6.0

∗ 5.7

0.0

Table

4.2:

Rig

ht

matc

hes

ord

erby

maxim

um

susi

ng

apyra

mid

(%)

1st.

maxim

um

%2n

d.

max

imu

m%

3rd

.m

axim

um

%

MI

GI

MGI

ηMI

GI

MGI

ηMI

GI

MGI

η

Wit

hou

tpro

paga

tion

t=

231.3

37.

841.0

3.2

10.6

9.7

11.9

1.3

4.8

4.7

6.1

1.3

t=

114.1

21.

324.3

2.9

7.7

6.7

8.1

0.5

3.9

4.2

4.8

0.6

t=

04.

910.

412.5

2.1

3.6

5.2

6.4

1.2

3.4

3.2

4.9

1.6

Wit

hpro

paga

tion

t=

231.3

37.8

∗ 41.

03.

210.6

9.7

∗ 11.

91.

34.

84.

7∗ 6.1

1.3

t=

126.8

31.5

∗ 38.

67.

18.

89.

6∗ 1

0.5

0.9

5.4

5.7

∗ 6.4

0.7

t=

014.2

26.6

∗ 36.6

8.0

6.4

11.4

∗ 9.9

0.0

5.2

6.4

∗ 6.5

0.1

∗percen

tage

correspo

ndingto

Ccost

function

Page 68: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

52 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

Table 4.3: Improvement achieved using interscale propagation (%)

Stack (%) Pyramid (%)

MI GI MGI MI GI MGI

Scalet = 2 0.0 0.0 0.0 0.0 0.0 0.0t = 1 14.4 10.6 ∗14.0 12.6 10.2 ∗13.4t = 0 10.3 14.3 ∗27.4 9.3 15.1 ∗24.1

∗ percentage corresponding to C cost function

Table 4.3 compares the improvement introduced by stack and pyramid. Although,the results achieved by stack are in average 3% better than the pyramid, this rep-resentation speed-up the computation of C. Particularly, mutual information thatis the most time consuming process. Hirschmuller [69] assesses this improvement at14% by comparing algorithms complexity.

Figure 4.6 shows cost curves corresponding to MI and GI when the templateand searching space indicated in Fig. 4.1 are matched. Although, both approaches,with and without propagation, find out the correct matching, the relative differencebetween maximums is great in Fig. 4.6(b) (this is obtained by the proposed propaga-tion scheme). Note that MI in Fig. 4.6(a) is noisy. In fact, it is hard to identify thecorrect matching from MI, the same happens for GI. This example in certain wayshows visually how works our cost function.

A second evaluation is performed on CVC-01 dataset and CVC-02 dataset (pre-sented in Sect. 3.6.2 and Sect. 3.6.3) using multiperspective visible and thermal in-frared images. A selection of 46 pair of multimodal images (IV S and ILWIR) wereused for evaluating the proposed cost functions. The aim of this experiment is toassess the global performance of MI, GI, MGI, and C under viewpoint changes. Anexample of the selected images is shown in Fig. 4.7(a)-(b).

Since it is necessary to have an accurate ground truth data, an indirect methodhas been envisaged for evaluating the performance of cost functions. This methodconsists in measuring the accuracy of disparity values computed from a cost functionon image regions that are planar surfaces. Actually, the evaluation is performed inthe v-disparity space [100]. That is, a histogram of disparities in columns direction.The ground truth data are provided by the PGB and are used for computing theserepresentations (see Fig. 4.7(c)-(d))

An interesting property of v-disparity space is that planes in the Euclidean spaceare mapped as straight lines. So, by identifying these straight lines, the accuracyof a cost function can be evaluated. This works as follows: firstly, it identifies thepredominant planar regions in the given evaluation image. Next, the contributionsin v-disparity space of this predominant planar regions are selected. Finally, a linearregression by least squares is applied only to this set of points, which provides the best

Page 69: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.4. Experimental Results 53

(a) without propagation (b) with propagation

Figure 4.6: Example of proposed propagation scheme: (a) MI and GI cost withoutpropagation and (b) final MI and GI cost when they are propagated through ascale-space representation (MI: dashed line and GI: solid line).

fitting. In this way, the real position of a plane is estimated from noisy data. Figure4.7(c) shows the disparity map when an unique plane is recorded by the cameras;Fig. 4.7(d) depicts its corresponding v-disparity. Note that the number of rows inboth plots is the same, but the disparity axis in v-disparity representation will dependon the position of the plane. In this plot, the straight line through points representsthe ideal disparity values of the plane. A justification of the proposed evaluationprocedure could be appreciated in these images, where even though the disparityimage (Fig. 4.7(c)) could be used as a ground truth, the large variance in disparitywould produce an unfair evaluation. Hence in the v-disparity representation we canfilter out that dispersion and get a more fair ground truth.

(a) (b) (c) (d)

Figure 4.7: v-disparity map representation: (a) IV S ; (b) ILWIR; (c) ground truthdisparity map; and (d) v-disparity representation.

Once the planar regions in the evaluation images have been identified, and theircorresponding straight lines (`) fitted in the v-disparity representations, an error func-tion based on Root Mean Squared (RMS) is used. It measures the orthogonal distance(dist>) between a disparity value obtained by our cost functions, and its correspond-ing ` (which depends on the region it belongs to). Thus, the error is defined asfollows:

R =

(1

N

∑x∈P|dist> (dC (x) , `)|2

) 12

, (4.12)

where R is the RMS error for a given planar region; P is a planar region; dC is adisparity value inside of this region; and N is the number of pixels belonging to thatregion.

Figure 4.8 shows the average accumulated error of MI, GI, and C from theevaluation dataset. The results depicted in this figure correspond to level t = 0 of

Page 70: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

54 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

Figure 4.8: Average accumulated errors of MI, GI and C at t = 0 when they arepropagated through a stack.

a stack when MI, GI, and C are initialized with the same parameters that thoseused in the first experiment. These plots depict the errors made by assuming that themaximum argument corresponds to the correct matching (Eq. (4.11)). This figure alsoreveals an important relationship between errors and the number of pixels assumedas correctly matched. So, insofar as more pixels are matched more mistakes are done.This trade-of between parameters is studied in following chapter.

Every point on the plots presented in Fig. 4.8 represents the error of a set ofmatches, with a variable number of elements. The cost values are arranged in de-scending order, and the error is computed from the first matched couple, which hasthe maximum cost value, till the number of points indicated in the horizontal axis.For this reason, the range of the curve goes from 1 till the number of pixels in theimages. This help us to visualize how the error increases as soon as more matches areaccepted.

In all cases the combined use of MI and GI increases the number of correctmatches, proving that mutual and gradient information supply complementary in-formation useful in the matching process. Figure 4.8 shows how C improves theprevious similarity functions. A stable behavior on both, in error and number of cor-rect matches, is noticeable. Indirectly, this figure also revel the relationship betweenexpected error when a number of pixels is assumed to be correctly matched. So, forexample, if similar images are considered, then is expected that 45% of the pixelswith higher cost value in the images will be matched with an error smaller than 4%.

4.5 Conclusions

This chapter presents a novel LWIR/VS cost function, which combines mutual infor-mation with gradient information in a multiresolution context. Experimental results

Page 71: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

4.5. Conclusions 55

show the improvement of discriminative power when mutual and gradient informationare combined, and when they are propagated through a space-scale representation,confirming the viability of the proposed approach. Also, it is presented an evaluationframework that compares the performance of four multimodal cost functions (MI,GI, MGI, and C), which are evaluated with two metrics: number of correct matchesper maximum and RMS error. The second evaluation is performed in v-disparityspace, which allows to consider the noise in the acquisition of the disparity maps.

We have shown that adding gradient information (GI) together with a scale-spacerepresentation improves descriptive ability of a similarity function based on mutualinformation. Also, details of its parameter setting are presented (Q, wz, t, λ , andσ) and how they affect the performance. Finally, the results obtained from realenvironments show that C is an useful tool to find correspondences in multimodalimages.

Page 72: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

56 SIMILARITY WINDOW-BASED MATCHING COST FUNCTION

Page 73: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 5

Local Information Based Approachfor Sparse Representations

This chapter presents an imaging system for computing sparse depth maps frommultispectral images. A special stereo head consisting of an infrared and a colorcamera defines the proposed multimodal acquisition system. The cameras are rigidlyattached so that their image planes are parallel. Sparse disparity maps are obtainedby the combined use of mutual information enriched with gradient information. Theproposed approach is evaluated using a Receiver Operating Characteristics curve.Experimental results in real outdoor scenarios are provided showing its viability

5.1 Introduction

The coexistence of visible and thermal infrared cameras has opened new perspec-tives for the development of multimodal systems. In general, these cameras are

being used in a complementary way, for example in applications such as video surveil-lance (e.g. [103, 27]) and driver assistance systems (e.g. [83]), where VS camerasprovide information during daylight, and LWIR cameras does at the night. Morerecently, multimodal systems based on cooperative sensors have been studied. As aresult, these novel approaches are overcoming state-of-the-art algorithms, task suchas: illuminant estimation [53]; scene category recognition [26]; and removing shad-ows [134] are best performed by NIR/VS systems. Although, these examples are notLWIR/VS systems, they prove that other information sources help to solve complextask.

All the approaches mentioned above involve registration and fusion steps, result-ing in an image that even though contains several channels of information lies in the2D space. We go beyond classical registration and fusion schemes by formulating the

57

Page 74: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

58 LOCAL INFORMATION BASED APPROACH

following question: “is it possible to obtain 3D information from visible and thermalinfrared images?”. It is clear that if the objective is to obtain depth maps close tothe state-of-the-art, classical binocular stereo systems (VS/VS) are more appropri-ated. Therefore, the motivation of this chapter is to show that the generation of 3Dinformation from images belonging to different spectral bands is possible.

The role of cameras in the proposed multimodal stereo system is not only restrictedto work in a complementary way (as it is traditionally) but also in a cooperativefashion, being able to extract 3D information. This challenge represents a step forwardfor the 3D multimodal community, and results obtained from this research by surecan benefit applications in the driver assistance or video surveillance domains, wherethe detection of an object of interest can be enriched with an estimation of its aspector distance from the cameras.

The performance of a stereo vision algorithm is directly related to its capacityto find good correspondences (matching), and this task largely depend on the usedcost function. In the multimodal case, as mentioned in the previous chapter, costfunctions such as SAD, NCC, SSD or Census are not recommended since theyassume a linear correlation, which is an incorrect hypothesis for multimodal signals[68]. Therefore, in this chapter the cost function presented in Chapt. 4 is adopted inorder to establish a bijective relationship between LWIR and VS images (one-to-onecorrespondence).

Multimodal matching has been widely studied in registration and fusion problems,specially in medical imaging (e.g., [11, 138, 146]). However, there are few researchrelated with the correspondence problem when thermal infrared and color images areconsidered. Hence, it is not clear how to exploit visible and infrared imaging in acooperative framework to obtain 3D information.

The use of multimodal images (LWIR/VS) has attracted interest of researchers indifferent computer vision fields, for examples: human detection [66], video surveillance[97], and 3D mapping of surface temperature [169, 128]. Recently, [96] presents acomparison of two stereo systems, one working in the visible spectrum (composedof two color cameras) and the others in the infrared spectrum (using two LWIRcameras). Since the study was devoted to pedestrian detection, the authors concludethat both, color and infrared based stereo, have a similar performance for such a kindof applications. However, in order to have a more compact system they propose amultimodal trifocal framework defined by two color cameras and a LWIR camera. Inthis framework, infrared information is not used for stereoscopy but just for mappingLWIR information over the 3D points computed from the VS/VS stereo head. Thisallows to develop robust approaches for video surveillance applications (e.g., [97]).

On the contrary to the previous approaches, a multimodal stereo head constructedwith just two cameras: a thermal infrared and a color one is studied in [95]. This min-imal configuration is adopted in this chapter since it is the most compact architecturein terms of hardware and software. Critical issues such as camera synchronization,control signaling, bandwidth, image processing, among other have a low impact in theoverall performance, and can be easily treated if an architecture such as the one pre-sented in [110] is used. In Krotosky et al. [95] this compact multimodal stereo head

Page 75: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.2. Sparse Stereo 59

(LWIR/VS) is employed for matching regions that contain human body silhouettes.Since their contribution is aimed at person tracking some assumptions are applied,for example they use a foreground segmentation for disclosing possible human shapes,which later on are corresponded by maximizing mutual information [161]. Although,these assumptions are valid, they restrict the scope of application.

A more general solution should be envisaged, allowing such a kind of multimodalstereo systems to be used in different applications. In other words, the matchingshould not be constrained to regions containing human body silhouettes. The currentchapter has two main contributions. Firstly, a robust approach that allows to computesparse depth maps from multimodal images is proposed. Since it is not restricted toa specific application it can be used in any scenario. Finally, the second contributionis an evaluation methodology based on Receiver Operating Characteristics (ROC)curves, which allows to find the best parameters of LWIR/VS cost function (C) pro-posed in Chapt. 4. Initially, this methodology was proposed for ranking VS/VS stereosystems [92] but in this chapter it is adopted for evaluating different C initializations.This investigation leads to a correct tradeoff between error and sparsity.

Although the proposed approach is motivated for recovering 3D information, ad-ditionally it could help to solve other multimodal problems. Due to the fact that mostexisting multimodal systems are affected by the same problem. That is, statisticalindependence between the different modalities, which makes difficult its correlation.Our approach offers a non-heuristic based solution, which is a novel feature withrespect to state of the art.

The chapter is organized as follows. Section 5.2 presents details about matchingcost volume and describes the optimization steps as well as the constraints applied forcomputing sparse depth maps. Section 5.3 introduces the evaluation methodology.Experimental results with different scenes are presented in Sect. 5.4, together withthe technique used for setting the parameters of the algorithm. Conclusions and finalremarks are detailed in Sect. 5.5.

5.2 Sparse Stereo

This section presents in detail the proposed algorithm for computing sparse disparitymaps. Figure 5.1 shows an illustration of the multimodal platform and a couple ofimages used to compute a sparse 3D representation. The different challenges of thetackled problem can be appreciated in this illustration, from the image acquisition todepth map estimation. Also, issues as evaluation are discussed. The different stagesof the proposed multispectral stereo are presented in detail below.

5.2.1 Matching Cost Volume Computation

A crucial aspect of any multimodal stereo algorithms is to find good matches, despiteof the poor correlation between LWIR and VS images [139, 120]. In the current

Page 76: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

60 LOCAL INFORMATION BASED APPROACH

(a) (b)

Figure 5.1: Proposed framework: (a) obtained sparse 3D map and multimodalimages (ILWIR and IV S); (b) multimodal stereo head.

chapter, this problem is tackled by using of a cost function that combines mutualand gradient information in a scale-space context. Although, this similarity functionis defined for multimodal template matching (Chapt. 4), the current chapter showsthat this function can be also used as a cost function in a multimodal stereo system.Additionally, its accuracy is improved through a Parzen window estimator, as will bedescribed below.

The join probability pai, bj from Eq. (4.5) is estimated in two steps. Firstly, a twodimensional histogram of discretized pixels ai and bj is constructed. Every entry isobtained as follow:

p(ai, bj) =1

wz2

∑x,x’

T [(ai, bj) = (x, x’)] , (5.1)

T [•] is a conditional function, which takes the value of 1 if the argument betweenbracket is true, and 0 otherwise. It is appropriate to recall that both x and x’represent a pair of corresponding points in local window coordinates, and that wzis the size of these windows. Once all entries are computed, the joint probability isapproximated by a Parzen estimator as is presented in [88]. It assumes a Gaussiandistribution g with standard deviation σg on every entry of the previously obtainedhistogram (p(ai, bj)). Thus:

p(ai,bj) = p(ai, bj) ∗ g(ai, bj , σg). (5.2)

The other probabilities from Eq. (4.5), are determined by summing along each dimen-sion of the previous joint probability (see [161] for more details about the probabilityestimation). Similarly to the cost volume presented in Chapt. 4, a multimodal costvolume C(p, d) is computed ∀p ∈ IV S . This representation comes from Eq. (4.3) andEq. (4.4), when a smooth joint probability is used for computing MI.

In contrast to the template matching algorithm presented in Sect. 4.4, the twostereo images are not registered. Therefore, the calibration and rectification have adecisive impact, because not only the search for correspondences is restricted to onedimension, but also the contours and edges of the objects contained in the scene have

Page 77: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.2. Sparse Stereo 61

a similar aspect. This fact increases the probability of coincidence on contours andboundaries as shown in [120].

5.2.2 Disparity Selection Based on Confidence Measures

In the previous section the matching cost of every pixel in the reference image iscomputed till obtaining a cost volume C. Now, this volume is used to obtain thebest disparity map that describes the scene. The generation of such a representa-tion is achieved transforming the disparity selection problem to one of optimization.Given that a fast but accurate algorithm is wanted, a local optimization methodis proposed. This is based on a winner takes all strategies (WTA) constrained byconfidence measures.

The proposed approach consists of two steps. Initially, the disparities with highercost values are selected as correct (i.e., seeds or control points), applying a winnertakes all optimization criterion (WTA), without any kind of global reasoning. In thesecases, a correct match is determined by the position d where the cost function reachesthe maximum value: arg maxd (C(p, d)). The disparity map obtained after this firststep contains several wrong entries due to occluded regions or without texture (noinformation). Note that the multimodal case is more susceptible to these anomaliesthan classical VS/VS ones, because LWIR images are poorly textured. For instance,let assume that an object in the scene appears textured in the visible image, while itstemperature holds constant. This means that the object should appear as a texturelessregion in the infrared image, these nonsimultaneous phenomenas cause doubtful costcurves. In general, these curves have small deviations, flatness and low cost values(see Fig. 5.2(a)). By contrast, a typical cost curve have multiple maximums withsimilar value as is depicted in Fig. 5.2(b).

As mentioned above, it is hard to select the correct d between several candidateswith similar scores. Therefore, a second step to reject mismatching candidates isadded. It consists in labelling as correct those correspondences with a score cost higherthan a given threshold τWTA. Its optimal value is determined by means of ROC curvesof error when a large dataset is evaluated. Figure 5.2(c) shows an example of a reliablemaximum. Note that this meets C(p, D − ε) ≤ C(p, D) and C(p, D + ε) ≤ C(p, D)when ε→ 0.

The τMGI parameter is included into our formulation for picking up only thosepixels with large C values. Since the C cost function is reliable in edges and texturedregions, and those regions have higher C cost, τMGI is used as a threshold that splitup the cost map into two groups: (i) reliable matches and (ii) unreliable matches.

Given that it is better to reject uncertain correspondences that to accept baddisparities, another constraint is added to the proposed optimization. It is based ona confidence measure called Left Right Difference (LRD), which is proposed in [75].The LRD criterion favors a large margin between two consecutive maximums andalso includes a term that checks the consistency of a pair of maximum across the two

Page 78: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

62 LOCAL INFORMATION BASED APPROACH

Threshold

d

C

(a)

Threshold

d

C

Threshold

(b)

Threshold

d

Thre

C c1

c2

d1

d2

(c)

Figure 5.2: Example of cost curves: (a) cost curve on a textureless regions; (b)typical cost curve; and (c) ideal cost function.

multimodal images (ILWIR and IV S); it is defined as follows:

CLRD(p, d) =c1 − c2

| c1 −max(C?(p− d1, d?)) + ε |, (5.3)

where C? is a cost volume computed when the reference image is switched, in thescope of this dissertation interchanging IV S by ILWIR (IV S ILWIR); the termp−d1 represents all matching costs between an image coordinate in ILWIR(x−d1, y)and its corresponding search space in IV S ; and d? represents the search space wherethe cost volume C? is optimized. Observe that the ? operator is dual to definitionsintroduced in Chapt. 4. Finally, the optimization arg maxd?(C?(p − d1, d

?) acts asa consistency test, which gives as a result the cost value of the maximum, when thedisparity value predicted by the cost volume C is used to optimize C?. If IV S(p) andILWIR(p−d1) are corresponding points, then C(p, d1) and C?(p−d1, d

?1) are similar.

In other words, Eq. (5.3) assumes that two corresponding windows should result insimilar cost values and in consequence small values in the denominator are expected.This formulation provides immunity against two observed weakness of the proposedmultimodal cost function. On one hand, if the margin c1 − c2 is large (a peak), butthe pixel has been mismatched the denominator will be large and the confidence willbe low. In contrast, if the margin is small then the match is likely to be ambiguous,however if the denominator is small a high confidence is assigned, because it is likelythat the correspondence is correct (the opposite case will result in a low confidencevalue).

Hu and Mordohai [75] evaluate several confidence measures for VS/VS stereomatching, including LRD. Their study show that a minimization strategy of disparityselection based on WTA and LRD as cost function achieves low error rates for theMiddlebury evaluation dataset [136]. Moreover, they also show than still if a costfunction (i.e., SAD or NCC) is used as a confidence measure, its result will beamong the top ranked. This is a theoretical justification of our proposed disparityselection method by cost thresholding and LRD confidence.

Once all reliable matchings have been found, they are used as seeds for search-ing more matchings, as indicated below. Each neighborhood of a seed—their eightsurrounding pixels— are considered as a set of candidates to be matched in the sec-ond optimization step, which is repeated until a maximum number of iterations is

Page 79: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.2. Sparse Stereo 63

reached or there are not more candidates. A neighbor can be candidate only once,and in next iterations it is ignored. In contrast to VS/VS stereo approaches, neigh-bourhoods bigger than one pixel are not recommended, because the proposed costfunction has shown a high efficiency in matching of edges and textured regions, whichare usually thin regions. When the seeds are selected and their corresponding neigh-bourhoods computed, an optimization step as described above is again performed,but without cost thresholding. In this second optimization the searching space isreduced to a small part of the initial, empirically it is found that one tenth of theinitial search space is sufficient for achieving accurate solutions in a few iterations,about 4 or 5. Furthermore, those monotonic cost curves C(p, [dmin, dmax]) are alsorejected (entirely increasing or decreasing).

Formally, the previously introduced optimizations are defined as follows:

DWTA = argmaxd

(C(p, d)) , (5.4)

subject to: C(p, d) > τWTA.C(p, d) is not monotonic.

DLRD = argmaxd

(CLRD(p, d)) , (5.5)

subject to: CLRD(p, d) is not monotonic.

DWTA is the disparity map given by a WTA optimization of C, while DLRD isdisparity map given by WTA optimization of the confidence measure CLRD. Note thatthese optimizations are nested, because CLRD is computed from DWTA (see Eq. (5.3)).Then, these results are combined into an unique solution D through the following rule:

D =

∅ if DWTA DLRD,12 (DWTA +DLRD) if DWTA ∼ DLRD.

(5.6)

The similarity condition in Eq. (5.6) is based on the distance of the terms. So,two disparity are similar if their values are separated by ±1 pixel away. On the otherhand, a method for sub-pixel interpolation is implemented. This is a variation of thesampling-insensitive dissimilarity measure [16].

When a point in the scene is registered by two cameras, their intensity values aredifferent since the radiation measured by each camera is not the same, the camerashave different parameters, and in general is a noisy process. A way to address thesecomplex interactions between sensor and environment is through costs interpolation[70]. So, for each disparity value in D, three cost values are extracted from C, theseare the maximum and two neighbours (C(p, d1 − 1), C(p, d1), and C(p, d1 + 1)), anillustration is provides in Fig. 5.3. These points are used to fit a quadratic function.Then, this function is optimized over the domain [d1 − 1, d1 + 1] for finding a betterapproximation (with sub-pixel accuracy). After computing the disparity of everypoint in the reference image, their corresponding 3D positions (X,Y, Z) are obtainedusing the interpolated disparities.

Page 80: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

64 LOCAL INFORMATION BASED APPROACH

C(p,d1-1) -1))

d Quadratic function

Maximum

C

C

C(p,d1+1)

C(p,d1)

d1-1 d1 d1+1

Figure 5.3: Sub-pixel disparity interpolation.

Finally, a pseudo code of proposed method for disparity selection based on con-fidence measures is presented in Alg. (1). Although, this method may seem com-putationally complex. Recently, there is an increasing interest for such a kind ofalgorithms because they are suitable for real time applications (low memory require-ments and simple operations), some examples of alike approaches are available inGPU and FPGA code [118, 132, 84].

5.3 Evaluation Methodology

The current section describes the quality metrics used for evaluating the performanceof the proposed multimodal stereo algorithm. These metrics are inspired by a stereoranking recently published on internet for VS/VS semi-dense stereo algorithms [93].In general, stereo algorithms have been evaluated following a methodology proposedin [136], which has become in a de facto standard in the stereo community. It presentstwo quality measures: (i) RMS (root mean squared) error; and (ii) percentage of badmatching pixels. In both cases, resulting disparity maps are compared to a knownground truth in a dense fashion. However, in uncontrolled scenarios, as outdoors,trying to get ground truth data as presented in [72] or [137] is not feasible, for thatreason, we must evaluate our proposed algorithm following a semi-dense methodology.

The method presented in [92] capture both, error and sparsity in a single value,which is suitable for our dataset. So, we extend this framework to the multimodal case.The pair: error and sparsity are plotted in a Receiver Operating Characteristics curveas an unique value, letting visualize how performance is affected as more disparityvalues are taken. Remember that every disparity obtained by our method have acost value associated, which depends on CMI and CGI . Therefore, regions with lowinformation (low entropy) or without texture (gradient) could be rejected consideringtheir cost. During the evaluation process the best τMGI parameter could be easilyidentified (see Sect. 5.2.2).

In the current section, ROC curves have been used for evaluating the performanceof the proposed approach on the CVC-01 Multimodal Urban Dataset (Sect. 3.6.2),independently of parameter settings wz, Q, σt. The evaluation procedure is brieflydetailed below following the original notation.

Page 81: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.3. Evaluation Methodology 65

Algorithm 1: Disparity Selection Based on Confidence Measures

Data: C(p, d), maxNumIterationsResult: D.begin

i ← 0;// First optimization step

[D′, Seeds ] ← confidenceOptimization (C, p, d, τWTA);D ← D′;// Second optimization step

while (i < maxNumIterations ∧ Seeds 6= ∅) doi ← i+ 1;p′ ← getNeighbors (Seeds); // p with a d assigned are excluded

d′ ← getNewSearchSpace (p′, Seeds, D′);[D′, Seeds′ ] ← confidenceOptimization (C, p′, d′, 0);Seeds ← Seeds′; // new seeds

D ← D + D′; // append new disparities

// Local optimization based on confidence measures

Function [D, Seeds ] ← confidenceOptimization (C, p, d, τ);begin

DWTA ← arg maxd (C(p, d)); // WTA

DLRD ← arg maxd(getLRDcost(C(p, d)); // Confidence Measure

D ← DWTA | C(p, DWTA) ≥ τ ∧DWTA ∼ DLRD; // Disparities list

Seeds ← p | C(p, DWTA) ≥ τ ∧DWTA ∼ DLRD; // Locations list

The statistics about the quality of a semi-dense stereo algorithm should captureboth: (i) how often a matching error happens and (ii) how often a matching is notfound. These two values define the Error Rate (ER) and the Sparsity Rate (SR)respectively. In other words, the ER represents the percentage of incorrect corre-spondences:

ER =incorrect correspondences

all matchable pixels. (5.7)

On the other hand, the SR is defined as the percentage of all missing correspondencesover the set of matchable pixels:

SR =missing correspondences

all matchable pixels. (5.8)

Note that these values are not computed over the whole set of pixels but over thosepixels with a match in the ground truth. An illustration of ROC curves, for differentscenarios, can be seen in Table 5.1 (its meaning will be explained in Sect. 5.4). Theseplots have four extrema points: the origin, which represents a dense and perfectmatching algorithm; its opposite, where no correct matches are found; the (0, 1)point corresponds to an algorithm that is dense but fully wrong; and finally, the (1, 0)point that corresponds to a disparity map completely empty.

Page 82: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

66 LOCAL INFORMATION BASED APPROACH

(a) (b)

Figure 5.4: Evaluation regions: (a) segmented planar regions; and (b) evaluationregions.

The evaluation by ROC curves compares row by row, a horizontal profile belong-ing to ground truth disparity map with its corresponding one obtained by the testedalgorithm. A correct matching is assumed when the difference with respect to thecorrect value is smaller than or equal to 1 pixel. Note that only three sets of images(i.e., roads, facades and OSU in Sect. 3.6.2) are used for the evaluation to avoid theproblem of occlusions, which is slightly different between the validation and multi-modal stereo rig. Regarding the roads dataset, in all the image appears a single planehence there are not occluded areas; while in the facades dataset occluded areas areremoved by generating a synthetic 3D model. The OSU dataset was not obtainedwith our multimodal stereo head; it is provided by [39] and contains perfectly alignedVS and LWIR images. The smooth surfaces dataset is not used during the evaluationsince occluded areas in the ground truth affect the evaluations. Hence, this datasetis just used for a qualitative validation of the proposed approach.

Figure 5.4(a) shows three kinds of regions identified in our dataset: Occluded ;Unavailable (e.g., no textured or too far/close to the multispectral stereo head); andValid regions. A region is valid when depth information is known or is possible to fita plane with its defining pixels (Fig. 5.4(b)). Therefore, let V be the set of all pixelsin ground truth with disparity information available; O be the occluded regions; Bbe the regions close to an occlusion, by definition, this boundary is 5 pixels of wide;and finally, C be the candidate matches obtained by the evaluated algorithm.

The two error metrics, ER and SR, introduced at the beginning of this sectionare now defined as a function of the following two terms. The operator T [•] that isdefined in Eq. (5.2), and an image coordinate p that lies both on the ground truth,and on the disparity map obtained by the proposed approach. Notice that, groundtruth and disparity map are referred to the same coordinate system. Thus, they canbe overlapped and their coordinates are equivalents.

Mismatch (M): a correspondence with a disparity value different from the groundtruth value larger than one pixel:

M(p) = T [|V (p)− C(p)| > 1] , (5.9)

this score considers pixels near to occlusions (B).

Page 83: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.4. Experimental Results 67

False Negative (FN): an unassigned correspondence where a ground truth data isavailable (i.e., a hole):

FN(p) = T [p ∈ V : p /∈ C] . (5.10)

False Positive (FP): an assigned correspondence in occluded areas:

FP (p) = T [p ∈ C : p /∈ V ]. (5.11)

The ROC space is defined by the above functions, and from them ER and SR areobtained; remember that they are used as vertical and horizontal axis respectively, inthe ROC plots.

ER =1

|V |∑p

(M(p) + FP (p)) , (5.12)

where p only takes the values of valid image coordinates (see Fig. 5.4(a)) and |V | isthe number of valid pixels. Finally:

SR =1

|V |∑p

FN(p) . (5.13)

In the ROC curves presented in Table 5.1, the sparsity rate parameter is varied asfollows: the cost values of the candidate matches in C are sorted in descending order.Next, from this list, and by using a decreasing τMGI threshold, different values of theROC curve are obtained. For instance, the first plotted element in the ROC curvecorresponds to bottom right point, which is the maximum cost value achieved onlyfor a few set of pixels. Then, by decreasing the τMGI threshold, all the other pointsthat define the ROC curve are obtained. In other words, the more pixels are selectedreducing the sparsity rate, the larger the resulting error rate.

5.4 Experimental Results

This section presents experimental results obtained with different algorithm settingsand scenes. The setting of parameters is obtained from two optimization steps. Thefirst one is intended to find the best setting of: (i) window size, (ii) scale and (iii)quantization levels, from the parameter space P = wz × σt × Q. The secondoptimization step is devoted to find the best confident value Λ = λ0, ..., λt used forpropagating CMI and CGI costs through consecutive levels (Eq. (4.2)). These twosteps have been implemented as follows.

Firstly, an efficiency measure (em) is defined to be used as a quantitative value

for comparisons between different settings. Let em =∫ 1

0ERdSR be the area under

the error curve (ER) defined for all SR in the interval [0, 1], for a given settingof parameters. The parameter space P is sampled in a limited number of valuesdefining a regular grid. Then, the best setting of parameters corresponds to the nodeof that grid where em reaches the minimum value. Since in the proposed approach ascale space representation is used, not only the setting with the minimum em valueis considered, but the best pi settings. Note that no prior information about the

Page 84: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

68 LOCAL INFORMATION BASED APPROACH

number of levels in the scale space representation is assumed. Hence, the family ofparameter settings, with the lowest error, is obtained. This first optimization step isperformed for each subset of the whole dataset. By analyzing the results is possibleto find similarities between the best settings for the images in the evaluation dataset.Thus, it is possible to find relationships between the elements of the parameter space,particularly the relationship between the window size and the quantization level.

Then, the second optimization step finds the best set of Λ = λ0, ..., λt valuesfor merging the CMI and CGI costs corresponding to each of the pi settings obtainedabove. Although initially a large family of pi settings were considered, we experi-mentally found that three levels were enough to propagate the CMI and CGI coststhrough the scale space representation. Hence, this second optimization process findsthe best [λ0, λ1, λ2] using a similar approach.

The two optimization steps mentioned above are used to find the best combina-tion of parameter settings. Initially, an exhaustive search in parameter space P isperformed. The results are used to illustrate the behavior of ER and SR in eachsubset of dataset. Table 5.1 shows the three error curves corresponding to: road,facade, and OSU color-thermal. These curves depict the error and sparsity rate whenthe best settings are used in CMI and CGI cost functions (Eq. (4.3) and Eq. (4.4),respectively), together with the improvement achieved by merging them (C). Finally,after finding the best settings for the whole dataset (including confidence parameters[λ0, λ1, λ2]) several sparse depth maps of real outdoor scenarios are presented (seeright columns in Table 5.2 and Table 5.3).

The settings of parameters corresponding to the ROC curves presented in Table 5.1were found with an exhaustive search in the following ranges: wz = 7, 19, 31,σt = 0.5, 1, . . . 6 and Q = 8, 16, 32, 64. The best set of parameters and prop-agation scheme is the following: CMGIp2 → CMGIp1 → CMGIp0, wherep2 = 31, 1.5, 32, p1 = 19, 1, 16 and p0 = 7, 0.5, 8. In our proposal, thewindows sizes (wz parameter) decreases from 31 to 7 pixels, which looks like an in-verted pyramid. This avoids blurred disparity maps, particularly on edges, since smallwindows in the last level reduce the edge fattening problem. On the other hand, weobserved that information content decreases with scale, as is reported in [99], butin our case faster, at σt = 2. So, σt greater than this value decreases the correctmatching score. This is due to the fact that gradient within windows appear toosmoothed, and the windows tend to have low entropies. Therefore, the propagationof costs should be done before this empirical value, given that a wrong setting couldincrease the error rate. Regarding λ, the best settings, for the above parameter spaceP = p2, p1, p0, is Λ = 0.45, 0.30, 0.25.

At this point, we must distinguish between MI computed from a discrete jointprobability p(ai, bj) Eq. (5.1) and its smooth version obtained from Eq. (5.2). Thedifference lies in how the joint probability p(ai, bj) is estimated. When the discretejoint probability is used, MI results in a wave-like curve difficult to minimize. Onthe contrary, when the smooth p(ai, bj) is considered a better behaved function isobtained, which helps us to find stable parameters. Parzen window estimation usesthree different Gaussian kernels: g(ai, bj , 3) for wz = 7; g(ai, bj , 7) for wz = 19; and

Page 85: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.4. Experimental Results 69

Table 5.1: Results obtained at different scales and different settings (p2 =31, 1.5, 32, p1 = 19, 1, 16 and p0 = 7, 0.5, 8; as well as their combinationCp2, p1, p0).

Scene MGI and CMGI ROC curves

Road

Facad

eO

SU

g(ai, bj , 9) for wz = 31.

As a conclusion from the plots presented in Table 5.1, we can mention that thebest results are obtained when C is used, improving the results at each scale. On

Page 86: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

70 LOCAL INFORMATION BASED APPROACH

Table 5.2: Examples of sparse depth maps from outdoor scenarios (red color in costmap corresponds to high cost values).

VS LWIR CMGI cost map 3D results

average a 45% of correct matches, with about 10% of ER, can be obtained with theproposed approach and the parameters of initialization provided. Another conclusionfrom Table 5.1 is that for a given sparsity rate always the best results (lowest errorrate) are obtained after merging the different CMGI with the fusion rule presented inEq. (4.2).

Page 87: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.4. Experimental Results 71

Table 5.3: Examples of sparse depth maps from outdoor scenarios (red color in costmap corresponds to high cost values).

VS LWIR CMGI cost map 3D results

Page 88: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

72 LOCAL INFORMATION BASED APPROACH

Table 5.2 and Table 5.3 show the results obtained with the proposed method. Firstand second columns correspond to the rectified images, visible and thermal infrared,respectively. Third column presents cost maps obtained after applying disparity selec-tion method explained in Sect. 5.2. Each pixel in this representation corresponds tothe maximum C value for a given d that maximizes arg maxd (C(p, d)). Finally, thefourth column depicts the 3D sparse depth maps obtained from the correct matches.Sparse maps show how the multimodal correspondence between LWIR and VS canprovide useful 3D information. Notice that the images used for validating the pro-posed approach are still a challenge for a VS/VS stereo algorithm. However, the ob-tained results demonstrate the capability of our approach for finding correspondencesin complex scenes under a wide range of radiometric differences, such as uncontrolledlighting conditions (sources), vignetting and shadows.

The results shown in the tables must be understood beyond the sparseness of 3Drepresentations, or the accuracy with which the contours are recovered. For example,notice that some regions in the LWIR images appear textureless, whereas in the VSimages they appear textured, however our approach can overcome this situation andprovides a depth map free of mismatches over those regions (the same for the contrarycase). This is a consequence of the manner in which mutual and gradient informationare combined. Thus, the multiplication of MI and GI reaches its maximum when agiven correspondence is weighted as a correct one by both MI and GI cost functions.A more dense representation could be obtained by relaxing the τMGI threshold, butit will be affected by noisy data. Actually, this is a common trade off in stereo visionsystems. It should be noticed that 3D representations presented in Table 5.2 andTable 5.3 provide not only the (X, Y, Z) and color components (r, g, b), as classicalstereo systems, but also the thermal infrared information corresponding to every 3Dpoint.

The cost maps presented in Table 5.2 and Table 5.3 show that, in general, thecost function introduced in Chapt. 4 can match a pair of window extracted from mul-timodal stereo images with different accuracy. Our algorithm is designed to identifyregions with high information content, such as edges and contours, and from them toobtain a 3D representation of the scene. Also, it penalizes mismatches in texturelessareas, which are not reliable to find correspondences, for instance in image regionssuch as walls and floor. As can be appreciated on Table 5.2 and Table 5.3 (thirdcolumn), higher cost values are concentrated on the edges, since in those regions aconsensus between MI and GI is reached. Furthermore, it is possible to perceivethe structure of the scene from these cost maps, which confirms the importance ofdiscontinuities for relieving the ill-posedness of multimodal stereo. The strategy ofcost propagation across a scale space representation enriches the final cost volume C,allowing to identify the correct disparity of a candidate set (Sect. 5.2.2).

As a result, we can affirm that although the current section is focused on recov-ering 3D information, we have confirmed that the C cost function overcomes mutualinformation and gradient-based approaches in multimodal stereo matching problems.This conclusion is supported by reviewing previous work [8], which uses a similarcost function. Since both evaluations (the current and previous one) use the same

Page 89: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

5.5. Conclusions 73

database (OSU Color-Thermal dataset [39]), we conclude that C is a valid cost func-tion for searching correspondences in multimodal video sequences. This conclusioncould be also extended to the multimodal pedestrian tracking and detection problem.The previous statement is motivated by the fact that the work of Krotosky et al.(e.g., [97, 96]) is based only on the use of mutual information as a similarity functionfor matching pedestrian regions.

Finally, regarding the question formulated in Sect. 5.1: “is it possible to obtain 3Dinformation from visible and thermal infrared images?”, we can say with safety thatit is possible and it represents a promising research topic with several open issues.

5.5 Conclusions

This chapter presents a novel multimodal stereo matching algorithm for VS and LWIRimages. The different stages for obtaining sparse depth maps are described. Further-more, a ROC-based evaluation methodology is proposed for evaluating results fromsuch a kind of multimodal stereo heads. It allows to analyze the behavior over a widerange of different parameter settings. Although the obtained results show sparserepresentations, we should have in mind the challenge of finding correspondences inbetween these two separated spectral bands.

In summary, the main contributions of the current chapter are: (i) to present astudy in an emerging topic as Multimodal Stereo LWIR/VS and achieves a sparse3D representation from images coming from heterogeneous information sources; (ii)to propose a consistent criteria for making the multimodal correspondence; (iii) toestablish a baseline for future comparisons; and (iv) to propose a framework that canbe used as a test bed for evaluation purposes in this field.

Page 90: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

74 LOCAL INFORMATION BASED APPROACH

Page 91: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 6

Piecewise Planar Based Approachfor Dense Representations

This chapter proposes a novel framework for extracting dense disparity maps from apair of multimodal images. The goals are to explore alternatives to the LWIR/VScorrespondence problem and to incorporate semantic information in the matchingprocess. The proposed framework consists of three step. Initially, a rough disparitymap is computed, this is sparse but accurate. Then, the scene is described througha set of plane hypotheses, using both the previous map and the visual information.Finally, a dense disparity map is computed by solving a global optimization prob-lem, instead of a local one. The current chapter also presents the transformationof proposed LWIR/VS cost function C from its discrete version to its smooth form.Finally, it proposes the use of a Manhattan-world assumption for constraining thespace of all possible scenes into a group of those that can be effectively processed bythe proposed multimodal piecewise planar stereo algorithm. Experimental results inoutdoor scenarios are provided showing the validity of the proposed framework.

6.1 Introduction

The 3D representation of scenes from two views is an issue that has received con-tinual attention in the photogrammetry and computer vision literature, due to

its wide range of application, which includes, for example: 3D object retrieval, 3Dphoto-realistic scene creation for game, movie, and maps. A first approximation tosurface reconstruction problem is to use dense depth data obtained from well-knownstereo techniques [22, 28, 54, 162]. However, it is difficult to obtain accurate sur-face model from real data, which are generally affected by noise. Although, thereare other techniques robust to these drawback, all of then are exclusives for VS/VSstereo images. Up to our knowledge dense disparity maps for LWIR/VS stereo rigs,

75

Page 92: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

76 PIECEWISE PLANAR BASED APPROACH

based on planar patches, have not been reported in the literature. Therefore, we arewondering whether “is it possible to obtain dense disparity maps from a multimodalstereo head defined with a camera working in the visible and another in the thermalinfrared spectral band?”.

A possibility to address above question is by defining the LWIR/VS correspon-dence problem as an optimization problem, in which a label should be assigned toeach image pixel. So, the goal now is to optimize an energy function that assignsa labels by imposing restrictions on the observations and penalizing configurationsthat violate spatial smoothness priors. Similar regularization techniques have beensuccessfully used in VS/VS stereo [22, 149]. However, they have not been exploredyet for LWIR/VS stereo, because neither cost function nor similarity function for thatproblem has been proposed.

In this chapter, we propose a new multimodal stereo method aimed at recover-ing a dense, piecewise planar disparity map of the scene. Also, it is shown that thegeometry of the scenes, predominantly planar, could be revered in an accurate andsmooth way. Our approach exploit high-contrast edges and straight lines in the scenefor overcoming the lack of detail or texture of multimodal images, particularly ILWIR.The structure of the chapter is the following. A review of works carried out on multi-modal stereo are presented in Sect. 6.2. Then, the steps of the proposed algorithm arepresented in Sect. 6.3. Technical details of its evaluation together with experimentalresults are presented in Sect. 6.4. Finally, conclusions and final remarks are given inSect. 6.5.

6.2 Background

The extraction of 3D information from multimodal stereo heads (LWIR/VS) hasattracted the interest of researchers in different computer vision applications, forexamples: human detection [66], video surveillance [97], and 3D mapping of surfacetemperature [169], [128]. However, up to our knowledge, previous multimodal stereosystems are not able to obtain dense representations via optimization. The firstattempt at incorporating these techniques was made unsuccessfully in [98]. Thatstudy concluded that energy functions based only on mutual information are notappropriated for solving the thermal infrared and visible correspondence problem.Thus, optimization methods, such as graph cuts [88] and semi-global matching [69],do not converge to the global solution since it is not possible to predict the value of apixel when its corresponding one is given (the well known thermal infrared and visiblespectrum correlation problem [120]).

Clearly, some constraints must be imposed in order to overcome the low correlationbetween thermal infrared and visible bands. In the current chapter, these constraintsare collected into the so called: Manhattan world assumption [35]. This constrain hasbeen originally used to model urban environments with building, exploiting regulari-ties of contours and boundaries. Although, these environments are characterized bysurfaces with lack of textures, repetitive patterns (e.g., bricks facade), and constant

Page 93: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.2. Background 77

colors (e.g., painted walls). A VS/VS stereo algorithm based on regions and Man-hattan assumption can recover depth information from such a kind of scenes (e.g.,[54, 119]).

Region-based algorithms are relatively new, they have been proposed for VS/VSstereo matching, and roughly they consist of two steps. Firstly, a pre-initializationstep is performed to compute a sparse representation. Then, a global optimizationscheme is applied to generate the sough dense solution. These algorithms exploit theedge information, which is a common characteristic in Manhattan world scenes.

We found out that region-based methods based on the Manhattan world assump-tion and those that constraint the scene to a set of plane structures (e.g., , [17, 90, 162])can be adapted to the thermal infrared and visible correspondence problem. The keyaspect of this adaptation is the cost function used for measuring the similarity be-tween image patches. In the current chapter a multimodal cost function, similarly tothe one introduced in Sect. 3 is used, but it includes a significant change in its formu-lation, which improves its accuracy in presence of multimodal edges and texturelessregions.

Once a suitable LWIR/VS cost function is established, it is possible to obtain aninitial 3D sparse representation of the scene, where the edges are predominant. Thisrepresentation is used both to deduce the structure of the scene and to initialize theproposed region-based algorithm.

MRF-based stereo [21] traditionally focuses on first order smoothness priors byassuming fronto-parallel surfaces, which limits the quality of reconstructions, due toa discrete set of disparities are allowed. An iterative approach for fitting planes todisparity maps was proposed by [16] but this approach could only find the mostdominant planes in each image. In contrast, our plane detection step is performedusing a RANSAC plane fitting algorithm on all the image, allowing to recover smallerplanar patches as well as planes whose orientation is almost parallel to principal axisof camera.

As mentioned above, our approach is closely related to segment-based stereo algo-rithms as presented in [17, 54, 56], but the key difference is that our MRF considersa novel data term based on mutual and gradient information in a scale-space context.Proposed approaches for urban reconstruction, often start with image matching fol-lowed by pose estimation or dense stereo, and ends up with an integration of severaldepth maps into one 3D model [1, 34, 78]. However, in the multimodal field is notclear how to compute the initial 3D models. So, large 3D models of cities or partialmodels of more than one view are a distant goal. Despite this, current chapter tacklesthe first challenge —obtaining dense data.

VS/VS stereo methods for urban scene are in most cases pixel-based [18, 56,151] and assume ruled surfaces1, enabling reducing the stereo matching ambiguitiesthrough photoconsistency-based constraints [23]. Such restrictions are unreachablefor multimodal images, so that instead of encourage photoconsistency, we assumes a

1In geometry, a surface S is ruled if through every point of S there is a straight line that lies onS.

Page 94: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

78 PIECEWISE PLANAR BASED APPROACH

structural similarity between local gradient fields and an underling probability distri-bution between pixel values within a neighborhood.

To relieve the above mentioned drawbacks a priori based on image segmentationcues as well as presences of dominant scene orientations and piecewise planar struc-tures is suggested [119]. We employ these assumptions directly in the dense matchingstage. The piecewise planarity explicitly suppresses the jaggy effect due to incorrectmatchings or disparity decimation. Since, urban scenes often exhibit strong structuralregularities (at, texture-poor walls, sharp corners, and axis-aligned geometry). Thepresence of such structures are opportunities for constraining and therefore simplify-ing the matching task as suggested by Furukawa et al. [54]. Traditionally, these scenesare complicates for dense two-frame stereo correspondence algorithms as presented in[136]. Due to the fact that lack of texture leads to wrong matchings, whereas thesharp angles and non-fronto-parallel geometry to brake the smoothness assumptions.

Piecewise planar stereo begins with the work of Wang and Adelson on layeredmotion models [163]. Then, several authors, including Baker et al. [5], Birchfieldand Tomasi [16], and Tao et al. [151] constraint the problem to multi-views by 2Daffine motions. Their algorithms iterate between two steps: assignation the pixelsto 3D planes and refinement of plane equations. Other researcher such as Coorgand Teller [33], Werner et al. [167, 175], and Pollefeys et al. [127] use dominantplane orientations to perform a plane sweep. Recent work such Furukawa et al. [54],Sinha et al. [144], Micusik [119], and Gallup et al. [56], produce very convincing 3Dreconstructions for man-made architectural scenes. We investigate all these researcheffort from a theoretical point of view, identifying their specific features that can beexploited in a multimodal application domain.

Summarizing, the main contribution of this chapter is to propose a complete frame-work for dense disparity map estimation assuming a piecewise planar model of thescene. The proposed framework spans from an acquisition module up to the proposedalgorithm for obtaining dense disparity maps (Chapt. 3 and Chapt. 5). The contri-butions are as follow: (i) it proposes a multimodal cost function to exploit reliablesimilarities between multimodal images; and (ii) it proposes a standard Markov Ran-dom Field for obtaining dense disparity maps, which represents the scene by meansof a piecewise planar model.

6.3 Pixelwise Planar Stereo

The proposed approach consists of three stages. Firstly, it starts by estimating asparse but accurate disparity map of the scene. Then, in the second stage the initialmap is represented by means of a set of planes. Finally, a dense disparity map isobtained by a piecewise planar labeling framework. These stages are detailed next;an illustration is provided in Fig. 6.1. This framework assumes an acquisition moduleof multimodal images such as the one presented in Chapt. 3, which provides bothrectified images and calibration parameters.

Page 95: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.3. Pixelwise Planar Stereo 79

Section 3.1

Section 3.2.1 Section 3.2.2

Section 3.3Section 3.2

Dense

Disparity

Map

Piece-

wise

Planar

Labeling

IVS

ILWIR

Initial

Disparity

Map

(Dmap0 )

Superpixel

Segmentation

(S)

Graph-based

Segmentation

(P )

Support

Region

Segmentation

(R )Plane

Fitting

Plane

Linking

Figure 6.1: Illustration of the algorithm’s stages.

6.3.1 Matching Cost Volume Computation

The goal of this section is to compute an initial disparity map, which will be fitted bya set of planes. This disparity map is obtained from a matching cost volume followinga local window based approach, in which a Winner Take All (WTA) strategy is usedfor disparities selection. The main challenges of this first stage are both, to get alarge number of good correspondences and to have a high accuracy in their locations.In order to address the matching problem, we propose to extend the LWIR/VS costfunction presented in Sect. 5.2 to its continuous form. The accuracy challenge is facedusing quadratic interpolations (similar to Sect. 5.2.2).

The matching cost volume is obtained through a cost function inspired by the onepresented in Sect. 5.2.1. Particularly, we replace the fusion operator and redefine thesystem of weights, from scales to information source. Thus, the parameter λ weightingthe contributions of CMI and CGI given a scale, instead of its combined contribution(CMGI) scale by scale. This chapter follows the notation previously introduced. Thus,it is assumed that IV S and ILWIR are rectified, the searching space is constrained toone dimension, the image coordinate p = (x, y) corresponds to a given pixel in IV S ,while the locations (x+d, y) represent the candidates correspondences in ILWIR, andd stands for a disparity value. The cost of corresponding two windows centered onpoints IV S(p) and ILWIR(x+ d, y) is obtained as follows:

C(p, d) = λCMI(p, d) + (1− λ)CGI(p, d), (6.1)

where CMI and CGI are the cost terms based on the mutual and gradient infor-mation given a scale space representation, which will be detailed next; and d =dmin, . . . , dmax.

By definition, Mutual Information (MI) measures the information content in com-mon between two random sampled signals, considering them as a collection of symbolsthat are drawn in a random manner [36]. However, from the point of view of our prob-lem those samples are a pair of windows centered on IV S(p) and ILWIR(x+ d, y) re-spectively, which encode energy measurements in visible and thermal infrared bands.In this chapter we propose to approximate MI(p, d) from their individual entropies

Page 96: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

80 PIECEWISE PLANAR BASED APPROACH

h(·) and joint entropy h(· , ·) as follows:

MI(p, d) = h(p) + h(d)− h(p, d). (6.2)

The above equation can be expressed in its continuous form as integrals of themarginal probability distribution functions (PDFs) and joint PDF of pixel values iV Sand iLWIR into IV S and ILWIR respectively, then:

h(p) = −1∫

0

P (iV S) log (P (iV S)) diV S, (6.3)

h(d) = −1∫

0

P (iLWIR) log (P (iLWIR)) diV S, (6.4)

h(p, d) = −1∫

0

1∫0

P (iV S, iLWIR) log (P (iV S, iLWIR)) diV SdiLWIR, (6.5)

where P (iV S, iLWIR) represents the joint PDF; P (iV S) and P (iLWIR) are two marginalPDFs. Kim et al. [88] approximate these PDF and PDFs by a Parzen window densityestimation, which is a sum of Gaussian distributions g, with mean µ and covariance ψ(a detailed explanation can be found in [161]). In the current chapter, a nonparametricestimator (NP ) [41] is used for computing the joint PDF, instead of using a Parzenestimator. In this way, we avoid dependencies in the selection of parameters: µ andψ of the Parzen estimator and the parameter needed for binning the windows (Q).Notice that a joint PDF is a two dimensional histogram, where rows and columnsrepresent symbols within windows. In our scope, these symbols come from pixelvalues of multispectral images; however, since thermal infrared measurements tendto be concentrated in few bins, particularly in outdoor scenes where the temperatureremains uniform (thermal equilibrium), the contribution of the estimator used in thecurrent section is significant because it does not require a parameter tuning for binningas Barrera et al. [8]. Hence, the joint PDF is obtained as:

P (iV S, iLWIR) = NP (p, d). (6.6)

Once P (iV S, iLWIR) is obtained, the joint entropy term, h(p, d) in Eq. (6.5), is com-puted as follows:

h(p, d) = −∑

iV S , iLWIR

log (P (iV S, iLWIR)) ∗ gψ(iV S, iLWIR), (6.7)

where gψ is a Gaussian kernel needed to approximate the continuous form in Eq. (6.5)from its equivalent discrete (see [88] and [70] for more details). In practice, we foundthat using a small kernel of 3× 3 pixels is enough for achieving good approximations,despite of the few samples in windows. Finally, marginal PDFs, corresponding toP (iV S) and P (iLWIR) in Eq. (6.3) and Eq. (6.4), respectively, are computed from

Page 97: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.3. Pixelwise Planar Stereo 81

the joint probability but along each dimension of P (iV S, iLWIR) . Thus, P (iV S) =∑iLWIR

P (iV S, iLWIR) and P (iLWIR) =∑iV S

P (iV S, iLWIR). Then:

h(p) = −∑iV S

log (P (iV S)) ∗ gψ(iV S), (6.8)

h(d) = −∑iLWIR

log (P (iLWIR)) ∗ gψ(iLWIR). (6.9)

Note that P (iV S) and P (iLWIR) are one dimensional vectors, and gψ is also an one-dimensional Gaussian kernel.

Mutual information finds linear and nonlinear correlations between a pair of win-dows, taking into account the whole dependence structure of variables. However,since local image structures provide rich information that could be also exploited, weintroduce a term based on gradient information (GI). Thus, this new term is intendedto contribute to the matching score in textured regions comparing the orientation ofgradient vectors. It is based on the observation that gradient vectors with similarorientations unveil potential matches.

The gradient information presented in Eq. (4.6) is also employed without changesin the current section. This remains being the product of two elements: the firstone measures the similarity in the orientation of gradient vectors; while the secondone is a factor that weights this similarity value (for more details see Sect. 4.3).Furthermore, the propagation scheme of MI and GI presented in previous chaptersis also incorporated, but they are combined at the lowest scale. In contrast with theinitial formulation, which combines them at every scale. Thus, mutual and gradientinformation cost are defined as follows:

CMI(p, d) = [α0, . . . , αt] ·[MI

(∇0

0(p, d)), . . . ,MI

(∇t0(p, d)

)]T, (6.10)

CGI(p, d) = [β0, . . . , βt] ·[GI(∇0

1(p, d)), . . . , GI

(∇t1(p, d)

)]T, (6.11)

where t is an index that refers to a level in the scale-space (t ∈ N); ∇t0 and ∇t1 arescale-space representations given by convolution of an image with a Gaussian kernelof standard deviation (σ), which is progressively increased until an image stack isobtained. Two Gaussian derivative kernels of order 0 and 1 are used to generateblurred and gradient stacks. The parameters αi and βi weight the contribution ofMI and GI at every level of the stack.

The cost volume refereed in Eq. (6.1) is computed from CMI and CGI , Eq. (6.10)and Eq. (6.11), respectively. Section 6.4 presents all the values of the parametersdescribed above, and how they are set. Finally, the initial sparse disparity map(Dmap0) is extracted from CMGI using the strategy of minimization based on aWTA criterion with bounded searching space presented in Sect. 5.2.2. Figure 6.2(c)shows an illustration of the sparse disparity map results from this first stage (output)as well as the multimodal input images Figs. 6.2(a,b).

Page 98: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

82 PIECEWISE PLANAR BASED APPROACH

(a) (b) (c)

Figure 6.2: Inputs and output images of first stage: (a) rectified infrared imageILWIR; (b) rectified visible image IV S ; and (c) initial sparse disparity map Dmap0.

6.3.2 Plane-Based Hypotheses Generation

In this section the given visible image is segmented into a set of regions. Then, eachregion is represented by a single plane, using the information from the initial sparsedisparity map. These planes are used in the final stage as labels for computing thedense disparity map we are looking for.

A. Split and Merge Segmentation

This step involves the combination of two segmentation algorithms (i.e., [102] and[46]), which are applied to IV S for obtaining regions that preserve the objects bound-aries in the scene. These segmentation algorithms are used in a split-and-mergescheme, in order to unveil potential planar regions. So, the image is decomposedinto small regions (superpixels) that later on are connected, following a perceptualcriterion. It should be noted that this combination of algorithms is motivated by theapplication domain (man-made environments).

The segmentation into support regions begins by splinting the given IV S into alarge set of small regions, referred to as superpixels [102]: S =

⋃i si. Without loss of

generality, we assume that disparity values inside a superpixel si can be accuratelyfitted by a plane; this assumption is met as long as IV S is oversegmented. Hence, atrade-off between size of superpixels and fitting error should be found (500 regions onaverage). Large regions have a high probability of covering more than a single planarsurface, on the contrary, smaller ones provide few samples to make a proper estima-tion of the geometry of the selected region. Then, in order to extract perceptuallymeaningful regions, the segmentation algorithm proposed in [46] is applied to IV S .This resulting in a set of P partitions of the reference image: P =

⋃i pi. Finally,

the results from superpixels (S) belonging to the same perceptual region (P ) are con-nected giving rise to the support regions R =

⋃i ri we were looking for (details on

the two segmentation algorithms can be found in [102] and [46] respectively):

ri =⋃j∈Ωi

sj , Ωi = j | sj ∩ pi ≥ sj ∩ pk, k 6= i (6.12)

Page 99: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.3. Pixelwise Planar Stereo 83

(a) (b)

(c) (d)

Figure 6.3: Illustration of the support region: (a) original IV S ; (b) superpixels (S)obtained from [102]; (c) perceptual regions (P ) from [46]; and (d) support regions(R) obtained by fusing (b) and (c).

where Ωi are the indexes of those superpixels with a maximum overlaping given aperceptual region pi. The combined use of these segmentation algorithms (i.e., [102]and [46]) is considered because theoretically they guarantee a stable segmentation thatpreserves the region boundaries and adapts to the local structure of the scene. Otheralgorithms of segmentation or combinations also are valid. For instance, the meanshift algorithm by Comaniciu et al. [31] commonly is employed in VS/VS region-basedstereo algorithms, such as [18, 162, 17], but it could lead to under-segmentation inthe absence of boundary cues, as is reported in [102]. IV S is selected for segmentationbecause the world coordinate system is set in V S camera and there is a large amountof algorithms and available code (in contrast to LWIR imaging). Finally, Fig. 6.3shows an illustration of the results from the two segmentation algorithms, S and P ,as well as their fusion R.

B. RANSAC Plane Fitting

Once the initial disparity map and the candidates planar regions (R) have been ob-tained (Sects. 6.3.1 and 6.3.2), they are combined in order to partition the disparity

Page 100: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

84 PIECEWISE PLANAR BASED APPROACH

map into subsets of points; then they are fitted by a plane. Since Dmap0 is consideredas a cloud of 3D points (x, y, d) that consists of planes, an estimator of plane param-eters based on a Random Sample Consensus (RANSAC) [50] strategy is proposed.

Our estimator alternates between performing a plane parameters estimation step,which computes a plane from random samples, and a plane parameters evaluationstep, which verifies the quality of it. These parameters are then used to describethe corresponding region with a plane. These steps are repeated until a maximumnumber of iterations is reached or the desired accuracy is obtained. In case that theparameters of a planar region cannot be found, either since the error is greater thanthe given threshold or the number of iterations is not enough, this region is labelledas non-planar. A more detailed explanation of these steps is given below.

• Plane parameters estimation. The parameters (c1, c2, c3) of the plane equationdefined as d = c1x+ c2y + c3, are obtained by randomly selecting three pointsfrom the given region—remember that the z coordinate corresponds to the dis-parity value d, while (x, y) to the corresponding column and row position in theimage.

• Plane parameters evaluation. The quality of the parameters estimated above isevaluated by two criteria: (i) by counting the number of inliers (i.e., points witha geometric distance to the plane smaller than a given threshold: τERROR); and(ii) by the ratio between number of inliers to total number of points in thatregion (this ratio needs to be higher than τRATIO).

The estimator selects the set of plane parameters with the best evaluation (i.e.,highest number of inliers and with a ratio greater than τRATIO). Finally, the selectedparameters are refined using only the inliers of the corresponding plane. In thisrefinement the parameters (c1, c2, c3) are obtained by orthogonal regression usingprincipal components analysis. This RANSAC based plane fitting is repeated withall the regions, resulting in a plane π per region. Hereinafter, each plane is definedby its normal vector n and the coordinates of its centroid x.

C. Plane Hypotheses Generation Based on Plane Affinity

Once RANSAC algorithm has been applied to all the regions in R, a postprocessingstage is performed to merge planar patches defined by similar parameters. Thispostprocessing is performed to simplify the number of planar hypotheses. Note thatthe planes have been obtained in a local way, then the number of planar hypothesescould be as large as the number of regions in R. Hence, the goal of this postprocessingstage is to reduce the number of planar hypotheses Π = π1, π2, ..., πn up to aminimum value so that the structure of the scene is still preserved. The plane linkingstage is based on a distance (distπ) computed from two planar patches, which was

Page 101: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.3. Pixelwise Planar Stereo 85

initially proposed in [151]. It is defined as follow:

distπ(πi, πj) = l(πi, πj) + l(πj , πi), (6.13)

l(πi, πj) =(xj − xi) · nj

ni · nj. (6.14)

The Eq. (6.14) corresponds to the length of the segment defined by xj and the intersec-tion of nj , passing through xj , with πi. In order to make it clear, a 2D representationof the segment lengths used for computing Eq. (6.13) is given in Fig. 6.4.

ni nj

pi pj

xj

xi

(pi ,pj) l

(pj ,pi) l

Figure 6.4: 2D illustration of segment lengths used to compute distπ(πi, πj).

The previous planes distance Eq. (6.13) is used as a similarity function for merginga pair of planar patches. Hence, two planar patches are fused into a single one if(distπ(πi, πj) ≤ τLINK). Once all possible combinations have been evaluated (onlyconnected neighbor regions are considered) a new relabelled R is obtained and theRANSAC algorithm is called again until convergence is reached.

Finally, once there are not more planes to be joined, a noisy planar hypothesisremoval is performed. It is based on detecting the predominant planar orientations,using a PCA over all normal vectors (n). This filtering stage tends to remove planeswith an orientation ni far away from the principal directions. This results in a com-pact set of planar hypotheses Π = π1, π2, ..., πc; it is expected that the number ofhypotheses has been reduced: c << n. Figure 6.5(b) shows the planar hypothesesobtained after merging planar patches with similar parameters and filtering the noisyones. The original set contains 179 hypotheses (see Fig. 6.5(a)), while the one pre-sented in Fig. 6.5(b) is defined by only 14 hypotheses. They were obtained after fouriterations of the plane linking stage.

6.3.3 Pixelwise Plane Labeling

The set of planar hypotheses obtained above are now converted into labels for refor-mulating the disparity computation as a global minimization problem. It allows totake into account contextual constraints in order to achieve a dense disparity repre-sentation from multispectral information. The global minimization problem is basedon the local correlation indicators computed in previous sections (i.e., mutual and gra-dient information boosted by the scale-space representation). In this section, formerindicators that were extracted at a level of pixels, are now interpreted as projections ofplanar surfaces. This helps to constrain the searching space to few candidates, while

Page 102: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

86 PIECEWISE PLANAR BASED APPROACH

(a) (b)

Figure 6.5: Planar hypotheses simplification: (a) original set of planar hypothesesfrom the segmentation presented in Fig. 6.3 (179 planes) and (b) planar hypothesesresulting after four iterations of the postprocessing stage (14 planes).

spatial coherence of disparity values is hold. Notice that an extra planar hypothesisdenoted as π∞ that represents all those regions out of the stereo range is added to Π(e.g., sky or distant surfaces).

The Markov Random Field theory provides a framework to relate local correlationindicators together with contextual constraints. These two elements are used to definean energy function. Then, this function is minimized through the classical graph cuts[22]. It works by defining a regular grid (four-connected) where every node representsa pixel in the image. Hence, a graph G = 〈V, E〉, where V represents the vertices and Erepresents the edges of the graph, is obtained. Then, the graph cut algorithm searchesin G for the best set of labels (f) that minimizes the following energy function:

E(f) =∑p∈P

Dp(fp) + λsmooth∑

p,q∈NVpq(fp, fq), (6.15)

where P is the set of pixels in the image; Dp is the data term that measures howwell a planar hypothesis explains a disparity value for a given pixel p; Vpq(fp, fq) isa smoothness prior computed in a neighborhood N (in the current chapter a first-order Markov Random Field is considered, thus a neighborhood has four connections);fp and fq are the current labels for pixels p and q, respectively; and λsmooth is aweighting factor for the regularization term. The Dp function is defined as follows:

Dp(fp) =

min (C(fp), Cmax) if fp ∈ π1, π2, ..., πc,0.9 · Cmax if fp ∈ π∞,

(6.16)

the cost assigned to a pixel p represents its degree of membership to a given planeπi. This cost is obtained from C(p, d) (see Eq. (6.1)), where d corresponds to thehypothetical disparity obtained if p is assigned to the plane πi; if certain hypothesisπi produces an inconsistent d. For instance, a value outside of the searching space,then p is penalized with a maximum cost Cmax. Finally, the smoothness term Vpq is

Page 103: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.4. Experimental Results 87

(a) (b)

Figure 6.6: Graph cuts outcomes: (a) labelled regions from graph cuts and (b)dense disparity map obtained with the proposed approach.

defined as:

Vpq(fp, fq) = ∇1(IV S(p)) ·

0 if fp = fq,distmax if fp or fq ∈ π∞,dist (fp, fq) otherwise,

(6.17)

∇1 is the gradient of IV S , and dist (fp, fq) is the Euclidean distance between thepoints p and q depending on which planes them belong to. The minimization ofEq. (6.15) assigns every p in the reference image to one planar hypothesis πi (seeFig. 6.6(a)). Then, from this membership the corresponding disparity value is ob-tained by computing the intersection of a ray passing through p with the assignedplane πi. Figure 6.6(b) shows the dense disparity map corresponding to the illustra-tion used as a case study in previous sections.

6.4 Experimental Results

This section starts presenting some details about the evaluation dataset. Particularly,the additional information that is required for evaluating (ground truth data). Then,eight case studies are presented, they are used to provide a qualitative overview of thealgorithm performance. They include both intermediate results of the proposed algo-rithm and dense disparity maps. Finally, two quantitative experiments are conductedover all the evaluation dataset, providing a measurement of the global error.

In order to evaluate the proposed approach the dataset presented in Sect. 3.6.3has been used (CVC-02 multimodal dataset). This dataset contains 116 scenes, whichdepict a large variety of urban scenes, such as: buildings, sidewalk, trees, vehicles,among others (see Table 6.1). Every scene includes their corresponding thermal in-frared and color image; a synthesized disparity map; and a hand-annotated map ofplanar regions. The images are acquired by the proposed multimodal stereo head,

Page 104: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

88 PIECEWISE PLANAR BASED APPROACH

and they are processed till a rectified pair is obtained. The planar region maps havebeen hand-labeled taking into account the geometry of the surfaces, thus an uniquelabel is assigned to each region and it identifies the pixes that belongs to the sameplane.

Table 6.1 shows some examples of the images used for validating the proposed ap-proach. ILWIR and IV S are rectified images; in both cases the size of resulting imagesis 506 × 408 pixels. Hand-labeled and disparity maps are crop and registered withrespect to IV S , their sizes are also 506×408 pixels. Since the disparity maps providedby PGB2 are only accurate in textured regions, we have used hand-labeled planar re-gions for obtaining dense and accurate representations, particularly in textureless andnoisy regions. To address this problem, we fit a plane to each hand-annotated region,through of image coordinates and corresponding disparity values. The disparity mapsresulting from this user supervised labelling process are shown in Table 6.1.

Dense disparity maps were obtained by setting the different parameters as is in-dicated below. The different values were empirically obtained and the same settingis used in all the scenes. The initial Dmap0 is obtained by defining dmin = 0 anddmax = 64. The scale space representation contains three levels and the values usedfor propagating mutual and gradient information through the different levels are setas: [α0, . . . , αt] = [0.5, 0.3, 0.2] and [β0, . . . , βt] = [0.5, 0.3, 0.2]; threshold τMGI isset as 25% of the maximum cost value; finally, mutual and gradient information inEq. (6.1) are fused defining λ = 0.4. The two values related with the planar hypoth-esis generation were set as follow: τlink = 2.5. The values given by default in thegraph cut implementation provided by [56] were used for the global minimization.

Figures 6.7 and 6.8 show the results obtained for eight different scenes. The initialmultispectral stereo images are provided in the first and second columns of Table 6.1.The results are grouped by scenes and they show the output of each step of proposedalgorithm. So, every scene has associated four images: (top-left) corresponds tosupport regions R, which split up the IV S image into planar regions; (top-right) isan illustration of the planar hypotheses Π; (bottom-left) shows the labelled regionsobtained by graph cuts; and (bottom-right) is the final disparity map. Notice thatΠ is the set of labels used during the minimization step, and the disparity map isobtained by using the plane parameters corresponding to each label. On the otherhand, it can be appreciate how the minimization stage is able to filter out small regionsand propagates information across the neighbors (see (bottom-left) illustrations in thedifferent scenes and compare them with their corresponding (top-right) images in Fig.6.7 and 6.8).

Figure 6.8 (scene 5) shows that the proposed approach can obtain dense disparitymaps even in non-planar regions. In this illustration a large cylinder is approximatedby two planar patches. The number of planar patches depends on the value usedfor setting the τlink parameter. Even in this challenging case the proposed algorithmis capable of finding a set of planar hypotheses, and converges toward an optimalsolution that preserve the appearance of the scene.

2Point Gray Bumblebee

Page 105: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.4. Experimental Results 89

Table 6.1: Examples of evaluation data set.

ILWIR IV SMaps of Synthesized

planar regions disparity maps

Page 106: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

90 PIECEWISE PLANAR BASED APPROACH

Scene 1 Scene 2

Scene 3 Scene 4

Figure 6.7: Experimental results from different stages of the proposed approach; ineach scene the illustrations correspond to: (top-left) Support region R; (top-right)Planar hypotheses Π; (bottom-left) Labeled regions from graph cuts; and (bottom-right) Final disparity map.

The accuracy of proposed algorithm is evaluated by using two metrics. They arefrequently employed as quantitative evaluation criteria for stereo matching algorithms.Initially, the absolute mean error (Eabs) is computed in a global manner for a givendisparity map as follows:

Eabs =1

N

N∑j=1

| dC(j)− dT (j) |, (6.18)

where dC is the disparity map computed by the proposed algorithm, dT is the groundtruth, and N is the number of evaluated points. Since, our data set offers reliableground truth only in those points that lie on a planar region, the error measurement islimited to those image coordinates that have a valid ground truth data and disparityvalue. Notice that the label π∞ used during the minimization step corresponds tonon disparity, for this reason these image coordinates are excluded from the evalua-tion. A drawback of using Eabs as an evaluation metric lies on the fact that it does

Page 107: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.4. Experimental Results 91

Scene 5 Scene 6

Scene 7 Scene 8

Figure 6.8: Experimental results from different stages of the proposed approach; ineach scene the illustrations correspond to: (top-left) Support region R; (top-right)Planar hypotheses Π; (bottom-left) Labeled regions from graph cuts; and (bottom-right) Final disparity map.

not distinguish between few disparity estimations with large errors and lot disparityestimations with small errors. Furthermore, it does not take into account that a smalldisparity value corresponds to a large depth value, and therefore its contribution tothe global error should be different, for instance, in comparison to a large disparity(small distance). Hence, in order to take into account this effect, the mean relativeerror (Erel) is also used. It is computed as follows:

Erel =1

N

N∑j=1

| dC(j)− dT (j) |dT (j)

. (6.19)

Eabs and Erel are computed from the 8 scenes that are used as case studies (seeFig. 6.7 and Fig. 6.8); their corresponding error scores are presented in Table 6.2.The Erel in the scenes 1 and 4 are considerable larger than the rest of scenes in thedata set. In both cases these large values result for the wrong matchings due to thelack of texture in the predominant geometries (a vertical non-textured wall).

Page 108: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

92 PIECEWISE PLANAR BASED APPROACH

Table 6.2: Global Eabs (pixel) and Erel of the case studies presented in Fig. 6.7 andFig. 6.8.

Case studies

1 2 3 4 5 6 7 8

ErrorEabs 0.635 0.443 0.582 0.629 0.484 0.387 0.465 0.505Erel 0.167 0.016 0.096 0.144 0.042 0.060 0.062 0.057

Figure 6.9: Average accuracy of the results obtained with the proposed approachcomputed from the whole data set.

Figure 6.9 shows the average accuracy of the obtained dense disparity maps, whenall the scenes in our data set are considered. For each scene an accuracy histogramis computed by using its corresponding ground truth map. The histogram counts thenumber of points for a given disparity error, spanning from 0 till 10 pixels. Then,from all these histograms a single plot is computed showing the variability of results(see Fig. 6.9). In this plot the central mark of each box corresponds to its medianvalue and the edges of the box are the 25th and 75th percentiles, the whiskers extendto the most extreme cases.

Figure 6.10 shows the average Erel computed from the whole data set. The Erelmeasurements are restricted to discrete values from 1 to 20 for the sake of visual-ization. This plot is similar to the previous one and presents the box plot of themean relative error when all the images into the evaluation data set are considered.Note how the mean Erel decreases to values below 10% when the disparity is higherthan 10 pixels. On the other hand, disparity values smaller or equal than 10 pixels

Page 109: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

6.5. Conclusions 93

Figure 6.10: Average Erel of the results obtained with the proposed approachcomputed from the whole data set.

correspond to distant points (several meters away from the stereo rig), which are outof the calibration range of the current work.

The results presented above answer the question that was formulated on the be-ginning (Sec. 6.1), which motivated the current work. They show that under certainrestrictions multispectral images can be used to extract dense disparity information.This information can be directly converted into a 3D representation describing thegeometry of the scene. This will allow for instance to extract semantic relationshipsbetween the objects in the scene.

6.5 Conclusions

The current chapter presents a novel framework for extracting dense disparity mapsfrom multimodal stereo images, each stage is described and evaluated both quali-tative and quantitative. The results obtained from this research can benefit thoseapplications where visible and thermal infrared cameras coexist. For instance, build-ing diagnostics, energy auditing, and material inspection. The main contribution ofthis chapter are as follow: (i) it improves the LWIR/VS cost function introduced inSect. 4.3 and Sect. 5.2.1 through a smooth approximation. This new cost function issuitable for optimization techniques and weights the individual contribution of MIand GI; (ii) a global minimization scheme based on prior of high order for extractingdense disparity maps is proposed. This scheme uses a set of plane hypotheses likelabels and defines a dynamic regularization term in function of location of a pointand its current label.

The Manhattan-world assumption is used for constraining the set of scenes in

Page 110: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

94 PIECEWISE PLANAR BASED APPROACH

which the proposed algorithm is useful and to model the geometry of the scene. Wealso show that under this restriction is possible to obtain accurate disparity maps, inspite of low correlation between thermal infrared and visible images. The experimentalresults confirm that our piecewice planar approach can successfully recover planesurfaces in complex situations (non-planar and cluttered scenes).

Page 111: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 7

Context and Piecewise PlanarBased Approach for DenseRepresentations

The previous chapter introduced a new framework for extracting dense disparitymaps. This chapter continues with this guideline but improves its performance byadding contextual information. Thus, three novel features are incorporated to theprevious approach. In the current chapter, a solution to the main weakness of al-gorithms based on regions is proposed, which reduces the effect of wrong planesdetection. This method builds a full connected graph from detected planes, and thengenerates a compact set of plane hypotheses, which are obtained in a global way bynormalized cuts in that graph, opposite to iterative methods. Furthermore, the graphdefined in the MRF optimization step, now connects centroids’ superpixel instead ofpixel, this reduces the number of nodes and therefore its computational complexity(i.e., less memory and processing time). Finally, the use of a planar/non-planar clas-sifier is proposed to provide meaningful contextual information of the scene, whichis addressed by an energy function that additionally employs an inter-plane distanceinto the prior model.

7.1 Introduction

Stereo algorithms based on contextual information are a novel approach to exploitthe specific characteristics of the application domain. They have shown robustness

to noise as well as a particular sensitivity to the context, which improves the qualityof the results while overcome state of the art algorithms. Despite its great poten-tial, still some issues are unknowns. For example, the appropriate selection of thesources of contextual information, its forms of representation, and strategies for its

95

Page 112: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

96 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

exploitation. So, it is necessary to determine the relationships between entities andits surrounding environment, in order to interpret the scene and translates them intouseful constraints. In the current chapter, we propose to go a step further by fusingthermal and visual data with contextual information, allowing dense and accurate 3Drepresentations of the scene.

Over the last few years, many computer vision problems have been investigatedfrom a multimodal point of view, assuming that a multimodal architecture couldproduce better results than a single modality. In practice, an inappropriate sensorcombination may cause worse results, primarily due to uncertainties of data, sinceit is difficult to distinguish the accurate information from biased data [64]. Indeed,a multimodal stereo system as the one proposed here, working at non overlappedspectral bands, supposes a complex task. This system should overcome the wellknown stereo correspondence problem as well as image fusion issues. A staring pointtowards such a kind of challenging scenarios was given in [43, 73, 88], but in VS/VSstereo, whose images are affected by smooth radiometric differences. Although theydo not truly investigate the multimodal problem, their work indicate that an energyoptimization framework based on mutual information could be a piece toward densesolutions in the multimodal case.

Since energy optimization approaches have demonstrated their usefulness in earliervision problems, in this chapter, we propose to extend these formulations to multi-modal stereo matching. Additionally, we propose to include contextual informationinto the data and smoothness term. Our study shows a promissory path towards theintegration of multiple sensors for recovering tridimensional information.

There are several work in the field of the extraction of 3D data using contextualinformation [104, 16, 144, 106]. Particularly for VS/VS stereo systems, it has beenshown that is possible to produce an accurate 3D reconstruction, similar to the oneobtained with a laser range scanner [54], just by imposing planarity constrains [119].This assumption may seem to be excessively restrictive. However, such scenes can beeffectively modelled by the so-called Manhattan-world assumption [35], as presentedin the previous chapter. Similar to previous work, the proposed approach exploits thecontextual information provided by man-made structures. Under this assumption, thesurfaces in a given scene are considered as a collection of piecewise-planar patches,where normal vectors are oriented in a reduced number of directions. This conditionholds true for images taken in environments with certain regularities in its structure,particularly over the edges.

It should be noticed that although a Manhattan-world assumption is imposed,the proposed approach does not assume that the surfaces into the scene are alignedwith a Cartesian coordinate system (X, Y, and Z) as is presented in [54, 119, 17, 108].Indeed, in the current chapter the surfaces and their orientations are recovered from aninitial sparse disparity map. These surfaces are identified and grouped in accordancewith dominant orientations on the scene, which are obtained by a spectral clusteringmethod [142]. In this way, false plane detections and noisy surfaces are removed.

The main contribution of current work is the formulation of the multimodal stereomatching as a piecewise planar region partition and labelling. This formulation is then

Page 113: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.2. Background 97

solved by the graph cuts algorithm [22]. The proposed approach works as follows.Firstly, a matching cost volume is computed, which is used for obtaining an initialdisparity map. Then, a set of regions is obtained by overlapping two different segmen-tations of the visible image (perceptual grouping [46] and superpixels [102]). Next,planes are extracted from those candidate regions and are used for partitioning theinitial disparity map. This map is then treated as a three-dimensional cloud of points;where, the position of each point in that space is defined by its image coordinates(row and column) and disparity value. Once every region is approximated by a plane,the key question is how to obtain the dominant orientations in the scene. Since thereis an infinite number of possible orientations, and a plane can be uniquely defined byone point and a normal vector, we propose to tackle this problem as a partitioningproblem. Thus, instead of an iterative algorithm that joins regions and computes theplanar parameters at each cycle [56], or a progressive approach [165], we propose togroup superpixels into clusters, according to their centroid location and normal. Thegraph consists of a set of nodes that corresponds to all superpixels in the image, and aset of edges whose weights are distances between two superpixels. Finally, that graphis clustered into regions by normalized cuts [142]. We argue that the agglomerationsof nodes in the graph mentioned above correspond to dominant orientations in thescene, and in turn they should be the labels used by the energy optimization frame-work. The proposed clustering strategy has shown high noise immunity, in particularto the noise caused by a deficient plane estimation. Finally, the multimodal energyfunction is solved via graph cuts [22]. This function compares at different scales bothgradient vectors and mutual information, following a window-based approach.

The chapter is organized as follows. Section 7.2 introduces the state-of-the-artin context based stereo approaches. Then, Sect. 7.3 details the proposed method.Experimental results are presented in Sect. 7.4 using several outdoor scenarios showingthe validity of the proposed approach. Finally, conclusions are provided in Sect. 7.5.

7.2 Background

As it was presented in previous chapters of this dissertation, stereo matching frommultimodal images (LWIR/VS) is a field still relatively unexplored, and yet certainquestions about image correspondence have not been fully addressed in the litera-ture. By contrast, stereo matching of VS/VS images is an active research area, whichevolved from window-based local approaches [85] to global methods that use regular-izations for introducing additional information in the form of contextual restrictionsor assumptions [149]. The lack of work in the LWIR/VS field is mainly due to the lowcorrelation between LWIR and VS images, which prevents to compute reliable match-ing cost values, and at the same time it also prevents to use state-of-the-art methodsfrom VS/VS stereo [73, 88]. A way of avoiding incoherent results and overcoming thecorrelation problem is by means of the use of local approaches; they commonly reducethe search for correspondences to certain regions of interest (ROI) in the images [76],such as human silhouettes [96], faces [147], or contrasted objects [169] (e.g., hot orcold objects on a smooth background). Although these ROI based approaches have

Page 114: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

98 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

shown attractive results, the main problem remains unsolved. Hence, the search fornovel multimodal cost functions is a recurrent subject of research in the multimodalcommunity. Stereo similarity functions such as Normalized Cross-Correlation (NCC)or image descriptors such as Histograms of Oriented Gradients (HOG) [38] have beenevaluated for multimodal stereo matching, but without success [154]. On the otherhand, mutual information, Local Self-Similarity (LSS) [141, 153], and the combina-tion of mutual and gradient information in a multiresolution scheme [8] have shownto be the top ranked cost functions [154].

The current chapter follows the scheme presented in previous ones, where mutualand gradient information in a multiresolution representation are used for computinga cost volume that is optimized twice; once with a Winner-Takes-All (WTA) strategyin order to generate a sparse disparity map and then with a graph cut algorithmfor a dense representation (disparity and 3D points). Such a kind of method splitup the disparity computation in two steps. Initially, a coarse representation of thescene is obtained by a local method that operates at a low level; it is fast and precise.Then, this approximation is refined in subsequent iterations by a method that addsa reasoning layer (high level), whose functionality depends on its complexity. Asimilar approach has been implemented for the matching of two thermal infraredimages in [40, 62, 63]. Although these approaches do not tackle the multimodalstereo matching, they have to deal with similar problems, which are caused by infraredimagery (e.g., low resolution, noise, and blurred edges). They start with a quite sparserepresentation, which is obtained from feature matching techniques (e.g., corners in[40], and phase congruency of edges in [62] and [63]). Then, this first representationis refined in a second step by removing inconsistencies when a sparse representation issought [62]; or through the use of support regions such as Delaunay triangulations [40]or watershed segmentation [63] when a dense representation is required. In practice,these methods require a very accurate detection of contours or segmentations sincethe second step cannot correct mistakes; actually the second step can only rejectthem.

More general and robust solutions, with respect to the previous methods, havebeen studied in the VS/VS stereo field; these solutions are usually referred to as theregion-based stereo matching methods. Early contributions on this topic have triedto associate entities extracted from the images [29]. Entities shall be understoodas regions, region descriptors, or nodes of a tree like structure that represents afine to coarse hierarchical candidate segmentation. A shortcoming inherent of theseapproaches is the image partitioning, since they need a precise segmentation of bothimages (homogeneity), which remains as an open issue in computer vision. In contrast,recent approaches have reduced their dependence on the segmentation quality byfitting 3D surfaces, such as planes [17, 90, 151], splines [18], voxels [54], or affinemodels [16], to the initial representation. Nowadays, region-based matching are thebest ranked algorithms in the VS/VS stereo field (Middlebury1).

The region based stereo approaches can be classified according to the operationspace into two groups: (i) disparity map based and (ii) depth map based (e.g., [144],

1http://vision.middlebury.edu/stereo/

Page 115: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.3. Superpixelwise Planar Stereo 99

[54], [56] and [18]). Although, the latter has demonstrated a better performancethan those using disparity maps, these methods cannot be directly extended to theLWIR/VS stereo matching since the initial depth maps are very sparse. Therefore,in the current work a disparity map based representation is used. The functionalstructure of the proposed method is not far from the previous reported region basedstereo algorithms, as for instance those presented in [17, 56]. The main differencesare discussed below.

First of all the proposed approach is based on a split and merge segmentationinstead of on a color based algorithm [31, 86] as generally used in region based stereo[165, 17, 90]. Also, we propose to formulate the planar hypotheses generation as agraph clustering problem, instead of a RANSAC-like algorithm as in Sect. 6.3.2.B or[17, 56]. Thus, the phases of plane fitting and planar hypotheses generation are in-dependent, allowing the relaxation of self-similarity assumptions [18]. In the currentchapter, disparity discontinuities between plane patches are measured by an inter-plane distance as presented in [151]. This distance is used for grouping plane patchesspatially near and with similar plane parameters. Hence segmentation errors can becorrected.

Furthermore, we also do changes in the energy minimization step with respectto the one presented in the previous chapter. Instead of a regular lattice with aneighborhood system of first or second order (4 or 8 connected) [22], a quasi-regulartessellation of sites is used. Each site corresponds to the location of a superpixelcentroid and its neighborhood system is limited to those superpixels that share aborder. On the one hand, this allows to simplify the labelling problem since the totalnumber of sites is smaller than if a regular lattice were used (number of pixels inthe image). On the other hand, a scheme of cost aggregation based on superpixel isintroduced, similar approaches have shown a better trade-off between computationalefficiency and accuracy in VS/VS stereo evaluations [152].

The energy function used during the minimization step has also been reformu-lated; the data term is derived from [8] and a novel smoothness term is proposed.This incorporates a piecewise planar prior that penalizes discontinuities between nearsuperpixels, again the interplane distance presented in [151] is used, but it measuresthe affinity of a pair of planar hypotheses instead of an Euclidean distance as proposedin [56], allowing the minimization over superpixel primitives.

7.3 Superpixelwise Planar Stereo

A diagram of the proposed approach is shown in Fig. 7.1. The first step correspondsto a sparse stereo, where a cost volume is computed from the LWIR and VS images,and then it is minimized in order to obtain an initial disparity map. The secondstep is a dense stereo algorithm, which starts with the extraction of a set of planesfrom the initial disparity map. Since labelling problem of plane surfaces is solved by adiscrete minimization technique (graph cut), the scene geometry must be summarizedinto a limited number of plane hypotheses. These hypotheses have a double meaning,they are 3D planes with their corresponding geometric parameters, and they are also

Page 116: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

100 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

Plane Based

Hypotheses

Generation

Winner

Takes

All

Multimodal

Matching

Cost

Piecewise

Superpixel

Labeling

3D cloud

of points

Disparity

Map

VS

LWIR

Sparse Stereo Dense Stereo

Planar/non-planar

Region Classifier

Figure 7.1: Pipeline of the proposed context-based multimodal stereo algorithm.

labels. Once the labelling problem is solved both a dense disparity map and a cloudof 3D points are recovered. In the next three sections, these two steps are explainedin more detail.

7.3.1 Matching Cost Volume Computation

The C(p, d) is computed using the multimodal cost function introduced in Chapt. 6,which combines two similarity measures. On the one hand, it considers the mutualinformation that takes into account pixel values within the two windows, as proposedin [43]. On the other hand, it also includes the gradient information that comparesgradient vectors within these windows, as proposed in [125]. A multi-scale, or coarse-to-fine scheme, is also included for analyzing the objects in the scene at differentresolutions to rise up the matching score [52]. In the current chapter the elementspresented above (mutual information, gradient information, and multiscale scheme)are combined to define the multimodal matching cost volume as depicted in Fig. 7.2.

x

y

d

d

max

min

Referencewindow

x+d dmaxdmin

Slidingwindow

X

Y

C( p, d )

P

c(x, y, d)

Disparity

0 0

Figure 7.2: Multimodal matching cost volume.

Page 117: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.3. Superpixelwise Planar Stereo 101

Table 7.1: Illustration of a set of images defining a scale space representation.

Scale ∇0 (IV S) | ∇1 (IV S) | ∇0 (ILWIR) | ∇1 (ILWIR) |

σ = 0.5

t = 0

σ = 2

t = 1

σ = 4

t = 2

The volume computation may be an expensive process, specially if the searchingspace or the size of the images are large. However, it should be noticed that it is anefficient representation for two step algorithms since it is computed only once, at thebeginning, and then optimized twice (first by a WTA and then through graph cuts).

Table 7.1 presents an example of a scale-space representation: first column con-tains the values corresponding to σ, from which each scale t is generated; secondcolumn depicts the convolution of IV S with a Gaussian derivative kernel of orderzero and σ = 0.5, 2, 4; third column shows the norm of IV S gradient field; fourthcolumn presents ILWIR smoothed; finally, last column depicts the norm when ILWIR

is convolved with a Gaussian mask of order one and σ = 0.5, 2, 4 (note that secondand third columns are dual to fourth and fifth).

Once C(p, d) has been computed, a Winner-Takes-All method is used to select thebest disparity for every point in the VS image; then, an initial sparse disparity map(Dmap0) is obtained by filtering unreliable matches using the corresponding matchingcost value. This minimization step is performed following the method presented inSect. 5.2.2.

7.3.2 Plane-Based Hypotheses Generation

The plane based hypotheses generation consists of three steps, which result in a com-pact set of planar representations that will be used as labels in the final optimization.

Page 118: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

102 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

The first step consists in segmenting IV S into a set of meaningful regions. Since weare working with piecewise planar scenes, ideally, each region will correspond to aplane. Then, in the second step, a plane is fitted to each one of the regions previouslyobtained, using the sparse disparity map computed in Sect. 7.3.1. Finally, in the thirdstep, the large set of planes previously computed is reduced resulting in the dominantplanes of the scene. The proposed approach is not limited to IV S segmentation; how-ever, segmentation algorithms for VS images are currently the best ranked in tasksof segmentation near to human perception.

A. Split and Merge Segmentation Revisited

In order to overcome the limited information supplied by the initial disparity map, astrategy for partitioning the images into approximately plane regions is adopted. Thishelps to a correct detection of plane patches. The selection of image segmentationalgorithm presented in [46] is due to the fact that the images depict man-made struc-tures, which can be efficiently segmented using an algorithm inspired on perceptualgrouping. Furthermore, this algorithm puts special emphasis on edge variabilities,which in the current work are important since they reveal the existence of planarsurfaces. Figure 7.3 shows an illustration of candidate planar regions R obtained bythis split and merge segmentation strategy. This will be later on used to cast theinitial disparity map into a lattice-like structure of planar regions.

(a) (b) (c)

Figure 7.3: Split and merge segmentation example: (a) superpixels (S); (b) per-ceptual regions (P ); and (c) resulting candidate planar regions (R).

B. RANSAC Plane Fitting Revisited

Once the sparse disparity map has been computed and the color image segmentedinto ri regions, a set of hypotheses of planar regions that describe the surfaces in thescene is computed. So, for every region ri ∈ R a RANSAC like algorithm as the onepresented in Sect. 6.3.2, is employed to estimate a pair 〈n, x〉, n is the normal vectorand x is the centroid of the points used for fitting this plane. Again the planar regionestimator operates in the disparity space in contrast to previous approaches that workon depth maps (e.g., [56] and [144]).

A RANSAC based plane estimator is chosen since the accuracy of the sought

Page 119: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.3. Superpixelwise Planar Stereo 103

disparity maps directly depends on the confidence of the planar hypotheses. Bydefinition, these methods are capable to find local models from noisy cloud of data;for instance, previous works have demonstrated that this kind of algorithms overcomesleast squared based techniques, since they are less sensitive to outliers [155]. It shouldbe mentioned that only those regions ri that contain three or more valid disparitiesDmap0(ri) are considered during this fitting step.

C. Plane Hypotheses Generation Based on Graphs

The previous section could result in a representation that contains as many planesas regions. The current section aims at reducing such a large set into a compactrepresentation, whose members are the most representative planes. The selectionof these planes must lead to a compact but also precise set of plane hypothesesthat preserve the geometry of the scene, since the quality of final results dependson the accuracy of this representation. Although, there are several approaches forconstructing such a compact set of plane hypotheses [54, 17, 56], two issues need tobe considered here: (i) how to cluster the planes from the different regions; and (ii)how the clusters capture the three-dimensional appearance of the scene.

The generation of plane hypotheses is tackled by using a multiclass spectral clus-tering framework, which is based on graphs and employs normalized cuts [142] forweighting the cost of disconnecting two nodes. The proposed clustering approachoperates with superpixels. Therefore, each superpixel is represented by a node in thegraph, and inherits the plane parameters of the region ri that contains this super-pixel. The clustering allows to solve wrong assignments generated during the splitand merge step (see Sect. 7.3.2). These wrong assignments are due to the fact thatthe split and merge step works at pixel level without considering 3D information.They are noticeable in the boundaries of the objects in the scene. Figure 7.4 showsthe wrong assignments of two superpixels, which belong to the floor but were mergedwith the vertical panel during the perceptual grouping.

wrong

assignments

Figure 7.4: Illustrations of superpixels assigned to a wrong region during the splitand merge step.

More formally, we define a graphG as a weighted undirected graph: G = (V, E, W ),where V represents a set of nodes (they correspond to superpixels in S); E is the setof edges connecting all these nodes; and W is a nonnegative and symmetric distance

Page 120: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

104 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

matrix whose elements correspond to all the possible distances between the differentplanes (previously computed by the RANSAC plane fitting approach in Sect. 7.3.2).The values in W can also be interpreted as a measure of the similarity between twogiven planes (πi, πj).

As mentioned above, the generation of plane hypotheses is formulated as a graphpartitioning problem. Therefore, it is necessary to establish an equivalence relationvalid for all pair of nodes, which allows partitioning G into disjoint subsets. Sinceeach subset is considered an equivalence class, all the planes that belong to it areassumed to be similar. In this way, the problem of plane hypotheses generation issimplified to obtain a single element (a plane) to represent each one of the classes.The generation of plane hypotheses works as follows:

1. Given a set of planes (estimated in Sect. 7.3.2.A) it constructs a weighted graphG = (V, E, W ).

2. For a given number of desired clusters it computes the normalized cuts on G,as indicated in [142].

3. From the obtained clusters it computes the plane parameters (Π) using theircorresponding clusters’ centroid.

4. If the current partition is not enough for encoding the geometry of the scene, itincreases the number of clusters and repeats the iterations from point 2.

The distance matrix W needed to construct the weighted graph G (step (1) inthe previous list) is computed using an inter-plane distance distπ, which is formallydefined in Eq. (6.13). This distance allows to establish an equivalence relationshipbetween a pair of connected nodes i and j. W (i, j) is defined as presented in [142]:

W (i, j) = e− dist

2π(i, j)

σ2distπ , (7.1)

where distπ is the distance between two planes and σdistπ is its standard deviation.Figure 7.5(a) presents a graph G before the multiclass spectral clustering showingthe crowded connections, while Fig. 7.5(b) shows the final clusters obtained afterthe plane hypotheses generation. It can be appreciated that the latter contains lessconnections than the former and it is defined by only four predominant planes, whichare shown in Fig. 7.5(c).

The generation of plane hypotheses concludes with a set of planes Π whose numberis smaller than the initially provided by the RANSAC plane estimator (Sect. 7.3.2.B).It is a compact representation of the scene geometry (dominant planes of the scene).Like in the previous case, every plane (Π) is defined by its centroid and normal vector.

7.3.3 Superpixelwise Plane Labeling

Given a compact set of plane hypotheses (Sect. 7.3.2.C), the final step consists inperforming a piecewise superpixel labeling on reference image IV S . In the current

Page 121: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.3. Superpixelwise Planar Stereo 105

(a) (b) (c)

Figure 7.5: Plane hypotheses generation: (a) initial graph G; (b) resulting partition;and (c) corresponding predominant planes.

section the set of plane hypotheses computed above is used as labels, and assigned toeach superpixel in S. Once every superpixel has been labeled, both a dense disparityand dense depth map are obtained. Dense disparity maps are obtained evaluatingthe plane parameters associated to a label, while depth maps are obtained from thecombination of the disparity with the multimodal stereo head parameters.

The matching of planar regions that belong to different modalities (LWIR/VS) isnow formulated as an energy optimization problem in the superpixel domain. Thus,a Markov random field is defined and solved via a graph cuts framework [22]. Thegoal of this section is to obtain a label fs that assigns a plane hypothesis to everysuperpixel. This label minimizes a global energy function E similar to the one pre-sented in Eq. (6.15), but instead of performing the optimization at pixel level, this isdone at superpixel level, as indicated below:

E(fs) = Edata(fs) + λsmoothEsmooth(fs, ft), (7.2)

where the Edata is a data term that compares the current label with the observeddata, Esmooth(fs) is a regularization term and λsmooth is a weighting factor. Edata isdefined as summation of:

Edata(fs) = Ddata(fs) +Dclass(fs), (7.3)

where Ddata is the cost of assigning a label fs to a superpixel s and Dclass is thelikelihood of that this superpixel is a nonplanar region. Ddata is defined as follows:

Ddata(fs) =

min (Cπ(fs), Cmax) if fs ∈ π1, π2, ..., πn, π∞,min (Cπ(fs), Cmax) + Cbias if fs is not a plane,ρ · Cmax if fs discard,

(7.4)

where Cπ(fs) is the cost of assigning a plane hypothesis (label fs) to superpixel s; andCmax is a constant used for: (i) truncating the Cπ cost; (ii) penalizing inconsistentplane hypothesis for a certain region s (e.g., plane hypothesis that generates disparityvalues outside of the volume); (iii) allowing that a given region changes its label bythe one of its neighbor. The Cπ function is defined as follows:

Cπ(fs) =∑p∈ s

C(p, d), (7.5)

Page 122: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

106 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

(a) (b)

Figure 7.6: Graph connecting neighbor superpixels over: (a) gradient magnitude ofVS image; (b) superpixels of VS image.

where d is the disparity value obtained by evaluating p in the current plane hypothesis(plane equation d = c1x + c2y + c3). This value is equal to the aggregation of costsspanned by the plane fs in the C(p, d) volume (Sect. 7.3.1).

The smoothness term is defined following [56], but using a superpixelwise formu-lation instead of a pixelwise as is proposed in Sect. 6.3.3, So:

Esmooth(fs, ft) =1

(γ |∇1 (IV S) |+ 1)︸ ︷︷ ︸g

·

0 if fs = ft,distmax if fs or ft is not a plane,distπ (fs, ft) otherwise,

(7.6)where distπ is the inter-planes distance defined in Eq. (6.13); distmax is a constantvalue that penalizes discontinuities; g is a weighting function with domain in thegradient magnitude of VS image (|∇1 (IV S) |); the function g is evaluated at themidpoint of the segment that links two superpixel’ centroid. Figure 7.6(a) depictsthe graph connecting neighbor superpixels—neighbor superpixels are those sharing acommon border over |∇1 (IV S) |; the same representation is presented in Fig. 7.6(b)but over the superpixels of IV S . Finally, the energy function E from Eq. (7.3) is againminimized via graph cuts [22].

Planar and Non-Planar Region Classifier

In general, urban environments contain non-planar surfaces that should be detectedbefore graph-cuts labeling. These surfaces mainly correspond to bushes, trees andgrass, which appear in images overlapping buildings structure; they are hard to fit bya planar model. Therefore, as pointed out by Gallup et al. [56] a non-planar regionclassifier can be used for identifying these surfaces. Since, the classification problemessentially consists in determining if a given patch corresponds to a planar or to anon-planar surface; a two-class classification algorithm is used.

Page 123: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.4. Experimental Results 107

Two modifications to the approach proposed by Gallup are done. Firstly, thefeature vector that describes every patch is similar to the one presented in [74]. Thus,features such as color, texture, and location are computed from each patch. On theother hand, instead of obtaining the class membership probability of a patch by a k-nearest-neighbor method, we use an Adaboost implementation based on decision trees[160], which directly returns a confidence value that is assumed as a class membershipprobability. All the images in the dataset are split up into patches of 15x15 pixels,and then a feature vector is obtained from each patch. We collect a total of 1200patches for each class, which were extracted from 20 images and manually classifiedinto planar or non-planar surfaces. Then, they were divided into two groups, onefor training and the other one of validation. After analyzing the classification errorwith respect to the number of weak classifiers, we concluded that a scheme with 100weak classifiers is enough for the trade-off between accuracy and computational cost.From the classification results we can conclude that the weight vector provided by thedecision tree of the Adaboost classifier, confirms that the texture is a discriminativefeature for classifying a non-planar region. Hence, this information is used to detectunstructured objects, which in general in urban scenes correspond to non man-madeelements.

Finally, the data term in Eq. (7.3) is complemented by the class membershipprobability coming from the Adaboost classifier as follows:

Eclass(fs) = λclass ·

1− pclass if fs ∈ π1, π2, ..., πn, π∞,pclass if fs is not a plane,0 if fs is discard,

(7.7)

pclass is the class membership probability and λclass is a constant that weights thecontribution of the non-planar classifier. Figure 7.7(a) shows the disparity map of thecase study used as an illustration through this chapter; its corresponding textured3D map is presented in Fig. 7.7(b). Further examples will show the contribution ofplanar/non-planar classifier.

7.4 Experimental Results

The proposed method is evaluated with the multimodal dataset presented in Sect. 3.6.3(CVC-02 Multimodal Urban Dataset: Planar Surfaces). Their images depict a vari-ety of urban scenes, including: buildings, sidewalk, trees, and vehicles, among others.Although this dataset shows a large collection of planar surfaces with different orien-tations, other types of non-planar surfaces are also captured. The evaluation datasetconsists of 116 outdoor scenarios. Table 7.2 shows some of the multimodal stereoimages used in both the qualitative and quantitative evaluations of the proposed con-text based multimodal stereo algorithm. First and second column are rectified images,IV S and ILWIR, respectively; third column depicts their corresponding planar regions(refereed to IV S); finally, the fourth column shows the synthesized disparity maps ob-tained with the approach presented above.

Page 124: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

108 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

Table 7.2: Some examples of the evaluation data set.

CaseIV S ILWIR

Maps of planar Synthesizedstudy regions disparity maps

1

2

3

4

5

6

7

8

9

Page 125: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.4. Experimental Results 109

(a)

(b)

Figure 7.7: Results from the proposed context-based multimodal stereo algorithm:(a) disparity map; (b) textured 3D representation.

Table 7.2 shows good examples of Manhattan images, often these kind of imageshave repetitive patterns (e.g., bricks or floor tils), areas with lack of textured dueto constant colored regions (e.g., painted walls), or strong lighting changes, just tomention a few. In spite of these unfavorable features, Coughlan et al. [35] demon-strated that the statistic of their edge gradient provide orientation information, ei-ther of an observer relative to scene, or what object in the scene is unaligned. Weexploit Coughlan’s conclusions for assigning high confidence to high-frequency com-ponents of the multimodal images during the matching process. That is valid becausethese components (i.e., edges) are the features most correlated in the images [120].Moreover, disparity values in low-frequency regions (i.e., non-edges) are efficientlyinferred/interpolated from edges in Manhattan environments. VS/VS stereo algo-rithm, as presented in [54, 119], take advantage of these ideas for recovering realistic3D models. However, in the multimodal stereo is more complicate because both im-ages, LWIR and VS, must fulfill the Manhattan world assumption. Therefore, it isnot enough that only one image has this behavior.

Dense disparity and 3D maps have been obtained by setting the parameters as isindicated in Sect. 6.4. Furthermore, the parameters related to the multimodal costfunction are obtained following the recommendations presented in [9]. The values usedin the global optimization were set as follow: Cmax = 5, λsmooth = 1.25, dmax = 30,and γ = 0.5. The optimal setting of parameters corresponding to the dense stereostep (see Fig 7.1) are computed by means of a exhaustive search on the space ofparameters.

Five case studies are presented in Figures 7.8, 7.9, 7.10, 7.11 and 7.12. They show

Page 126: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

110 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

the experimental results obtained when the scenes shown in Table 7.2 are processed.Each illustration corresponds to the outcomes of the different steps of the proposedalgorithm. They are: (a) split and merge segmentation; (b) initial disparity map;(c) plane hypotheses generation; (d) graph cut labelling; (e) final disparity map; and(f) different views of the resulting cloud of 3D points. A qualitative inspection ofthese figures, particularly the illustrations (e) and (f), show that our assumptions arevalid to extract dense disparity maps and 3D representations from multimodal images(LWIR/VS).

At this point we consider interesting to give some final remarks about the resultspresented above. As earlier mentioned, our solution strategy is inspired by region-based stereo matching algorithms, working in the visible spectrum (i.e., [56, 90, 17]).However, in contrast to these methods, a LWIR/VS stereo algorithm must exhibit ahigh noise immunity, which is generated by the uncertainty in the matching costs (dataterm in Eq. (7.3)). Although this is a common problem to all stereo algorithms, inthis particular case, it is a determining factor that may result in poor representations.This drawback is addressed in the current chapter by assigning more weight to thecontribution of the smoothness term, on the contrary to the VS/VS stereo algorithmsmentioned above. This fact can be appreciated in the illustrations (c) and (d) ofFigures 7.8 to 7.12, where the right tuning of the weighting parameter allows to filterout small noisy regions and propagate information across the connected superpixels.

The case studies presented in this section validate all the extensions and modifi-cation applied to the Markov random field formulation. Furthermore, the disparitymaps and 3D representations computed from multimodal images allow to distinguishthe most relevant objects in the scene, even if some planes are missed or undetected.

The layout used for presenting the case studies 6 to 9 has been modified respectto case studies 1 to 5 for including a planar/non-planar probability map of IV S .Moreover, they depict long range scenes and non-planar regions where contextualinformation improves final results. So, Figs. 7.13, 7.14, 7.15 and 7.16 show thecontribution of the planar/non-planar classifier previously proposed. Although thegoal of this chapter is to produce depth maps, the results obtained after of the secondoptimization have also been evaluated. Several examples of these results could beseen in Figs. 7.8 to 7.12 item (e) and Figs. 7.13 to 7.16 item (f). We have evaluatedthose graph-cut labels for accuracy with respect to hand-labeled planar and non-planar segments. Since the segment label is determined by majority vote, any labelπ1, π2, πn, π∞ count as planar, and either of non-plane, discard count as non-planar. So, 90.4% of the planar segments while 92.5% of the non-planar regionswere labeled correctly (Note that unlabeled segment in Table 7.2 are assumed asnon-plane).

Page 127: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.4. Experimental Results 111

(a) (b) (c) (d)

(e) (f)

Figure 7.8: Case study 1: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) regions labelled from graph cut; (e) resulting disparity map; and (f)different views of the resulting 3D representation.

(a) (b) (c) (d)

(e) (f)

Figure 7.9: Case study 2: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) regions labelled from graph cut; (e) resulting disparity map; and (f)different views of the resulting 3D representation.

The performance of the proposed algorithm is quantitatively measured throughtwo error metrics (Eabs and Erel). These metrics show the effect of parameters’setting on the obtained results with respect to a synthesized disparity map, which isconsidered as the ground truth (see Sect. 3.6.3 for details in the extraction of thesesynthesized disparity maps). The absolute mean error Eabs is defined in Eq. (6.18),

Page 128: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

112 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

(a) (b) (c) (d)

(e) (f)

Figure 7.10: Case study 3: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) regions labelled from graph cut; (e) resulting disparity map; and (f)different views of the resulting 3D representation.

(a) (b) (c) (d)

(e) (f)

Figure 7.11: Case study 4: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) regions labelled from graph cut; (e) resulting disparity map; and (f)different views of the resulting 3D representation.

while the relative mean error (Erel) in Eq. (6.19). These metrics are used to evaluatethe results from the case studies, and provide quantitative measures of the errorfor the different scenes. Table 7.3 presents the Eabs and Erel for every case study.Figures 7.17 and 7.18 show the average accuracy computed over the whole data set

Page 129: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.4. Experimental Results 113

(a) (b) (c) (d)

(e) (f)

Figure 7.12: Case study 5: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) regions labelled from graph cut; (e) resulting disparity map; and (f)different views of the resulting 3D representation.

(a) (b) (c) (d)

(e) (f) (g)

Figure 7.13: Case study 6: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) planar and non-planar regions probability map; (e) labels result-ing from MRF labeling (f) Resulting disparity map; and (g) different views of theresulting 3D representation.

for different disparity error values (in pixels). These plots are computed similarly toSect. 6.4.

Page 130: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

114 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

(a) (b) (c) (d)

(e) (f) (g)

Figure 7.14: Case study 7: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) planar and non-planar regions probability map; (e) labels result-ing from MRF labeling (f) Resulting disparity map; and (g) different views of theresulting 3D representation.

(a) (b) (c) (d)

(e) (f) (g)

Figure 7.15: Case study 8: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) planar and non-planar regions probability map; (e) labels result-ing from MRF labeling (f) Resulting disparity map; and (g) different views of theresulting 3D representation.

From the quantitative evaluation presented above we can conclude that the pro-posed approach is able to compute disparity maps with an accuracy of ±1 pixel inmost of the multispectral image pairs of our data set (149 pair of images). Fur-thermore, it should be highlighted that, on average, about the 40% of matches are

Page 131: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.4. Experimental Results 115

(a) (b) (c) (d)

(e) (f) (g)

Figure 7.16: Case study 9: (a) candidate planar regions(R); (b) Dmap0; (c) planehypotheses; (d) planar and non-planar regions probability map; (e) labels result-ing from MRF labeling (f) Resulting disparity map; and (g) different views of theresulting 3D representation.

Table 7.3: Global Eabs and Erel of the case studies presented in Figures 7.8 to 7.16.

Case studies

1 2 3 4 5 6 7 8 9

ErrorEabs 0.490 0.735 0.572 0.359 0.872 2.9 0.510 0.600 3.50Erel 0.027 0.028 0.020 0.011 0.010 1.5 0.010 0.050 1.19

correctly detected in every image (disparity error = 0).

Page 132: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

116 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

Figure 7.17: Average accuracy of the results obtained with the proposed approachcomputed from the whole data set (119 pairs of images).

Figure 7.18: Average Erel of the results obtained with the proposed approachcomputed from the whole data set.

7.5 Conclusions

This chapter presents a novel multimodal context-based stereo algorithm, which ex-ploits recent advances in VS/VS region-based stereo. It use the Manhattan worldassumption as a criterion for splitting up the given scene into a set of planar patches.

This last chapter has confirmed that under certain restrictions the 3D structureof a scene can be inferred from a small set of good correspondences, in spite of themultimodal correspondence problems. The smooth version of the similarity functionused in this chapter significantly increases the number of good matches up to overcomethe minimum number of correspondences needed to obtain an accurate representationof the scene. Although its performance still depends on parameter setting, both sparse

Page 133: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

7.5. Conclusions 117

and dense disparity maps can be obtained.

The proposed energy function allows obtaining dense representations by modellingthe scene as a superpixelwise planar surface. Furthermore, the non-planar regions,labelled by the Adaboost classifier, are also considered during the minimization re-sulting in a robust solution even though their presence in the given scene. The energyfunction is optimized through a state-of-the-art graph-cuts algorithm. This formula-tion drives to two different representations. On the one hand, it allows obtaining thedominant planar regions of the given image; on the other hand, it computes the soughdense disparity map. Dominant planar regions are obtained through graph-cuts us-ing as a prior a piecewise planar representation that evolves during the minimizationprocess.

The experimental results show that the use of the context information helps toovercome the lack of correlation between the multimodal images (LWIR/VS), whereasis a robust way to generate dense scene representations. Furthermore, the proposedapproach has shown a noise immunity, in particular in the LWIR images where ingeneral edges are blurred or the image’s regions are poorly contrasted.

The proposed approach has been tested with a large set of multimodal imageshowing that context based VS/VS stereo matching algorithms can be extended totackle multimodal LWIR/VS stereo problems.

Page 134: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

118 CONTEXT AND PIECEWISE PLANAR BASED APPROACH

Page 135: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Chapter 8

Conclusion and Future Work

The main contributions of this dissertation are briefly recapitulated in this final chap-ter. Additionally, the future work and possible research lines identified during theseyears in the development of this dissertation are presented.

8.1 Conclusions

The first objective stated at the beginning of this dissertation was reached by con-structing a multimodal stereo rig for acquiring multispectral images. The imageacquisition system is an essential step toward an accurate 3D modeling of a scene,given that the quality of final results highly depends on the precision of the systemsetup. Therefore, not only the algorithms that transform the input images into usefulrepresentations are relevant, but also all the elements involved during its construction,such as selection of cameras and their configuration. In order to achieve this objec-tive, several configurations were tested as described in Sect. 3.2. Finally, the selectedconfiguration was a classical non-verged geometry, where the proposed multimodalstereo head shares one camera with a commercial stereovision system used for prov-ing ground truth data. This minimal configuration reduces the computational cost,while deliver quasi-rectified images, making easier the work in subsequent stages.

Regarding the calibration stage, we conclude that saddle points are better de-tected in both modalities, instead of a classical corner detector. Also, we found thatreflecting the energy coming from an external source of radiation such as the sunor another artificial one during the acquisition of calibration images is better thatwarming periodically the calibration pattern (convective heat method) as is usuallyperformed in infrared applications. Due to small thermal variations in the calibrationpattern correspond to blurred junctions, making difficult to identify the exact locationof the calibration points.

The multimodal image rectification was performed using two algorithms: firstly,

119

Page 136: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

120 CONCLUSIONS AND FUTURE WORK

an extension of a standard rectification method for VS images, Fusiello et al. [55],to the multimodal case was used. Then, a method that minimizes the deformationof rectified images (geometric distortion) was tested. Empirically, it is corroboratedthat the second option provides better rectification results. That is, images withoutstrong deformations. In general, rectification methods like these are recommended formultimodal stereo due to their robustness to heterogeneous sensors (different intrinsicparameters).

As a tangible result of this first objective, two multimodal datasets and their cor-responding ground truth data were acquired and published for public usage. Thesedatasets consist of: (i) thermal infrared and visible images in raw format as well astheir rectified versions; (ii) disparity maps registered to the corresponding referenceimages; (iii) 3D representations; (iv) hand annotated planar regions; (v) synthesizeddisparity maps; and (vi) labeled image regions (valid and occluded image regions).Up to our knowledge there are not similar datasets available for evaluation and com-parisons.

The second objective of this dissertation was related with the study of differentmatching cost functions. During this dissertation four matching cost functions werepresented: MI, GI, MGI, and C. They are evaluated in different problems suchas template matching, sparse and dense multimodal stereo. Each problem demandeda modification on its formulation, particularly in C. Hence, for template matchingand sparse multimodal stereo this cost function includes a decision rule based on theprinciple of consensus. This formulation shows that for each of these problems aWTA strategy or a disparity selection based on confidence is enough to recover thecorrect matchings. Another advantage of that formulation is that it allows a costthresholding criterion that represents a trade off between the number of acceptedmatchings and the final expected error. In the case of a dense solution a multimodalcost function based on the principle voting is proposed. This strategy is contraryto previous one, but since a global optimization framework with high-order prior isresponsible for decision-making this scheme shows to be effective (low error rates).There is no point of comparison between these approaches due to they are orientedto different applications and requirements (sparse versus dense). However, the votingscheme shows that the aggregation of mutual and gradient information can provideaccurate 3D representations.

In order to reach the last objective of this dissertation, two multimodal stereomethods for dense representations were proposed; their results are similar to thoseobtained by VS stereo algorithms in the state-of-the-art —of course– under the restric-tions mentioned in Chapt. 4 and Chapt. 5. Both algorithms show that low absoluteand relative error rates can be reached if an appropriated combination of prior knowl-edge and matching cost are fulfilled. Experimental results show that support regionsbased on superpixels achieve better qualitative and quantitative scores, in comparisonto one approach based on pixels; this is due to large support regions compensate thelimited number of correct matching in textureless multimodal regions.

Both algorithms assume a piece-wise planar world, which is valid for the pro-posed multimodal dataset. That constraint is needed to overcome the low correlation

Page 137: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

8.2. Future Work 121

between thermal infrared and visible images; instead of to improve the 3D represen-tations as usually intended in VS stereo approaches. This means that a Manhattanworld assumption is a necessary condition in such a kind of systems.

Chapters 6 and 7 present two different plane hypotheses generators; the first onebased on a plane affinity, while the second is based on a spectral clustering of graphs.The latter approach shows, on one hand, being robust enough to wrong plane esti-mations; on the other hand, it is able to find the principal orientations of planes inthe scene (clusters in the graph).

8.2 Future Work

In this dissertation, a set of multimodal stereo algorithms that exploit visual andthermal infrared information for recovering 3D representations of real scenes is pro-posed. Although there are several possible further research lines, we think that themost direct ones are related with the improvement on the performance of the pro-posed algorithms. Hence, in the next paragraphs, several ideas for future works arebriefly described.

Formulating new thermal infrared and visible cost functions. In general,the main problem of a multimodal stereo algorithm is related with its cost function;although in this dissertation several matching cost functions were proposed, thereare still space for improvements. It is likely that better results can be obtainedwith additional efforts; for instance, by exploring new forms of: (i) computing themutual information in a global scheme, instead of in a local way as presented inthis dissertation; (ii) investigating other methods for measuring similarities betweengradient fields; and finally, (iii) exploring other measures of similarities as for instancelocal self-similarities, which show an interesting performance in a similar matchingproblem [141].

Handling occlusions. During this dissertation the occlusion problem is nottaken into account. Actually, its effect is neglected by using a short distance betweenthe multimodal cameras and avoiding complex viewpoints of the objects in the scene.A future work would be to explore how to extend the proposed multimodal stereoalgorithms to address occlusions. Recently, a promising work for VS stereo matchinghas reported interesting results for this problem [10]. With this insight, it is expectedthat combining the proposed multimodal algorithms with that work a more robustalgorithm could be obtained.

Adding more contextual information. This dissertation explored the useof contextual information through a planar/non-planar classifier. An attractive ex-tension could be to add more contextual information, for example depths cues orgeometric context in order to get a better understanding of the scene (priors).

Incorporating multimodal stereo into other applications. Multiple cam-era systems have been incorporated into diverse domains, some of them include boththermal infrared and visible cameras but independently working. For example, some

Page 138: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

122 CONCLUSIONS AND FUTURE WORK

new vehicles are been equipped with both sensors, opening the possibility for pedes-trian detection algorithms based on thermal information and depth representations;or in video surveillance where visible, thermal, and depth information can improveits overall performance; finally, in facilities inspection a multimodal stereo algorithmcan help to detect looses in heating systems, damp, broken pipelines, etc.

Page 139: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Bibliography

[1] A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. Engels, D. Gallup,P. Merrell, M. Phelps, S. Sinha, B. Talton, L. Wang, Q. Yang, H. Stewenius,R. Yang, G. Welch, H. Towles, D. Nister, and M. Pollefeys. Towards urban 3dreconstruction from video. In Proc. Conf. 3D Imaging, Modeling, Processing,Visualization, and Transmission, pages 1–8, 2006.

[2] A. Auclair, N. Vincent, and L.D. Cohen. Using point correspondences withoutprojective deformation for multi-view stereo reconstruction. In Proc. IEEE Int’lConf. Image Processing, pages 193 –196, 2008.

[3] N. Ayache and C. Hansen. Rectification of images for binocular and trinocularstereovision. In Proc. Int’l Conf. Pattern Recognition, volume 1, pages 11 –16,1988.

[4] K. Bahador, K. Alaa, O. Fakhreddine, and N. Saiedeh. Multisensor data fusion:A review of the state-of-the-art. Information Fusion, (0):1–12, 2011.

[5] S. Baker, R. Szeliski, and P. Anandan. A layered approach to stereo reconstruc-tion. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages434–441, 1998.

[6] J. Banks, M. Bennamoun, K. Kubik, and P. Corke. A hybrid stereo matchingalgorithm incorporating the rank constraint. In Proc. IEEE Int’l Conf. ImageProcessing, volume 1, pages 33–37, 1999.

[7] S. Barnard and W. Thompson. Disparity analysis of images. IEEE Trans.Pattern Analysis and Machine Intelligence, 2(4):333 –340, 1980.

[8] F. Barrera, F. Lumbreras, and A. Sappa. Multimodal template matching basedon gradient and mutual information using scale-space. In Proc. IEEE Int’l Conf.Image Processing, pages 2749–2752, 2010.

[9] F. Barrera, F. Lumbreras, and A. Sappa. Multimodal stereo vision system: 3ddata extraction and algorithm evaluation. IEEE J. Selected Topics in SignalProcessing, 6(5):437–446, 2012.

123

Page 140: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

124 BIBLIOGRAPHY

[10] R. Ben-Ari and N. Sochen. Stereo matching with mumford-shah regularizationand occlusion handling. IEEE Trans. Pattern Analysis and Machine Intelli-gence, 32(11):2071 –2084, 2010.

[11] E. Bennett, J. Mason, and L. McMillan. Multispectral bilateral video fusion.IEEE Trans. Image Processing, 16(5):1185–1194, 2007.

[12] M. Bertozzi, E. Binelli, A. Broggi, and M. Rose. Stereo vision-based approachesfor pedestrian detection. In Proc. IEEE Conf. Computer Vision and PatternRecognition, pages 15–16, 2005.

[13] M. Bertozzi, A. Broggi, M. Felisa, S. Ghidoni, P. Grisleri, G. Vezzoni, C. Gomez,and M. Rose. Multi stereo-based pedestrian detection by means of daylight andfar infrared cameras. In Object Tracking and Classification Beyond the VisibleSpectrum, pages 371–401. 2009.

[14] M. Bertozzi, A. Broggi, M. Felisa, G. Vezzoni, and M. Rose. Low-level pedes-trian detection by means of visible and far infra-red tetra-vision. In Proc. IEEEIntelligent Vehicles Symposium, pages 231–236, 2006.

[15] C. Beyan, A. Yigit, and A. Temizel. Fusion of thermal- and visible-band videofor abandoned object detection. J. Electronic Imaging, 20(3):033001–033001–12, 2011.

[16] S. Birchfield and C. Tomasi. Multiway cut for stereo and motion with slantedsurfaces. In Proc. IEEE Int’l Conf. Computer Vision, volume 1, pages 489–495,1999.

[17] M. Bleyer and M. Gelautz. A layered stereo matching algorithm using imagesegmentation and global visibility constraints. J. Photogrammetry and RemoteSensing, 59(3):128–150, 2005.

[18] M. Bleyer, C. Rother, and P. Kohli. Surface stereo with soft segmentation. InProc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1570–1577,2010.

[19] J. Bouguet. Camera calibration toolbox for matlab. http://www.vision.

caltech.edu/bouguetj/calib_doc/index.html, 2012.

[20] Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Analysisand Machine Intelligence, 26(9):1124–1137, 2004.

[21] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimizationvia graph cuts. In Proc. IEEE Conf. Computer Vision and Pattern Recognition,volume 1, pages 377–384, 1999.

[22] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimiza-tion via graph cuts. IEEE Trans. Pattern Analysis and Machine Intelligence,23(11):1222–1239, 2001.

Page 141: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

BIBLIOGRAPHY 125

[23] M. Branislav and J. Kosecka. Multi-view superpixel stereo in man-made envi-ronments. Technical report, George Mason University, 2008.

[24] L. Brown. A survey of image registration techniques. ACM Computing Surveys,24(4):325–376, 1992.

[25] M. Brown, D. Burschka, and D. Hager. Advances in computational stereo. IEEETrans. Pattern Analysis and Machine Intelligence, 25(8):993–1008, 2003.

[26] M. Brown and S. Susstrunk. Multi-spectral sift for scene category recognition.In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 177–184,2011.

[27] A. Caballero, J. Castillo, J. Serrano, and S. Bascon. Real-time human segmen-tation in infrared videos. Expert Systems with Applications, 38(3):2577–2584,2011.

[28] D. Cobzas and H. Zhang. Planar patch extraction with noisy depth data. InProc. Int’l Conf. 3-D Digital Imaging and Modeling, pages 240 –245, 2001.

[29] L. Cohen, L. Vinet, P. Sander, and A. Gagalowicz. Hierarchical region basedstereo matching. In Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion, pages 416–421, 1989.

[30] E. Coiras, J. Santamarıa, and C. Miravet. Segment-based registration techniquefor visual-infrared images. Optical Engineering, 39(1):282–289, 2000.

[31] D. Comaniciu and P. Meer. Mean shift: a robust approach toward feature spaceanalysis. IEEE Trans. Pattern Analysis and Machine Intelligence, 24(5):603–619, 2002.

[32] C. Conaire, E. Cooke, N. OConnor, N. Murphy, and A. Smearson. Backgroundmodelling in infrared and visible spectrum video for people tracking. In Proc.IEEE Wksp. Computer Vision and Pattern Recognition, volume 3, pages 20–28,2005.

[33] S. Coorg and S. Teller. Extracting textured vertical facades from controlledclose-range imagery. In Proc. IEEE Conf. Computer Vision and Pattern Recog-nition, pages 637–663, 1999.

[34] N. Cornelis, B. Leibe, K. Cornelis, and L. Gool. 3d urban scene modelingintegrating recognition and reconstruction. Int’l J. Computer Vision, 78(3):121–141, 2008.

[35] J. Coughlan and A. Yuille. The manhattan world assumption: Regularities inscene statistics which enable bayesian inference. In Proc. Neural InformationProcessing Systems, pages 845–851, 2000.

[36] T. Cover and J. Thomas. Elements of information theory. Wiley-Interscience,1991.

Page 142: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

126 BIBLIOGRAPHY

[37] J. Crowley, O. Riff, and J. Piater. Fast computation of characteristic scale usinga half-octave pyramid. In Proc. Int’l. Conf. Scale-Space theories in ComputerVision.

[38] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection.In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 886–893,2005.

[39] J. Davis and V. Sharma. Background-subtraction using contour-based fusionof thermal and visible imagery. Computer Vision Image Understanding, 106(2-3):162–182, 2007.

[40] A. Dhua, F. Cutu, R. Hammoud, and S. Kiselewich. Triangulation based tech-nique for efficient stereo computation in infrared images. In Proc. IEEE Intel-ligent Vehicles Symposium, pages 673–678, 2003.

[41] N. Dowson, T. Kadir, and R. Bowden. Estimating the joint statistics of im-ages using nonparametric windows with application to registration using mu-tual information. IEEE Trans. Pattern Analysis and Machine Intelligence,30(10):1841–1857, 2008.

[42] H. Durrant-Whyte. Sensor models and multisensor integration. Int’l J. RoboticsResearch, 7(6):97–113, 1988.

[43] G. Egnal. Mutual information as a stereo correspondence measure. Technicalreport, University of Pennsylvania, 2000.

[44] G. Egnal and R. Wildes. Detecting binocular half-occlusions: Empirical com-parisons of five approaches. IEEE Trans. Pattern Analysis and Machine Intel-ligence, 24(8):1127–1133, 2002.

[45] D Fay, A. Waxman, M. Aguilar, D. Ireland., J. Racamato., W. Ross,W. Streilein, and M. Braun. Fusion of multi-sensor imagery for night vision:Color visualization, target learning and search. In Proc. Int’l Conf. InformationFusion, pages 215–219, 2000.

[46] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation.Int’l J. Computer Vision, 59:167–181, 2004.

[47] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ramanan.Object detection with discriminatively trained part-based models. IEEE Trans.Pattern Analysis and Machine Intelligence, 32:1627–1645, 2010.

[48] G. Fielding and M. Kam. Weighted matchings for dense stereo correspondence.Pattern Recognition, 33(9):1511 – 1524, 2000.

[49] D. Firmenich, M. Brown, and S. Susstrunk. Multispectral interest points forrgb-nir image registration. In Proc. IEEE Int’l Conf. Image Processing, 2011.

[50] M. Fischler and R. Bolles. Random sample consensus: a paradigm for modelfitting with applications to image analysis and automated cartography. Com-munications of the ACM, 24(6):381–395, 1981.

Page 143: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

BIBLIOGRAPHY 127

[51] FLIR. The ultimate infrared handbook for r&d professionals. Technical report,FLIR Commercial Vision Systems B.V., 2009.

[52] C. Fookes, A. Maeder, S. Sridharan, and J. Cook. Multi-spectral stereo imagematching using mutual information. In Proc. Conf. 3D Imaging, Modeling,Processing, Visualization, and Transmission, pages 961–968, 2004.

[53] C. Fredembach and S. Susstrunk. Illuminant estimation and detection usingnear-infrared. In Proc. Digital Photography, volume 7250 of SPIE Proceedings,pages 72–80. SPIE, 2009.

[54] Y. Furukawa, B. Curless, S. Seitz, and R. Szeliski. Manhattan-world stereo. InProc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1422–1429,2009.

[55] A. Fusiello, E. Trucco, and A. Verri. A compact algorithm for rectification ofstereo pairs. Machine Vision and Applications, 12(1):16–22, 2000.

[56] D. Gallup, J. Frahm, and M. Pollefeys. Piecewise planar and non-planar stereofor urban scene reconstruction. In Proc. IEEE Conf. Computer Vision andPattern Recognition, pages 1418–1425, 2010.

[57] D. Geiger, B. Ladendorf, and A. Yuille. Occlusions and binocular stereo. Int’lJ. Computer Vision, 14:211–226, 1995.

[58] T. Gevers and H. Stokman. Classifying color edges in video into shadow-geometry, highlight, or material transitions. IEEE Trans. Multimedia, 5(2):237–243, 2003.

[59] J. Gluckman and S. Nayar. Rectifying transformations that minimize resam-pling effects. In Proc. IEEE Conf. Computer Vision and Pattern Recognition,pages 111–117, 2001.

[60] A. Grassi, V. Frolov, and F. Leon. Information fusion to detect and classifypedestrians using invariant features. Information Fusion, 12(4):284–292, 2011.

[61] R. Gray. Entropy and information theory. Springer-Verlag, New York, USA,2009.

[62] K. Hajebi and J. Zelek. Sparse disparity map from uncalibrated infrared stereoimages. In Proc. Canadian Conf. Computer and Robot Vision, page 17, 2006.

[63] K. Hajebi and J. Zelek. Dense surface from infrared stereo. In Proc. IEEEWksp. Applications of Computer Vision, page 21, 2007.

[64] D. Hall and J. Llinas. An introduction to multisensor data fusion. Proc. of theIEEE, 85(1):6–23, 1997.

[65] D. Hall and S. McMullen. Mathematical Techniques in Multisensor Data Fusion.Artech House, second edition, 1992.

Page 144: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

128 BIBLIOGRAPHY

[66] J. Han and B. Bhanu. Fusion of color and infrared video for moving humandetection. Pattern Recognition, 40:1771–1784, 2007.

[67] R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.Cambridge University Press, second edition, 2004.

[68] G. Hermosillo and O. Faugeras. Variational methods for multimodal imagematching. Int’l J. Computer Vision, 50:329–343, 2002.

[69] H. Hirschmuller. Accurate and efficient stereo processing by semi-global match-ing and mutual information. In Proc. IEEE Conf. Computer Vision and PatternRecognition, volume 2, pages 807–814, 2005.

[70] H. Hirschmuller. Stereo processing by semiglobal matching and mutual informa-tion. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(2):328–341,2008.

[71] H. Hirschmuller, P. Innocent, and J. Garibaldi. Real-time correlation-basedstereo vision with reduced border errors. Int’l J. Computer Vision, 47(1-3):229–246, 2002.

[72] H. Hirschmuller and D. Scharstein. Evaluation of cost functions for stereomatching. In Proc. IEEE Conf. Computer Vision and Pattern Recognition,pages 1–8, 2007.

[73] H. Hirschmuller and D. Scharstein. Evaluation of stereo matching costs onimages with radiometric differences. IEEE Trans. Pattern Analysis and MachineIntelligence, 31(9):1582–1599, 2009.

[74] D. Hoiem, A. Efros, and M. Hebert. Geometric context from a single image. InProc. IEEE Int’l Conf. Computer Vision, volume 1, pages 654–661, 2005.

[75] X. Hu and P. Mordohai. A quantitative evaluation of confidence measuresfor stereo vision. IEEE Trans. Pattern Analysis and Machine Intelligence,34(11):2121–2133, 2012.

[76] C. Hua, P. Varshney, and M. Slamani. On registration of regions of interest (roi)in video sequences. In Proc. IEEE Int’l Conf. Advanced Video and Signal-BasedSurveillance, pages 313–318, 2003.

[77] M. Irani and P. Anandan. Robust multi-sensor image alignment. In Proc. IEEEInt’l Conf. Computer Vision, pages 959–967, 1998.

[78] A. Irschara, C. Zach, and H. Bischof. Towards wiki-based dense city modeling.In Proc. IEEE Wksp. Int’l Conf. Computer Vision, pages 1–8, 2007.

[79] S. Irwan, A. Ahmed, N. Ibrahim, and N. Zakaria. Roof angle for optimumthermal and energy performance of insulated roof. In Proc. Int’l Conf. Energyand Environment, pages 145–150, 2009.

Page 145: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

BIBLIOGRAPHY 129

[80] M. Itoh, M. Ozeki, Y. Nakamura, and Y. Ohta. Simple and robust tracking ofhands and objects for video-based multimedia production. In Proc. IEEE Conf.Multisensor Fusion and Integration for Intelligent Systems, 2003.

[81] S. Jian, L. Yin, S. Kang, and S. Heung-Yeung. Symmetric stereo matchingfor occlusion handling. In Proc. IEEE Conf. Computer Vision and PatternRecognition, volume 2, pages 399–406, 2005.

[82] X. Ju, J. Nebel, and J. Siebert. 3d thermography imaging standardizationtechnique for inflammation diagnosis. In Proc. SPIE, volume 5640, pages 266–273, 2005.

[83] S. Jung, J. Eledath, S. Johansson, and V. Mathevon. Egomotion estimationin monocular infra-red image sequence for night vision applications. In Proc.IEEE Wksp. Applications of Computer Vision, volume 0, page 8, 2007.

[84] R. Kalarot and J. Morris. Comparison of fpga and gpu implementations ofreal-time stereo vision. In Proc. IEEE Wksp. Computer Vision and PatternRecognition, pages 9–15, 2010.

[85] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptivewindow: theory and experiment. IEEE Trans. Pattern Analysis and MachineIntelligence, 16(9):920–932, 1994.

[86] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu.An efficient k-means clustering algorithm: Analysis and implementation. IEEETrans. Pattern Analysis and Machine Intelligence, 24:881–892, 2002.

[87] J. Kessler. Functional description of the data fusion process. Technical report,Office of Naval Technology, Naval Air Development Center, 1992.

[88] J. Kim, V. Kolmogorov, and R. Zabih. Visual correspondence using energyminimization and mutual information. In Proc. IEEE Int’l Conf. ComputerVision, pages 1033–1040, 2003.

[89] S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by Simulated Annealing.Science, 220(4598):671–680, 1983.

[90] A. Klaus, M. Sormann, and K. Karner. Segment-based stereo matching usingbelief propagation and a self-adapting dissimilarity measure. In Proc. Int’l Conf.Pattern Recognition, volume 3, pages 15–18, 2006.

[91] K. Konolige. Small vision systems: Hardware and implementation. In Proc.Int’l. Symp. Robotics Research, pages 111–116, 1997.

[92] J. Kostliva, J. Cech, and R. Sara. Feasibility boundary in dense and semi-dense stereo matching. In Proc. IEEE Conf. Computer Vision and PatternRecognition, pages 1–8, 2007.

[93] J. Kostliva, R. Sara, and J. Cech. Stereo algorithm evaluation. http://cmp.

felk.cvut.cz/~stereo/, 2012.

Page 146: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

130 BIBLIOGRAPHY

[94] S. Krotosky and M. Trivedi. Registration of multimodal stereo images usingdisparity voting from correspondence windows. In Proc. IEEE Int’l Conf. Ad-vanced Video and Signal-Based Surveillance, pages 91–97, 2006.

[95] S. Krotosky and M. Trivedi. Mutual information based registration of multi-modal stereo videos for person tracking. Computer Vision Image Understanding,106(2-3):270–287, 2007.

[96] S. Krotosky and M. Trivedi. On color-, infrared-, and multimodal-stereo ap-proaches to pedestrian detection. IEEE Trans. Intelligent Transportation Sys-tems, 8(4), 2007.

[97] S. Krotosky and M. Trivedi. Person surveillance using visual and infrared im-agery. IEEE Trans. Circuits and Systems for Video Technology, 18(8):1096–1105, 2008.

[98] S. Krotosky and M. Trivedi. Registering multimodal imagery with occludingobjects using mutual information: Application to stereo tracking of humans.In Augmented Vision Perception in Infrared, Advances in Pattern Recognition,pages 321–347. 2009.

[99] A. Kuijper. Mutual information aspects of scale space images. Pattern Recog-nition, 37(12):2361–2373, 2004.

[100] R. Labayrade and D. Aubert. A single framework for vehicle roll, pitch, yawestimation and obstacles detection by stereovision. In Proc. IEEE IntelligentVehicles Symposium, pages 31–36, 2003.

[101] S. Lazebnik, S. Cordelia, and P. Jean. Beyond bags of features: Spatial pyra-mid matching for recognizing natural scene categories. In Proc. IEEE Conf.Computer Vision and Pattern Recognition, pages 2169–2178, 2006.

[102] A. Levinshtein, A. Stere, K. Kutulakos, D. Fleet, S. Dickinson, and K Sid-diqi. Turbopixels: Fast superpixels using geometric flows. IEEE Trans. PatternAnalysis and Machine Intelligence, 31(12):2290–2297, 2009.

[103] A. Leykin and R. Hammoud. Pedestrian tracking by fusion of thermal-visiblesurveillance videos. Machine Vision and Applications, 21(1):587–595, 2010.

[104] H. Li and G. Chen. Segment-based stereo matching using graph cuts. In Proc.IEEE Conf. Computer Vision and Pattern Recognition, volume 1, pages 74–81,2004.

[105] H. Li and Y. Zhou. Automatic eo/ir sensor image registration. In Proc. IEEEInt’l Conf. Image Processing, volume 3, pages 240–243, 1995.

[106] Y. Li, Q. Zheng, A. Sharf, D. Cohen-Or, B. Chen, and N. Mitra. 2d-3d fusionfor layer decomposition of urban facades. In Proc. IEEE Int’l Conf. ComputerVision, pages 882–889, 2011.

Page 147: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

BIBLIOGRAPHY 131

[107] M. Liggins, D. Hall, and J. Llinas. Multisensor Data Fusion: Theory and Prac-tice. CRC, second edition, 2008.

[108] M. Lin and C. Tomasi. Surfaces with occlusions from layered stereo. IEEETrans. Pattern Analysis and Machine Intelligence, 26(8):1073–1078, 2004.

[109] T. Lindeberg. Scale-Space Theory in Computer Vision. Kluwer Academic Pub-lishers, 1994.

[110] G. Litos, X. Zabulis, and G. Triantafyllidis. Synchronous image acquisitionbased on network synchronization. In Proc. IEEE Wksp. Computer Vision andPattern Recognition, page 167, 2006.

[111] H. Longuet-Higgins. A computer algorithm for reconstructing a scene from twoprojections. Science, 293(4061):133–135, 1981.

[112] D. Lowe. Distinctive image features from scale-invariant keypoints. Int’l J.Computer Vision, 60(2):91–110, 2004.

[113] B. Lucas and T. Kanade. An iterative image registration technique with anapplication to stereo vision. In Proc. Int’l Joint Conf. Artificial Intelligence,pages 674–679, 1981.

[114] J. Maciel and J. Costeira. A global solution to sparse correspondence problems.IEEE Trans. Pattern Analysis and Machine Intelligence, 25(2):187–199, 2003.

[115] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens. Mul-timodality image registration by maximization of mutual information. IEEETrans. Medical Imaging, 16(2):187–198, 1997.

[116] J. Mallon and P. Whelan. Projective rectification from the fundamental matrix.Image and Vision Computing, 23(7):643–650, 2005.

[117] D. Marr and T. Poggio. Cooperative computation of stereo disparity. Science,194(4262):283–287, 1976.

[118] X. Mei, X. Sun, M. Zhou, S. Jiao, Wang H, and X. Zhang. On building anaccurate stereo matching system on graphics hardware. In Proc. IEEE Wksp.Int’l Conf. Computer Vision, pages 467–474, 2011.

[119] B. Micusik and J. Kosecka. Piecewise planar city 3d modeling from streetview panoramic sequences. In Proc. IEEE Conf. Computer Vision and PatternRecognition, pages 2906–2912, 2009.

[120] N. Morris, S. Avidan, W. Matusik, and H. Pfister. Statistics of infrared images.In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages 1–7, June2007.

[121] S. Nadimi and B. Bhanu. Physics-based cooperative sensor fusion for movingobject detection. In Proc. IEEE Wksp. Computer Vision and Pattern Recogni-tion, pages 1–8, 2004.

Page 148: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

132 BIBLIOGRAPHY

[122] Y. Ohta and T. Kanade. Stereo by intra- and inter-scanline search using dy-namic programming. IEEE Trans. Pattern Analysis and Machine Intelligence,7(2):139 –154, 1985.

[123] N. Oon-Ee and G. Velappa. A novel modular framework for stereo vision. InProc. IEEE Int’l Conf. Advanced Intelligent Mechatronics, pages 857–862, 2009.

[124] F. Perronnin, C. Dance, G. Csurka, and M. Bressan. Adapted vocabularies forgeneric visual categorization. In Proc. European Conf. Computer Vision, pages464–475, 2006.

[125] J. Pluim, J. Maintz, and M. Viergever. Image registration by maximization ofcombined mutual information and gradient information. IEEE Trans. MedicalImaging, 19(8):809–814, 2000.

[126] J. Pluim, J. Maintz, and M. Viergever. Mutual information matching in mul-tiresolution contexts. Image and Vision Computing, 19(1-2):45–52, 2001.

[127] M. Pollefeys, D. Nister, J. M. Frahm, A. Akbarzadeh, P. Mordohai, B. Clipp,C. Engels, D. Gallup, S. J. Kim, P. Merrell, C. Salmi, S. Sinha, B. Talton,L. Wang, Q. Yang, H. Stewenius, R. Yang, G. Welch, and H. Towles. Detailedreal-time urban 3d reconstruction from video. Int’l J. Computer Vision, 78(2-3):143–167, 2008.

[128] S. Prakash, P. Lee, and T. Caelli. 3d mapping of surface temperature usingthermal stereo. In Proc. Int’l Conf. Control, Automation, Robotics and Vision,pages 1–4, 2006.

[129] A. Rogalski. Infrared detectors: an overview. Infrared Physics & Technology,43(35):187–210, 2002.

[130] P. Rothman and R. Denton. Fusion or confusion: knowledge or nonsense. InProc. SPIE, volume 1470, pages 2–12, 1991.

[131] S. Roy and I. Cox. A maximum-flow formulation of the n-camera stereo cor-respondence problem. In Proc. IEEE Int’l Conf. Computer Vision, pages 492–499, 1998.

[132] Y. Ruigang, M. Pollefeys, and L. Sifang. Improved real-time stereo on com-modity graphics hardware. In Proc. IEEE Wksp. Computer Vision and PatternRecognition, page 36, 2004.

[133] Z. Sadeghipoor, Y. Lu, and S. Susstrunk. Correlation-based joint acquisitionand demosaicing of visible and near-infrared images. In Proc. IEEE Int’l Conf.Image Processing, pages 3165–3168, 2011.

[134] N. Salamati, A. Germain, and S. Susstrunk. Removing shadows from imagesusing color and near-infrared. In Proc. IEEE Int’l Conf. Image Processing,pages 1713—1716, 2011.

Page 149: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

BIBLIOGRAPHY 133

[135] A. Saxena, S. Chung, and A. Ng. 3-d depth reconstruction from a single stillimage. Int’l J. Computer Vision, 76(1):53–69, 2008.

[136] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-framestereo correspondence algorithms. Int’l J. Computer Vision, 47(1-3):7–42, 2002.

[137] D. Scharstein and R. Szeliski. High-accuracy stereo depth maps using structuredlight. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, volume 1,pages 195–202, 2003.

[138] Y. Schechner and S. Nayar. Generalized mosaicing: Wide field of view mul-tispectral imaging. IEEE Trans. Pattern Analysis and Machine Intelligence,24(10):1334–1348, 2002.

[139] D. Scribner, P. Warren, and J. Schuler. Extending color vision methods tobands beyond the visible. Machine Vision and Applications, 11:306–312, 2000.

[140] S. Shafer. In Color, chapter Using color to separate reflection components, pages43–51. Wiley, 1992.

[141] E. Shechtman and M. Irani. Matching local self-similarities across images andvideos. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages1–8, 2007.

[142] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans.Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

[143] O. Sidek and S. Quadri. A review of data fusion models and systems. Int’l J.Image Data Fusion, 3(1):3–21, 2012.

[144] S. Sinha, D. Steedly, and R. Szeliski. Piecewise planar stereo for image-basedrendering. In Proc. IEEE Int’l Conf. Computer Vision, pages 1881–1888, 2009.

[145] D. Smith and S. Singh. Approaches to multisensor data fusion in target tracking:A survey. IEEE Trans. Knowledge and Data Engineering, 18:1696–1710, 2006.

[146] D. Socolinsky and L. Wolff. Multispectral image visualization through first-order fusion. IEEE Trans. Image Processing, 11(8):923–931, 2002.

[147] D. Socolinsky and L. Wolff. Face recognition in low-light environments usingfusion of thermal infrared and intensified imagery. In Augmented Vision Per-ception in Infrared, Advances in Pattern Recognition, pages 197–211. 2009.

[148] C. Studholme, D. Hill, and D. Hawkes. An overlap invariant entropy measureof 3d medical image alignment. Pattern Recognition, 32(1):71–86, 1999.

[149] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala,M. Tappen, and C. Rother. A comparative study of energy minimization meth-ods for markov random fields with smoothness-based priors. IEEE Trans. Pat-tern Analysis and Machine Intelligence, 30(6):1068–1080, 2008.

Page 150: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

134 BIBLIOGRAPHY

[150] S. Tan, F. Che, Z. Xiaowu, K. Teo, S. Gao, D. Pinjala, and Y. Hoe. Thermalperformance analysis of a 3d package. In Proc. Electronics Packaging TechnologyConf., pages 72–75, 2010.

[151] H. Tao, H. Sawhney, and R. Kumar. A global matching framework for stereocomputation. In Proc. IEEE Int’l Conf. Computer Vision, volume 1, pages532–539, 2001.

[152] F. Tombari, S. Mattoccia., L. Stefano, and E. Addimanda. Classification andevaluation of cost aggregation methods for stereo correspondence. In Proc.IEEE Conf. Computer Vision and Pattern Recognition, pages 1–8, 2008.

[153] A. Torabi and G. Bilodeau. Local self-similarity as a dense stereo correspondencemeasure for themal-visible video registration. In Proc. IEEE Wksp. ComputerVision and Pattern Recognition, pages 61–67, 2011.

[154] A. Torabi, M. Najafianrazavi, and G. Bilodeau. A comparative evaluation ofmultimodal dense stereo correspondence measures. In Proc. IEEE Int’l Symp.Robotic and Sensors Environments, pages 143 –148, 2011.

[155] P. Torr and A. Zisserman. Mlesac: A new robust estimator with applicationto estimating image geometry. Computer Vision Image Understanding, 78:138–156, 2000.

[156] H. Torresan, B. Turgeon, C. Ibarra-Castanedo, P. Hebert, and X. Maldague.Advanced surveillance systems: combining video and thermal imagery for pedes-trian detection. Proc. SPIE, 5405:506–515, 2004.

[157] M. Trivedi, S. Cheng, E. Childers, and S. Krotosky. Occupant posture analysiswith stereo and thermal infrared video: Algorithms and experimental evalua-tion. IEEE Trans. Intelligent Transportation Systems, 53(6):1698–1712, 2004.

[158] C. Vanegas, D. Aliaga, and B. Benes. Automatic extraction of manhattan-world building masses from 3d laser range scans. IEEE Trans. Visualizationand Computer Graphics.

[159] O. Veksler. Fast variable window for stereo correspondence using integral im-ages. In Proc. IEEE Conf. Computer Vision and Pattern Recognition, pages556–561, 2003.

[160] A. Vezhnevets and V. Vezhnevets. Gml adaboost toolbox. http://www.inf.

ethz.ch/personal/vezhneva/index.html, 2012.

[161] P. Viola and W. Wells. Alignment by maximization of mutual information. Int’lJ. Computer Vision, 24(2):137–154, 1997.

[162] D. Wang and K. Lim. Obtaining depth map from segment-based stereo matchingusing graph cuts. Journal of Visual Communication and Image Representation,22(4):325–331, 2011.

Page 151: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

BIBLIOGRAPHY 135

[163] J. Wang and E. Adelson. Representing moving images with layers. IEEE Trans.Image Processing, 3(5):625 –638, 1994.

[164] Z. Wang, D. Ziou, C. Armenakis, D. Li, and Q. Li. A comparative analy-sis of image fusion methods. IEEE Trans. Geoscience and Remote Sensing,43(6):1391–1402, 2005.

[165] Y. Wei and L. Quan. Region-based progressive stereo matching. In Proc. IEEEConf. Computer Vision and Pattern Recognition, volume 1, pages 106–113,2004.

[166] S. Weiqun, G. Arabadjis, B. Bishop, P. Hill, and R. Plasse. Development of anexperimental prototype multi-modal netted sensor fence for homeland defenseand border integrity. In Proc. IEEE Conf. Technologies for Homeland Security,pages 221 –226, 2007.

[167] T. Werner and A. Zisserman. New techniques for automated architectural re-construction from photographs. In Proc. European Conf. Computer Vision,pages 541–555, 2002.

[168] F. White. Data fusion lexicon. Technical report, Data fusion subpanel of theJoint Directors of Laboratories Technical panel for C3, 1991.

[169] R. Yang and Y. Chen. Design of a 3-d infrared imaging system using structuredlight. IEEE Trans. Instrumentation and Measurement, 60(2):608–617, 2011.

[170] Z. Yang, G. Shen, W. Wang, Z. Qian, and Y. Ke. Spatial-spectral cross cor-relation for reliable multispectral image registration. In Proc. IEEE AppliedImagery Pattern Recognition Wksp, pages 1–8, 2009.

[171] G. Ye. Image Registration and Super-resolution Mosaicing. PhD thesis, TheUniversity of New South Wales, 2005.

[172] K. Yoon and I. Kweon. Locally adaptive support-weight approach for visualcorrespondence search. In Proc. IEEE Conf. Computer Vision and PatternRecognition, pages 924–931, 2005.

[173] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visualcorrespondence. In Proc. European Conf. Computer Vision, pages 151–158,1994.

[174] L. Zheng and R. Laganiere. Registration of ir and eo video sequences based onframe difference. In Proc. Canadian Conf. Computer and Robot Vision, pages459–464, 2007.

[175] A. Zisserman, F. Schaffalitzky, T. Werner, and A. Fitzgibbon. Automatedreconstruction from multiple photographs. In Proc. IEEE Int’l Conf. ImageProcessing, volume 3, pages 517–520, 2002.

[176] B. Zitova and J. Flusser. Image registration methods: a survey. Image andVision Computing, 21:977–1000, 2003.

Page 152: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

136 BIBLIOGRAPHY

Page 153: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

Publications

The following publications are direct consequence of the research carried out duringthe elaboration of this thesis, and give an idea of the progression that has beenachieved.

Journals

• F. Barrera, F. Lumbreras, A. Sappa. Multimodal Stereo Vision System: 3DData Extraction and Algorithm Evaluation, In IEEE Journal of Selected Topicsin Signal Processing, 6(5):437-446, 2012.

• F. Barrera, F. Lumbreras, A. Sappa. Multispectral Piecewise Planar Stereo us-ing Manhattan-World Assumption, In Pattern Recognition Letters, 2012 (Avail-able online).

• C. Aguilera, F. Barrera, A. Sappa, and R. Toledo. Multispectral Image FeaturePoints, In Sensors, 12(9), 12661-12672, 2012.

• F. Barrera, F. Lumbreras, A. Sappa. Context-Based 3D Extraction from Mul-timodal Stereo Data, In Information Fusion (Elsevier), (in second revision).

International Conferences

• F. Barrera, F. Lumbreras, and A. Sappa. Evaluation of Similarity Functionsin Multimodal Stereo, In Proc Int’l Conf. Image Analysis and Recognition(ICIAR), pages 320-329, 2012.

• F. Barrera, F. Lumbreras, C. Aguilera, and A. Sappa. Planar-Based Multispec-tral Stereo, In Proc. Quantitative InfraRed Thermography (QIRT), pages 1-8,2012.

• C. Aguilera, F. Barrera, A. Sappa, and R. Toledo. A Novel SIFT-Like-Based Ap-proach for FIR-VS Images Registration, In Proc. Quantitative InfraRed Ther-mography (QIRT), pages 1-8, 2012.

• F. Barrera, F. Lumbreras, and A. Sappa. Multimodal template matching basedon gradient and mutual information using scale-space, In Proc. IEEE Int’l Conf.Image Processing (ICIP), pages 2749-2752, 2010.

137

Page 154: Multimodal Stereo from Thermal Infrared and Visible SpectrumMultimodal Stereo from Thermal Infrared and Visible Spectrum A dissertation submitted by Jos e Fernando Barrera Campo at

138 PUBLICATIONS

Technical Reports

• F. Barrera, F. Lumbreras, and A. Sappa. Towards a multimodal stereo rig forADAS: State of art and algorithms. CVC Technical Report, Computer VisionCenter (Universitat Autonoma de Barcelona), 2008.