From Dusk till Dawn: Modeling in the Darkfrahm.web.unc.edu/files/2016/03/radenovic2016from.pdfFrom...

From Dusk till Dawn: Modeling in the Dark

Filip Radenovic1 Johannes L. Schonberger2,3 Dinghuang Ji2Jan-Michael Frahm2 Ondrej Chum1 Jirı Matas1

1CMP, Faculty of Electrical Engineering, Czech Technical University in Prague2Department of Computer Science, The University of North Carolina at Chapel Hill3Department of Computer Science, Eidgenossisch Technische Hochschule Zurich

Abstract

Internet photo collections naturally contain a large va-riety of illumination conditions, with the largest differencebetween day and night images. Current modeling tech-niques do not embrace the broad illumination range oftenleading to reconstruction failure or severe artifacts. Wepresent an algorithm that leverages the appearance varietyto obtain more complete and accurate scene geometry alongwith consistent multi-illumination appearance information.The proposed method relies on automatic scene appearancegrouping, which is used to obtain separate dense 3D mod-els. Subsequent model fusion combines the separate modelsinto a complete and accurate reconstruction of the scene.In addition, we propose a method to derive the appearanceinformation for the model under the different illuminationconditions, even for scene parts that are not observed un-der one illumination condition. To achieve this, we developa cross-illumination color transfer technique. We evaluateour method on a large variety of landmarks from across Eu-rope reconstructed from a database of 7.4M images.

1. IntroductionImage retrieval and 3D reconstruction have made big

strides in the past decade. Recently, image retrieval andStructure-from-Motion (SfM) methods have been combinedto achieve modeling from 100 million images [10]. Com-bining them can not only tackle scale but also allows to re-construct spatially complete models with high levels of de-tail [21]. A key observation is that an increasing numberof images in the collections ease the registration of imagestaken under very different illumination conditions into a sin-

This work was done while J. L. Schonberger was at the University ofNorth Carolina at Chapel Hill. We thank Marc Eder for helping with exper-iments. F. Radenovic, O. Chum and J. Matas were supported by the MSMTLL1303 ERC-CZ and GACR P103/12/G084 grants. J. L. Schonberger,D. Ji and J.-M. Frahm were supported in part by the National ScienceFoundation No. IIS-1349074, No. CNS-1405847, and the MITRE Corp.

Figure 1. Night model of St. Peter’s Cathedral in Rome recon-structed by our method. Left: Model obtained from night imagesonly. Right: Fused, recolored model from day and night images.

gle 3D model. A feat that is not achieved by direct matchingtechniques, but rather by discovering sequences of match-ing images with a gradual change of the illumination, seeFigure 2 for such a transition sequence.

A sparse 3D reconstruction of feature points, obtainedfrom a mixed set of day and night images, is reliable andnaturally occurs in large-scale photo collections. This isdue to the presence of “transition” images and due to thefact, that some of the detected features after photometricnormalization provide sufficiently stable matches across il-lumination transitions. Examples of such feature points, thecorresponding image patches, and their normalized descrip-tor patches are shown in Figure 4.

However, while beneficial for SfM, the registration ofmixed illumination images creates challenges for dense 3Dreconstruction, which delivers poor results or even fails inthe presence of day and night images [15]. In particu-lar, mixed illuminations cause erroneous dense correspon-dences due to accidental photo-consistency in multi-viewstereo that distort the texture composition of the models.

As a first contribution of the paper, we propose a methodfor automatically separating day and night images basedon the sparse scene graph produced by SfM and a learnedday/night color model. The separated sets of day and nightimages then allow to compute reliable dense reconstructionsfor each of the two modalities separately.

Figure 2. Tyn Church, Prague. Registration of day and night im-ages into the same model through smoothly varying illuminationin intermediate images during dusk and dawn.

While two separate models on first sight may be seen as adrawback, we demonstrate that they often contain regions inwhich only one of the models provides reliable surface re-construction. As expected, we observe that usually daytimeimages are significantly more frequent and, due to better il-lumination conditions, lead to overall superior models overnighttime models. Interestingly, we observed several situa-tions where night images provide better reconstruction thantheir daytime counterparts: (i) when lights at night illumi-nate or texture areas that are shadowed or ambiguous dur-ing the day, and (ii) when areas with repeated and confusingtextures are not lit during the night, allowing unambiguousdense matching in those areas. Our second contribution isto fuse the initially separated dense models into a superiormodel combining the strengths of both modalities.

Finally, as a third contribution, we introduce a methodof color transfer to consistently re-color the composite 3Dareas for each illumination condition, even for areas thatwere not reconstructed under the illumination, i.e., we willcompute a nighttime color even for geometry that is onlyreconstructed in the day model.

In summary, our contributions achieve a more completeand accurate dense 3D reconstruction for mixed day- andnighttime images that are typically present in Internet photocollections. Previously, the joint modeling of day and night-time images caused disturbing artifacts or even lead to re-construction failures. Additionally, we are able to recon-struct a complete color representation for the dense modelsurfaces leveraging the corresponding appearance charac-teristics of the daytime and nighttime images.

2. Related WorkThe seminal paper of Snavely et al. [23, 24] first pro-

posed reconstruction from unordered Internet photo collec-tions. To determine overlapping views, Snavely et al. per-formed exhaustive pairwise geometric verification. Whilethis ensures the highest possible discovery rate, it impairsthe scalability of their system due to the quadratic complex-ity growth in the number of images. During the followingyears, several methods for tackling scalability of unorderedphoto collection reconstruction were proposed: appearance-

based clustering methods for grouping the images [12, 6],vocabulary tree based approaches [1, 14], and most recentlystreaming based methods leveraging augmented appearanceindexing [10]. Although the systems successfully scaledthe reconstruction to tens of millions of images, they lostthe ability to reconstruct details of the scene in the pro-cess. Recently, Schonberger et al. [21] proposed a methodto overcome this limitation of not being able to reconstructdetails. Their method leverages a tightly-coupled SfM andimage retrieval system [17] to overcome the loss of finedetails in the models while keeping the scalability of thestate-of-the-art reconstruction systems. Our reconstructionsystem is inspired by this method. Snavely et al. [23, 24]empirically observed the difficulty in registering night im-ages due to their noisiness and darkness. In our system, weovercome this limitation by registering night images mainlythrough transition images under intermediate illuminationconditions during dusk and dawn (see Figure 2). Snavely’ssystem [22] provided an option to manually select day ornight images to explore similar viewpoints and illumina-tions. In contrast, our system automatically classifies andclusters day and night images. In addition, we use the clus-tering to improve reconstruction results.

Schindler and Dellaert [19] proposed a method for an-alyzing the point in time at which a photo was taken. Incontrast to our approach, their method was relying on ob-servable changes of the scene geometry, e.g., constructionor demolition of buildings, which typically happens overlonger periods of time. Our method focuses on modelingthe illumination changes over the course of a day. Re-cently, Matzen et al. [16] proposed an approach to modeland extract temporal scene appearance changes in 3D re-constructions. They perform temporal segmentation of the3D model to obtain objects whose appearance changed overtime. The recovered object appearance changes (wall art,signs, billboards, storefronts, etc.) relate to scene texturechanges but not to illumination changes due to their searchof change over longer periods of time. In contrast, our al-gorithm aims at determining periodic short term (over thecourse of a day) temporal scene appearance and illumina-tion changes. Hence, our proposed approach deals withmuch smaller appearance differences in segmented parts ofthe reconstruction. These changes are caused by differentilluminations during daytime and nighttime and are not cor-related with scene texture changes.

Martin-Brualla et al. [15] proposed to compute time-lapse mosaics from unordered Internet photo collections oflandmarks. They observed the difficulties posed by the pres-ence of night and day images in the same reconstruction.Specifically, they noted that mixing day and night imageswithin the same model introduces “unrealistic twilight ef-fects”. In this paper, we propose an approach that over-comes these failure cases and obtains a correct representa-tion of the 3D model for both modes of illumination.

Image Database Sparse Model

SfM ClusteringDay

Night

Day/Night Clusters

Fusion

DenseRec.

Dense Model Fused ModelScene Graph

Figure 3. The proposed day/night modeling pipeline starting with sparse modeling to day-night clustering and the final dense modeling.

Ji et al. [11] proposed a system to automatically createillumination mosaics for a given outdoor scene from Inter-net photos. Their work strives to depict temporal variabilityof the observed scene by presenting a 2D image of the scenewith varying illumination along the rows of the image. Theyperform a search for a chain of images that exercise smoothillumination variation and that are all related through a ho-mography mapping. In contrast, our method considers allavailable images and not only the images related throughhomographies. Instead of illumination modeling in 2D, ourapproach achieves illumination separation and modeling in3D for the entire scene. Moreover, the ordering of Ji etal. [11] heavily relies on the color of the sky shown in theimages. Whereas our system can perform day-night separa-tion even with no sky present in any of the images.

Veride et al. [25] learned a feature detector which isstable under significant illumination changes, facilitatingthe matching between day- and nighttime images. Theyobserved that standard feature detectors exhibit significanttemporal sensitivity, i.e., reduced repeatability under differ-ent illumination conditions. We exploit this temporal sen-sitivity “flaw” of the standard detectors to efficiently split agiven 3D model into groups of cameras and points that havethe highest illumination change across groups, i.e., a groupfor the day and another for the night.

3. OverviewBefore delving into the details of our method for day and

night model reconstruction, we provide an overview as il-lustrated in Figure 3. It starts with a database of unorderedimages. During the initial phase of the reconstruction, webuild a sparse 3D model using SfM (see Section 4). In sup-port of sparse modeling, we index all images in the databaseusing a min-Hash and find reconstruction seeds by leverag-ing geometrically verified hash-collisions. Next, our SfMalgorithm uses these seeds to build sparse 3D models forthe scenes contained in the photo collection. Specifically,it uses a feedback loop to gradually extend the reconstruc-tion by dedicated queries against the database. The resultingsparse model contains day and night images registered intothe same model and represented as one scene graph.

In the next step, a dense scene model is obtained. Giventhe previously observed difficulties and artifacts caused bymixed day and night images, we deviate from the standard

approach of directly proceeding to dense reconstruction.We first split the scene graph into day and night clustersto separate the images of the different illumination condi-tions (see Section 5). This in essence separates the scenegraph into two scene graphs – one for daytime images andone for nighttime images. Then, we perform separate densegeometry estimation for the images in each of the scenegraphs yielding two separate dense 3D models (see Sec-tion 6). Subsequently, the two dense models are aligned intoone common model representing the overall dense scene ge-ometry. As part of computing the dense scene geometry,we obtain the color information of the point cloud underthe two illumination conditions, i.e., a daytime color and anighttime color for each point. Given that not all parts ofthe common model are necessarily visible both at daytimeand at nighttime, we then determine the missing color infor-mation through cross-illumination transfer. Specifically, weuse one illumination condition to find similar patches witha corresponding color in the other illumination. The colorinformation of the patches under one illumination is thenused to compose the missing color information for the pointunder the other illumination.

4. ReconstructionIn this section, we detail our approach for efficiently

reconstructing all 3D models contained in a given imagedatabase. We use a generic database from [21] with over7.4 million images downloaded from Flickr through key-words of famous landmarks, cities, countries, and architec-tural sites. The approach starts with an initial clusteringprocedure to find putative spatially related images. Thesespatially related images are subsequently used to seed aniterative reconstruction process that repeatedly extends the3D model through a tight integration of the image retrievaland SfM module similar to the approach by Schonberger etal. [21, 20]. In contrast to their system, our approach ex-haustively builds models for the entire image database. Dueto the massive number of images in the database, exhaustivereconstruction imposes several challenges in terms of effi-ciency, which we address through an initial clustering pro-cedure and a parallelized implementation of their system.

Clustering To seed our iterative reconstruction process ef-ficiently, we find independent sets of spatially overlappingimages using the clustering approach by Chum et al. [4].

This approach first indexes all database images in a min-Hash table and then uses spatially verified hash collisionsas cluster seeds. Next, an incremental query expansion[5, 18] with spatial verification extends the initial clusterswith additional images of the same landmark. The nearest-neighbor images in this query expansion step then definethe graph of overlapping images, the so-called scene graph.Given that query expansion is a depth first search strategy,the resulting scene graph is only sparsely connected. How-ever, in order to achieve a successful reconstruction, SfMrequires a denser scene graph than provided by the clus-tering method. Therefore, we first densify the scene graphas described in the following paragraph before using it inSfM. Compared to the approach in [21], which takes a sin-gle query image as input for the reconstruction, this cluster-ing step reduces the number of query images dramatically.Rather than seeding the reconstruction with 7.4M query im-ages, the clustering procedure identifies 19,546 individuallandmarks used to initialize the subsequent reconstructionprocedure and thereby reduces the number of seeds by 3orders of magnitude.

Densification Next, we densify the initially sparse scenegraph for improved reconstruction robustness and com-pleteness. In the spirit of Schonberger et al. [21], we lever-age the spatially verified image pairs and their visual wordmatches along with an affine model to serve as hypothe-ses for subsequent exhaustive feature matching and epipolarverification. From this exhaustive verification, we not onlyobtain a higher number of feature correspondences but wealso determine additional image pairs to densify the scenegraph. More importantly, beyond the benefit of additionalimage pairs, the significantly increased number of featurecorrespondences is essential for establishing feature tracksfrom day to night images through dusk and dawn. Onlythrough these transitive connections, we are able to reliablyregister day and night images into a single 3D model.

Structure-from-Motion The densified scene graph is theinput to the subsequent incremental SfM algorithm, whichtreats each edge in the graph as a putative image pair forreconstruction and attempts to reconstruct every connectedcomponent within a cluster. Connected components withless than 20 registered images are discarded for the pur-poses of day/night modeling as they typically lack a suf-ficient number of transition images during dusk and dawn.

Extension To boost registration completeness, a final ex-tension step issues further queries for all registered im-ages in each reconstructed connected component. If newimages are found and spatially verified, we again performscene graph densification and use SfM to register the newviews into the previously reconstructed models. While sig-nificantly increasing the size of the reconstructed models,the extension process also improves the performance of the

day/night modeling step. Typically, the initial set of imagesobtained in clustering often only contains images from onemodality, i.e., either day or night, even though our large-scale image database contains images of both modalitiesfor almost all landmarks. The iterative extension overcomesthis problem by incrementally growing the model from dayto night or vice versa through transition images during duskand dawn (see Figure 2 for an example).

5. Day/Night ClusteringAfter the exhaustive 3D reconstruction stage of all land-

marks in the database, we proceed with clustering the im-ages inside each of the 3D models into two groups: day-and nighttime. For crowd-sourced data, the clustering can-not simply rely on embedded EXIF time stamps. In our ex-periments, the majority of images either have no time stampinformation at all or the information is clearly corrupt. Wespeculate that most images are taken on vacation and peo-ple do not adjust the time zone in their cameras. For mostlandmarks with many registered images, day- and nighttimeimages are registered into the same model as a result of theextension step (see Section 4). It is well known that stan-dard feature (keypoint) detectors [25] suffer under illumi-nation sensitivity, i.e., the reliability of keypoint detectorsdegrades significantly when the images originate from out-door scenes during different times of the day or generallydifferent illumination conditions. In this case, the detec-tors commonly produce keypoints at different locations forday and night lighting conditions [25]. This even holds truewhen the images are taken from the same viewpoint. Ourkey insight is to exploit this behavior in order to split theimages inside a SfM model into two groups. Our clusteringis based on the number of commonly observed 3D pointsfor each pair of images with similar viewpoints. This en-ables us to identify day and night images registered withina model. For efficient grouping, we leverage a bipartite vis-ibility graph [13], as explained in the following sections.

5.1. Min-cut on Bipartite Visibility Graph

A 3D model produced by SfM can be interpreted as abipartite visibility graph G = (I ∪ P, E) [13], where theimages i ∈ I and the points p ∈ P are the vertices of thegraph. The edges of the graph are then defined by the visi-bility relations between cameras and points, i.e., if a point pis visible in an image i, then there exists an edge (i, p) ∈ E .We define the set of points observed by an image i as:

P(i) = {p ∈ P | (i, p) ∈ E}. (1)

Our day/night clustering separates the vertices of thegraph (the cameras and points) into two groups: one cor-responding to day cameras and points and the second forthe night cameras and points. More formally, we define two

Figure 4. Colosseum, Rome. Two feature tracks containing both day and night images/features. Each row depicts two images labeled as dayand night, respectively, followed by a subset of feature patches depicted in two rows, one for day and one for night features, respectively.Intensity normalized patches, grayscale versions used for SIFT description, are shown to the right of the respective color patches. Noticethe variation in lighting conditions for day and night, expressed as a significant color difference of patches. Best viewed in color.

label vectors representing the group assignment. Vector αifor the images and vector αp for the points:

αi = {αi ∈ {0, 1} | i ∈ I},αp = {αp ∈ {0, 1} | p ∈ P}, (2)

where label variables αi and αp correspond to image iand point p, and label αi, αp = 0 denotes day and labelαi, αp = 1 night. We formulate the problem of separatingday from night images as an energy optimization. We pro-pose the following energy function E over the graph G thatmeasures the quality of the labeling αi, αp:

E(αi,αp,G) =∑i∈I

Ui(αi) +∑

(i,p)∈E

Pi,p(αi, αp). (3)

The term Pi,p(αi, αp) describes the pairwise potentials as-sociated with the edges enforcing a smooth labeling of thecameras and points with respect to their mutually observedscene information. A standard Potts model is used for thepairwise potentials, that is Pi,p(αi, αp) = 0 for αi = αp

and Pi,p(αi, αp) = 1 otherwise. The 3D points incur nounary cost for being assigned either label. The unary costUi(αi) for images is based on the day/night illuminationmodel discussed below. The clustering of all images andpoints in a model is achieved by minimizing the objective

αi,αp = argminαi,αp

E(αi,αp,G) (4)

using the min-cut/max-flow algorithm of Boykov et al. [3].Figure 4 shows examples of 3D point tracks that containboth day and night labels.

5.2. Day/Night Illumination Model

We use a day/night illumination model to estimate thelikelihood of an image being taken during day or night re-spectively. As a feature for the prediction, a spatial colorhistogram in the opponent color space [9]

I = (R+G+B)/3,

O1 = (R+G− 2B)/4 + 0.5,

O2 = (R− 2G+B)/4 + 0.5, (5)

x1

x2

y1

y2

x1

x2

y1

y2

Figure 5. Colosseum, Rome. Examples of image color histogramdescription area. Coordinates of features reconstructed as 3Dpoints define the bounding boxes used to compute three his-tograms. Using the 3D model information, we successfully seg-ment out confusing background and are able to focus the descrip-tion on the three important parts: sky, upper and lower part of thereconstructed landmark.

is used. To reduce the influence of occlusions and back-ground clutter, a three-band spatial histogram is computedover a region of the image directly related to the recon-structed object, as depicted in Figure 5. The bottom twostripes of the histogram equally split the bounding box offeature points that have been reconstructed as 3D points inthe model. The top band covers the sky area above the land-mark, up to the top edge of the image.

The color is uniformly quantized and each spatial bandof the histogram is separately normalized by the number ofpixels per region. The final illumination descriptor is ob-tained by concatenating the color histograms for the threespatial bands. In our experiments, we use n = 4 bins percolor channel resulting in an image descriptor of dimension-ality D = 3n3 = 192.

To classify the illumination descriptors into daytime andnighttime, a linear SVM [2] is trained on ground-truth la-beled images of our largest model (Colosseum, Rome). Thesame trained SVM is used to compute the unary terms foreach image i in all reconstructed models:

Ui(αi) =

{0 if αi = SVMp(i),c·SVMs(i)· |P(i)| otherwise, (6)

where SVMp(i) and SVMs(i) denote the SVM’s label pre-diction and the absolute value of the prediction score of im-

age i, respectively. The confidence constant c of the trainedSVM has higher confidence for higher values c > 0 and inour experiments we set c = 1. The cardinality of the set ofobserved points P(i) is equal to the number of edges thatconnect image i to 3D points in the visibility graph.

The label of image i is decided based on the labels ofits observed points (pairwise term) and by the confidenceof the linear SVM prediction (unary term). In order forthis process to be fair for all images, we multiply the SVMscore by the number of observed points |P(i)| for the finalunary term. This number defines the percentage of observedpoints that should have different labels to change the SVMprediction for the image.

6. Day/Night ModelingAfter obtaining the image clustering, we first aim to re-

construct the separate models and then combine them intoa joint model to produce consistent geometry and texturewithin each modality, as detailed in this section. Typically,there is an uneven distribution of day and night images,causing one of the modalities to have lower scene coverage.In addition, the different illumination conditions during dayand night allow for reconstruction of details that are clearlyvisible during the day but not at night and vice versa. Forexample, many landmarks are lit during the night and a re-construction of fine details is oftentimes possible for nightimages while during the day those structures are hidden inshadows. Hence, in the second step, we fuse the geome-try of the two models in order to obtain better completenessin terms of scene coverage and reconstruction of fine de-tails. To obtain consistent color for the fused model, we re-color the structure of the respective other modality throughrepainting of visible structure and inpainting of structuresnot covered by images. The following sections describe ourproposed approach in detail.

6.1. Dense Reconstruction

For dense reconstruction, we first separate the sparsemodel into its day and night modalities based on the la-bels αi and αp. For most models, there are enough im-ages during day and night to allow for dense reconstructionin both modalities. We split the graph G into two disjointsub-graphs: Gd for the day modality, and Gn for the nightmodality. We separate the tracks of points that are visiblein both day and night images. The two graphs serve as theinput to the dense reconstruction system by Furukawa andPonce [7, 8]. Separate reconstruction of day and night im-ages removes many of the disturbing artifacts present whenusing all images in a model (see Figure 6). To mitigate re-construction artifacts caused by sky regions, we create seg-mentation masks using an improved version of the approachproposed by Ji et al. [11]. In distinction to their approach,we leverage the sparse point cloud as an additional clue for

Standard Dense

Night-only Dense

Day-only Dense

Day-Night Fused

Night-Day Fused

Figure 6. Moulin Rouge, Paris. Standard dense modeling usingday and night images creates disturbing artifacts, while a separatemodeling for day and night images produces consistent geometryand coloring. Fusion and recoloring improves completeness, ap-pearance, and accuracy.

deciding whether parts of the image belong to the sky ornot. The outputs of this step are separate models for dayand night. In the next section, we describe an approach thatfuses the two models and leverages the benefits of the re-spective other modality for increased model completenessand detail reconstruction.

6.2. Fusion

Typically, the scene coverage of day and night modelsare very different due to a multitude of reasons. First, partsof the scene may not be covered by any images in one of themodalities, e.g., caused by occlusion or lack of images. Inaddition, we found that for some scenes, parts of the recon-struction are not covered by images at all during the nightdue to restricted access in those areas, e.g., the inside ofthe Colosseum. A second reason for different scene cover-age is the different illumination conditions causing dynamic

range issues for the cameras that often prevent reliable re-construction of scene parts, even though they are theoret-ically visible. Especially for night images, parts are oftenunder-illuminated or lack any illumination at all. There is asimilar issue for day images as well, e.g., shadows causedby intense sunlight often prevent reconstruction of structure.One such case is depicted in Figure 6. Using the default pa-rameters, the dense reconstruction method by Furukawa andPonce [8] is very conservative in terms of creating densepoints, i.e., 3D structure only appears in high confidence ar-eas. Therefore, geometric fusion of the two models enablesthe use of structure that is more accurately reconstructedfrom day or night images. As a first step, we perform align-ment of the two models into the same reference frame usingthe correspondences from points that appear both in nightand day images. However, such fused models contain bothday and night points and thus suffer from inconsistent col-oring. In the following paragraphs, we describe a joint re-painting and inpainting procedure to color the fused daypoints in the night model and vice versa (see Figure 7). Forsimplicity, we explain the procedure for the case of color-ing the fused point cloud using the night images, but theapproach is analogous in the opposite direction.

Repainting As explained in the previous paragraph, manydense points are reconstructed in day but not in night mod-els, even though they are covered by night images. Weproject these points into all night images and determine theircolor as the median of all projections. For occlusion han-dling, we enforce depth consistency with the sparse pointcloud. The depth of the dense points must be within the10th and 90th percentile of the depth range of the observedsparse points of an image. While this cannot account forfine-grained occlusions, in our experiments, the extractedcolors are not affected by occluded observations due to therobust averaging of colors.

Inpainting For those points that are not visible in any nightimage, we propose a novel inpainting method. The methodlearns the appearance mapping between known correspond-ing day and night patches to predict the color of unseenpoints. To establish dense correspondence between day andnight patches, we first project all points into day and nightimages. Any point that projects both into day and night im-ages defines a correspondence that we use to infer the ap-pearance of a day point during the night. Each of the corre-spondences usually projects into multiple day and night im-ages. An average color histogram is extracted from a 5× 5patch around the projected image location, for each corre-spondence between day and night images. While we triedto incorporate shape information as descriptors, we foundcolor histograms to be sufficiently distinctive features andbest performing for the task of inpainting. Using these his-tograms as input, we train a nearest-neighbor regressor tomap from day patches to night patches. To inpaint the color

Blending

Repainted Inpainted + Blended

Inpainting

Night Image

Figure 7. Pantheon, Rome. Example of repainting, inpainting, andblending for building facade that is not present in the original nightreconstruction.

of points that only project to day images, we extract the av-erage day color histogram for that point and use our trainedregressor to predict its most likely appearance during thenight. This inpainting method enables us to obtain a modelduring the night that is as complete as during the day. Inall our experiments, we use N = 20 nearest neighbors forthe regression and D = 96 dimensional histograms for theappearance descriptor.

Blending Even though we are using a robust average in therepainting step, low-coverage points sometimes suffer fromabrupt changes in appearance in 3D space whenever thefield of view of one image ends. To counteract this artifact,we propose to blend these points by predicting their appear-ance using the same mapping as in the inpainting step. Weimprove the color of any point with a track length t < tmin.The originally repainted color is then blended with the in-painted color based on the track length of the point. Theblended color of a point is calculated as

cbl =tmin − ttmin

· cinp +t

tmin· crep, (7)

where cinp and crep denote the inpainted and repainted col-ors, respectively. In all experiments we set tmin = 10.

7. ResultsAfter describing our novel approach for day/night mod-

eling, we now evaluate our method on the entire 7.4M im-age database and present results for a variety of scenes.Our experiments demonstrate that the proposed algorithmrobustly generalizes to different illumination conditions.

Reconstruction The iterative reconstruction process for thedatabase of 7.4 million images converges in 3 iterations forall clusters in the database and takes around one week ona single desktop machine. We produce day and nighttimemodels for any reconstructed cluster that has a sufficientnumber of registered images, i.e., at least 30 day and 30night images. We find 1,474 such models out of the ini-tial set of 19,546 clusters used to seed the reconstructionpipeline. These models have 239,717 unique, registered im-ages contained in 845 disjoint landmarks. The average ratioof day to nighttime images in the reconstructions is 9:1.

Clustering To evaluate our clustering approach, we hand-labeled 13,931 images of 6 different landmarks present inthe dataset using the two classes of labels “day” and “night”(see Table 1). For the sake of comparison, we also introducea baseline method for image clustering into day- and night-time images using k-means clustering with two clusters on

Day Image Day Model Night Model Fused Night Model Night Image

Figure 8. Example of reconstructions produced by our method for St. Peter’s Basilica in Vatican, Colosseum in Rome, Astronomical Clockin Prague, Altare della Patria in Rome, and Pantheon in Rome.

the HSV color histograms of the images. Our clusteringapproach achieves almost perfect classification for the dayand night images. Even in the challenging case with onlyfew night images. We outperform k-means on all landmarksand, most importantly, we can classify night images very ac-curately, which is crucial for avoiding artifacts in day/nightmodeling. This is even more notable considering that nightimages are significantly outnumbered in most of the models.

Geometric Fusion Figure 8 impressively demonstrates theimproved completeness and accuracy of night models bythe geometric fusion. In addition, Figure 6 also depicts anexample of the opposite direction, where the structure ofday model is improved through the night model. We en-courage the readers to view the supplementary material foradditional impressions and videos.

Color Fusion Figure 7 demonstrates the proposed repaint-ing, inpainting, and blending method applied to a buildingfacade in a low-coverage part of the Pantheon reconstruc-tion. The structure is not reconstructed in the original nightmodel (Figure 8). Hence, the entire structure consists of re-painted points from the day reconstruction. In addition, ourmethod effectively inpaints structure that is not visible inany night images and removes artifacts through blending.

Ours Baseline

Landmark # D # N TP FP TP FP

Spanish Steps 1030 92 98.91 3.26 93.48 14.13Moulin Rouge 880 754 87.00 0.93 85.81 1.33Castel St’Angelo 1400 129 99.22 6.20 93.02 6.98Astronomical Clock 2243 1375 97.89 5.60 80.15 2.98Altare d. Patria 1993 357 97.76 2.52 92.72 4.20St. Peter’s Basilica 1980 495 98.99 2.22 87.47 6.46

Table 1. Quantitative evaluation of clustering accuracy for nightimages. Ground-truth labels obtained through manual labeling.

8. Conclusions

We introduced a novel algorithm that handles and bene-fits from the variety of scene illuminations naturally presentin large-scale Internet photo collections. This is in starkcontrast to previous methods that treated multiple illumi-nations as a nuisance or failure condition. We exploit theadditional information to obtain a more complete and ac-curate 3D model and to create multi-illumination appear-ance information for the 3D model. The proposed methoddemonstrates that we can leverage the additional informa-tion provided by the different illuminations to boost model-ing quality for both geometry and appearance.

References[1] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless,

S. Seitz, and R. Szeliski. Building Rome in a Day. Commu-nications of the ACM, 2011. 2

[2] C. M. Bishop. Pattern recognition and machine learning.Springer, 2006. 5

[3] Y. Boykov and V. Kolmogorov. An experimental comparisonof min-cut/max-flow algorithms for energy minimization invision. IEEE PAMI, 2004. 5

[4] O. Chum and J. Matas. Large-scale discovery of spatiallyrelated images. IEEE PAMI, 2010. 3

[5] O. Chum, J. Philbin, J. Sivic, M. Isard, and A. Zisserman.Total recall: Automatic query expansion with a generativefeature model for object retrieval. In ICCV, 2007. 4

[6] J. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson, R. Ragu-ram, C. Wu, Y. Jen, E. Dunn, B. Clipp, S. Lazebnik, andM. Pollefeys. Building Rome on a Cloudless Day. In ECCV,2010. 2

[7] Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski. To-wards internet-scale multi-view stereo. In CVPR, 2010. 6

[8] Y. Furukawa and J. Ponce. Accurate, dense, and robust mul-tiview stereopsis. IEEE PAMI, 2010. 6, 7

[9] J.-M. Geusebroek, R. Van den Boomgaard, A. W. Smeulders,and H. Geerts. Color invariance. IEEE PAMI, 2001. 5

[10] J. Heinly, J. L. Schonberger, E. Dunn, and J.-M. Frahm. Re-constructing the World* in Six Days *(As Captured by theYahoo 100 Million Image Dataset). In CVPR, 2015. 1, 2

[11] D. Ji, E. Dunn, and J.-M. Frahm. Synthesizing IlluminationMosaics from Internet Photo-Collections. In ICCV, 2015. 3,6

[12] X. Li, C. Wu, C. Zach, S. Lazebnik, and J.-M. Frahm. Mod-eling and recognition of landmark image collections usingiconic scene graphs. In ECCV. 2008. 2

[13] Y. Li, N. Snavely, and D. P. Huttenlocher. Location recog-nition using prioritized feature matching. In ECCV. 2010.4

[14] Y. Lou, N. Snavely, and J. Gehrke. MatchMiner: EfficientSpanning Structure Mining in Large Image Collections. InECCV, 2012. 2

[15] R. Martin-Brualla, D. Gallup, and S. M. Seitz. Time-lapsemining from internet photos. In SIGGRAPH, 2015. 1, 2

[16] K. Matzen and N. Snavely. Scene chronology. In ECCV.2014. 2

[17] A. Mikulik, F. Radenovic, O. Chum, and J. Matas. Efficientimage detail mining. In ACCV, 2014. 2

[18] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisser-man. Object retrieval with large vocabularies and fast spatialmatching. In CVPR, 2007. 4

[19] G. Schindler and F. Dellaert. Probabilistic temporal infer-ence on reconstructed 3D scenes. In CVPR, 2010. 2

[20] J. L. Schonberger and J.-M. Frahm. Structure-from-motionrevisited. In CVPR, 2016. 3

[21] J. L. Schonberger, F. Radenovic, O. Chum, and J.-M. Frahm.From Single Image Query to Detailed 3D Reconstruction. InCVPR, 2015. 1, 2, 3, 4

[22] N. Snavely. Scene reconstruction and visualization from in-ternet photo collections. PhD thesis, 2008. 2

[23] N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: ex-ploring photo collections in 3D. In SIGGRAPH, 2006. 2

[24] N. Snavely, S. M. Seitz, and R. Szeliski. Modeling the worldfrom internet photo collections. IJCV, 2007. 2

[25] Y. Verdie, K. M. Yi, P. Fua, and V. Lepetit. TILDE: A Tem-porally Invariant Learned DEtector. In CVPR, 2015. 3, 4

From Dusk till Dawn: Modeling in the Darkfrahm.web.unc.edu/files/2016/03/radenovic2016from.pdfFrom...

Documents

Transcript of From Dusk till Dawn: Modeling in the Darkfrahm.web.unc.edu/files/2016/03/radenovic2016from.pdfFrom...