Camera Elevation Estimation from a Single Mountain...

1
Camera Elevation Estimation from a Single Mountain Landscape Photograph Martin ˇ Cadík 1 cadik@fit.vutbr.cz Jan Vašíˇ cek 1 xvasic21@stud.fit.vutbr.cz Michal Hradiš 1 ihradis@fit.vutbr.cz Filip Radenovi´ c 2 [email protected] Ondˇ rej Chum 2 [email protected] 1 CPhoto@FIT, http://cphoto.fit.vutbr.cz/elevation/, Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic 2 Center for Machine Perception, Department of Cybernetics, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic In outdoor environments one of the most important and informative at- tributes is the elevation: the height of a geographic location above the sea level. However, almost all currently available photos and videos lack ele- vation information. Moreover, a majority of them do not even contain the GPS coordinates. This work addresses the problem of camera elevation estimation from visual information contained in a landscape photograph. We introduce a new Alps100K dataset of annotated (GPS coordi- nates, elevation, EXIF if available) outdoor images from mountain en- vironments. We create a list of all hills and mountain peaks located in the seven Alpine countries from the OpenStreetMap database. The list of hill names is used to query the Flickr photo hosting service. The final collection contains 98136 outdoor images that span almost all possible elevations observed in the Alps [0, 4782m] and covers vast geographic area of the Alps. The images span all the seasons of the year and exhibit high variation in landscape appearance, see Fig. 1. 217.5m 43.7644N, 5.3622E 1798m 46.4384N, 13.6351E 2504.5m 45.9449N, 7.6527E 3503m 45.8113N, 6.7909E 4782m 45.8336N, 6.8616E Figure 1: A sample from the new benchmark dataset Alps100K. Im- age credits - flickr users: Allie_Caulfield, Erik, Guillaume Baviere, an- toine.pardigon, Karim von Orelli. To measure the ability of humans to estimate camera elevation from an image, an experiment with 100 subjects is conducted. The partici- pants are asked to estimate the camera elevation for each of the 50 test images using a web-based interface. The overall root-mean-squared error of human elevation predictions is RMSE(H)= 879.95m. The predictions for each test image along with the ground truths are plotted in Fig. 2. elevation 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 Ground Truth Human BOW CNN BOW+CNN index to the 50 test images Figure 2: Comparison of the elevation estimation by humans and the pro- posed methods. Blue boxes show the span of the human predictions: the red mark is the median, the edges of the box are the 25 th and 75 th per- centiles respectively, the whiskers extend to extreme human guesses that are not considered outliers, and outliers are plotted individually as ’+’. First proposed automatic approach to elevation estimation from im- age content is based on convolutional neural networks (CNN). We ini- tialize CNNs from a network previously trained on the Places205 dataset [2]. The basic network architecture is the same as the one used in the Caffe reference network and the Places-CNN network [2]. The main building blocks of the network are convolutions followed by Rectified Linear Units. First, second, and fifth convolutional layers are followed by max-pooling, each reducing resolution by a factor of two. The activa- tions of the first and second convolutional layers are locally normalized. The output of the convolutional part of the network is fed into a fully connected layer with 4096 units. The network is trained by mini-batch Stochastic Gradient Descent with momentum. 1.5 5.9 13.2 23.5 36.7 Figure 3: Left: normalized elevation histogram of the Alps mountain range (red) and the distribution of elevations in the Alps100K dataset (green). Right: geographic coverage of the new dataset. An alternative bag-of-words approach is based on the k-NN classifier, two efficient methods of obtaining k-NN are considered: sparse high- dimensional bag-of-words (BOW) based image retrieval, and image re- trieval with compact image representations (mVocab) [1]. The BOW approach using inverted file to efficiently retrieve images has been shown to perform well for specific object and place recognition, especially when combined with a spatial verification step, while the mVocab approach using a joint dimensionality reduction from multiple vocabularies shows certain level of generalization power. Therefore, we also propose a hy- brid method that first tries to estimate the elevation by recognizing the location (using BOW), and if that fails, i.e., no spatially verified image is retrieved, then by a secondary estimator: either mVocab or CNN. A test set of 13148 images is randomly selected from Alps100K, the rest of the dataset (i.e., 84988 images) is used for training. The selected measure of performance is an overall RMSE of elevation predictions with regards to the known ground truth elevations, see Tab. 1. The results for a subset of 50 images selected from the test dataset (used in user experi- ment), which compares the performance of automatic elevation estimation to the performance of humans, are shown in Fig. 2 and Tab. 1. Generally, all of the proposed methods achieve better scores than humans. Table 1: Results (overall root-mean-square error in meters) Method test dataset (13148 images) user experiment (50 images) Baseline 801.49; 786.42 1383.64; 1154.43 Human - 879.95 CNN 537.11 709.10 BOW 601.63 757.76 mVocab 610.36 811.00 BOW+mVocab 564.14 646.89 BOW+CNN 500.44 531.05 [1] F. Radenovi´ c, H. Jégou, and O. Chum. Multiple measurements and joint dimensionality reduction for large scale image search with short vectors. In Proc. ICMR. ACM, 2015. [2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. Learning Deep Features for Scene Recognition using Places Database. NIPS, 2014.

Transcript of Camera Elevation Estimation from a Single Mountain...

Page 1: Camera Elevation Estimation from a Single Mountain ...cphoto.fit.vutbr.cz/elevation/cadik15elevation_abstract.pdf · tributes is the elevation: the height of a geographic location

Camera Elevation Estimation from a Single Mountain Landscape Photograph

Martin Cadík1

[email protected]

Jan Vašícek1

[email protected]

Michal Hradiš1

[email protected]

Filip Radenovic2

[email protected]

Ondrej Chum2

[email protected]

1 CPhoto@FIT, http://cphoto.fit.vutbr.cz/elevation/,Faculty of Information Technology,Brno University of Technology,Brno, Czech Republic

2 Center for Machine Perception,Department of Cybernetics,Faculty of Electrical Engineering,Czech Technical University in Prague,Prague, Czech Republic

In outdoor environments one of the most important and informative at-tributes is the elevation: the height of a geographic location above the sealevel. However, almost all currently available photos and videos lack ele-vation information. Moreover, a majority of them do not even contain theGPS coordinates. This work addresses the problem of camera elevationestimation from visual information contained in a landscape photograph.

We introduce a new Alps100K dataset of annotated (GPS coordi-nates, elevation, EXIF if available) outdoor images from mountain en-vironments. We create a list of all hills and mountain peaks located inthe seven Alpine countries from the OpenStreetMap database. The listof hill names is used to query the Flickr photo hosting service. The finalcollection contains 98136 outdoor images that span almost all possibleelevations observed in the Alps [0, 4782m] and covers vast geographicarea of the Alps. The images span all the seasons of the year and exhibithigh variation in landscape appearance, see Fig. 1.

217.5m43.7644N, 5.3622E

1798m46.4384N, 13.6351E

2504.5m45.9449N, 7.6527E

3503m45.8113N, 6.7909E

4782m45.8336N, 6.8616E

Figure 1: A sample from the new benchmark dataset Alps100K. Im-age credits - flickr users: Allie_Caulfield, Erik, Guillaume Baviere, an-toine.pardigon, Karim von Orelli.

To measure the ability of humans to estimate camera elevation froman image, an experiment with 100 subjects is conducted. The partici-pants are asked to estimate the camera elevation for each of the 50 testimages using a web-based interface. The overall root-mean-squared errorof human elevation predictions is RMSE(H) = 879.95m. The predictionsfor each test image along with the ground truths are plotted in Fig. 2.

elev

atio

n

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 2 3 4 5 6 7 8 9 10 1112 1314 1516 1718 1920 2122 2324 2526 27 2829 3031 3233 3435 3637 3839 4041 4243 4445 46 4748 4950

Ground TruthHumanBOWCNNBOW+CNN

index to the 50 test images

Figure 2: Comparison of the elevation estimation by humans and the pro-posed methods. Blue boxes show the span of the human predictions: thered mark is the median, the edges of the box are the 25th and 75th per-centiles respectively, the whiskers extend to extreme human guesses thatare not considered outliers, and outliers are plotted individually as ’+’.

First proposed automatic approach to elevation estimation from im-age content is based on convolutional neural networks (CNN). We ini-tialize CNNs from a network previously trained on the Places205 dataset[2]. The basic network architecture is the same as the one used in theCaffe reference network and the Places-CNN network [2]. The main

building blocks of the network are convolutions followed by RectifiedLinear Units. First, second, and fifth convolutional layers are followedby max-pooling, each reducing resolution by a factor of two. The activa-tions of the first and second convolutional layers are locally normalized.The output of the convolutional part of the network is fed into a fullyconnected layer with 4096 units. The network is trained by mini-batchStochastic Gradient Descent with momentum.

1.5

5.9

13.2

23.5

36.7

Figure 3: Left: normalized elevation histogram of the Alps mountainrange (red) and the distribution of elevations in the Alps100K dataset(green). Right: geographic coverage of the new dataset.

An alternative bag-of-words approach is based on the k-NN classifier,two efficient methods of obtaining k-NN are considered: sparse high-dimensional bag-of-words (BOW) based image retrieval, and image re-trieval with compact image representations (mVocab) [1]. The BOWapproach using inverted file to efficiently retrieve images has been shownto perform well for specific object and place recognition, especially whencombined with a spatial verification step, while the mVocab approachusing a joint dimensionality reduction from multiple vocabularies showscertain level of generalization power. Therefore, we also propose a hy-brid method that first tries to estimate the elevation by recognizing thelocation (using BOW), and if that fails, i.e., no spatially verified image isretrieved, then by a secondary estimator: either mVocab or CNN.

A test set of 13148 images is randomly selected from Alps100K, therest of the dataset (i.e., 84988 images) is used for training. The selectedmeasure of performance is an overall RMSE of elevation predictions withregards to the known ground truth elevations, see Tab. 1. The results fora subset of 50 images selected from the test dataset (used in user experi-ment), which compares the performance of automatic elevation estimationto the performance of humans, are shown in Fig. 2 and Tab. 1. Generally,all of the proposed methods achieve better scores than humans.

Table 1: Results (overall root-mean-square error in meters)Method test dataset (13148 images) user experiment (50 images)Baseline 801.49; 786.42 1383.64; 1154.43Human - 879.95CNN 537.11 709.10BOW 601.63 757.76mVocab 610.36 811.00BOW+mVocab 564.14 646.89BOW+CNN 500.44 531.05

[1] F. Radenovic, H. Jégou, and O. Chum. Multiple measurements andjoint dimensionality reduction for large scale image search with shortvectors. In Proc. ICMR. ACM, 2015.

[2] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. LearningDeep Features for Scene Recognition using Places Database. NIPS,2014.