Research Article Dual-Layer Density Estimation for...

Research ArticleDual-Layer Density Estimation for Multiple ObjectInstance Detection

Qiang Zhang123 Daokui Qu123 Fang Xu123 Kai Jia123 and Xueying Sun24

1State Key Laboratory of Robotics Shenyang Institute of Automation Chinese Academy of SciencesNo 114 Nanta Street Shenhe District Shenyang 110016 China2University of Chinese Academy of Sciences No 19A Yuquan Road Beijing 100049 China3SIASUN Robot amp Automation Co Ltd No 16 Jinhui Street Hunnan New District Shenyang 110168 China4Department of Information Service and Intelligent Control Chinese Academy of Sciences No 114 Nanta StreetShenhe District Shenyang 110016 China

Correspondence should be addressed to Qiang Zhang zhangqiangsiacn

Received 8 May 2016 Revised 19 July 2016 Accepted 1 August 2016

Academic Editor Luis Paya

Copyright copy 2016 Qiang Zhang et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

This paper introduces a dual-layer density estimation-based architecture for multiple object instance detection in robot inventorymanagement applications The approach consists of raw scale-invariant feature transform (SIFT) feature matching and key pointprojectionThe dominant scale ratio and a reference clustering threshold are estimated using the first layer of the density estimationA cascade of filters is applied after feature template reconstruction and refined feature matching to eliminate false matchesBefore the second layer of density estimation the adaptive threshold is finalized by multiplying an empirical coefficient for thereference value The coefficient is identified experimentally Adaptive threshold-based grid voting is applied to find all candidateobject instances Error detection is eliminated using final geometric verification in accordance with Random Sample Consensus(RANSAC) The detection results of the proposed approach are evaluated on a self-built dataset collected in a supermarket Theresults demonstrate that the approach provides high robustness and low latency for inventory management application

1 Introduction

With the development of robotics humanoid robots havebeen introduced in innumerable applications Among theavailable functionalities of the humanoid robot specificobject detection has attracted increasing attention in recentyears Inventory management autosorting and pick-and-place system are typical applications Unlike single-objectdetection multiple-instance detection is a more challengingtask In this paper we focus on the goal of multiple objectinstance detection for robot inventory management andpropose an effective approach to achieve this goal

Multiple object instance detection is a complex technol-ogy that encounters a variety of difficulties First diversitiesof species shapes colors and sizes of objects make it difficultto accomplish the fixed goal Moreover target objects appeardifferent in different environments For example changes inscale orientation and illumination increase uncertainty and

ambiguity for identification Additionally multiple instancescan affect the verification procedure

There are two representative types of techniques for objectinstance detection the training and learning-based approachand the template-based approach The latter approachincludes an extensive range of template forms such as edgeboxes [1] patches [2] and local features Local feature match-ing based object detection method has received considerableattention from researchers because of its notable advantagesin overcoming a portion of the deficiencies caused by scalerotation and illumination changes Scale-invariant featuretransform (SIFT) [3] was proposed by Lowe in 2004 and hasbeen widely applied in many situations due to its robustnessA new approach called PCA-SIFT [4] was proposed tosimplify the calculations and decrease storage space Themain concept of PCA-SIFT is dimension reduction In 2005Mikolajczyk and Schmid proposed the gradient location andorientation histogram (GLOH) [5] The GLOH is a SIFT-like

Hindawi Publishing CorporationJournal of SensorsVolume 2016 Article ID 6937852 12 pageshttpdxdoiorg10115520166937852

2 Journal of Sensors

descriptor that uses a log-polar transformation hierarchyrather than four quadrants The original high dimensionalityof its descriptor can be reduced using PCA In 2008 Bayet al developed a prominent method known as speeded uprobust features (SURF) [6] based on improvements in theconstruction of SIFT features In [5 7] the performances oflocal feature descriptors such as SIFT PCA-SIFTGLOH andSURF were compared According to [5 7] PCA-SIFT andSURF have advantages in terms of speed and illuminationchanges whereas SIFT and GLOH are invariant to rotationscale changes and affine transformations

Feature matching is a basic procedure in object detectionFeature matching is typically performed by comparing thesimilarity of two feature descriptors In fact raw matchesoften contain a large number of mistakes thus false matchelimination is necessaryThe classical approaches are the ratiotest [3] bidirectional matching algorithm [8] and RANSAC[9] In addition a remarkable method based on scale restric-tion [10 11] was proposed This method first estimates adominant scale ratio using statistics after prematchingThenfeatures are reextracted from the high-resolution image atan adjusted Gaussian smoothing parameter according to thedominant scale ratio After refined matching feature pairsthat do not conform to a certain scale ratio restriction arerejected This method is adopted in our work due to its highperformance In addition we provide a new approach to gain-ing access to the dominant scale ratio In 2010 Arandjelovicand Zisserman [12] considered that theHellinger kernel leadsto superior matching results compared to Euclidean distancein SIFT feature matching

Lin et al [13] used a key point coordinate clusteringmethod for duplicate object detection Regions of interestare detected using an adaptive window search Wu et al[14] reported an improved graph-based method to locateobject instances In [15] Collet et al proposed a scalableapproach known as MOPED The framework first clustersmatched features and generates hypothesis models Potentialinstances can be found after an iterative process for poserefinement However the key point coordinates obtainedfrom the clustering results in [13ndash15] might be unreliablebecause the key points are sparsely distributed Alternativelyapproaches based on Hough voting were proposed andapplied in [16ndash18] The Hough voting based approach locatespossible instances according to feature mapping and densityestimation Specifically themethod in [16] appliesmean-shiftin the voting step Similarly grid voting was adopted in [19]Although Hough voting is an effective approach for multipleobject instance detection the clustering radius formean-shiftor grid voting should be preset by experience which leads tolow adaptability and accuracy

In this paper we present a new architecture that improvesmultiple object instance detection accuracy by consideringthe adaptive selection of the optimal clustering threshold anda cascade of filters for false feature match elimination Thecontributions of our work are as follows

(i) We propose an architecture for multiple objectinstance detection based on dual-layer density esti-mationThe first layer calculates an optimal clustering

threshold for the second layer and applies a constraintfor the next scale restriction-based false match elim-ination The second layer aims to detect all candidateobject instances The proposed strategy can reducethe possibility of mismatch and improve detectionaccuracy Compared to traditional methods whichneed to set the threshold manually the proposedadaptive clustering threshold computation methodleads to stronger environmental flexibility and higherrobustness

(ii) We introduce a new method to compute and verifythe value of the dominant scale ratio between thetraining image and query image Rather than usinga histogram statistical method for matched featuresthe value is derived from the first layer of the densityestimationThen the value is tested by an approximateone which is obtained based on the homographymatrix According to our experiments the proposedmethod is more robust for dominant scale ratioestimation compared to the conventional methods

The remainder of this paper is organized as follows Sec-tion 2 describes the proposed architecture according to ourparticular application background Details of the proposedmethod are discussed in Section 3 A variety of experimentsare designed to evaluate our approach The experimentalmethodology results and discussions are presented in Sec-tion 4 Finally Section 5 summarizes our contributions andpresents conclusions

2 Framework Overview

In this section we provide an introduction to the backgroundof our work and briefly explain the proposed architecture

Our work develops a service robot for a supermarketThepurpose of the robot is to count the goods before the start ofbusiness and provide feedback to the staff to ensure adequatesupplies Because no standard database exists for our specificapplication we created a database for 70 types of man-madeproducts to evaluate our algorithm

The lighting conditions in the supermarket are generallyuniform and thus we collected training images for each itemunder same lighting conditions One image was obtainedfrom the front and another 24 were captured from 24different directionsThe frontal object image serves for objectrecognition and all 25 sequence images were used to build asparse 3D model for recovering pose of the identified objectAll training images were captured at the distance which isapproximately equal to the minimum safe distance betweenthe robot and shelves This sampling method can ensurethat the training image has more details To validate ourarchitecture the training database was divided into three setsbased on the density of textures The set with the highestdensity of textures contains 20 types of products the set withamediumdensity of textures has 30 types and the set with thelowest density of textures includes 20 types For each objectthere were 2 to 40 instances in the scene image

Our proposed method is based on local features whichcan provide information about scale and rotation and SIFT

Journal of Sensors 3

SURF and PCA-SIFT are three alternatives According to[5 7] SIFT has better performance in scale and rotationchange than SURF and PCA-SIFT thus SIFT is used inour work although it is time-consuming The proposedframework is based on SIFT feature extraction and featurematching by considering the specific application backgroundThe framework consists of two phases the offline trainingphase and the online detection phase A graphic illustrationof the proposed approach is shown in Figure 1 To make ouralgorithm more explicit we make selected arrangements inadvance First the term key point refers to a point with 2Dcoordinates and the point is detected by SIFT theory Theterm descriptor represents a 128-dimensional SIFT featurevector The term feature consists of a description vector andthe scale orientation and coordinate of the SIFT point

In the offline phase as shown in Figure 1(a) an initialvalue of the Gaussian smoothing parameter is given inadvance The SIFT features are extracted from the trainingimages for certain objects Reference vectors between all keypoints and the object center are computed to locate the objectcentroid All features are stored in a retrieval structure toreduce time overhead during detection On the other handwe created a sparse 3D model for each object with a standardStructure fromMotion algorithm [20] and each 3D point wasassociated with a corresponding SIFT descriptor

The online detection phase is a dual-layer density esti-mation-basedmethodThe first layer exists for two purposesto compute the dominant scale ratio between the trainingimage and query image (Figures 1(b)ndash1(e)) and to calculatea reference clustering threshold for the second layer ofdensity estimation (Figures 1(f)ndash1(i)) At the beginning offeature extraction for the query image an initial value ofthe Gaussian smoothing parameter is given the same asin the training phase All descriptors extracted from thevideo footage are matched to their nearest neighbors in thedatabase (Figure 1(b)) and the key points are projected totheir reference centers (Figure 1(c)) A valid object centerwith a maximum density value can be found using kerneldensity estimation (Figure 1(d)) Considering that objectinstances in our applications have nearly the same scale thedominant scale ratio and an effective clustering thresholdare computed accordingly (Figure 1(e)) The second layerof density estimation detects all possible instances Firstthe feature template is reconstructed based on the initialvalue of the base scale and the calculated dominant scaleratio (Figure 1(f)) The majority of false feature matches areremoved by a cascade of filters based on the distance ratio testand scale restriction (Figure 1(g)) The key point projectionand 2D clustering methods are applied to find all candidateobject centers (Figure 1(h)) The final geometric verificationprocedure can eliminate incorrect detection results anddetermine each instancersquos pose (Figure 1(i))

3 Description of the Proposed Method

In this section we introduce our work in detail in accordancewith the aforementioned architectureThe schematic diagramfor the offline training phase and the flowchart of the onlinedetection are shown in Figures 2 and 4 respectively

31 Offline Training Template Generation and RetrievalStructure Construction Indeed the proposedmethod can beapplied in conjunction with any scale and rotation invariantfeatures As is described in Section 2 SIFT is applied inour work for its robustness To create templates for all typesof object instances frontal images of the targets must becaptured As noted in Section 2 the light conditions in ourapplication are relatively invariant In addition we assumethat all object instances face front outward SIFT is ableto work properly under these conditions Thus we cancollect one frontal image for each type of product for objectrecognition Besides for the following object pose estimationa sparse 3D model for each object was created (as shownin Figure 3) and thus 24 other images were captured atapproximately equally spaced intervals in a circle around eachobject According to SIFT theory the Gaussian smoothingparameter should be given first Suppose that the initial valueis set to 120590TrainInit = 120590119900 In this work 120590

119900is a fixed value as is

described in Section 4 and the SIFT feature extraction takesplace

We assume that the number of features for a specificobject is 119899 Each SIFT feature descriptor is a 128-dimensionalvector 119891

119894 where 119894 = 1 2 119899 Similarly the scale of the

feature is 119904119894 the principle orientation is 120579

119894 and its coordinate

is 119888119894(119909119894 119910119894) Coordinate differences V

119894between each SIFT key

point 119888119894(119909119894 119910119894) and the related object centroid 119888

119900(119909119900 119910119900) are

calculated according to the following

V119894119900= [

Δ119909119894

Δ119910119894

119909119894

119910119894

] minus [

119909119900

119910119900

Featurematching is a subprocedure in ourmultiple objectinstance detection architecture The process is used to findthe most similar feature in the dataset based on a distancemeasurement In our work the Hellinger distance measure-ment is applied due to its robustness according to [12]Feature matching is typically a time-consuming process Theconstruction of an effective retrieval structure is necessaryfor speeding up the detection phase Two types of effectiveretrieval methods are currently available tree-basedmethodsand hashing-based methods The randomized kd-tree [2122] hierarchical 119896-means tree [21 22] and vocabulary tree[23] are typical representatives of tree-based methods Localsensitive hashing (LSH) [24 25] and SSH [26] are tworepresentative hashing-based methods In all of the feasiblemethods near-optimal hashing algorithms [27] have provento be highly efficient and accurate and this method waschosen for our work Construction of multiple independenttrees to form a forest is necessary to reduce the false negativeand false positive rates

32 Online Multiple Object Instances Detection

321 Feature Extraction for Query Image and Feature Match-ing During online detection the system first obtains accessto a new captured video frame SIFT key points are detectedand descriptors are extracted in the same manner as thefirst part of offline procedure The Gaussian smoothingparameter is also set as 120590Query = 120590119900 Then the near-optimal

Database

Offline

(f) (g) (h) (i)

(c) (d) (e)

Effective training imageDominant scale ratioClustering threshold

Figure 1 Overview of the proposed framework (a) offline phase for constructing the retrieval structure (b)ndash(e) first layer of densityestimation (b) local feature detection (c) feature matching and key point mapping (d) first layer of density estimation and (e) intermediateresults (f)ndash(i) second layer of density estimation (f) feature template reconstruction (g) false matching result elimination and (h) clusteringfor candidate instances detection (i) geometric verification

Key points detection and

descriptors extraction

Reference vectors calculation

Retrieval structure construction Database

Frontal object images

Feature descriptors ScalesOrientationsReference vectors Original training images

120590o

Figure 2 Offline training procedure

Figure 3 3D sparse model of packing box from 25 images

hashing algorithm takes effect During feature matching lowdiscriminable matches are discarded based on ratio test ofdistances between the nearest neighbor and second nearestneighbor which was proposed in [3]

322 Key Points Projection and Object Center EstimationTheprinciple of key point projection is illustrated in Figure 5In Figure 5 the left part is the training image and the rightpart is the query image Regarding the middle part the solid

region is a matched patch from the query image and the areaformedbydotted lines is assumed to be the ideal case inwhichthere is only similarity transform Assume that the matchingpair of features is 119891

119894and 119891119895 where 119891

119894is from the database and

119891119895is from the query image The key points corresponding to

these two features are 119901119894(119909119894 119910119894) and 1199011015840

119895(1199091015840

119895 1199101015840

119895) As for a plane

object the center 1198881015840119900119895(1199091015840

1199001198951199101015840

119900119895) related to 119891

119895 can be estimated

according to (2)ndash(5)In the formulas 1199041015840

119895and 1205791015840119895are the corresponding scale and

orientation of feature 1198911015840119895 Similarly 119904

119894and 120579

119894are related to

feature 119891119894in the training image For each pair of matching

features there is a normalized deflection angle 120576119895between the

normal vector of an object surface and camera optical axisfor each matched features According to (5) the estimatedcenters would be located in a small range of areas aroundthe real center when the training image is the exact imagecorresponding to the ordered object instance and 120576

119895has an

extremely small value

120579 = 1205791015840

119895minus 120579119894 (2)

As shown in Figure 5 reference centers are distributedin small areas Then the problem of determining the center

Result

false results eliminationObject level

Clustering based on Tr

Key points projection

False matches eliminationbased on sr

Feature matching

Feature extraction

Scale setting 120590 = sr times 120590o

Get access to the validtraining image

Query image acquisition

Feature extraction

Scale setting 120590 = 120590o

Database

Feature matching

Kernel densityestimation

Dominant scale ratio sr

clustering thresholdTr computation

and reference

Figure 4 Online detection flowchart

Training image Query image

Matched

Optic axis

features

p998400j

c998400oco120576j

Figure 5 Key points projection principle diagram

coordinates is converted into a density estimation problemThe first layer of density estimation aims to find one of thevalid centers in the query image Object center estimationis a crucial problem A two-stage procedure-based adaptivekernel density estimation method elaborated in [28] isemployed to improve the precision Only those density valuesassociatedwith themapped key points are calculated to speedup the process The point with the highest density value issaved Although this point may be not the exact center it isa typical approximationThus the mapped point is identifiedas a valid center Simultaneously the exact training image canbe obtained As is illustrated in Figure 6 the blue point is theobtained object center

1199091015840

119900119895

1199101015840

119900119895

1199091015840

119895

1199101015840

119895

1199041015840

119895

119904119894

cos 120579 minus sin 120579sin 120579 cos 120579

] times V119894times cos 120576

119895(3)

1199091015840

119895

1199101015840

119895

1199041015840

119895

119904119894

cos 120579 minus sin 120579

sin 120579 cos 120579]

times V119894

times (1 minus

1205762

119895

1205764

119895

minus sdot sdot sdot)

1199091015840

119900119895

1199101015840

119900119895

]⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

RealCenter

1199041015840

119895

119904119894

] times V119894times (minus

1205762

119895

1205764

119895

minus sdot sdot sdot )

⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

DistributionRange

Columns

Figure 6 Reference clustering threshold calculation

323 Dominant Scale Ratio Estimation and Scale Restriction-Based False Matches Elimination The dominant scale ratioserves two purposes false match elimination and calculationof a reference clustering radius for the second layer of densityestimation In contrast to the conventional methods in [1011] the dominant scale ratio in our work can be derivedaccording to (6) based on the assumption that the estimatedcenter has a typical scale ratio value In (6) sr is the orientedscale ratio 1199041015840

119898is the scale of the key point related to the

estimated object center and 119904119899is the scale of the matched key

point in the training image

sr =1199041015840

119898

119904119899

Once the valid center is found the points that supportthe center are recordedThese points are used to calculate thehomography matrix119867

119900for the pattern The matrix is shown

in (7) Because the minimum safe distance between the robotand the shelves is far enough which means the camera onthe robot is far from the targets the actual homography issufficiently close to affine transformationThen the dominantscale ratio sr1015840 can also be computed according to (8)Then sr1015840is used to verify sr Only if the value of sr is approximate tosr1015840 the value of sr is confirmed to be correct We use (9) toassess the similarity between the two values

119867119900=[[

ℎ11ℎ12ℎ13

ℎ21ℎ22ℎ23

ℎ31ℎ32

sr1015840 = radic1003816100381610038161003816ℎ11times ℎ22

1003816100381610038161003816+1003816100381610038161003816ℎ12times ℎ21

1003816100381610038161003816 (8)

100381610038161003816100381610038161003816100381610038161003816

sr minus sr1015840

min (sr sr1015840)

100381610038161003816100381610038161003816100381610038161003816

lt 15 (9)

To find all possible object instances a SIFT feature-basedtemplate of the ordered object must be reconstructed (seeFigure 1(f))The Gaussian smoothing factor is to be set basedon the dominant scale ratio and is adjusted in accordancewith (10) A new retrieval structure is constructed after SIFTfeatures are detected Then features obtained from the queryimage above are matched to the new dataset Due to theaforementioned preprocessing the amount of SIFT featuresin the newly constructed database is reduced compared

to offline training phase Thus the time overhead of thematching process is greatly reduced

120590TrainAdjust = sr times 120590119900 (10)

The strategy of feature matching disambiguation hereis a cascade of filters These filters can be divided into theratio test algorithm (proposed in [3]) scale restriction-basedmethod (presented in [11]) and geometric verification-basedapproachThe ratio test and scale restrictionmethods use thefollowing matching process The geometric verification takeseffect after clustering After this series of filters most of falsematches can be eliminated

324 Reference Clustering Threshold Computation and Can-didate Object Instances Detection Traditional methods fordetecting multiple object instances such as mean-shift andgrid voting are based on density estimation However thesemethods have the same disadvantage that the bandwidthmust be given by experience For example in [16] the cluster-ing thresholdwas set to a specific value In [19] the voting gridsize was set to the value associated with the size of the queryimage Nevertheless this approachmay still lead to unreliableresults For our specific application occasion the clusteringthreshold can be estimated based on the size of trainingimage and the aforementioned dominant scale ratio Beforethe clustering threshold is finally determined a referenceclustering threshold should be computed automatically Herethe reference clustering threshold can be estimated based on(11) In the formula119879

119903is the reference clustering threshold sr

is the oriented scale ratio and rows and cols are the numbersof rows and columns in the training image respectivelyAs noted above the mapped key points are located insmall regions around real centroids Therefore the clusteringthreshold Th can be finalized in line with (12) in which 119896 isa correction factor According to our repeated experimentsdescribed in Section 4 we provide a recommended value for119896 Candidate object instance detection is based on the secondlayer of density estimation Grid voting is employed here dueto its high precision and recall

119879119903=

sr times rows if rows lt cols

sr times cols otherwise(11)

Th = 119896 times 119879119903 (12)

33 Object Level False Result Elimination In the procedurefor eliminating false detection results we first calculate thehomography matrix for each cluster Then four corners ofthe training image are projected onto four new coordinatesAs a result a convex quadrilateral in accordance with thefour mapped corners is produced Here we provide a simplebut effective way to assess whether the system has obtainedcorrect object instances and error detections are eliminatedThe criterion is as follows

119888min leArea (Quadrilateral)

sr2 times Area (TrainingImage)le 119888max (13)

(a) (b) (c)

Figure 7 Examples of objects with different texture levels (a) high texture (b) medium texture (c) low texture

In (13) Area(Quadrilateral) is the area of the convexquadrilateral derived from each candidate object instanceArea(TrainingImage) is the area of the training imageAccording to (13) if the detection is accurate the ratiocoefficient between the area of the quadrilateral and thetraining image is approximate to sr2 The threshold 119888min and119888max should be set before verification

Finally for each cluster the features are matched to the3D sparse model created in the offline training procedureA noniterative method called EPnp [29] was employed toestimate pose for each object instance

4 Experiments

41 Experimental Methodology We are developing a servicerobot for the detection and manipulation of multiple objectinstances and there is no standard database for our specificapplication To validate our approach we created a databasefor 70 types of products with different shapes colors andsizes in a supermarket Objects to be detected were placedon shelves with the front outside All images were capturedusing a SONYRGB cameraThe resolution of the camera was1240 times 780 pixels To comprehensively evaluate the accuracyof the proposed architecture the database was divided intothree sets according to the texture level of the objects Figure 7shows examples of objects with different texture levels

We designed three experiments to evaluate the proposedarchitecture The first experiment was to verify whether thescale ratio calculation and false eliminationmethod were fea-sible The second one was to examine whether the proposedclustering threshold computation method was effective Thelast experiment was to comprehensively evaluate the perfor-mance of the proposed architectureThese three experimentswere designed as follows

(i) Experiment I for each training image in the databasewe acquired an image considering that the objectinstance in the image had the same scale as thetraining image Then the captured images weredownsampled The size of the resampled imageswere 100 75 50 and 25 of the original sizeWe calculated the dominant scale ratios based onthe conventional histogram statistics and proposedmethod separately Then the accuracy of both valueswas compared The feature matching and key point

projection results with and without false eliminationwere also recorded and compared

(ii) Experiment II we first calculated a clustering thresh-old according to (14)Thenwe tested the performanceof the conventional methods (mean-shift and gridvoting) based on changing the clustering thresholdcontinuously Here an approximate nearest neigh-bor searching method was employed to speed upmean-shift Because the thresholds could not bedirectly compared in different experiments we usedthe multiple of the computed threshold in differentexperiments to express the new value In (14) CR isthe bandwidth for mean-shift GS is the grid size forgrid voting and 119896MS and 119896GV are the coefficients Wechose an optimal threshold value according to theexperimental results In the experiment the thresholdratio parameters were sampled as 119896MS = 119896GV =26 24 22 20 19 18 17 16 14 12 10 08

CR = 12

times 119896MS times 119879119903 using mean-shif t

GS = 119896GV times 119879119903 using grid voting (14)

(iii) Experiment III we compared the proposed methodwith the conventional grid voting on three types ofdatasets The experimental conditions of the con-ventional grid voting were as follows width andheight of the grid are 1130 of the width and theheight of the query image and the voting grid hadan overlap of 25 of size with an adjacent gridThe performances of the proposed method and theconventional grid voting were expressed in terms ofthe accuracy (precision and recall) and computationaltime

In all the experiments the parameters for SIFT featureextraction and the threshold for feature matching were setas the default values in [3] In particular the initial Gaussiansmoothing parameter was set as 120590

119900= 16 and the default

threshold on key point contrast was set to 01 In theverification procedure in our experiments thresholds 119888minand 119888max were set as 08 and 12 respectively In our work allof the experiments have been conducted on Windows 7 PCwith Core i7-4710MQ CPU 250GHz and 8GB RAM

sr = 100 sr = 074

sr = 048 sr = 0254

(a) Center estimation and dominant scale ratio computation by proposedmethod

Scale ratio

Scale ratio0 1 2

Scale ratio

sr = 099 sr = 075

sr = 0234sr = 047

(b) Dominant scale ratio computation by conventional histogram statistic

Figure 8 The first example of dominant scale ratio computation

sr = 101 sr = 075

sr = 050 sr = 0251

Scale ratio0 1 2

Scale ratio

Scale ratio0 1 2

Scale ratio

sr = 029 sr = 021

sr = 052sr = 021

Figure 9 The second example of dominant scale ratio computation

42 Experimental Results and Analysis

421 Results of the Dominant Scale Ratio Computation andScale Restriction-Based False Match Elimination Figures 8and 9 display the results of two examples for computing thedominant scale ratios Figures 8(a) and 9(a) are the resultsof the proposed method whereas Figures 8(b) and 9(b) are

the results of the conventional method The reference scaleratios are 100 75 50 and 25 in these figures In Figures8(a) 8(b) and 9(a) the calculated results are close to thereference valuesHowever in Figure 9(b) the results obtainedby the conventional method are not reliable The reason forthe error in Figure 9(b) is that the background noise is toosevere and the extracted features may have nearly the same

(a) (b) (c)

Figure 10 Raw matching results (a) training image (b) feature matching (c) key points projection

(a) (b) (c)

Figure 11 Matching results with false matches elimination (a) training image (b) feature matching (c) key points projection

scale ratio The proposed method evaluates the dominantscale ratio depending on the distribution and relationship ofkey points therefore the result is more reliable

Figure 10 shows that the raw matching results withoutscale-constrained filtering exhibit a large number of falsematches The matching results based on scale-constrainedfiltering are shown in Figure 11 with fewer outliers presentScale restriction-based template reconstruction and elimi-nation of false matches lead to the best optimum results(Figure 12) Most of the false matches are eliminated and lay agood foundation for the subsequent clustering Figures 10ndash12illustrate the effectiveness of the proposed filters

422 Results of Clustering Threshold Estimation Figures13(a)ndash14(b) show the performance of the methods usingmean-shift and grid voting The brown curve in Figure 13(a)describes the accuracy of grid voting and the blue onedescribes accuracy of mean-shift Figure 13(b) illustrates thetrue positive rate versus false positive rate of mean-shift andgrid voting as the discrimination threshold changes Points inboth Figures 13(a) and 13(b) were sampled based on differentclustering threshold ratios as detailed in the experimentalmethodology The threshold ratio values decrease graduallyfrom left to right Besides coordinates surrounded by circlesare related to the precalculated threshold Figures 14(a) and14(b) show the average value and standard deviation ofcomputational time for mean-shift and grid voting based ondifferent thresholds

As shown in Figure 13(a) the precision decreases and therecall increases as the threshold is decreased In Figure 13(b)

both the true and false positive rates increase as the thresholdis decreased Figure 13(a) shows that grid voting has abetter performance than mean-shift in recall as a whole andFigure 13(b) indicates that grid voting has a better perfor-mance in accuracy than mean-shift According to Figures13(a) and 13(b) 119896MS and 119896GV corresponding to the inflectionpoint are both 18 As shown in Figure 14(a) the time costfor feature matching and ANN-based mean-shift clusteringremains relatively stable However a smaller threshold ratioleads to a higher time cost for geometric verification becausethe number of clusters increases As shown in Figure 14(b)the computational time for clustering using grid voting isconsiderably shorter than when using mean-shift but theverification time becomes longer due to the clustering errorsAccording to the results of the feasibility validation clusteringradius 119896MS = 18 for mean-shift and 119896GV = 18 for grid votingare optimized preset parameters for the detection of multipleobject instances in inventory management

423 Performance for Different Object Instance DetectionBased on the Proposed Architecture Table 1 shows the averageresults of different levels of textures using the proposedmethod and grid voting The precision and recall wererecorded The computational times for feature extractionraw matching density estimation template reconstruction-based rematching clustering and geometric verificationweredocumented separately Figure 15 shows the results of twoexamples using the proposed method

According to Table 1 different levels of texture densitywill lead to different accuracies and computational times

(a) (b) (c)

Figure 12 Matching results based on template reconstruction and scale restriction (a) training image (b) feature matching (c) key pointsprojection

Mean-shift + RANSACGrid voting + RANSAC

Recall ()

kMS = 18kGV = 18

1009590858075

(a) Accuracy of mean-shift and grid voting

Mean-shift + RANSAC

kMS = 18

kGV = 18

False positive rate ()

Grid voting + RANSAC

750 10 20 30 40 50 60 70

(b) True positive rate versus false positive rate of mean-shift and gridvoting

Figure 13 Accuracy performance using mean-shift and grid voting

Feature matchingClusteringGeometric verification

26 24 22 20 19 18 17 16 14 12 10 08

(a) Computational time for mean-shift

26 24 22 20 19 18 17 16 14 12 10 08

(b) Computational time for grid voting

Figure 14 Computational time statistics

Figure 15 Results of two detection examples

Table 1 Average results for different levels of texture using proposed method and grid voting

Texture level MethodsAccuracy () Computational time (ms)

Precision Recall Featuredetection Raw match Density

estimation Rematch Clustering Geometricverification Total

High Proposed 976 968 1027 379 479 526 3 522 2936Grid voting 962 963 1027 379 0 0 4 2595 4005

Medium Proposed 964 958 941 220 191 246 3 866 2467Grid voting 957 954 941 220 0 0 4 2033 3198

Low Proposed 921 936 586 94 72 119 4 1054 1929Grid voting 916 919 586 94 0 0 3 1345 2028

Precision and time overhead increase with increases in thetexture density Although the first layer of density esti-mation and template reconstruction-based rematching takesome computational time the geometric verification latencyis greatly reduced compared to the conventional methodbecause the adaptive threshold is more reasonable than thejudgment based simply on the size of the query image Table 1indicates that the proposed architecture can accurately detectand identify multiple identical objects with low latency Ascan be seen in Figure 15 most of object instances weredetected However objects marked as ldquoArdquo in Figure 15(a)ldquoBrdquo ldquoCrdquo and ldquoDrdquo in Figure 15(b) and ldquoFrdquo ldquoHrdquo and ldquoGrdquo inFigure 15(c) were not detected and objects marked as ldquoErdquowere a false detection result Reasons for these errors are thereflection of light (in Figure 15(a)) high similarity of objects(the short bottle marked as ldquoErdquo is similar to the high one inFigure 15(b)) translucent occlusion (three undetected yellowbottlesmarked as ldquoBrdquo ldquoCrdquo and ldquoDrdquo in Figure 15(b)) and errorclustering results (ldquoFrdquo ldquoGrdquo and ldquoHrdquo in Figure 15(c))

5 Conclusions

In this paper we introduced the problem of multiple objectinstance detection in robot inventory management and pro-posed a dual-layer density estimation-based architecture forresolving this issueThe proposed approach is able to success-fully address the multiple object instance detection problemin practice by considering dominant scale ratio-based falsematch elimination and adaptive clustering threshold-based

grid voting The experimental results illustrate the superiorperformance our proposed method in terms of its highaccuracy and low latency

Although the presented architecture performs well inthese types of applications the algorithm would fail whenapplied to more complex problems For example if objectinstances have different scales in the query image theassumptions made in this paper will be no longer validFurther more the accuracy of the proposed method willbe greatly reduced when there is a dramatic change ofillumination or the target is occluded by other translucentobjects In our future work we will focus on improving themethod for solving such complex problems

Competing Interests

The authors declare that they have no competing interests

Acknowledgments

The authors would like to thank Shenyang SIASUN RobotAutomation Co Ltd for funding this research The projectis supported byTheNational Key Technology RampD ProgramChina (no 2015BAF13B00)

References

[1] C L Zitnick and P Dollar ldquoEdge boxes locating object pro-posals from edgesrdquo in Proceedings of the European Conference

on Computer Vision (ECCV rsquo14) Zurich Switzerland September2014 pp 391ndash405 Springer Cham Switzerland 2014

[2] SHinterstoisser S BenhimaneNNavab P Fua andV LepetitldquoOnline learning of patch perspective rectification for efficientobject detectionrdquo in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR rsquo08) pp 1ndash8IEEE Anchorage Alaska USA June 2008

[3] D G Lowe ldquoDistinctive image features from scale-invariantkeypointsrdquo International Journal of Computer Vision vol 60 no2 pp 91ndash110 2004

[4] Y Ke and R Sukthankar ldquoPCA-SIFT a more distinctiverepresentation for local image descriptorsrdquo in Proceedings ofthe IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR rsquo04) pp II506ndashII513 WashingtonDC USA July 2004

[5] K Mikolajczyk and C Schmid ldquoA performance evaluation oflocal descriptorsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 27 no 10 pp 1615ndash1630 2005

[6] H Bay A Ess T Tuytelaars and L Van Gool ldquoSpeeded-uprobust features (SURF)rdquo Computer Vision and Image Under-standing vol 110 no 3 pp 346ndash359 2008

[7] L Juan and O Gwun ldquoA comparison of SIFT PCA-SIFT andSURFrdquo International Journal of Image Processing vol 3 no 4pp 143ndash152 2009

[8] Q Sen and Z Jianying ldquoImproved SIFT-based bidirectionalimage matching algorithm Mechanical science and technologyfor aerospace engineeringrdquoMechanical Science and Technologyfor Aerospace Engineering vol 26 pp 1179ndash1182 2007

[9] J Wang and M F Cohen ldquoImage and video matting a surveyrdquoFoundations and Trends in Computer Graphics and Vision vol3 no 2 pp 97ndash175 2008

[10] Y Bastanlar A Temizel and Y Yardimci ldquoImproved SIFTmatching for image pairs with scale differencerdquo ElectronicsLetters vol 46 no 5 pp 346ndash348 2010

[11] J Zhang andH-S Sang ldquoSIFTmatchingmethod based on basescale transformationrdquo Journal of Infrared andMillimeter Wavesvol 33 no 2 pp 177ndash182 2014

[12] R Arandjelovic and A Zisserman ldquoThree things everyoneshould know to improve object retrievalrdquo in Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition(CVPR rsquo12) pp 2911ndash2918 San Francisco Calif USA June 2012

[13] F-E Lin Y-H Kuo and W H Hsu ldquoMultiple object local-ization by context-aware adaptive window search and search-based object recognitionrdquo in Proceedings of the 19th ACMInternational Conference onMultimedia ACMMultimedia (MMrsquo11) pp 1021ndash1024 ACM Scottsdale Ariz USA December 2011

[14] C-C Wu Y-H Kuo and W Hsu ldquoLarge-scale simultaneousmulti-object recognition and localization via bottom up search-based approachrdquo in Proceedings of the 20th ACM InternationalConference on Multimedia (MM rsquo12) pp 969ndash972 Nara JapanNovember 2012

[15] AColletMMartinez and S S Srinivasa ldquoTheMOPED frame-work object recognition andpose estimation formanipulationrdquoThe International Journal of Robotics Research vol 30 no 10 pp1284ndash1306 2011

[16] S Zickler and M M Veloso ldquoDetection and localization ofmultiple objectsrdquo in Proceedings of the 6th IEEE-RAS Inter-national Conference on Humanoid Robots pp 20ndash25 GenovaItaly December 2006

[17] G Aragon-Camarasa and J P Siebert ldquoUnsupervised clusteringinHough space for recognition ofmultiple instances of the same

object in a cluttered scenerdquo Pattern Recognition Letters vol 31no 11 pp 1274ndash1284 2010

[18] R Bao K Higa and K Iwamoto ldquoLocal feature based multipleobject instance identification using scale and rotation invariantimplicit shape modelrdquo in Proceedings of the 12th Asian Confer-ence onComputer Vision (ACCV rsquo14) Singapore November 2014pp 600ndash614 Springer Cham Switzerland 2014

[19] K Higa K Iwamoto and T Nomura ldquoMultiple object iden-tification using grid voting of object center estimated fromkeypoint matchesrdquo in Proceedings of the 20th IEEE InternationalConference on Image Processing (ICIP rsquo13) pp 2973ndash2977Melbourne Australia September 2013

[20] R Szeliski and S B Kang ldquoRecovering 3D shape and motionfrom image streams using nonlinear least squaresrdquo in Proceed-ings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR rsquo93) pp 752ndash753 IEEENew York NY USA June 1993

[21] M Muja and D G Lowe ldquoFast approximate nearest neighborswith automatic algorithm configurationrdquo in Proceedings ofthe 4th International Conference on Computer Vision Theoryand Applications (VISAPP rsquo09) pp 331ndash340 Lisboa PortugalFebruary 2009

[22] M Muja and D G Lowe ldquoFast matching of binary featuresrdquo inProceedings of the 9th Conference on Computer and Robot Vision(CRV rsquo12) pp 404ndash410 IEEE Toronto Canada May 2012

[23] D Nister and H Stewenius ldquoScalable recognition with avocabulary treerdquo in Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPRrsquo06) vol 2 pp 2161ndash2168 IEEE NewYork NY USA June 2006

[24] B Matei Y Shan H S Sawhney et al ldquoRapid object indexingusing locality sensitive hashing and joint 3D-signature spaceestimationrdquo IEEETransactions onPatternAnalysis AndMachineIntelligence vol 28 no 7 pp 1111ndash1126 2006

[25] B Kulis andK Grauman ldquoKernelized locality-sensitive hashingfor scalable image searchrdquo in Proceedings of the 12th Interna-tional Conference onComputerVision (ICCV rsquo09) pp 2130ndash2137Kyoto Japan October 2009

[26] J Wang S Kumar and S-F Chang ldquoSemi-supervised hash-ing for scalable image retrievalrdquo in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition (CVPR rsquo10) pp 3424ndash3431 IEEE San FranciscoCalif USA June 2010

[27] A Andoni and P Indyk ldquoNear-optimal hashing algorithmsfor approximate nearest neighbor in high dimensionsrdquo inProceedings of the 47th Annual IEEE Symposium on Foundationsof Computer Science (FOCS rsquo06) pp 459ndash468 Berkeley CalifUSA October 2006

[28] B W Silverman ldquoDensity Estimation for Statistics and DataAnalysis Chapman amp Hall LondonmdashNew York 1986 175 ppm12rdquo Biometrical Journal vol 30 pp 876ndash877 1988

[29] V Lepetit F Moreno-Noguer and P Fua ldquoEPnP An accurateO(n) solution to the PnP problemrdquo International Journal ofComputer Vision vol 81 no 2 pp 155ndash166 2009

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Control Scienceand Engineering

Journal of

RotatingMachinery

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Shock and Vibration

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Electrical and Computer Engineering

Journal of

Advances inOptoElectronics

Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of Antennas and

Propagation

Navigation and Observation

DistributedSensor Networks