CREST: Convolutional Residual Learning for Visual TrackingCREST algorithm to reformulate DCFs as a...

CREST: Convolutional Residual Learning for Visual Tracking

Yibing Song1, Chao Ma2, Lijun Gong3, Jiawei Zhang1, Rynson W.H. Lau1, and Ming-Hsuan Yang4

1City University of Hong Kong, Hong Kong 2The University of Adelaide, Australia3SenseNets Technology Ltd, China 4University of California, Merced, U.S.A.

Abstract

Discriminative correlation filters (DCFs) have beenshown to perform superiorly in visual tracking. They onlyneed a small set of training samples from the initial frameto generate an appearance model. However, existing DCFslearn the filters separately from feature extraction, and up-date these filters using a moving average operation withan empirical weight. These DCF trackers hardly benefitfrom the end-to-end training. In this paper, we propose theCREST algorithm to reformulate DCFs as a one-layer con-volutional neural network. Our method integrates featureextraction, response map generation as well as model up-date into the neural networks for an end-to-end training. Toreduce model degradation during online update, we applyresidual learning to take appearance changes into account.Extensive experiments on the benchmark datasets demon-strate that our CREST tracker performs favorably againststate-of-the-art trackers. 1

1. Introduction

Visual tracking has various applications ranging fromvideo surveillance, human computer interaction to au-tonomous driving. The main difficulty is how to utilize theextremely limited training data (usually a bounding box inthe first frame) to develop an appearance model robust toa variety of challenges including background clutter, scalevariation, motion blur and partial occlusions. Discrimi-native correlation filters (DCFs) have attracted an increas-ing attention in the tracking community [4, 8, 30], due tothe following two important properties. First, since spa-tial correlation is often computed in the Fourier domain asan element-wise product, DCFs are suitable for fast track-ing. Second, DCFs regress the circularly shifted versionsof input features to soft labels, i.e., generated by a Gaus-sian function ranging from zero to one. In contrast to mostexisting tracking-by-detection approaches [22, 1, 14, 34]

1More results and code are provided on the authors’ webpages.

#0011 #0036

#0023 #0065

#0049 #0365

DeepSRDCF CCOT HCFT CREST

Figure 1: Convolutional features improve DCFs (Deep-SRDCF [8], CCOT [11], HCFT [30]). We propose theCREST algorithm by formulating DCFs as a shallow convo-lutional layer with residual learning. It performs favorablyagainst existing DCFs with convolutional features.

that generate sparse response scores over sampled loca-tions, DCFs always generate dense response scores overall searching locations. With the use of deep convolutionalfeatures [25], DCFs based tracking algorithms [30, 8, 11]have achieved state-of-the-art performance on recent track-ing benchmark datasets [45, 46, 24].

However, existing DCFs based tracking algorithms arelimited by two aspects. First, learning DCFs is indepen-dent of feature extraction. Although it is straightforward tolearn DCFs directly over deep convolutional features as in[30, 8, 11], DCFs trackers benefit little from the end-to-endtraining. Second, most DCFs trackers use a linear interpo-lation operation to update the learned filters over time. Suchan empirical interpolation weight is unlikely to strike a goodbalance between model adaptivity and stability. It leads todrifting of the DCFs trackers due to noisy updates. These

1

arX

iv:1

708.

0022

5v1

[cs

.CV

] 1

Aug

201

7

limitations raise two questions: (1) whether DCFs with fea-ture representation can be modeled in an end-to-end man-ner, and (2) whether DCFs can be updated in a more effec-tive way rather than using those empirical operations suchas linear interpolation?

To address these two questions, we propose a Con-volutional RESidual learning scheme for visual Tracking(CREST). We interpret DCFs as the counterparts of the con-volution filters in deep neural networks. In light of this idea,we reformulate DCFs as a one-layer convolutional neuralnetwork that directly generates the response map as the spa-tial correlation between two consecutive frames. With thisformulation, feature extraction through pre-trained CNNmodels (e.g., VGGNet [38]), correlation response map gen-eration, as well as model update are effectively integratedinto an end-to-end form. The spatial convolutional opera-tion functions similarly with the dot product between thecirculant shifted inputs and the correlation filter. It removesthe boundary effect in Fourier transform through directlyconvolving in the spatial domain. Moreover, the convo-lutional layer is fully differentiable. It allows updatingthe convolutional filters using back propagation. Similarto DCFs, the convolutional layer generates dense responsescores over all searching locations in a one-pass manner. Toproperly update our model, we apply residual learning [15]to capture appearance changes by detecting the differencebetween the output of this convolutional layer and groundtruth soft label. This helps alleviate a rapid model degrada-tion caused by noisy updates. Meanwhile, residual learningcontributes to the target response robustness for large ap-pearance variations. Ablation studies (Section 5.2) showthat the proposed convolutional layer performs well againststate-of-the-art DCFs trackers and the residual learning ap-proach further improves the accuracy.

The main contributions of this work are as follows:• We reformulate the correlation filter as one convolu-

tional layer. It integrates feature extraction, responsegeneration, and model update into the convolutionalneural network for end-to-end training.• We apply residual learning to capture the target ap-

pearance changes referring to spatiotemporal frames.This effectively alleviates a rapid model degradationby large appearance changes.• We extensively validate our method on benchmark

datasets with large-scale sequences. We show thatour CREST tracker performs favorably against state-of-the-art trackers.

2. Related Work

There are extensive surveys of visual tracking in the liter-ature [47, 37, 39]. In this section, we mainly discuss track-ing methods that are based on correlation filters and CNNs.

Tracking by Correlation Filters. Correlation filters for vi-sual tracking have attracted considerable attention due to thecomputational efficiency in the Fourier domain. Trackingmethods based on correlation filters regress all the circular-shifted versions of the input features to a Gaussian function.They do not need multiple samples of target appearance.The MOSSE tracker [4] encodes target appearance throughan adaptive correlation filter by optimizing the output sumof squared error. Several extensions have been proposedto considerably improves tracking accuracy. The examplesinclude kernelized correlation filters [17], multiple dimen-sional features [12, 18], context learning [49], scale esti-mation [7], re-detection [31], subspace learning [28], short-term and long-term memory [20], reliable collection [27]and spatial regularization [9]. Different from existing cor-relation filters based frameworks that formulate correlationoperation as an element wise multiplication in the Fourierdomain, we formulate the correlation filter as a convolutionoperation in the spatial domain. It is presented by one con-volutional layer in CNN. In this sense, we demonstrate thatfeature extraction, response generation as well as model up-date can be integrated into one network for end-to-end pre-diction and optimization.

Tracking by CNNs. Visual representations are importantfor visual tracking. Existing CNN trackers mainly explorethe pre-trained object recognition networks and build upondiscriminative or regression models. Discriminative track-ing methods propose multiple particles and refine themthrough online classification. They include stacked de-noising autoencoder [44], incremental learning [26], SVMclassification [19] and fully connected neural network [33].These discriminative trackers require auxiliary training dataas well as an off-line pre-training. On the other hand, re-gression based methods typically regress CNN features intosoft labels (e.g., a two dimensional Gaussian distribution).They focus on integrating convolutional features with thetraditional DCF framework. The examples include hierar-chical convolutional features [30], adaptive hedging [36],spatial regularization [8] and continuous convolutional op-erations [11]. In addition, there are methods based on CNNto select convolutional features [42] and update sequentially[43]. Furthermore, the Siamese networks receive growingattention due to its two stream identical structure. These in-clude tracking by object verification [40], tracking by cor-relation [3] and tracking by location axis prediction [16].Besides, there is investigation on the recurrent neural net-work (RNN) to facilitate tracking as object verification [6].Different from the existing frameworks, we apply residuallearning to capture the difference of the predicted responsemap between the current frame and the ground-truth (theinitial frame). This facilitates to account for appearancechanges and effectively reduce model degradation causedby noisy updates.

Frame T

Frame 1

Feature Extraction

Feature map

Frame T

Feature map

Frame 1Spatial Residual

Base mapping: DCF

Response Map

Temporal Residual

+

Figure 2: The pipeline of our CREST algorithm. We extract convolutional features from one search patch of the currentframe T and the initial frame. These features are transformed into the response map through the base and residual mappings.

3. Convolutional Residual LearningOur CREST algorithm carries out feature regression

through the base and residual layers. The base layer con-sists of one convolutional layer which is formulated as thetraditional DCFs. The difference between the base layeroutput and ground truth soft label is captured through theresidual layers. Figure 2 shows the CREST pipeline. Thedetails are discussed below.

3.1. DCF Reformulation

We revisit the DCF based framework and formulate itas our base layer. The DCFs learn a discriminative classi-fier and predict the target translation through searching themaximum value in the response map. We denote the inputsample by X and denote the corresponding Gaussian func-tion label by Y . A correlation filter W is then learned bysolving the following minimization problem:

W ? = argminW

||W ∗X − Y ||2 + λ||W ||2, (1)

where λ is the regularization parameter. Typically, the con-volution operation between the correlation filter W and theinput X is formulated into a dot product in the Fourier do-main [17, 31, 18].

We reformulate the learning process of DCFs as the lossminimization of the convolutional neural network. The gen-eral form of the loss function [21] can be written as:

L(W ) =1

N

|N |∑i

LW (X(i)) + λr(W ), (2)

where N is the number of samples, LW (X(i)) (i ∈ N ) isthe loss of the i-th sample, and r(W ) is the weight decay.

We set N = 1 and take the L2 norm as r(W ). The lossfunction in Eq. 2 can be written as:

L(W ) = LW (X) + λ||W ||2, (3)

where LW (X) = ||F(X)− Y ||2. It is equivalent to the L2loss between F(X) and Y where F(X) is the network out-put and Y is the ground truth label. We takeF(X) =W ∗Xas the convolution operation on X , which can be achievedthrough one convolutional layer. The convolutional filtersW is equivalent to the correlation filter and the loss func-tion in Eq. 3 is equivalent to the DCFs objective function.As a result, we formulate the DCFs as one convolutionallayer with L2 loss as the objective function. It is named asthe base layer in our network. Its filter size is set to coverthe target object. The convolutional weights can be effec-tively calculated using the gradient descent method insteadof the closed form solution [18].

3.2. Residual Learning

We formulate DCFs as a base layer represented by oneconvolutional layer. Ideally, the response map from the baselayer output will be identical to the ground truth soft la-bel. In practice, it is unlikely that a single layer networkis able to accomplish that. Instead of stacking more layerswhich may cause the degradation problem [15], we applythe residual learning to effectively capture the difference be-tween the base layer output and the ground truth.

Figure 3 shows the structure of the base and residual lay-ers. We denote H(X) as the optimal mapping of input Xand FB(X) as the output from the base layer. Rather thanstacking more layers to approximateH(X), we expect theselayers to approximate the residual function: FR(X) =H(X) − FB(X). As a result, our expected network out-

Weight Layer

Weight Layer

x

Residual Mapping

𝐹𝑅(𝑋)

Base Mapping

𝐹𝐵(𝑋)

+

Weight Layer

Weight Layer

Relu

Relu

𝐹𝐵 𝑋 + 𝐹𝑅(𝑋)

Figure 3: Base and spatial residual layers.

put can be formulated as follows:

F(X) = FB(X) + FR(X)

= FB(X, {WB}) + FR(X, {WR}), (4)

where FB(X) = WB ∗ X . The mapping FR(X, {WR})represents the residual learning andWR is a general form ofconvolutional layers with biases and ReLU [32] omitted forsimplifying notations. We adopt three layers in the resid-ual learning with small filter size. They are set to capturethe residual which is not presented in the base layer output.Finally, input X is regressed through the base and residualmapping to generate the output response map.

In addition, we can also utilize the temporal residualwhich helps to capture the difference when the spatial resid-ual is not effective. We develop a temporal residual learningnetwork that contains similar structure as the spatial resid-ual learning. The temporal input is extracted from the firstframe which contains the initial object appearance. Let Xt

denote the input X on frame t. Thus we have

F(Xt) = FR(Xt) + FSR(Xt) + FTR(X1), (5)

where FTR(X1) is the temporal residual from the firstframe. The proposed spatiotemporal residual learning pro-cess encodes elusive object representation into the responsemap generation framework and no additional data is neededto train the network.

Figure 4 shows one example of the filter response fromthe base and residual layers. The feature maps are scaled forvisualization purpose. Given an input image, we first cropthe search patch centered at the estimated position of theprevious frame. The patch is sent into our feature extractionnetwork and then regressed into a response map through thebase and residual layers. We observe that the base layer per-forms similarly to the traditional DCF based trackers to pre-dict response map. When target objects undergo small ap-pearance variations, the difference between the base layeroutput and the ground truth is minor. The residual layershave little effect on the final response map. However, whentarget objects undergo large appearance variations, such asbackground clutter, the response from the base layer is lim-ited and may not differentiate the target and the background.

Search Patch

Position in

last frame Base

Spatial residual

Temporal residual

Response map Tracking Output

Estimated

Position

Figure 4: Visualization of the feature maps in the target lo-calization step.

Nevertheless, this limitation is alleviated by the residuallayers, which effectively model the difference between thebase layer output and the ground truth. It helps to reduce thenoisy response values on the final output through the addi-tion of base and residual layers. As a result, target responseis more robust to large appearance variations.

4. Tracking via CRESTWe illustrate the detailed procedure of CREST for visual

tracking. As we do not need offline training we present thetracking process through the aspects of model initialization,online detection, scale estimation and model update.

Model Initialization. Given an input frame with the tar-get location, we extract a training patch which is centeredon the target object. This patch is sent into our frameworkfor feature extraction and response mapping. We adopt theVGG network [38] for feature extraction. Meanwhile, allthe parameters in the base and residual layers are randomlyinitialized following zero mean Gaussian distribution. Ourbase and residual layers are well initialized after a few steps.

Online Detection. The online detection scheme is straight-forward. When a new frame comes we extract the searchpatch from the center location predicted by the previousframe. The search patch has the same size with the train-ing patch and fed into our framework to generate a responsemap. Once we have the response map, we locate the objectby searching for the maximum response value.

Scale Estimation. After we obtain the target center loca-tion, we extract search patches in different scales. Thesepatches are then resized into a fixed size of training patches.Thus the candidate object sizes in different scales are allnormalized. We send these patches into our network to gen-erate the response map. The width wt and height ht of thetarget object at frame t is updated as:

(wt, ht) = β(w?t , h

?t ) + (1− β)(wt−1, ht−1), (6)

where w?t and h?t are the width and height of the scaled

object with maximum response value. The weight factor βenables the smooth update of the target size.

Model Update. We consistently generate training data dur-ing online tracking. For each frame, after predicting the tar-get location we can generate the corresponding ground truthresponse map, and the search patch can be directly adoptedas the training patch. The collected training patches and re-sponse maps from every T frames are adopted as trainingpairs which will be fed into our network for online update.

5. Experiments

In this section, we introduce the implementation detailsand analyze the effect of each component including the baseand residual layers. We then compare our CREST trackerwith state-of-the-art trackers on the benchmark datasets forperformance evaluation.

5.1. Experimental Setups

Implementation Details. We obtain the training patch fromthe first frame. It is 5 times the maximum value of ob-ject width and height. Our feature extraction network isfrom VGG-16 [38] with only the first two pooling layers re-tained. We extract the feature maps from the conv4-3 layerand reduce the feature channels to 64 through PCA dimen-sionality reduction, which is learned using the first frameimage patch. The regression target map is generated usinga two-dimensional Gaussian function with a peak value of1.0. The weight factor β for scale estimation is set to 0.6.Our experiments are performed on a PC with an i7 3.4GHzCPU and a GeForce GTX Titan Black GPU with MatCon-vNet toolbox [41]. In the training stage, we iteratively applythe adam optimizer [23] with a learning rate of 5e-8 to up-date the coefficients, until the loss in Eq. 3 is below thegiven threshold of 0.02. We observe that in practice, ournetwork converges from random initialization in a few hun-dred iterations. We update the model every 2 frames foronly 2 iterations with a learning rate of 2e-9.

Benchmark Datasets. The experiments are conducted onthree standard benchmarks: OTB-2013 [45], OTB-2015[46] and VOT-2016 [24]. The first two datasets contain 50and 100 sequences, respectively. They are annotated withground truth bounding boxes and various attributes. In theVOT-2016 dataset, there are 60 challenging videos from aset of more than 300 videos.

Evaluation Metrics. We follow the standard evaluationmetrics from the benchmarks. For the OTB-2013 and OTB-2015 we use the one-pass evaluation (OPE) with precisionand success plots metrics. The precision metric measuresthe rate of frame locations within a certain threshold dis-tance from those of the ground truth. The threshold distanceis set as 20 for all the trackers. The success plot metricmeasures the overlap ratio between predicted and ground

Location error threshold0 10 20 30 40 50

Pre

cisi

on

0

0.2

0.4

0.6

0.8

1

Precision plots of OPE

HCFT [0.891]

Base [0.883]

Overlap threshold0 0.2 0.4 0.6 0.8 1

Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1

Success plots of OPE

HCFT [0.605]

Base [0.601]

Figure 5: Precision and success plots using one-pass evalua-tion on the OTB-2013 dataset. The performance of the baselayer without scale estimation is similar to that of HCFT[30] on average.


Pre

cisi

on

0

0.2

0.4

0.6

0.8

1


Base+Residual (Spatial+Temporal) [0.908]

Base+Residual (Spatial) [0.897]

Base [0.883]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1


Base+Residual (Spatial+Temporal) [0.673]

Base+Residual (Spatial) [0.663]

Base [0.620]

Figure 6: Precision and success plots using one-pass eval-uation on the OTB-2013 dataset. The performance of thebase layer is improved gradually with the integration of spa-tiotemporal residual.

truth bounding boxes. For the VOT-2016 dataset, the per-formance is measured in terms of expected average over-lap (EAO), average overlap (AO), accuracy values (Av), ac-curacy ranks (Ar), robustness values (Rv) and robustnessranks (Rr). The average overlap is similar to the AUC met-ric in OTB benchmarks.

5.2. Ablation Studies

Our CREST algorithm contains base and residual layers.As analyzed in Section 3.1, the base layer is formulated sim-ilar to existing DCF based trackers, and the residual layersrefine the response map. In this section, we conduct abla-tion analysis to compare the performance of the base layerwith those of the DCF based trackers. In addition, we alsoevaluate the base layer and its integration with spatiotempo-ral residual layers.

We analyze our CREST algorithm in the OTB-2013dataset. We first compare the base layer performance withthat of HCFT [30], which is a traditional DCF based trackerwith convolutional features. The objective function ofHCFT is shown in Eq. 1, which is the same as ours. As thereis no scale estimation in HCFT, we remove this step in ouralgorithm. Figure 5 shows the quantitative evaluation underAUC and average distance precision scores. We observethat the performance of our base layer is similar to that ofHCFT on average. It indicates that our base layer achievessimilar performance as the DCF based trackers with con-


Pre

cisi

on

0

0.2

0.4

0.6

0.8

1


CREST [0.908]

CCOT [0.899]

HCFT [0.891]

HDT [0.889]

SINT [0.882]

SRDCFdecon [0.870]

MUSTer [0.865]

DeepSRDCF [0.849]

LCT [0.848]

SRDCF [0.838]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1


CREST [0.673]

CCOT [0.672]

SINT [0.655]

SRDCFdecon [0.653]

MUSTer [0.641]

DeepSRDCF [0.641]

LCT [0.628]

SRDCF [0.626]

SiamFC [0.612]

HCFT [0.605]

Figure 7: Precision and success plots over all 50 sequencesusing one-pass evaluation on the OTB-2013 Dataset. Thelegend contains the area-under-the-curve score and the av-erage distance precision score at 20 pixels for each tracker.

volutional features. The DeepSRDCF [8] and CCOT [11]trackers are different from the traditional DCF based track-ers because they add a spatial constraint on the regulariza-tion term, which is different from our objective function.We also analyze the effect of residual integration in Figure6. The AUC and average distance precision scores showthat the base layer obtains obvious improvement through theintegration of spatial residual learning. Meanwhile, tempo-ral residual contributes little to the overall performance.

5.3. Comparisons with State-of-the-art

We conduct quantitative and qualitative evaluations onthe benchmark datasets including OTB-2013 [45], OTB-2015 [46] and VOT-2016 [24]. The details are discussedin the following.

5.3.1 Quantitative Evaluation

OTB-2013 Dataset. We compare our CREST tracker with29 trackers from the OTB-2013 benchmark [45] and other21 state-of-the-art trackers including KCF [17], MEEM[48], TGPR [13], DSST [7], RPT [27], MUSTer [20], LCT[31], HCFT [30], FCNT [42], SRDCF [9], CNN-SVM [19],DeepSRDCF [8], DAT [35], Staple [2], SRDCFdecon [10],CCOT [11], GOTURN [16], SINT [40], SiamFC [3], HDT[36] and SCT [5]. The evaluation is conducted on 50 videosequences using one-pass evaluation with distance precisionand overlap success metrics.

Figure 7 shows the evaluation results. We only show thetop 10 trackers for presentation clarity. The AUC and dis-tance precision scores for each tracker are reported in thefigure legend. Among all the trackers, our CREST trackerperforms favorably on both the distance precision and over-lap success rate. In Figure 7, we exclude the MDNet tracker[33] as it uses tracking videos for training. Overall, theprecision and success plots demonstrate that our CRESTtracker performs favorably against state-of-the-art trackers.

In Figure 9, we further analyze the tracker performanceunder different video attributes (e.g., background clutter,

Location error threshold (pixels)0 10 20 30 40 50

Pre

cisi

on

ra

te

0

0.2

0.4

0.6

0.8

1


[0.898] CCOT

[0.851] DeepSRDCF

[0.848] HDT

[0.837] CREST

[0.837] HCFT

[0.825] SRDCFdecon

[0.789] SRDCF

[0.784] Staple

[0.781] MEEM

[0.774] MUSTer


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1


[0.671] CCOT

[0.635] DeepSRDCF

[0.627] SRDCFdecon

[0.623] CREST

[0.598] SRDCF

[0.582] SiamFC

[0.578] Staple

[0.577] MUSTer

[0.564] HDT

[0.562] HCFT

Figure 8: Precision and success plots over all 100 sequencesusing one-pass evaluation on the OTB-2015 Dataset. Thelegend contains the area-under-the-curve score and the av-erage distance precision score at 20 pixels for each tracker.

deformation and illumination variation) annotated in thebenchmark. We show the one pass evaluation on the AUCscore under eight main video attributes. The results indi-cate that our CREST tracker is effective in handling back-ground clutters and illumination variation. It is mainly be-cause the residual layer can capture the appearance changesfor effective model update. When the target appearance un-dergoes obvious changes or becomes similar to the back-ground, existing DCF based trackers (e.g., HCFT, CCOT)can not accurately locate the target object. They do notperform well compared with SINT that makes sparse pre-diction through generating candidate bounding boxes andverifying through the Siamese network. This limitation isreduced in our CREST tracker, where the residual from thespatial and temporal domains effectively narrow the gap be-tween the noisy output of the base layer and ground truthlabel. The dense prediction becomes accurate and the targetlocation can be correctly identified. We have found simi-lar performance in deformation, where LCT achieves a fa-vorable result through integrating the redetection scheme.However, it does not perform well as our CREST trackerwith the temporal residual integrated into the frameworkand optimized as a whole. In motion blur and fast motionsequences, our tracker does not perform well as CCOT. Thiscan be attributed to the convolutional features from multi-ple layers CCOT has adopted. Their feature representationperforms better than ours from single layer output. The fea-ture representation from multiple layers will be consideredin our future work.

OTB-2015 Dataset. We also compare our CREST trackeron the OTB-2015 benchmark [46] with the 29 trackers in[45] and the state-of-the-art trackers, including KCF [17],MEEM [48], TGPR [13], DSST [7], MUSTer [20], LCT[31], HCFT [30], SRDCF [9], CNN-SVM [19], Deep-SRDCF [8], Staple [2], SRDCFdecon [10], CCOT [11],HDT [36]. We show the results of one-pass evaluation us-ing the distance precision and overlap success rate in Figure8. It indicates that our CREST tracker performs better thanDCFs trackers HCFT and HDT with convolutional features.


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - background clutter (21)

CREST [0.675]

SINT [0.650]

HCFT [0.623]

SRDCFdecon [0.619]

MUSTer [0.617]

HDT [0.610]

CCOT [0.601]

DeepSRDCF [0.591]

SRDCF [0.587]

LCT [0.587]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - deformation (19)

CREST [0.689]

LCT [0.668]

SINT [0.664]

CCOT [0.656]

MUSTer [0.649]

SRDCF [0.635]

HDT [0.627]

SRDCFdecon [0.626]

HCFT [0.626]

Staple [0.618]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - illumination variation (25)

CREST [0.655]

SINT [0.637]

CCOT [0.637]

SRDCFdecon [0.624]

MUSTer [0.607]

DeepSRDCF [0.589]

LCT [0.588]

SRDCF [0.576]

Staple [0.568]

HCFT [0.560]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - in-plane rotation (31)

CREST [0.645]

SINT [0.631]

CCOT [0.625]

SRDCFdecon [0.598]

DeepSRDCF [0.596]

LCT [0.592]

MUSTer [0.582]

HCFT [0.582]

SiamFC [0.582]

HDT [0.580]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - low resolution (4)

CREST [0.600]

SINT [0.587]

SiamFC [0.566]

CCOT [0.562]

HCFT [0.557]

HDT [0.551]

MUSTer [0.482]

Staple [0.438]

SRDCFdecon [0.436]

SRDCF [0.426]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - out-of-plane rotation (39)

CREST [0.660]

CCOT [0.659]

SINT [0.638]

SRDCFdecon [0.633]

DeepSRDCF [0.630]

LCT [0.624]

MUSTer [0.614]

SRDCF [0.599]

SiamFC [0.588]

HCFT [0.587]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - motion blur (12)

CCOT [0.666]

CREST [0.656]

SINT [0.629]

DeepSRDCF [0.625]

HCFT [0.616]

HDT [0.614]

SRDCFdecon [0.606]

SRDCF [0.601]

MUSTer [0.558]

SiamFC [0.543]


Su

cces

s ra

te

0

0.2

0.4

0.6

0.8

1Success plots of OPE - fast motion (17)

CCOT [0.652]

CREST [0.642]

SINT [0.623]

DeepSRDCF [0.608]

SRDCFdecon [0.603]

HCFT [0.578]

HDT [0.574]

SRDCF [0.569]

SiamFC [0.561]

MUSTer [0.552]

Figure 9: The success plots over eight tracking challenges, including background clutter, deformation, illumination variation,in-plan rotation, low resolution, out-of-plane rotation, motion blur and fast motion.

Table 1: Comparison with the state-of-the-art trackers on the VOT 2016 dataset. The results are presented in terms ofexpected average overlap (EAO), average overlap (AO), accuracy value (Av), accuracy rank (Ar), robustness value (Rv), androbustness rank (Rr).

CCOT Staple EBT DSRDCF MDNet SiamFC CREST

EAO 0.331 0.295 0.291 0.276 0.257 0.277 0.283AO 0.469 0.388 0.370 0.427 0.457 0.421 0.435Av 0.523 0.538 0.441 0.513 0.533 0.549 0.514Ar 2.167 2.100 4.383 2.517 1.967 2.465 2.833Rv 0.850 1.350 0.900 1.167 1.204 1.382 1.083Rr 2.333 3.933 2.467 3.550 3.250 2.853 2.733

Overall, the CCOT method achieves the best result. Mean-while, CREST achieves similar performance with Deep-SRDCF in both distance precision and overlap threshold.Since there are more videos involved in this dataset in mo-tion blur and fast motion, our CREST tracker is less effec-tive in handling these sequences as CCOT.

VOT-2016 Dataset. We compare our CREST tracker withstate-of-the-art trackers on the VOT 2016 benchmark, in-cluding CCOT [11], Staple [2], EBT [50], DeepSRDCF [8],MDNet [33] and SiamFC [3]. As indicated in the VOT 2016report [24], the strict state-of-the-art bound is 0.251 underEAO metrics. For trackers whose EAO values exceed thisbound, they will be considered as state-of-the-art trackers.

Table 1 shows the results from our CREST tracker andthe state-of-the-art trackers. Among these methods, CCOTachieves the best results under the EAO metric. Meanwhile,the performance of our CREST tracker is similar to those ofStaple and EBT. In addition, these trackers perform betterthan DeepSRDCF, MDNet and SiamFC trackers. Note thatin this dataset, MDNet does not use external tracking videos

for training. According to the analysis of VOT report andthe definition of the strict state-of-the-art bound, all thesetrackers can be regarded as state-of-the-art.

5.3.2 Qualitative Evaluation

Figure 10 shows some results of the top performing track-ers: MUSTer [20], SRDCFDecon [10], DeepSRDCF [8],CCOT [11] and our CREST tracker on 12 challenging se-quences. The MUSTer tracker does not perform well in allthe presented sequences. This is because it adopts empiricaloperations for feature extraction (e.g., SIFT [29]). Althoughkeypoint matching and long-term processing are involved,handcrafted features with limited performance are not ableto differentiate the target and background. In compari-son, SRDCFDecon incorporates the tracking-by-detectionscheme to jointly optimize the model parameters and sam-ple weights. It performs well on in-plane rotation (Box),out-of-view (Liquor) and deformation (Skating2) becauseof detection. However, the sparse response leads to the lim-

DeepSRDCF CCOTSRDCFDecon CRESTMUSTer

Figure 10: Qualitative evaluation of our CREST tracker, MUSTer [20], SRDCFDecon [10], DeepSRDCF [8], CCOT [11]on twelve challenging sequences (from left to right and top to down: Football, Bolt2, Bird1, KiteSurf, Liquor, Box, Skiing,Matrix, Football1, Basketball, Ironman and Skating-2, respectively). Our CREST tracker performs favorably against thestate-of-the-art trackers.

itation on background clutter (Matrix), occlusion (Basket-ball, Bird1) and fast motion (Bolt2). DeepSRDCF improvesthe performance through combining convolutional featureswith SRDCF. The dense prediction performs well on the fastmotion (Bolt2) and occlusion (Basketball) scenes. How-ever, the direct integration does not further exploit themodel potential and thus limitation occurs on the illumi-nation variation (Ironman) and out-of-view (Bird1). Insteadof involving multiple convolutional features from differentlayers like CCOT, our CREST algorithm further exploitsthe model potential through DCF formulation and improvesthe performance through residual learning. We generate thetarget prediction in base layer while capturing the elusiveresidual via residual layers. These residual layers facilitatethe learning process since they are jointly optimized withthe base layer. As a result, the response map predicted byour network is more accurate for target localization, espe-cially in the presented challenging scenarios. Overall, thevisual evaluation indicates that our CREST tracker performsfavorably against state-of-the-art trackers.

6. Concluding RemarksIn this paper, we propose CREST to formulate the cor-

relation filter as a one-layer convolutional neural networknamed base layer. It integrates convolutional feature extrac-tion, correlation response map generation and model updateas a whole for end-to-end training and prediction. This con-volutional layer is fully differentiable and allows the con-volutional filters to be updated via online back propagation.Meanwhile, we exploit the residual learning to take targetappearance changes into account. We develop spatiotem-poral residual layers to capture the difference between thebase layer output and the ground truth Gaussian label. Theyrefine the response map through reducing the noisy values,which alleviate the model degradation limitations. Experi-ments on the standard benchmarks indicate that our CRESTtracker performs favorably against state-of-the-art trackers.

7. AcknowledgementsThis work is supported in part by the NSF CAREER

Grant #1149783, gifts from Adobe and Nvidia.

References[1] Q. Bai, Z. Wu, S. Sclaroff, M. Betke, and C. Monnier. Ran-

domized ensemble tracking. In ICCV, 2013.[2] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H.

Torr. Staple: Complementary learners for real-time tracking.In CVPR, 2016.

[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for objecttracking. In ECCV Workshop, 2016.

[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.Visual object tracking using adaptive correlation filters. InCVPR, 2010.

[5] J. Choi, H. Jin Chang, J. Jeong, Y. Demiris, andJ. Young Choi. Visual tracking using attention-modulateddisintegration and integration. In CVPR, 2016.

[6] Z. Cui, S. Xiao, J. Feng, and S. Yan. Recurrently target-attending tracking. In CVPR, 2016.

[7] M. Danelljan, G. Hager, F. Khan, and M. Felsberg. Accuratescale estimation for robust visual tracking. In BMVC, 2014.

[8] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Fels-berg. Convolutional features for correlation filter based vi-sual tracking. In ICCV Workshops, 2015.

[9] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg.Learning spatially regularized correlation filters for visualtracking. In ICCV, 2015.

[10] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg.Adaptive decontamination of the training set: A unified for-mulation for discriminative visual tracking. In CVPR, 2016.

[11] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg.Beyond correlation filters: Learning continuous convolutionoperators for visual tracking. In ECCV, 2016.

[12] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van deWeijer. Adaptive color attributes for real-time visual track-ing. In CVPR, 2014.

[13] J. Gao, H. Ling, W. Hu, and J. Xing. Transfer learning basedvisual tracking with gaussian processes regression. In ECCV,2014.

[14] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng,S. L. Hicks, and P. H. Torr. Struck: Structured output track-ing with kernels. IEEE TPAMI, 2016.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In CVPR, 2016.

[16] D. Held, S. Thrun, and S. Savarese. Learning to track at 100fps with deep regression networks. In ECCV, 2016.

[17] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Ex-ploiting the circulant structure of tracking-by-detection withkernels. In ECCV, 2012.

[18] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEETPAMI, 2015.

[19] S. Hong, T. You, S. Kwak, and B. Han. Online trackingby learning discriminative saliency map with convolutionalneural network. In ICML, 2015.

[20] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, andD. Tao. Multi-store tracker (muster): A cognitive psychol-ogy inspired approach to object tracking. In CVPR, 2015.

[21] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-tional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[22] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE TPAMI, 2012.

[23] D. Kingma and J. Ba. Very deep convolutional networks forlarge-scale image recognition. ICLR, 2015.

[24] M. Kristan and et al. The visual object tracking vot2016challenge results. In ECCV Workshops, 2016.

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InNIPS, 2012.

[26] H. Li, Y. Li, and F. Porikli. Deeptrack: Learning discrimina-tive feature representations by convolutional neural networksfor visual tracking. In BMVC, 2014.

[27] Y. Li, J. Zhu, and S. C. Hoi. Reliable patch trackers: Ro-bust visual tracking by exploiting reliable patches. In CVPR,2015.

[28] T. Liu, G. Wang, and Q. Yang. Real-time part-based visualtracking via adaptive correlation filters. In CVPR, 2015.

[29] D. G. Lowe. Distinctive image features from scale-invariantkeypoints. IJCV, 2004.

[30] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchicalconvolutional features for visual tracking. In ICCV, 2015.

[31] C. Ma, X. Yang, C. Zhang, and M.-H. Yang. Long-termcorrelation tracking. In CVPR, 2015.

[32] V. Nair and G. E. Hinton. Rectified linear units improve re-stricted boltzmann machines. In ICML, 2010.

[33] H. Nam and B. Han. Learning multi-domain convolutionalneural networks for visual tracking. In CVPR, 2016.

[34] J. Ning, J. Yang, S. Jiang, L. Zhang, and M.-H. Yang. Objecttracking via dual linear structured svm and explicit featuremap. In CVPR, 2016.

[35] H. Possegger, T. Mauthner, and H. Bischof. In defense ofcolor-based model-free tracking. In CVPR, 2015.

[36] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H.Yang. Hedged deep tracking. In CVPR, 2016.

[37] S. Salti, A. Cavallaro, and L. Di Stefano. Adaptive appear-ance modeling for video tracking: Survey and evaluation.IEEE TIP, 2012.

[38] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. ICLR, 2014.

[39] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: An experimentalsurvey. IEEE TPAMI, 2014.

[40] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instancesearch for tracking. In CVPR, 2016.

[41] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neuralnetworks for matlab. In ACM Multimedia, 2015.

[42] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual trackingwith fully convolutional networks. In ICCV, 2015.

[43] L. Wang, W. Ouyang, X. Wang, and H. Lu. Stct: Sequen-tially training convolutional networks for visual tracking. InCVPR, 2016.

[44] N. Wang and D.-Y. Yeung. Learning a deep compact imagerepresentation for visual tracking. In NIPS, 2013.

[45] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: Abenchmark. In CVPR, 2013.

[46] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark.IEEE PAMI, 2015.

[47] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A sur-vey. ACM Computing Surveys, 2006.

[48] J. Zhang, S. Ma, and S. Sclaroff. Meem: robust trackingvia multiple experts using entropy minimization. In ECCV,2014.

[49] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang. Fastvisual tracking via dense spatio-temporal context learning. InECCV, 2014.

[50] G. Zhu, F. Porikli, and H. Li. Beyond local search: Track-ing objects everywhere with instance-specific proposals. InCVPR, 2016.

CREST: Convolutional Residual Learning for Visual TrackingCREST algorithm to reformulate DCFs as a...

Documents

Transcript of CREST: Convolutional Residual Learning for Visual TrackingCREST algorithm to reformulate DCFs as a...