DeepTrack: Learning Discriminative Feature Representations ... · Training patches Normalized...

1
DeepTrack: Learning Discriminative Feature Representations by Convolutional Neural Networks for Visual Tracking Hanxi Li 12 [email protected] Yi Li 1 http://users.cecs.anu.edu.au/~yili/ Fatih Porikli 1 http://www.porikli.com/ 1 NICTA and ANU, Canberra, Australia 2 Jiangxi Normal University, Jiangxi, China Defining hand-crafted feature representations needs expert knowledge, re- quires time-consuming manual adjustments, and besides, it is arguably one of the limiting factors of object tracking. In this paper, we propose a novel solution to automatically relearn the most useful feature representations during the tracking process in order to accurately adapt appearance changes, pose and scale variations while pre- venting from drift and tracking failures. We employ a candidate pool of multiple Convolutional Neural Networks (CNNs) as a data-driven model of different instances of the target object. Individually, each CNN main- tains a specific set of kernels that favourably discriminate object patches from their surrounding background using all available low-level cues (Fig. 1). These kernels are updated in an online manner at each frame after be- ing trained with just one instance at the initialization of the corresponding CNN. Cue-1 Cue-2 Cue-3 Cue-4 Training patches Normalized patches Learned filters in Layer-1 Training faces Pre-learn Class-Specific Cue Layer-1 9, 20 × 20 Input Patch 32 × 32 Layer-2 9, 10 × 10 Layer-3 18, 2 × 2 Feature Vector 18 × 1 Output Label Label 2×1 Convolution 13 × 13 Subsampling 2×2 Convolution 9×9 Subsampling 2×2 Fully Connected Image 32 × 32 32 × 32 e Maps 20 × 20 e Maps 20 × 20 32 × 32 32 × 32 32 × 32 Class- Specific Cue Current frame Image 32 × 32 32 × 32 e Maps 20 × 20 e Maps 20 × 20 32 × 32 32 × 32 32 × 32 Cue-1 Image 32 × 32 32 × 32 e Maps 20 × 20 e Maps 20 × 20 32 × 32 32 × 32 32 × 32 Cue-4 Figure 1: Overall architecture with (red box) and without (rest) the class- specific version. Instead of learning one complicated and powerful CNN model for all the appearance observations in the past, we chose a relatively small num- ber of filters in the CNN within a framework equipped with a temporal adaptation mechanism (Fig. 2). Given a frame, the most promising CNNs in the pool are selected to evaluate the hypothesises for the target object. The hypothesis with the highest score is assigned as the current detection window and the selected models are retrained using a warm-start back- propagation which optimizes a structural loss function. #34 #162 #208 CNN-1 CNN-2 CNN-3 CNN-20 CNN-4 CNN-Pool Prototypes Test Est. Train Nearest CNNs via prototype matching Perform Nearest CNNs Train from CNN-3 Update CNN-3 Add a new CNN New CNN? No Yes Time axis Test Est. Train Test Est. Train Select & rank Figure 2: Illustration of the temporal adaptation mechanism. Our experiments on a large selection of videos from the recent bench- marks demonstrate that our method outperforms the existing state-of-the- art algorithms and rarely loses the track of the target object. We evaluate our method on 16 benchmark video sequences that cover most challeng- ing tracking scenarios such as scale changes, illumination changes, occlu- sions, cluttered backgrounds and fast motion (Fig. 3). 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 Location error threshold Precision Precision plots CNN SCM Struck CXT ASLA TLD IVT MIL Frag CPF 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Overlap threshold Success rate Success plots CNN SCM TLD Struck ASLA CXT IVT Frag MIL CPF Figure 3: The Precision Plot (left) and the Success Plot (right) of the comparing visual tracking methods. In certain applications, the target object is from a known class of ob- jects such as human faces. Our method can use this prior information to leverage the performance of tracking by training a class-specific detec- tor. In the tracking stage, given the particular instance information, one needs to combine the class-level detector and the instance-level tracker in a certain way, which usually leads to higher model complexity (Fig. 4). 0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1 Location error threshold Precision Precision plots Class-CNN CNN Normal-SGD 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Overlap threshold Success rate Success plots Class-CNN CNN Normal-SGD Figure 4: The Precision Plot (left) and the Success Plot (right) of the comparing visual tracking methods. To conclude, we introduced DeepTrack, a CNN based online object tracker. We employed a CNN architecture and a structural loss function that handles multiple input cues and class-specific tracking. We also pro- posed an iterative procedure, which speeds up the training process signif- icantly. Together with the CNN pool, our experiments demonstrate that DeepTrack performs very well on 16 sequences.

Transcript of DeepTrack: Learning Discriminative Feature Representations ... · Training patches Normalized...

Page 1: DeepTrack: Learning Discriminative Feature Representations ... · Training patches Normalized patches Learned filters in Layer-1 Training faces Pre-learn Class-Specific Cue Layer-1

DeepTrack: Learning Discriminative Feature Representations by Convolutional Neural Networksfor Visual Tracking

Hanxi Li12

[email protected]

Yi Li1

http://users.cecs.anu.edu.au/~yili/

Fatih Porikli1

http://www.porikli.com/

1 NICTA and ANU,Canberra, Australia

2 Jiangxi Normal University,Jiangxi, China

Defining hand-crafted feature representations needs expert knowledge, re-quires time-consuming manual adjustments, and besides, it is arguablyone of the limiting factors of object tracking.

In this paper, we propose a novel solution to automatically relearn themost useful feature representations during the tracking process in order toaccurately adapt appearance changes, pose and scale variations while pre-venting from drift and tracking failures. We employ a candidate pool ofmultiple Convolutional Neural Networks (CNNs) as a data-driven modelof different instances of the target object. Individually, each CNN main-tains a specific set of kernels that favourably discriminate object patchesfrom their surrounding background using all available low-level cues (Fig.1). These kernels are updated in an online manner at each frame after be-ing trained with just one instance at the initialization of the correspondingCNN.

Cue-1

Cue-2

Cue-3

Cue-4

Training patches

Normalized patches

Learned filters in Layer-1

Training faces

Pre-learn

Class-Specific Cue Layer-1

9, 20 × 20 Input Patch

32 × 32 Layer-2

9, 10 × 10 Layer-3

18, 2 × 2

Feature Vector 18 × 1

Output Label

Label 2 × 1

Convolution 13 × 13

Subsampling 2 × 2

Convolution 9 × 9

Subsampling 2 × 2

Fully Connected

Input-Image

32 × 32 Input-Image

32 × 32 Feature Maps

20 × 20 Feature Maps

20 × 20 Feature Maps

20 × 20 Input-Image

32 × 32 Input-Image

32 × 32 Input-Image

32 × 32

Image 32× 32

32× 32

e Maps 20 ×20

e Maps 20 ×20 32× 32 32× 32

32× 32

Class-Specific

Cue

Current frame

Input-Image

32 × 32 Input-Image

32 × 32 Feature Maps

20 × 20 Feature Maps

20 × 20 Feature Maps

20 × 20 Input-Image

32 × 32 Input-Image

32 × 32 Input-Image

32 × 32

Image 32× 32

32× 32

e Maps 20 ×20

e Maps 20 ×20 32× 32 32× 32

32× 32

Cue-1

Input-Image

32 × 32 Input-Image

32 × 32 Feature Maps

20 × 20 Feature Maps

20 × 20 Feature Maps

20 × 20 Input-Image

32 × 32 Input-Image

32 × 32 Input-Image

32 × 32

Image 32× 32

32× 32

e Maps 20 ×20

e Maps 20 ×20 32× 32 32× 32

32× 32

Cue-4

Figure 1: Overall architecture with (red box) and without (rest) the class-specific version.

Instead of learning one complicated and powerful CNN model for allthe appearance observations in the past, we chose a relatively small num-ber of filters in the CNN within a framework equipped with a temporaladaptation mechanism (Fig. 2). Given a frame, the most promising CNNsin the pool are selected to evaluate the hypothesises for the target object.The hypothesis with the highest score is assigned as the current detectionwindow and the selected models are retrained using a warm-start back-propagation which optimizes a structural loss function.

#34 #162 #208

CNN-1 CNN-2 CNN-3 CNN-20 CNN-4 … CNN-Pool

Prototypes …

… …

Test Est. Train

① Nearest CNNs via prototype matching

② Perform Nearest CNNs

④ Train from CNN-3

⑥ Update CNN-3

⑥ Add a new CNN

⑤ New CNN?

No

Yes

Time axis

Test Est. Train Test Est. Train

③ Select & rank

Figure 2: Illustration of the temporal adaptation mechanism.

Our experiments on a large selection of videos from the recent bench-marks demonstrate that our method outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object. We evaluate

our method on 16 benchmark video sequences that cover most challeng-ing tracking scenarios such as scale changes, illumination changes, occlu-sions, cluttered backgrounds and fast motion (Fig. 3).

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Location error thresholdP

recis

ion

Precision plots

CNN

SCM

Struck

CXT

ASLA

TLD

IVT

MIL

Frag

CPF

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Overlap threshold

Su

cce

ss r

ate

Success plots

CNN

SCM

TLD

Struck

ASLA

CXT

IVT

Frag

MIL

CPF

Figure 3: The Precision Plot (left) and the Success Plot (right) of thecomparing visual tracking methods.

In certain applications, the target object is from a known class of ob-jects such as human faces. Our method can use this prior information toleverage the performance of tracking by training a class-specific detec-tor. In the tracking stage, given the particular instance information, oneneeds to combine the class-level detector and the instance-level tracker ina certain way, which usually leads to higher model complexity (Fig. 4).

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Location error threshold

Pre

cis

ion

Precision plots

Class−CNN

CNN

Normal−SGD

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Overlap threshold

Su

cce

ss r

ate

Success plots

Class−CNN

CNN

Normal−SGD

Figure 4: The Precision Plot (left) and the Success Plot (right) of thecomparing visual tracking methods.

To conclude, we introduced DeepTrack, a CNN based online objecttracker. We employed a CNN architecture and a structural loss functionthat handles multiple input cues and class-specific tracking. We also pro-posed an iterative procedure, which speeds up the training process signif-icantly. Together with the CNN pool, our experiments demonstrate thatDeepTrack performs very well on 16 sequences.