DeepTrack: Learning Discriminative Feature Representations ... · Training patches Normalized...

Post on 21-Apr-2020

14 views 0 download

Transcript of DeepTrack: Learning Discriminative Feature Representations ... · Training patches Normalized...

DeepTrack: Learning Discriminative Feature Representations by Convolutional Neural Networksfor Visual Tracking

Hanxi Li12

hanxi.li@nicta.com.au

Yi Li1

http://users.cecs.anu.edu.au/~yili/

Fatih Porikli1

http://www.porikli.com/

1 NICTA and ANU,Canberra, Australia

2 Jiangxi Normal University,Jiangxi, China

Defining hand-crafted feature representations needs expert knowledge, re-quires time-consuming manual adjustments, and besides, it is arguablyone of the limiting factors of object tracking.

In this paper, we propose a novel solution to automatically relearn themost useful feature representations during the tracking process in order toaccurately adapt appearance changes, pose and scale variations while pre-venting from drift and tracking failures. We employ a candidate pool ofmultiple Convolutional Neural Networks (CNNs) as a data-driven modelof different instances of the target object. Individually, each CNN main-tains a specific set of kernels that favourably discriminate object patchesfrom their surrounding background using all available low-level cues (Fig.1). These kernels are updated in an online manner at each frame after be-ing trained with just one instance at the initialization of the correspondingCNN.

Cue-1

Cue-2

Cue-3

Cue-4

Training patches

Normalized patches

Learned filters in Layer-1

Training faces

Pre-learn

Class-Specific Cue Layer-1

9, 20 × 20 Input Patch

32 × 32 Layer-2

9, 10 × 10 Layer-3

18, 2 × 2

Feature Vector 18 × 1

Output Label

Label 2 × 1

Convolution 13 × 13

Subsampling 2 × 2

Convolution 9 × 9

Subsampling 2 × 2

Fully Connected

Input-Image

32 × 32 Input-Image

32 × 32 Feature Maps

20 × 20 Feature Maps

20 × 20 Feature Maps

20 × 20 Input-Image

32 × 32 Input-Image

32 × 32 Input-Image

32 × 32

Image 32× 32

32× 32

e Maps 20 ×20

e Maps 20 ×20 32× 32 32× 32

32× 32

Class-Specific

Cue

Current frame

Input-Image

32 × 32 Input-Image

32 × 32 Feature Maps

20 × 20 Feature Maps

20 × 20 Feature Maps

20 × 20 Input-Image

32 × 32 Input-Image

32 × 32 Input-Image

32 × 32

Image 32× 32

32× 32

e Maps 20 ×20

e Maps 20 ×20 32× 32 32× 32

32× 32

Cue-1

Input-Image

32 × 32 Input-Image

32 × 32 Feature Maps

20 × 20 Feature Maps

20 × 20 Feature Maps

20 × 20 Input-Image

32 × 32 Input-Image

32 × 32 Input-Image

32 × 32

Image 32× 32

32× 32

e Maps 20 ×20

e Maps 20 ×20 32× 32 32× 32

32× 32

Cue-4

Figure 1: Overall architecture with (red box) and without (rest) the class-specific version.

Instead of learning one complicated and powerful CNN model for allthe appearance observations in the past, we chose a relatively small num-ber of filters in the CNN within a framework equipped with a temporaladaptation mechanism (Fig. 2). Given a frame, the most promising CNNsin the pool are selected to evaluate the hypothesises for the target object.The hypothesis with the highest score is assigned as the current detectionwindow and the selected models are retrained using a warm-start back-propagation which optimizes a structural loss function.

#34 #162 #208

CNN-1 CNN-2 CNN-3 CNN-20 CNN-4 … CNN-Pool

Prototypes …

… …

Test Est. Train

① Nearest CNNs via prototype matching

② Perform Nearest CNNs

④ Train from CNN-3

⑥ Update CNN-3

⑥ Add a new CNN

⑤ New CNN?

No

Yes

Time axis

Test Est. Train Test Est. Train

③ Select & rank

Figure 2: Illustration of the temporal adaptation mechanism.

Our experiments on a large selection of videos from the recent bench-marks demonstrate that our method outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object. We evaluate

our method on 16 benchmark video sequences that cover most challeng-ing tracking scenarios such as scale changes, illumination changes, occlu-sions, cluttered backgrounds and fast motion (Fig. 3).

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Location error thresholdP

recis

ion

Precision plots

CNN

SCM

Struck

CXT

ASLA

TLD

IVT

MIL

Frag

CPF

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Overlap threshold

Su

cce

ss r

ate

Success plots

CNN

SCM

TLD

Struck

ASLA

CXT

IVT

Frag

MIL

CPF

Figure 3: The Precision Plot (left) and the Success Plot (right) of thecomparing visual tracking methods.

In certain applications, the target object is from a known class of ob-jects such as human faces. Our method can use this prior information toleverage the performance of tracking by training a class-specific detec-tor. In the tracking stage, given the particular instance information, oneneeds to combine the class-level detector and the instance-level tracker ina certain way, which usually leads to higher model complexity (Fig. 4).

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Location error threshold

Pre

cis

ion

Precision plots

Class−CNN

CNN

Normal−SGD

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Overlap threshold

Su

cce

ss r

ate

Success plots

Class−CNN

CNN

Normal−SGD

Figure 4: The Precision Plot (left) and the Success Plot (right) of thecomparing visual tracking methods.

To conclude, we introduced DeepTrack, a CNN based online objecttracker. We employed a CNN architecture and a structural loss functionthat handles multiple input cues and class-specific tracking. We also pro-posed an iterative procedure, which speeds up the training process signif-icantly. Together with the CNN pool, our experiments demonstrate thatDeepTrack performs very well on 16 sequences.