1 Coreset-Based Adaptive Tracking

8
1 Coreset-Based Adaptive Tracking Abhimanyu Dubey, Nikhil Naik, Dan Raviv, Rahul Sukthankar, and Ramesh Raskar Abstract—We propose a method for learning from streaming visual data using a compact, constant size representation of all the data that was seen until a given moment. Specifically, we construct a “coreset” representation of streaming data using a parallelized algorithm, which is an approximation of a set with relation to the squared distances between this set and all other points in its ambient space. We learn an adaptive object appearance model from the coreset tree in constant time and logarithmic space and use it for object tracking by detection. Our method obtains excellent results for object tracking on three standard datasets over more than 100 videos. The ability to summarize data efficiently makes our method ideally suited for tracking in long videos in presence of space and time constraints. We demonstrate this ability by outperforming a variety of algorithms on the TLD dataset with 2685 frames on average. This coreset based learning approach can be applied for both real-time learning of small, varied data and fast learning of big data. Index Terms—Object tracking, learning from video, object recognition, feature measurements 1 I NTRODUCTION Real-world computer vision systems are increasingly required to provide robust real-time performance under memory and bandwidth constraints for tasks such as track- ing, on-the-fly object detection, and visual search. Robots with visual sensors may be required to perform all of these tasks while interacting with their environment. On another extreme, IP cameras, life logging devices, and other visual sensors may want to hold a summary of the data stream seen over a long time. Hence, new algorithms that enable real-world computer vision applications under a tight space-time budget are a necessity. In this paper, we propose new algorithms with low space-time complexity for learning from streaming visual data, inspired by research in computational geometry. In recent years, computational geometry researchers have proposed techniques to obtain small representative sets of points from larger sets with constant or logarithmic space-time complexity [5], [14], [16]. We propose to adapt these techniques to retain a summary of visual data and train object appearance models in constant time and space from this summary. Specifically, we perform summarization of a video stream based on “coresets”, which are defined as a small set of points that approximately maintain the properties of the original set with respect to the distance be- tween the set and other structures in its ambient space with theoretical guarantees [18]. We demonstrate that the coreset formulation is useful for learning object appearance models for detection and tracking. Coresets have been primarily used as a technique to convert “big data” into manageable sizes [29]. In contrast, we employ coresets for reducing small data to “tiny data”, and perform real-time classifier training. We use the object appearance model learned from the coreset representation for visual tracking by detection. Abhimanyu Dubey is an undergraduate student at the Indian Institute of Technology, New Delhi, India. Rahul Sukthankar is a research scientist at Google. Nikhil Naik, Dan Raviv, and Ramesh Raskar are with the Media Labora- tory at Massachusetts Institute of Technology. Tracking systems typically retain just an appearance model for the object [22], [38] and optionally, data from some prior frames [15], [33]. In contrast, we compute a summary of all the previous frames in constant time (on average) and update our appearance model continuously with this sum- mary. The summarization process improves the appearance model with time. As a result, our method is beneficial for tracking objects in very long video sequences that need to deal with variation in lighting, background, and object appearance, along with the object leaving and re-entering the video. We demonstrate the potential of this technique with a simple tracker that combines detections from a linear SVM trained from the coreset with a Kalman filter. We call this no-frills tracker the “Coreset Adaptive Tracker (CAT).” CAT obtains competitive performance on three standard datasets, the Princeton Tracking Benchmark [32], the TLD Dataset [23], and the CVPR2013 Tracking Benchmark [35]. Our method is found to be superior to popular methods for tracking video sequences from the TLD dataset. We must emphasize that the primary goal of this paper is not to develop the best algorithm for tracking—which is evident from our simple design. Rather, our goal is to demonstrate the potential of coresets as a data reduction technique for learning from visual data. We summarize our contributions below. 1.1 Contributions We propose an algorithm for learning and detection of objects from streaming data by training a classifier in constant time using coresets. We construct coresets using a parallel algorithm and prove that this algorithm requires constant time (on average) and logarithmic space. We introduce the concept of hierarchical sampling from coresets, which improves learning performance signifi- cantly when compared to the original formulation [14]. We obtain competitive results on standard bench- marks [23], [32], [35]. Our method is especially suited for longer videos (with several hundred frames), since it summarizes data from all the frames in the video so far.

Transcript of 1 Coreset-Based Adaptive Tracking

Page 1: 1 Coreset-Based Adaptive Tracking

1

Coreset-Based Adaptive TrackingAbhimanyu Dubey, Nikhil Naik, Dan Raviv, Rahul Sukthankar, and Ramesh Raskar

Abstract—We propose a method for learning from streaming visual data using a compact, constant size representation of all the datathat was seen until a given moment. Specifically, we construct a “coreset” representation of streaming data using a parallelizedalgorithm, which is an approximation of a set with relation to the squared distances between this set and all other points in its ambientspace. We learn an adaptive object appearance model from the coreset tree in constant time and logarithmic space and use it forobject tracking by detection. Our method obtains excellent results for object tracking on three standard datasets over more than 100videos. The ability to summarize data efficiently makes our method ideally suited for tracking in long videos in presence of space andtime constraints. We demonstrate this ability by outperforming a variety of algorithms on the TLD dataset with 2685 frames on average.This coreset based learning approach can be applied for both real-time learning of small, varied data and fast learning of big data.

Index Terms—Object tracking, learning from video, object recognition, feature measurements

F

1 INTRODUCTION

Real-world computer vision systems are increasinglyrequired to provide robust real-time performance undermemory and bandwidth constraints for tasks such as track-ing, on-the-fly object detection, and visual search. Robotswith visual sensors may be required to perform all ofthese tasks while interacting with their environment. Onanother extreme, IP cameras, life logging devices, and othervisual sensors may want to hold a summary of the datastream seen over a long time. Hence, new algorithms thatenable real-world computer vision applications under atight space-time budget are a necessity. In this paper, wepropose new algorithms with low space-time complexity forlearning from streaming visual data, inspired by research incomputational geometry.

In recent years, computational geometry researchershave proposed techniques to obtain small representativesets of points from larger sets with constant or logarithmicspace-time complexity [5], [14], [16]. We propose to adaptthese techniques to retain a summary of visual data andtrain object appearance models in constant time and spacefrom this summary. Specifically, we perform summarizationof a video stream based on “coresets”, which are definedas a small set of points that approximately maintain theproperties of the original set with respect to the distance be-tween the set and other structures in its ambient space withtheoretical guarantees [18]. We demonstrate that the coresetformulation is useful for learning object appearance modelsfor detection and tracking. Coresets have been primarilyused as a technique to convert “big data” into manageablesizes [29]. In contrast, we employ coresets for reducing smalldata to “tiny data”, and perform real-time classifier training.

We use the object appearance model learned from thecoreset representation for visual tracking by detection.

• Abhimanyu Dubey is an undergraduate student at the Indian Institute ofTechnology, New Delhi, India.

• Rahul Sukthankar is a research scientist at Google.• Nikhil Naik, Dan Raviv, and Ramesh Raskar are with the Media Labora-

tory at Massachusetts Institute of Technology.

Tracking systems typically retain just an appearance modelfor the object [22], [38] and optionally, data from some priorframes [15], [33]. In contrast, we compute a summary ofall the previous frames in constant time (on average) andupdate our appearance model continuously with this sum-mary. The summarization process improves the appearancemodel with time. As a result, our method is beneficial fortracking objects in very long video sequences that needto deal with variation in lighting, background, and objectappearance, along with the object leaving and re-enteringthe video.

We demonstrate the potential of this technique witha simple tracker that combines detections from a linearSVM trained from the coreset with a Kalman filter. We callthis no-frills tracker the “Coreset Adaptive Tracker (CAT).”CAT obtains competitive performance on three standarddatasets, the Princeton Tracking Benchmark [32], the TLDDataset [23], and the CVPR2013 Tracking Benchmark [35].Our method is found to be superior to popular methods fortracking video sequences from the TLD dataset.

We must emphasize that the primary goal of this paperis not to develop the best algorithm for tracking—whichis evident from our simple design. Rather, our goal is todemonstrate the potential of coresets as a data reductiontechnique for learning from visual data. We summarize ourcontributions below.

1.1 Contributions• We propose an algorithm for learning and detection of

objects from streaming data by training a classifier inconstant time using coresets. We construct coresets usinga parallel algorithm and prove that this algorithm requiresconstant time (on average) and logarithmic space.

• We introduce the concept of hierarchical sampling fromcoresets, which improves learning performance signifi-cantly when compared to the original formulation [14].

• We obtain competitive results on standard bench-marks [23], [32], [35]. Our method is especially suitedfor longer videos (with several hundred frames), since itsummarizes data from all the frames in the video so far.

Page 2: 1 Coreset-Based Adaptive Tracking

2

2 RELATED WORK

First, we introduce the literature on coresets, followed by asummary of the prior art on online learning for tracking bydetection.

Coresets and Data Reduction: The term coreset wasfirst introduced by Agarwal et al. [1] in their seminalwork on k-median and k-means problems. Badoiu etal. [5] propose a coreset for clustering using a subset ofinput points to generate the solution, while Har-Peled andMazumdar [18] provide a different solution for coresetconstruction which includes points that may not be asubset of the original set. Feldman et al. [13] demonstratethat a weak coreset can be generated independent ofnumber of points in the data or their feature dimensions.We follow the formulation of Feldman et al. [14], whichprovides an eminently parallelizable streaming algorithmfor coreset construction with a theoretical error bound.This formulation has been recently used for applications inrobotics [12] and for summarization of time series data [29].

Online Learning for Detection and Tracking: Theproblem of object tracking by detection has been wellstudied in computer vision [31]. Early important examplesof learning based methods include [7], [10], [22]. Ross etal. [30] learn a low dimensional subspace representation ofobject appearance using incremental algorithms for PCA. Liet al. [27] propose a compressive sensing tracking methodusing efficient l1 norm minimization. Discriminativealgorithms solve the tracking problem using a binaryclassification model for object appearance. Avidan [2]combines optical flow with SVMs and performs tracking byminimizing the SVM classification score, and extends thismethod with an “ensemble tracker” using Adaboost [3].Grabner et al. [17] develop a semi-supervised boostingframework to tackle the problem of drift due to incorrectdetections. Kwon et al. [25] represent the object usingmultiple appearance models, obtained from sparse PCA,which are integrated using an MCMC framework. Babenkoet al. [4] introduce the semi-supervised multiple instancelearning (MIL) algorithm. Hare et al. [19] learn an onlinestructured output SVM classifier for adaptive tracking.Zhang et al. [37] develop an object appearance model basedon efficient feature extraction in the compressed domainand use the features discriminatively to separate the objectfrom its background. Kalal et al. [23] introduce the TLDalgorithm, where a pair of “expert” learners select positiveand negative examples in a semi-supervised mannerfor real-time tracking. Gao et al. [15] combine results ofa tracker learnt using prior knowledge from auxiliarysamples and another tracker learnt with samples fromrecent frames. Our work contributes to this literature witha novel coreset representation for model-based tracking.

3 CONSTRUCTING CORESETS

In this section, we introduce the key theoretical ideas be-hind coresets and low-rank approximation and provide newresults for the time and space complexity of a parallelalgorithm for coreset construction.

3.1 CoresetsWe denote time series data of d dimensions as rows in amatrix A, and search for a subset or a subspace of A indomain S such that the sum of squared distances from therest of the domain to this summarized structure is an (1+ε)-approximation of that of A, where ε is a small constant (0 ≤ε ≤ 1). Specifically, we call A a ‘coreset’ of A if

(1− ε) · dist2(A,S) ≤ dist2(A,S) + c

≤ (1 + ε) · dist2(A,S) (1)

where dist2(A,S) denotes the sum of squared distancesfrom each row onA to its closest point in S and c is constant.Eq. (1) can be rewritten in a matrix form using the Frobeniusnorm [14] as

(1− ε) · ‖AY‖2F ≤ ‖AY‖2F + c ≤ (1 + ε) · ‖AY‖2F , (2)

where Y is a d× (d− k) orthogonal matrix. Going forwardwe denote A ≈ A if A is a (1 + ε)-approximation of A.

The best low dimensional approximation can be ob-tained by calculating the k dimensional eigen-structure,e.g., using Singular Value Decomposition (SVD). Severalresults have been recently proposed for obtaining a lowrank approximation of the data for summarization moreefficiently than SVD. However, very few of these algorithmsare designed for a streaming setting [9], [14], [16], [28].

Motivated by [14], we adopt a tree formulation to ap-proximate the low rank subspace which best representsthe data. Unlike other methods, the tree formulation hassufficiently low time complexity for a real-time learner,considering the constants. Using this formulation, the timecomplexity for the summarization and learning steps isconstant (on average) in between observations. In the nextsection we formalize and prove time and space guaranteesfor this approach.

3.2 Coreset TreeThe key idea in a coreset tree is that a low rank approx-imation can be “approximated” using a merge-and-reducetechnique. The origin of this approach goes back to Bentleyand Saxe [6], and was further adopted for coresets in [1]. Inthis work, following Feldman et al. [14], we generate a treestructure, which holds a summary of the streaming data.

We denote Ci ∈ Rn×d as a leaf in the tree constructedfrom n rows of d dimensional features. The total number ofdata points captured until time t is marked by p(t). For easeof calculations, we define 2q(t) = p(t)

n . Thus, bq(t)c is thedepth of the tree in time t. We construct a binary tree fromthe streaming data, and compress the data by merging theleaves, whenever two adjacent leaves get formed at the samelevel. If p(t) = n2q0 , only the root node remains in memoryafter repeated merging of leaves. For the next n2q0 streamedpoints, the right side of the binary tree keeps growing andmerging, until it is reduced to a single node again, whenn2q0+1 data points arrive (Figure 1).

If Ci and Cj are two (1 + ε) coresets then their concate-nation

[Ci

Cj

]is also an (1 + ε) coreset. Since∥∥∥∥[Ci

Cj

]Y

∥∥∥∥2F

= ‖CiY‖2F + ‖CjY‖2F , (3)

Page 3: 1 Coreset-Based Adaptive Tracking

3

C1

1

C1 C2

2

C1 C2

C123

C1 C2

C12

C3

4

C1 C2

C34

C3 C4

C12

6

C1 C2

C34

C3 C4

C12

C123475

C1 C2 C3 C4

C12

level 0 level 0 level 1 level 1

level 1 level 1 level 2

Fig. 1: Tree construction: Coresets are constructed using amerge-and-reduce approach. Whenever two coreset nodes arepresent at the same level, they are merged to construct ahigher level coreset (shown in green). The nodes that have beenmerged are deleted from the memory (shown in red). We provein Section 3 that the space requirement for tree construction isO(log p(t)) in the worst case, where p(t) is the number of datapoints at time t, and on average the time complexity is constantper time step.

the proof follows trivially from the definition of a coreset.This identity implies that two nodes which are (1 + ε)approximation can be concatenated and then re-compressedback to size n. This is a key feature of the tree formulation,which repeatedly performs compression from (2n × d) to(n × d). Each computation can be considered constant intime and space. However, this formulation also reduces thequality of the approximation, since the merged coreset is an(1 + ε)2 approximation of the two nodes. For every level ofthe tree, a multiplication error of (1 + ε) is introduced fromthe leaf node to the root.

3.2.1 Data Approximation in the Tree

We denote the entire set up to time t by ∪tCi, where Ci arethe leaves. We denote the coreset tree by 2∪tCi . It followsthat

Theorem 1. If ∀i Ci is a (1+ ε/q(t)) approximation of Ci thenthe collapse of the tree 2∪tCi (i.e. the root) is a (1 + ε/3) coresetof ∪tCi

Proof. We concatenate and compress sets of two leaves atevery level of the tree. Compressing a set of 2n × d pointsinto a set of n × d points introduces an additional multi-plicative approximation of (1+ ε/q(t)). Since the maximumdepth of the tree at time t is q(t), and(

1 +ε

q(t)

)q(t)

≤ exp(ε/6) < (1 + ε/3), (4)

for ε < 0.1, we conclude that if we compress 2n × d pointsto n × d with an (1 + ε/q(t)) approximation, the root ofthe tree will be a (1 + ε/3) approximation of the entire data∪tCi.

3.2.2 Space and Time Complexity

We now provide new proofs for the time and space com-plexity of coreset tree generation per time step and for theentire set.

Theorem 2. The space complexity to hold a 2∪tCi tree in memoryis O(ndq(t)), where n is the size of the coreset of points withdimension d.

Proof. In a coreset tree, two children of the same parentnode cannot be in the memory stack simultaneously, whichimplies that, between two full collapses of the tree, thememory stack is increased at most by O(nd), which isthe size of one leaf. A full contraction is contained withina constant size memory S(n2q(t)) = 1. If the data sizebetween two consecutive contractions is p(t), we know thatn2q(t) < p(t) < n2q(t)+1. Therefore,

S(p(t)) ≤ nd+maxp(t)

S(p(t))

n2q(t)−1 < p(t) < n2q(t). (5)

This recursive formulation results in a space requirementof S(p) ≤ O(nd log(p(t)/n)) = O(ndq(t)). Assuming nand d are constants chosen a priori, the complexity of thememory is logarithmic in the number of points seen sofar.

Theorem 3. The time complexity of coreset tree constructionbetween time step t and t+1 is O(log(p(t)/n)nd2) in the worstcase, which is logarithmic in p, the total number of points.

Proof. In the worst case, we perform log(p(t)/n) compres-sions from the lower level all the way to the root. The com-pressions are performed using SVD (which will be describedin detail in Section 4.3.1). Since the time complexity of eachSVD algorithm is O(nd2), assuming n > d,

T (t+ 1)− T (t) ≤ O(log(p(t)/n)nd2) = O(q(t)nd2)= O(q(t)), (6)

where n and d are considered constants.

This scenario occurs for the last data point of a perfectsubtree (p(t) = n2q(t)).

Finally, we wish to explore the total and average timecomplexity to summarize p points. On average, the timecomplexity is constant and does not depend on p.

Theorem 4. The time complexity for a complete collapse ofa tree to a single coreset of size n × d is O(pd2), where d thedimension of each point and p is the number of points. The averagetime complexity per point is constant.

Proof. The set of p points is split into p/n leaves with n× ddimension each. The time complexity of SVD is O(nd2), as-suming n > d. A total of p(t)/n SVDs will be computed forcompressing a perfect tree. Hence the total time complexityT (t) for all cycles until time t becomes

T (t) = O(p(t)

nnd2

)= O(p(t)d2) = O(p(t)) (7)

assuming d is constant. We further infer that the averagetime complexity T (t) is O(1) as

T (t) =T (t)

p(t)= O

(p(t)d2

p(t)

)= O(d2) = O(1). (8)

Page 4: 1 Coreset-Based Adaptive Tracking

4

Ci Ci+1 feature vector

Ci Ci+1

Ci

level = 0 level = 0

(a) Compression

HOGFeature

Extraction

level = 1

t = 0

t = n-2t = n-1

t = n

t = 1

Video Frames

Coreset Tree Construction

(c) Detection

t = n+1 t = n+2

t = n+i

Coreset Node

Ci+1

level = 0

Detection and Feature Extraction

(b) Training

n

d

Cilevel = 1

Fig. 2: Tracking by detection using coresets: Our system forlearning from streaming data contains three main components,which run in parallel. (a) Compression using coresets: Foreach frame, we extract HOG features from the object. After nframes are processed, a new leaf is inserted into the coresettree. The tree is compressed recursively by merging nodes atthe same level using SVD. (b) Training: A linear SVM for objectappearance is trained using data points from the coreset treeafter every n frames are processed. The maximum size of thetraining set is 2n × d. (c) Detection: The trained model is usedfor sliding window detection on every new frame. Featuresextracted from the detected area are inserted into the coresettree.

4 CAT: CORESET ADAPTIVE TRACKER

In the previous section, we describe an efficient methodto obtain a summarization of all the data till a point in astreaming setting, with strong theoretical guarantees. In thissection, we use the coreset tree for tracking by detectionusing a coreset appearance model for the object.

4.1 Coreset Tree Implementation

The coreset tree is stored as a stack of coreset nodes SC .Each coreset node SC [i] has two attributes level and features.The attribute level denotes the height of the coreset node inthe tree, and features are the image features extracted from aframe. In a streaming setting, once n frames enter the framestack, we create a coreset node with level = 0. We mergetwo coreset nodes when they have the same level.

4.2 Coreset Appearance Model

In the coreset tree, any node at level q with size n representsn2q frames, since it is formed after repeatedly merging 2q

leaf nodes. Nodes with different levels belong to differenttemporal instances in streaming data. Nodes higher up inthe tree have older training samples, whereas a node withlevel 0 contains the last n frames. This is not ideal fortraining a tracking system, since more recent frames shouldhave a representation in the learning process. To accountfor recent data, we propose an hierarchical sampling of theentire tree, constructing an at most (2n×d)-size summariza-tion of the data seen until a given time. For each SC [i], wecalculate ki = 2(level(SC [i])−level(top(SC)), and select the firstn ·ki elements from SC [i]. This summarized data constitutesour “coreset appearance model”, which is used to train anobject detector.

4.2.1 Advantages of Hierarchical SamplingWhile the coreset tree formulation has been inspired byFeldman et al. [14], our idea of hierarchical sampling ofthe entire tree allows to account for recent data, which iscrucial for many learning problems which need to avoid atemporal bias in the training set, and especially for trackingby detection. We demonstrate in Table 5 that hierarchicalsampling improves the learning performance significantlyas compared to just using the data from the root of the tree.

4.3 System ImplementationIn the first frame, user input is used to select a region ofinterest (ROI) for the object to be tracked. We resize theROI to a 64 × 64 square from which we extract 2 × 2HOG features from the RGB channels. We also extract HOGfeatures from small affine transformations of the input frameand add them to our frame stack. The affine transformationsinclude scaling, translation and rotation. The number ofaffine transformations are determined by the number ofadditional frames required to make the frame stack sizereach a maximum of 25% of the total video length. Forexample, if the coreset size is 5% of the total length of thevideo, we add 4 transformations per frame. This processgives us the advantage of having more—albeit syntheticallygenerated—samples to train the classifier, and makes thetracker robust to small object scale changes and translations.

Initially, since no coreset appearance model has beentrained, we need to use an alternate method for tracking (weuse SCM [38] in our current implementation). Once first nframes have been tracked, the first leaf of the tree is formed.Data classification, compression, and training now run inparallel (Figure 2). These components are summarized be-low –• Classification: Detect the object in the stream from the

last trained classifier.• Compression: Update a coreset tree, by inserting new

leaves and summarizing data whenever possible.• Training: Train an SVM classifier from the hierarchical

coreset appearance model.

ClassificationIf a classifier has been trained, we feed each sliding windowto the classifier and obtain a confidence value. After hardthresholding and non-maximum suppression, we detect anROI. To remove noise in the position of the ROI, we employa Kalman filter. We use Expectation Maximization (EM)to obtain its parameters from the centers of the returnedpredictions of the previous n frames. The EM returns learntparameters for prediction, which are used to estimate the la-tent position of the bounding box center. This latent positionis used to update the Kalman filter parameters for the nextiteration. Once the center of the bounding box is obtained,we calculate the true ROI. We extract features from the ROIand feed it to a feature stack SP . Until the first n framesarrive we obtain the ROI from SCM [38] since no classifierhas been trained. This process continues till the last frame.

CompressionIf |SP | > n, we pop the top n elements SP,n from the featurestack and push a coreset node containing elements SP,n

Page 5: 1 Coreset-Based Adaptive Tracking

5

Algorithm Mean Target Type Target Size Movement Occlusion Motion TypeScore Human Animal Rigid Large Small Slow Fast Yes No Active Passive

SOWP [24] 0.54 0.46 0.52 0.69 0.54 0.52 0.62 0.42 0.43 0.65 0.61 0.45MUSTER [21] 0.54 0.47 0.52 0.69 0.53 0.53 0.63 0.40 0.45 0.61 0.63 0.45

CAT 25 0.52 0.52 0.55 0.63 0.49 0.49 0.65 0.38 0.41 0.58 0.59 0.44KCF [20] 0.52 0.42 0.50 0.65 0.48 0.54 0.65 0.47 0.41 0.67 0.65 0.47DSST [11] 0.49 0.43 0.52 0.68 0.52 0.53 0.62 0.39 0.46 0.61 0.62 0.43

MEEM [36] 0.49 0.43 0.49 0.56 0.50 0.48 0.64 0.43 0.41 0.60 0.57 0.46SCM [38] 0.49 0.49 0.50 0.54 0.47 0.48 0.60 0.37 0.41 0.57 0.59 0.44TGPR [15] 0.47 0.36 0.51 0.58 0.46 0.48 0.62 0.41 0.35 0.65 0.56 0.44Struck [19] 0.44 0.35 0.47 0.53 0.45 0.44 0.58 0.39 0.30 0.64 0.54 0.41TLD [23] 0.38 0.29 0.35 0.44 0.32 0.38 0.52 0.30 0.34 0.39 0.50 0.31

TABLE 1: We use the Princeton Tracking Benchmark [32] to compare the precision of CAT with other algorithms. CAT 25 leadsthe benchmark in 3 out of 11 categories.

into the coreset stack SC with level = 0. If level(SC [i]) =level(SC [i + 1]), they are merged by constructing a coresetfrom SC [i] and SC [i+ 1]. For this process, we first computean SVD of the concatenated matrix

[Ci

Ci+1

], in the form

USV T , where Ci and Ci+1 are the n×dmatrices of features.We then compute the (n × d)-size coreset by multiplyingS

n×dwith V T

d×d. Sn×d

contains the first n rows of S. We store

the coreset in SC [i], and delete SC [i+1]. The level for nodeSC [i] is increased by one. This process continues recursively.

Training

If at least one leaf has been constructed and the size ofthe coreset stack changes, we train a one-class linear SVMusing samples from SC . We sample the data points for thenegative class randomly from regions of the image thatexclude the ROI. We generate an at most (2n × d)-sizeappearance model by hierarchical sampling of the tree asdescribed in Section 4.2.

5 EVALUATION

We implement the tracking system in C++ with Boost, Eigen,OpenCV, pThreads, and libFreenect libraries. We evaluatethe performance of CAT using three publicly availabledatasets at coreset size 25%, and denote the tracker asCAT 25.

5.1 Princeton Tracking Benchmark

We first evaluate our algorithm using the Princeton Track-ing Benchmark [32], which is a collection of 100 manuallyannotated RGBD tracking videos, with 214 frames on av-erage. The videos contain significant variation in lighting,background, object size, orientation, pose, motion, and oc-clusion. We report the average precision over all frames,which is the ratio of overlap between the ground truth anddetected bounding box around the object. We report theperformance of all algorithms (including CAT) for RGB data.CAT 25 obtains competitive performance on this dataset(Table 1), and is ranked in top three, after SOWP [24] andMUSTER [21]. Note that we obtain reasonable precision inpresence of occlusions, even though we have no explicitway for occlusion handling. Sample detections are shownin Figure 5.

Algorithm Success RateAverage Motocross Panda VW Car chase

#Frames 2685 2665 3000 8576 9928CAT 25 0.66 0.85 0.58 0.67 0.85

SOWP [24] 0.66 0.87 0.65 0.76 0.82KCF [20] 0.63 0.86 0.43 0.66 0.80

MUSTER [21] 0.62 0.84 0.57 0.69 0.77DSST [11] 0.62 0.83 0.45 0.68 0.75

MEEM [36] 0.59 0.83 0.43 0.67 0.81TLD [23] 0.57 0.86 0.25 0.67 0.81

Struck [19] 0.40 0.57 0.21 0.54 0.31TGPR [15] 0.38 0.62 0.12 0.62 0.18SCM [38] 0.36 0.50 0.13 0.47 0.14

TABLE 2: CAT obtains state of the art performance on the TLDdataset [23]. We report the average success rate over all videos,as well as success rate for the four longest videos.

5.2 TLD Dataset

The coreset formulation summarizes all the data it has seenso far. Therefore we expect the coreset appearance modelto adapt to pose and lighting variations as it trains usingadditional data, and perform well on very long videos.To demonstrate its adaptive behavior, we use the TLDdataset [23], which contains ten videos with 2685 framesof labeled video on average. For all experiments with theTLD dataset (Tables 2, 7, 4, and 3) we report the success rateat a threshold of 0.5. Tables 2 shows that CAT 25 is jointfirst on the TLD dataset, along with SOWP [24]. CAT 25outperforms methods including MUSTER [21] and KCF [20],and MEEM [36]. These results support our intuition that theability of CAT to summarize all data provides a boost inperformance over long videos.

5.3 CVPR2013 Tracking Benchmark

We also evaluate our algorithm using the CVPR2013 Track-ing Benchmark [35], which includes 50 annotated sequences.We use two evaluation metrics to compare the performanceof tracking algorithms—the precision plot and successplot—as defined in [35]. The area under the curve for theseplots provide the success score and precision score used tocompare the algorithms. The strength of our method lies intracking objects from videos with several hundred framesor longer. So we compare a variety of trackers using videosthat are 500 frames or longer (21 out of 50 videos), usingOPE, SRE, and TRE as described in [35]. Figure 3 shows thatCAT 25 obtains satisfactory accuracy on this competitivedataset, though it’s performance is not as good as compared

Page 6: 1 Coreset-Based Adaptive Tracking

6

0 5 10 15 20 25 30 35 40 45 50

Location Error Threshold0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

Rat

e

Precision Plots of OPESOWP [0.885]MUSTer [0.856]MEEM [0.819]TGPR [0.801]KCF [0.762]DSST [0.745]CAT_25 [0.689]SCM [0.627]Struck [0.619]TLD [0.576]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap Threshold0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Su

cces

sR

ate

Success Plots of OPEMUSTer [0.656]SOWP [0.628]MEEM [0.593]TGPR [0.567]DSST [0.563]CAT_25 [0.541]KCF [0.535]SCM [0.523]Struck [0.496]TLD [0.466]

0 5 10 15 20 25 30 35 40 45 50

Location Error Threshold0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

Rat

e

Precision Plots of SRESOWP [0.867]MUSTer [0.851]MEEM [0.851]TGPR [0.817]KCF [0.801]DSST [0.713]CAT_25 [0.679]Struck [0.637]SCM [0.618]TLD [0.610]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap Threshold0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

Rat

e

Success Plots of SREMUSTer [0.669]SOWP [0.615]TGPR [0.601]DSST [0.591]MEEM [0.568]KCF [0.546]CAT_25 [0.525]Struck [0.488]SCM [0.516]TLD [0.439]

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Overlap Threshold0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

Rat

e

Success Plots of TRESOWP [0.665]MUSTer [0.628]MEEM [0.590]DSST [0.580]TGPR [0.578]KCF [0.556]Struck [0.520]CAT_25 [0.514]SCM [0.512]TLD [0.459]

0 5 10 15 20 25 30 35 40 45 50

Location Error Threshold0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Succ

ess

Rat

e

Precision Plots of TREMUSTer [0.898]SOWP [0.856]MEEM [0.843]KCF [0.792]DSST [0.789]TGPR [0.755]CAT_25 [0.682]SCM [0.619]Struck [0.610]TLD [0.564]

Fig. 3: CAT 25 obtains competitive performance on the CVPR2013 Tracking Benchmark on videos with 500 frames or longer.

to TLD dataset, possibly due to the fewer number of framesavailable in the CVPR2013 Tracking Benchmark.

5.4 Tracking in Extremely Long VideosTo further explore the performance of CAT on long videos,we generate a new dataset containing extremely long videos(between 10,000 to 100,000 frames) from the four largestTLD videos. To create these videos, we append clips thatare 10% in length of the original video. The clips arerandomly flipped, time-reversed, shifted in contrast andbrightness, translated by small amounts, and corrupted withGaussian noise. We evaluate CAT with small, fixed coresetsizes (between 1000 and 2500), that are independent of thevideo size. We find that CAT (n=2500) has a better successrate (Table 3) than other top performing trackers, includingSOWP, MUSTER, and MEEM. In addition, CAT shows anincrease in success rate with increasing number of frames.In contrast, the success rate of other trackers stays constantor reduces with an increase in number of frames.

5.5 Coreset Size SelectionThe coreset size n determines the time and space complexityof CAT. In a majority of experiments, we evaluate theperformance of CAT at coreset size or 25% (CAT 25) ofthe total video size. Using such a large value for coresetsize is not a strict requirement, but rather an artifact of thedatasets used for evaluation. A majority of videos in thethree standard datasets are short—114 out of 155 videoshave less than 400 frames. We model the object using denseHOG features, which leads to a very high dimensionalrepresentation. Hence, the SVM requires sufficient numberof data points for good prediction on such short videos.We set n to 25% of the video size to obtain consistent

Algorithm Number of Frames10K 20K 50K 75K 100K

CAT (n=2500) 0.713 0.724 0.736 0.743 0.744SOWP [24] 0.711 0.711 0.712 0.712 0.712

MUSTER [21] 0.703 0.705 0.702 0.702 0.702CAT (n=2000) 0.685 0.702 0.718 0.725 0.726

MEEM [36] 0.683 0.682 0.682 0.682 0.681KCF [20] 0.677 0.677 0.676 0.676 0.675DSST [11] 0.669 0.670 0.670 0.669 0.670

CAT (n=1500) 0.627 0.674 0.697 0.701 0.702CAT (n=1000) 0.606 0.634 0.658 0.669 0.673

TLD [23] 0.579 0.573 0.569 0.567 0.567SCM [38] 0.528 0.529 0.529 0.528 0.529TGPR [15] 0.382 0.368 0.365 0.364 0.362

TABLE 3: Success Rate Comparison: CAT performs well, and infact improves with time, on very long videos with small, fixedcoreset size which is independent of the total number of frames.

performance across videos of various sizes. However, CATmaintains high detection accuracy with small coreset sizesas well. In Table 4, we show that CAT obtains competitiveperformance with coreset sizes of 1%, 2.5%, 5% and 10% onthe TLD dataset.

5.6 Selection of Base Trackers

Unlike other tracking methods, our current implementationof CAT does require a base tracker for the first n frames,for populating the initial coreset leaf. In our current imple-mentation, we use SCM as a base tracker. CAT significantlyoutperforms SCM on all datasets in our experiments. Forinstance, CAT 25 obtains an 83.3% improvement over SCMon the TLD dataset (Table 2), indicating that the coresetsummarization method is responsible for the gain in perfor-mance. In fact, CAT shows similar performance gains overa variety of trackers used as base trackers (Table 6) in anexperiment with extremely long videos from Section 5.4.

Page 7: 1 Coreset-Based Adaptive Tracking

7

Coreset Size Success RateMean Motocross VW Panda Carchase

1% 0.49 0.74 0.48 0.59 0.732.5% 0.51 0.76 0.49 0.59 0.745% 0.53 0.77 0.49 0.61 0.7410% 0.56 0.77 0.49 0.62 0.7925% (CAT 25) 0.66 0.85 0.58 0.67 0.85

TABLE 4: We evaluate CAT at several coreset sizes as thepercent of video size using the TLD dataset. CAT performssatisfactorily at coreset size as small as 1% of the video size.

Algorithm Success Rate PrecisionLearning from root of thetree [14] 0.398 0.434

Learning with hierarchicalsampling (CAT) 0.524 0.677

TABLE 5: Our hierarchical sampling method allows for morerepresentation to recent data in the learning process, as demon-strated here using the CVPR2013 Tracking Benchmark. In theoriginal formulation [14], the classifier would be trained fromthe root of the coreset tree alone, ignoring more recent data.

5.7 Choice with Sampling MethodIn comparison to CAT, sampling can achieve reduction inspace and time complexity without the additional overheadof coreset computation. We validate CAT versus randomsampling and sub-sampling using the TLD dataset usingmultiple values of n, the coreset size. For random sampling(RS), we use n random frames from all the frames seenso far. For sub-sampling (SS), we use n uniformly spacedframes. CAT consistently outperform both RS and SS (Ta-ble 7). Thus, the summarization of data by coresets allowsus to learn a more complex object appearance model ascompared to sampling.

5.8 Empirical Evaluation of Time ComplexityThe coreset formulation is used to train a learning algorithmin constant time and space (Section 3). We compare thetime complexity of training with the coreset tree and train-ing using all the data points (Figure 4-(left)). For a linearSVM implementation with linear time complexity [8], thelearning time for training with all data points can becomeprohibitively large, in contrast to the constant, small timeneeded for learning from coresets. We pay for the reducedcost of training the learning algorithm by the additional timeand space requirements of coreset tree formation. We provein Section 3 that the time complexity of tree constructionis constant on average. We validate this result empiricallyin Figure 4-(right). Other coreset formulations [16] may beused to reduce the space requirements to constant, at thecost of a higher value of the constant for time complexity.

6 DISCUSSION

Simplicity: The utility of the coreset formulation is evidentfrom the fact that we obtain competitive experimental re-sults with a simple tracker that consists of a linear classifierand a Kalman filter. The primary goal of this work is toexplore the benefit of coresets for learning from streamingvisual data, rather than to compete on tracking benchmarks.Additional improvement could be obtained via choosingthe optimal coreset size or exploring various hierarchical

Algorithm Number of Frames10K 20K 50K 75K 100K

CAT +SOWP [24] 0.728 0.739 0.755 0.764 0.771SOWP [24] 0.711 0.711 0.712 0.712 0.712

CAT +MUSTER [21] 0.713 0.718 0.725 0.729 0.731MUSTER [21] 0.703 0.705 0.702 0.702 0.702

CAT + SCM [38] 0.713 0.724 0.736 0.743 0.744SCM [38] 0.528 0.529 0.529 0.528 0.529

CAT + TLD [23] 0.632 0.637 0.641 0.642 0.642TLD [23] 0.579 0.573 0.569 0.567 0.567

TABLE 6: CAT (n=2500) outperforms a variety of trackers usedas base trackers for first n frames. Note that SCM is used as abase tracker for results reported in other tables.

Algorithm % Frames Used For Training25 37.5 50 62.5 75

Coreset Size =512CAT 0.408 0.449 0.483 0.534 0.582SS 0.332 0.358 0.396 0.408 0.424RS 0.218 0.221 0.251 0.279 0.301

Coreset Size = 768CAT 0.502 0.559 0.604 0.637 0.668SS 0.378 0.415 0.438 0.474 0.502RS 0.257 0.299 0.326 0.368 0.385

Coreset Size = 1024CAT 0.546 0.621 0.657 0.701 0.732SS 0.418 0.492 0.54 0.587 0.633RS 0.321 0.36 0.381 0.419 0.458

TABLE 7: CAT consistently obtains better success rate thanrandom sampling and sub-sampling (evaluated here on theTLD dataset).

sampling methods. Performance may also improve by in-corporating traditional methods for handling occlusion anddrift.

Real-time Performance: Our current implementation,running on commodity hardware (Intel i7 Processor, 8 GBRAM), performs coreset construction and SVM training inparallel at ∼20 FPS for small coreset sizes (n ≤ 64). Dueto the high computational complexity of sliding windowdetection, the total speed is ∼3 FPS. Faster detection meth-ods (e.g., efficient sub-window search [26] and paralleliza-tion [34]) can be easily incorporated into our system toachieve real-time performance.

7 CONCLUDING REMARKS

In this paper, we proposed an efficient method for trainingobject appearance models using streaming data. The keyidea is to use coresets to compress object features throughtime to a constant size and perform learning in constanttime using hierarchical sampling of coresets. Our methodis ideal for efficient tracking in long videos in real worldscenarios. Since we maintain a compact summary of allprevious data, our learning model improves with time.We provide experimental validation for this insight withstrong performance on the TLD dataset, and a syntheticdataset containing extremely long videos of 100,000 frames.The ideas described in this paper complement the currentexpertise in the computer vision community on developingtrackers and detectors. Combining our framework withexisting methods should lead to fruitful online approaches,particularly when ‘old’ data should be taken into account.While we explore the use of coresets for object tracking inthis paper, we hope that our technique can be adapted for

Page 8: 1 Coreset-Based Adaptive Tracking

8

0 2000 4000 6000 8000 100000

0.2

0.4

0.6

0.8

1

Number of Data Points

Coreset

Con

structionTime(m

s)

n = 50%n = 37%n = 25%n = 13%

Number of Data Points

SVMTraining

Time(secon

ds)

2000 3000 4000 5000 6000 7000 80000

50

100

150

200All datan = 32n = 128n = 512n = 1024

Fig. 4: We validate the theoretical bounds for time complexitywith empirical analysis using average over all videos from TLDdataset. (Left) The coreset tree allows us to train an SVM in con-stant time (depending on the coreset size n), while the learningtime becomes prohibitively large while training with all datapoints. (Right) The time complexity of coreset construction isconstant on average. We demonstrate this at various coresetsizes n, expressed as percentage of the total video size.

Fig. 5: Sample detections: We demonstrate excellent perfor-mance on the CVPR2013 Tracking Benchmark, the PrincetonTracking Benchmark and the TLD dataset.

a variety of applications, such as image collection summa-rization, video summarization and more.

REFERENCES

[1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximat-ing extent measures of points. Journal of the ACM, 51(4):606–635,2004.

[2] S. Avidan. Support vector tracking. IEEE Transactions on PAMI,26(8):1064–1072, 2004.

[3] S. Avidan. Ensemble tracking. IEEE Transactions on PAMI,29(2):261–271, 2007.

[4] B. Babenko, M.-H. Yang, and S. Belongie. Robust object trackingwith online multiple instance learning. IEEE Transactions on PAMI,33(8):1619–1632, 2011.

[5] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering viacore-sets. ACM Symposium on Theory of Computing, pages 250–257,2002.

[6] J. Bentley and J. Saxe. Decomposable searching problems i: Static-to-dynamic transformation. Journal of Algorithms, 1(4):301–358,1980.

[7] M. J. Black and A. D. Jepson. Eigentracking: Robust matching andtracking of articulated objects using a view-based representation.ECCV, pages 329–342, 1996.

[8] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vectormachines. ACM Transactions on Intelligent Systems and Technology,2(3):27, 2011.

[9] K. L. Clarkson and D. P. Woodruff. Numerical linear algebra in thestreaming model. ACM Symposium on Theory of Computing, 2009.

[10] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based objecttracking. IEEE Transactions on PAMI, 25(5):564–577, 2003.

[11] M. Danelljan, G. Hager, F. Khan, and M. Felsberg. Accuratescale estimation for robust visual tracking. British Machine VisionConference, 2014.

[12] D. Feldman, S. Gil, R. Knepper, B. Julian, and D. Rus. K-robotsclustering of moving sensors using coresets. IEEE InternationalConference on Robotics and Automation, pages 881–888, 2013.

[13] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. ACM Symposium onComputational Geometry, pages 11–18, 2007.

[14] D. Feldman, M. Schmidt, and C. Sohler. Turning big data intotiny data: Constant-size coresets for k-means, PCA and projectiveclustering. ACM-SIAM Symposium on Discrete Algorithms, pages1434–1453, 2013.

[15] J. Gao, H. Ling, W. Hu, and J. Xing. Transfer learning based visualtracking with gaussian processes regression. ECCV, pages 188–203, 2014.

[16] M. Ghashami and J. M. Phillips. Relative errors for determinis-tic low-rank matrix approximations. ACM-SIAM Symposium onDiscrete Algorithms, 2014.

[17] H. Grabner, C. Leistner, and H. Bischof. Semi-supervised on-lineboosting for robust tracking. ECCV, pages 234–247, 2008.

[18] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. ACM Symposium on Theory of Computing, 2004.

[19] S. Hare, A. Saffari, and P. H. Torr. Struck: Structured outputtracking with kernels. IEEE Internation Conference on ComputerVision, pages 263–270, 2011.

[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speedtracking with kernelized correlation filters. IEEE Transactions onPAMI, 37(3):583–596, 2015.

[21] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao.Multi-store tracker (muster): A cognitive psychology inspiredapproach to object tracking. IEEE CVPR, pages 749–758, 2015.

[22] A. D. Jepson, D. J. Fleet, and T. F. El-Maraghi. Robust onlineappearance models for visual tracking. IEEE Transactions on PAMI,25(10):1296–1311, 2003.

[23] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Transactions on PAMI, 34(7):1409–1422, 2012.

[24] H.-U. Kim, D.-Y. Lee, J.-Y. Sim, and C.-S. Kim. Sowp: Spatiallyordered and weighted patch descriptor for visual tracking. IEEEICCV, pages 3011–3019, 2015.

[25] J. Kwon and K. M. Lee. Visual tracking decomposition. IEEECVPR, pages 1269–1276, 2010.

[26] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond slidingwindows: Object localization by efficient subwindow search. IEEECVPR, pages 1–8, 2008.

[27] H. Li, C. Shen, and Q. Shi. Real-time visual tracking usingcompressive sensing. IEEE CVPR, pages 1305–1312, 2011.

[28] E. Liberty. Simple and deterministic matrix sketching. ACMConference on Knowledge Discovery and Data Mining, 2013.

[29] G. Rosmanm, M. Volkov, D. Feldman, W. Fisher, and D. Rus.Coresets for k-segmentation of streaming data. Advances in NeuralInformation Processing Systems, 2014.

[30] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental learningfor robust visual tracking. IJCV, 77(1-3):125–141, 2008.

[31] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: An experimentalsurvey. IEEE Transactions on PAMI, 2014.

[32] S. Song and J. Xiao. Tracking revisited using RGBD camera:Unified benchmark and baselines. IEEE Internation Conference onComputer Vision, pages 233–240, 2013.

[33] J. Supancic and D. Ramanan. Self-paced learning for long-termtracking. IEEE CVPR, pages 2379–2386, 2013.

[34] C. Wojek, G. Dorko, A. Schulz, and B. Schiele. Sliding-windowsfor rapid object class localization: A parallel technique. PatternRecognition, pages 71–81, 2008.

[35] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A bench-mark. IEEE CVPR, pages 2411–2418, 2013.

[36] J. Zhang, S. Ma, and S. Sclaroff. MEEM: robust tracking viamultiple experts using entropy minimization. ECCV, pages 188–203, 2014.

[37] K. Zhang, L. Zhang, and M.-H. Yang. Real-time compressivetracking. ECCV, pages 864–877, 2012.

[38] W. Zhong, H. Lu, and M.-H. Yang. Robust object tracking viasparsity-based collaborative model. IEEE CVPR, pages 1838–1845,2012.