Active Frame Selection for Label Propagation in Videos Sudheendra Vijayanarasimhan and Kristen...
-
Upload
felix-roberts -
Category
Documents
-
view
241 -
download
0
Transcript of Active Frame Selection for Label Propagation in Videos Sudheendra Vijayanarasimhan and Kristen...
Active Frame Selection for Label Propagation in VideosSudheendra Vijayanarasimhan and Kristen Grauman
Department of Computer Science, University of Texas at Austin
Motivation
Main Idea
Estimating Expected Label Propagation Error
Results
Video Label Propagation
Active Frame Selection
Manually labeling objects in video is tedious and expensive, yet such annotations are valuable for object and activity recognition.
Existing methods for interactive labeling•Propagate labels from an arbitrarily selected frame, and/or•Assume a human will intervene repeatedly to correct errors
Our active approach outperforms the baselines for all values of k, and saves hours of manual effort per video, if cost to correct errors is proportional to number of mislabeled pixels.
Error in terms of average number of mislabeled pixels, in hundreds of pixels
Our error predictions in C follow the actual
errors closely.
In this case, our method automatically selects frames with high resolution information of most of the objects.
Total annotation timeAccuracy per frame, sorted from high to low
Segtrackk = 5
Datasets: Camseq01: 101 frames of a moving driving scene, Camvid seq05: 3000 frames of a driving scene, Labelme 8126: 167 frames of a traffic signal, Segtrack: 6 videos with moving objects.
Baselines• Uniform-f: samples frames uniformly and transfers labels forward• Uniform: samples frames uniformly and transfers labels in both directions.• Keyframe: selects frames with k-way spectral clustering on Gist features.
1 i
…
nn-1
b
…
i+1 i+2
Case 1: 1-way end.n > i
1 2 i = n
…
i-1
b = 1
Case 2: 1-way beg. b = 1 and n = i
Pixel Flow + MRF Label PropagationEnhance flow model with space-time Markov Random Field:• Infer label maps that are smooth in space and time• Exploit object appearance models defined by labeled frames.
We explicitly model the probability that pixel p in frame t will be mislabeled if we were to obtain its label from frame t+1:
, where
Distances use flow to estimate errors due to boundaries, occlusions, and when pixels change in appearance, or enter/leave the frame:
Appearance Motion
If more than one frame separates the labeled frame rt and current frame t, we compute the accumulated error recursively (and analogously for lt):
Identify the k frames which, if labeled, would propagate to the rest of the video with minimal expected error.
Propagate labels to all other frames
…
Actively select k informative frames
Segment and label selected frames
Highlights of our approach• Annotate all objects in a video with minimal manual effort.• Jointly select k most useful frames via predicted “trackability”• Efficient dynamic programming solution
Pixel Flow Label PropagationUse dense optical flow to track each pixel in both the forward
and backward directions, until it reaches the closest labeled frame on either side.
flow fwd
label prop back
label prop fwd
flow back
… …
Occlusion
To segment an N-frame video, there are two sources of manual effort cost:1. the cost of fully labeling a frame from scratch, denoted Cl 2. the cost of correcting errors by propagation, denoted Cc.
Our approach yields higher accuracy, especially for frames far from labeled frames. It reduces effort better than the baselines, and can also predict the optimal number of frames to have labeled, k*.
Errors and time saved
Example of actively selected frames
1 2
…
i
…
nn-1
…
Sequence frame index:
Selected frame index:
bb-1
…
b-2…
Dynamic programming solution
Let be the optimal value of for selecting b frames from the first n frames, where i denotes the index of the b-th selected frame.
For a given k, we show how to obtain the optimal value:
in time, compared to for a naïve exhaustive search.
that minimizes expected effort:
where
ObjectiveWe want
1 i = n
bb-1
… … …
j m i-1j+1
Case 3: Both ways.b > 1 and n = i
Let the N x N matrix C record the frame-to-frame predicted errors: