Binary Features for Object Detection and Landmarking

Binary FeaturesSteven C. Mitchell, Ph.D.Componica, LLC

What’s a Binary Feature?

What’s a Binary Feature?

-Let’s take an image, and sample a region of interest, a 4x4 patch. Maybe you’re looking for a face, or a tumor, or gun.-In a typical object detection system, this region of interest will be scanned across the image over different scales.-Typically you scan left-to-right, top-to-bottom in steps of 10% the size of the patch. Then you shrink the image (or scale the patch) by 20% and start over. Continue doing that until the image becomes too small or you found what you’re looking for.

-So let’s start with this patch (we’ll assume only gray values, forget about color for now).-First the pixels have value, typically from 0 to 255.-Now we also need a way of addressing the location of the these pixels. I’ll use a simple number scheme as the patches will always be 4x4.-Lastly, I want to compare the brightness of two pixels. I’ll pick location 5 and 11.-Why those two locations? Well in a later slide, I’ll explain how locations are chosen.

-Ok, let’s try different patches with the same binary feature, that is compare location 5 and 11.-Now imagine I try a whole bunch on pairs on a given patch. 2 vs 14, 8 vs 4, 7 vs 2, etc. I’m going to get a bunch of yes/no responses base on the patch I happen to show the system.

Different Types of Binary Features

-Of course there are many different types of binary features, different types of questions I can ask.-Simple thresholding, which pixel is brighter, which pixel is brighter based on a threshold, how similar are two pixels.-With color it could be comparisons of different channels.-The main points are, each feature has a fixed set of parameters discovered during training and fixed for recognition. And the output is a yes or no.-BTW, I really like the simple comparison of two pixels. It fast and any changes to the brightness / contrast of a patch will always return the same result.

Decision Tree Overview

-Now in order to make use of these features, let’s talk about decision trees.

Is Grass Wet?

Did you waterthe grass?

Y N Y N

Y N

YES

YES

NO

NO

Did it rain last night?

-Let’s saying you’re trying to determine if it rained last night.-This is a classification problem.-Here I constructed a simple decision tree based on a couple yes/no questions.-At the leaves of the this tree are probability histograms created from my data.-They sum up to one.-My decision is based on which of the two bars are greater at each leaf.

Is Grass Wet?

YES NO

Y NY N

Do you like oranges?

YES NO

Y N Y N

Selecting Good Questions

-So how do I pick a good question? First pick a question from my Universe of questions, pour my data thru it, and measure how well it predicts.-Three commonly used metrics: Entropy, Gini Impurity, and Classification Error.-What they basically measure is how far away you are from just a 50/50 coin toss.-Here you can see an irrelevant question like “Do you like oranges” would yield a flat distribution. This would yield a high entropy, gini impurity, or classification error.

I[5] < I[11]

Y N Y N

Y N

YES

YES

NO

NO

I[7] < I[3]

-Going back to Binary Features, the questions we ask are based on pixel comparisons.-How do we pick the parameters? Well we random sample from the universe of parameters and choose the one that yields a good score from the given dataset.-In the 4x4 patch, I would pick two random numbers from 0 to 15 (no duplicates) and a random threshold (if I need one). Add that feature to the tree, and then I test my tree with my dataset and compute a score. I’ll do this 2000 times and keep the binary feature that produced the best tree with the best score. I then keep growing my tree in a greedy fashion until it’s big enough (5-9 levels deep) or accurate enough.-This answers the question where does x, y, T come from.-In my experience a good sampling of 500-2000 works really well with diminishing returns with anything higher.-This is the most time consuming part of building these times, but it’s extremely parallelizable.

Is Grass Wet?

YES NO

Do you like oranges?

YES NO

Selecting Good Questions

-Now that’s for classification. Decision trees can also be used for regression too.-Instead of classes like yes/no, cat/dog/horse, etc. The output is the average value at the leaves from my dataset.-What makes a good question? The ones that decrease the variance from the averages.-Also note, the output can be multi-dimensional, and not necessarily a single value. You can compute variance of multi-dimensional things fairly easily, don’t worry.

I[5] < I[11]

YES

YES

NO

NO

I[7] < I[3]

-So here is a binary feature tree that returns a value (like probability it’s an object) instead of a class... or it could be a vector like landmarks.-Now we can start constructing interesting solutions using these concepts.

Corner Detector

-First let’s start with corner detection.

Harris Corner Detector1. Compute a smooth gradient in the X and Y

2. For each pixel, compute this matrix.

3. Solve for R

4. Maximum suppression to gather corners.

-Harris Corner Detector, one of the simplest ways to detect corners based on estimating the 2nd derivative of the sum-square-distance of two patches.

-SURF, SIFT, SUSAN etc.

-So what’s the point? These points are stable regardless of angle, scale, or translation.

-This reduces the data such that you can rapidly compare the image to a template for techniques like augmented reality, image stitching, and motion tracking.

-So you can find corners using these four easy steps... wait... lots of math... slow...

FAST Corner DetectorGiven a pixel, based on the 16 surrounding pixel, is this location a corner?

FAST uses a decision tree trained on real images and converted to nested if statements in C.

Doesn’t use math, averages about 3 comparisons per pixel...very very FAST.

http://mi.eng.cam.ac.uk/~er258/work/fast.html

-Ok, enough of that. Let’s use a more machine learning approach...FAST: Features from Accelerated Segment Test



FAST Corner Detector

The source code is computer generated, and free for anyone to use.

It is 6000 lines long and not comprehensible.

With an averaging of vectors and an arctangent, you can get a rotation vector cheaply.4 TAYLOR, DRUMMOND: MULTIPLE TARGET LOCALISATION AT OVER 100 FPS

Figure 2: Left: The 8�8 sample grid used for the HIPs and the 5 sample locations selectedfor indexing, relative to the FAST-9 interest point (shown by the grey circle). Right: Theorientation assignment scheme uses a sum of the gradients between opposite pixels in the16-pixel FAST ring.

2.1 Selecting Repeatable Feature Positions and OrientationsRuntime performance considerations led us to select FAST-9 [12] as the interest point de-tector. Typical approaches to assigning orientation require computationally expensive blur-ring [2] or histogramming [7] and would add significant computation to the runtime process-ing. Instead we simply sum gradients computed between opposite pixels in the 16-pixel ringused in FAST corner detection, as shown in Figure 2. The directions are fixed so the x and ycomponents of the orientation vector can be computed very quickly from weighted sums ofthe 8 pixel differences.

We run FAST-9 on each training image within a viewpoint bin and represent the 35highest-scoring corners from each 200�200 region with subfeatures. Proportionally fewersubfeatures are extracted from smaller regions at the edges of the viewpoint reference frame.For smaller scale viewpoint bins where the entire target is under 200�200 pixels a 35 cor-ner minimum is enforced which effectively increases the feature density for these smallertargets. The orientation measure of Figure 2 is also computed at each detected corner. Theposition (xr, yr) and orientation �r of the subfeature in the coordinate system of the view-point reference frame can be computed as the warp used to generate the training image isknown.

The appearance of the subfeature is represented by a sparsely-sampled quantised patch.We use a square 8�8 sampling grid centred on the interest point, with a 1-pixel gap betweensamples, as shown in Figure 2. Before sampling the pixel values the sampling grid is firstrotated so that it is aligned with the detected orientation of the subfeature. The 64 samplesare then extracted with bilinear interpolation, normalised for mean and standard deviation togive lighting invariance, and quantised into 5 intensity bins. The 5-bit index value explainedin Section 3.2 is also computed and stored in the subfeature.

The most repeatable feature positions and orientations for a viewpoint bin appear asdense clusters of subfeatures in the (xr, yr, �r) space when all the training images in the binhave been processed. Every subfeature is considered as the potential centre of a HIP feature,and the set of other subfeatures (from other training images) that lie within a certain distanceof the centre is found. We manually decide the allowable distance; in this paper we allow 2pixels of localisation error and 10 degrees of orientation error. Sets of subfeatures given theseallowed distances share enough similarity in appearance to be represented by a single HIPfeature in a target database. The largest set of subfeatures represents the most repeatably-detected feature, and will be the first feature we select to add to the database. Sets continue




FAST Example

-Here’s a picture of your’s truly and a Starbuck’s Logo that I ran for a project.-The lines indicate a direction derived from that rotation vector in the last slide. It’s useful for normalizing patches like if you were to create an augmented reality system on a mobile device.-Here is some random dude’s youtube video running FAST. I’d show you my own, but I didn’t have enough time.-Notice it’s running in realtime off a slow iPhone 3, Harris Corners and SURF would drag on such a device. Just as a note, Mobile phones typical run 10x-30x slower than desktops.

Keypoint Recognition

-Once you have corners, the next step is to identify what those corners belong to.

Keypoint Recognition

Fast Keypoint Recognition using Random FernsMustafa Özuysal, Michael Calonder, Vincent Lepetit and Pascal Fua

-So in an image stitching problem, an augmented reality solution, or bag-of-words object recognizer (Amazon’s Product IDer thingy), you sample a region of interest around each corner and try to match it with a known template.-Comparisons are often non-trivial because you have to normalize the patches from distortions caused by rotations and tilt, normalize the brightness, and then come up with some feature vector from the patches.-Finally you measure the distances from the feature vectors from each patch in the template to the image.. That’s like an O(n^2) deal there.-Everything about this sounds really slow on an iPhone.-Ok, let’s use binary feature trees to solve this.


-First generate patches from each corner in the original template with random orientations, sizes, tilt. Generate a ton of them because that’s our training set.


-Next, at for these guys, they simplified that decision tree concept with something they dubbed Ferns (or primitive trees)-The idea is if you ask the same question at each depth, you can collapse the tree into simple bits in an index. The leaves are simply locations in an array.-So for example three bits is 2^3 or 8 possible outcomes. So instead of a tree, you have an array of 8 probability histograms.-Next, the selection of classes is based off this simple max of the class probabilities for a given set of bits, but you’re probably going to need a lot of bits to get a good result (they empirically determine this)-Now if you assume independence of the features, then you can reduce this to products of several ferns.

0

1

1

1

0

0

1

0

11102=6 0012=1 1012=5

Efficient Keypoint Recognition, Lepetit et al

1

0

0

1

0

1

0

1

00012=6 1012=5 0102=2


1

0

0

0

1

1

1

0

10012=6 1102=6 1012=5


Fast Keypoint Recognition in Ten Lines of CodeMustafa Özuysal Pascal Fua Vincent Lepetit

-This whole algorithm can be express in just 10 lines of C code.-Very very fast.

From Bits to Images

-So these binary trees toss all gray values. Do they really characterize images well enough to solve serious problems?-Ok, let’s say we took an image, found corners, sampled binary pairs from 32x32 patches (few hundred). Can we reconstruct an image from just the locations of the corners, patch size, and binary pairs?

From Bits to Images: Inversion of Local Binary DescriptorsEmmanuel d’Angelo, Laurent Jacques, Alexandre Alahi and Pierre Vandergheynst

-Yes we can. It’s a bit like solving Sodoku.-What’s really surprising is how much information we can capture without any gray levels.-So you’re collecting edge information over different scales, plus, if it’s just simple comparisons, it’s immune to brightness / contrast issues or global lighting.-In many ways it’s superior to other means of characterizing images.

Object Detection

-Let’s talk about object detection.

Viola / Jones Object Detection

"Robust Real-time Object Detection"Paul Viola and Michael Jones

-The Viola Jone’s object detection frame was formulated in the early 2000s and was a breakthru in object detection. Cheap cameras and cellphones use it all the time.-It works by measuring the differences of the sums of rectangles and taking a threshold. If it exceeds a certain value, it’s a face.-Now of course that’s a very poor system of face detection, so they strengthened it utilizing the principles of ensemble learning.-That is, yes one rectangle comparison makes a very awful face detector, but if you have a large number of independent detectors and do a weighted vote, you’ll end up with a much more accurate detector.-Wisdom of crowds.-The AdaBoost algorithm shown here lists a method of determining the weighting. Basically give higher vote to the more accurate detectors, retrain on the dataset looking at the incorrect samples. Repeat.

Viola / Jones Object Detection

Figure 2: The integral image. Left: A simple input of image values. Center: The computed integral image. Right:Using the integral image to calculate the sum over rectangle D.

3 The Technique

Our adaptive thresholding technique is a simple extension of Wellner’s method [Wellner 1993]. The main ideain Wellner’s algorithm is that each pixel is compared to an average of the surrounding pixels. Specifically, anapproximate moving average of the last s pixels seen is calculated while traversing the image. If the value of thecurrent pixel is t percent lower than the average then it is set to black, otherwise it is set to white. This method worksbecause comparing a pixel to the average of nearby pixels will preserve hard contrast lines and ignore soft gradientchanges. The advantage of this method is that only a single pass through the image is required. Wellner uses 1/8thof the image width for the value of s and 15 for the value of t. However, a problem with this method is that it isdependent on the scanning order of the pixels. In addition, the moving average is not a good representation of thesurrounding pixels at each step because the neighbourhood samples are not evenly distributed in all directions. Byusing the integral image (and sacrificing one additional iteration through the image), we present a solution that doesnot suffer from these problems. Our technique is clean, straightforward, easy to code, and produces the same outputindependently of how the image is processed. Instead of computing a running average of the last s pixels seen, wecompute the average of an s x s window of pixels centered around each pixel. This is a better average for comparisonsince it considers neighbouring pixels on all sides. The average computation is accomplished in linear time by usingthe integral image. We calculate the integral image in the first pass through the input image. In a second pass, wecompute the s x s average using the integral image for each pixel in constant time and then perform the comparison.If the value of the current pixel is t percent less than this average then it is set to black, otherwise it is set to white.The following pseudocode demonstrates our technique for input image in, output binary image out, image width wand image height h.

procedure AdaptiveT hreshold(in,out,w,h)1: for i = 0 to w do2: sum⇥ 03: for j = 0 to h do4: sum⇥ sum+ in[i, j]5: if i = 0 then6: intImg[i, j]⇥ sum7: else8: intImg[i, j]⇥ intImg[i�1, j]+ sum9: end if

10: end for11: end for

method is easy to implement for real-time performance on a live video stream. Though our technique is an extensionto a previous method [Wellner 1993], we increase robustness to strong illumination changes. In addition, we presenta clear and tidy solution without increasing the complexity of the implementation. Our technique is also similar tothe thresholding method of White and Rohrer for optical character recognition [White and Rohrer 1983], howeverwe present an implementation designed for real-time video. The motivation for this work is finding fiducials inaugmented reality applications. Pintaric also presents an adaptive thresholding algorithm specifically for augmentedreality markers [Pintaric 2003], however his method requires that a fiducial has been located in a previous frame inorder for the technique to threshold correctly. Our algorithm makes no assumptions and is more general, suitable foruse in any application. The source code is available online at the address listed at the end of this paper.

2 Background

2.1 Real-Time Adaptive Thresholding

In this paper we focus on adaptively thresholding images from a live video stream. In order to maintain real-timeperformance, the thresholding algorithm must be limited to a small constant number of iterations through eachimage. Thresholding is often a sub-task that makes up part of a larger process. For instance in augmented reality,input images must be segmented to locate known markers in the scene that are used to dynamically establish thepose of the camera. A simple and fast adaptive thresholding technique is therefore an important tool.

2.2 Integral Images

An integral image (also known as a summed-area table) is a tool that can be used whenever we have a functionfrom pixels to real numbers f (x,y) (for instance, pixel intensity), and we wish to compute the sum of this functionover a rectangular region of the image. Examples of where integral images have been applied include texturemapping [Crow 1984], face detection in images [Viola and Jones 2004], and stereo correspondence [Veksler 2003].Without an integral image, the sum can be computed in linear time per rectangle by calculating the value of thefunction for each pixel individually. However, if we need to compute the sum over multiple overlapping rectangularwindows, we can use an integral image and achieve a constant number of operations per rectangle with only a linearamount of preprocessing.

To compute the integral image, we store at each location, I(x,y), the sum of all f (x,y) terms to the left and abovethe pixel (x,y). This is accomplished in linear time using the following equation for each pixel (taking into accountthe border cases),

I(x,y) = f (x,y)+ I(x�1,y)+ I(x,y�1)� I(x�1,y�1). (1)

Figure 2 (left and center) illustrates the computation of an integral image. Once we have the integral image, the sumof the function for any rectangle with upper left corner (x1,y1), and lower right corner (x2,y2) can be computed inconstant time using the following equation,

x2

Âx=x1

y2

Ây=y1

f (x,y) = I(x2,y2)� I(x2,y1�1)� I(x1�1,y2)+ I(x1�1,y1�1). (2)

Figure 2 (right) illustrates that computing the sum of f (x,y) over the rectangle D using Equation 2 is equivalent tocomputing the sums over the rectangles (A+B+C+D)-(A+B)-(A+C)+A.

D. Bradley, G. Roth, Adaptive Thresholding using the Integral Image. J. Graphics Tools 12(2): 13-21 (2007)

-The other trick in Viola-Jones was the fast method of summing the rectangles using an integral image.-If you construct an integral image based on summing the pixels left and about while subtracting the upper left pixel, you can rapidly compute the rect sum using the about equation.-Problem is this construction of integral images can be slow, plus you’re doing 8 operations per feature.-Binary Features with pixel comparisons can do it with two without even constructing an integral image or brightness / contrast normalization.

Binary Feature-Based Object Detection

Unconstrained Face DetectionShengcai Liao, Anil K. Jain, and Stan Z. Li

I[5] < I[11]

Y N Y N

Y N

YES

YES

NO

NO

I[7] < I[3]

Object Detection with Pixel Intensity Comparisons Organized in Decision TreesNenad Markus, Miroslav Frljak, Igor S. Pandzic, Jorgen Ahlberg, and Robert Forchheimer

-This technique was simultaneously published by several groups.-Here is Nenad Markus’ implementation-His runs 30x faster than Viola Jones and 9x faster than Local Binary Patterns approach in OpenCV.-Here he accomplishes rotational invariance by rotating the trees N times, however it’s fast enough that that’s feasible.

Object Landmarking

Face Alignment by Explicit Shape Regression, Cao et al

-Microsoft has been putting a lot of effort into deriving methods for landmarking faces.-For some reason they call it facial alignment. We tend to call it landmarking or segmentation.-Basically find points on an object that may or may not represent contours of that object.

Base on: Face Alignment by Explicit Shape Regression, Cao et al

t = 0 t = 1 t = 2 t = 10

AffineTransform to mean shape

Transform back from mean shape

...

...

Insert Magic

…... …...

-Here is one of their approaches to landmarking faces using regression trees.-Dubbed Explicit Shape Regression.-Typically done with 10 groups of trees.-Each group is hundreds of trees refining the shape vector from the previous tree.-Although they don’t say it, they’re effectively using a Gradient Boosting approach using regression trees with a lambda of one. A slightly lower lambda would improve generalization, but most likely they were not aware of this.


I[S5+∆] < I[S11+∆]

YES

YES

NO

NO

I[S7+∆] < I[S3+∆]

What’s inside ?

-So each regression tree is between 5-9 levels deep.-Pixel comparisons are made with locations relative to the landmarks, S.-One comparison requires which two landmarks (i,j) and x/y delta from each landmark.-The affine to mean transform in the other slide removes any need to care about scale.-The leaves store delta S’s to move the S closer to the target.


-An average face, S^0, is placed on the image using a face detector like Viola-Jones / LBP / or that tree thing I just talked about.-The shape is refined to the image using groups of trees followed by affine transform adjustments.-Here are examples of landmarked faces.-The original paper makes the argument that all generated landmarks are based on a linear combination of faces. That it implicitly creates a shape model of faces, so you don’t need to worry about generating non-sensical faces.

In Conclusion

I just presented a small subset of a very large topic.

The comparison of two pixels is a surprisingly useful feature that’s very easy to compute.

Combined with decision trees and ferns, these techniques substitute math with machine learning.

This enables complicated object recognition techniques to run in realtime on mobile devices.

Binary Features for Object Detection and Landmarking

Engineering

Transcript of Binary Features for Object Detection and Landmarking