Clear Path Detection - Carnegie Mellon University · PDF fileClear Path Detection ... PCA was...

Carnegie Mellon University Pittsburgh

Clear Path Detection

A thesis submitted in partial satisfaction of the requirements for the degree

Master of Science in Department of Electrical and

Computer Engineering

By

Qi Wu

2008

This is to certify that the thesis of Qi Wu has been approved. ____________________________________ Prof. Tsuhan Chen, Advisor ____________________________________ Prof. Marios Savvides

Carnegie Mellon University, Pittsburgh

2008

Acknowledgments

I would like to express my sincere gratitude to my advisor, Prof. Tsuhan Chen, for his

guidance, encouragement, and support in my graduate study. His knowledge, kindness,

patience, enthusiasm, and vision have brought out the best from me in my M.S. study

and provided me with lifetime benefits. He teaches me how to create valuable ideas,

perform research, communicate efficiently, and work with team members. I feel very

fortunate to join his research group and be one of his students at CMU.

I would also like to extend my thank General Motors for providing me an opportunity to

work on this challenging project and Dr. Wende Zhang for his continuous

encouragement and support of my GM internship and M.S. thesis work. His guidance

and support have played crucial role in the completion of my work.

I also thank my wonderful AMP lab mates at CMU: Kate Shim, David Liu, Amy Lu, Devi

Parikh, Andrew Gallagher, Yao‐Jen Chang, Yimeng Zhang, Wei Yu and Congcong Li. My

best friends: Yi Zhang, Le Xie, Xiaoqian Jiang, Sicun Gao. their support of my research

and sharing the happy time.

Finally, I express my sincere love and appreciation to my parents, for their constant

encouragement, support, and sacrifice throughout what have been wonderful two years

of my life.

Abstract

Based on the statistic of Department of Transportation, most accidents

occurred by carelessness of drivers who were not aware of any obstacles

ahead. Such a warning system of guiding a clear path in front can assist

drivers to avoid the potential dangers and prevent accidents. In this project, a

novel robust algorithm is proposed to detect clear path in real-time with the

specific goal of incorporating such smart technology into General Motors

cars so as to alert the drivers of any obstacles in frontal safe distance while

driving.

Different from the traditional approaches which try to build various

detectors to detect all types of obstacles on road, this proposed framework

focus on indicating the clear path ahead. We assume the video camera is

calibrated (the intrinsic and extrinsic parameters) and the vehicle

information (vehicle speed and yaw angle) are known all the time. Well

utilizing this prior knowledge of scenes, first, we generate perspective

patches inverse projected from world coordinate for extracting features

instead of traditional 2D patches on frame. Secondly, the proposed

framework takes advantage of each patch’s spatial and temporal constraints

based on prior to build a probabilistic refinement on the initial statistical

learning from training data, which hence enhances the performance and

speed.

Finally, our proposed framework is verified through experimental study to

show its robustness and efficiency. Especially in some challenge situation

like shadow and illumination changes, we can still get well performance.

Table of Contents 1. Introduction ............................................................................................................................. 6

1.1 Motivation ............................................................................................................................. 6

1.2 Related Work ......................................................................................................................... 6

1.3 Overview of Proposed work .................................................................................................. 9

1.4 Thesis Structure ................................................................................................................... 10

2. Perspective Patch Generation ............................................................................................... 11

2.1 2D Patch Generation ........................................................................................................... 11

2.2 Perspective Patch Generation ............................................................................................. 12

2.2.1 Pinhole Camera Model ............................................................................................ 13

2.2.2 Generate Perspective Patches ............................................................................... 14

3. Feature Extraction ................................................................................................................. 16

3.1 Filter Bank ............................................................................................................................ 16

3.1.1 Leung-Malik(LM) Filter Bank ................................................................................... 16

3.1.2 Gabor Filter Bank ...................................................................................................... 16

3.2 Feature Representation ....................................................................................................... 18

4. Feature Selection ................................................................................................................... 19

4.1 Weak Classifier .................................................................................................................... 19

4.2 Adaboost Feature Selection ................................................................................................ 19

5. Learning Algorithm ................................................................................................................ 23

6. Patch‐based Refinement ....................................................................................................... 25

6.1 Spatial Patch Smoothing ...................................................................................................... 26

6.2 Temporal Patch Smoothing ................................................................................................. 28

6.3 Patch‐based Refinement ..................................................................................................... 31

7. Experiments ........................................................................................................................... 32

7.1 Data Description .................................................................................................................. 32

7.2 Feature Preparation ............................................................................................................ 33

7.3 Data Result .......................................................................................................................... 34

8. Conclusion and Future Work ................................................................................................. 37

Reference ...................................................................................................................................... 38

1. Introduction

1.1 Motivation

There are many traffic accidents of cars or pedestrians being hit every day. Majority of

these accidents occur at the situation that either pedestrians appear suddenly or cars run

too fast to stop while frontal obstacles are close to the car. In either case, the inability of

drivers to be aware of obstacles in time causes the trouble. Therefore, it is very important

to keep a safe following distance clear. That does not only enable the driver to react to a

problem ahead without the need for a panic stop, which could cause a following driver to

crash; but also reduces the probability of drivers crash to frontal car, which stops or

slows suddenly. Pennsylvania driver’s manual identifies 4-second following distance is

the safe following distance.(Figure 1.1(A)) In the normal drive condition, this predefined

distance ahead of the car can allow drivers to steer or brake to avoid a hazard or accidents

safely. Hence, a system (Figure 1.1(B)) to detect any obstacles in this safe distance can

assist driver to avoid the potential dangers and reduce the accidents by driver’s

carelessness, cell phone usage, drowsiness and drinking.

Figure 1.1 Safe following distance (A) Safe distance defined by PA driver’s manual (B) Clear path detection system

1.2 Related Work

In the previous work, there are already many obstacle detection methods (e.g. pedestrian

detection, vehicle detection) built to predict potential dangers. (Figure 1.2)

Figure 1.2 Pedestrian detection and vehicle detection

The most common approaches to detect obstacles are using active sensors[1], such as

radar-based (millimeter-wave )[2] [3] [4] and laser-based (Lidar) [5] [6] [7]. Radar

transmits radio wave into the atmosphere and the obstacles will reflect some of the power

back to the radar’s receiver. Lidar (Light Detection and Ranging) also transmits and

receives electromagnetic radiation in the ultraviolet, visible, and infrared region. From

these work, prototype vehicles employing active sensors have shown promising results.

Furthermore, in the nationwide autonomous vehicle competition, DARPA Urban

Challenge 2007, Rader and Lider are well used by all teams and give a great performance

for obstacles detection and road navigation, which prove their robustness and efficiency

[8].

However, active sensors have several drawbacks such as low resolution, short range, may

interfere with each other, and are rather expensive. On the other hand, passive sensors

such as cameras, combined with computer vision algorithms, offer a more affordable

solution and can be used to track more effectively obstacles. In last decade, various

approaches have been reported in the literature. For the vehicle detection, Bertozzi et al

[9]and Zhao et al[10], used stereo-vision-based methods to detect vehicles and obstacles.

In Matthews et al.[11], PCA was used for feature extraction and neural networks for

detection. Goerick et al.[12] used a method called Local Orientation Coding to extract

edge information to feed neural network classifier. Betke[13] used motion and edge

information to hypothesize the vehicle locations and template-matching for detection. In

the work of Schneiderman[14], the statistics of both obstacle appearance and non-

obstacle appearance were represented using two histograms that each histogram

represents the distribution of wavelet coefficients codewords.

For the pedestrian detection, wavelet response[15] and histogram of oriented gradients

[16]have been used to learn a shape-based model to detect human. Viola et

al.[17]adopted spatial-temporal filters based on shifted frame difference to augment the

pedestrian detection using spatial filters alone, thus using both motion and appearance for

detection. Fablet and Black[18]used optical flow to learn a generative human-motion

model while Hedvig [19]trained a Support Vector Machine for detection.

However, generic detection which can detect all kinds of obstacles is a hard problem in

computer vision. Intra-class variability of the class of obstacles makes the detection

process very challenging. They can appear in different colors, textures and shapes due to

the various types of obstacles. Additionally, their movements in perspective scene are

usually leading to notable variability in appearance and articulation. The variations in

lighting conditions and fluctuations of weather further complicate the problem.

Therefore, due to obstacles’ variability, it is difficult to use a uniform detector to detect

all of them. That can be revealed from Figure 1.3(a) .In feature space, these features

extracted from kinds of obstacles (e.g. pedestrian, vehicle) are dispersed that we need

several boundaries to separate them. However, the clear path, which features are more

clustered, just need one boundary to be separated with other features and is easy to

classify.

(a) (b)

Figure 1.3 The feature space, (a) is feature space of pervious obstacles detection and (b) is the feature space of clear path detection.

Thus, in our approach, instead of designing various obstacle detectors, we would like to

build a clear path detector to determine the clear path ahead of the car directly. As shown

as in the Figure 1.3(b), the features of path with less variation in color, texture and shape

are more clustered and easy to classify from other obstacle features.

1.3 Overview of Proposed work

Figure 1.4 The overview of clear path detection

The clear path detection system that we proposed learns to differentiate clear path and

other obstacles. Since the camera is moving with the host, it is not easy to find the

distinct from the motion. Thus, we use appearance such as texture and shape being the

features to detect clear path in a single frame. Figure 1.4 gives an overview of the system,

comprising of several components – perspective patches generation, features extraction

and selection, SVM learning and patch-based refinement. Compare to the traditional

detection algorithm, there two novelties in our proposed method:

1) Perspective Patches

Since the camera satisfies the perspective projection model, it is improper to build

patches on 2D frame directly. In our method, we generate the 3D patches on

world coordinate as shown in Figure 2.3(A), camera pinhole model and

calibration parameter are used to inverse perspective map them onto 2D frame.

This kind of patches does not only save the computation, but also reduce the

ambiguity in the patch that improves the overall performance.

2) Patch‐based refinement with spatial and temporal consideration

Within frame, neighboring patches have constraints because of the spatial

continuity. Between neighboring frames, patch in current frame has constraints

with its responding region in previous frame because of temporal continuity.

Hence, we can use these constraints to refine the initial learning result. In our

method, two types of patch smoothing considering the spatial and temporal

information separately have been proposed. The spatial patch smoothing enforces

a smoothness constraint that the neighboring patches with similar textures should

have similar probability to be clear path or obstacles. The temporal patch

smoothing ensures patch’s appearance constraint and probability consistency

constraint between current frame and previous frames. Finally, combined these

two smoothing, the maximum likelihood estimator optimizes all the factors and

improves the overall performance.

1.4 Thesis Structure

The rest of the thesis is organized as follows: Chapter 2 outlines the process of

perspective patches generation using camera pinhole model and calibration parameters.

Chapter 3 and Chapter 4 introduce feature extraction and feature selection, which goal is

to find the most discriminative features for representing the patches. Chapter 5 describes

the learning process and Chapter 6 proposes a novel refinement using spatial and

temporal information to improve the overall performance. Chapter 7 details the

experimental results, followed by a discussion and future work in Chapter 8.

2. Perspective Patch Generation

In this chapter, we describe the method of generating the detection patches from the input

frames. Compared to use of pixel directly, we use patches as the detection unit to

improve the computational efficiency. There are two kinds of well-known patches on 2D

images without considering any perspective information: fix grid patch and dynamic size

patch. However, in our proposed method, known the calibration information, we can

build our grid patches on the 3D world coordination and inverse perspective map them

into 2D for extracting the texture from the input frame. That makes the detection region

focus on the frontal path without any influence of other scenes in the frame. Thus, it can

great increase the speed and accuracy of our algorithm.

2.1 2D Patch Generation

Figure 2.1 Fix size patch. A) Detection region with small patches. B) Detection region with large patches. The blue circled patch covers clear path and obstacles.

Similar as shown in Figure 2.1, in the previous work [14], the input image is divided into

small patches with equal size. The features are extracted from each patches and fed to the

classifiers. However, the size of patch is difficult to be defined. If the patch is too small

as shown in Figure 2.1(A), the computation will be greatly increased and less

discriminative features can be extracted from one patch. But if the patch is too large as

shown in Figure 2.1(B), it will cover too much information, even cross different classes

(In Figure2.1 (B), the blue circled patch covers both clear path and obstacles). That made

the patches too ambiguous to be classified correctly.

Figure 2.2 Dynamic size patch.

Based on the limitation of fixed grid patch, in the previous work such as [17], they

proposed dynamic size of sub-window (patch) to scan the whole image as shown in

Figure 2.2. In order to detect the presence of clear path in given image sequence, each

frame is scanned with sub-windows of different size at various positions and every

candidate sub-window is tested against the learned classifier. However, they waste a lot

of computation and time on testing the improper size of sub-windows, redundancy due to

overlap between two patches and unnecessary parts for detecting clear path, such as sky,

objects aside the road.

2.2 Perspective Patch Generation In the clear path detection, the detection region should be laid on the ground ahead of

camera which satisfies the perspective projection model. Hence, the far-off road and

obstacles are smaller while the ones close to camera are much bigger. Based on these

phenomena, it is improper to build the unique fix size patches. Furthermore, it is also

low-efficiency to build the dynamic size patches as we discussed above. In our proposed

method, instead of building patches on 2D image directly, we construct the patches on the

3D world coordinate as shown in the Figure 2.3. We first define the detection region in

the world coordinate and divide this region into small patches. Then, calibration

parameters and pinhole camera model are used to inverse perspective map them onto 2D

image as the prior. Finally, we construct the 2D patches on the frame based on these

prior, which are called perspective patches.

Figure 2.3 Perspective Patch (A) Patches on the world coordination. Note the 3D patches are not equal size. (B) Inverse map 3D patches on 2D frame (C) Perspective Patches.

2.2.1 Pinhole Camera Model In the general pinhole camera model as Figure 2.4, the camera is assumed to perform a

perfect perspective transformation. Let tvum ]1,,[= be the projective image point of the

world point tzyxM ]1,,,[= , both expressed in the homogeneous coordinates. They satisfy:

PMm =µ Where P is a 43× projection matrix describing the perspective projection process. µ is

an arbitrary scale factor. The projection matrix can also be decomposed as:

]|[ TRKP =

Here K is the matrix of intrinsic parameters composed by focal length f and camera

optical center (principle points) ],[ 00 vu

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

1000

00

0

vfuf

K

And ]|[ TR denotes the orientation and position of the camera with respect to the world

coordinate.

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡ −

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

−⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡−==

1000cossin0sincos

cos0sin010

sin0cos

cossin0sincos0001

γγγγ

ββ

ββ

ααααγβα

zyx RRRR

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

z

y

x

ttt

T

In our case, both known the intrinsic and extrinsic calibration parameters, we project the

grid patches in 3D coordinate onto 2D frame for synthesizing the perspective patch.

Figure 2.4 Pinhole Camera Model

2.2.2 Generate Perspective Patches Because the scene on frame is always perspective, if the obstacles are too far away from

the camera, their projected regions on the frame are too small to be detected. Therefore,

as shown in Figure 2.3(A), we first define the detection region in the world coordinate is

a 9 meters width by 25 meters length rectangle in front of the camera. Secondly, this

detection region is divided into small patches. It is interesting to note that the size of each

patch is not equal. The patches farther away are much bigger than ones close by (compare

8 meters length farthest to 2 meters length nearest). That’s because we should keep the

projected 2D patches large enough to be detected (10 pixel square at least which we can

extract enough discriminative features). Then, calibration parameters and pinhole camera

model are adopted to inverse map 3D patches onto 2D frames as shown in Figure 2.3 (B).

However, the projected 2D patches on frame are arbitrary quadrilaterals and it is time

consuming to extract their features. Thus, we normalize the project arbitrary quadrilateral

patches (Figure 2.5) and synthesize grid patches based on that as shown in Figure 2.3(C),

which called the perspective patches.

Figure 2.5 Normalize arbitrary quadrilateral patches and synthesize perspective patches.

Using the proposed perspective patches, we do not only reduce the computation and

ambiguity of detection patch, but also make our detection available for considering the

temporal relationship between neighboring frames and build a refinement based on that.

3. Feature Extraction

This section introduces the method of feature extraction. Similar to human vision system,

usually, texture, color and shape are extracted from each patch. However, in our case, the

clear path and obstacles can be represented in all kinds of colors with various light

conditions. Thus, we didn’t adopt any color features. Instead, the texture and shape

between clear path and obstacles are so different that they can both be the good

discriminative features. Therefore, we characterize texture and shape by their responses

to a set of orientation and spatial-frequency selective linear filters (a filter bank).

3.1 Filter Bank

3.1.1 Leung-Malik(LM) Filter Bank The Leung-Malik (LM) filter set is a multi scale, multi orientation filter bank which

contains first, second derivatives of Gaussian filters (edge and bar filters), Laplacian of

Gaussian filters (LOG) and Gaussian filters (spot filters), totally 48 filters. In our

implementation, we extend the original filter bank with putting more directions for

derivatives of Gaussian filters and scales for LOG filters. As show in the Figure 3.1, our

filter bank consists of first and second derivatives of Gaussians at 9 orientations and they

occur at 3 basic scales }22,2,2{=σ with an elongation factor of 3 (i.e. σσ =x and

σσ 3y = ), which make a total of 54 filters; Moreover, setting the basic scales

}4,22,2,2{=σ , the Gaussians and LOG occur from σ to σ3 separately, total 24

filters. Therefore, there are 78 filters in our extend LM filter bank. Figure 3.1(A)

demonstrates visualization of all extend LM filter bank.

3.1.2 Gabor Filter Bank A Gabor filter is a linear filter whose impulse response is defined by a harmonic function

multiplied by a Gaussian function. It has received considerable attention because the

characteristics of object shape can be approximated by these filters. In addition, these

filters have been shown to posses optimal localization properties in both spatial and

frequency domain.

The two-dimensional Gabor function is shown as following:

( ) ( )⎟⎠⎞

⎜⎝⎛

⎟⎟⎠

⎞⎜⎜⎝

⎛ +−= XYXyxG

λπ

σγ 2sin

2exp, 2

222

θθθθ

cossinsincos

yxYyxX

+−=+=

Where λ is the wavelength of the sin factor in the Gabor filter kernel; θ specifies the

orientation andγ called the spatial aspect ratio, represents the ellipticity of the support of

the Gabor function. To generate more kinds of Gabor filters for extracting different

features, in our Gabor filter bank, keeping the spatial aspect ratio 1 and 0.5 separately, we

vary the wavelength within set [3, 5, 7, 10, 13] at 9 directions each. There are 90 filters

in total. Figure 3.1(B) demonstrates visualization of all Gabor filter.

(A) (B)

Figure 3.1 Filter Bank (A) Visualization of extended LM filter bank with a mix of edge, bar and spot filters at multiple orientation and scales. (B) Visualization of Gabor filter

bank at 9 directions with different parameters

3.2 Feature Representation

Figure 3.2 Feature representation

As shown in Figure 3.2, after cropping the detection region from the input frame, each

pixel is transformed to a 168 dimensional vector as the responses for the filter bank. Then,

we sum up all absolute responses corresponding to one filter within a patch. Since the

size of patch is changed with perspective, we normalize the features by dividing the area

of each patch. Finally, the normalized values are reshaped and concatenated to form a

168 dimension vector feature which can be treated as the histogram of 168 filters.

4. Feature Selection After extracting the features from input frame, feature selection is introduced in this

section. There are two reasons for selecting the features: 1) some filters in LM filter bank

are similar to Gabor filter bank which provide the redundant features; 2) Not all features

are discriminative enough for feeding the classifier, we should select out the good

features obtained from the distinct filters and reduce other noisy variation. In our

proposed method, we build the weak classifier on each feature first. Then, Adaboosting is

adopted to choose the most discriminative features which improves the computational

efficiency and reduce the time consume.

4.1 Weak Classifier For each patch, either clear path or obstacles, there are 78 features obtained from LM

filters and other 90 features obtained from Gabor filters; we have a total of 168 vectors

that act as a pool of features for selection. Taking one column of features jx from the

training data x , we find the optimum threshold jθ that minimizes the overall

classification error would yield a weak classifier jh :

⎩⎨⎧

><

=jj

jjj

xx

xhθθ

if obstacles, if path,clear

)(

4.2 Adaboost Feature Selection

AdaBoost is a boosting algorithm designed to improve the performance using a

combination of weak classifiers. The procedure to choose the most discriminative

features is motivated by the face detection algorithm proposed by Viola-Jones [17] and

the detail is described in Table.1. Initially, every training data sample is weighted equally.

From the pool of 168 features, the one with minimum classification error is selected

without replacement. The data samples are re-weighted proportional to the

misclassification error before choosing the next feature with corresponding optimum

weak classifier from the updated pool. The selected features (classifiers) also have an

associated confidence jα that is inversely proportional to their misclassification errors

and hence, is a measure of their “weakness”. The process of choosing classifiers is

continued until all the features have been selected out.

Table 1: Adaboosting Feature Selection

Given: the training data ),(,),,( 11 NN yxyx L where ix is the thi feature corresponding

the thi filter and }1,0{∈iy indicate the type of patch to be “obstacles” or “clear path”

Aim: Train the weight for each feature and select the feature by weight.

• Initialize the weight 21,2

1,1 mlw i = for 1,0=iy respectively, where l and m are

the number of obstacles and clear path examples.

• For Tt ,...,1=

o Normalize the weights ∑ =

= n

j jt

iti

www

1 ,

,,1

o Select the best weak classifier th with respect to the weighted error:

∑ −=i ijijt yhw ||minε

o Update the weights

1,,1

ietitit ww −

+ = β

where 0=ie if example ix is correctly classified by th and 1 =ie

otherwise. Now 1

t

tt

-εεβ =

Figure 4.1 depicts the visualization of all ranked filters selected by this algorithm from

the origin extended LM filter bank and Gabor filter bank. Figure 4.2 demonstrates the

filter response images generated by the top 3 and last 3 ranked filters. From the Figure

4.1, it is interesting the selection of the most discriminative vectors follows a similar

trend that the edge filters and bar filters with certain scales at the direction from around

45 degree to 135 degree ranked higher than other filters. It’s also proved by the Figure

4.2. The top 3 ranked filters provide results that clear path responses are distinct from the

obst

dem

amb

tacles respo

monstrate m

biguity betw

Figure 4.

Figure 4.2 F

onses. On

many noise r

ween clear p

.1 Visualizat

Filter respon

the contrar

responses ar

path and obs

tion of all ra

nse images g3 ranked fi

ry, the resu

re generated

stacles.

anked filters

generated byilters (C) La

ults obtaine

d by the im

s (rank from

y selected fiast 3 ranked

ed from las

mproper filte

m top to bott

lters. (A) Ord filters.

st 3 ranked

ers which m

tom, left to r

rigin frame

d filters

make the

right)

(B) Top

These results are due to the facts that clear path is usually flat with less texture while the

obstacles such as the cars and pedestrians have complex texture with a variety of edges

and lines. That’s the good factors we can distinguish clear path out of obstacles. Only the

filters with right scale and shape can generate the maximum filter responses. In addition,

the frame captured is perspective as we mentioned before. Therefore, the boundary and

texture of both clear path and obstacles will go along with the certain degree (around 45

degree on left and 135 degree on right) toward to the horizon line. Filters with these

orientations can get the higher discriminative features.

5. Learning Algorithm Support Vector Machine is, arguably, the most popular off-the-shelf discriminative

classifier and was developed by Vladimir Vapnik [20]. It provides state-of-the-art

performance in many real-world applications like text categorization, bio-sequences

analysis, object classification etc. It is also known as the maximum margin classifier as it

learns the decision boundary to separate the positive and negative classes (assuming

binary classification problem) that maximizes the margin from the data.

Figure 5.1 Support Vector Machines when the data is linearly separable. The decision boundary is chosen so as to maximize the margin.

Consider the given training data set )},{( ii yx where Ni ,....,2,1= and ix is N-dimensional

feature vector with label iy equal to -1 and +1 denoting the obstacles and clear path. The

feature vectors are assumed to be normalized to zero mean and standard variance to

obviate the undesirable domination of any particular dimension(s) in deciding the

decision boundary. SVM strives to find the hyperplane 0=−bxitω that best separates the

training data with regards to the distance from this hyperplane. If the data is linearly

separable, the problem reduces to maximizing the margin ||/2 ω between the parallel

hyperplanes on either side of the decision boundary such that the following conditions

hold for all the features:

1 if ,11 if ,1

−=−≤−

+=+≥−

iit

iit

ybxybx

ω

ω

That can be written as:

ibxy it

i ∀+≥− ,1)(ω

As shown in Figure 5.1, the sample points that lie on the hyperplanes are known as the

support vectors as they are the ones that decide the decision boundary. In general, if the

data would not be linearly separable, SVM will map vectors with appropriate kernel

function ),( ixxK into high-dimension of the feature space and satisfy the linearly

separable constraint. In our implementation, unlike the classical SVM which use Signum

function and has binary output (+1 or -1), we need to know the posterior probability

every class as the prior information for the later processing. Thus, we first obtain the

binary posterior probability distribution which can be written as:

∑=

=

−+=±=

N

iiii

i

xxKyxf

xfxyp

1),()(

))(||/1exp(11),|1(

α

ωω

Where iα denotes the Lagrange multipliers. ),( ixxK is the kernel function which we use

RBF (Radial Basis Function) kernel in our case. Shown in the Figure 5.2, given the

features, we classify each patch into type “clear path” or “obstacles”. Additionally, SVM

classifier provides the probabilities of both classes in order to be used for refining the

results in the next section.

Figure 5.2 Initial Type Estimation using SVM

6. Patchbased Refinement

In our prior classification, each patch is classified into 2 classes: “clear path” and

“obstacles”. However, as same as any other machine learning algorithms, SVM cannot

guarantee 100 percents of accuracy in classification. Sometimes, the ambiguity on texture

will make SVM easily make the wrong decision. There are two classic errors shown in

the Figure 6.1

1) Spatial Error. Within the same frame, the clear path patch is wrong classified into

“obstacles”, while it is surrounded by clear path patches with right classification.

2) Temporal Error. Between the neighboring frames, the patch in current frame is

classified into “obstacles”, while its corresponding regions computed by the vehicle’s

movement in the previous frame are classified into “clear path”. That makes the conflict.

In actually, these kinds of errors can be corrected if we consider the relationship of

patches with their spatial and temporal information.

Figure 6.1 Classification Errors, the patches circled are classified into wrong classes (A) Spatial Error (B) Temporal Error.

Therefore, in this section, we proposed two types of patch smoothing considering the

spatial and temporal information separately. The spatial patch smoothing enforces a

smoothness constraint that the neighboring patches with similar textures (e.g. color)

should have similar probability to be clear path or obstacles. Thus, it updates detecting

patch’s status with the influence of its neighboring patches status. The temporal patch

smoothing ensures patch’s appearance constraint and probability consistency constraint

between current frame and previous frames. Therefore, it updates detecting patch’s status

in current frame with the influence of its corresponding region’s status in previous frames.

Finally, combined these two smoothing, the maximum likelihood estimator optimizes all

the factors and improves the overall performance.

6.1 Spatial Patch Smoothing

Figure 6.2 Spatial Patch Smooth. The detecting patch js has 6 adjoining neighboring patches.

Within the one frame, the neighboring patches have certain relationship which can help

us to determine the class of detecting patch. Therefore, we set the spatial smoothness

coefficient of patch j to )(cn j which enforces the constraint that neighboring patches

with similar texture should have similar class c .

Let ls denote one of current detecting patch js ’s neighboring patches with its associated

initial probability obtained from SVM )(0 cPl and maximal likelihood estimate lc .

)(maxargˆ 0 cPc lc

l >=<

The class of patch js is modeled by a contaminated Gaussian distribution with mean lc

and variance 2lσ . We define spatial smoothness coefficient )(cn j to be:

⎩⎨⎧

=+=∏ obtacles 2,pathclear ,1

c ),ˆ;()( 2

lsllj ccNcn εσ

Where ),,( 2σmeanxN is the Gaussian distribution and ε is small constants (e.g. 1010− ).

We evaluate the variance 2lσ using texture similarity, neighboring connectivity, and ls ’s

initial probability obtained from SVM:

1. Texture similarity of the patches lj.∆ , which measures the texture difference

between patches ls and js . We estimate the Texture similarity as defined

following:

Assume the texture is represented by colors, the patches should have similar color

distribution if their texture is similar. We first compute the color histogram of two

patches as their color distribution ( jC and lC ). Then Kullback–Leibler

divergence is used to measure the difference between these two distributions:

∑=i l

jjljKL

iCiCCCCD)()(log)||(

Since the KL divergence is not symmetric, in our computing the texture similarity,

we sum both divergences to measure the distance between these two distributions:

)||()||( jlKLljKLjl CCDCCD +=∆

Therefore, if the textures of two patches are totally the same, their texture

similarity lj.∆ is equal to zero, while lj.∆ is larger if these two patches are more

different.

2. The neighboring connectivity ljb , , which contains the percentage of patch js ’s

border between patches ls and js . As shown in the Figure 6.2 the neighboring

patch 2 has the longer border sharing with detecting patch j than neighboring

patch 1, which means patch 2 has more influence to the detecting patch j than

patch 1 and its neighboring connectivity is bigger.

3. Initial probability for patch ls obtained from SVM )(0 cPl

Therefore, the variance 2lσ is defined as

),0;()( 2,,

22

∆∆=

σσ

ljljt

ll

NbcPv

Where v and 2∆σ are constants ( 8=v and 202 =∆σ in our experiment). Thus, if patch js

and its neighboring patch ls have similar textures. Also, if patch js ’s class is consistent

with its neighbor’s status estimates (they are both classified as obstacles or clear path),

we expect spatial smoothness coefficient of patch j to )(cn j to be large.

6.2 Temporal Patch Smoothing

Figure 6.3 Spatial Patch Smooth. The patches in current frame are projected back to

previous frame with known the speed and yaw angle of vehicle.

Because the camera on vehicle is moving forward, the temporal neighboring frames also

have certain relationship. Shown in the Figure 6.3, known the speed and the yaw angle

of the vehicle, we can calculate where the patches in current frame (blue rectangles in (B))

are located in previous k frames (red rectangles in (A), when 1=k ). Usually, the patch

and its corresponding regions in previous frames should have similar texture and they

also should be classified into same class. However, it is important to note there are

several kinds of occlusions that a patch might not have the corresponding patches in the

previous frames, such as

a) Obstacles in front keep the same speed as our vehicle. The class of patch in

current frame is “clear path” whose corresponding region is occluded by obstacles

in the last frame. (Figure 6.4(A))

b) Obstacles appear from aside of our vehicle suddenly. The class of patch in

current frame is “obstacles” which did not inherit the same class from last frame.

(Figure 6.4(B))

Figure 6.4 Two types of occlusion

In these two cases, the patches with its temporal regions are not consistent. We call the

patch is not visible and set the consistency likelihood is a fixed probability instead of

computing the temporal consistency.

Therefore, in our implementation, the temporal smoothing coefficient )(, cc kj , which

ensures that the patch js ’s estimate is consistent with its corresponding estimates at

previous k frames, is computed based on temporal consistency, visibility, and patch ls ’s

initial probability obtained from SVM.

1. Temporal consistency.

Given the vehicle speed Tv , yaw angle Tθ and frame rate of video f , we first

calculate out the vehicle has moved distance fvT / between neighboring frames.

Secondly, the region of patches on world coordinate in the previous frame can be

computed back by:

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡•−

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡=

⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

0cossin

T

TT

cur

cur

cur

prev

prev

prev

fv

ZYX

ZYX

θθ

Then, combined changes of position with the calibration information and camera

model, we project patch js onto its neighboring frame. Finally, since the projected

region may cover several patches in previous frame, we calculate patch js ’s

projecting distribution )(0, cb kj based on the probability distribution at the projected

previous k frame to estimate the temporal consistency without occlusion.

∑∈

=jj Sx

xkrS

kj cPnum

cb )(1)( 0),(

0,

As show in the Figure 6.3(A), where ),( xkr is the patch index at the time k , on

which the corresponding pixel of the pixel position x on patch js is. And iSnum is

the number of the pixels on patch js . If the projected region’s status is consistent

with patch js ’s status, we expect )(0, cb kj to be large when patch js is visible at the

previous frame.

2. Visibility: kjv ,

Due to the possible occlusions we discussed before, a patch might not have the

corresponding pixels in the previous frames. We estimate the overall visibility

likelihood kjv , to measure how much the patch is visible. This factor is model as the

Gaussian distribution and computed based on the Texture similarity as defined

above.

),0;( 2,, ∆∆= σkjkj Nv

If the patch js is visible on the previous frame as shown in Figure 6.3, we can find

its corresponding region when we search the space on the previous frames. The

temporal consistency solutions offer large values )(0, cb kj . If the patch js is occluded

on the previous frame as shown in Figure 6.4, we cannot find its corresponding

region when we search the space on the previous frames. No solution provides large

)(0, cb kj value. Therefore, we use kjv , as a robust and computational-efficient

measure of patch’s visibility.

3. Initial probability for patch ls obtained from SVM )(0 cPl

Now, we combine the visible and occluded cases. If the patch is totally visible,

)(, cc kj is calculated from the visible consistency likelihood )()( 00, cPcb jkj . Otherwise,

its occluded consistency likelihood is a fixed prior 0P (e.g. 2/1 ). Therefore,

. obtacles 2,

pathclear ,1c )1()()()( 0

,00

,,,⎩⎨⎧

=−+= PvcPcbvcc kjjkjkjkj

In our case, considering the computation efficiency, we just calculate the influence from

the previous frame (Let 1=k ).

6.3 Patchbased Refinement

Finally, we start to refine detecting patch js ’s initial probability )(0 cPj between its

neighboring patches and between its corresponding regions on the previous frames

iteratively. The updated probability of patch js : )(cPtj is updated iteratively as follows:

0,1,2,.... t obtacles 2,

pathclear ,1c

)()()()(

)(

}1,0{,

,1 =⎩⎨⎧

==∑ ∏

∏

=∈

∈+

cNk kjj

Nk kjjtj

cccncccn

cP

Where )(cn j is the spatial smoothing coefficient and )(, cc kj is the temporal smoothing

coefficient in each projected region on previous k frames ( 1=k ). Moreover, we just

update the probability of patch js : )(cPtj once ( 1=t ) with considering of the efficiency.

Thus, after refinement, the final maximal likelihood estimate jc of patch js is.

)(maxargˆ cPc tj

cj >=<

7. Experiments

7.1 Data Description

The experiments data provided by General Motors is real time videos captured on the

running vehicle. The rate of these videos is 5fps and there are 4500 frames at 320*240

pixels in total. In these videos, various conditions are included. Some typical situations

are shown in Figure 7.1: urban, highway, shadow and illumination change. However,

some cases are challenge because they are ambiguous in texture and shape: the texture of

the fence in highway is similar to that of clear path; the appearance of shadow in the clear

path is easy to be confused with that of obstacles; the sudden changes of illumination also

greatly influence the texture and appearance. However, our experiments confirm we can

get reasonable performance on these situations.

Figure 7.1 Typical situations in experiment. (A) Urban (B) Illumination change (C)Shadow (D) Highway

Then, we calibrated the camera’s intrinsic parameters (camera’s focal length and optical

center) with checker board patterns offline and the extrinsic parameters (the translation

vector and the rotation matrix) with markers on the ground using Zhang’s method [62].

The real time movement parameters corresponding to each frame, such as vehicle speed

and yaw angle, are read from the sensors on the vehicle.

7.2 Feature Preparation

Figure 7.2 Feature Selection (A) 5-folder cross-validation (B) ROC curve

As described in the Chapter 2, each frame is divided into patches in world coordinate and

inverse mapped to perspective patches on 2D. Hence, 30 patches varied from 6*7 to

23*15 in size are generated from one frame. We labeled 33201 patches (6224 obstacle

patches and 23986 clear path patches) extracted from 1007 frames covered all 4 typical

situations discussed above. Convoluted with extended LM filters and Gabor filters, there

are 168 features originally obtained from one patch. An Adaboost feature selection was

trained for the set of all clear path and obstacles data to rank the features. Then, we

trained the SVM-based status estimator with all training set by incrementally putting

features. In Figure 5.2(A), we showed the result of 5-folder cross-validation with the

increasing of ranked features. Also, we adjust the parameters of SVM and plot the ROC

curve in Figure 5.2(B). Clearly, having more number of features improves the

performance. The variation in the false alarm rate and detection rate was observed for

different number of features with SVM parameters changing. The one that provided a

good balance between speed and performance was chosen. In this project, we choose top

50 features with their corresponding filters. The SVM model is created using RBF kernel

with parameter C=32 and gamma=0.0313.

7.3 Data Result

In the test procedure, every frame is represented by 30 perspective patches which are

extracted 50 features each. The initial classification created by SVM model with

probability output. Figure 7.3 shows some initial detection results in the urban and

highway case. The result shows our algorithm can already well identify the clear path out

of different types of obstacles (e.g. cars or roadsides) in various sizes, shapes and

orientations, even without any refinement.

Figure 7.3 Result of initial detection using SVM

Nevertheless, SVM could make the wrong classification because of its non-perfect

performance or ambiguity of the patches, such as landmark in the clear path patch as

shown in Figure 7.4(A)1. Furthermore, in the challenge cases like Figure 7.4(A)3 and 4,

the bridge’s shadow and illumination change greatly influence patches’ appearance and

texture that guides to wrong classification. Hence, spatial and temporal information are

considered to refine the initial classification. Figure 7.4(B) demonstrates the results after

refinement. The error patch in Figure7.4(A)1 is corrected by the constraint of its

neighboring patches. The error patches in Figure 7.4(B)3 and 4 are corrected by the

constraint of both temporal neighboring patches and spatial neighboring patches.

Figure 7.4 Patch-based refinement

Table 7.1 indicates the performance before and after refinement. It is obvious from the

table that comparing with initial SVM classification which has 92.23% in accuracy, 4.6%

in FAR and 5.1% in FRR. Patch-based refinement can improve the precise to 94.57% and

sharply reduce FAR to 3.2% and FRR to 4.5%. We show more examples in Figure 7.5.

Figure 7.6 provide some extreme cases we cannot handle. The huge change in

illumination and over-exposure totally mess the texture, the wrong constraints from

spatial and temporal information amplify the errors. We can improve our algorithm by

increasing the training set covering these situations or finding more discriminative

features invariance to illumination change.

Accuracy FAR FRR

SVM 92.23% 4.6% 5.1%

SVM + Refinement 94.57% 3.2% 4.5

Table 7.1 Test result

Figure 7.5 More test result

Figure 7.6 Extreme cases cannot be handled now

8. Conclusion and Future Work Different from the traditional approaches which try to build various detectors to detect

the all types of obstacles on road, this proposed framework focus on indicating the clear

path ahead. We assume the video camera is calibrated (the intrinsic and extrinsic

parameters) and the vehicle information (vehicle speed and yaw angle) are known all the

time. Well utilizing this prior knowledge of scenes, we first generate perspective patches

projected from world coordinate for extracting features instead of traditional 2D patches

on frame. Then, the proposed framework also takes advantage of each patch’s spatial and

temporal constraints based on prior to build a probabilistic refinement on the initial

statistical learning from training data. The prior hence enhances the performance and

speed.

Finally, our proposed framework is verified through experimental study to show its

robustness and efficiency. Especially in some challenge situation like shadow and

illumination changes, we can still get well performance after proposed refinement.

In future, we will further improve the learning performance of prior knowledge by having

more training samples and better learning algorithms. Moreover, we will further improve

the performance by investigating the tracking using our prior of perspective projection.

Such as predicting where the tracked obstacles are located on world coordinate and

project them on next frames using motion, vehicle speed and calibration parameters.

Reference

1. Herbert, M., Active and Passive Range Sensing for Robotics. Proc.

IEEE Int'l Conf. Robotics and Automation, 2000. 1: p. 102-110.

2. K. Kaliyaperumal, S.L., K. Kluge, An algorithm for detecting roads

and obstacles in radar images. IEEE Transactions on Vehicular

Technology, 2001. 50(1): p. 170-182.

3. S Sugimoto, H.T., H Takahashi, M Okutomi. Obstacle detection using

millimeter-wave radar and its visualization on image sequence. in

ICPR. 2004.

4. S. Park, T.K., S. Kang, and K. Heon. A Novel Signal Processing

Technique for Vehicle Detection Radar. in A Novel Signal Processing

Technique for Vehicle Detection Radar. 2003.

5. A. Ewald, V.W. Laser scanners for obstacle detection in automotive

applications. in Proceedings of the IEEE Intelligent Vehicles

Symposium. 2000.

6. J. Hancock, M.H., C.Thorpe. Laser intensity-based obstacle detection.

in International Conference on Intelligent Robots and Systems. 1998.

7. C. Wang, C.T., and A. Suppe, Ladar-Based Detection and Tracking of

Moving Objects from a Ground Vehicle at High Speeds. Proc. IEEE

Intelligent Vehicles Symp., 2003.

8. http://www.tartanracing.org/.

9. N. Bertozzi, A.B., Real-time lane and obstacle detection on the gold

system. IEEE Intelligent Vehicles Symp., 1996: p. 213-218.

10. G. Zhao, Y.S., Obstacle detection by vision system for autonomous

vehicle. IEEE Intelligent Vehicles Symp., 2003: p. 31-36.

11. N. Matthews, P.A., D. Charnley, and C, Harris, Vehicle detection and

recognition in greyscale imagery. Control Engineering Practice, 1996.

4: p. 473-479.

12. C. Goerick, N.D.a.M.W., Artificial neural networks in real time car

detection and tracking applications. Pattern Recognition Letters, 1996:

p. 335-343.

13. M. Betke, E.H.a.L., Davis, Multiple vehicle detection and tracking in

hard real time. IEEE Intelligent Vehicles Symp., 1996: p. 351-356.

14. H.Schneiderman, T.K., Probabilistic modeling of local appearance

and spatial relationships for object recognition. CVPR, 1998: p. 45-

51.

15. M. Oren, C.P., P. Sinha, E. Osuna and T. Poggio, Pedestrian detection

using wavelet templates. CVPR, 1997: p. 193-199.

16. N. Dalal, B.T. Histograms of Oriented Gradients for Human

Detection. in CVPR. 2005.

17. P. Viola and M. Jones. Rapid Object Detection using a Boosted

Cascade of Simple Features. in CVPR. 2001.

18. R. Fablet, M.J.B., Automatic Detection and Tracking of Human

Motion with a View-based representation. ECCV, 2002. 1: p. 476-491.

19. Sidenbladh, H., Detecting Human Motion with Support Vector

Machines. ICPR, 2004. 2: p. 188-191.

20. Burges., C.J.C., A Tutorial on Support Vector Machines for Pattern

Recognition. Data Mining and Knowledge Discovery, 1998. 2(2): p.

121-167.

Clear Path Detection - Carnegie Mellon University · PDF fileClear Path Detection ... PCA was...

Documents

Transcript of Clear Path Detection - Carnegie Mellon University · PDF fileClear Path Detection ... PCA was...