Learning Gaussian Conditional Random Fields for Low-Level...

Submitted Draft – Final Version will appear in CVPR 2007 -Please do not distribute

Learning Gaussian Conditional Random Fields for Low-Level Vision

Marshall F. TappenUniversity of Central Florida

Orlando, [email protected]

Ce Liu Edward H. Adelson William T. FreemanMIT CSAIL

Cambridge, MA [email protected]

Abstract

Markov Random Field (MRF) models are a popular toolfor vision and image processing. Gaussian MRF modelsare particularly convenient to work with because they canbe implemented using matrix and linear algebra routines.However, recent research has focused on on discrete-valuedand non-convex MRF models because Gaussian modelstend to over-smooth images and blur edges. In this paper,we show how to train a Gaussian Conditional Random Field(GCRF) model that overcomes this weakness and can out-perform the non-convex Field of Experts model on the taskof denoising images. A key advantage of the GCRF model isthat the parameters of the model can be optimized efficientlyon relatively large images. The competitive performance ofthe GCRF model and the ease of optimizing its parametersmake the GCRF model an attractive option for vision andimage processing applications.

1. Introduction

Markov Random Field, or MRF, models are a populartool for solving problems in low-level vision and image-processing. MRF models are convenient because the re-lationships between variables in the model are expressedin terms of local relationships, such as the relationship be-tween pixels in an image. At the same time, they are pow-erful because these local relationships combine to producelong-range interactions between variables.

Gaussian Markov Random Fields, which are MRFswhere the variables are also jointly Gaussian, have a longhistory in computer vision research [10]. Gaussian mod-els are particularly convenient to work with because manyof the computations involving these models, such as infer-ence, can be accomplished using linear algebra. Algorithmsfor numerical linear algebra are well understood, and effi-cient implementations are available in most programmingenvironments.

Despite these advantages, many Gaussian models havea characteristic weakness. Empirically, using these modelsoften results in over-smoothed images. This has led recent

research on MRF models of images to focus on discrete-valued MRF models [2, 4, 13] or non-convex continuous-valued models, such as the Field of Experts model [8]. Oneof the driving forces behind choosing these types of modelsis that they can implement robust penalty functions, such astruncated quadratic penalties. Using these robust penaltiesalleviates the problem of over-smoothing and has improvedthe performance on a number of tasks, including image de-noising [8], stereo reconstruction [2], and super-resolution[13].

In this paper, we describe a Gaussian Conditional Ran-dom Field, or GCRF, model that avoids the over-smoothingtypically associated with Gaussian models. To demonstrateits power, we show how it can denoise images with less er-ror than the non-convex Field of Experts model. The GCRFmodel is also able to produce estimates much more quickly.

An additional advantage of the GCRF model is that theparameters of the model can be found efficiently by mini-mizing the error in the MAP solution of the model. Thistractable method of training, described in Section 3, en-ables the GCRF model to improve on the results producedby the more complex, non-convex Field of Experts model.The fact that GCRF models are relatively easy to train andcan perform as well as more complicated models makesthem potentially useful for a number of low-level vision andimage-processing tasks. To demonstrate the convenienceand power of the model, sample training implementationsare available athttp://Removed for Review.

Section 2 describes the GCRF model. The training algo-rithm is described in Section 3. Related work and modelsare discussed in Section 4. The GCRF model is comparedto the Field of Experts model in Section 5.

2. Motivating the GCRF Model

The Gaussian Conditional Random Field (GCRF) modelcan be motivated in two ways: probabilistically as a Condi-tional Random Field [5], or as an estimator based on min-imizing a cost function. Section 2.1 shows how the GCRFestimator can be motivated as a probabilistic model, whileSection 2.3 shows how it can be motivated as the minimumof a quadratic cost function.

1

2.1. Gaussian Markov Random Field Model

Before presenting the actual Gaussian Conditional Ran-dom Field (GCRF) model used to denoise images, this sec-tion will first describe a GCRF model for images that suffersfrom the problem of oversmoothing. Section 2.2 will thenshow how this model can be improved into a model that canhandle sharp edges and produce much improved results.

This GCRF model is defined by a set of linear fea-tures that are convolution kernels. For a set of features,f1 . . . fNf

, the probability density of an image,X , condi-tioned on the observed image,O, is defined to be

p(X ) =1

Zexp

0

@−

NfX

i=1

X

x,y

((X ∗ fi)(x, y) − ri(x, y;O))2

1

A

(1)where(X ∗ fi)(x, y) denotes the value at location(x, y) inthe image produced by convolvingX with fi. We assumethat this image only contains those pixels where the entirefilter fits in X . This corresponds to “valid” convolution inMATLAB. The function ri(x, y;O) contains the estimatedvalue of(X ∗ fi)(x, y). For each featurefi, the functionri

uses the observed imageO to estimate the value of the filterresponse at each pixel. In the rest of this paper,ri(x, y;O)will be shortened tori for conciseness.

The exponent:Nf∑

i=1

∑

x,y

(

(X ∗ fi)(x, y) − r2i

)

(2)

can be written in matrix form by creating a set of matricesF1 . . . FNf

. Each matrixFi performs the same set of linearoperations as convolving an image with a filterfi. In otherwords, ifX is a vector created by unwrapping the imageX ,then the values ofFiX are identical to the values ofX ∗ fi.These matrices can then be stacked and Equation 2 can berewritten as

(F X − r)T (F X − r) (3)

where

F =

F1

F2

...FNf

and r =

r1

r2

...rNf

Equation 3 can be rewritten in the standard form of theexponent of a multivariate normal distribution by settingthe precision matrix,Λ−1, to beFT F and the mean,µ, to(FT F )−1FT r.

Note that the difference between(X − µ)T Λ−1(X − µ)and Equation 3 is the constantrT r − µT µ. However, thisvalue is constant for allX, so it only affects the normaliza-tion constant,Z, and the distributions defined using Equa-tion 1 orΛ andµ are identical.

2.2. Utilizing Weights

One of the major weaknesses of the GCRF model inEquation 1 is that it does not properly capture the statis-tics of natural images. Natural images tend to have smoothregions interrupted by sharp discontinuities. The Gaussianmodel imposes smoothness, but penalizes sharp edges. Inapplications like image denoising, optic flow, and range es-timation, this violates the actual image statistics and leadsto oversmoothing.

One way to avoid this oversmoothing is to use non-convex, robust potential functions, such as those usedin [13] and [8]. Unfortunately, the convenience of thequadratic model is lost when using these models.

Alternatively, the quadratic model can be modified byassigning weights to the various terms in the sum. For ex-ample, if the filters,f1 . . . fNf

, include derivative filters, thebasic GCRF model penalizes the strong derivatives aroundsharp edges. This causes the images estimated from basicGCRF models to have blurry edges. Using the observedimage to lower the weight for those derivatives that crossedges will preserve the sharpness of the estimated image.It should be noted that this example is illustrative. Section3 will show how use training data to learn the relationshipbetween the observed image and the weights.

Adding weights to the exponent of Equation 1 can beexpressed formally by modifying Equation 2:

Nf∑

i=1

∑

x,y

wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri)2 (4)

wherewi(x, y;O, θ) is a positive weighting function thatuses the observed image,O, to assign a positive weight toeach quadratic term in Equation 4.

Assigning a weight to each quadratic term improves themodel’s ability to handle strong edges. For a term based ona derivative filter, the weight could be increased in flat re-gions of the image and decreased along edges. This wouldenable the model to smooth out noise, while preservingsharp edges. Section 3 will discuss how to learn the pa-rameters,θ, of this weighting function.

The potential difficulty in using this model is the fact thatthe weights must be computed from the observed image.This will be problematic if the observed image is too noisyor the necessary information is not easy to compute. Fortu-nately, as Section 5 will show, the system is able to performwell with high-noise levels when denoising images. Thisindicates that good weights can still be computed in high-noise problems.

2.3. Cost-Function Perspective

The GCRF model can also be motivated from a cost-function point of view. Leth(O; θ) be an estimator that usesthe observed image,O, to estimate an image. The estimateis the imageX that minimizes the quadratic function:

Nf∑

i=1

∑

x,y

wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri)2 (5)

Or, in other words,

h(O; θ) = arg minX

∑

i,x,y wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri))2 (6)

Again,ri(x, y;O) has been shortened tori for conciseness.Because Equation 5 contains a quadratic cost function,

the minimum can be computed using the pseudo-inverse.Using the matrix notation from Section 2.1, this pseudo-inverse can be expressed as:

h(O; θ) =(

FT W (O; θ)F)−1

FT W (O; θ)r (7)

In this equation, we have introduced a diagonal ma-trix W (O; θ) that is a function of the observed imageand parametersθ. This diagonal matrix is a block-diagonal matrix constructed from the diagonal sub-matricesW1(O; θ) . . . WNf

(O; θ). Each entry along the diagonal intheWi(O; θ) matrices is equal to the weight of a term at aparticular pixel,wi(x, y;O, θ).

3. Learning the Parameters of a GCRF

Before the model described in the previous sectioncan be used, the parameters of the weighting functions,wi(x, y;O, θ), must be chosen. Traditionally, the parame-ters of Conditional Random Fields are found by maximizingthe likelihood of the training data. Using the GCRF modelfrom 2.2, the conditional log-likelihood of a training sampleT , conditioned on an associated observationOT , is

−∑Nf

i=1

∑

x,y wi(x, y;OT , θ) ((T ∗ fi)(x, y) − ri)2

−∫∫

Xexp

(

−∑Nf

i=1

∑

x,y wi(x, y;OT , θ)×

((X ∗ fi)(x, y) − ri)2)

(8)and its partial derivative with respect to a parameterθi

is:

−∑Nf

i=1

∑

x,y∂wi(x,y;OT ,θ)

∂θi((T ∗ fi)(x, y) − ri)

2

−1

Z

∫∫

X

∂wi(x,y;OT ,θ)∂θi

×

exp (−∑Nf

i=1

∑

x,y wi(x, y;OT , θ)×

((X ∗ fi)(x, y) − ri)2)

(9)

Notice that the integration on the righthand of Equation9 can be rewritten as an expectation. This implies thatcomputing the gradient of the log-likelihood with respectto the parameters requires computing the expected valuesof the clique potentials or feature functions. For a Gaus-sian model, these expectations can be computed by invert-ing the inverse covariance matrix, or precision matrix, then

multiplying it with another matrix. The key difficulty withusing the log-likelihood to fit the GCRF parameters lies inthis matrix inversion. While the matrix

(

FT W (O; θ)F)

is

sparse,(

FT W (O; θ)F)−1

will be dense. For a256 × 256image, storing this dense matrix in memory with single-precision values will require 16GB of memory. In addition,inverting a matrix is anO(N3) operation, making it infea-sible for large images.

3.1. Discriminative Learning for GCRFs

The matrix inversion can be avoided by instead penal-izing the difference between the mode of the conditionaldistribution and the ground truth. The penalty is expressedusing a loss functionL(Y, T ) that assigns a loss for an im-ageY based on its difference from the ground-truth imageT . In the GCRF model, the mode of the conditional dis-tribution is the conditional mean, so the cost,C(θ,O, T ),associated with a particular set of parametersθ is

C(θ,O, T ) = L((

FT W (O; θ)F)−1

FT W (O; θ)r, T )(10)

The parametersθ can be found by minimizingC(θ,O, T ) using gradient descent techniques. For conve-nience when computing the gradient, we defineµ such that

µ =(

FT W (O; θ)F)−1

FT W (O; θ)r (11)

Using the chain rule, the partial derivative ofC(·) withrespect to a parameterθi is

∂C(θ,O, T )

∂θi

=∂C(θ,O, T )

∂µ

∂µ

∂θi

(12)

Note that∂C(θ,O,T )∂µ

is a1 × Np vector, whereNp is thenumber of pixels in the image, equal to

[

∂C(θ,O, T )

∂µ1. . .

∂C(θ,O, T )

∂µNp

]

(13)

The derivative vector∂µ∂θi

is an Np × 1 vector with eachelement equal to

∂µ1

∂θi

...∂µNp

∂θi

(14)

If the loss function is the squared-difference betweenµand the ground-truth image,T , then

C(θ,O, T ) = (µ − T )T

(µ − T ) (15)

and

∂C(θ,O, T )

∂µ= 2(µ − T )

3.2. Differentiating µ

The vectorµ, defined in Equation 11, can be differenti-ated with respect to a parameterθi using the matrix deriva-tive identity:

∂A−1

∂x= −A−1 ∂A

∂xA−1 (16)

Using this identity,

∂µ∂θi

= −(

FT W (O; θ)F)−1

FT ∂W (θ,O)∂θi

×(

FT W (O; θ)F)−1

FT W (O; θ)r

+(

FT W (O; θ)F)−1


r

(17)

Equation 17 can be expressed much more succinctly byreplacing

(

FT W (O; θ)F)−1

FT W (O; θ)r with µ and fac-toring:

∂µ∂θi

=(

FT W (O; θ)F)−1


(−Fµ + r)(18)

3.3. Computational Implications

Combining Equation 12 with Equation 18, the derivativeof theC(·) with respect to a parameterθi is

∂C(θ,O,T )∂θi

=∂C(θ,O,T )

∂µ

(

FT W (O; θ)F)−1


(−Fµ + r)

(19)The most expensive step in Equation 19 is computing

µ and ∂C(θ,O,T )∂µ

(

FT W (O; θ)F)−1

. Both of these termsinvolve calculating the product of an inverse matrix and avector. This can be computed efficiently inO(N2

p ) timeusing numerical linear algebra techniques. In our imple-mentation, we use the “backslash” operator in MATLAB tocompute these vectors.

Also notice that if the matrix multiplications are orderedcorrectly, each of these vectors must only be computed onceto calculate the gradient with respect to allθi. This makesthe complexity of computing the gradientO(N2

p ). Com-pared to maximizing the likelihood, which requiresO(N3

p )matrix inversions, this is a significant reduction in compu-tation. Even for a relatively small128 × 128 image, therewill be an immense savings in computation.

4. Related Work

This GCRF model is similar to that used by Tappen,Adelson, and Freeman to learn parameters for decomposingan image into shading and albedo components [12]. Thatsystem was also used to denoise images, though with anerror slightly higher than the Field of Experts model. Theprimary differences between that system and the system de-scribed here is that many more filters are used and the filterresponses are all set to zero. In [12], a semi-parametric re-gression system was used to estimate the filter responses for

Figure 1. The filters used in the GCRF model for denoising im-ages. These filters are a set of derivatives.

horizontal and vertical first-order derivatives. The GCRFmodel described here uses higher-order derivatives and con-strains them to be zero. This better captures the piecewisecontinuity in images.

A distinguishing characteristic of this method for train-ing is that model parameters are optimized using a lossfunction that measures the difference between the mostlikely image and the ground-truth image. This type of train-ing has also been used for discrete-valued graphical modelsby Collins [3], Taskar et al. [15, 14], and Altun et al. [1]. In[3], Collins shows how a variant of the perceptron algorithmcan be used to maximize a margin, similar to the margin inthe Support Vector Machine literature, between the correctoutput from a graphical model and all other possible out-puts. In [15], Taskar et al. show how a loss function canbe upper-bounded with a hinge function and the parameterscan be found using convex optimization. In [1], Altun et al.present a similar formulation.

Unlike [15] and [1], this work does not rely on machin-ery from convex optimization. Instead, the gradient is cal-culated analytically and standard gradient-based optimiza-tion techniques can be used to optimize the loss function.Relying on convex optimization limits the types of weight-functions and loss-functions that can be used. The only re-striction on the weight functions in the GCRF formulationpresented in this paper is that they be differentiable.

The GCRF model can also be viewed as a type ofEnergy-Based Model [6]. The advantage of Energy-BasedModels is that the partition function does not need to becomputed when training.

5. Comparing GCRFs to Field of Experts Mod-els for Denoising

To demonstrate the power of GCRFs on a real-worldproblem, we constructed a GCRF model for denoising im-ages. As stated earlier, a model has three major com-ponents: the linear featuresf1 . . . fNf

, the estimated re-sponses of the featuresr1(·) . . . rNf

(·), and the weightingfunctionsw1(·) . . . wNf

(·).To keep the linear system invertible, the first filter,

f1 is the identity filter withr1(x, y;O) = O(x, y) andw1(x, y;O) = 1. In other words, the first feature constrainsthe mean to be somewhat close to the observation.

The remaining filters are a set of eleven3 × 3 deriva-

Figure 2. The filters used for generating the features that are usedto estimate the weightwi(x, y;O) of each term in the GCRFmodel.

tive filters. The filters are shown in Figure 1. For each ofthese filters, the estimated response,ri(·), equals zero. Es-sentially, these filters are imposing smoothness constraints.

The weight function associated with each filter is basedon the absolute response of the observed imageO to a setof multi-scale oriented edge and bar filters. These filters areshown in Figure 2. These filters operate at multiple scalesin the image. The size of the filters in each of the threerows is 11 pixels, 21 pixels, and 31 pixels. The filters weredesigned to be elongated in order to be sensitive to extendededges. Due to the large size of the filters, reflected borderhandling was used when computing the filter responses. Athird set of responses was created by squaring the responsesof the corresponding edge and bar filters, then adding thesevalues.

Altogether, there are 72 filter responses at each pixel inthe image. Denoting the absolute response to each filter atlocation(x, y) asa

(x,y)1 . . . a

(x,y)72 , wherea

(x,y)72 is a feature

containing all ones, the weighting function is

wi(x, y;O) = exp

72∑

j=1

θija

(x,y)j

(20)

where θij represents the weight associated with response

j for filter i. The complete set of parameters,θ, is theconcatenation of the response weights for each filter,θ =[θ1θ2 . . . θ12]. The exponential ensures that the weight isalways positive.

The intuition behind this weight function is that the ob-served image,O, is being used to guess where edges occurin the image. The weight of the smoothing constraints canthen be reduced in those areas.

5.1. Training the GCRF Model

We trained the GCRF model by optimizing the responseweights,θ, to minimize the absolute value of the differencebetween between the predicted image and the true noiselessimage,T . To make the loss function differentiable, we used(x2 + 0.1) to approximate|x|. Using the derivations fromSection 2.2, the derivative with respect to a particularθi

j is

∂C

∂θji

= (µ−T )

((µ−T )2+0.1)12

×(

FT WF)−1

FT ∂W (θ,O)

∂θji

(−Fµ + r)(21)

For conciseness, we have dropped the arguments fromCandW . Note thatµ andT are vectors, so the division andsquare-root operations in (µ−T )

((µ−T )2+0.1)12

are element-wise

operations.We found that optimizing the parameters of the model

with this smoothed absolute error criterion led to an estima-tor that generalized better than the estimator trained usinga squared-error criterion. Optimizing under a squared-errorcriterion causes the algorithm to focus on the images withlarger errors. These images are treated equally under theabsolute error criterion. This helps generalization perfor-mance because outlier images do not receive undue weightduring training.

Theθ vector was optimized using a gradient-descent al-gorithm with the “bold-driver” heuristic. Under this heuris-tic, the step size was decreased each time the the error in thedenoised estimates increased.

The system was trained on forty129×129 patches takenfrom the same training images used by Roth and Black.White Gaussian noise, with a standard deviation of 25, wasadded to each observed image. Computing the gradient fortwo images required approximately two minutes on a 2.4GHz Intel Xeon processor. Using a computing cluster, wewere able to train the model in a few hours. The trainingimages used in training the GCRF model were much largerthan the15 × 15 patches used to train the Field of Expertsmodel, which was also trained using a cluster. It is useful tobe able to train on large images because they better matchthe test images.

5.2. ResultsFor comparison, we ran the Field of Experts (FoE) im-

plementation provided by Roth and Black on the same 68test images from the Berkeley Segmentation Database thatRoth and Black identified as test images. The observationswere created by adding white Gaussian noise with a stan-dard deviation of 25 to each image. The FoE model de-noises images by using gradient-descent to minimize a ro-bust cost function. The results vary depending on the step-size parameter used. We found that we obtained imageswith the highest log-posteriors by optimizing for 5000 iter-ations with a step-size of 0.6. This enabled us to comparethe two models fairly because the results from the GCRFmodel are optimal under that model.

Average AverageRMS Error PSNR

Model (Lower is Better) (Higher is Better)Field of Experts 11.04 27.59GCRF 10.47 28.04

Table 1. This table compares the average error in denoised esti-mates produced by the GCRF model described in this paper andthe Field of Experts model from [8]. The results are averaged overthe same 68 test images used in [8] The standard deviation of thenoise,σ, in the observations was 25.

Time to ProduceModel Estimates (sec)Field of Experts 376.9(2500 Iterations)GCRF 180.7GCRF using Conjugate Gradient 97.82(300 Iterations)

Table 2. A comparison of the amount of time needed to denoisean image using either the Field of Experts model, which usessteepest-descent, or the GCRF model, which uses the MATLABlinear solver. The Conjugate Gradients method is described in Sec-tion 5.3. Times were computed on a 3.2Ghz Intel Xeon Processor.

4 6 8 10 12 14 16 18 20

6

8

10

12

14

16

18

RMS Error of GCRF Estimate

RM

SE

rror

ofF

oEE

stim

ate

Figure 3. This scatterplot shows the RMS error in each of the 68test images when denoised with either the Field of Experts (FoE)model or the GCRF model. The GCRF model outperforms theFoE model for all images where the point is above the red line.

As Table 1 shows, the GCRF model produces denoisedimages with a lower error than the Field of Experts Model.In addition, as Table 2 shows, the GCRF model requiresnearly half the time to produce a denoised481×321 image.Note that in this table, we have shown the time required for2500 iterations of the steepest-descent algorithm, which isthe number of iterations suggested by Roth and Black.

To compare the performance of the two models acrossthe test set, Figure 3 shows of a scatterplot of the RMS er-rors of the images estimated with the two models. Thoseimages where the GCRF model produces a better estimateappear above the red line. As Figure 3 shows, the GCRFmodel produces the better estimate for the majority of the

images. It is important to note that the RMS error is justone type of error metric. Fortunately, if good performanceis desired under a specific error metric, the GCRF modelcan be easily adapted to different error metrics. It shouldbe noted that the GSM denoising algorithm [7] still outper-forms both of these models in terms of PSNR.

Figures 4, 5, and 6 show three examples of images fromthe test set where the GCRF model produces better denoisedimages in terms of PSNR. Figure 7 shows an example wherethe Field of Experts model outperforms the GCRF model interms of PSNR.

Qualitatively, the GCRF model does a much better job atpreserving the fine-scale texture. This comes at the expenseeliminating low-frequency noise in smooth portions of theimage, which can be seen in Figure 7.

5.3. Iterative Estimation in GCRF modelsFor large images, the overhead of constructing the matri-

ces and solving the linear system may be prohibitive. Inthese situations, the denoised image can be obtained byminimizing Equation 6 iteratively. A significant advantageof the GCRF model is that denoising is accomplished bysolving a linear system. This linear system can be solved it-eratively using the conjugate gradient method. This methodconverges more quickly than standard steepest-descent [9].

Using the notation in [9], the basic steps of the conjugategradient method for solvingAx = b are

d0 = r0 = b − Ax0 (22)

αi =rT

i ri

dTi

Adi(23)

xi+1 = xi + αidi (24)

ri+1 = ri − αiAdi (25)

βI+1 =rT

i+1ri+1

rTi

ri(26)

di+1 = ri+1 + βi+1di (27)

When using this method for the GCRF model, the matrixA is replaced with

(

FT W (O; θ)F)

. Each of the matrices inF correspond to computing a convolution, so matrix-vectorproducts such asdT

i Adi can be computed using convolu-tions. This makes each iteration of the conjugate gradi-ent slightly more expensive than an iteration of steepest-descent.

One advantage of the conjugate-gradient method is thatthere are no step-size parameters. We also found that it con-verged quickly. After 300 iterations of the conjugate gradi-ent method, the average absolute difference, per pixel, be-tween the iterative estimate and the estimate found with thedirect matrix solver was 5.3851e-05. As Table 2 shows, us-ing the iterative solver significantly reduces the amount oftime needed to produce an estimate. Further improvementsin speed may be possible by using a preconditioner suitedfor images, such as the preconditioner described in [11].

While the conjugate gradients method could be used tofind estimates in the Field of Experts model, the non-linear

(a) Original (b) Noisy Observation (c) Results using FoE Model,PSNR = 25.23

(d) Results using GCRF Model,PSNR= 26.28

Figure 4. In this denoising task, the GCRF model produces a more accurate estimate of the true image than the Field of Experts (FoE)model in terms of PSNR. The GCRF model also does a better job of preserving the texture in the image.

(a) Original (b) Noisy Observation (c) Results using FoE model, PSNR= 25.77

(d) Results using GCRF model,PSNR = 26.75

Figure 5. A second example where the GCRF model produces superiorresults in terms of PSNR. Again, the GCRF better preserves texture.

conjugate gradient algorithm requires line-searches, whichwould affect its overall performance.

6. ConclusionWe have shown how a Gaussian Conditional Random

Field (GCRF) model can outperform the non-convex Fieldof Experts model on the task of denoising images. TheGCRF model produced more accurate estimates, in terms ofPSNR, and required less computation to produce estimates.This is significant because of the advantages of the GCRFmodel:

1. It is convenient to work with. The GCRF model can beimplemented using standard numerical linear algebraroutines, which are available for most programmingenvironments.

2. As Sections 3 and 5 showed, the model can be trainedon relatively large images.

3. Any differentiable loss function on the estimated im-ages can be used during training. This is importantbecause the behavior of a system can be significantlyinfluenced by the loss function used to train it. Allow-ing for different loss functions during training enablesa system’s behavior to be controlled more precisely.

Combined with the ability to perform as well as non-convex models, these advantages make the GCRF model anattractive model for low-level vision and image processingapplications.

(a) Original (b) Noisy Observation (c) Result from FoE, PSNR = 28.59(d) Result from GCRF, PSNR =28.68

Figure 6. In this comparison, the PSNR for the Field of Experts result is slightly lower than the PSNR reported in [8]. We found that asthe Field of Experts code optimized the denoised images, the PSNR began to decrease as the log-posterior of the estimate continued toincrease. However, the result in (c) is visually quite similar to that in [8].

(a) Original (b) Noisy Observation (c) Result using FoE model, PSNR= 36.07

(d) Result using GCRF model,PSNR = 34.77

Figure 7. These images are an example of how the Field of Experts modelcan produce superior estimates in terms of PSNR. The FoEmodel is better at removing low-frequency noise from flat regions.

References[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov

support vector machines. InThe Twentieth InternationalConference on Machine Learning (ICML-2003), 2003.

[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts.IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 23(11):1222–1239,2001.

[3] M. Collins. Discriminative training methods for hiddenmarkov models: Theory and experiments with perceptron al-gorithms. InEMNLP 2002, 2002.

[4] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn-ing low-level vision. International Journal of Computer Vi-sion, 40(1):25–47, 2000.

[5] J. Lafferty, F. Pereira, and A. McCallum. Conditional ran-dom fields: Probabilistic models for segmenting and label-ing sequence data. InInternational Conference on MachineLearning (ICML), 2001.

[6] Y. LeCun and F. Huang. Loss functions for discriminativetraining of energy-based models. InProc. of the 10-th In-ternational Workshop on Artificial Intelligence and Statistics(AIStats’05), 2005.

[7] J. Portilla, V. Strela, M. Wainwright, and E. P. Simoncelli.Image denoising using scale mixtures of gaussians in thewavelet domain.IEEE Transactions on Image Processing,12(11):1338–1351, November 2003.

[8] S. Roth and M. Black. Field of experts: A framework forlearning image priors. InIEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), volume 2, pages 860–867, 2005.

[9] J. R. Shewchuk. An introduction to the conjugategradient method without the agonizing pain. Available athttp://www.cs.cmu.edu/˜jrs/jrspapers.html .

[10] R. Szeliski. Bayesian modeling of uncertainty in low-levelvision. International Journal of Computer Vision, 5(3):271–301, 1990.

[11] R. Szeliski. Locally adapted hierarchical basis precondition-ing. ACM Transactions on Graphics, 25(3):1135–1143, Au-gust 2006.

[12] M. F. Tappen, E. H. Adelson, and W. T. Freeman. Estimatingintrinsic component images using non-linear regression. InThe Proceedings of the 2006 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR),volume 2, pages 1992–1999, 2006.

[13] M. F. Tappen, B. C. Russell, and W. T. Freeman. Efficientgraphical models for processing images. InProceedings ofthe 2004 IEEE Conference on Computer Vision and Patt ernRecognition, volume 2, pages 673–680, 2004.

[14] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin.Learning structured prediction models: A large margin ap-proach. InThe Twenty Second International Conference onMachine Learning (ICML-2005), 2005.

[15] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structuredprediction via the extragradient method. In Y. Weiss,B. Scholkopf, and J. Platt, editors,Advances in Neural Infor-mation Processing Systems 18. MIT Press, Cambridge, MA,2006.

Learning Gaussian Conditional Random Fields for Low-Level...

Documents

Transcript of Learning Gaussian Conditional Random Fields for Low-Level...