Learning Gaussian Conditional Random Fields for Low-Level...
Transcript of Learning Gaussian Conditional Random Fields for Low-Level...
000
001
002
003
004
005
006
007
008
009
010
011
012
013
014
015
016
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
046
047
048
049
050
051
052
053
054
055
056
057
058
059
060
061
062
063
064
065
066
067
068
069
070
071
072
073
074
075
076
077
078
079
080
081
082
083
084
085
086
087
088
089
090
091
092
093
094
095
096
097
098
099
100
101
102
103
104
105
106
107
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Learning Gaussian Conditional Random Fields for Low-Level Vision
Anonymous CVPR submission
Paper ID 770
Abstract
Markov Random Field (MRF) models are a popular toolfor vision and image processing. Gaussian MRF modelsare particularly convenient to work with because they canbe implemented using matrix and linear algebra routines.However, recent research has focused on on discrete-valuedand non-convex MRF models because Gaussian modelstend to over-smooth images and blur edges. In this paper,we show how to train a Gaussian Conditional Random Field(GCRF) model that overcomes this weakness and can out-perform the non-convex Field of Experts model on the taskof denoising images. A key advantage of the GCRF model isthat the parameters of the model can be optimized efficientlyon relatively large images. The competitive performance ofthe GCRF model and the ease of optimizing its parametersmake the GCRF model an attractive option for vision andimage processing applications.
1. Introduction
Markov Random Field, or MRF, models are a populartool for solving problems in low-level vision and image-processing. MRF models are convenient because the re-lationships between variables in the model are expressedin terms of local relationships, such as the relationship be-tween pixels in an image. At the same time, they are pow-erful because these local relationships combine to producelong-range interactions between variables.
Gaussian Markov Random Fields, which are MRFswhere the variables are also jointly Gaussian, have a longhistory in computer vision research [10]. Gaussian mod-els are particularly convenient to work with because manyof the computations involving these models, such as infer-ence, can be accomplished using linear algebra. Algorithmsfor numerical linear algebra are well understood, and effi-cient implementations are available in most programmingenvironments.
Despite these advantages, many Gaussian models havea characteristic weakness. Empirically, using these modelsoften results in over-smoothed images. This has led recentresearch on MRF models of images to focus on discrete-
valued MRF models [2, 4, 13] or non-convex continuous-valued models, such as the Field of Experts model [8]. Oneof the driving forces behind choosing these types of modelsis that they can implement robust penalty functions, such astruncated quadratic penalties. Using these robust penaltiesalleviates the problem of over-smoothing and has improvedthe performance on a number of tasks, including image de-noising [8], stereo reconstruction [2], and super-resolution[13].
In this paper, we describe a Gaussian Conditional Ran-dom Field, or GCRF, model that avoids the over-smoothingtypically associated with Gaussian models. To demonstrateits power, we show how it can denoise images with less er-ror than the non-convex Field of Experts model. The GCRFmodel is also able to produce estimates much more quickly.
An additional advantage of the GCRF model is that theparameters of the model can be found efficiently by mini-mizing the error in the MAP solution of the model. Thistractable method of training, described in Section 3, en-ables the GCRF model to improve on the results producedby the more complex, non-convex Field of Experts model.The fact that GCRF models are relatively easy to train andcan perform as well as more complicated models makesthem potentially useful for a number of low-level vision andimage-processing tasks. To demonstrate the convenienceand power of the model, sample training implementationsare available athttp://Removed for Review.
Section 2 describes the GCRF model. The training algo-rithm is described in Section 3. Related work and modelsare discussed in Section 4. The GCRF model is comparedto the Field of Experts model in Section 5.
2. Motivating the GCRF Model
The Gaussian Conditional Random Field (GCRF) modelcan be motivated in two ways: probabilistically as a Condi-tional Random Field [5], or as an estimator based on min-imizing a cost function. Section 2.1 shows how the GCRFestimator can be motivated as a probabilistic model, whileSection 2.3 shows how it can be motivated as the minimumof a quadratic cost function.
1
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
2.1. Gaussian Markov Random Field Model
Before presenting the actual Gaussian Conditional Ran-dom Field (GCRF) model used to denoise images, this sec-tion will first describe a GCRF model for images that suffersfrom the problem of oversmoothing. Section 2.2 will thenshow how this model can be improved into a model that canhandle sharp edges and produce much improved results.
This GCRF model is defined by a set of linear fea-tures that are convolution kernels. For a set of features,f1 . . . fNf
, the probability density of an image,X , condi-tioned on the observed image,O, is defined to be
p(X ) =1
Zexp
0
@−
NfX
i=1
X
x,y
((X ∗ fi)(x, y) − ri(x, y;O))2
1
A
(1)where(X ∗ fi)(x, y) denotes the value at location(x, y) inthe image produced by convolvingX with fi. We assumethat this image only contains those pixels where the entirefilter fits in X . This corresponds to “valid” convolution inMATLAB. The function ri(x, y;O) contains the estimatedvalue of(X ∗ fi)(x, y). For each featurefi, the functionri
uses the observed imageO to estimate the value of the filterresponse at each pixel. In the rest of this paper,ri(x, y;O)will be shortened tori for conciseness.
The exponent:Nf∑
i=1
∑
x,y
(
(X ∗ fi)(x, y) − r2i
)
(2)
can be written in matrix form by creating a set of matricesF1 . . . FNf
. Each matrixFi performs the same set of linearoperations as convolving an image with a filterfi. In otherwords, ifX is a vector created by unwrapping the imageX ,then the values ofFiX are identical to the values ofX ∗ fi.These matrices can then be stacked and Equation 2 can berewritten as
(F X − r)T (F X − r) (3)
where
F =
F1
F2
...FNf
and r =
r1
r2
...rNf
Equation 3 can be rewritten in the standard form of theexponent of a multivariate normal distribution by settingthe precision matrix,Λ−1, to beFT F and the mean,µ, to(FT F )−1FT r.
Note that the difference between(X − µ)T Λ−1(X − µ)and Equation 3 is the constantrT r − µT µ. However, thisvalue is constant for allX, so it only affects the normaliza-tion constant,Z, and the distributions defined using Equa-tion 1 orΛ andµ are identical.
2.2. Utilizing Weights
One of the major weaknesses of the GCRF model inEquation 1 is that it does not properly capture the statis-tics of natural images. Natural images tend to have smoothregions interrupted by sharp discontinuities. The Gaussianmodel imposes smoothness, but penalizes sharp edges. Inapplications like image denoising, optic flow, and range es-timation, this violates the actual image statistics and leadsto oversmoothing.
One way to avoid this oversmoothing is to use non-convex, robust potential functions, such as those usedin [13] and [8]. Unfortunately, the convenience of thequadratic model is lost when using these models.
Alternatively, the quadratic model can be modified byassigning weights to the various terms in the sum. For ex-ample, if the filters,f1 . . . fNf
, include derivative filters, thebasic GCRF model penalizes the strong derivatives aroundsharp edges. This causes the images estimated from basicGCRF models to have blurry edges. Using the observedimage to lower the weight for those derivatives that crossedges will preserve the sharpness of the estimated image.It should be noted that this example is illustrative. Section3 will show how use training data to learn the relationshipbetween the observed image and the weights.
Adding weights to the exponent of Equation 1 can beexpressed formally by modifying Equation 2:
Nf∑
i=1
∑
x,y
wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri)2 (4)
wherewi(x, y;O, θ) is a positive weighting function thatuses the observed image,O, to assign a positive weight toeach quadratic term in Equation 4.
Assigning a weight to each quadratic term improves themodel’s ability to handle strong edges. For a term based ona derivative filter, the weight could be increased in flat re-gions of the image and decreased along edges. This wouldenable the model to smooth out noise, while preservingsharp edges. Section 3 will discuss how to learn the pa-rameters,θ, of this weighting function.
The potential difficulty in using this model is the fact thatthe weights must be computed from the observed image.This will be problematic if the observed image is too noisyor the necessary information is not easy to compute. Fortu-nately, as Section 5 will show, the system is able to performwell with high-noise levels when denoising images. Thisindicates that good weights can still be computed in high-noise problems.
2.3. Cost-Function Perspective
The GCRF model can also be motivated from a cost-function point of view. Leth(O; θ) be an estimator that usesthe observed image,O, to estimate an image. The estimateis the imageX that minimizes the quadratic function:
2
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Nf∑
i=1
∑
x,y
wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri)2 (5)
Or, in other words,
h(O; θ) = arg minX
∑
i,x,y wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri))2 (6)
Again,ri(x, y;O) has been shortened tori for conciseness.Because Equation 5 contains a quadratic cost function,
the minimum can be computed using the pseudo-inverse.Using the matrix notation from Section 2.1, this pseudo-inverse can be expressed as:
h(O; θ) =(
FT W (O; θ)F)−1
FT W (O; θ)r (7)
In this equation, we have introduced a diagonal ma-trix W (O; θ) that is a function of the observed imageand parametersθ. This diagonal matrix is a block-diagonal matrix constructed from the diagonal sub-matricesW1(O; θ) . . . WNf
(O; θ). Each entry along the diagonal intheWi(O; θ) matrices is equal to the weight of a term at aparticular pixel,wi(x, y;O, θ).
3. Learning the Parameters of a GCRF
Before the model described in the previous sectioncan be used, the parameters of the weighting functions,wi(x, y;O, θ), must be chosen. Traditionally, the parame-ters of Conditional Random Fields are found by maximizingthe likelihood of the training data. Using the GCRF modelfrom 2.2, the conditional log-likelihood of a training sampleT , conditioned on an associated observationOT , is
−∑Nf
i=1
∑
x,y wi(x, y;OT , θ) ((T ∗ fi)(x, y) − ri)2
−∫∫
Xexp
(
−∑Nf
i=1
∑
x,y wi(x, y;OT , θ)×
((X ∗ fi)(x, y) − ri)2)
(8)and its partial derivative with respect to a parameterθi
is:
−∑Nf
i=1
∑
x,y∂wi(x,y;OT ,θ)
∂θi((T ∗ fi)(x, y) − ri)
2
−1
Z
∫∫
X
∂wi(x,y;OT ,θ)∂θi
×
exp (−∑Nf
i=1
∑
x,y wi(x, y;OT , θ)×
((X ∗ fi)(x, y) − ri)2)
(9)
Notice that the integration on the righthand of Equation9 can be rewritten as an expectation. This implies thatcomputing the gradient of the log-likelihood with respectto the parameters requires computing the expected valuesof the clique potentials or feature functions. For a Gaus-sian model, these expectations can be computed by invert-ing the inverse covariance matrix, or precision matrix, then
multiplying it with another matrix. The key difficulty withusing the log-likelihood to fit the GCRF parameters lies inthis matrix inversion. While the matrix
(
FT W (O; θ)F)
is
sparse,(
FT W (O; θ)F)−1
will be dense. For a256 × 256image, storing this dense matrix in memory with single-precision values will require 16GB of memory. In addition,inverting a matrix is anO(N3) operation, making it infea-sible for large images.
3.1. Discriminative Learning for GCRFs
The matrix inversion can be avoided by instead penal-izing the difference between the mode of the conditionaldistribution and the ground truth. The penalty is expressedusing a loss functionL(Y, T ) that assigns a loss for an im-ageY based on its difference from the ground-truth imageT . In the GCRF model, the mode of the conditional dis-tribution is the conditional mean, so the cost,C(θ,O, T ),associated with a particular set of parametersθ is
C(θ,O, T ) = L((
FT W (O; θ)F)−1
FT W (O; θ)r, T )(10)
The parametersθ can be found by minimizingC(θ,O, T ) using gradient descent techniques. For conve-nience when computing the gradient, we defineµ such that
µ =(
FT W (O; θ)F)−1
FT W (O; θ)r (11)
Using the chain rule, the partial derivative ofC(·) withrespect to a parameterθi is
∂C(θ,O, T )
∂θi
=∂C(θ,O, T )
∂µ
∂µ
∂θi
(12)
Note that∂C(θ,O,T )∂µ
is a1 × Np vector, whereNp is thenumber of pixels in the image, equal to
[
∂C(θ,O, T )
∂µ1. . .
∂C(θ,O, T )
∂µNp
]
(13)
The derivative vector∂µ∂θi
is an Np × 1 vector with eachelement equal to
∂µ1
∂θi
...∂µNp
∂θi
(14)
If the loss function is the squared-difference betweenµand the ground-truth image,T , then
C(θ,O, T ) = (µ − T )T
(µ − T ) (15)
and
∂C(θ,O, T )
∂µ= 2(µ − T )
3
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
3.2. Differentiating µ
The vectorµ, defined in Equation 11, can be differenti-ated with respect to a parameterθi using the matrix deriva-tive identity:
∂A−1
∂x= −A−1 ∂A
∂xA−1 (16)
Using this identity,
∂µ∂θi
= −(
FT W (O; θ)F)−1
FT ∂W (θ,O)∂θi
×(
FT W (O; θ)F)−1
FT W (O; θ)r
+(
FT W (O; θ)F)−1
FT ∂W (θ,O)∂θi
r
(17)
Equation 17 can be expressed much more succinctly byreplacing
(
FT W (O; θ)F)−1
FT W (O; θ)r with µ and fac-toring:
∂µ∂θi
=(
FT W (O; θ)F)−1
FT ∂W (θ,O)∂θi
(−Fµ + r)(18)
3.3. Computational Implications
Combining Equation 12 with Equation 18, the derivativeof theC(·) with respect to a parameterθi is
∂C(θ,O,T )∂θi
=∂C(θ,O,T )
∂µ
(
FT W (O; θ)F)−1
FT ∂W (θ,O)∂θi
(−Fµ + r)
(19)The most expensive step in Equation 19 is computing
µ and ∂C(θ,O,T )∂µ
(
FT W (O; θ)F)−1
. Both of these termsinvolve calculating the product of an inverse matrix and avector. This can be computed efficiently inO(N2
p ) timeusing numerical linear algebra techniques. In our imple-mentation, we use the “backslash” operator in MATLAB tocompute these vectors.
Also notice that if the matrix multiplications are orderedcorrectly, each of these vectors must only be computed onceto calculate the gradient with respect to allθi. This makesthe complexity of computing the gradientO(N2
p ). Com-pared to maximizing the likelihood, which requiresO(N3
p )matrix inversions, this is a significant reduction in compu-tation. Even for a relatively small128 × 128 image, therewill be an immense savings in computation.
4. Related Work
This GCRF model is similar to that used by Tappen,Adelson, and Freeman to learn parameters for decomposingan image into shading and albedo components [12]. Thatsystem was also used to denoise images, though with anerror slightly higher than the Field of Experts model. Theprimary differences between that system and the system de-scribed here is that many more filters are used and the filterresponses are all set to zero. In [12], a semi-parametric re-gression system was used to estimate the filter responses for
Figure 1. The filters used in the GCRF model for denoising im-ages. These filters are a set of derivatives.
horizontal and vertical first-order derivatives. The GCRFmodel described here uses higher-order derivatives and con-strains them to be zero. This better captures the piecewisecontinuity in images.
A distinguishing characteristic of this method for train-ing is that model parameters are optimized using a lossfunction that measures the difference between the mostlikely image and the ground-truth image. This type of train-ing has also been used for discrete-valued graphical modelsby Collins [3], Taskar et al. [15, 14], and Altun et al. [1]. In[3], Collins shows how a variant of the perceptron algorithmcan be used to maximize a margin, similar to the margin inthe Support Vector Machine literature, between the correctoutput from a graphical model and all other possible out-puts. In [15], Taskar et al. show how a loss function canbe upper-bounded with a hinge function and the parameterscan be found using convex optimization. In [1], Altun et al.present a similar formulation.
Unlike [15] and [1], this work does not rely on machin-ery from convex optimization. Instead, the gradient is cal-culated analytically and standard gradient-based optimiza-tion techniques can be used to optimize the loss function.Relying on convex optimization limits the types of weight-functions and loss-functions that can be used. The only re-striction on the weight functions in the GCRF formulationpresented in this paper is that they be differentiable.
The GCRF model can also be viewed as a type ofEnergy-Based Model [6]. The advantage of Energy-BasedModels is that the partition function does not need to becomputed when training.
5. Comparing GCRFs to Field of Experts Mod-els for Denoising
To demonstrate the power of GCRFs on a real-worldproblem, we constructed a GCRF model for denoising im-ages. As stated earlier, a model has three major com-ponents: the linear featuresf1 . . . fNf
, the estimated re-sponses of the featuresr1(·) . . . rNf
(·), and the weightingfunctionsw1(·) . . . wNf
(·).To keep the linear system invertible, the first filter,
f1 is the identity filter withr1(x, y;O) = O(x, y) andw1(x, y;O) = 1. In other words, the first feature constrainsthe mean to be somewhat close to the observation.
The remaining filters are a set of eleven3 × 3 deriva-
4
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Figure 2. The filters used for generating the features that are usedto estimate the weightwi(x, y;O) of each term in the GCRFmodel.
tive filters. The filters are shown in Figure 1. For each ofthese filters, the estimated response,ri(·), equals zero. Es-sentially, these filters are imposing smoothness constraints.
The weight function associated with each filter is basedon the absolute response of the observed imageO to a setof multi-scale oriented edge and bar filters. These filters areshown in Figure 2. These filters operate at multiple scalesin the image. The size of the filters in each of the threerows is 11 pixels, 21 pixels, and 31 pixels. The filters weredesigned to be elongated in order to be sensitive to extendededges. Due to the large size of the filters, reflected borderhandling was used when computing the filter responses. Athird set of responses was created by squaring the responsesof the corresponding edge and bar filters, then adding thesevalues.
Altogether, there are 72 filter responses at each pixel inthe image. Denoting the absolute response to each filter atlocation(x, y) asa
(x,y)1 . . . a
(x,y)72 , wherea
(x,y)72 is a feature
containing all ones, the weighting function is
wi(x, y;O) = exp
72∑
j=1
θija
(x,y)j
(20)
where θij represents the weight associated with response
j for filter i. The complete set of parameters,θ, is theconcatenation of the response weights for each filter,θ =[θ1θ2 . . . θ12]. The exponential ensures that the weight isalways positive.
The intuition behind this weight function is that the ob-served image,O, is being used to guess where edges occurin the image. The weight of the smoothing constraints canthen be reduced in those areas.
5.1. Training the GCRF Model
We trained the GCRF model by optimizing the responseweights,θ, to minimize the absolute value of the differencebetween between the predicted image and the true noiselessimage,T . To make the loss function differentiable, we used(x2 + 0.1) to approximate|x|. Using the derivations fromSection 2.2, the derivative with respect to a particularθi
j is
∂C
∂θji
= (µ−T )
((µ−T )2+0.1)12
×(
FT WF)−1
FT ∂W (θ,O)
∂θji
(−Fµ + r)(21)
For conciseness, we have dropped the arguments fromCandW . Note thatµ andT are vectors, so the division andsquare-root operations in (µ−T )
((µ−T )2+0.1)12
are element-wise
operations.We found that optimizing the parameters of the model
with this smoothed absolute error criterion led to an estima-tor that generalized better than the estimator trained usinga squared-error criterion. Optimizing under a squared-errorcriterion causes the algorithm to focus on the images withlarger errors. These images are treated equally under theabsolute error criterion. This helps generalization perfor-mance because outlier images do not receive undue weightduring training.
Theθ vector was optimized using a gradient-descent al-gorithm with the “bold-driver” heuristic. Under this heuris-tic, the step size was decreased each time the the error in thedenoised estimates increased.
The system was trained on forty129×129 patches takenfrom the same training images used by Roth and Black.White Gaussian noise, with a standard deviation of 25, wasadded to each observed image. Computing the gradient fortwo images required approximately two minutes on a 2.4GHz Intel Xeon processor. Using a computing cluster, wewere able to train the model in a few hours. The trainingimages used in training the GCRF model were much largerthan the15 × 15 patches used to train the Field of Expertsmodel, which was also trained using a cluster. It is useful tobe able to train on large images because they better matchthe test images.
5.2. ResultsFor comparison, we ran the Field of Experts (FoE) im-
plementation provided by Roth and Black on the same 68test images from the Berkeley Segmentation Database thatRoth and Black identified as test images. The observationswere created by adding white Gaussian noise with a stan-dard deviation of 25 to each image. The FoE model de-noises images by using gradient-descent to minimize a ro-bust cost function. The results vary depending on the step-size parameter used. We found that we obtained imageswith the highest log-posteriors by optimizing for 5000 iter-ations with a step-size of 0.6. This enabled us to comparethe two models fairly because the results from the GCRFmodel are optimal under that model.
5
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
Average AverageRMS Error PSNR
Model (Lower is Better) (Higher is Better)Field of Experts 11.04 27.59GCRF 10.47 28.04
Table 1. This table compares the average error in denoised esti-mates produced by the GCRF model described in this paper andthe Field of Experts model from [8]. The results are averaged overthe same 68 test images used in [8] The standard deviation of thenoise,σ, in the observations was 25.
Time to ProduceModel Estimates (sec)Field of Experts 376.9(2500 Iterations)GCRF 180.7GCRF using Conjugate Gradient 97.82(300 Iterations)
Table 2. A comparison of the amount of time needed to denoisean image using either the Field of Experts model, which usessteepest-descent, or the GCRF model, which uses the MATLABlinear solver. The Conjugate Gradients method is described in Sec-tion 5.3. Times were computed on a 3.2Ghz Intel Xeon Processor.
4 6 8 10 12 14 16 18 20
6
8
10
12
14
16
18
RMS Error of GCRF Estimate
RM
SE
rror
ofF
oEE
stim
ate
Figure 3. This scatterplot shows the RMS error in each of the 68test images when denoised with either the Field of Experts (FoE)model or the GCRF model. The GCRF model outperforms theFoE model for all images where the point is above the red line.
As Table 1 shows, the GCRF model produces denoisedimages with a lower error than the Field of Experts Model.In addition, as Table 2 shows, the GCRF model requiresnearly half the time to produce a denoised481×321 image.Note that in this table, we have shown the time required for2500 iterations of the steepest-descent algorithm, which isthe number of iterations suggested by Roth and Black.
To compare the performance of the two models acrossthe test set, Figure 3 shows of a scatterplot of the RMS er-rors of the images estimated with the two models. Thoseimages where the GCRF model produces a better estimateappear above the red line. As Figure 3 shows, the GCRFmodel produces the better estimate for the majority of the
images. It is important to note that the RMS error is justone type of error metric. Fortunately, if good performanceis desired under a specific error metric, the GCRF modelcan be easily adapted to different error metrics. It shouldbe noted that the GSM denoising algorithm [7] still outper-forms both of these models in terms of PSNR.
Figures 4, 5, and 6 show three examples of images fromthe test set where the GCRF model produces better denoisedimages in terms of PSNR. Figure 7 shows an example wherethe Field of Experts model outperforms the GCRF model interms of PSNR.
Qualitatively, the GCRF model does a much better job atpreserving the fine-scale texture. This comes at the expenseeliminating low-frequency noise in smooth portions of theimage, which can be seen in Figure 7.
5.3. Iterative Estimation in GCRF modelsFor large images, the overhead of constructing the matri-
ces and solving the linear system may be prohibitive. Inthese situations, the denoised image can be obtained byminimizing Equation 6 iteratively. A significant advantageof the GCRF model is that denoising is accomplished bysolving a linear system. This linear system can be solved it-eratively using the conjugate gradient method. This methodconverges more quickly than standard steepest-descent [9].
Using the notation in [9], the basic steps of the conjugategradient method for solvingAx = b are
d0 = r0 = b − Ax0 (22)
αi =rT
i ri
dTi
Adi(23)
xi+1 = xi + αidi (24)
ri+1 = ri − αiAdi (25)
βI+1 =rT
i+1ri+1
rTi
ri(26)
di+1 = ri+1 + βi+1di (27)
When using this method for the GCRF model, the matrixA is replaced with
(
FT W (O; θ)F)
. Each of the matrices inF correspond to computing a convolution, so matrix-vectorproducts such asdT
i Adi can be computed using convolu-tions. This makes each iteration of the conjugate gradi-ent slightly more expensive than an iteration of steepest-descent.
One advantage of the conjugate-gradient method is thatthere are no step-size parameters. We also found that it con-verged quickly. After 300 iterations of the conjugate gradi-ent method, the average absolute difference, per pixel, be-tween the iterative estimate and the estimate found with thedirect matrix solver was 5.3851e-05. As Table 2 shows, us-ing the iterative solver significantly reduces the amount oftime needed to produce an estimate. Further improvementsin speed may be possible by using a preconditioner suitedfor images, such as the preconditioner described in [11].
While the conjugate gradients method could be used tofind estimates in the Field of Experts model, the non-linear
6
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Original (b) Noisy Observation (c) Results using FoE Model,PSNR = 25.23
(d) Results using GCRF Model,PSNR= 26.28
Figure 4. In this denoising task, the GCRF model produces a more accurate estimate of the true image than the Field of Experts (FoE)model in terms of PSNR. The GCRF model also does a better job of preserving the texture in the image.
(a) Original (b) Noisy Observation (c) Results using FoE model, PSNR= 25.77
(d) Results using GCRF model,PSNR = 26.75
Figure 5. A second example where the GCRF model produces superiorresults in terms of PSNR. Again, the GCRF better preserves texture.
conjugate gradient algorithm requires line-searches, whichwould affect its overall performance.
6. ConclusionWe have shown how a Gaussian Conditional Random
Field (GCRF) model can outperform the non-convex Fieldof Experts model on the task of denoising images. TheGCRF model produced more accurate estimates, in terms ofPSNR, and required less computation to produce estimates.This is significant because of the advantages of the GCRFmodel:
1. It is convenient to work with. The GCRF model can beimplemented using standard numerical linear algebraroutines, which are available for most programmingenvironments.
2. As Sections 3 and 5 showed, the model can be trainedon relatively large images.
3. Any differentiable loss function on the estimated im-ages can be used during training. This is importantbecause the behavior of a system can be significantlyinfluenced by the loss function used to train it. Allow-ing for different loss functions during training enablesa system’s behavior to be controlled more precisely.
Combined with the ability to perform as well as non-convex models, these advantages make the GCRF model anattractive model for low-level vision and image processingapplications.
7
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
CVPR#770
CVPR#770
CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Original (b) Noisy Observation (c) Result from FoE, PSNR = 28.59(d) Result from GCRF, PSNR =28.68
Figure 6. In this comparison, the PSNR for the Field of Experts result is slightly lower than the PSNR reported in [8]. We found that asthe Field of Experts code optimized the denoised images, the PSNR began to decrease as the log-posterior of the estimate continued toincrease. However, the result in (c) is visually quite similar to that in [8].
(a) Original (b) Noisy Observation (c) Result using FoE model, PSNR= 36.07
(d) Result using GCRF model,PSNR = 34.77
Figure 7. These images are an example of how the Field of Experts modelcan produce superior estimates in terms of PSNR. The FoEmodel is better at removing low-frequency noise from flat regions.
References[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov
support vector machines. InThe Twentieth InternationalConference on Machine Learning (ICML-2003), 2003.
[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts.IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 23(11):1222–1239,2001.
[3] M. Collins. Discriminative training methods for hiddenmarkov models: Theory and experiments with perceptron al-gorithms. InEMNLP 2002, 2002.
[4] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn-ing low-level vision. International Journal of Computer Vi-sion, 40(1):25–47, 2000.
[5] J. Lafferty, F. Pereira, and A. McCallum. Conditional ran-dom fields: Probabilistic models for segmenting and label-ing sequence data. InInternational Conference on MachineLearning (ICML), 2001.
[6] Y. LeCun and F. Huang. Loss functions for discriminativetraining of energy-based models. InProc. of the 10-th In-ternational Workshop on Artificial Intelligence and Statistics(AIStats’05), 2005.
[7] J. Portilla, V. Strela, M. Wainwright, and E. P. Simoncelli.Image denoising using scale mixtures of gaussians in thewavelet domain.IEEE Transactions on Image Processing,12(11):1338–1351, November 2003.
[8] S. Roth and M. Black. Field of experts: A framework forlearning image priors. InIEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), volume 2, pages 860–867, 2005.
[9] J. R. Shewchuk. An introduction to the conjugategradient method without the agonizing pain. Available athttp://www.cs.cmu.edu/˜jrs/jrspapers.html .
[10] R. Szeliski. Bayesian modeling of uncertainty in low-levelvision. International Journal of Computer Vision, 5(3):271–301, 1990.
[11] R. Szeliski. Locally adapted hierarchical basis precondition-ing. ACM Transactions on Graphics, 25(3):1135–1143, Au-gust 2006.
[12] M. F. Tappen, E. H. Adelson, and W. T. Freeman. Estimatingintrinsic component images using non-linear regression. InThe Proceedings of the 2006 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR),volume 2, pages 1992–1999, 2006.
[13] M. F. Tappen, B. C. Russell, and W. T. Freeman. Efficientgraphical models for processing images. InProceedings ofthe 2004 IEEE Conference on Computer Vision and Patt ernRecognition, volume 2, pages 673–680, 2004.
[14] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin.Learning structured prediction models: A large margin ap-proach. InThe Twenty Second International Conference onMachine Learning (ICML-2005), 2005.
[15] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structuredprediction via the extragradient method. In Y. Weiss,B. Scholkopf, and J. Platt, editors,Advances in Neural Infor-mation Processing Systems 18. MIT Press, Cambridge, MA,2006.
8