Learning Gaussian Conditional Random Fields for Low-Level...

8
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 CVPR #770 CVPR #770 CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE. Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR submission Paper ID 770 Abstract Markov Random Field (MRF) models are a popular tool for vision and image processing. Gaussian MRF models are particularly convenient to work with because they can be implemented using matrix and linear algebra routines. However, recent research has focused on on discrete-valued and non-convex MRF models because Gaussian models tend to over-smooth images and blur edges. In this paper, we show how to train a Gaussian Conditional Random Field (GCRF) model that overcomes this weakness and can out- perform the non-convex Field of Experts model on the task of denoising images. A key advantage of the GCRF model is that the parameters of the model can be optimized efficiently on relatively large images. The competitive performance of the GCRF model and the ease of optimizing its parameters make the GCRF model an attractive option for vision and image processing applications. 1. Introduction Markov Random Field, or MRF, models are a popular tool for solving problems in low-level vision and image- processing. MRF models are convenient because the re- lationships between variables in the model are expressed in terms of local relationships, such as the relationship be- tween pixels in an image. At the same time, they are pow- erful because these local relationships combine to produce long-range interactions between variables. Gaussian Markov Random Fields, which are MRFs where the variables are also jointly Gaussian, have a long history in computer vision research [10]. Gaussian mod- els are particularly convenient to work with because many of the computations involving these models, such as infer- ence, can be accomplished using linear algebra. Algorithms for numerical linear algebra are well understood, and effi- cient implementations are available in most programming environments. Despite these advantages, many Gaussian models have a characteristic weakness. Empirically, using these models often results in over-smoothed images. This has led recent research on MRF models of images to focus on discrete- valued MRF models [2, 4, 13] or non-convex continuous- valued models, such as the Field of Experts model [8]. One of the driving forces behind choosing these types of models is that they can implement robust penalty functions, such as truncated quadratic penalties. Using these robust penalties alleviates the problem of over-smoothing and has improved the performance on a number of tasks, including image de- noising [8], stereo reconstruction [2], and super-resolution [13]. In this paper, we describe a Gaussian Conditional Ran- dom Field, or GCRF, model that avoids the over-smoothing typically associated with Gaussian models. To demonstrate its power, we show how it can denoise images with less er- ror than the non-convex Field of Experts model. The GCRF model is also able to produce estimates much more quickly. An additional advantage of the GCRF model is that the parameters of the model can be found efficiently by mini- mizing the error in the MAP solution of the model. This tractable method of training, described in Section 3, en- ables the GCRF model to improve on the results produced by the more complex, non-convex Field of Experts model. The fact that GCRF models are relatively easy to train and can perform as well as more complicated models makes them potentially useful for a number of low-level vision and image-processing tasks. To demonstrate the convenience and power of the model, sample training implementations are available at http://Removed for Review. Section 2 describes the GCRF model. The training algo- rithm is described in Section 3. Related work and models are discussed in Section 4. The GCRF model is compared to the Field of Experts model in Section 5. 2. Motivating the GCRF Model The Gaussian Conditional Random Field (GCRF) model can be motivated in two ways: probabilistically as a Condi- tional Random Field [5], or as an estimator based on min- imizing a cost function. Section 2.1 shows how the GCRF estimator can be motivated as a probabilistic model, while Section 2.3 shows how it can be motivated as the minimum of a quadratic cost function. 1

Transcript of Learning Gaussian Conditional Random Fields for Low-Level...

Page 1: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

000

001

002

003

004

005

006

007

008

009

010

011

012

013

014

015

016

017

018

019

020

021

022

023

024

025

026

027

028

029

030

031

032

033

034

035

036

037

038

039

040

041

042

043

044

045

046

047

048

049

050

051

052

053

054

055

056

057

058

059

060

061

062

063

064

065

066

067

068

069

070

071

072

073

074

075

076

077

078

079

080

081

082

083

084

085

086

087

088

089

090

091

092

093

094

095

096

097

098

099

100

101

102

103

104

105

106

107

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Learning Gaussian Conditional Random Fields for Low-Level Vision

Anonymous CVPR submission

Paper ID 770

Abstract

Markov Random Field (MRF) models are a popular toolfor vision and image processing. Gaussian MRF modelsare particularly convenient to work with because they canbe implemented using matrix and linear algebra routines.However, recent research has focused on on discrete-valuedand non-convex MRF models because Gaussian modelstend to over-smooth images and blur edges. In this paper,we show how to train a Gaussian Conditional Random Field(GCRF) model that overcomes this weakness and can out-perform the non-convex Field of Experts model on the taskof denoising images. A key advantage of the GCRF model isthat the parameters of the model can be optimized efficientlyon relatively large images. The competitive performance ofthe GCRF model and the ease of optimizing its parametersmake the GCRF model an attractive option for vision andimage processing applications.

1. Introduction

Markov Random Field, or MRF, models are a populartool for solving problems in low-level vision and image-processing. MRF models are convenient because the re-lationships between variables in the model are expressedin terms of local relationships, such as the relationship be-tween pixels in an image. At the same time, they are pow-erful because these local relationships combine to producelong-range interactions between variables.

Gaussian Markov Random Fields, which are MRFswhere the variables are also jointly Gaussian, have a longhistory in computer vision research [10]. Gaussian mod-els are particularly convenient to work with because manyof the computations involving these models, such as infer-ence, can be accomplished using linear algebra. Algorithmsfor numerical linear algebra are well understood, and effi-cient implementations are available in most programmingenvironments.

Despite these advantages, many Gaussian models havea characteristic weakness. Empirically, using these modelsoften results in over-smoothed images. This has led recentresearch on MRF models of images to focus on discrete-

valued MRF models [2, 4, 13] or non-convex continuous-valued models, such as the Field of Experts model [8]. Oneof the driving forces behind choosing these types of modelsis that they can implement robust penalty functions, such astruncated quadratic penalties. Using these robust penaltiesalleviates the problem of over-smoothing and has improvedthe performance on a number of tasks, including image de-noising [8], stereo reconstruction [2], and super-resolution[13].

In this paper, we describe a Gaussian Conditional Ran-dom Field, or GCRF, model that avoids the over-smoothingtypically associated with Gaussian models. To demonstrateits power, we show how it can denoise images with less er-ror than the non-convex Field of Experts model. The GCRFmodel is also able to produce estimates much more quickly.

An additional advantage of the GCRF model is that theparameters of the model can be found efficiently by mini-mizing the error in the MAP solution of the model. Thistractable method of training, described in Section 3, en-ables the GCRF model to improve on the results producedby the more complex, non-convex Field of Experts model.The fact that GCRF models are relatively easy to train andcan perform as well as more complicated models makesthem potentially useful for a number of low-level vision andimage-processing tasks. To demonstrate the convenienceand power of the model, sample training implementationsare available athttp://Removed for Review.

Section 2 describes the GCRF model. The training algo-rithm is described in Section 3. Related work and modelsare discussed in Section 4. The GCRF model is comparedto the Field of Experts model in Section 5.

2. Motivating the GCRF Model

The Gaussian Conditional Random Field (GCRF) modelcan be motivated in two ways: probabilistically as a Condi-tional Random Field [5], or as an estimator based on min-imizing a cost function. Section 2.1 shows how the GCRFestimator can be motivated as a probabilistic model, whileSection 2.3 shows how it can be motivated as the minimumof a quadratic cost function.

1

Page 2: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

2.1. Gaussian Markov Random Field Model

Before presenting the actual Gaussian Conditional Ran-dom Field (GCRF) model used to denoise images, this sec-tion will first describe a GCRF model for images that suffersfrom the problem of oversmoothing. Section 2.2 will thenshow how this model can be improved into a model that canhandle sharp edges and produce much improved results.

This GCRF model is defined by a set of linear fea-tures that are convolution kernels. For a set of features,f1 . . . fNf

, the probability density of an image,X , condi-tioned on the observed image,O, is defined to be

p(X ) =1

Zexp

0

@−

NfX

i=1

X

x,y

((X ∗ fi)(x, y) − ri(x, y;O))2

1

A

(1)where(X ∗ fi)(x, y) denotes the value at location(x, y) inthe image produced by convolvingX with fi. We assumethat this image only contains those pixels where the entirefilter fits in X . This corresponds to “valid” convolution inMATLAB. The function ri(x, y;O) contains the estimatedvalue of(X ∗ fi)(x, y). For each featurefi, the functionri

uses the observed imageO to estimate the value of the filterresponse at each pixel. In the rest of this paper,ri(x, y;O)will be shortened tori for conciseness.

The exponent:Nf∑

i=1

x,y

(

(X ∗ fi)(x, y) − r2i

)

(2)

can be written in matrix form by creating a set of matricesF1 . . . FNf

. Each matrixFi performs the same set of linearoperations as convolving an image with a filterfi. In otherwords, ifX is a vector created by unwrapping the imageX ,then the values ofFiX are identical to the values ofX ∗ fi.These matrices can then be stacked and Equation 2 can berewritten as

(F X − r)T (F X − r) (3)

where

F =

F1

F2

...FNf

and r =

r1

r2

...rNf

Equation 3 can be rewritten in the standard form of theexponent of a multivariate normal distribution by settingthe precision matrix,Λ−1, to beFT F and the mean,µ, to(FT F )−1FT r.

Note that the difference between(X − µ)T Λ−1(X − µ)and Equation 3 is the constantrT r − µT µ. However, thisvalue is constant for allX, so it only affects the normaliza-tion constant,Z, and the distributions defined using Equa-tion 1 orΛ andµ are identical.

2.2. Utilizing Weights

One of the major weaknesses of the GCRF model inEquation 1 is that it does not properly capture the statis-tics of natural images. Natural images tend to have smoothregions interrupted by sharp discontinuities. The Gaussianmodel imposes smoothness, but penalizes sharp edges. Inapplications like image denoising, optic flow, and range es-timation, this violates the actual image statistics and leadsto oversmoothing.

One way to avoid this oversmoothing is to use non-convex, robust potential functions, such as those usedin [13] and [8]. Unfortunately, the convenience of thequadratic model is lost when using these models.

Alternatively, the quadratic model can be modified byassigning weights to the various terms in the sum. For ex-ample, if the filters,f1 . . . fNf

, include derivative filters, thebasic GCRF model penalizes the strong derivatives aroundsharp edges. This causes the images estimated from basicGCRF models to have blurry edges. Using the observedimage to lower the weight for those derivatives that crossedges will preserve the sharpness of the estimated image.It should be noted that this example is illustrative. Section3 will show how use training data to learn the relationshipbetween the observed image and the weights.

Adding weights to the exponent of Equation 1 can beexpressed formally by modifying Equation 2:

Nf∑

i=1

x,y

wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri)2 (4)

wherewi(x, y;O, θ) is a positive weighting function thatuses the observed image,O, to assign a positive weight toeach quadratic term in Equation 4.

Assigning a weight to each quadratic term improves themodel’s ability to handle strong edges. For a term based ona derivative filter, the weight could be increased in flat re-gions of the image and decreased along edges. This wouldenable the model to smooth out noise, while preservingsharp edges. Section 3 will discuss how to learn the pa-rameters,θ, of this weighting function.

The potential difficulty in using this model is the fact thatthe weights must be computed from the observed image.This will be problematic if the observed image is too noisyor the necessary information is not easy to compute. Fortu-nately, as Section 5 will show, the system is able to performwell with high-noise levels when denoising images. Thisindicates that good weights can still be computed in high-noise problems.

2.3. Cost-Function Perspective

The GCRF model can also be motivated from a cost-function point of view. Leth(O; θ) be an estimator that usesthe observed image,O, to estimate an image. The estimateis the imageX that minimizes the quadratic function:

2

Page 3: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Nf∑

i=1

x,y

wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri)2 (5)

Or, in other words,

h(O; θ) = arg minX

i,x,y wi(x, y;O, θ) ((X ∗ fi)(x, y) − ri))2 (6)

Again,ri(x, y;O) has been shortened tori for conciseness.Because Equation 5 contains a quadratic cost function,

the minimum can be computed using the pseudo-inverse.Using the matrix notation from Section 2.1, this pseudo-inverse can be expressed as:

h(O; θ) =(

FT W (O; θ)F)−1

FT W (O; θ)r (7)

In this equation, we have introduced a diagonal ma-trix W (O; θ) that is a function of the observed imageand parametersθ. This diagonal matrix is a block-diagonal matrix constructed from the diagonal sub-matricesW1(O; θ) . . . WNf

(O; θ). Each entry along the diagonal intheWi(O; θ) matrices is equal to the weight of a term at aparticular pixel,wi(x, y;O, θ).

3. Learning the Parameters of a GCRF

Before the model described in the previous sectioncan be used, the parameters of the weighting functions,wi(x, y;O, θ), must be chosen. Traditionally, the parame-ters of Conditional Random Fields are found by maximizingthe likelihood of the training data. Using the GCRF modelfrom 2.2, the conditional log-likelihood of a training sampleT , conditioned on an associated observationOT , is

−∑Nf

i=1

x,y wi(x, y;OT , θ) ((T ∗ fi)(x, y) − ri)2

−∫∫

Xexp

(

−∑Nf

i=1

x,y wi(x, y;OT , θ)×

((X ∗ fi)(x, y) − ri)2)

(8)and its partial derivative with respect to a parameterθi

is:

−∑Nf

i=1

x,y∂wi(x,y;OT ,θ)

∂θi((T ∗ fi)(x, y) − ri)

2

−1

Z

∫∫

X

∂wi(x,y;OT ,θ)∂θi

×

exp (−∑Nf

i=1

x,y wi(x, y;OT , θ)×

((X ∗ fi)(x, y) − ri)2)

(9)

Notice that the integration on the righthand of Equation9 can be rewritten as an expectation. This implies thatcomputing the gradient of the log-likelihood with respectto the parameters requires computing the expected valuesof the clique potentials or feature functions. For a Gaus-sian model, these expectations can be computed by invert-ing the inverse covariance matrix, or precision matrix, then

multiplying it with another matrix. The key difficulty withusing the log-likelihood to fit the GCRF parameters lies inthis matrix inversion. While the matrix

(

FT W (O; θ)F)

is

sparse,(

FT W (O; θ)F)−1

will be dense. For a256 × 256image, storing this dense matrix in memory with single-precision values will require 16GB of memory. In addition,inverting a matrix is anO(N3) operation, making it infea-sible for large images.

3.1. Discriminative Learning for GCRFs

The matrix inversion can be avoided by instead penal-izing the difference between the mode of the conditionaldistribution and the ground truth. The penalty is expressedusing a loss functionL(Y, T ) that assigns a loss for an im-ageY based on its difference from the ground-truth imageT . In the GCRF model, the mode of the conditional dis-tribution is the conditional mean, so the cost,C(θ,O, T ),associated with a particular set of parametersθ is

C(θ,O, T ) = L((

FT W (O; θ)F)−1

FT W (O; θ)r, T )(10)

The parametersθ can be found by minimizingC(θ,O, T ) using gradient descent techniques. For conve-nience when computing the gradient, we defineµ such that

µ =(

FT W (O; θ)F)−1

FT W (O; θ)r (11)

Using the chain rule, the partial derivative ofC(·) withrespect to a parameterθi is

∂C(θ,O, T )

∂θi

=∂C(θ,O, T )

∂µ

∂µ

∂θi

(12)

Note that∂C(θ,O,T )∂µ

is a1 × Np vector, whereNp is thenumber of pixels in the image, equal to

[

∂C(θ,O, T )

∂µ1. . .

∂C(θ,O, T )

∂µNp

]

(13)

The derivative vector∂µ∂θi

is an Np × 1 vector with eachelement equal to

∂µ1

∂θi

...∂µNp

∂θi

(14)

If the loss function is the squared-difference betweenµand the ground-truth image,T , then

C(θ,O, T ) = (µ − T )T

(µ − T ) (15)

and

∂C(θ,O, T )

∂µ= 2(µ − T )

3

Page 4: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

3.2. Differentiating µ

The vectorµ, defined in Equation 11, can be differenti-ated with respect to a parameterθi using the matrix deriva-tive identity:

∂A−1

∂x= −A−1 ∂A

∂xA−1 (16)

Using this identity,

∂µ∂θi

= −(

FT W (O; θ)F)−1

FT ∂W (θ,O)∂θi

×(

FT W (O; θ)F)−1

FT W (O; θ)r

+(

FT W (O; θ)F)−1

FT ∂W (θ,O)∂θi

r

(17)

Equation 17 can be expressed much more succinctly byreplacing

(

FT W (O; θ)F)−1

FT W (O; θ)r with µ and fac-toring:

∂µ∂θi

=(

FT W (O; θ)F)−1

FT ∂W (θ,O)∂θi

(−Fµ + r)(18)

3.3. Computational Implications

Combining Equation 12 with Equation 18, the derivativeof theC(·) with respect to a parameterθi is

∂C(θ,O,T )∂θi

=∂C(θ,O,T )

∂µ

(

FT W (O; θ)F)−1

FT ∂W (θ,O)∂θi

(−Fµ + r)

(19)The most expensive step in Equation 19 is computing

µ and ∂C(θ,O,T )∂µ

(

FT W (O; θ)F)−1

. Both of these termsinvolve calculating the product of an inverse matrix and avector. This can be computed efficiently inO(N2

p ) timeusing numerical linear algebra techniques. In our imple-mentation, we use the “backslash” operator in MATLAB tocompute these vectors.

Also notice that if the matrix multiplications are orderedcorrectly, each of these vectors must only be computed onceto calculate the gradient with respect to allθi. This makesthe complexity of computing the gradientO(N2

p ). Com-pared to maximizing the likelihood, which requiresO(N3

p )matrix inversions, this is a significant reduction in compu-tation. Even for a relatively small128 × 128 image, therewill be an immense savings in computation.

4. Related Work

This GCRF model is similar to that used by Tappen,Adelson, and Freeman to learn parameters for decomposingan image into shading and albedo components [12]. Thatsystem was also used to denoise images, though with anerror slightly higher than the Field of Experts model. Theprimary differences between that system and the system de-scribed here is that many more filters are used and the filterresponses are all set to zero. In [12], a semi-parametric re-gression system was used to estimate the filter responses for

Figure 1. The filters used in the GCRF model for denoising im-ages. These filters are a set of derivatives.

horizontal and vertical first-order derivatives. The GCRFmodel described here uses higher-order derivatives and con-strains them to be zero. This better captures the piecewisecontinuity in images.

A distinguishing characteristic of this method for train-ing is that model parameters are optimized using a lossfunction that measures the difference between the mostlikely image and the ground-truth image. This type of train-ing has also been used for discrete-valued graphical modelsby Collins [3], Taskar et al. [15, 14], and Altun et al. [1]. In[3], Collins shows how a variant of the perceptron algorithmcan be used to maximize a margin, similar to the margin inthe Support Vector Machine literature, between the correctoutput from a graphical model and all other possible out-puts. In [15], Taskar et al. show how a loss function canbe upper-bounded with a hinge function and the parameterscan be found using convex optimization. In [1], Altun et al.present a similar formulation.

Unlike [15] and [1], this work does not rely on machin-ery from convex optimization. Instead, the gradient is cal-culated analytically and standard gradient-based optimiza-tion techniques can be used to optimize the loss function.Relying on convex optimization limits the types of weight-functions and loss-functions that can be used. The only re-striction on the weight functions in the GCRF formulationpresented in this paper is that they be differentiable.

The GCRF model can also be viewed as a type ofEnergy-Based Model [6]. The advantage of Energy-BasedModels is that the partition function does not need to becomputed when training.

5. Comparing GCRFs to Field of Experts Mod-els for Denoising

To demonstrate the power of GCRFs on a real-worldproblem, we constructed a GCRF model for denoising im-ages. As stated earlier, a model has three major com-ponents: the linear featuresf1 . . . fNf

, the estimated re-sponses of the featuresr1(·) . . . rNf

(·), and the weightingfunctionsw1(·) . . . wNf

(·).To keep the linear system invertible, the first filter,

f1 is the identity filter withr1(x, y;O) = O(x, y) andw1(x, y;O) = 1. In other words, the first feature constrainsthe mean to be somewhat close to the observation.

The remaining filters are a set of eleven3 × 3 deriva-

4

Page 5: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Figure 2. The filters used for generating the features that are usedto estimate the weightwi(x, y;O) of each term in the GCRFmodel.

tive filters. The filters are shown in Figure 1. For each ofthese filters, the estimated response,ri(·), equals zero. Es-sentially, these filters are imposing smoothness constraints.

The weight function associated with each filter is basedon the absolute response of the observed imageO to a setof multi-scale oriented edge and bar filters. These filters areshown in Figure 2. These filters operate at multiple scalesin the image. The size of the filters in each of the threerows is 11 pixels, 21 pixels, and 31 pixels. The filters weredesigned to be elongated in order to be sensitive to extendededges. Due to the large size of the filters, reflected borderhandling was used when computing the filter responses. Athird set of responses was created by squaring the responsesof the corresponding edge and bar filters, then adding thesevalues.

Altogether, there are 72 filter responses at each pixel inthe image. Denoting the absolute response to each filter atlocation(x, y) asa

(x,y)1 . . . a

(x,y)72 , wherea

(x,y)72 is a feature

containing all ones, the weighting function is

wi(x, y;O) = exp

72∑

j=1

θija

(x,y)j

(20)

where θij represents the weight associated with response

j for filter i. The complete set of parameters,θ, is theconcatenation of the response weights for each filter,θ =[θ1θ2 . . . θ12]. The exponential ensures that the weight isalways positive.

The intuition behind this weight function is that the ob-served image,O, is being used to guess where edges occurin the image. The weight of the smoothing constraints canthen be reduced in those areas.

5.1. Training the GCRF Model

We trained the GCRF model by optimizing the responseweights,θ, to minimize the absolute value of the differencebetween between the predicted image and the true noiselessimage,T . To make the loss function differentiable, we used(x2 + 0.1) to approximate|x|. Using the derivations fromSection 2.2, the derivative with respect to a particularθi

j is

∂C

∂θji

= (µ−T )

((µ−T )2+0.1)12

×(

FT WF)−1

FT ∂W (θ,O)

∂θji

(−Fµ + r)(21)

For conciseness, we have dropped the arguments fromCandW . Note thatµ andT are vectors, so the division andsquare-root operations in (µ−T )

((µ−T )2+0.1)12

are element-wise

operations.We found that optimizing the parameters of the model

with this smoothed absolute error criterion led to an estima-tor that generalized better than the estimator trained usinga squared-error criterion. Optimizing under a squared-errorcriterion causes the algorithm to focus on the images withlarger errors. These images are treated equally under theabsolute error criterion. This helps generalization perfor-mance because outlier images do not receive undue weightduring training.

Theθ vector was optimized using a gradient-descent al-gorithm with the “bold-driver” heuristic. Under this heuris-tic, the step size was decreased each time the the error in thedenoised estimates increased.

The system was trained on forty129×129 patches takenfrom the same training images used by Roth and Black.White Gaussian noise, with a standard deviation of 25, wasadded to each observed image. Computing the gradient fortwo images required approximately two minutes on a 2.4GHz Intel Xeon processor. Using a computing cluster, wewere able to train the model in a few hours. The trainingimages used in training the GCRF model were much largerthan the15 × 15 patches used to train the Field of Expertsmodel, which was also trained using a cluster. It is useful tobe able to train on large images because they better matchthe test images.

5.2. ResultsFor comparison, we ran the Field of Experts (FoE) im-

plementation provided by Roth and Black on the same 68test images from the Berkeley Segmentation Database thatRoth and Black identified as test images. The observationswere created by adding white Gaussian noise with a stan-dard deviation of 25 to each image. The FoE model de-noises images by using gradient-descent to minimize a ro-bust cost function. The results vary depending on the step-size parameter used. We found that we obtained imageswith the highest log-posteriors by optimizing for 5000 iter-ations with a step-size of 0.6. This enabled us to comparethe two models fairly because the results from the GCRFmodel are optimal under that model.

5

Page 6: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

Average AverageRMS Error PSNR

Model (Lower is Better) (Higher is Better)Field of Experts 11.04 27.59GCRF 10.47 28.04

Table 1. This table compares the average error in denoised esti-mates produced by the GCRF model described in this paper andthe Field of Experts model from [8]. The results are averaged overthe same 68 test images used in [8] The standard deviation of thenoise,σ, in the observations was 25.

Time to ProduceModel Estimates (sec)Field of Experts 376.9(2500 Iterations)GCRF 180.7GCRF using Conjugate Gradient 97.82(300 Iterations)

Table 2. A comparison of the amount of time needed to denoisean image using either the Field of Experts model, which usessteepest-descent, or the GCRF model, which uses the MATLABlinear solver. The Conjugate Gradients method is described in Sec-tion 5.3. Times were computed on a 3.2Ghz Intel Xeon Processor.

4 6 8 10 12 14 16 18 20

6

8

10

12

14

16

18

RMS Error of GCRF Estimate

RM

SE

rror

ofF

oEE

stim

ate

Figure 3. This scatterplot shows the RMS error in each of the 68test images when denoised with either the Field of Experts (FoE)model or the GCRF model. The GCRF model outperforms theFoE model for all images where the point is above the red line.

As Table 1 shows, the GCRF model produces denoisedimages with a lower error than the Field of Experts Model.In addition, as Table 2 shows, the GCRF model requiresnearly half the time to produce a denoised481×321 image.Note that in this table, we have shown the time required for2500 iterations of the steepest-descent algorithm, which isthe number of iterations suggested by Roth and Black.

To compare the performance of the two models acrossthe test set, Figure 3 shows of a scatterplot of the RMS er-rors of the images estimated with the two models. Thoseimages where the GCRF model produces a better estimateappear above the red line. As Figure 3 shows, the GCRFmodel produces the better estimate for the majority of the

images. It is important to note that the RMS error is justone type of error metric. Fortunately, if good performanceis desired under a specific error metric, the GCRF modelcan be easily adapted to different error metrics. It shouldbe noted that the GSM denoising algorithm [7] still outper-forms both of these models in terms of PSNR.

Figures 4, 5, and 6 show three examples of images fromthe test set where the GCRF model produces better denoisedimages in terms of PSNR. Figure 7 shows an example wherethe Field of Experts model outperforms the GCRF model interms of PSNR.

Qualitatively, the GCRF model does a much better job atpreserving the fine-scale texture. This comes at the expenseeliminating low-frequency noise in smooth portions of theimage, which can be seen in Figure 7.

5.3. Iterative Estimation in GCRF modelsFor large images, the overhead of constructing the matri-

ces and solving the linear system may be prohibitive. Inthese situations, the denoised image can be obtained byminimizing Equation 6 iteratively. A significant advantageof the GCRF model is that denoising is accomplished bysolving a linear system. This linear system can be solved it-eratively using the conjugate gradient method. This methodconverges more quickly than standard steepest-descent [9].

Using the notation in [9], the basic steps of the conjugategradient method for solvingAx = b are

d0 = r0 = b − Ax0 (22)

αi =rT

i ri

dTi

Adi(23)

xi+1 = xi + αidi (24)

ri+1 = ri − αiAdi (25)

βI+1 =rT

i+1ri+1

rTi

ri(26)

di+1 = ri+1 + βi+1di (27)

When using this method for the GCRF model, the matrixA is replaced with

(

FT W (O; θ)F)

. Each of the matrices inF correspond to computing a convolution, so matrix-vectorproducts such asdT

i Adi can be computed using convolu-tions. This makes each iteration of the conjugate gradi-ent slightly more expensive than an iteration of steepest-descent.

One advantage of the conjugate-gradient method is thatthere are no step-size parameters. We also found that it con-verged quickly. After 300 iterations of the conjugate gradi-ent method, the average absolute difference, per pixel, be-tween the iterative estimate and the estimate found with thedirect matrix solver was 5.3851e-05. As Table 2 shows, us-ing the iterative solver significantly reduces the amount oftime needed to produce an estimate. Further improvementsin speed may be possible by using a preconditioner suitedfor images, such as the preconditioner described in [11].

While the conjugate gradients method could be used tofind estimates in the Field of Experts model, the non-linear

6

Page 7: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Original (b) Noisy Observation (c) Results using FoE Model,PSNR = 25.23

(d) Results using GCRF Model,PSNR= 26.28

Figure 4. In this denoising task, the GCRF model produces a more accurate estimate of the true image than the Field of Experts (FoE)model in terms of PSNR. The GCRF model also does a better job of preserving the texture in the image.

(a) Original (b) Noisy Observation (c) Results using FoE model, PSNR= 25.77

(d) Results using GCRF model,PSNR = 26.75

Figure 5. A second example where the GCRF model produces superiorresults in terms of PSNR. Again, the GCRF better preserves texture.

conjugate gradient algorithm requires line-searches, whichwould affect its overall performance.

6. ConclusionWe have shown how a Gaussian Conditional Random

Field (GCRF) model can outperform the non-convex Fieldof Experts model on the task of denoising images. TheGCRF model produced more accurate estimates, in terms ofPSNR, and required less computation to produce estimates.This is significant because of the advantages of the GCRFmodel:

1. It is convenient to work with. The GCRF model can beimplemented using standard numerical linear algebraroutines, which are available for most programmingenvironments.

2. As Sections 3 and 5 showed, the model can be trainedon relatively large images.

3. Any differentiable loss function on the estimated im-ages can be used during training. This is importantbecause the behavior of a system can be significantlyinfluenced by the loss function used to train it. Allow-ing for different loss functions during training enablesa system’s behavior to be controlled more precisely.

Combined with the ability to perform as well as non-convex models, these advantages make the GCRF model anattractive model for low-level vision and image processingapplications.

7

Page 8: Learning Gaussian Conditional Random Fields for Low-Level ...mtappen/out_papers/gcrf_submitted.pdf · Learning Gaussian Conditional Random Fields for Low-Level Vision Anonymous CVPR

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

CVPR#770

CVPR#770

CVPR 2006 Submission #770. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

(a) Original (b) Noisy Observation (c) Result from FoE, PSNR = 28.59(d) Result from GCRF, PSNR =28.68

Figure 6. In this comparison, the PSNR for the Field of Experts result is slightly lower than the PSNR reported in [8]. We found that asthe Field of Experts code optimized the denoised images, the PSNR began to decrease as the log-posterior of the estimate continued toincrease. However, the result in (c) is visually quite similar to that in [8].

(a) Original (b) Noisy Observation (c) Result using FoE model, PSNR= 36.07

(d) Result using GCRF model,PSNR = 34.77

Figure 7. These images are an example of how the Field of Experts modelcan produce superior estimates in terms of PSNR. The FoEmodel is better at removing low-frequency noise from flat regions.

References[1] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov

support vector machines. InThe Twentieth InternationalConference on Machine Learning (ICML-2003), 2003.

[2] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate en-ergy minimization via graph cuts.IEEE Transactions on Pat-tern Analysis and Machine Intelligence, 23(11):1222–1239,2001.

[3] M. Collins. Discriminative training methods for hiddenmarkov models: Theory and experiments with perceptron al-gorithms. InEMNLP 2002, 2002.

[4] W. T. Freeman, E. C. Pasztor, and O. T. Carmichael. Learn-ing low-level vision. International Journal of Computer Vi-sion, 40(1):25–47, 2000.

[5] J. Lafferty, F. Pereira, and A. McCallum. Conditional ran-dom fields: Probabilistic models for segmenting and label-ing sequence data. InInternational Conference on MachineLearning (ICML), 2001.

[6] Y. LeCun and F. Huang. Loss functions for discriminativetraining of energy-based models. InProc. of the 10-th In-ternational Workshop on Artificial Intelligence and Statistics(AIStats’05), 2005.

[7] J. Portilla, V. Strela, M. Wainwright, and E. P. Simoncelli.Image denoising using scale mixtures of gaussians in thewavelet domain.IEEE Transactions on Image Processing,12(11):1338–1351, November 2003.

[8] S. Roth and M. Black. Field of experts: A framework forlearning image priors. InIEEE Conference on Computer Vi-

sion and Pattern Recognition (CVPR), volume 2, pages 860–867, 2005.

[9] J. R. Shewchuk. An introduction to the conjugategradient method without the agonizing pain. Available athttp://www.cs.cmu.edu/˜jrs/jrspapers.html .

[10] R. Szeliski. Bayesian modeling of uncertainty in low-levelvision. International Journal of Computer Vision, 5(3):271–301, 1990.

[11] R. Szeliski. Locally adapted hierarchical basis precondition-ing. ACM Transactions on Graphics, 25(3):1135–1143, Au-gust 2006.

[12] M. F. Tappen, E. H. Adelson, and W. T. Freeman. Estimatingintrinsic component images using non-linear regression. InThe Proceedings of the 2006 IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR),volume 2, pages 1992–1999, 2006.

[13] M. F. Tappen, B. C. Russell, and W. T. Freeman. Efficientgraphical models for processing images. InProceedings ofthe 2004 IEEE Conference on Computer Vision and Patt ernRecognition, volume 2, pages 673–680, 2004.

[14] B. Taskar, V. Chatalbashev, D. Koller, and C. Guestrin.Learning structured prediction models: A large margin ap-proach. InThe Twenty Second International Conference onMachine Learning (ICML-2005), 2005.

[15] B. Taskar, S. Lacoste-Julien, and M. Jordan. Structuredprediction via the extragradient method. In Y. Weiss,B. Scholkopf, and J. Platt, editors,Advances in Neural Infor-mation Processing Systems 18. MIT Press, Cambridge, MA,2006.

8