Image Colorization with Deep Convolutional Neural...

Image Colorization with Deep Convolutional Neural Networks

Jeff [email protected]

You [email protected]

Abstract

We present a convolutional-neural-network-based sys-tem that faithfully colorizes black and white photographicimages without direct human assistance. We explore var-ious network architectures, objectives, color spaces, andproblem formulations. The final classification-based modelwe build generates colorized images that are significantlymore aesthetically-pleasing than those created by the base-line regression-based model, demonstrating the viability ofour methodology and revealing promising avenues for fu-ture work.

1. Introduction

Automated colorization of black and white images hasbeen subject to much research within the computer visionand machine learning communities. Beyond simply beingfascinating from an aesthetics and artificial intelligence per-spective, such capability has broad practical applicationsranging from video restoration to image enhancement forimproved interpretability.

Here, we take a statistical-learning-driven approach to-wards solving this problem. We design and build a convolu-tional neural network (CNN) that accepts a black-and-whiteimage as an input and generates a colorized version of theimage as its output; Figure 1 shows an example of such apair of input and output images. The system generates itsoutput based solely on images it has “learned from” in thepast, with no further human intervention.

In recent years, CNNs have emerged as the de facto stan-dard for solving image classification problems, achievingerror rates lower than 4% in the ImageNet challenge [12].CNNs owe much of their success to their ability to learnand discern colors, patterns, and shapes within images andassociate them with object classes. We believe that thesecharacteristics naturally lend themselves well to colorizingimages since object classes, patterns, and shapes generallycorrelate with color choice.

Figure 1. Sample input image (left) and output image (right).

2. Related work

Our project was inspired in part by Ryan Dahl’s CNN-based system for automatically colorizing images [2].Dahl’s system relies on several ImageNet-trained layersfrom VGG16 [13], integrating them with an autoencoder-like system with residual connections that merge interme-diate outputs produced by the encoding portion of the net-work comprising the VGG16 layers with those producedby the latter decoding portion of the network. The resid-ual connections are inspired by those existing in the ResNetsystem built by He et al that won the 2015 ImageNet chal-lenge [5]. Since the connections link downstream networkedges with upstream network edges, they purportedly allowfor more rapid propagation of gradients through the system,which reduces training convergence time and enables train-ing deeper networks more reliably. Indeed, Dahl reportsmuch larger decreases in training loss on each training iter-ation with his most recent system compared with an earliervariant that did not utilize residual connections.

In terms of results, Dahl’s system performs extremelywell in realistically colorizing foliage, skies, and skin. We,however, notice that in numerous cases, the images gen-erated by the system are predominantly sepia-toned andmuted in color. We note that Dahl formulates image col-orization as a regression problem wherein the training ob-jective to be minimized is a sum of Euclidean distances be-tween each pixel’s blurred color channel values in the targetimage and predicted image. Although regression does seemto be well-suited to the task due to the continuous natureof color spaces, in practice, a classification-based approachmay work better. To understand why, consider a pixel that

1

exists in a flower petal across multiple images that are iden-tical, save for the color of the flower petals. Depending onthe picture, this pixel can take on various tones of red, yel-low, blue, and more. With a regression-based system thatuses an `2 loss function, the predicted pixel value that min-imizes the loss for this particular pixel is the mean pixelvalue. Accordingly, the predicted pixel ends up being anunattractive, subdued mixture of the possible colors. Gener-alizing this scenario, we hypothesize that a regression-basedsystem would tend to generate images that are desaturatedand impure in color tonality, particularly for objects thattake on many colors in the real world, which may explainthe lack of punchiness in color in the sample images col-orized by Dahl’s system.

3. ApproachWe build a learning pipeline that comprises a neural net-

work and an image pre-processing front-end.

3.1. General pipeline

During training time, our program reads images of pixeldimension 224× 224 and 3 channels corresponding to red,green, and blue in the RGB color space. The images areconverted to CIELUV color space. The black and whiteluminance L channel is fed to the model as input. The Uand V channels are extracted as the target values.

During test time, the model accepts a 224×224×1 blackand white image. It generates two arrays, each of dimension224 × 224 × 1, corresponding to the U and V channelsof the CIELUV color space. The three channels are thenconcatenated together to form theCIELUV representationof the predicted image.

3.2. Transfer learning

We initialized parts of model with a VGG16 instance thathas been pretrained on the ImageNet dataset. Since imagesubject matter often implies color palette, we reason thata network that has demonstrated prowess in discriminatingamongst the many classes present in the ImageNet datasetwould serve well as the basis for our network. This moti-vates our decision to apply transfer learning in this manner.

3.3. Activation function

We use the rectified linear unit as the nonlinearity thatfollows each of our convolutional and dense layers. Mathe-matically, the rectified linear unit is defined as

f(x) = max(0, x)

The rectified linear unit has been empirically shown togreatly accelerate training convergence [9]. Moreover, it ismuch simpler to compute than many other conventional ac-tivation functions. For these reasons, the rectified linear unithas become standard for convolutional neural networks.

Figure 2. Regression network schematic.

One downside of using the rectified linear unit as the ac-tivation function in a neural network is that the model pa-rameters can be updated in such a way that the function’sactive region is always in the zero-gradient section. In thisscenario, subsequent backpropagated gradients will alwaysbe zero, hence rendering the corresponding neurons perma-nently inactive. In practice, this has not been an issue forus.

3.4. Batch normalization

Ioffe et al introduced batch normalization as a means ofdramatically reducing training convergence time and im-proving accuracy [7]. For our networks, we place a batchnormalization layer before every non-linearity layer apartfrom the last few layers before the output. In our trials, wehave found that doing so does improve the training rate ofthe systems.

3.5. Baseline regression model

We used a regression-based model similar to the modeldescribed in [2] as our baseline. Figure 2 shows the struc-ture of this baseline model.

We describe this architecture as comprising a “summa-rizing”, encoding process on the left side followed by a“creating”, decoding process on the right side

The architecture of the leftmost column of layers is in-herited from a portion of the VGG16 network. During this“summarizing” process, the size (height and width) of the

2

feature map shrinks while the depth increases. As the modelforwards its input deeper into the network, it learns a richcollection of higher-order abstract features

The “creating” process on the right column is a modifiedversion of the “residual encoder” structure described in [2].Here, the network successively upscales the preceding layeroutput, merges the result with an intermediate output fromthe VGG16 layers via an elementwise sum, and performsa two-dimensional convolution on the result. The progres-sive, decoder-like upscaling of layers from an encoded rep-resentation of the input allows for the propagation of globalspatial features to more-local image regions. This trick en-ables the network to realize the more abstract concepts withthe knowledge of the more concrete features so that the cre-ating process will be both creative and down to earth to suitthe input images.

For the objective function in our system, we consideredseveral loss functions. We began by using the vanilla `2loss function. Later, we moved onto deriving a loss functionfrom the Huber penalty function, which is defined as

L(u) =

{u2 |u| < M

M(2|u| −M) |u| > M

Intuitively, the function is defined piecewise in terms ofa quadratic function and two affine functions. For residu-als u that are smaller than the threshold M , it follows the`2 penalty function; for residuals that are larger than M , itreverts to the `1 penalty function. This feature of the Huberpenalty function allows it to extract the best of both worldsbetween the `2 and `1 norms; it can be more robust to out-liers while de-emphasizing points the system has fit closelyenough to. For our particular use case, this behavior is ideal,since we expect there to be many outliers for colors that cor-respond to a particular shape or pattern.

3.6. Final classification model

Figure 3 depicts a schematic of our final classificationmodel. The regression model suffers from a dimming prob-lem because it minimizes some variant of the `p norm,which motivates the model to choose an average or inter-mediate color when multiple distinct color choices are pos-sible. To address this issue, we remodeled our problem as aclassification problem.

In order to perform classification on continuous data, wemust discretize the domain. The targets U and V fromthe CIELUV color space take on values in the interval[−100, 100]. We implicitly discretize this space into 50equi-width bins by applying a binning function (denotedbin()) to each input image prior to feeding it to the in-put of the network. The function returns an array of thesame shape as the original image with each U and V valuemapped to some value in the interval [0, 49]. Then, instead

Figure 3. Classification network schematic.

of directly predicting numeric values for U and V , the net-work outputs two separate sets of the most probable binnumbers for the pixels, one for each channel. We used thesum of cross-entropy loss on the two channels as our mini-mization objective.

In terms of the architecture, we introduced a concate-nation layer concat, which is inspired by segmentationmethods. Combining multiple intermediate feature maps inthis fashion has been shown to increase prediction quality insegmentation problems, producing finer details and cleaneredges[4]. Although there is no explicit segmentation step inour setup, this approximate approach allows our system tominimize the amount of visual noise that is generated alongobject edges in the output image.

We experimented with placing various model structuresbetween the concatenation layer and output. In our finalmodel, the concatenation layer is followed by three 3 × 3convolutional layers, which are in turn followed by the finaltwo parallel 1×1 convolutional layers corresponding to theU and V channels. These 1 × 1 convolutional layers actas the fully-connected layers to produce 50 class scores foreach channel for each pixel of the image. The classes withthe largest scores on each channel are then selected as the

3

Dataset Training TestMcGill 896 150

MIT CVCL 361 50ILSVRC 2015 CLS-LOC 12486 494

MIRFLICKR 7500 1000Table 1. Number of training and test images in datasets.

Figure 4. Sample images from the MIT CVCL Open Countrydataset.

predicted bin numbers. Via an un-binning function, we thenconvert the predicted bins back to numericalU and V valuesusing the means of the selected bins.

4. Dataset

We tested our system on several datasets; Table 1 pro-vides a summary of the datasets we considered.

The MIT CVCL Urban and Natural Scene Categoriesdataset contains several thousand images partitioned intoeight categories [10]. We experimented with 411 images inthe ”Open Country” category to measure our system’s abil-ity to generate images pertaining to a specific class of im-ages; Figure 4 shows some sample images from the dataset.

To gauge how well our system generalizes to diverse im-ages, we experimented with larger datasets encompassingbroader classes of photos. The McGill Calibrated ColourImage Database contains more than a thousand images ofnatural scenes organized by categories [11]. We chose toexperiment with samples from each of the categories. TheILSVRC 2015 CLS-LOC dataset is the dataset used forthe ImageNet challenge in 2015 [12]. We sampled imagesfrom the following categories: spatula, school bus, bear,book shelf, armor, kangaroo, spider, sweater, hair dryer, andbird. The MIRFLICKR dataset comprises 25000 CreativeCommons images downloaded from the community photo-sharing website Flickr [6]. The images span a vast range ofcategories, artistic styles, and subject matter.

We preprocess each image in our dataset prior to for-warding it to our network. We scale each image to dimen-sions of 224× 224× 3 and generate a grayscale version ofthe image of dimensions 224× 224× 1. Since the input ofour network is the input of the ImageNet-trained VGG16,which expects its input images to be zero-centered and ofdimensions 224 × 224 × 3, we duplicate the grayscale im-

age three times to form a (224× 224× 3)-sized image andsubtract the mean R, G, and B value across all the picturesin the ImageNet dataset. The resulting final image serves asthe black-and-white input image for the network.

5. Experiments5.1. Evaluation metrics

For regression, we quantify the closeness of the gener-ated image to the actual image as the sum of the `2 normsof the difference of the generated image pixels and actualimage pixels in the U and V channels:

Lreg. = ||Up − Ua||22 + ||Vp − Va||22

Likewise, for classification, we measure the closeness ofthe generated image to the actual image by the percent ofbinned pixel values that match between the generated imageand actual image for each channel U and V :

Acc.U =1

N2

(N,N)∑(i,j)

1{bin(Up) = bin(Ua)}

Acc.V =1

N2

(N,N)∑(i,j)

1{bin(Vp) = bin(Va)}

, where bin : R → Z50 is the color binning functiondescribed in Section 3.6. We emphasize that classificationaccuracy alone is not the ideal metric to judge our systemon, since the accuracy of color matching against target im-ages does not directly relate with the aesthetic quality of animage. For example, for a still-life painting, it may be thecase that virtually none of the colors actually match the cor-responding real-life scene. Nevertheless, the painting maystill be regarded as being artistically impressive. We, how-ever, report it as one possible measure because we do be-lieve that there exists some correlation between the two.

We can also apply these formulae to the regression re-sults to compare with the classification results.

Finally, we track the percent deviation in average colorsaturation between pixels in the generated image and in theactual image:

Sat. diff. =

∣∣∣∑(N,N)(i,j) Spij

−∑(N,N)

(i,j) Saij

∣∣∣∑(N,N)(i,j) Saij

Generally speaking, the degree of color saturation in agiven image strongly influences its aesthetic appeal. Ide-ally, then, the saturation levels present in the training imagesshould be replicated at the system’s output, even when theexact hues and tones are not matched perfectly. This metricallows us to quantify the faithfulness of this replication.

4

5.2. Experiment setup and alternative structures

Our networks were implemented with Lasagne [1] andwere trained on an AWS instance running a NVIDIA GRIDK520 GPU.

We started by trying to overfit our model on a 270-imagerandom subset of ImageNet data.

To determine a suitable learning rate, we ran multiple tri-als of training with minibatch updates to see which learningrate yielded faster convergence behavior over a fixed num-ber of iterations. Within the set of learning rates sampled ona logarithmic scale, we found that a learning rate of 0.001achieved one of the largest per-iteration decreases in train-ing loss as well as the lowest training loss of the learningrates sampled. Using that as a starting point, we moved towith the entire training set. With a hold-out proportion of10% as the validation set, we observed fastest convergencewith a learning rate of 0.0003

We also experimented with different update rules,namely Adam [8] and Nesterov momentum [14]. We fol-lowed the recommended β1 = 0.9 and β2 = 0.99, 0.999.For Nesterov Momentum, we used a momentum of 0.9.Among these options, the Adam update rule with β1 = 0.9and β2 = 0.999 produced slightly faster convergence thanthe others, so we used the Adam update rule with these hy-perparameters for our final model.

In terms of minibatch sizes, we experimented withbatches of four, six, eight and twelve images based onnetwork architecture. Some alternative structures we triedrequired less memory usage, so we tested those with allfour options. The model shown in Figure 3, however, ismemory-intensive. Due to the limited access of computa-tional resource, we were only able to test it with batch sizesof four and six with the GPU instance. Nevertheless, thismodel with a batch size of six demonstrated faster and sta-bler convergence than the other combinations.

For weight initialization, since our model uses the rec-tified linear unit as its activation function, we followed theXavier Initialization scheme proposed by [3] for our origi-nal trainable layers in the decoding, “creating” phase of thenetwork.

We also developed several alternative network struc-tures before we arrived at our final classification model.The following are some design elements and decisions weweighed:

1. Multilayer aggregation – elementwise sum versus con-catenation: we experimented with performing layeraggregation using an elementwise sum layer in placeof the concatenation layer. An elementwise sum layerreduces memory usage, but in our experiments, itturned out to harm training and prediction perfor-mance.

2. Presence or absence of residual encoder units: a

residual encoder unit refers to a joint convolution-elementwise-sum step on a feature map in the “sum-marizing” process and an upscaled feature map in the“creating” process, as described in Section 3. Weexperimented with trimming away the residual en-coder units and applying aggregation layers directly ontop of the maxpooling layers inherited from VGG16.However, the capacity of the resulting model is muchsmaller, and it showed poorer quality of results whenoverfitting to the 300-image subset.

3. The final sequence of convolutional layers beforethe network output: we experimented with one andtwo convolutional layers with various depths, but thethree-layer structure with the current choice of depthsyielded the best results.

4. Color space: initially, we experimented with the HSVcolor space to address the under-saturation problem. InHSV, saturation is explicitly modeled as the individualS channel. Unfortunately, the results were not satis-fying. Its main issue lies in its exact potential merit:since saturation is directly estimated by the model, anyprediction error became extremely noticeable, makingthe images noisy.

5.3. Results and discussion

Figure 5 depicts two sets of regression and classifica-tion network outputs along with their associated black-and-white input images. The model that generated these imageswas trained on the MIT CVCL Open Country dataset.

Figure 5. Test set input images (left column), regression networkoutput (center column), and classification network output (rightcolumn).

The regression network outputs are somewhat reason-able. Green tones are restricted to areas of the image withfoliage, and there seems to be a slight amount of colortinting in the sky. We, however, note that the images areseverely desaturated and generally unattractive. These re-sults are expected given their similarity to Dahl’s sampleoutputs and our hypothesis.

5

In contrast, the classification network outputs are amaz-ingly colorful yet realistic. Colors are lively, nicely sat-urated, and generally tightly restricted to the regions theycorrespond to. For example, for the top image, the systemmanaged to infer the reflection of the sky in the water andcolorized both with bright swaths of blue. The foliage onthe mountain is colorized with deep tones of green. Over-all, the output is highly aesthetically pleasing.

Note, however, that there exists a noticeable amount ofnoise in the classification results, with blobs of various col-ors interspersed throughout the image. This may result fromseveral factors. It may be the case that the system discretizesthe U and V channels into bins that are too large, whichmeans that image regions that contain color gradients mayappear choppier. Also, the system performs classificationon a per-pixel basis without explicitly accounting for thecolor values of surrounding pixels. Finally, it may simplybe the case that a particular patch containing a shape or pat-tern has many possible color matches in the real world, andthe system does not have the capacity to choose a partic-ular color consistently. For instance, considering the sec-ond classification-network-generated image, we see that thebrush in the foreground takes on varying shades of green,yellow, and brown. In the wild, grass does take on manycolors depending on seasonality and geography. With thelittle context present in the corresponding black and whiteimage, it can be difficult for even a human to discern themost probable color for the grass.

System Acc. (U ) Acc. (V ) Sat. diff.Classification 34.64% 24.14% 6.5%

Regression 12.92% 19.02% 85.8%

Table 2. Performances on MIT CVCL Open Country test set.

Despite the aforementioned flaws in its output, all in allthe classification system performs very well. Via our eval-uation metrics, we can quantify the superiority of the clas-sification network over the regression network for this par-ticular dataset (Table 2). For the landscape data test set,using the classification network, we find that the U and Vchannel prediction accuracies are 0.3464% and 0.2414%,respectively, and that the average percent difference in sat-uration is 6.5%. In comparison, using the regression net-work on the same test set, we find that the U and V channelprediction accuracies are 0.1292% and 0.1902%, and thatthe average percent difference in saturation regression is awhopping 85.8%. Evidently, not only is the classificationnetwork able to correctly classify the color of a particularpixel more effectively, but it is also much more likely topredict colors that match the saturation levels of images inthe training set.

In Figure 6, we present sample colored test images alongwith the black and white input. The model generally yields

Figure 6. Sample test set input and output from ImageNet.

more convincing images on nature themes. It is the casebecause the candidate colors for objects in nature are morewell defined than some man-made objects. For example,the sky might be blue, grey or even pink at the a sunset,but it is almost never green. However, chair cushions, forexample, may bear almost any color. The model needs tosee more cushion images than sky images before it learnsto color it nicely. On the bottom right image, for example,the table is colored partially red. Since we did not samplethe table class, there is not many images with tables in thetraining set for the model to learn how to color table objectsproperly.

Figure 7. Sample outputs exhibiting color inconsistency issue.

The main challenge our model faces is inconsistency incolors within individual objects. Figure 7 shows two testinput and output pairs that suffer from this issue. For thefirst example, the model colored parts of the sweatshirt redand other parts of it grey. As human beings, we can imag-ine a sweatshirt being red or grey as a whole. Our cur-rent system-on the other hand-makes one color predictionon each pixel, and hopefully the close-by pixels have sim-ilar color assignment. However, it is not always the case.Even though local regions of small sizes are examined to-gether given the nature of convolutional layers, there is noexplicit enforcement on the object level. We experimentedwith applying a Gaussian smoothing on the class scores toaddress this issue. This kind of smoothing performed onlyslightly better. Unfortunately,it introduced another issue: itsignificantly increased visual noise along object edges. Ac-cordingly, we left out the smoothing in our final model.

Figure 8 shows two under-colored images. As we ob-served earlier, man-made objects with a large intra-domaincolor variation are generally more challenging. The modelwas not able to give a good prediction for the sweater, mostlikely because of the wide range of color choices. Upon

6

Figure 8. Under-colored output examples.

close examination, we noticed that the model even paintedpart of it slightly green. Similarly for the crowd picture, themodel did not provide much color to non-white clothes.

6. Conclusion and future workThrough our experiments, we have demonstrated the ef-

ficacy and potential of using deep convolutional neural net-works to colorize black and white images. In particular, wehave empirically shown that formulating the task as a clas-sification problem can yield colorized images that are ar-guably much more aesthetically-pleasing than those gener-ated by a baseline regression-based model, and thus showsmuch promise for further development.

Our work therefore lays a solid foundation for futurework. Moving forward, we have identified several avenuesfor improving our current system. To address the issue ofcolor inconsistency, we can consider incorporating segmen-tation to enforce uniformity in color within segments. Wecan also utilize post-processing schemes such as total varia-tion minimization and conditional random fields to achievea similar end. Finally, redesigning the system around an ad-versarial network may yield improved results, since insteadof focusing on minimizing the cross-entropy loss on a per-pixel basis, the system would learn to generate pictures thatcompare well with real-world images. Based on the qualityof results we have produced, the network we have designedand built would be a prime candidate for being the generatorin such an adversarial network.

References[1] Lasagne. https://github.com/Lasagne, 2015.[2] R. Dahl. Automatic colorization.

http://tinyclouds.org/colorize/, 2016.[3] X. Glorot and Y. Bengio. Understanding the difficulty of

training deep feedforward neural networks. In Internationalconference on artificial intelligence and statistics, pages249–256, 2010.

[4] B. Hariharan, P. Arbelaez, R. Girshick, and J. Malik. Hyper-columns for object segmentation and fine-grained localiza-tion. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 447–456, 2015.

[5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-ing for image recognition. arXiv preprint arXiv:1512.03385,2015.

[6] M. J. Huiskes and M. S. Lew. The mir flickr retrieval eval-uation. In Proceedings of the 1st ACM international con-

ference on Multimedia information retrieval, pages 39–43.ACM, 2008.

[7] S. Ioffe and C. Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift.arXiv preprint arXiv:1502.03167, 2015.

[8] D. Kingma and J. Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980, 2014.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. InAdvances in neural information processing systems, pages1097–1105, 2012.

[10] A. Oliva and A. Torralba. Modeling the shape of the scene: Aholistic representation of the spatial envelope. Internationaljournal of computer vision, 42(3):145–175, 2001.

[11] A. Olmos et al. A biologically inspired algorithm for therecovery of shading and reflectance images. Perception,33(12):1463–1473, 2004.

[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision, 115(3):211–252,2015.

[13] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556, 2014.

[14] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the im-portance of initialization and momentum in deep learning. InProceedings of the 30th international conference on machinelearning (ICML-13), pages 1139–1147, 2013.

7

https://github.com/Lasagne

Image Colorization with Deep Convolutional Neural...

Documents

Transcript of Image Colorization with Deep Convolutional Neural...