CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then...

5
CONTENT-AWARE COMPRESSION USINGSALIENCY-DRIVEN IMAGE RETARGETING Fabio Z¨ und *, Yael Pritch * , Alexander Sorkine-Hornung * , Stefan Mangold * , Thomas Gross * Disney Research Zurich ETH Zurich ABSTRACT In this paper we propose a novel method to compress video content based on image retargeting. First, a saliency map is extracted from the video frames either automatically or ac- cording to user input. Next, nonlinear image scaling is per- formed which assigns a higher pixel count to salient image regions and fewer pixels to non-salient regions. The non- linearly downscaled images can then be compressed using existing compression techniques and decoded and upscaled at the receiver. To this end we introduce a non-uniform an- tialiasing technique that significantly improves the image re- sampling quality. The overall process is complementary to existing compression methods and can be seamlessly incor- porated into existing pipelines. We compare our method to JPEG 2000 and H.264/AVC-10 and show that, at the cost of visual quality in non-salient image regions, our method achieves a significant improvement of the visual quality of salient image regions in terms of Structural Similarity (SSIM) and Peak Signal-to-Noise-Ratio (PSNR) quality measures, in particular for scenarios with high compression ratios. Index Termsvideo compression, image retargeting 1. INTRODUCTION A large amount of live video content is consumed on mo- bile devices, typically via streaming over cellular networks. As the computational power of streaming servers and mo- bile devices is constantly increasing, the critical bottleneck remains the limited wireless channel capacity per device, in particular when each mobile device receives its own individ- ual content independently (different camera views of events, individually selected replays, etc.). Wireless video stream- ing does not only rely on potentially high compression ra- tios but also demands high error-resilience of the transmitted data. Compression has been employed in all modern codecs as images contain significant statistical and visually subjec- tive redundancy, and videos exhibit even more redundancy in between frames. This observation is the starting point for numerous approaches to reduce the size of an image while maintaining image quality [1]. Advanced video codecs such as H.264/AVC-10, which allow for high compression while maintaining high video quality, however, exhibit a raised sen- sitivity to data errors [2]. We propose content-aware com- pression using saliency-driven image retargeting (CCSIR) to integrate saliency maps into a compression pipeline. This method uses content-aware image retargeting to allocate more pixels to the important part of the image in a continuous, non- uniform way (see Fig. 1). The retargeting is followed by a multi-resolution approach in which different bands of the im- age are compressed with different ratios, using existing com- pression algorithms. An overview of current state of the art saliency compuation algorithms can be found in [11]. Note that the computation times are in the order of a few milisec- onds. Hence, computing the saliency does not significantly increase the processing time of our suggested pipeline. A basic form of CCSIR are existing region-of-interest (ROI) coding techniques which prioritize specific regions in an image. The JPEG 2000 standard [3] encodes certain re- gions at higher quality than the background. In the general ROI method, the wavelet transform coefficients in the ROI are scaled (shifted) so that their bits lie in higher bit-planes than the bits associated with the background. During the entropy coding the higher bit-planes are given higher priority and, therefore, the background is encoded in lower quality as the ROIs [4, 5, 6]. Further, [7] presents a method for ROI en- coding in H.264/AVC similar to ROI encoding in JPEG 2000, and [8] presents an approach that blurs less salient regions using a foveation filter. With the high image frequencies removed, the non-salient regions can be compressed stronger. These ROI methods are, however, strictly tied to a spe- cific codec, whereas for CCSIR an arbitrary codec can be em- ployed. Furthermore, our approach supports saliency maps representing arbitrary shapes rather than rectangular ROIs only, and the retargeting algorithm can be configured to gen- erate smooth transitions from salient to non-salient regions, which is typically not the case for ROI based compression. Even though our experiments showed the best efficiency for a combination of CCSIR and JPEG 2000, our method is agnostic to the employed compression technique and hence complementary to existing approaches. 2. THE PROPOSED TECHNIQUE Fig. 1 depicts an overview of the compression pipeline. First, the input image is downscaled (retargeted) to a smaller reso-

Transcript of CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then...

Page 1: CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then encoded using JPEG 2000. To create the difference image D, we decode Id, upscale it

CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING

Fabio Zund*†, Yael Pritch*, Alexander Sorkine-Hornung*, Stefan Mangold*, Thomas Gross†

*Disney Research Zurich†ETH Zurich

ABSTRACT

In this paper we propose a novel method to compress videocontent based on image retargeting. First, a saliency map isextracted from the video frames either automatically or ac-cording to user input. Next, nonlinear image scaling is per-formed which assigns a higher pixel count to salient imageregions and fewer pixels to non-salient regions. The non-linearly downscaled images can then be compressed usingexisting compression techniques and decoded and upscaledat the receiver. To this end we introduce a non-uniform an-tialiasing technique that significantly improves the image re-sampling quality. The overall process is complementary toexisting compression methods and can be seamlessly incor-porated into existing pipelines. We compare our method toJPEG 2000 and H.264/AVC-10 and show that, at the costof visual quality in non-salient image regions, our methodachieves a significant improvement of the visual quality ofsalient image regions in terms of Structural Similarity (SSIM)and Peak Signal-to-Noise-Ratio (PSNR) quality measures, inparticular for scenarios with high compression ratios.

Index Terms— video compression, image retargeting

1. INTRODUCTION

A large amount of live video content is consumed on mo-bile devices, typically via streaming over cellular networks.As the computational power of streaming servers and mo-bile devices is constantly increasing, the critical bottleneckremains the limited wireless channel capacity per device, inparticular when each mobile device receives its own individ-ual content independently (different camera views of events,individually selected replays, etc.). Wireless video stream-ing does not only rely on potentially high compression ra-tios but also demands high error-resilience of the transmitteddata. Compression has been employed in all modern codecsas images contain significant statistical and visually subjec-tive redundancy, and videos exhibit even more redundancyin between frames. This observation is the starting point fornumerous approaches to reduce the size of an image whilemaintaining image quality [1]. Advanced video codecs suchas H.264/AVC-10, which allow for high compression whilemaintaining high video quality, however, exhibit a raised sen-

sitivity to data errors [2]. We propose content-aware com-pression using saliency-driven image retargeting (CCSIR) tointegrate saliency maps into a compression pipeline. Thismethod uses content-aware image retargeting to allocate morepixels to the important part of the image in a continuous, non-uniform way (see Fig. 1). The retargeting is followed by amulti-resolution approach in which different bands of the im-age are compressed with different ratios, using existing com-pression algorithms. An overview of current state of the artsaliency compuation algorithms can be found in [11]. Notethat the computation times are in the order of a few milisec-onds. Hence, computing the saliency does not significantlyincrease the processing time of our suggested pipeline.

A basic form of CCSIR are existing region-of-interest(ROI) coding techniques which prioritize specific regions inan image. The JPEG 2000 standard [3] encodes certain re-gions at higher quality than the background. In the generalROI method, the wavelet transform coefficients in the ROIare scaled (shifted) so that their bits lie in higher bit-planesthan the bits associated with the background. During theentropy coding the higher bit-planes are given higher priorityand, therefore, the background is encoded in lower quality asthe ROIs [4, 5, 6]. Further, [7] presents a method for ROI en-coding in H.264/AVC similar to ROI encoding in JPEG 2000,and [8] presents an approach that blurs less salient regionsusing a foveation filter. With the high image frequenciesremoved, the non-salient regions can be compressed stronger.

These ROI methods are, however, strictly tied to a spe-cific codec, whereas for CCSIR an arbitrary codec can be em-ployed. Furthermore, our approach supports saliency mapsrepresenting arbitrary shapes rather than rectangular ROIsonly, and the retargeting algorithm can be configured to gen-erate smooth transitions from salient to non-salient regions,which is typically not the case for ROI based compression.Even though our experiments showed the best efficiency fora combination of CCSIR and JPEG 2000, our method isagnostic to the employed compression technique and hencecomplementary to existing approaches.

2. THE PROPOSED TECHNIQUE

Fig. 1 depicts an overview of the compression pipeline. First,the input image is downscaled (retargeted) to a smaller reso-

Page 2: CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then encoded using JPEG 2000. To create the difference image D, we decode Id, upscale it

Non-uniform downscale

-

Grid Coord.

Encode

Encode

TRANSMIT

+

Non-uniform upscale

Decode

Grid Coord.

Decode

Decode

Non-uniform upscale

I

S

D D

Id

Id

Ir

Fig. 1: Pipeline architecture: The input image I is downscaled to Id according to the saliency image S. From Id and I , adifference image D is created. Images are encoded to J2K and streamed. To decode, Id is upscaled and D is added.

lution in a non-uniform way. Hence, more pixels are assignedto more salient areas of the image. While in principle anyretargeting method can be employed, the recent axis-alignedretargeting algorithm of [9] is computationally particularly ef-ficient and its warping technique is guaranteed an overlap-freebijective mapping. The scaling is based on saliency in a non-uniform way: Most of the reduction in resolution occurs innon-salient regions.

The downscaled image is then encoded with an arbitraryimage encoder (JPEG 2000 in our example). To compensatefor information loss that occurs during downscaling and en-coding, a difference image is calculated, i.e., we compute alaplacian image pyramid [10] with a single level. The dif-ference image contains the differences between the originalinput image and the downscaled and encoded image after it isupscaled back to its original size. It is then encoded as well.The total file size of the encoded image comprises the file sizeof the downscaled image, the difference image, and the set ofgrid coordinates, which is required to upscale the downscaledimage back to its original shape.

From the input image I , we create a saliency map S (e.g.,using [11]). Alternatively, in an interactive encoding system,the saliency map could be created by a user. When encod-ing the input image I , we create a set of three componentsENC(I)S = {Id,D,C}, where Id = downscale(I) repre-sents the downscaled image, D is a difference image, and C isa set of grid coordinates. C is calculated solely from S by theretargeting algorithm. All three components are transmittedto the receiver. On the receiver, we decode the componentsinto the reconstructed image Ir with DEC({Id,D,C}) = Ir.If D is losslessy compressed, then I = Ir holds, that is, theoriginal image is perfectly reconstructed. By adjusting thecompression level of Id and D, we can control the quality ofthe compressions.

2.1. Encoding

Id is calculated by applying an image retargeting algorithm tothe original image I , using S. Following [9], we overlay an

uniform grid over the input image and we compute an axis-aligned deformed grid, which is calculation from the desiredtarget scaling factor s and the saliency map S. A bicubic inter-polation on the image is performed according to the deformedgrid to scale the image down to the new resolution. The de-formed grid coordinates C are saved along with the down-scaled image Id. Id is then encoded using JPEG 2000. Tocreate the difference image D, we decode Id, upscale it againand calculate D, comprising all missing image content thatwas lost during the downscaling process as well as during theJPEG 2000 compression, i.e. it contains the JPEG 2000 com-pression artifacts. HFCR (high frequency compression ratio)denotes the JPEG 2000 compression ratio for D and LFCR(low frequency compression ratio) denotes the JPEG 2000compression ratio for Id.

2.2. Decoding

To restore the original image, we first decode the JPEG 2000encoded images Id and D and perform a bicubic interpolationon Id, according to C, to upscale. Finally, we add D: Ir =dec(D) + upscaleC(dec(Id)). Note that even if we choose tohighly compress Id, we can, with a losslessy compression ofD, perfectly reconstruct I .

2.3. Non-uniform Anti-Aliasing

Sampling theory dictates that, when subsampling a signal, theShannon-Nyquist sampling theorem must be satisfied to pre-vent aliasing. We can prevent violation of the sampling the-orem by applying an anti-aliasing filter, i.e., by attenuatingthe high frequency components. In CCSIR, as the image issubsampled in a non-uniform way, we need to apply a non-uniform low-pass filter. We employ the 6-tap cubic splineinterpolation filter as described in [12], that approximates theLanczos-3 kernel. Each axis is filtered independently. Forevery grid column, we calculate the scaling factor sxi , wherei ∈ [0, N ], and syj , where j ∈ [0,M ], respectively, basedon the deformed grid with N columns and M rows. We di-

Page 3: CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then encoded using JPEG 2000. To create the difference image D, we decode Id, upscale it

(a) Input image (b) Saliency image

(c) Uniform AA (d) Non-uniform AA

(e) Uniform AA (f) Non-uniform AA

Fig. 2: Different anti-aliasing (AA) methods. (c): downscaledinput image using uniform AA. (d): downscaled input imageusing non-uniform AA. (e): upscaled image (c). (f): upscaledimage (d). Notice the artifacts in (c) and (e) that are avoidedin (d) and (f).

vide the scaling factors into segments, which correspond toa certain mean scaling factor value. A segment is marked ifthe absolute delta scaling factor exceeds a threshold, i.e., if|∆sxi | > e, holds, where ∆sxi = sxi+1 − sxi .

In our implementation, a threshold value of e = 0.05 isused. For every segment, the mean scaling factor is calculatedand the 6-tap smoothing filter is applied on the correspond-ing part of the image. All parts are then linearly blendedtogether. Fig. 2 compares our approach with uniform anti-aliasing. For this illustration we manually created the saliencyimage in Fig. 2b. The uniformly anti-aliased images Fig. 2cand Fig. 2e exhibit aliasing artifacts towards the middle of thestar pattern and too much smoothing in the salient regions.The non-uniformly anti-aliased images Fig. 2d and Fig. 2fshow significantly reduced aliasing.

3. RESULTS

A prototype Matlab implementation that performs downscal-ing and upscaling as described in Section 2 was created to

evaluate CCSIR. The implementation relies on [9] for calcu-lating the deformed grid. The evaluation images are retar-geted with a high non-uniformity to differentiate our methodfrom the other methods. In practice, a smoother transitionfrom salient to non-salient regions could be targeted and thena moderate non-uniformity would suffice. To compare CC-SIR we selected a compression ratio for the respective refer-ence method so that the resulting file size equals our encodedimage file size as close as possible. Given the same file size,we calculate two image quality metrics, Peak Signal-to-NoiseRatio (PSNR) and Structural Similarity (SSIM) index [13],for all pixels in the images and for the pixels in the salientregions only. Fig. 3 compares PSNR and SSIM values of CC-SIR to JPEG 2000 compressed images for different scalingfactors s for two given LFCRs and a constant HFCR = 1200.The PSNR and SSIM values are averaged for a collection offive video clips.

The compression ranges from 0.037 bpp to 0.857 bpp withan average of 0.313 bpp. As expected, for both PSNR andSSIM, CCSIR performs slightly worse than JPEG 2000 onthe overall image (Fig. 3a and Fig. 3c). However, it performsbetter in the salient areas of the images for moderate scalingfactors. For large scaling factors s > 0.75, Id is relativelylarge and the high total number of bits in Id and D allowJPEG 2000 to compress with a relatively low compressionratio, and thus achieve higher quality.

0.2 0.4 0.6 0.8

26

28

30

32

34

36

Scale factor s

PS

NR

[dB

]

CCSIR, LFCR=20

CCSIR, LFCR=60

J2k, LFCR=20

J2k, LFCR=60

(a) PSNR of the whole image

0.2 0.4 0.6 0.8

44

46

48

50

52

54

Scale factor s

PS

NR

[dB

]

CCSIR, LFCR=20

CCSIR, LFCR=60

J2k, LFCR=20

J2k, LFCR=60

(b) PSNR of salient regions

0.2 0.4 0.6 0.8

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Scale factor s

SS

IM

CCSIR, LFCR=20

CCSIR, LFCR=60

J2k, LFCR=20

J2k, LFCR=60

(c) SSIM of the whole image

0.2 0.4 0.6 0.8

0.9955

0.996

0.9965

0.997

0.9975

0.998

0.9985

Scale factor s

SS

IM

CCSIR, LFCR=20

CCSIR, LFCR=60

J2k, LFCR=20

J2k, LFCR=60

(d) SSIM of salient regions

Fig. 3: PSNR and SSIM overall and in salient regions only,for different scaling factors.

Fig. 4 provides a visual comparison to other known meth-ods. One representative frame from each of two 100 framesvideo clips (man and marathon) is shown. The salient re-gion is marked in red in the original images. For the saliency

Page 4: CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then encoded using JPEG 2000. To create the difference image D, we decode Id, upscale it

(a) Original images (b) CCSIR (c) JPEG 2000

(d) JPEG 2000 ROI (e) H.264 Baseline (f) H.264 High

Fig. 4: Comparison to other methods. The red squares in (a) indicate the salient areas. All videos have equal file sizes. Eachframe is compressed at approx. 0.07 bpp.

in the man images we automatically detect faces using [14].The video encoding parameters are: LFCR = 60, HFCR =1800, s = 1/3 for the man video and LFCR = 20, HFCR =3000, s = 1/3 for the marathon video. The frames werecompressed to approx. 15 KB (man) and to approx. 37 KB(marathon). The H.264 videos were encoded once using theHigh profile and once using the Baseline profile. The targetbit rate for the H.264 videos was empirically determined toachieve a total file size that is equal to the total file size of thevideo encoded by CCSIR and the JPEG 2000 encoded video.Both H.264 videos were encoded using ffmpeg/x264 [15].The JPEG 2000 ROI enabled video was encoded in PhotoshopCS 5. The important regions are well recognizable in the CC-SIR compressed frames. In the JPEG 2000 ROI frames, theregions are more precisely preserved. However, the transitionfrom salient to non-salient regions is clearly visible, which isundesirable. One main advantage of CCSIR over JPEG 2000ROI coding is that we can encode the image with an arbitrarysmooth transition from salient to non-salient regions. We con-ducted a preliminary user study with 20 participants to assessthe subjective quality of the five different versions of the manvideo. 75% of the participants rated (b) as the visually mostappealing video, the next runner-up, (c) attracted 15% of thevotes, followed by (d), (e), and (f). This result, although pre-

liminary and subject to further validation, is encouraging.The prototype implementation ran on an Intel i5 3.3 GHz

Linux PC with 4 GB of RAM. We believe an implementa-tion of CCSIR that benefits from current tuned software im-plementations or hardware accelerations for bicubic scaling,anti-aliasing, and JPEG 2000 encoding [16], could encode24 fps 1080p video in real-time.

4. CONCLUSION

We presented an approach to content-aware saliency-drivenvideo compression based on image retargeting. The approachexploits non-uniform anti-aliasing, which prevents aliasing inthe highly scaled regions while avoiding over-smoothing inother regions. One attractive feature is that this approach canbe easily incorporated into any existing compression pipeline.

Our method has most noticeable benefits if the sourcevideo contains a large amount of changes (e.g., due to ob-ject or camera motion) that result in many high frequencydetails. As the increasing interest for video content or com-peting video streams face the rim of capacity limits of wire-less channels, content-aware compression provides a path tomaintain quality in the critical regions while reducing storageand bandwidth demands.

Page 5: CONTENT-AWARE COMPRESSION USING SALIENCY-DRIVEN IMAGE RETARGETING · 2015-12-11 · Id is then encoded using JPEG 2000. To create the difference image D, we decode Id, upscale it

5. REFERENCES

[1] T. Sikora, “MPEG digital video-coding standards,” Sig-nal Processing Magazine, IEEE, vol. 14, no. 5, pp. 82–100, Sept. 1997.

[2] A. Kostuch, K. Gierssowski, and J. Wozniak, “Perfor-mance Analysis of Multicast Video Streaming in IEEE802.11 b/g/n Testbed Environment,” in Wireless andMobile Networking, Jozef Wozniak, Jerzy Konorski,Ryszard Katulski, and Andrzej Pach, Eds., vol. 308of IFIP Advances in Information and CommunicationTechnology, pp. 92–105. Springer Boston, 2009.

[3] C. Christopoulos, A. Skodras, and T. Ebrahimi, “TheJPEG2000 still image coding system: an overview,”Consumer Electronics, IEEE Transactions on, vol. 46,no. 4, pp. 1103–1127, Nov. 2000.

[4] L. Liu and G. Fan, “A new JPEG2000 region-of-interest image coding method: partial significant bit-planes shift,” Signal Processing Letters, IEEE, vol. 10,no. 2, pp. 35–38, 2003.

[5] E. Atsumi and N. Farvardin, “Lossy/lossless region-of-interest image coding based on set partitioning in hierar-chical trees,” in Image Processing, 1998. ICIP 98. Pro-ceedings. 1998 International Conference on, Oct. 1998,vol. 1, pp. 87 –91 vol.1.

[6] Z. Wang and A. C. Bovik, “Bitplane-by-bitplane shift(BbBShift) - A suggestion for JPEG2000 region of in-terest image coding,” Signal Processing Letters, IEEE,vol. 9, no. 5, pp. 160–162, May 2002.

[7] Y. Liu, Z. G. Li, and Y. C. Soh, “Region-of-InterestBased Resource Allocation for Conversational VideoCommunication of H.264/AVC,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 18, no.1, pp. 134–139, Jan. 2008.

[8] L. Itti, “Automatic Foveation for Video CompressionUsing a Neurobiological Model of Visual Attention,”IEEE Transactions on Image Processing, vol. 13, no.10, pp. 1304–1318, Oct. 2004.

[9] D. Panozzo, O. Weber, and O. Sorkine, “Robust im-age retargeting via axis-aligned deformation,” Com-puter Graphics Forum (proceedings of EUROGRAPH-ICS), vol. 31, no. 2, pp. 229–236, 2012.

[10] P. Burt and E. Adelson, “The laplacian pyramid as acompact image code,” Communications, IEEE Transac-tions on, vol. 31, no. 4, pp. 532 – 540, apr 1983.

[11] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung,“Saliency filters: Contrast based filtering for salient re-gion detection,” in CVPR, 2012, pp. 733–740.

[12] Joint Video Team (JVT) of ISO/IEC MPEG & ITU-TVCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16Q.6), “Upsampling Filter Design with Cubic Splines,”Apr. 2006.

[13] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simon-celli, “Image quality assessment: from error visibilityto structural similarity,” Image Processing, IEEE Trans-actions on, vol. 13, no. 4, pp. 600–612, Apr. 2004.

[14] L. Wolf, T. Hassner, and Y. Taigman, “Effective Uncon-strained Face Recognition by Combining Multiple De-scriptors and Learned Background Statistics,” PatternAnalysis and Machine Intelligence, IEEE Transactionson, vol. 33, no. 10, pp. 1978–1990, 2011.

[15] “x264,” www.videolan.org/developers/x264.html,viewed 2013-02-05.

[16] “Kakadu JPEG 2000 SDK,” www.kakadusoftware.com,viewed 2013-02-05.