cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep...

13
1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 Jia Li , Senior Member, IEEE, Yifan Zhao, Weihua Ye, Kaiwen Yu, and Shiming Ge , Senior Member, IEEE 3 Abstract—360 omnidirectional images are very helpful in creat- 4 ing immersive multimedia contents, which enables a huge demand 5 in their efficient generation and effective assessment. In this paper, 6 we leverage an attentive idea to meet this demand by addressing two 7 concerns: how to generate a good omnidirectional image in a fast 8 and robust way and what is a good omnidirectional image for hu- 9 man. To this end, we propose an attentive deep stitching approach 10 to facilitate the efficient generation of omnidirectional images, 11 which is composed of two modules. The low-resolution deformation 12 module aims to learn the deformation rules from dual-fisheye to om- 13 nidirectional images with joint implicit and explicit attention mech- 14 anisms, while the high-resolution recurrence module enhances the 15 resolution of stitching results with the high-resolution guidance in a 16 recurrent manner. In this way, the stitching approach can efficiently 17 generate high-resolution omnidirectional images that are highly 18 consistent with human immersive experiences. Beyond the efficient 19 generation, we further present an attention-driven omnidirectional 20 image quality assessment (IQA) method which uses joint evaluation 21 with both global and local metrics. Especially, the local metric 22 mainly focuses on the stitching region and attention region that 23 mostly affect the Mean Opinion Score (MOS), leading to a consis- 24 tent evaluation of human perception. To verify the effectiveness of 25 our proposed assessment and stitching approaches, we construct a 26 hybrid benchmark evaluation with 7 stitching models and 8 IQA 27 metrics. Qualitative and quantitative experiments show our stitch- 28 ing approach generate preferable results with the state-of-the-art 29 models at a 6× faster speed and the proposed quality assessment 30 approach surpasses other methods by a large margin and is highly 31 consistent with human subjective evaluations. Q1 32 Index Terms—360 omnidirectional image, image quality 33 assessment (IQA), attentive deep stitching. 34 Manuscript received April 21, 2019; revised September 3, 2019; accepted November 8, 2019. This work was supported in part by the National Natural Science Foundation of China under Grant 61922006, Grant 61532003, and Grant 61772513, and in part by the Beijing Nova Program Z181100006218063. The work of S. Ge was supported by the Youth Innovation Promotion Association, Chinese Academy of Sciences. The guest editor coordinating the review of this manuscript and approving it for publication was Prof. Patrick LE CALLET. (Corresponding author: Jia Li.) J. Li is with the State Key Laboratory of Virtual Reality Technology and Sys- tems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China, with the Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing 100191, China, and also with the Peng Cheng Laboratory, Shenzhen 518000, China (e-mail: [email protected]. cn). Y. Zhao, W. Ye, and K. Yu are with the State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineer- ing, Beihang University, Beijing 100191, China (e-mail: [email protected]; [email protected]; [email protected]). S. Ge is with the Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100095, China (e-mail: [email protected]). I. INTRODUCTION 35 I N THE rapid development of virtual reality (VR) over the last 36 decade, high-quality 360 omnidirectional images play an 37 increasingly important role in producing multimedia contents, 38 which requires natural immersions of real-world scenarios in 39 head-mounted displays. Along with the boost of omnidirectional 40 acquisition devices, there exists a huge demand in efficient om- 41 nidirectional image generation and accurate quality assessment, 42 which can further be of great use in biology [1], [2], medical [3], 43 modeling [4] and virtual reality [5]. To get the high-quality 44 omnidirectional images, tens of models have been proposed to 45 stitch the dual-fisheye images into 360 omnidirectional images. 46 With the large amount of stitched images and stitching models, 47 it further yields two important concerns: how to generate a good 48 omnidirectional image in a fast and robust way and what is a 49 good omnidirectional image for human? 50 In the view of the first concern, there exist two main cat- 51 egories of automatic stitching methods to generate omnidi- 52 rectional images rather than manual calibration methods [4], 53 [6]: direct stitching and feature-based stitching. Direct stitching 54 approaches [1], [7], [8] have the advantage that they make full 55 use of the available image data and hence can provide accurate 56 registration but required a closed initialization. In contrast, the 57 feature-based stitching methods [3], [9]–[11] do not require the 58 complicated initialization procedure. They usually automati- 59 cally detect invariant local features and construct a matching 60 correspondence instead of manual registration. Some classical 61 feature detectors [12], [13] usually perform well on conventional 62 planar natural images but may lack invariant properties in han- 63 dling dual-fisheye images, which may cause distortions or shape 64 breakages in stitching regions. 65 Beyond the first concern, many image quality assessment 66 (IQA) methods [14]–[18] have been proposed to address this 67 problem. In the early researches of IQA, the evaluation meth- 68 ods mainly focus on the common daily images with many 69 photometric quality indexes such as MSE [19], PSNR [20] 70 and SSIM [21]. With the development of Convolutional Neu- 71 ral Networks (CNNs), some representative models [16], [22], 72 [23] with deep features have been proposed. However, these 73 models usually focus on the photometric quality indexes such 74 as blurring, noise and color distortions, which may be not suit- 75 able for the omnidirectional images. Moreover, there are a few 76 works [24], [25] on the quality assessment of panoramic images. 77 For example, Yang et al. [17] proposed a light-weight model to 78 evaluate the stitched panoramic images based on ghosting and 79

Transcript of cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep...

Page 1: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

1

Attentive Deep Stitching and Quality Assessment for360◦ Omnidirectional Images

1

2

Jia Li , Senior Member, IEEE, Yifan Zhao, Weihua Ye, Kaiwen Yu, and Shiming Ge , Senior Member, IEEE3

Abstract—360◦ omnidirectional images are very helpful in creat-4ing immersive multimedia contents, which enables a huge demand5in their efficient generation and effective assessment. In this paper,6we leverage an attentive idea to meet this demand by addressing two7concerns: how to generate a good omnidirectional image in a fast8and robust way and what is a good omnidirectional image for hu-9man. To this end, we propose an attentive deep stitching approach10to facilitate the efficient generation of omnidirectional images,11which is composed of two modules. The low-resolution deformation12module aims to learn the deformation rules from dual-fisheye to om-13nidirectional images with joint implicit and explicit attention mech-14anisms, while the high-resolution recurrence module enhances the15resolution of stitching results with the high-resolution guidance in a16recurrent manner. In this way, the stitching approach can efficiently17generate high-resolution omnidirectional images that are highly18consistent with human immersive experiences. Beyond the efficient19generation, we further present an attention-driven omnidirectional20image quality assessment (IQA) method which uses joint evaluation21with both global and local metrics. Especially, the local metric22mainly focuses on the stitching region and attention region that23mostly affect the Mean Opinion Score (MOS), leading to a consis-24tent evaluation of human perception. To verify the effectiveness of25our proposed assessment and stitching approaches, we construct a26hybrid benchmark evaluation with 7 stitching models and 8 IQA27metrics. Qualitative and quantitative experiments show our stitch-28ing approach generate preferable results with the state-of-the-art29models at a 6× faster speed and the proposed quality assessment30approach surpasses other methods by a large margin and is highly31consistent with human subjective evaluations.

Q132

Index Terms—360◦ omnidirectional image, image quality33assessment (IQA), attentive deep stitching.34

Manuscript received April 21, 2019; revised September 3, 2019; acceptedNovember 8, 2019. This work was supported in part by the National NaturalScience Foundation of China under Grant 61922006, Grant 61532003, and Grant61772513, and in part by the Beijing Nova Program Z181100006218063. Thework of S. Ge was supported by the Youth Innovation Promotion Association,Chinese Academy of Sciences. The guest editor coordinating the review of thismanuscript and approving it for publication was Prof. Patrick LE CALLET.(Corresponding author: Jia Li.)

J. Li is with the State Key Laboratory of Virtual Reality Technology and Sys-tems, School of Computer Science and Engineering, Beihang University, Beijing100191, China, with the Beijing Advanced Innovation Center for Big Data andBrain Computing, Beihang University, Beijing 100191, China, and also withthe Peng Cheng Laboratory, Shenzhen 518000, China (e-mail: [email protected]).

Y. Zhao, W. Ye, and K. Yu are with the State Key Laboratory of VirtualReality Technology and Systems, School of Computer Science and Engineer-ing, Beihang University, Beijing 100191, China (e-mail: [email protected];[email protected]; [email protected]).

S. Ge is with the Institute of Information Engineering, Chinese Academy ofSciences, Beijing 100095, China (e-mail: [email protected]).

I. INTRODUCTION 35

IN THE rapid development of virtual reality (VR) over the last 36

decade, high-quality 360◦ omnidirectional images play an 37

increasingly important role in producing multimedia contents, 38

which requires natural immersions of real-world scenarios in 39

head-mounted displays. Along with the boost of omnidirectional 40

acquisition devices, there exists a huge demand in efficient om- 41

nidirectional image generation and accurate quality assessment, 42

which can further be of great use in biology [1], [2], medical [3], 43

modeling [4] and virtual reality [5]. To get the high-quality 44

omnidirectional images, tens of models have been proposed to 45

stitch the dual-fisheye images into 360◦ omnidirectional images. 46

With the large amount of stitched images and stitching models, 47

it further yields two important concerns: how to generate a good 48

omnidirectional image in a fast and robust way and what is a 49

good omnidirectional image for human? 50

In the view of the first concern, there exist two main cat- 51

egories of automatic stitching methods to generate omnidi- 52

rectional images rather than manual calibration methods [4], 53

[6]: direct stitching and feature-based stitching. Direct stitching 54

approaches [1], [7], [8] have the advantage that they make full 55

use of the available image data and hence can provide accurate 56

registration but required a closed initialization. In contrast, the 57

feature-based stitching methods [3], [9]–[11] do not require the 58

complicated initialization procedure. They usually automati- 59

cally detect invariant local features and construct a matching 60

correspondence instead of manual registration. Some classical 61

feature detectors [12], [13] usually perform well on conventional 62

planar natural images but may lack invariant properties in han- 63

dling dual-fisheye images, which may cause distortions or shape 64

breakages in stitching regions. 65

Beyond the first concern, many image quality assessment 66

(IQA) methods [14]–[18] have been proposed to address this 67

problem. In the early researches of IQA, the evaluation meth- 68

ods mainly focus on the common daily images with many 69

photometric quality indexes such as MSE [19], PSNR [20] 70

and SSIM [21]. With the development of Convolutional Neu- 71

ral Networks (CNNs), some representative models [16], [22], 72

[23] with deep features have been proposed. However, these 73

models usually focus on the photometric quality indexes such 74

as blurring, noise and color distortions, which may be not suit- 75

able for the omnidirectional images. Moreover, there are a few 76

works [24], [25] on the quality assessment of panoramic images. 77

For example, Yang et al. [17] proposed a light-weight model to 78

evaluate the stitched panoramic images based on ghosting and 79

alien
打字机
The corresponding author is Shiming Ge.
Page 2: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

2

Fig. 1. Dual-fisheye image and its stitched 360◦ panoramic image with humanattention (white) and stitching region focusing (blue). The image immersiveexperience to human is mainly affected by human attention mechanism and thedistortions are most likely to happen in specific stitching regions.

structure inconsistency. These proposed metrics are designed80

for normal 2-D plane image stitching, such that cannot handle81

360◦ omnidirectional images which are generated from the82

dual-fisheye images and have large distortion and information83

loss in the stitching areas.84

In summary, these general stitching and assessment ap-85

proaches usually treat every pixel of the stitching images equally.86

However, immersive experiences of omnidirectional images are87

affected by two attentive cues: attention region and stitching88

region. As shown in Fig. 1, attention region is mainly focused89

by human gaze while stitching region in the middle most likely90

happens distortions or shape breakage. To this end, we address91

the efficient generation and effective assessment of 360◦ omnidi-92

rectional images by two human perception-driven approaches,93

which is attentive deep stitching (ADS) and attentive quality94

assessment (AQA), respectively.95

ADS adopts a progressive manner to perform efficient96

generation of 360◦ omnidirectional images, which is composed97

of two main modules along with an implicit-attention and98

explicit-attention mechanisms respectively. The first low-99

resolution deformation module learns the deformation features100

from the dual-fisheye image with multiple implicit-attention101

blocks. By combing the learned deformation features and the102

high-resolution dual-fisheye image, the second high-resolution103

recurrence module is conducted to assign the deformation104

relationship with the high-resolution pixel guidance. With the105

recurrent refinement scheme, a high-resolution omnidirectional106

image is obtained. At the end of these two modules, the explicit107

attention map of human gaze is introduced to regularize the con-108

sistency of stitching results and human subjective experience.109

AQA is a novel full-reference quality assessment approach, 110

which is designed to evaluate the quality of the stitched 360◦ om- 111

nidirectional images in accord with human perception. Based on 112

the accurate cross-reference omnidirectional image dataset [26], 113

we propose a joint approach to combining the local and global 114

metrics, where the global metric mainly considers the environ- 115

mental differences like color chromatism and the blind zone 116

phenomenon. For the local metric, we develop an attentive sam- 117

pling strategy to focus on attention region and stitching region, 118

the two special regions that mostly affect the stitching quality 119

for the attentive frequency and stitching distortions, respectively. 120

To this end, We adopt the sparse reconstruction and appearance 121

difference to represent the local metric and finally use the linear 122

learning progress to match human subjective evaluations. 123

The contributions of this paper can be summarized as follows: 124

1) We propose a novel attentive deep stitching approach to 125

facilitate the generation of high-resolution 360◦ omnidirectional 126

images from dual-fisheye images in an end-to-end deep manner, 127

which runs at a 6× faster speed than the state-of-the-art methods. 128

2) We propose an attentive quality assessment approach to au- 129

tomatically assess the stitching quality of 360◦ omnidirectional 130

images, which provides more consistent evaluation to the human 131

perception. 3) Qualitative and quantitative experiments are con- 132

ducted to demonstrate the effectiveness of the proposed stitching 133

approach while the proposed quality assessment approach is 134

highly consistent with human subjective evaluation. 135

The rest of this paper is organized as follows: Section II 136

reviews related works and Section III proposes the attentive deep 137

stitching method. In Section IV, the attentive quality assessment 138

approach for omnidirectional stitching is proposed. We further 139

conduct quantitative and qualitative experiments in Section V 140

and finally conclude this paper in Section VI. 141

II. LITERATURE REVIEW 142

A. Omnidirectional Image Stitching 143

Classical stitching models: Image stitching techniques is 144

now becoming a research hotspot with wide applications [7], 145

[9], [27]. Classical stitching models, such as Stereoscopic Vi- 146

sion Projection (SVP) [28], Isometric Projection (IP) [29] and 147

Equidistant Projection (EP) [30], have been widely used to gen- 148

erate the 360◦ omnidirectional images in an automatic way. Most 149

modern digital cameras have added panoramic mode, including 150

many mobile devices. 151

Classical image stitching methods can be roughly divided into 152

two categories: camera calibration based image stitching and 153

keypoint based image stitching. Recently several works [31]– 154

[33], have made progress in improving traditional image stitch- 155

ing algorithm [11]. Charles et al. [34] solve the problem that the 156

use of a single registration often leads to errors, especially in 157

scenes with significant depth variation or object motion. With 158

the portability and cheapness of the dual-fisheye camera, the 159

research [35], [36] on fisheye image stitching becomes more and 160

more applicable. Lo et al. [37] stitched the dual-fisheye image 161

into a 360◦ panoramic image following the four basic steps of 162

right angle transformation, feature extraction, mesh deformation 163

and mixture. 164

Page 3: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

3

Deep convolutional models: At present, most convolutional165

neural networks keep the image prototype and only extract166

special information. With the high demand on daily images,167

the traditional problems of pixel drift, such as image stitching168

and fisheye image distortion correction [35], [38], also need to be169

further studied. In recent years, a few researchers made attempts170

in solving the pixel drift issue with convolutional networks. Yin171

et al. [39] proposed an end-to-end multi-context collaborative172

deep network for removing distortions from single fisheye im-173

ages that learns high-level semantics and low-level appearance174

features simultaneously to estimate the distortion parameters.175

Deng et al. [40] proposed restricted stitching convolution for176

semantic segmentation, which can effectively model geometric177

transformations by learning the shapes of convolutional filters.178

B. Image Quality Assessment179

Many IQA methods [41] have been proposed in the past180

decades, which can be roughly grouped into three categories.181

Some pioneer works for image IQA [15], [22], [42] focuses on182

both traditional IQA and common panoramic stitching IQA. In183

this paper, we mainly focus on the quality assessment omnidi-184

rectional images, which is a less-explored task with increasing185

demands.186

Classical IQA metrics: Most of recent IQA researches focused187

on no-reference image quality assessment (NR-IQA) [43]–[48]188

and full-reference image quality assessment (FR-IQA) [18],189

[19], [49]–[51]. NR-IQA do not need specific reference image190

which is convenient to various image assessment task. In the pro-191

cess of FR-IQA, the assessment quality are compared with the192

results of reference image, For instance: MSE [19], PSNR [20],193

SSIM [52]. The assessment of immersive stitching IQA can194

also be adopted to full reference assessment method to some195

extent. However, the assessment result may be not so accurate196

and sometimes even opposite to human visual judgements.197

Learnable IQA metrics: Due to the rapid development of deep198

learning in recent years, various existing problems can achieve199

better results on the basis of deep learning approaches. There-200

fore, many researchers use deep learning models to evaluate201

daily images in the field of image quality evaluation. For these202

deep learning models [15], [23], the biggest problem is that there203

is no suitable large dataset for training. Kang et al. [23] aimed204

at images which the dataset suitable for deep learning training,205

and proposed to use 32 × 32 patches for training. On the one206

hand, the method increased the amount of data through simplify207

image processing, on the other hand, the discontinuity of the208

main structure in the image may lead to inaccuracy. Liu et al. [16]209

trained a Siamese Network to sort and learn images, and learned210

the relationship between images by sharing the weight of the211

network. Some researchers adopted convolutional sparse coding212

to locate specific distortions [53]–[55] and designed the kernel213

to quantify the mixed effects of multiple distortion types in local214

regions.215

Stitching IQA metrics: There are few researches in the216

image quality assessment of stitching images. For example,217

Yang et al. [17] solved the problem of ghosting and structural218

discontinuity in image stitching by using perceptual geometric219

error metric and local structure-guide metric, but for immersive 220

image, the evaluation method is not comprehensive enough to 221

detect the global color difference, and the conditions of blind 222

zone. Huang et al. [56] proposed the quality evaluation of im- 223

mersed images, mainly focusing on resolution and compression, 224

neither the quality evaluation of stitching, nor on the image 225

quality evaluation. In [53], the authors adopted convolutional 226

sparse coding and compound feature selection which focuses 227

on the stitching region for stitched image assessment. More- 228

over, some subjective omnidirectional video quality assessment 229

methods [57], [58] have been proposed in this less-explored task. 230

III. ATTENTIVE DEEP STITCHING 231

A. Overview 232

The classical omnidirectional stitching problem is a transfor- 233

mation of optical refraction operation. The pixels of dual-fisheye 234

images are transferred from the equirectangular coordinates to 235

the two-dimensional plane stretching. Commonly, transforming 236

dual fisheye imagex to omnidirectional imagey can be formally 237

represented as: 238

y = T (x; θ), (1)

where T : RW×H → RW×H is the transformation function and 239

θ denotes the parameters of stitching model. However, the trans- 240

formation relationship varies significantly for different images 241

and cameras. 242

Instead of the manual calibration or automatical registration, 243

we advocate using deep convolutional neural networks to learn 244

this transformation instead of hand-crafted designs. To this 245

end, we propose an attentive deep stitching approach which 246

efficiently solves the transformation T in two phases: 247

f = F(x; θF ), θF ⊂ θL, (2)

y = H(f ,x; θH), (3)

where F is the low-resolution deformation module and H is the 248

high-resolution recurrence module. x is the down-sampled input 249

of dual-fisheye image x. f are learnable deformation features 250

of x and θ{L,H} are learnable parameters of the low-resolution 251

deformation and high-resolution recurrence phase, respectively. 252

θF is used to extract the deformation features, which is a part 253

of θL. 254

B. Low-Resolution Deformation 255

As shown in Fig. 2, the low-resolution deformation module 256

aims to decode the transformation informationF(·) in a learning 257

procedure. Verified many representative researches in panoramic 258

attention, human gaze [59], [60] usually focuses on special 259

regions which contain the most attractive information. Keeping 260

these two cues in mind, we further develop an attention-based 261

deformation learning process, which is jointly optimized with 262

the implicit-attention and explicit-attention mechanisms. 263

Considering the computation cost and stitching efficiency, we 264

resort to the low-resolution image x to learn the deformation 265

information. Inspired by the successful U-Net [61] architec- 266

ture, we develop the light-weighted deformation module which 267

Page 4: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

4

Fig. 2. Framework of proposed Attentive Deep Stitching (ADS). Our framework is mainly composed of two modules. The first deformation module is tolearn the transformation rules from dual-fisheye to omnidirectional images with the joint human-supervised explicit attention and implicit attention mechanism.The second recurrence module utilizes the high-resolution fisheye image as a guidance to the stitching results in a recurrent manner.

Fig. 3. Implicit-attention block. σ: element summation in channel dimensionand a softmax operation. ⊗: scalar production. ⊕: element-wise sum.

consists of a contracting path and an expansive path. Based268

on this design, we add multiple implicit-attention blocks as269

the transition layers to pass through the low-level attention to270

the high-level decoders, which attaches more importance in271

attention region and can further eliminate the gradient loss.272

The detailed architecture of implicit-attention block is shown273

in Fig. 3. σ denotes the channel-wise summation and softmax274

operation. With the input feature P ∈ RW×H×C , this block can275

be formally represented as:276

V = tanh(wvP+ bv),

K = tanh(wkP+ bk), (4)

where wk,bk,wv,bv are the parameters and � is the scalar-277

product operation. After getting this transformed features, the278

Fig. 4. Illustration of local attentive sampling. First row: gaussian-basedstitching region sampling. Second row: attention-based region sampling.

final output S(P) with the implicit attention is formulated as: 279

S(Pi,j,k) =eMi,j

∑i,j e

Mi,j�Qi,j,k +Pi,j,k,

Mi,j =∑

k

Ki,j,k. (5)

Page 5: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

5

After that, the outputs S(P) pass from these implicit-attention280

blocks to the corresponding features in decoder with a concate-281

nation operation.282

In addition to the implicit-attention mechanism, the explicit283

attention is usually generated by human gaze estimation, which284

provides reliable region attention information. To this end, we285

resort to the state-of-the-art SALICON [62] approach to estimate286

human gaze. We train the gaze model on large daily image287

datasets and finetune the network with omnidirectional image288

annotations.289

With the reliable generation of explicit attention model, we290

develop a region sensitive MSE loss LL to attach more impor-291

tance on attention regions:292

LL(θL) =1

2N

N∑

i=1

Wi‖F(xi; θL)− yi‖2,

Wi =1

1 + e−A(yi),

(6)

where N is the total number of elements and F(xi; θL) denotes293

the low-resolution prediction of the deformation module. More294

exactly, θL is the deformation module with the final prediction,295

and θF is used to obtain the transformation features. Wi, xi,296

and yi are the ith element of the explicit-attention weight297

W, the low-resolution dual-fisheye image x and low-resolution298

omnidirectional image y, respectively. A(yi) is the ith attention299

value of omnidirectional image y, which is normalized in [0, 1].300

Regularized with these two attention mechanisms, our defor-301

mation module is developed with d = 32 times down-sample302

operations with chained max-pooling. In this manner, each303

pixel in the highest level takes the responsibility to learn304

the transformation from k2 × d2 pixels of the original input,305

where k is the kernel size of the current feature map. Moreover,306

we explore chained max-pooling operation in each down-sample307

operation and the tanh(·) activation at the end of the network,308

which are suitable to emphasize the local extremum and greatly309

maintain the transformation information. The tanh(·) function310

also accelerates the convergence speed of our network.311

C. High-Resolution Recurrence312

After the initialized deformation, we develop a progressive313

high-resolution generation module with a recurrent manner.314

As shown in Fig. 2, the high-resolution fisheye image passes315

through a feature encoder (view in red) to obtain the accurate316

pixel guidance. In another way, the feature from the second317

last feature map (features before output) is obtained with a318

up-sampling operation as the deformation guidance. The high-319

resolution pixel guidance and deformation guidance are then320

concatenated with 1 × 1 convolutions. Finally, these fused fea-321

tures pass through a hourglass network without down-sampling322

operations to get the higher-resolution output.323

In this manner, the deformation branch provides the trans-324

formation regulation F and the high-resolution input provides325

the pixel-level guidance G. In sth stage, we concatenate the326

up-sampling transformation regulation F and high-resolution G327

to decode them with a high-resolution hourglass decoder ϕs(·), 328

which is composed of 8 convolutional layers with the 3 × 3 329

kernels. This recurrent manner in sth stage can be formulated 330

as: 331

Hs(xs, xs+1) =

{ϕs(G(xs+1)� F(xs; θF )), if s = 1

ϕs(G(xs+1)�Hs−1(·)), if 1 < s ≤ S,(7)

where the xs is the down-sampled input of the sth scale and � 332

is the feature concatenate operation with 1 × 1 convolutions. 333

H(·) is the output feature in the sth iteration. S is the maximum 334

stage with largest input resolutions, which is set as 3 in our 335

experiments. The first iteration adopts deformation feature from 336

our first module and the following iterations adopts the high- 337

resolution features of the last stage as guidance. At the end of 338

the third iteration, the width and height of the recurrence F 339

are fixed for the computation limitation. We recommend using 340

this result as the final prediction considering the time efficiency. 341

Our recurrence network follows the common stage-wise training 342

process, the final loss in the sth stage can be represented as: 343

LH(θsH) =1

2N

N∑

i=1

Wi‖H(xi,s, xi,s+1; θsH)− yi,s+1‖2,

(8)

where s = 1 . . . S denotes the iteration stage. The weight Wi is 344

generated by the explicit attention map in Eqn. (6). With the 345

recurrent iterations from low-resolution to high-resolution, a 346

finer stitching result is obtained with an end-to-end inference. 347

IV. ATTENTIVE QUALITY ASSESSMENT 348

After the promise of the stitching models, the main con- 349

cern is how to evaluate these models on the stitched images. 350

In this section, we propose a novel approach to evaluate the 351

less-explored omnidirectional stitching task with joint local 352

and global quality assessment metrics. Both global and local 353

indexes are full-reference metrics which are evaluated with the 354

cross-reference ground-truth. The local indexes mainly focus 355

on attentive and the stitching seam region, while the global ones 356

mainly focus on the environmental immersion experience. 357

A. Local Attentive Assessment 358

The main distortions in the omnidirectional images are most 359

likely to happen in the regions near the stitching seam. In 360

contrast, the regions far from the stitching seam are usually 361

with fewer distortions. On the other hand, human gaze [63]–[65] 362

usually focuses on regions with special patterns, which drives 363

us not to treat every pixel equally. Specially, the stitching im- 364

ages and ground-truth reference may have some slight degree 365

changes, which is not suitable to align pixels in stitching image 366

and ground-truth. 367

To this end, we sample the patches instead of per-pixel 368

matching in both stitched images and ground-truth image. The 369

stitching regionsΩsti are sampled with gaussian sigma-criterion 370

Page 6: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

6

in Eqn. (9):371

Rs(x) =1√2πσ

exp

(

− (x− μ)2

2σ2

)

,x ∈ Ωsti, (9)

where the region indicators μ is set as the 0.5 times width of the372

stitching region. To sampling more patches in stitched images373

and eliminate the pixel shifting, we set σ for stitching regions374

and reference regions as 220 and 350 respectively. The patch375

size is set as 8 × 8 and sampled by using a sliding-window376

strategy. Moreover, the human attention Ra(x) is sampled with377

the most brightness scores in attention map, which is calculated378

by the same SALICON [62] model fine-tuned on the omni-379

directional image. Similarly, The ground-truth patches Da(x)380

and Ds(x) are obtained. With the summarization of attentive381

sampled patches, two meaningful metrics of local regions are382

further proposed.383

1) Sparse reconstruction: To robustly measure the region384

similarity at various levels of details, we propose to adopt the385

sparse reconstruction errors as the local metric. The foundation386

of this metric is that the similar patches can represent each387

other with a minimal length of the sparse code. To this end,388

an over-complicated sparse dictionary D = {Da,Ds} is con-389

structed with the ground-truth patches and the stitched patches390

are stacked as R = {Ra,Rs}. Our solving procedure of the391

minimal reconstruction code X∗ can be formulated as:392

X∗ = argminX

1

2||R−DX||2F + λ||X||1. (10)

This can be easily solved with the Online Dictionary Learn-393

ing [66]. With the optimized sparse representation X∗, we394

further adopt the SVD decomposition to the principal component395

with FPCA. The final score is evaluated with the L1-norm:396

X∗ =r∑

i=1

UiΣiVTi ,

Esparse = −r∑

i=1

||FPCA(Σi)||1, (11)

where Σ is decomposed with singular values to represent the397

sparse reconstructions.398

2) Appearance similarity: To evaluate the appearance similar-399

ity, we resort to the commonly used Gray-Level Co-occurrence400

Matrix (GLCM) [67] in degrees of [45, 90, 135, 180] to extract401

the texture features. With these features, we adopt the histogram402

calculation to measure the texture similarity between the sam-403

pled patches. In this manner, we calculate the cosine similarity404

between the divided bins in Eqn. (12).405

Eapp =

4∑

d=1

n∑

i

n∑

j

cos(hi,hj)||hi||2F ||hj ||2F , (12)

where hi is the ith histogram bin of the GLCM matrix with406

n = 10 devisions and d indicates the four degrees in GLCM407

matrix.408

B. Global Environmental Assessment 409

To evaluate the environmental immersion experience, we 410

further propose two global metrics to qualify global regions of 411

the stitched images, which can be summarized as: 412

1) Color chromatism: Most of the stitching methods adjust 413

some optical parameters to match two images, which could 414

further bring in some chromatic aberrations. To evaluate the 415

point-wise color difference, we adopt the SIFT [68] matching 416

to find the pixel correspondences between the stitched image 417

and ground-truth image. For each matched pair of points, we 418

compute the K-nearest neighbor to eliminated mismatches, 419

and we denote S and G as the sampled patches of stitching 420

regions and referenced ground-truth regions, respectively. The 421

sift matching procedure can be formulated as: 422

G∗ = argminG||S−Hsift(Gi)||2F , i = 1 . . .K. (13)

After matching every S with the nearest G, the color chro- 423

matism score can be calculated as: 424

Ecolor = −M∑

i=1

C∑

k=1

λ||Sik −G∗

ik||2FM × C

, (14)

where M is the number of corresponding pairs and C is the 425

number of channels. λ is set as 100 to balance the final score. 426

2) Blind zone: The blind zones are the blank areas with 427

the information loss during transformation processes, which 428

affect the visual comfortableness in immersive experiences. To 429

accurately measure the impact of blind zones, we propose an 430

attention-weighted blind zone evaluation metric. We adopt the 431

same SALICON model [62] to calculate the attentive regions 432

with groundtruth. The generated attention mapwbi for blind zone 433

is generated with the ground-truth omni-directional image. The 434

value of wbi are normalized in [0, 1]. The score of Eblind can be 435

calculated as: 436

Eblind = 1− 1

N

N∑

i=1

Bi

1 + e−wbi

, (15)

where Bi denotes the ith pixel of blind zones masks, which is 437

set as 1 when it is in blind zone. N denotes the number of pixels 438

in the stitched image. The region Bi can be simply calculated 439

with the bottom and top stitching region with continuous blank 440

areas (zero-value pixels). If the attentive region with important 441

message are missing, the Eblind will generate lower scores. 442

C. Joint Assessment With Human Guided Classifier 443

With the proposed two local metrics and two global ones, 444

we further introduce human subjective evaluations to supervise 445

our linear classifier. We use the concluded pair-wise evaluation 446

scores as the final results, which is the Mean Opinion Score 447

(MOS) collected in the CROSS dataset [26]. 448

The aim of our classifier is to provide learnable weight to make 449

the metric consistent with the human subjective assessment. To 450

this end, we adopt the multiple linear regression (MLR) [69], 451

[70] to fit the human subjective ground-truth. Stack vector 452

x = {Eapp,Esparse,Ecolor,Eblind} with the proposed metrics 453

above, we adopt MOS scores as the ground-truth M, The 454

Page 7: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

7

weight-balance parametersβ can be learned by generalized least455

squares estimation, which are shown in Eq. (16):456

M = β · x,

β∗ = argminβ(xTΩ−1x)−1xTΩ−1M, (16)

where Ω is the covariance matrix of residual error. Finally, the457

final assessment scores can be calculated as:458

R = β∗ · x, (17)

which can be further used to rank different stitching results.459

V. EXPERIMENTS460

A. Experiments Settings461

1) Cross-Reference Dataset: We conduct our experimental462

on the CROSS dataset [26], which contains 292 fisheye images463

as quaternions. Each quaternion is composed of images captured464

from standard quarters of 0, 90, 180 and 270 degrees. Taking two465

images in opposite directions for stitching, the other two images466

can provide high-quality ground-truth references. In this manner,467

a high-quality ground truth stitching image is obtained.468

To make a fair comparison of the state-of-the-art models, we469

randomly select 192 fisheye images as the training set from 12470

different scenarios and the rest 100 images as the test set. More-471

over, all the existing dual-fisheye images and corresponding472

360◦ panoramic images are mirrored horizontally and vertically473

to obtain four times the number of original. In order to verify474

the robustness of our network, we only use the original images475

and the horizontal images in training process but add the vertical476

images in the test.477

2) Evaluation Criterion: Our evaluation criterion is com-478

posed of two systems. The first criterion system is composed of479

the five most commonly used quality assessment evaluation met-480

rics to match the mean opinion score (MOS) provided by [26],481

which conducted pair-wise ranking scores with 14,847 compar-482

isons. The evaluation metrics include Cosine Similarity (CS), the483

Pearson Rank Correlation Coefficient (PRCC), the Spearman’s484

Rank Order Correlation Coefficient (SROCC), Kendall Rank485

Correlation Coefficient (KRCC) and Root Mean Square Error486

(RMSE). We also adopt the quality metric evaluation frame-487

work [74] to assess our AQA approach. The detailed formulation488

of these evaluation metrics can be found in Table II. The MOS489

scores are stacked as vectors to calculate correlation similarities490

with IQA methods.491

The second criterion system is adopted to evaluate the qual-492

ity of stitched images. Despite the proposed omnidirectional493

quality assessment, 9 widely-used IQA methods are adopted494

for our benchmark, including classical methods MSE [19],495

PSNR [20], SSIM [49], no-reference quality assessment metrics,496

BRISQUE [43], NIQE [73], PIQE [14], CCF [47], CEIQ [48],497

and current method based machine learning, CNNIQA [15].498

3) Implementation Details: Our deep deformation model is499

built with 10 convolutional blocks with (3 × 3, ReLU, batch500

norm), followed by a 2 × 2 max-pooling after each block in501

encoder and upsampling in decoder. The high-resoluton recur- 502

rence module is built with the same convolutional blocks and 503

finally designed with a tanh function at the end. The whole model 504

is trained on a single NVIDIA GeForce GTX 1080 GPU and 505

a single Intel i7-6700 CPU. The learning rate of deformation 506

module and recurrence module are both starting with 0.001 507

and reduces to half of that when the validation loss reaches a 508

plateau. The deformation module is trained with the resolution 509

of 512 × 256. Owing to the limitation of GPU memory, we 510

adopt the 2048 × 1024 as final high-resolution output and the 511

recurrence stage as 3 iterations to get the final output in our ex- 512

periments. The deformation stage is trained for 30 k iteration and 513

each recurrent stage are trained for 15 k iterations. The adopted 514

attention model SALICON [62] is pre-trained on the SALICON 515

fixation dataset [75], which contains 20,000 annotations of daily 516

images [76]. We then fine-tuned on the fixation annotations of 517

our training set with 10 k iterations with a finetuned lr = 1e− 6. 518

The groundtruth annotation follows the original dataset [75]. 519

B. The Omnidirectional Stitching Benchmark 520

We adopt 7 widely-used state-of-the-art stitching meth- 521

ods to construct our benchmark, including Samsung Gear 522

360 [71], OpenSource [72], Stereoscopic Vision Projection 523

(SVP) [28], Isometric Projection (IP) [29] and Equidistant 524

Projection (EP) [30], ManMethod (Manual Method) and our 525

proposed Attentive Deep Stitching (ADS), which finally yields 526

1344 stitched images in total for comparisons. 527

We firstly use the 7 compared IQA methods to evaluate results 528

of selected 7 stitching models, which finally yields 49 scores, 529

as shown in Table I. To compare with these IQA indexes, we 530

use the ranking order (view in blue) to evaluate these methods. 531

From which we can see that in most of the proposed indexes, 532

our proposed deep stitching method generate the preferable 533

results, comparing to the time-costing or labour-consume meth- 534

ods. Our proposed method ranks the first place in referenced 535

IQA metrics, which demonstrate the stitching deformation re- 536

sults of our method. However, our proposed losses a lot de- 537

tails in No-reference IQA comparisons such as BRISQE [43], 538

NIQE [73] and PIQE [14] because of the resolution limitation. 539

Most of the stitching methods are conducted in the resolution 540

of 5792 × 2896, which our stitching result are generated by 541

upsampling from 2048 × 1024, which may loss many details 542

in the non-referenced IQA methods. we will further expound 543

this in Section V-D. 544

The qualitative results are shown in Fig. 5, our model gen- 545

erates favorable results with a fast inference procedure. The 546

proposed evaluation scores are shown in the top left corner (view 547

in blue). The NIQE [73] and PIQE [14] are shown in green color 548

in the second and third row, respectively. On the one hand, the 549

proposed AQA method is sensitive to the local distortions and 550

global color chromatism, while NIQE [73] and PIQE [14] do 551

not generate favorable results to match the visual experience. On 552

the other hand, our proposed attentive deep stitching approach 553

shows less breakage and color chromatism, while other methods 554

show apparent stitching error in stitching regions. Moreover, 555

Page 8: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

8

TABLE IJOINT BENCHMARKING OF 8 STITCHING MODELS WITH 10 IQA METRICS. ADS: ATTENTIVE DEEP STITCHING METHOD. AQA: PROPOSED ATTENTIVE

QUALITY ASSESSMENT. THE ASSESSMENT RANKS OF THE 1ST AND 2ND PLACE IN EACH ROW ARE VIEW IN BOLD AND UNDERLINED.↑: THE HIGHER THE BETTER. ↓: THE LOWER THE BETTER

Fig. 5. Benchmark evaluations. The proposed evaluation scores are shown in the top left corner (view in blue). The NIQE [73] and PIQE [14] are shown in greencolor in the second and third row, respectively. Our proposed deep stitching method with the second highest scores show fewer distortions and color chromatism,especially in the attention regions in the second row.

benefiting from our network architecture, the blind zones in our556

results more also smaller than the state-of-the-art models.557

C. Analysis of Image Quality Assessment558

To further evaluate the effectiveness of the proposed quality559

assessment metric, we further use the five commonly used560

metrics (e.g., CS, PRCC, SROCC, KRCC, RMSE) to evaluate561

our IQA scores. We adopt the pair-wise ranking of 14,847 images562

comparisons as the MOS results. As shown in Table III, Our563

evaluation show a large superior margin comparing to the second564

best result of PIQE [14] in CS, PRCC, SROCC and KRCC.565

However, our proposed IQA method generates comparable re-566

sults to the PIQE with a slightly lower of 3%, while most of the567

classical metrics failed to handle this kind of images. Despite568

the existing evaluation metrics in Table II, we adopt the MOS569

evaluation framework of [74] to transform the subjective MOS570

scores into pair-wise significance. From Table IV, our proposed571

AQA method is also highly matched with the human subjective572

evaluations with the AUC value of 0.965 and surpasses the573

state-of-the-art methods.574

To verify the robustness of the proposed AQA algorithm, we575

split our dataset into different parts and test our algorithm under576

TABLE IIMATHEMATICAL FORMULATION OF 5 CORRELATION METRICS

various scenarios. The qualitative results are shown in Table V, 577

where our model show robustness in the various conditions. Joint 578

analyzing with Table III, the lowest results of our proposed 579

algorithms still show superiority to the state-of-the-arts. 580

From Table V, we can easily conclude that indoor-scenes are 581

more challenging than the outdoors mainly because of the var- 582

ious object and light changes. To further lightness parameters 583

in our assessment metrics, we divided the test images into 584

two groups, the natural-light and no-natural (e.g., indoor light, 585

Page 9: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

9

TABLE IIIMOS SIMILARITIES OF 10 STATE-OF-THE-ART IQA METHODS WITH 5 CORRELATIONS CRITERIONS. ↑: THE HIGHER THE BETTER. ↓: THE LOWER THE BETTER

TABLE IVMOS SIMILARITIES OF 8 STATE-OF-THE-ART IQA METHODS USING PAIR-WISE MOS FRAMEWORK [74]

TABLE VCOMPARISONS WITH HUMAN SUBJECT SUBJECTIVE EVALUATIONS.

EVALUATION SCORES: TOTAL WINING PROPORTIONS

VIA PAIR-WISE COMPARISONS

TABLE VIMOS SIMILARITY EVALUATIONS WITH LOCAL INDICATORS.

AQA (ALL): AQA METHOD WITH ALL LOCAL AND GLOBAL METRICS

TABLE VIIABLATION STUDY OF ADS WITH STATE-OF-THE-ART IQA METRICS

streetlight). Our assessment still faces a challenge in handling586

these conditions, with a sharp 17% drop in sensitive PRCC and587

SROCC metrics, while getting acceptable results in CS metric588

(3% lower).589

We use single indicator to evaluate the scores and conduct590

CS similarity with MOS scores. The results can be found in591

Table VI. With our single appearance similarity indicators, the592

CS similarity with MOS can be 0.811, which is higher than593

the classical MSE and SSIM indicators. The single global color594

chromatism indicator also shows 0.803 similarity with MOS.595

With our full-reference AQA algorithm, the full algorithm reach596

TABLE VIIITIME COST OF ADS AND STATE-OF-THE-ART MODELS WITH THE

SAME RESOLUTION OF 2048 × 1024 STITCHED OUTPUT

the performance of 0.948. This also verifies that our local and 597

global module are complementary and can boost the perfor- 598

mance together. 599

We further evaluate the time efficiency, and compare our 600

method with state-of-the-art IQA approaches. The execution 601

time of our method is evaluated on a single Intel I7-6700 CPU. 602

For a single 2048 × 1024 image, our proposed AQA method 603

costs 5.38 seconds and CCF [47] costs 6.98 seconds per im- 604

age. While the classical SSIM and PSNR indexes with lower 605

performance cost 0.604 s and 0.275 s per image respectively. 606

D. Analysis of Attentive Deep Stitching 607

To evaluate the effectiveness of our proposed Attentive Deep 608

Stitching (ADS), we compare the stitching time with four state- 609

of-the-art automatic stitching methods. It can be conclude that 610

our stitching method runs 15 times faster than the classical 611

Stereoscopic Vision Projection (SVP) [28], Isometric Projection 612

(IP) [29] and Equidistant Projection (EP) [30]. Thanks to the 613

two-stage lightweight design, our method runs over 6 times 614

faster than the state-of-the-art OpenSource [72] toolbox, which 615

is shown in Table VIII. 616

To verify the design of our proposed network architecture, we 617

conduct ablation study in the proposed ADS. We randomly select 618

a half from the test set for this validation with state-of-the-art 619

IQA metrics. As shown in Table VII, the first row of Progressive- 620

1 denotes the first iterated output of the recurrence network. With 621

3 times recurrence, the stitching results boost PSNR from 15.52 622

to 16.29. While our model without the attention mechanism 623

generates much coarser results than the final model in the third 624

line. Our proposed method is consistent with human subjective 625

Page 10: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

10

Fig. 6. Results of different iterative resolutions. The first row is the stitching results in the iteration of 512 × 256. The second row shows the final results withresolutions of 2048× 1024. The proposed progressive module provide increasing details, especially in the attention regions, e.g., cars in the first row and widowsin the second and third row.

Fig. 7. Visualized results of our ADS algorithm. ADS w/o Attention: ADS algorithm without attention module. ADS-Final: the final results with proposedattention mechanism.

evaluations while the NIQE [73] is not sensitive to this kind626

of image quality. The proposed AQA is also sensitive to the627

image quality improvement for varying from 38.36 to 51.56 in628

the progressive iterations. The visualized ablation results of our629

attention mechnism can be found in Fig. 7. The corresponding630

attention map can be found in third column. Compared to the631

second column with our full model, the results without attention632

mechanism lose many details in local regions, especially the633

attentive regions.634

The qualitative results are shown in Fig. 6. Comparing the635

first row with fewer iterations and the second row, the proposed636

progressive module provide increasing details, especially in637

the attention regions (e.g., cars in the first row and widows638

in the second and third row). However, with the limitation639

of GPU memory, obtaining results with higher resolutions is640

still a challenging task, which is the largest limitation of our641

module.642

The last thing we want to emphasize is the choice of attention643

algorithms. As shown in Table IX, we adopt three different644

TABLE IXADS WITH DIFFERENT ATTENTION METHODS. THE BEST PERFORMANCES

ARE IN BOLD. ↑: THE HIGHER THE BETTER. ↓: THE LOWER THE BETTER.†: TRAINED WITH SALIENT 360 DATASET [78]

ways in generating attention regions for stitching algorithms. 645

We adopt the super-pixel based attention generation method [77] 646

designed for 360◦ omni-directional images. The second row and 647

the third row indicate the SALICON model trained with or with- 648

out the omnidirectional benchmark [78] images. It can be easily 649

concluded that the results with deep SALICON model [62] show 650

better performance than the classical method [77]. With more 651

accurate saliency annotations in 360◦ scenarios, most of the IQA 652

metrics show a slight performance boost. 653

Page 11: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

11

VI. CONCLUSIONS654

In this paper, we mainly address two concerns with increasing655

demands in 360◦ omnidirectional images: how to generate a656

good omnidirectional image in a fast and robust way and what is657

a good omnidirectional image for human? To address these two658

concerns, we develop two human perception-driven approaches,659

which are attentive deep stitching (ADS) and attentive quality660

assessment (AQA) for omnidirectional images. Firstly, our pro-661

gressive attentive deep stitching model consists of two modules,662

the first to learn the deformation information, the second to663

progressively enhance the perceptive ability in resolution. To664

achieve this, we propose a joint implicit and explicit attention665

mechanism to make our results consistent with human subjective666

evaluations. Secondly, to accurately evaluate the stitching re-667

sults, we develop a novel attentive quality assessment approach668

for 360◦ omnidirectional images, which consists of two local669

sensitive metrics to focus on the human attention and stitch-670

ing region and two global ones on environmental immersions.671

Qualitative and Quantitative experiments show that our stitching672

approach generates preferable results with the state-of-the-arts673

at a 6× faster speed. Moreover, the proposed attentive quality674

assessment approach for omnidirectional images surpasses the675

state-of-the-art methods by a large margin and is highly consis-676

tent with human subjective evaluations.677

REFERENCES678

[1] J. Chalfoun et al., “Mist: Accurate and scalable microscopy image stitching679tool with stage modeling and error minimization,” Sci. Rep., vol. 7, 2017,680Art. no. 4988.681

[2] E. A. Semenishchev, V. V. Voronin, V. I. Marchuk, and I. V. Tolstova,682“Method for stitching microbial images using a neural network,” Proc.683SPIE, vol. 10221, 2017, Art. no. 102210.684

[3] D. Li, Q. He, C. Liu, and H. Yu, “Medical image stitching using parallel685sift detection and transformation fitting by particle swarm optimization,”686J. Med. Imag. Health Informat., vol. 7, no. 6, pp. 1139–1148, Oct. 2017.687

[4] L. Barazzetti, M. Previtali, and F. Roncoroni, “3D modelling with the688samsung gear 360,” ISPRS-Int. Archives Photogramm., Remote Sens.689Spatial Inf. Sci., pp. 85–90, 2017.690

[5] D. Chapman and A. Deacon, “Panoramic imaging and virtual realityłfilling691the gaps between the lines,” ISPRS J. Photogramm. Remote Sens., vol. 53,692no. 6, pp. 311–319, 1998.693

[6] G. Payen de La Garanderie, A. Atapour Abarghouei, and T. P. Breckon,694“Eliminating the blind spot: Adapting 3D object detection and monocular695depth estimation to 360-degree panoramic imagery,” in Proc. Eur. Conf.696Comput. Vis., 2018, pp. 789–807.697

[7] T. Z. Xiang, G. S. Xia, X. Bai, and L. Zhang, “Image stitching by line-698guided local warping with global similarity constraint,” Pattern Recognit.,699pp. 481–497, 2018.700

[8] W. Ye, K. Yu, Y. Yu, and J. Li, “Logical stitching: A panoramic image701stitching method based on color calibration box,” in Proc. IEEE Int. Conf.702Signal Process., 2018, pp. 1139–1143.703

[9] I. C. Lo, K. T. Shih, and H. H. Chen, “Image stitching for dual fisheye704cameras,” in Proc. IEEE Int. Conf. Image Process., Oct. 2018, pp. 3164–7053168.706

[10] T. Ho and M. Budagavi, “Dual-fisheye lens stitching for 360-degree imag-707ing,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Mar. 2017,708pp. 2172–2176.709

[11] M. Brown and D. G. Lowe, “Automatic panoramic image stitching using710invariant features,” Int. J. Comput. Vis., vol. 74, no. 1, pp. 59–73, 2007.711

[12] P. C. Ng and S. Henikoff, “Sift: Predicting amino acid changes that affect712protein function,” Nucleic Acids Res., vol. 31, no. 13, pp. 3812–3814, 2003.713

[13] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”714in Proc. Eur. Conf. Comput. Vis., May 2006, pp. 404–417.715

[14] V. N. , P. D. , M. C. Bh., S. S. Channappayya, and S. S. Medasani, “Blind 716image quality evaluation using perception based features,” in Proc. 21st 717Nat. Conf. Commun., Feb./Mar. 2015, pp. 1–6. 718

[15] L. Kang, P. Ye, Y. Li, and D. S. Doermann, “Convolutional neural networks 719for no-reference image quality assessment,” in Proc. IEEE Conf. Comput. 720Vis. Pattern Recognit., Columbus, OH, USA, Jun. 2014, pp. 1733–1740. 721

[16] X. Liu, J. van de Weijer, and A. D. Bagdanov, “Rankiqa: Learning from 722rankings for no-reference image quality assessment,” in Proc. IEEE Int. 723Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 1040–1049. 724

[17] G. Cheung, L. Yang, Z. Tan, and Z. Huang, “A content-aware metric for 725stitched panoramic image quality assessment,” in Proc. IEEE Int. Conf. 726Comput. Vis. Workshops, Venice, Italy, Oct. 2017, pp. 2487–2494. 727

[18] Y. Niu, H. Zhang, W. Guo, and R. Ji, “Image quality assessment for color 728correction based on color contrast similarity and color value difference,” 729IEEE Trans. Circuits Syst. Video Techn., vol. 28, no. 4, pp. 849–862, 730Apr. 2018. 731

[19] W. Xue, L. Zhang, X. Mou, and A. C. Bovik, “Gradient magnitude 732similarity deviation: A highly efficient perceptual image quality index,” 733IEEE Trans. Image Process., vol. 23, no. 2, pp. 684–695, Feb. 2014. 734

[20] A. Tanchenko, “Visual-PSNR measure of image quality,” J. Visual Com- 735mun. Image Representation, pp. 874–878, 2014. 736

[21] Z. Wang et al., “Image quality assessment: From error visibility to struc- 737tural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 738Apr. 2004. 739

[22] S. Bosse, D. Maniry, K.-R. Müller, T. Wiegand, and W. Samek, “Deep 740neural networks for no-reference and full-reference image quality assess- 741ment,” IEEE Trans. Image Process., vol. 27, no. 1, pp. 206–219, Jan. 2018. 742

[23] L. Kang, P. Ye, Y. Li, and D. S. Doermann, “Simultaneous estimation of 743image quality and distortion via multi-task convolutional neural networks,” 744in Proc. IEEE Int. Conf. Image Process., Quebec City, QC, Canada, 745Sep. 2015, pp. 2791–2795. 746

[24] G. Cheung, L. Yang, Z. Tan, and Z. Huang, “A content-aware metric for 747stitched panoramic image quality assessment,” in Proc. IEEE Int. Conf. 748Comput. Vis. Workshops, Venice, Italy, Oct. 2017, pp. 2487–2494. 749

[25] H. Duan, G. Zhai, X. Min, Y. Zhu, Y. Fang, and X. Yang, “Perceptual 750quality assessment of omnidirectional images,” in Proc. IEEE Int. Symp. 751Circuits Syst., 2018, pp. 1–5. 752

[26] J. Li, K. Yu, Y. Zhao, Y. Zhang, and L. Xu, “Cross-reference stitching 753quality assessment for 360 omnidirectional images,” in Proc. 27th ACM 754Int. Conf. Multimedia, 2019, pp. 2360–2368. 755

[27] O. S. Vaidya and S. T. Gandhe, “The study of preprocessing and postpro- 756cessing techniques of image stitching,” in Proc. Int. Conf. Adv. Commun. 757Comput. Technol., Feb. 2018, pp. 431–435. 758

[28] B. Maneshgar, L. Sujir, S. P. Mudur, and C. Poullis, “A long-range vision 759system for projection mapping of stereoscopic content in outdoor areas,” 760in Proc. 12th Int. Joint Conf. Comput. Vis., Imag. Comput. Graph. Theory 761Appl., Porto, Portugal, Feb./Mar. 2017, pp. 290–297. 762

[29] D. Cai, X. He, and J. Han, “Isometric projection,” in Proc. 22nd AAAI 763Conf. Artif. Intell., Vancouver, BC, Canada, Jul. 2007, pp. 528–533. 764

[30] D. Schneider, E. Schwalbe, and H. G. Maas, “Validation of geometric 765models for fisheye lenses,” ISPRS J. Photogrammetry Remote Sens., 766vol. 64, no. 3, pp. 259–266, 2009. 767

[31] W. Li, C.-B. Jin, M. Liu, H. Kim, and X. Cui, “Local similarity refine- 768ment of shape-preserved warping for parallax-tolerant image stitching,” 769Institution Eng. Technol., pp. 661–668, 2017. 770

[32] T. Xiang, G.-S. Xia, and L. Zhang, “Image stitching with perspective- 771preserving warping,” 2016, arXiv:1605.05019. 772

[33] T. Ho, I. D. Schizas, K. Rao, and M. Budagavi, “360-degree video stitching 773for dual-fisheye lens cameras based on rigid moving least squares,” in Proc. 774IEEE Int. Conf. Image Process., 2017, pp. 51–55. 775

[34] C. Herrmann et al., “Robust image stitching with multiple registrations,” 776in Proc. Eur. Conf. Comput. Vis., 2018. 777

[35] L. Yao, Y. Lin, C. Zhu, and Z. Wang, “An effective dual-fisheye lens 778stitching method based on feature points,” in Proc. Int. Conf. Multimedia 779Model., Jan. 2019, pp. 665–677. 780

[36] J. Tan, G. Cheung, and R. Ma, “360-degree virtual-reality cameras for the 781masses,” IEEE Multimedia, vol. 25, no. 1, pp. 87–94, Jan./Mar. 2018. 782

[37] I.-C. Lo, K.-T. Shih, and H. H. Chen, “Image stitching for dual fisheye 783cameras,” in Proc. Int. Conf. Image Process., Oct. 2018, pp. 3164–3168. 784

[38] X. Li, Y. Pi, Y. Jia, Y. Yang, Z. Chen, and W. Hou, “Fisheye image 785rectification using spherical and digital distortion models,” Multispectral 786Image Acquisition Process. Anal., 2018, Art. no. 106070G. 787

[39] X. Yin, X. Wang, J. Yu, P. Fua, M. Zhang, and D. Tao, “Fisheyerecnet: A 788multi-context collaborative deep network for fisheye image rectification,” 789in Proc. Eur. Conf. Comput. Vis., 2018, pp. 469–484. 790

alien
打字机
Ref [4]: (Vol. 42, No. 2W3, pp. 85-90)
alien
打字机
Ref [7]: Vol 83 (2018): 481-497.
alien
打字机
Ref[20]: Volume 25 Issue(5).
alien
打字机
Ref[31]: Volume 12 issue 5
Page 12: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

12

[40] L. Deng, M. Yang, H. Li, T. Li, B. Hu, and C. Wang, “Restricted deformable791convolution based road scene semantic segmentation using surround view792cameras,” 2018, arXiv:1801.00708.793

[41] C. Li, M. Xu, S. Zhang, and P. L. Callet, “State-of-the-art in 360°794video/image processing: Perception, assessment and compression,” 2019,795arXiv:1905.00161.796

[42] Y. Qian, D. Liao, and J. Zhou, “Manifold alignment based color transfer797for multiview image stitching,” in Proc. IEEE Int. Conf. Image Process.,798Melbourne, Australia, Sep. 2013, pp. 1341–1345.799

[43] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality800assessment in the spatial domain,” IEEE Trans. Image Process., vol. 21,801no. 12, pp. 4695–4708, Dec. 2012.802

[44] C. Richardt, Y. Pritch, H. Zimmer, and A. Sorkine-Hornung, “Megastereo:803Constructing high-resolution stereo panoramas,” in Proc. IEEE Conf.804Comput. Vision Pattern Recognit., Portland, OR, USA, Jun. 2013,805pp. 1256–1263.806

[45] J. Zaragoza, T. Chin, Q. Tran, M. S. Brown, and D. Suter, “As-projective-807as-possible image stitching with moving DLT,” IEEE Trans. Pattern Anal.808Mach. Intell., vol. 36, no. 7, pp. 1285–1298, Jul. 2014.809

[46] P. Ye, J. Kumar, L. Kang, and D. Doermann, “Unsupervised feature810learning framework for no-reference image quality assessment,” in Proc.811IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 1098–1105.812

[47] Y. Wang et al., “An imaging-inspired no-reference underwater color image813quality assessment metric,” Comput. Elect. Eng., vol. 70, pp. 904–913,8142018.815

[48] Y. Fang, K. Ma, Z. Wang, W. Lin, Z. Fang, and G. Zhai, “No-reference816quality assessment of contrast-distorted images based on natural scene817statistics,” IEEE Signal Process. Lett., vol. 22, no. 7, pp. 838–842,818Jul. 2015.819

[49] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality820assessment: From error visibility to structural similarity,” IEEE Trans.821Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.822

[50] Z. Wang and Q. Li, “Information content weighting for perceptual image823quality assessment,” IEEE Trans. Image Process., vol. 20, no. 5, pp. 1185–8241198, May 2011.825

[51] L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: A feature similarity826index for image quality assessment,” IEEE Trans. Image Process., vol. 20,827no. 8, pp. 2378–2386, Aug. 2011.828

[52] L. Liu, H. Dong, H. Huang, and A. C. Bovik, “No-reference image quality829assessment in curvelet domain,” Signal. Process Image Commun., vol. 29,830pp. 494–505, 2014.831

[53] S. Ling, G. Cheung, and P. L. Callet, “No-reference quality assessment832for stitched panoramic images using convolutional sparse coding and833compound feature selection,” in Proc. IEEE Int. Conf. Multimedia Expo,834San Diego, CA, USA, Jul. 2018, pp. 1–6.835

[54] Y. Yuan, Q. Guo, and X. Lu, “Image quality assessment: A sparse learning836way,” Neurocomputing, vol. 159, pp. 227–241, 2015.837

[55] C. Zhang, J. Pan, S. Chen, T. Wang, and D. Sun, “No reference image838quality assessment using sparse feature representation in two dimensions839spatial correlation,” Neurocomputing, vol. 173, pp. 462–470, 2016.840

[56] M. Huang et al., “Modeling the perceptual quality of immersive images841rendered on head mounted displays: Resolution and compression,” IEEE842Trans. Image Process., vol. 27, no. 12, pp. 6039–6050, Dec. 2018.843

[57] M. Xu, C. Li, Z. Chen, Z. Wang, and Z. Guan, “Assessing visual quality844of omnidirectional videos,” IEEE Trans. Circuits Syst. Video Technol., to845be published.846

[58] M. Xu, C. Li, Y. Liu, X. Deng, and J. Lu, “A subjective visual quality847assessment method of panoramic videos,” in Proc. IEEE Int. Conf. Multi-848media Expo, 2017, pp. 517–522.849

[59] Y. Rai, P. L. Callet, and P. Guillotel, “Which saliency weighting for omni850directional image quality assessment?” in Proc. 9th Int. Conf. Quality851Multimedia Experience, Erfurt, Germany, May/Jun. 2017, pp. 1–6.852

[60] E. Upenik, M. Rerábek, and T. Ebrahimi, “Testbed for subjective evalu-853ation of omnidirectional visual content,” in Proc. Picture Coding Symp.,854Nuremberg, Germany, Dec. 2016, pp. 1–5.855

[61] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks856for biomedical image segmentation,” in Proc. Int. Conf. Med. Image857Comput. Comput. Assisted Intervention, Oct. 2015, pp. 234–241.858

[62] X. Huang, C. Shen, X. Boix, and Q. Zhao, “Salicon: Reducing the semantic859gap in saliency prediction by adapting deep neural networks,” in Proc.860IEEE Int. Conf. Comput. Vis., 2015, pp. 262–270.861

[63] M. Assens Reina, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor,862“Saltinet: Scan-path prediction on 360 degree images using saliency863volumes,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 23318642338.865

[64] Y. Rai, J. Gutiérrez, and P. Le Callet, “A dataset of head and eye movements 866for 360 degree images,” in Proc. 8th ACM Multimedia Syst. Conf., 2017, 867pp. 205–210. 868

[65] M. Startsev and M. Dorr, “360-aware saliency estimation with con- 869ventional image saliency predictors,” Signal Process., Image Commun., 870vol. 69, pp. 43–52, 2018. 871

[66] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix 872factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, no. Jan, 873pp. 19–60, 2010. 874

[67] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for 875image classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6, 876pp. 610–621, Nov. 1973. 877

[68] D. G. Lowe, “Object recognition from local scale-invariant features,” in 878Proc. 7th Int. Conf. IEEE Comput. Vis., 1999, pp. 1150–1157. 879

[69] C. Li, A. C. Bovik, and X. Wu, “Blind image quality assessment using 880a general regression neural network,” IEEE Trans. Neural Netw., vol. 22, 881no. 5, pp. 793–799, May 2011. 882

[70] K. J. Preacher, P. J. Curran, and D. J. Bauer, “Computational tools for 883probing interactions in multiple linear regression, multilevel modeling, 884and latent curve analysis,” J. Educational Behav. Statist., pp. 437–448, 8852006. 886

[71] S. Group, “Samsung gear 360 stitching toolbox,” 2017. [Online]. Avail- 887able: https://www.samsung.com/global/galaxy/gear-360 888

[72] Github.com, “Dualfisheye,” 2016. [Online]. Available: https://github.com/ 889ooterness/DualFisheye. 890

[73] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a ”completely 891blind” image quality analyzer,” IEEE Signal Process. Lett., vol. 20, no. 3, 892pp. 209–212, Mar. 2013. 893

[74] L. Krasula, K. Fliegel, P. Le Callet, and M. Klíma, “On the accuracy 894of objective image and video quality models: New methodology for 895performance evaluation,” in Proc. IEEE 8th Int. Conf. Quality Multimedia 896Experience, 2016, pp. 1–6. 897

[75] M. Jiang, S. Huang, J. Duan, and Q. Zhao, “Salicon: Saliency in context,” 898in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1072–1080. 899

[76] T.-Y. Lin et al., “Microsoft coco: Common objects in context,” in Proc. 900Eur. Conf. Comput. Vis., Springer, 2014, pp. 740–755. 901

[77] Y. Fang, X. Zhang, and N. Imamoglu, “A novel superpixel-based saliency 902detection model for 360-degree images,” Signal Process., Image Commun., 903vol. 69, pp. 1–7, 2018. 904

[78] J. Gutiérrez, E. J. David, A. Coutrot, M. P. Da Silva, and P. Le Callet, 905“Introducing un salient360! benchmark: A platform for evaluating visual 906attention models for 360 contents,” in Proc. IEEE 10th Int. Conf. Quality 907Multimedia Experience, 2018, pp. 1–3. 908

Jia Li (M’12–SM’15) received the B.E. degree from 909Tsinghua University, Beijing, China, in 2005, and 910the Ph.D. degree from the Institute of Computing 911Technology, Chinese Academy of Sciences, Beijing, 912China, in 2011. He is currently an Associate Professor 913with the School of Computer Science and Engineer- 914ing, Beihang University, Beijing, China. Before he 915joined Beihang University in June 2014, he used to 916conduct research in Nanyang Technological Univer- 917sity, Peking University and Shanda Innovations. He 918is the author or coauthor of more than 70 technical 919

articles in refereed journals and conferences such as TPAMI, IJCV, TIP, CVPR, 920and ICCV. His research interests include computer vision and multimedia big 921data, especially the learning-based visual content understanding. He is a senior 922member of CIE and CCF. More information can be found at http://cvteam.net 923

924

Yifan Zhao received the B.E. degree from the Harbin 925Institute of Technology, Harbin, China, in Jul. 2016. 926He is currently working toward the Ph.D. degree with 927the State Key Laboratory of Virtual Reality Technol- 928ogy and System, School of Computer Science and 929Engineering, Beihang University, Beijing, China. His 930research interests include computer vision and image 931understanding. 932

933

alien
打字机
Ref[70]:Volume 31 Issue 4
Page 13: cvteam.netcvteam.net/papers/2020-JSTSP-Attentive Deep Stitching and Quality... · 1 Attentive Deep Stitching and Quality Assessment for 360 Omnidirectional Images 1 2 3 Jia Li , Senior

13

Weihua Ye is currently working toward the M.S. De-934gree with the State Key Laboratory of Virtual Reality935Technology and System, School of Computer Science936and Engineering, Beihang University, Beijing, China.937His research interests include computer vision and938virtual reality.939

940

Kaiwen Yu is currently working toward the M.S. de-941gree with the State Key Laboratory of Virtual Reality942Technology and System, School of Computer Sci-943ence and Engineering, Beihang University, Beijing,944China. His research interests include image quality945assessment and deep learning.946

947

Shiming Ge (M’13–SM’15) received the B.S. and 948Ph.D degrees in electronic engineering from the Uni- 949versity of Science and Technology of China, Hefei, 950China, in 2003 and 2008, respectively. He is currently 951an Associate Professor with the Institute of Infor- 952mation Engineering, Chinese Academy of Sciences, 953Beijing, China. He is also the member of Youth Inno- 954vation Promotion Association, Chinese Academy of 955Sciences. Prior to that, he was a Senior Researcher and 956Project Manager in Shanda Innovations, a Researcher 957in Samsung Electronics and Nokia Research Center. 958

His research mainly focuses on computer vision, data analysis, machine learning 959and AI security, especially efficient learning models and solutions toward 960scalable applications. 961

962