Benchmarking Deep Learning Models for Classification of ... · Vol.:(0123456789) SN Computer...

Vol.:(0123456789)

SN Computer Science (2020) 1:139 https://doi.org/10.1007/s42979-020-00132-z

SN Computer Science

ORIGINAL RESEARCH

Benchmarking Deep Learning Models for Classification of Book Covers

Adriano Lucieri1 · Huzaifa Sabir1 · Shoaib Ahmed Siddiqui1 · Syed Tahseen Raza Rizvi1 · Brian Kenji Iwana2 · Seiichi Uchida2 · Andreas Dengel1 · Sheraz Ahmed1

Received: 20 January 2020 / Accepted: 30 March 2020 / Published online: 24 April 2020 © Springer Nature Singapore Pte Ltd 2020

AbstractBook covers usually provide a good depiction of a book’s content and its central idea. The classification of books in their respective genre usually involves subjectivity and contextuality. Book retrieval systems would utterly benefit from an automated framework that is able to classify a book’s genre based on an image, specifically for archival documents where digitization of the complete book for the purpose of indexing is an expensive task. While various modalities are available (e.g., cover, title, author, abstract), benchmarking the image-based classification systems based on minimal information is a particularly exciting field due to the recent advancements in the domain of image-based deep learning and its applicability. For that purpose, a natural question arises regarding the plausibility of solving the problem of book classification by only utilizing an image of its cover along with the current state-of-the-art deep learning models. To answer this question, this paper makes a three-fold contribution. First, the publicly available book cover dataset comprising of 57k book covers belonging to 30 different categories is thoroughly analyzed and corrected. Second, it benchmarks the performance on a battery of state-of-the-art image classification models for the task of book cover classification. Third, it uses explicit attention mechanisms to identify the regions that the network focused on in order to make the prediction. All of our evaluations were performed on a subset of the mentioned public book cover dataset. Analysis of the results revealed the inefficacy of the most powerful models for solving the classification task. With the obtained results, it is evident that significant efforts need to be devoted in order to solve this image-based classification task to a satisfactory level.

Keywords CNN · Book cover · Book cover classification

Introduction

Books have been the most prevalent medium for imparting knowledge for the past few centuries. Book covers provide the first impression of a book’s content, subject and its cen-tral idea. This information is depicted by a combination of visual and textual information [14]. However, this visual interpretation is subjective and varies from person to person

Adriano Lucieri, Huzaifa Sabir, Shoaib Ahmed Siddiqui and Syed Tahseen Raza Rizvi have contributed equally to this work.

“This article is part of the topical collection “Document Analysis and Recognition” guest edited by Michael Blumenstein, Seiichi Uchida and Cheng-Lin Liu”.

The source code and the models are available at https ://githu b.com/adria no-lucie ri/BookC overC lassi ficat ion.

* Adriano Lucieri [email protected]

Shoaib Ahmed Siddiqui [email protected]

Syed Tahseen Raza Rizvi [email protected]

Brian Kenji Iwana [email protected]

Seiichi Uchida [email protected]

Andreas Dengel [email protected]

Sheraz Ahmed [email protected]

1 German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany

2 Department of Advanced Information Technology, Kyushu University, Fukuoka, Japan

http://crossmark.crossref.org/dialog/?doi=10.1007/s42979-020-00132-z&domain=pdf

https://github.com/adriano-lucieri/BookCoverClassification

https://github.com/adriano-lucieri/BookCoverClassification

SN Computer Science (2020) 1:139139 of 16

SN Computer Science

depending upon his/her background and perspective. There-fore, this makes the interpretation of book covers based on just visual content extremely challenging, even for humans. Figure 1 shows some randomly selected book covers after blurring out the text. The lack of textual information makes it hard to guess the correct category for these covers. It is worth mentioning that these books are visually descriptive. Having a very abstract or plain background is also very com-mon in book covers making the task almost impossible to solve without textual aid [16]. Therefore, it is particularly interesting to analyze the efficacy of state-of-the-art image classification models for the identification of book cover genres based on just an image of the book cover. Having the ability to automatically categorize and classify book covers without explicit human intervention could signifi-cantly improve the performance of current generation book retrieval systems. Relying on only the book cover image is a significantly harder problem as compared to explicitly tak-ing the whole textual content into account, along with being better suited for end-users.

Deep learning has been applied to a wide variety of problems since its resurgence in 2012 when Krizhevsky et al. [21] were directly able to reduce the error rate to half on a standard image classification benchmark challenge, comprising of millions of images [33], by just employing a deep model. These applications include image classifica-tion [9, 21, 37], image synthesis [8], image captioning [2], semantic segmentation [5], voice recognition [6], audio syn-thesis [39], document classification and understanding [1, 40], as well as playing Atari games [26].

With these advances in the domain of computer vision, an implicit assumption is made regarding the ability to directly solve the problem of book cover classification by employing the state-of-the-art image classification models. Therefore, we try to answer this question, by employing the most pow-erful image recognition models (NASNet, SE-ResNeXt-50, SE-ResNet-50, Inception ResNet v2, DenseNet-161, ResNet-152, ResNet-50 and VGG-16) to date in order to automatically classify these book covers. Finally, we also consider the impact of employing textual information along with the visual modality in order to quantify the gains by using this visual representation.

The main contributions of this paper are threefold:

1. Detailed insights into the book cover classification data-set introduced by Iwana et al. [14] and its complexity.

2. A detailed evaluation of the state-of-the-art classifica-tion models for the task of book cover classification. This helps in establishing a benchmark on this problem. Furthermore, it also outlines the challenges due to which even the state-of-the-art models fail to solve this prob-lem.

3. Identification of the regions of the input that the network focused on in order to make the prediction by equipping the model with an explicit attention mechanism as a way to identify the regions of the input deemed as useful by the network. This attention mechanism also helps the network in attaining minor improvements in the com-puted metrics.

The rest of the paper is structured as follows. We first present a brief recapitulation of the previous work done in the direc-tion of book cover classification in “Related Works” section. We then provide details of the dataset used in this study in “Dataset” section. State-of-the-art deep learning models for image classification are evaluated on the task of book cover classification in “State-of-the-Art Models for Book Cover Classification” section. Besides, the dataset distribution and associated challenges are analyzed in depth. Quantification of the impact of dataset cleansing on the classification accu-racy is then provided in “Cleansed Dataset Evaluation” sec-tion, followed by an extensive tweaking of the model archi-tecture to unveil the task’s difficulties in “Extensive Model Tweaking” section. Finally, we conclude the paper with the concluding remarks in “Conclusion” section.

Related Works

The classification of artistic book covers is a sophisticated task, due to its subjectiveness and the fuzzy nature of class affiliation. However, exploration and analysis of underlying patterns can reveal interesting coherencies that could be of use for both understanding arts and aiding artists as inspi-rational influences.

In the field of book genre classification, Iwana et al. [14] have proposed a publicly available dataset of book covers

Fig. 1 Book covers containing no textual cues highlighting the difficulty of the task solely rely-ing on the visual content

SN Computer Science (2020) 1:139 of 16 139

SN Computer Science

comprising of 30 categories. They tried genre classification using LeNet [22] and AlexNet [21] architectures with the book cover image as input and achieved a baseline accu-racy of 24.5%. Kjartansson and Ashavsky [20] approached a subset of the dataset from [14]. They chose only ten of the original 30 categories, reducing the dataset to 19k samples. By applying several image-based and text-based approaches, they achieved higher accuracies on this subset compared to Iwana et al. [14] on the original dataset. For their image-based approaches, older architectures like VGG-16 [34], SqueezeNet [13] and ResNet-50 [9] were used. Buczkowski et al. [4, 35] similarly approached genre classification using book covers and short textual descriptions. They focused on a completely different dataset, crawled from Goodreads.com, comprising of 14 categories. On this dataset, they evaluate a simple convolutional neural network (CNN) architecture as well as a VGG-like architecture. Recently, Jolly et al. [16] applied layer-wise relevance propagation (LRP) [3] method to the trained CNN [22] for book cover classification to explain the cover image’s pixel-wise contributions and spot the most relevant elements of the artworks for genre classification.

Besides, other types of genre classification have also been a focus of research. Oramas et al. [30, 31] performed genre classification of album covers using textual review information as well as using a multi-modal approach, where textual, acoustic and visual information of music albums were also leveraged. Libeks and Turnbull [24] annotated music albums with genre tags using only the album cover artwork as well as promotional photographs. Similarly, the classification of painting styles has been studied in [18, 43].

Dataset

We used the publicly available dataset of book covers pro-posed by Iwana et al. [14]. The raw dataset contains infor-mation regarding the book cover title, authors, main cat-egory, multiple subcategories and link to the images of over 207k book covers from Amazon.com. The book covers are classified into 32 main categories and over 3200 subcatego-ries. In cases where the book cover was assigned to multiple categories, one category has been randomly selected. Iwana et al. [14] employed a subset of around 57k books from the original dataset for the corresponding experimentation. This subset was equalized to contain 1900 samples per class. Two classes (Gay & Lesbian and Education & Teaching) were discarded as they comprised of a limited number of samples. This dataset will be referred to as 30cat dataset in this paper.

The book cover dataset is quite complex when we compare it to standard image classification datasets like Caltech [38], MS-COCO [25], Oxford-102 [29], LSUN [41] and even ImageNet [33]. It is significantly difficult even for

humans to classify. Most of the categories are distinguish-able based on the textual description, while some categories contain significant visual cues (recognizable objects) for its discrimination [16].

State‑of‑the‑Art Models for Book Cover Classification

Automated classification of book covers is an interesting philosophical question along with its practical implications, thus beneficial in a variety of applications. Genre classifica-tion could be useful in reducing time and costs invested for indexing books in big libraries or even e-commerce plat-forms by utilizing only a single image of the book cover. Book cover image classification by means of machine learning methods has already been approached by some studies [4, 14] as well as classification leveraging textual information like book titles or descriptions [20, 35]. These studies vary significantly in terms of the employed dataset for the corresponding experimentation along with the com-puted metrics. Despite these efforts, none of the mentioned approaches for book cover image classification has employed state-of-the-art networks for this task. Iwana et al. [14] used AlexNet [21] and LeNet [22] architectures, whereas Bucz-kowski et al. [4] employed a shallow CNN similar to the VGG architecture in their work. However, no method has achieved any significant improvements in terms of the com-puted metrics. Therefore, a natural question that originates is regarding the task itself. Whether the task of book cover classification itself is extremely difficult for the current gen-eration of machine learning models or the models that were employed in those studies were themselves not adequate for the task. In order to answer this question, we employed some of the most recent state-of-the-art image classification mod-els for the task of book cover classification.

Experiments

The models evaluated in this paper include NASNet [42], SE-ResNeXt-50 [11], SE-ResNet-50 [11], Inception ResNet v2 [36], DenseNet-161 [12], ResNet-152 [9], ResNet-50 [9] and VGG-16 [34] with NASNet achieving best corresponding top-1 accuracy of 82.7% on the ImageNet [33] test set. We initialized our models using the pretrained weights from the ImageNet [33] models in order to benefit from transfer learn-ing. The input samples from the 30cat dataset are scaled up by factor 1.15, randomly cropped to the input size and then randomly flipped in the horizontal and vertical direc-tion as part of data augmentation. To provide comparable circumstances, all experiments were conducted with a fixed training time of 10 epochs, batch size of 20 samples, learn-ing rate of 1e-4 and image sizes of 224 × 224 , 299 × 299


SN Computer Science

and 331 × 331 pixels depending upon the model in question. Data from empirical trials showed that despite sophisticated hyperparameter tuning, all models tend to overfit in a few epochs, resulting in a stagnant test set accuracy with increas-ing number of epochs. We selected a fixed training time of 10 epochs for all models. It is possible to obtain marginal gains in accuracy by employing a more sophisticated hyper-parameter tuning strategy; however, the accuracy is very low to begin with for any useful real-world application.

The obtained results highlight only marginal gains even by employing some of the most sophisticated models to date, indicating that the task itself is hard for the current genera-tion of deep learning models. Since all of the classes had the same number of examples in the test set, except for one, accuracy served as a good metric for performance. The com-puted metrics (top-1 and top-3 accuracies) are presented in Table 1. Results from Iwana et al. [14] are included for com-parison. We achieved an absolute 6% gain in the top-1 and 10% gain in the top-3 accuracy over the baseline established by Iwana et al. [14] by employing NASNet with a top-1

accuracy of 30.5%, while ResNet-152 followed with an accu-racy of 25.6%. Despite this improvement, the models trained for the task overfit to the training set primarily because of the high intra-class variance which made it extremely hard for the network to decipher the correct features (textual features) using a purely end-to-end training strategy.

Table 2 shows the results reported by Kjartansson et al. [20] for top-1 and top-3 per-class accuracies on their best performing image-based ResNet-50 ensemble. In com-parison, the per-class accuracies of single Inception ResNet v2 and NASNet models, trained and tested on the same subset of ten classes are presented. It can be seen that both sin-gle state-of-the-art models slightly outperform the ensem-ble method used in Kjartansson et al. [20], with Inception ResNet v2 yielding the highest combined top-1 accuracy of 59.6%. However, considering the per-class accuracies, every model still has its fortes, indicating that an ensemble can result in further gains.

Buczkowski et al. [4] reported their results on an unpub-lished dataset comprising of different categories and the number of samples, therefore evading the possibility for a direct comparison. Their dataset is comprised of 14 cat-egories from Goodreads.com, where one of the categories was named Others, obtained after merging together all the categories comprising of small numbers of examples. Unfor-tunately, a comparison of our results to the results of [4] is not possible and is thus not included.

Discussion and Analysis

To better understand the nature of the classification prob-lem, we analyzed the category distribution of the book cover dataset. Figure 2 shows the co-occurrence matrix represent-ing the number of simultaneous occurrences of two main classes in the whole raw dataset containing 207k samples. From all classes, one is specifically prominent in this figure,

Table 1 Accuracy comparison of state-of-the-art models to LeNet and AlexNet from [14] on the original 30cat dataset

Architecture Accuracies

Train (%) Top1 (%) Top3 (%)

NASNet 40 30.5 50.2SE-ResNeXt-50 80 27.7 45.9SE-ResNet-50 50 27.1 46.7Inception ResNet v2 60 26.7 45.3ResNet-152 60 25.6 43.7ResNet-50 40 25.5 44.3VGG-16 25 25.1 46.3DenseNet-161 70 23.9 44.3AlexNet – 24.7 40.3LeNet – 13.5 27.8

Table 2 Accuracy comparison of IncResV2 and NASNet to the best performing image-based architecture ResNet50 ensemble from [20] on a subset of the 30cat dataset

Genre Kjartansson et al. [20] IncResV2 NASNet

Top1 (%) Top3 (%) Top1 (%) Top3 (%) Top1 (%) Top3 (%)

Children’s books 66 86 67.9 86.3 55.8 82.6Comics & graphic novels 62 85 66.3 92.1 73.1 90.0Computers & technology 62 83 56.3 84.2 55.2 78.4Cookbooks, food & wine 61 81 70.5 87.4 67.3 77.8Romance 66 91 55.8 80.5 67.3 91.5Science & math 40 76 45.8 74.2 33.1 70.0Science fiction & fantasy 48 80 59.5 78.9 67.3 89.4Sports & outdoors 49 81 44.2 74.2 47.3 76.8Test preparation 73 89 69.5 85.8 71.5 89.4Travel 49 76 60.0 89.5 53.6 83.6Average 57.6 82.8 59.6 83.3 59.2 83.0


SN Computer Science

i.e., the Reference class rarely occurs exclusively and is mixed with almost all other classes. This makes sense, as reference books are very common in scientific literature like natural sciences, law and economics. On the other hand, they are very uncommon in literature like comics, thrillers and romances. Another very prominent mutual occurrence is the one of Religion & Spirituality together with Christian Books & Bibles. This again is understandable, as Christian books are a subset of religious books. As the data have been collected from the American Amazon.com page, it is most likely that the subset of Christian books has been treated as

a separate main class, as it specifically addresses the major-ity of the American customers. Moreover, the set of main categories is also given by Amazon.com’s system and is not necessarily optimal for classification. By looking at the co-occurrence matrix in detail, other overlapping classes can also be observed. For example, History seems to overlap with many classes, particularly Arts & Photography as well as Religion & Spirituality. Another striking overlap is the one of Literature & Fiction with Children’s Books. However, those specific mutual appearances can be comprehensible. Also, it is still in the nature of book genres to be overlapping,

Fig. 2 Co-occurrence matrix, representing the number of mutual and exclusive occurrences of labels in the original dataset of 207k images and 32 categories


SN Computer Science

as books can consist of broad content and genres are very subjective.

Another factor that adds complexity to this specific clas-sification task is that the dataset exhibits low inter-class vari-ance and high intra-class variance, which makes it extremely difficult for any image classification method to deal with. High intra-class variance pertains to the fact that there is a huge variety of different book covers present in a single category. Low inter-class variance, on the other hand, per-tains to the fact that the book covers belonging to different categories are strikingly similar. Figures 3 and 4 provide an insight into the low intra-class variance issue where it can be seen that book covers containing very similar visual content belong to different classes. In many cases, a plain book cover (Fig. 3) or a specific design book cover (Fig. 4) occurred in 5–6 classes, where the only differentiating fac-tor was the title which justified the assignment of that par-ticular category. This means that if the textual information is discarded, it is impossible even for humans to assign the corresponding book cover to a particular category.

In contrast to the inter-class and intra-class variances, which are an inherent problem of book cover classification, the findings from category distribution analysis motivated us to cleanse the dataset. This shall clarify the task definition and therefore reduce confusion of the network during train-ing, which ultimately leads to better accuracies. A subset was extracted from 30cat dataset in which the class Refer-ence is removed from the dataset and the class Christian Books & Bibles is merged with the class Religion & Spir-ituality, resulting in 28 classes and only 55.1k samples. This dataset will be referred to as 28cat dataset in this paper.1

Figure 5 visualizes the embeddings of a pretrained Incep-tion ResNet v2 on the cleansed 28cat book cover dataset. Despite employing state-of-the-art image classification

models, the embedding space is still highly overlapping, highlighting the complexity of the problem. There are also categories that seem to be well segregated such as the Test Preparation class since the covers are highly distinctive in that case.

Cleansed Dataset Evaluation

Based on the insights from “Discussion and Analysis” section, the impact of cleansing the dataset is quantified in the following. Two separate experiments have been conducted to simplify the classification problem with the 30cat dataset and to reduce the resulting confusion of the models. All further experiments have been con-ducted using the best performing Inception ResNet v2 architecture which was pretrained on the ImageNet dataset using the hyperparameters highlighted in “Experiments” section. For training, image data are scaled up by fac-tor 1.15, randomly cropped to input size and randomly flipped in the horizontal and vertical direction. The other model is trained on 28cat subset. Table 3 shows that by removing Reference class and by merging both classes related to religion, an increase of 1.1% compared to the initial 30 classes was observed. This increase, although a minor one, reinforces the assumption that the occurrence of these subclasses caused confusion in the classifica-tion task. It needs to be mentioned that the accuracy is at least partly increasing naturally due to the simplification of the classification problem, as the number of classes is reduced. However, the choices of cleaning and merg-ing classes from the dataset are justifiable as shown in Fig. 2 and improve the problem definition. Merging and redefining more subclasses could further increase the

Fig. 3 Plain book covers of books belonging to different categories

Fig. 4 Specifically designed series of book covers belonging to different categories

1 https://github.com/adriano-lucieri/book-dataset


SN Computer Science

quality of the classification problem’s definition. This would enhance the utility of resulting classifiers and might further increase classification accuracies. Albeit, finding proper super-classes is a very subjective and com-plex task, requiring in-depth domain knowledge. Due to the positive effect of confusion reduction, all subsequent experiments are conducted on the cleaned 28cat subset.

Fig. 5 T-SNE plot of 28cat dataset using softmax activations, obtained from an Inception ResNet v2 classifier

Table 3 Accuracies of Inception ResNet v2 architecture on original 30cat dataset and on 28cat subset

Dataset Test accuracy (%)

30cat 26.728cat 27.8


SN Computer Science

Extensive Model Tweaking

We now extend the initial experiments with Inception ResNet v2 architecture in several new ways in order to better understand the complexity of the problem and unveiling key obstacles in the task of book cover clas-sification. We first analyzed the effect of exhaustive data augmentation on the resulting classifier. We also bench-marked several attention mechanisms allowing the net-work to explicitly focus on parts of the book cover that were actually influential for a particular prediction. To this extreme, we also analyzed the impact of incorporating spa-tial transformer networks (STNs) [15] where the network can learn the full affine transformation of the input through backpropagation. We then assessed the impact of fusing textual information along with the visual cues in order to identify the gains through the two different information streams. Finally, we ensembled all the different models trained for the different experiments to highlight the possi-ble gains through an ensembling scheme. All experiments are conducted on the 28cat dataset with hyperparameters as mentioned in “Experiments” section.

Data Augmentation

Data augmentation is a common technique used to artifi-cially increase dataset sizes and ultimately avoid overfit-ting. Especially in image classification, plenty of different augmentation techniques have been proposed in the past.

The experiment was conducted by training the Inception ResNet v2 model on augmented input data. The 55.1k sam-ples from the 28cat subset were augmented by random flip on the horizontal and vertical axis, randomly chang-ing contrast, hue and saturation and by random blurring, translation and rotation of the book cover images. The model, pretrained on ImageNet, is further fine-tuned for 10 epochs. The hyperparameters were kept constant as speci-fied in “Experiments” section.

The computed metrics from the augmented network are reported in Table 4. As a reference, the accuracy of the base-line Inception ResNet v2 model on the 28 category dataset is given in the first row. Unfortunately, test accuracy decreased in comparison with the previous experiments. Since we fine-tune the network for a fixed number of epochs (10 in all our experiments), the introduction of additional noise into the network can increase the training time. Since we kept the training time to be the same, this might have been the reason for the drop in performance. However, there is a possibil-ity that the used augmentation hampered the performance of the original network. More sophisticated strategies like AutoAugment [7] could also be introduced where the net-work learns data augmentation using a separate GAN trained end-to-end achieving state-of-the-art performance on the ImageNet [33] dataset.

Attention Module

The use of attention mechanisms in book cover classifica-tion was already recommended by Kjartansson et al. [20]. Jolly et al. [16] observed that CNNs seem to heavily rely on objects in the book covers for classification. In addition, they found that smaller textual content, which is often crucial for the classification of book covers by humans, is of less relevance to the networks. Additionally, many book covers consist of mostly planar regions that do not contribute to the classification. Focusing on these regions could potentially result in severe overfitting. Therefore, to further investi-gate these findings, we experimented with several different

Table 4 Inception ResNet v2 results on 28cat dataset: with & without augmentation

Bold value indicates the experiment with the best result in a set of experiments in terms of test accuracy

Experiment Accuracy (%)

Without augmentation 27.8With augmentation 24.4

Fig. 6 Inception ResNet v2 additionally equipped with dif-ferent attention strategies


SN Computer Science

variations of attention mechanisms on the basic structure of Inception ResNet v2 to identify their plausibility for the task of book cover classification. A basic schematic of the different modules is presented in Fig. 6. The same train-ing parameters as in the previous sections were used. We now briefly explain the different methodologies employed to incorporate explicit attention into the network. The obtained results are presented in Table 5.

Simple Attention Initially, we implemented a simple atten-tion mechanism as proposed by Rodriguez et al. [32]. A sin-gle 1 × 1 filter was applied to the model’s last convolutional feature map of size 8 × 8 . The output is then normalized using the softmax activation function which serves as the attention over spatial locations. The resulting tensor is then element-wise multiplied to the initial feature map to exert attention. The simple attention mechanism implementation with softmax activation yielded only 17.1% accuracy. By inspecting the resulting attention masks, it has been found that the network drew all of its attention onto one specific spot of the feature map, which led to intense overfitting to the train data.

In order to enforce the diffusion of the attention mask, we augmented the attention mechanism by employing sig-moid and temperature-augmented softmax functions. The attention masks indeed showed a diffusion of attention onto specific areas that mostly showed objects, persons or big lettering. Temperature-augmented softmax resulted in a slight improvement over sigmoid achieving an accuracy of 31.0%. We visualize the attention maps computed from the temperature-augmented softmax in Fig. 7. Figure 7g shows the scale, indicating the respective attention of the network. Figure 7a highlights examples from the Cook-books, Food & Wine class where the network correctly focused on the food items in order to come up with the correct prediction. The network only focused on the big lettering in order to identify the Test Preparation cate-gory as highlighted in Fig. 7b. For the Comics & Graphic

Novels category, the network interestingly learnt to focus on the faces of the characters (Fig. 7c). For Engineering & Transportation category, the network learned to attend to cars and bikes which were very common (Fig. 7d). Italic and stylish fonts were quite common in the Romance cat-egory. Therefore, the network learnt to attend to these styl-ish fonts in order to correctly tell the class apart along with the faces (Fig. 7e) which is consistent with the findings of the previous work [16]. Finally, since the Law category was mainly comprised of textual content on the cover, the network learned to keep the text in focus as highlighted in Fig. 7f.

Saliency-Based Attention As a follow-up, we imple-mented an attention mechanism based on saliency maps of the network’s input. This attention mechanism is meant to focus the network’s attention on salient regions containing text or objects. Thereby, irrelevant areas that contain irrel-evant details will not lead to confusion. The input image is used to calculate an attention mask using Hou and Zhang’s method of spectral residual saliency detection [10]. This attention mask is then converted to a binary map using a manual threshold value of 10 (for values in range [0, 255]) and finally resized to 8 × 8 . The resulting mask is again element-wise multiplied with the last convolutional feature map. Some examples of the upscaled attention masks are visualized in Fig. 8a and b.

The saliency-based attention mechanism focused strongly on objects as well as on textual content of all types. How-ever, the resulting accuracy of 30.8% lies slightly below the previously mentioned approaches. Figure 8a shows some good examples on the first row and bad examples on the second. The good examples highlight a very precise focus on objects, symbols and important textual regions. However, sometimes important contexts like landscapes or big objects have been neglected due to their modest appearance, e.g., smooth color gradients.

Residual Attention The previous architecture was then modified to a combined approach. The saliency map, com-puted from the input image, is element-wise multiplied with the last convolutional feature map of shape 8 × 8 . A residual attention mechanism with 1 × 1 convolution, fol-lowed by a Tanh activation function is element-wise multi-plied to the last feature map as well. Those two tensors are then summed and passed to the output block of Inception ResNet v2. By applying the Tanh activation, the attention mechanism is able to dynamically adapt the attention that is given by the saliency input map. In a second experiment (Residual Stacked), an extension of this architecture was tested, where the trainable attention’s 1 × 1 convolution is preceded by a convolutional layer with 32 filters of kernel size 3 × 3 . This experiment was conducted with the aim that the trainable attention mechanism also takes context into account when deciding about the salient regions, as opposed

Table 5 Inception ResNet v2 results on 28cat dataset: with and with-out Attention



Inception ResNet v2 27.8Attention—softmax 17.1Attention—sigmoid 29.7Attention—temperatured softmax 31.0Attention—saliency 30.8Attention—residual 31.1Attention—residual stacked 28.3Attention—STN 21.4


SN Computer Science

Fig. 7 Examples of attention maps from an Inception ResNet v2 model using an augmenta-tion mechanism based on the temperature-augmented softmax function


SN Computer Science

to the case where only a 1 × 1 convolution is applied pixel-wise. The accuracies achieved by the two approaches vary significantly.

Spatial Transformer Networks Finally, we implemented two spatial transformer networks (STNs) [15] to incorporate hard attention. The first one is a conventional one, using a separate localization network made up of three blocks of max pooling, convolution and batch normalization layers. An intermediate feature map is again max-pooled and con-catenated to the final feature map, flattened and further fed into a dense layer of 512 units and an output layer of six units for the affine transform parameters. The affine trans-form is then applied to the input image that is then fed into

the classification network. The second approach used an intermediate feature of the classification network to produce the six affine transform parameters. A feature map of shape 8 × 8 × 2080 is flattened and fed into a dense layer with six units. The output is used to transform the input image and feed it back into the Inception ResNet v2 architecture.

The STNs transformation turned out to be very unstable. Initially, the transformation layer extensively zoomed in, out or rotated the images. Most experiments led to trans-form parameters that would let the input image disappear completely, which resulted in accuracies almost at the level of random guessing. Examples of the resulting transfor-mations of the input images are presented in Fig. 9. The

Fig. 8 Examples of attention maps from an Inception ResNet v2 model using an augmentation mechanism based on saliency maps

Fig. 9 Examples of input images transformed by the transformer layer of a modified Inception ResNet v2 model


SN Computer Science

results summarized in Table 5 indicate that the non-STN variants achieved higher accuracies. The common STN approach with a separate localization network resulted in a zoomed-out view of the book covers. The actual book cover did only cover a slight proportion of the input. Jaderberg et al. [15] proved their method on MNIST [23], SVHN [28] and CUB-200-2011 [38].

STN provides the ability to transform the image into its canonical pose by optimizing the affine transform param-eters so as to minimize the overall objective. In order to correct warping of the image, it is assumed that the same set of transform parameters generalize to the complete class or the canonical pose for the class is the same. In the case of book covers, the intra-class variance is significantly high, disabling the STN module to extract any generalizable trans-formation parameters from the datasets as the relevant fea-tures vary widely from image to image.

Although the incorporation of attention in the Inception ResNet v2 resulted in modest gains in accuracy (27.8% vs 31.1%), the main aim of this attention mechanism was to get a better understanding regarding what the network learned during its training phase. The obtained results indicate that the network learned to focus on the correct regions in some cases, but in most of the cases, there were no visual cues which the network can consistently exploit for classification. This makes the book cover classification task distinct from other classification problems like the ImageNet large-scale visual recognition challenge [33].

Loss Metric

According to the recent findings in [27], it appears that cross-entropy loss could be problematic in some cases. For this experiment, the Inception ResNet v2 model was trained using mean-squared error (MSE) instead of the cross-entropy loss. The obtained results are presented in Table 6. The model yielded an accuracy of 30.1% after 10 epochs, which is an increase by 2.3% to the experiment with cross-entropy loss. Though, the increase is not significant and the loss metric does not seem to be the problem’s origin.

GAN Pretraining

In this experiment, GAN-generated images were used for pretraining the basic Inception ResNet v2 model that was previously pretrained on ImageNet [33]. For generating the book cover samples, we modified the base architecture of the Progressive-GAN [19] framework. 551k generated sam-ples from 28 categories were used. After pretraining this model on the generated samples, it was further fine-tuned on the real images from 28cat subset for 10 epochs. Figure 10 shows generated samples from trained GAN, where Fig. 10a, b and c shows generated book cover for Children’s Books, Mystery and Medical Books genre. Two experiments were conducted with the pretrained model stopped after seven and ten epochs, in order to examine the influence of pretraining.

Table 7 indicates that by using the models pretrained on GAN images, the accuracies decreased by 1.3% and 2.2% after training the model for 7 and 10 epochs, respectively. It

Table 6 Inception ResNet v2 results on 28cat dataset: with different loss metrics



Cross-entropy 27.8Mean-squared error 30.1

Table 7 Inception ResNet v2 results on 28cat dataset: with and with-out GAN augmentation



Without GAN pretraining 27.8With GAN pretraining (7 epochs) 26.5With GAN pretraining (10 epochs) 25.6

Fig. 10 Generated book cover samples from trained GAN


SN Computer Science

also appears that longer pretraining on GAN images worsens the result, as pretraining for three more epochs resulted in a further decrease of 1% in accuracy. One very plausible explanation for this drop in accuracy is the poor quality of conditioning of the generated samples.

Incorporation of Textual Modality

As textual content on the book cover is usually extremely important for humans to classify a particular book cover, we evaluated the relevance of the book’s titles for corre-sponding classification problem with the assumption that the text present in a book cover is much more descriptive for the corresponding classification as compared to visual cues. We used the title information available in the dataset to implement a text-based classifier in order to examine the potential of text incorporation. The text-based classifier is implemented using sentence vectors from FastText [17]. In addition, different ensembles of text and image classifiers were evaluated. The text embeddings for the ensembles

were obtained using FastText as well. For the image-based classifier, the basic Inception ResNet v2 architecture pre-trained on ImageNet was used. Three different variants of the network were tested. We first evaluated the early-fusion scheme, where the sentence embeddings are concatenated to the channels’ axis of the CNN’s input. The sentence vec-tors are broadcasted in height and width dimensions. In the next experiment, we tested the late-fusion scheme, where the sentence embeddings are concatenated to the flattened tensor from the last convolutional layer and fed into an additional dense layer with 4096 units and ReLU activation function. Finally, the last variant was tested, combining both early and late fusion in one network (dual fusion). A scheme of the dif-ferent variants is presented in Fig. 11. Furthermore, we used the embeddings from the pretrained text and image ensemble classifiers to train a support vector machine (SVM) classi-fier. For this experiment, the previously mentioned ensemble with late fusion was used.

It is evident from Table 8 that the classifier trained on just the title of the book present on the book cover (embedded using FastText embeddings) is significantly superior in terms of accuracy (55.6%) as compared to the image-based clas-sifier (27.8%). This is consistent with our understanding of the problem. It is also evident from the table that fusing both the textual and visual information results in deterioration of performance in almost all of the cases, instead of improving it. In the case of late fusion, the improvement is statistically insignificant.

Model Ensembles

An ensemble is a set of multiple, mutually complementary classifiers whose predictions are combined in order to ben-efit from their varying distributions. By combining several

Fig. 11 Architectural diagram for experiments conducted on text–image ensembles

Table 8 Inception ResNet v2 results on 28cat dataset: Multimodal (Text & Image)



FastText 55.6Text & Image ensemble—early fusion 44.4Text & Image ensemble—late fusion 55.7Text & Image ensemble—dual fusion 53.4Text & Image ensemble—late fusion & SVM 12.5


SN Computer Science

classifiers of lower complexity, a more complex representa-tion can be achieved, suitable for difficult problems. Ensem-bling is commonly employed to further boost the classifi-cation performance of the individual classifiers. Given the high number of models trained in the previous sections, it is a natural approach to apply ensembling to examine the potential gains.

For ensemble learning, many different approaches are available. In the following, a simple voting scheme is used to combine the predictions of different models. From the unity of all model’s predictions, the label is chosen that has the most votes. In case that two or more labels have an equal number of votes, one is chosen at random. Different combi-nations have been evaluated. The first combination (referred to as Ensemble 1) includes the strongest models trained on 28cat so far, including Inception ResNet v2 with MSE loss, attention with temperature-augmented softmax, saliency-based attention and residual attention. All of these models yielded test accuracies of more than 30%. The ensembling approach resulted in an accuracy of 33.9%, surpassing the best model by 2.8%. Ensemble 2 additionally included the models, fine-tuned on the GAN generated images for seven and ten epochs. Both models yield significantly lower accu-racies of 26.5% and 25.6%, respectively. Table 9 shows that even adding those two weaker models increased test accu-racy again by 1.1%. Given the performance boost obtained by employing new models, another combination (Ensem-ble 3) was evaluated, consisting of nine different models. In addition to the models from Ensemble 2, the augmented model and two differently initialized models based on Inception ResNet v2 were included. All of them yielded test

accuracies lower than 27%. The resulting test accuracy of 36.6% outperformed the best single model by 5.5%. Further inclusion of models resulted in a negligible gain in classifi-cation accuracy, hence omitted for clarity.

The obtained results advocate that ensembling results in an increase in the obtained accuracies for the book cover genre classification task despite low independent accuracies of some models. This is plausible, as ensembling benefits from a high variance in model initialization, architectures and thus combinations of various local minima, resulting in slightly different fortes of mapping the given input data dis-tribution to the target distribution. By systematically choos-ing models with complementary fortes, performance could be further improved. This is consistent with the findings by Kjartansson et al. [20]. However, solving the book cover classification problem requires solving an interim task, i.e., OCR, and provides poor textural cues both of which are significant impediments for the current generation of deep models.

Conclusion

Book cover classification is an intriguing research question along with having practical value. We, therefore, evaluated the efficacy of employing the state-of-the-art deep learn-ing models in the direction of classification of book cov-ers. Despite a range of experiments performed, the obtained results for image-based classification significantly under-performed the simple text-based classifier. A plausible explanation of this poor performance can be the violation of

Table 9 Inception ResNet v2 Results on 28cat dataset: Ensemble

Experiment Models used Accuracy (%)

Ensemble 1 IncResV2 (MSE) 33.9IncResV2—attention (temperatured softmax)IncResV2—attention (saliency-based)IncResV2—attention (residual)

Ensemble 2 IncResV2 (MSE) 35.0IncResV2—attention (temperatured softmax)IncResV2—attention (saliency-based)IncResV2—attention (residual)IncResV2 after GAN pretraining for 7 epochsIncResV2 after GAN pretraining for 10 epochs

Ensemble 3 IncResV2 (MSE) 36.6IncResV2—attention (temperatured softmax)IncResV2—attention (saliency-based)IncResV2—attention (residual)IncResV2 after GAN pretraining for 7 epochsIncResV2 after GAN pretraining for 10 epochsIncResV2 with augmentationIncResV2 (cross-entropy) random initialization


SN Computer Science

the i. i. d. (independent and identically distributed) assump-tion. Although the generated samples are independent, the samples are not exactly identically distributed since they are only limited by the imagination of the artists, unlike natural images, which are usually identically distributed.

With the obtained results and analysis, it is evident that the current generation of state-of-the-art models is unable to solve these tasks to a satisfactory level of performance. Therefore, significant efforts need to be invested in order to solve this task. These advances will cover the develop-ment of more sophisticated deep learning models as well as specific strategies to improve their applicability to this problem. A particularly important direction in this regard would be the development of advanced feature extraction techniques which can learn the correct set of invariants for the task which is in itself a very hard problem to solve.

Acknowledgements This work was supported by the BMBF project DeFuseNN (Grant 01IW17002) and partially supported by JSPS KAK-ENHI (Grant JP17H06100). We thank all members of the Deep Learn-ing Competence Center at the DFKI for their comments and support.

Compliance with ethical standards

Conflict of Interest On behalf of all authors, the corresponding author states that there is no conflict of interest.

References

1. Afzal MZ, Capobianco S, Malik MI, Marinai S, Breuel TM, Den-gel A, Liwicki M. Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th international conference on document analysis and recognition (ICDAR); 2015. p. 1111–5. https ://doi.org/10.1109/ICDAR .2015.73339 33.

2. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image caption-ing and visual question answering. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2018. vol. 3, p. 6.

3. Bach S, Binder A, Montavon G, Klauschen F, Müller KR, Samek W. On pixel-wise explanations for non-linear classi-fier decisions by layer-wise relevance propagation. PLoS ONE. 2015;10(7):2015.

4. Buczkowski P, Sobkowicz A, Kozlowski M. Deep learning approaches towards book covers classification. In: International conference on pattern recognition applications and methods (ICPRAM); 2018. p. 309–16. https ://doi.org/10.5220/00065 56103 09031 6.

5. Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation; 2017. arXiv :1706.05587 .

6. Chiu CC, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2018. p. 4774–8.

7. Cubuk ED, Zoph B, Mané D, Vasudevan V, Le QV. Autoaugment: learning augmentation policies from data. CoRR; 2018. arXiv :abs/1805.09501 .

8. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems (NIPS); 2014. p. 2672–80.

9. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR); 2016. p. 770–8.

10. Hou X, Zhang L. Saliency detection: a spectral residual approach. In: The IEEE conference on computer vision and pattern recogni-tion (CVPR). IEEE; 2007. p. 1–8.

11. Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Pro-ceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

12. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–8.

13. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. Squeezenet: alexnet-level accuracy with 50× fewer parameters and< 0.5 mb model size; 2016. arXiv :1602.07360 .

14. Iwana BK, Rizvi STR, Ahmed S, Dengel A, Uchida S. Judging a book by its cover; 2016. arXiv :1610.09204 .

15. Jaderberg M, Simonyan K, Zisserman A, et al. Spatial trans-former networks. In: Advances in neural information processing systems (NIPS); 2015. p. 2017–25.

16. Jolly S, Iwana BK, Kuroki R, Uchida S. How do convolutional neural networks learn design? In: 2018 24th international con-ference on pattern recognition (ICPR). IEEE; 2018. p. 1085–90.

17. Joulin A, Grave E, Bojanowski P, Mikolov T. Bag of tricks for efficient text classification. In: Proceedings of the 15th con-ference of the European chapter of the association for com-putational linguistics: volume 2, short papers. Association for Computational Linguistics; 2017. p. 427–31.

18. Karayev S, Trentacoste M, Han H, Agarwala A, Darrell T, Hertzmann A, Winnemoeller H. Recognizing image style; 2013. arXiv :1311.3715.

19. Karras T, Aila T, Laine S, Lehtinen J. Progressive growing of gans for improved quality, stability, and variation; 2017. arXiv :1710.10196 .

20. Kjartansson S, Ashavsky A. Can you judge a book by its cover? Stanford CS231N; 2017. http://cs231 n.stanf ord.edu/repor ts/2017/pdfs/814.pdf.

21. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS); 2012. p. 1097–105.

22. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

23. LeCun Y, Cortes C. MNIST handwritten digit database; 2010. http://yann.lecun .com/exdb/mnist /.

24. Libeks J, Turnbull D. You can judge an artist by an album cover: using images for music annotation. IEEE Multi Med. 2011;18(4):30–7.

25. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL. Microsoft coco: common objects in con-text. In: European conference on computer vision. Springer; 2014. p. 740–55.

26. Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M. Playing atari with deep reinforce-ment learning; 2013. arXiv :1312.5602.

27. Nar K, Ocal O, Sastry SS, Ramchandran K. Cross-entropy loss leads to poor margins. OpenReview; 2019. https ://openr eview .net/forum ?id=Byfbn sA9Km .

https://doi.org/10.1109/ICDAR.2015.7333933

https://doi.org/10.5220/0006556103090316

https://doi.org/10.5220/0006556103090316

http://arxiv.org/abs/1706.05587


http://arxiv.org/abs/abs/1805.09501

http://arxiv.org/abs/abs/1805.09501






http://cs231n.stanford.edu/reports/2017/pdfs/814.pdf

http://cs231n.stanford.edu/reports/2017/pdfs/814.pdf

http://yann.lecun.com/exdb/mnist/


https://openreview.net/forum?id=ByfbnsA9Km

https://openreview.net/forum?id=ByfbnsA9Km


SN Computer Science

28. Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng AY. Read-ing digits in natural images with unsupervised feature learning. In: NIPS workshop on deep learning and unsupervised feature learning; 2011. vol. 2011, p. 5.

29. Nilsback ME, Zisserman A. Automated flower classification over a large number of classes. In: Sixth Indian conference on computer vision, graphics & image processing, 2008. ICV-GIP’08. IEEE; 2008. p. 722–9.

30. Oramas S, Barbieri F, Nieto O, Serra X. Multimodal deep learn-ing for music genre classification. Trans Int Soc Music Inf Retr. 2018;1(1):4–21.

31. Oramas S, Nieto O, Barbieri F, Serra X. Multi-label music genre classification from audio, text, and images using deep features; 2017. arXiv :1707.04916 .

32. Rodríguez P, Cucurull G, Gonzàlez J, Gonfaus JM, Roca X. A painless attention mechanism for convolutional neural networks; 2018. https ://openr eview .net/forum ?id=rJe7F W-Cb.

33. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L. ImageNet large scale visual recognition challenge. Int J Com-put Vis (IJCV). 2015;115(3):211–52. https ://doi.org/10.1007/s1126 3-015-0816-y.

34. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv :1409.1556.

35. Sobkowicz A, Kozłowski M, Buczkowski P. Reading book by the cover-book genre detection using short descriptions. In: Interna-tional conference on man–machine interactions. Springer; 2017. p. 439–48.

36. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, incep-tion-resnet and the impact of residual connections on learning. In: AAAI conference on artificial intelligence; 2017. vol. 4, p. 12.

37. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: The IEEE conference on computer vision and pattern recogni-tion (CVPR) 2015.

38. Wah C, Branson S, Welinder P, Perona P, Belongie S. The caltech-UCSD birds-200-2011 dataset 2011.

39. Wang Y, Skerry-Ryan R, Stanton D, Wu Y, Weiss RJ, Jaitly N, Yang Z, Xiao Y, Chen Z, Bengio S, et al. Tacotron: a fully end-to-end text-to-speech synthesis model; 2017. arXiv :1703.10135 .

40. Yao JG, Wan X, Xiao J. Recent advances in document sum-marization. Knowl Inf Syst. 2017;53(2):297–336. https ://doi.org/10.1007/s1011 5-017-1042-4.

41. Yu F, Seff A, Zhang Y, Song S, Funkhouser T, Xiao J. Lsun: construction of a large-scale image dataset using deep learning with humans in the loop; 2015. arXiv :1506.03365 .

42. Zoph B, Vasudevan V, Shlens J, Le QV. Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 8697–710.

43. Zujovic J, Gandy L, Friedman S, Pardo B, Pappas TN. Classifying paintings by artistic genre: an analysis of features & classifiers. In: IEEE international workshop on multimedia signal processing, 2009. MMSP’09. IEEE; 2009. p. 1–5.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


https://openreview.net/forum?id=rJe7FW-Cb

https://doi.org/10.1007/s11263-015-0816-y

https://doi.org/10.1007/s11263-015-0816-y



https://doi.org/10.1007/s10115-017-1042-4

https://doi.org/10.1007/s10115-017-1042-4


Benchmarking Deep Learning Models for Classification of ... · Vol.:(0123456789) SN Computer...

Documents

Transcript of Benchmarking Deep Learning Models for Classification of ... · Vol.:(0123456789) SN Computer...