Recognizing Image Style: Extended Abstract · 2015. 10. 23. · Recognizing Image Style: Extended...

1
Recognizing Image Style: Extended Abstract Sergey Karayev 1 Matthew Trentacoste 2 Helen Han 1 Aseem Agarwala 2 Trevor Darrell 1 Aaron Hertzmann 2 Holger Winnemoeller 2 1 University of California, Berkeley 2 Adobe Deliberately-created images convey meaning, and visual style is often a significant component of image meaning. For example, a political candi- date portrait made in the lush colors of a Renoir painting tells a different story than if it were in the harsh, dark tones of a horror movie. While understanding style is crucial to image understanding, very little research in computer vision has explored visual style. We present two novel datasets of image style, describe an approach to predicting style of images, and perform a thorough evaluation of dif- ferent image features for these tasks. We find that features learned in a multi-layer network generally perform best – even when trained with ob- ject class (not style) labels. Our approach shows excellent classification performance on both datasets, and we use the learned classifiers to extend traditional tag-based image search to consider stylistic constraints. Flickr Style Using curated Flickr Groups, we gather 80K photographs annotated with 20 style labels, ranging from photographic techniques (“Macro,” “HDR”), composition styles (“Minimal,” “Geometric”), moods (“Serene,” “Melancholy”), genres(“Vintage,” “Romantic,” “Horror”), to types of scenes (“Hazy,” “Sunny”). Top five predictions on the test set for a selection of styles: Minimal Depth of Field Melancholy HDR Vintage Wikipaintings Using community-annotated data, we gather 85K paintings annotated with 25 style/genre labels. Top five predictions on the test set for a selection of styles: Baroque Impressionism Ukiyo-e Cubism Color Field Features and Learning We test the following features: L*a*b color histogram, GIST descriptor, Graph-based visual saliency, Meta-class bi- nary (MC-bit) object features, and deep convolutional neural networks (CNN), using the Caffe implementation of Krizhevsky’s ImageNet ar- chiteture (referred to as the DeCAF feature, with subscript denoting net- work layer). Notably, the last two of these are features designed and trained for object recognition. As we hypothesize that style features may be content dependent, we also train Content classifiers using the CNN features and an aggregated version of the PASCAL VOC dataset, and use them in second-stage fusion with other features. Evaluation Mean APs on three datasets for the considered single-channel features and their second-stage combination. Only the clearly superior features are evaluated on the Flickr and Wikipaintings datasets. Fusion x Content DeCAF 6 MC-bit L*a*b* Hist GIST random AVA Style 0.581 0.579 0.539 0.288 0.220 0.132 Flickr 0.368 0.336 0.328 - - 0.052 Wikipaintings 0.473 0.356 0.441 - - 0.043 We compare our predictors to human observers, using Amazon Me- chanical Turk experiments, and find that our classifiers predict Group membership at essentially the same level of accuracy as Turkers. We also test on the AVA aesthetic prediction task, and show that using the “deep” object recognition features improves over state-of-the-art results. Applications Example of filtering image search results by style. Our Flickr Style classifiers are applied to images found on Pinterest. The im- ages are searched by the text contents of their captions, then filtered by the response of the style classifiers. Here we show top five results for the query “Dress.” Pastel Ethereal Noir Code & Data All data, trained predictors, and code are available at http://sergeykarayev.com/recognizing-image-style/.

Transcript of Recognizing Image Style: Extended Abstract · 2015. 10. 23. · Recognizing Image Style: Extended...

Page 1: Recognizing Image Style: Extended Abstract · 2015. 10. 23. · Recognizing Image Style: Extended Abstract Sergey Karayev1 Matthew Trentacoste2 Helen Han1 Aseem Agarwala2 Trevor Darrell1

Recognizing Image Style: Extended Abstract

Sergey Karayev1

Matthew Trentacoste2

Helen Han1

Aseem Agarwala2

Trevor Darrell1

Aaron Hertzmann2

Holger Winnemoeller2

1 University of California, Berkeley2 Adobe

Deliberately-created images convey meaning, and visual style is often asignificant component of image meaning. For example, a political candi-date portrait made in the lush colors of a Renoir painting tells a differentstory than if it were in the harsh, dark tones of a horror movie. Whileunderstanding style is crucial to image understanding, very little researchin computer vision has explored visual style.

We present two novel datasets of image style, describe an approachto predicting style of images, and perform a thorough evaluation of dif-ferent image features for these tasks. We find that features learned in amulti-layer network generally perform best – even when trained with ob-ject class (not style) labels. Our approach shows excellent classificationperformance on both datasets, and we use the learned classifiers to extendtraditional tag-based image search to consider stylistic constraints.

Flickr Style Using curated Flickr Groups, we gather 80K photographsannotated with 20 style labels, ranging from photographic techniques(“Macro,” “HDR”), composition styles (“Minimal,” “Geometric”), moods(“Serene,” “Melancholy”), genres (“Vintage,” “Romantic,” “Horror”), totypes of scenes (“Hazy,” “Sunny”).

Top five predictions on the test set for a selection of styles:

Min

imal

Dep

thof

Fiel

dM

elan

chol

yH

DR

Vin

tage

Wikipaintings Using community-annotated data, we gather 85Kpaintings annotated with 25 style/genre labels.

Top five predictions on the test set for a selection of styles:

Bar

oque

Impr

essi

onis

mU

kiyo

-eC

ubis

mC

olor

Fiel

d

Features and Learning We test the following features: L*a*b colorhistogram, GIST descriptor, Graph-based visual saliency, Meta-class bi-nary (MC-bit) object features, and deep convolutional neural networks(CNN), using the Caffe implementation of Krizhevsky’s ImageNet ar-chiteture (referred to as the DeCAF feature, with subscript denoting net-work layer). Notably, the last two of these are features designed andtrained for object recognition.

As we hypothesize that style features may be content dependent, wealso train Content classifiers using the CNN features and an aggregatedversion of the PASCAL VOC dataset, and use them in second-stage fusionwith other features.

Evaluation Mean APs on three datasets for the considered single-channelfeatures and their second-stage combination. Only the clearly superiorfeatures are evaluated on the Flickr and Wikipaintings datasets.

Fusion xContent DeCAF6 MC-bit L*a*b* Hist GIST random

AVA Style 0.581 0.579 0.539 0.288 0.220 0.132Flickr 0.368 0.336 0.328 - - 0.052Wikipaintings 0.473 0.356 0.441 - - 0.043

We compare our predictors to human observers, using Amazon Me-chanical Turk experiments, and find that our classifiers predict Groupmembership at essentially the same level of accuracy as Turkers. We alsotest on the AVA aesthetic prediction task, and show that using the “deep”object recognition features improves over state-of-the-art results.

Applications Example of filtering image search results by style. OurFlickr Style classifiers are applied to images found on Pinterest. The im-ages are searched by the text contents of their captions, then filtered bythe response of the style classifiers. Here we show top five results for thequery “Dress.”

Past

elE

ther

eal

Noi

r

Code & Data All data, trained predictors, and code are available athttp://sergeykarayev.com/recognizing-image-style/.