Automatic Image Annotation (AIA)

Seminar Report Presented to:

Dr. Shanbehzadeh

Presented by: Farzaneh Rezaei

November 2015

2

What is the goal of computer vision ?

Perceive the story behind the picture

See the world!!But what exactly does it mean to see?Source: Wall-e Movie: Pixar, Walt Disney Pictures

3

Outline

Introduction To Image

Annotation

• What?• Why?

Story Behind AIA

• Components of AIA• Progress of AIA• Issues &

Conclusions

Going deeper !

• Feature Extraction• Learning Methods• Deep Learning• Conclusions

Useful Information

• Recent Articles• Toolbox• Databases• Authors

Conclusions

• References

4

Outline


Annotation

• What?• Why?

Story Behind AIA


Conclusions

Going deeper !


Useful Information


Conclusions

• References

5

What is Automatic Image Annotation?Automatic image annotation is the task of automatically assigning words to an image that describe the content of the image.

Munirathnam Srikanth, et al. Exploiting ontologies for automatic image annotation

Source: Personalizing Automated Image Annotation Using Cross-Entropy: https://ivi.fnwi.uva.nl/isis/publications/bibtexbrowser.php?key=LiICM2011&bib=all.bib

http://academic.research.microsoft.com/Author/1168641/munirathnam-srikanth



http://academic.research.microsoft.com/Publication/1842477/exploiting-ontologies-for-automatic-image-annotation



6

What is Automatic Image Annotation?(Cont.)

Source: MS COCO Captioning Challenge: http://mscoco.org/dataset/#captions-challenge2015

7

3,000 Photos Are Uploaded Every Second to Facebook

Why Image Annotation is important?Recently, we have witnessed an exponential growth of user generated videos and images, due to the booming of social networks, such as Facebook and Flickr.

Source: petapixel.com

Source: http://petapixel.com/2012/02/01/3000-photos-are-uploaded-every-second-to-facebook/

8

Why Image Annotation is important?(Cont.)

Source: Barriuso, A., & Torralba, A. (2012). Notes on image annotation

• Applications e.g. Photo organizer apps• Image Classification Systems

9

Numbers of articles per year for “Automatic Image Annotation”

(in Title of article)

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 20150

10

20

30

40

50

60

70

Year Reported by: Google Scholar

10

Outline


Annotation

• What?• Why?

Story Behind AIA


Conclusions

Going deeper !


Useful Information


Conclusions

• References

11

How do you annotate these images?

12

What are components of

Automatic Image Annotation

System ?

13

How to classify Images ?



System ?

14

Feature Extraction

ClassificationMethods



System ?

15



System ?


Feature Extraction

16



System ?

Feature Extraction


Pattern Recognition !!

17

Slide Credit

18

An Example of classical approaches in AIA

Source: Zhang, D., Islam, M. M., & Lu, G. (2012). A review on automatic image annotation techniques. Pattern Recognition, 45(1), 346–362. doi:10.1016/j.patcog.2011.05.013

19

Theoretical Limitations of Shallow Architectures*

Functions that can be compactly represented by a depth k architecture

might require an exponential number of computational elements to

be represented by a depth k − 1 architecture

Issues of classical approaches

*Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning

20

Issues of classical approaches (Cont.)Theoretical Limitations of Shallow Architectures

• Shallow? Deep?

• Functions?

• Compact?

• Depth?

• Computational Elements?

logic circuit

21

Issues of classical approaches (Cont.)

Picture Source: Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning

Depth 4 Depth 3

22


• Linear regression and logistic regression have depth 1, i.e., have a single level.

• Ordinary multi-layer neural networks With the most common choice of one hidden

layer, they have depth two

• Decision trees can also be seen as having two levels

• Boosting (Freund & Schapire, 1996) usually adds one level to its base learners: that

level computes a vote or linear combination of the outputs of the base learners

23


• Shallow? Deep?

• Functions

• Compact

• Depth

• Computational Elements

24

Theoretical Limitations of Shallow Architectures*

Functions that can be compactly represented by a depth k architecture

might require an exponential number of computational elements to

be represented by a depth k − 1 architecture

Issues of classical approaches

*Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning

25

• A two-layer circuit of logic gates can represent any boolean function (Mendelson,

1997).

• With depth two logical circuits, most boolean functions require an exponential

number of logic gates (Wegener, 1987) to be represented (with respect to input size)

• There are functions computable with a polynomial-size logic gates circuit of depth k

that require exponential size when restricted to depth k − 1 (Hastad, 1986) The proof

of this theorem relies on earlier results (Yao, 1985) showing that d-bit parity circuits

of depth 2 have exponential size


26

• One might wonder whether these computational complexity results for boolean circuits are

relevant to machine learning.

• See Orponen (1994)!

• for an early survey of theoretical results in computational complexity relevant to learning

algorithms. Interestingly, many of the results for boolean circuits can be generalized to

architectures whose computational elements are linear threshold units (also known as

artificial neurons (McCulloch & Pitts, 1943)), which compute:

f(x) = w0 x+b≥0 (1)

with parameters w and b.


27


1 Theoretical Limitations of Shallow Architectures

2 Theoretical Advantages of Deep Architectures

Which one ?? !

28

Slide Credit

29

Slide Credit

30

How to assign a word to an image ?



System ?

Feature Extraction



Components of AIA

Classical or Shallow

Structure Issues

31http://graffiti-artist.net/corporate-offices/ny-facebook-office-graffiti/

32

Outline


Annotation

• What?• Why?

Story Behind AIA


Conclusions

Going deeper !

• Feature Extraction• Learning Methods• CNN• Conclusions

Useful Information


Conclusions

• References

33

Going Deeper!• Color• Texture• Shape• Segmentation

Feature Extraction &

Representation

• ANN• SVM• Bayes• Metadata

Learning Methods

34

Feature Extraction

ColorHistogram

Color Moments

Color Coherence

Vector

Color Correlogra

m Scalable Color

Descriptor

Color Structure Descriptor

Dominant Color

Descriptor

Spatial• Statistical• Structural• Model-basedSpectral• FT, DCT,

Wavelet, ..Texture

35

Color

36

Color

37

Color: ComparisonsColor method Pros Cons

Histogram Simple to compute, intuitive High dimension, no spatial info,sensitive to noise

CM Compact, robust Not enough to describe all colors, no spatial info

CCV Spatial info High dimension, high computation cost

Correlogram Spatial info Very high computation cost, sensitive to noise, rotation and scale

38

Color: Comparisons (Cont.)Color method Pros Cons

DCD Compact, robust,perceptual meaning

Need post-processing for spatial info

CSD Spatial info Sensitive to noise, rotation and scale

SCD Compact on need,scalability

No spatial info, less accurate ifcompact

39

Spatial Texture : ComparisonsColor method Pros Cons

Texton Intuitive Sensitive to noise, rotation and scale, difficult to define textons

GLCM based method Intuitive, compact, robust High High computation cost, not enough to describe all

Tamura Perceptually meaningful Too few features

SAR Compact, robust, rotationinvariant

High computation cost, difficult to define pattern size

FD Compact, perceptually meaningful computation cost, sensitive to scale

40

Spectral Texture : Comparisons (Cont.)Color method Pros Cons

FT/DCT Fast computation Sensitive to scale and rotation

Wavelet Fast computation, multi-resolution Sensitive to rotation, limitedorientations

Gabor Multi-scale, multi-orientation, robust

normalisation, losing of spectral information due to incomplete cover of spectrum plane

Curvelet Multi-resolution, multi-orientation, robust

Need rotation normalisation

41

Shape

Chart Source: [Zhang and Lu 2004]

42

Chart Source: [M. Yang, K. Kpalma, J. Ronsin 2008]

Shape (Cont.)

43

Shape (Cont.)

Contour Based

Calculate shape features only from the boundaryof the shape

Region Based

Extract features from the entire

region

44

Shape (Cont.)• Because contour based techniques are more sensitive to noise than

region based techniques.• Therefore, color image retrieval usually employs region based shape

features.

45

Learning Methods:

Learning Methods• SVM• ANN• Tree• Parametric• Non-Parametric

46

Learning Methods: ComparisonsAnnotation method Pros Cons

SVM Small sample, optimal class boundary, non-linear classification

Single labelling, one class per time, expensive trial and run, sensitive to noisy data, prone to over-fitting

ANN Multiclass outputs, non- linear classification, robust to noisy data, suitable for complex problem

Single labelling, sub-optimal, expensive training, complex and black box classification

DT Intuitive, semantic rules, multiclass outputs, fast, allow missing values, handle both categorical and numerical values

Single labelling, sub-optimal, need pruning, can be unstable

47

Learning Methods: ComparisonsAnnotation method Pros Cons

Non-parametric Multi-labelling, model free, fast Large number of parameters, large sample, sensitive to noisy data

Parametric Multi-labelling, small sample, good approximation of unknown distribution

Predefined distribution, expensive training, approximated boundary

Metadata Use of both textual and visual features

Difficult to relate visual features with textual features, difficult textual feature extraction

48

Deep Learning• Deep belief networks• Deep Boltzmann machines• Deep Convolutional neural networks• Deep Recurrent neural networks• Hierarchical temporal memory

Source: https://en.wikipedia.org/wiki/List_of_machine_learning_concepts

https://en.wikipedia.org/wiki/Deep_belief_network

https://en.wikipedia.org/wiki/Boltzmann_machine

https://en.wikipedia.org/wiki/Convolutional_neural_network

https://en.wikipedia.org/wiki/Recurrent_neural_network

https://en.wikipedia.org/wiki/Hierarchical_temporal_memory

49

Deep Learning (Cont.)

Source: Ranzato, 4 October 2013, Slides

50

Deep Learning (Cont.)

•A Potential Problem with Deep Learning *??•Optimization Task• See : • Bengio’s Articles!• Hot videos about Deep Learning on YouTube!• Ranzato, 4 October 2013:• https://www.youtube.com/watch?

v=clgMTk5V2Sk*: Ranzato, 4 October 2013, Slides

51

Outline


Annotation

• What?• Why?

Story Behind AIA


Conclusions

Going deeper !


Useful Information


Conclusions

• References

52

2009, Shallow

Source: Venkatesh N. Mur thy, S. Maji, R. Manmatha, Automatic Image Annotation using Deep Learning Representations 2015

Useful Information: Recent Articles

53

Which one ?? !

1 Theoretical Limitations of Shallow Architectures

2 Theoretical Advantages of Deep Architectures

54

Source: B. Klein, G. Lev, G. Sadeh, and L. Wolf, Fisher Vectors Derived from Hybrid Gaussian-Laplacian Mixture Models for Image Annotation 2015

Useful Information: Recent Articles (Cont.)

55

Useful Information: Toolbox

MatConvNet• MatConvNet is a MATLAB toolbox

implementing Convolutional Neural Networks (CNNs) for computer vision applications. It is simple, efficient, and can run and learn state-of-the-art CNNs. Several example CNNs are included to classify and encode images.

Caffe• Caffe is a deep learning framework made with

expression, speed, and modularity in mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by community contributors.Yangqing Jia created the project during his PhD at UC Berkeley. Caffe is released under the BSD 2-Clause license.

http://bvlc.eecs.berkeley.edu/

http://daggerfs.com/



https://github.com/BVLC/caffe/blob/master/LICENSE

56

Useful Information: Databases

an important benchmark for keyword based image retrieval and image annotation5000 images manually annotated with 1 to 5 keywords. The vocabulary contains 260 words.

Corel5k:This data set is obtained from an online game where two players, that can not communicate outside the game, gain points by agreeing on words describing the image

ESP Game:This set of 20.000 images accompanied with descriptions in several languages was initially published for cross-lingual retrieval

IAPR TC12:

57

Useful Information: Databases• Other Databases:• Flicker8,10,30

Table Source: M. Guillaumin, T. Mensink, J. Verbeek and C. Schmid, TagProp: Discriminative Metric Learning in Nearest Neighbor Models for Image Auto-Annotation

58

Useful Information: Authors

Cordelia Schmid• Research director INRIA• Computer vision, object recognition,

video recognition, learning

Li Fei-Fei• Professor, Stanford University• Artificial Intelligence,

Machine Learning, Computer Vision, Neuroscience

Yoshua Bengio• Professor, U. Montreal, Computer Sc.• Machine learning, deep learning,

artificial intelligence

Reported by: Google Scholar

https://scholar.google.com/citations?view_op=view_org&hl=en&org=12325585247221707367

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:computer_vision

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:object_recognition

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:video_recognition

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:learning

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:artificial_intelligence

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:machine_learning


https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:neuroscience

https://scholar.google.com/citations?view_op=view_org&hl=en&org=4964519586676348649


https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:deep_learning

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:artificial_intelligence

59

Useful Information: Authors (Cont.)

Richard Socher• MetaMind• deep learning, machine learning,

natural language processing, computer vision

Recursive Deep Learning for Natural Language Pro

cessing and Computer Vision

,

PhD Thesis, Computer Science Department,

Stanford University

2014 Arthur L. Samuel Best Computer Science PhD

Thesis Award

Reported by: Google Scholar

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:deep_learning


https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:natural_language_processing


http://nlp.stanford.edu/~socherr/thesis.pdf

http://nlp.stanford.edu/~socherr/thesis.pdf

60

Outline


Annotation

• What?• Why?

Story Behind AIA


Conclusions

Going deeper !


Useful Information


Conclusions

• References

61

How to assign a word to an image ?



System ?

Feature Extraction



Components of AIA

Classical or Shallow

Structure Issues

Conclusions !!!

62

1. High dimensional feature analysis2. How to build an effective annotation model?3. The third issue is that currently annotation and

ranking are done online simultaneously in the multiple labelling annotation approaches. This is not efficient for image retrieval.

4. Lack of standard vocabulary and taxonomy.5. There is no commonly acceptable image database6. insufficient depth of architectures, and locality of

estimators[Bengio, 2009]

Picture Source: Bengio, Y. (2009). Learning Deep Architectures for AI. Foundations and Trends® in Machine Learning

Source: Zhang, D., Islam, M. M., & Lu, G. (2012). A review on automatic image annotation techniques. Pattern Recognition, 45(1), 346–362. doi:10.1016/j.patcog.2011.05.013

Conclusions (Cont.)

63

References

Automatic Image Annotation (AIA)

Education

Transcript of Automatic Image Annotation (AIA)