Categorization of natural images

Novel Approaches to Natural Scene Categorization

Amit Prabhudesai

Roll No. 04307002

[email protected]

M.Tech Thesis Defence

Under the guidance of

Prof. Subhasis Chaudhuri

Indian Institute of Technology, Bombay

Natural Scene Categorization – p.1/32

Overview of topics to be covered

• Natural Scene Categorization: Challenges• Our contribution

◦ Qualitative visual environment description• Portable, real-time system to aid the visually impaired• System has peripheral vision!

◦ Model-based approaches• Use of stochastic models to capture semantics• pLSA and maximum entropy models

• Conclusions and Future Work

Natural Scene Categorization

• Interesting application of a CBIR system• Images from a broad image domain: diverse and often

ambiguous• Bridging the semantic gap• Grouping scenes into semantically meaningful categories

could aid further retrieval• Efficient schemes for grouping images into semantic

categories

Qualitative Visual Environment Retrieval

SKYBUILDING

WATER BODY

• Use of omnidirectional images• Challenges

◦ Unstructured environment◦ No prior learning (unlike navigation/localization)

• Target application and objective◦ Wearable computing community, emphasis on visually

challenged people◦ Real-time operation

Qualitative Visual Environment System: Overview

• Environment representation• Environment retrieval

◦ View partitioning◦ Feature extraction◦ Node annotation◦ Dynamic node annotation◦ Real-time operation

• Results

System Overview (contd.)

• Environment representation◦ Image database containing images belonging to 6

classes: Lawns(L), Woods(W), Buildings(B),Waterbodies(H), Roads(R) and Traffic(T)

◦ Moderately large intra-class variance (in the featurespace) in images of each category

◦ Description relative to the person using the system: e.g.,‘to left of’, ‘in the front’, etc.

◦ Topological relationships indicated by a graph◦ Each node annotated by an identifier associated with a

• Environment Retrieval◦ View Partitioning

� � � � � �

� � � � �

� � � � ��

� � � � � �

FORWARD DIRECTION

BACKWARD DIRECTION

View Partitioning Graphical representation

◦ Feature Extraction• Feature invariant to scaling, viewpoint, illumination

changes, and geometric warping introduced byomnicam images

• Colour histogram selected as the feature forperforming CBIR

• Environment Retrieval◦ View Partitioning

� � � � � �

� � � � �

� � � � ��

� � � � � �

FORWARD DIRECTION

BACKWARD DIRECTION

View Partitioning Graphical representation

◦ Feature Extraction• Feature invariant to scaling, viewpoint, illumination

changes, and geometric warping introduced byomnicam images

• Colour histogram selected as the feature forperforming CBIR

• Environment Retrieval◦ Node annotation

• Objective: Robust retrieval against illuminationchanges and intra-class variations

• Solution: Annotation decided by a simple votingscheme

◦ Dynamic node annotation• Temporal evolution of graph Gn with time tn• Complete temporal evolution of the graph given by G,

obtained by concatenating the subgraphs Gn,i.e.,G = {G1, G2, . . . , Gk, . . .}

• Environment Retrieval◦ Node annotation

• Objective: Robust retrieval against illuminationchanges and intra-class variations

• Solution: Annotation decided by a simple votingscheme

◦ Dynamic node annotation• Temporal evolution of graph Gn with time tn• Complete temporal evolution of the graph given by G,

obtained by concatenating the subgraphs Gn,i.e.,G = {G1, G2, . . . , Gk, . . .}

• Environment Retrieval◦ Real-time operation

• Colour histogram: compact feature vector• Pre-computed histograms of all the database images• Linear time complexity (O(N)): on P-IV 2.0 GHz, ∼

100 ms for single omnicam image

◦ Portable, low-cost system for visually impaired• Modest hardware and software requirements• Easily put together using off-the-shelf components

• Environment Retrieval◦ Real-time operation

• Colour histogram: compact feature vector• Pre-computed histograms of all the database images• Linear time complexity (O(N)): on P-IV 2.0 GHz, ∼

100 ms for single omnicam image

◦ Portable, low-cost system for visually impaired• Modest hardware and software requirements• Easily put together using off-the-shelf components

• Results

◦ Cylindrical concentric mosaics

• Results

◦ Cylindrical concentric mosaics

• Results

◦ Still omnicam image

• Results

◦ Still omnicam image

• Results

◦ Omnivideo sequence

nFORWARD DIRECTION

BACKWARD DIRECTION

1051 15 20

R R R L L

• Results

◦ Omnivideo sequence

nFORWARD DIRECTION

BACKWARD DIRECTION

1051 15 20

R R R L L

Analyzing our results

• System accuracy: close to 70%– This is not enough!• Some scenes are inherently ambiguous!• Often the second best class is the correct class

• Limitations1. Limited discriminating power of global colour histogram

(GCH)2. Local colour histogram (LCH) based on tiling cannot be

used3. Each frame analyzed independently

• Possible solutions1. Adding memory to the system2. Clustering scheme before computing similarity measure

Method I. Adding memory to the system

• System uses only the current observation in labeling• Good idea to use all observations upto the current one• Desired: A recursive implementation to calculate the

posterior (should be able to do it in real-time!)• Hidden Markov Model: Parameter estimation using Kevin

Murphy’s HMM toolkit

• Challenges1. Estimation of the transition matrix- possible solution is to

use limited classes2. Enormous training data required

Method I. Adding memory to the system

• System uses only the current observation in labeling• Good idea to use all observations upto the current one• Desired: A recursive implementation to calculate the

posterior (should be able to do it in real-time!)• Hidden Markov Model: Parameter estimation using Kevin

Murphy’s HMM toolkit

• Challenges1. Estimation of the transition matrix- possible solution is to

use limited classes2. Enormous training data required

Adding memory. . . (Results)

• Improved confidence in the results. However, negligibleimprovement in the accuracy

• Reasons for poor performance◦ Limited number of transitions in categories (as opposed

to locations◦ Typical training data for HMMs is thousands of labels:

difficult to collect such vast data• Limitation: Makes the system dependent on the system

dependent on the training sequence

Method II. Preclustering the image

• Presence of clutter, images from a broad domain• Premise: The part of the image indicative of the semantic

category forms a distinct part in the feature space

Some test images belonging to the ‘Water-bodies’ category

• Possible solution: segment out the clutter in the scene

Preclustering the image. . .

• K means clustering of the image• Use only pixels from the largest cluster to compute the

colour histogram

Results of K means clustering on the test images

• Results◦ Accuracy improves significantly– for ‘water-bodies’ class

improvement from 25% to about 72%

• Limitations: What about, say, a traffic scene?!

colour histogram

Model-based approaches

• Stochastic models used to learn semantic concepts fromtraining images

• Use of normal perspective images• Use of local image features• Two models examined

1. probabilistic Latent Semantic Analysis (pLSA)2. Maximum entropy models

• Use of the ‘bag of words’ approach

Bag of words approach

• Local features more robust to occlusions and spatialvariations

• Image represented as a collection of local patches• Image patches are members of a learned (visual)

vocabulary• Positional relationships not considered!• Data representation by a co-occurrence matrix

• Notation◦ D = {d1, . . . , dN} : corpus of documents◦ W = {w1, . . . , wM} : dictionary of words◦ Z = {z1, . . . , zK} : (latent) topic variables◦ N = {n(w, d)}: co-occurrence table

Bag of words approach

• Local features more robust to occlusions and spatialvariations

• Image represented as a collection of local patches• Image patches are members of a learned (visual)

vocabulary• Positional relationships not considered!• Data representation by a co-occurrence matrix

• Notation◦ D = {d1, . . . , dN} : corpus of documents◦ W = {w1, . . . , wM} : dictionary of words◦ Z = {z1, . . . , zK} : (latent) topic variables◦ N = {n(w, d)}: co-occurrence table

pLSA model . . .

• Generative model◦ select a document d with probability P (d)

◦ select a latent class z with probability P (z|d)

◦ select a word w with probability P (w|z)

• Joint observation probabilityP (d,w) = P (d)P (w|d), whereP (w|d) =

z∈Z P (w|z)P (z|d)

• Modeling assumptions1. Observation pairs (d,w) generated independently2. Conditional independence assumption

P (w, d|z) = P (w|z)P (d|z)

pLSA model . . .

P (w, d|z) = P (w|z)P (d|z)

pLSA model . . .

P (w, d|z) = P (w|z)P (d|z)

pLSA model . . .

• Model fitting◦ Maximize the log-likelihood functionL =

w∈Wn(d,w)logP (d,w)

◦ Minimizing the KL divergence between the empiricaldistribution and the model

◦ EM algorithm to learn model parameters

• Evaluating model on unseen test images◦ P (w|z) and P (z|d) learned from the training dataset◦ ‘Fold-in’ heuristic for categorization: learned factors

P (w|z) are kept fixed, mixing coefficients P (z|dtest) areestimated using the EM iterations

pLSA model . . .

• Model fitting◦ Maximize the log-likelihood functionL =

w∈Wn(d,w)logP (d,w)

◦ Minimizing the KL divergence between the empiricaldistribution and the model

◦ EM algorithm to learn model parameters

• Evaluating model on unseen test images◦ P (w|z) and P (z|d) learned from the training dataset◦ ‘Fold-in’ heuristic for categorization: learned factors

P (w|z) are kept fixed, mixing coefficients P (z|dtest) areestimated using the EM iterations

pLSA model . . .

• Details of experiment to evaluate model◦ 5 categories: houses, forests, mountains, streets and

beaches◦ Image dataset: COREL photo CDs, images from internet

search engines, and personal image collections◦ 100 images of each category◦ Modifications in Rob Fergus’s code for the experiments◦ 128-dim SIFT feature used to represent a patch◦ Visual codebook with 125 entries

• Image annotationz = arg maxi P (zi|dtest)

pLSA model. . . Results

• 50 runs of the experiment: with random partitioning on eachrun

• Vastly different accuracy on different runs: best case ∼ 46%,and worst case 5%

• Analysis of the results◦ Confusion matrix gives us further insights◦ Most of the labeling errors occur between houses and

streets◦ Ambiguity between mountains and forests

pLSA model. . . Results

• 50 runs of the experiment: with random partitioning on eachrun

• Vastly different accuracy on different runs: best case ∼ 46%,and worst case 5%

• Analysis of the results◦ Confusion matrix gives us further insights◦ Most of the labeling errors occur between houses and

streets◦ Ambiguity between mountains and forests

Results using the pLSA model

Figure 0: Some images that were wrongly anno-

tated by our system

Results of the pLSA model . . .

• Comparison with the naive Bayes’ classifier

Figure 0: Confusion matrices for the pLSA and

naive Bayes models

• 10-fold cross validation test on the same dataset: meanaccuracy: ∼ 66%

Analysis of our results

• Reasons for poor performance◦ Model convergence!◦ Local optima problem in the EM algorithm◦ Optimum value of the objective function depends on the

initialized values◦ We initialize the algorithm randomly at each run!

• Possible solution: Deterministic annealing EM (DAEM)algorithm

• Even with DAEM no guarantee of converging to the globaloptimal solution

Analysis of our results

• Reasons for poor performance◦ Model convergence!◦ Local optima problem in the EM algorithm◦ Optimum value of the objective function depends on the

initialized values◦ We initialize the algorithm randomly at each run!

• Possible solution: Deterministic annealing EM (DAEM)algorithm

• Even with DAEM no guarantee of converging to the globaloptimal solution

Maximum entropy models

• Maximum entropy prefers a uniform distribution when nodata are available

• Best model is the one that is:1. Consistent with the constraints imposed by training data2. Makes as few assumptions as possible

• Training dataset: {(x1, y1), (x2, y2), . . . , (xN , yN )}, where xi

represents an image and yi represents a label• Predicate functions

◦ Unigram predicate: co-occurrence statistics of a wordand a label

fv1,LABEL(x, y) =

1 if y=LABEL and v1 ∈ x

0 otherwise

Maximum entropy models . . .

• Notation◦ f : predicate function◦ p(x, y): empirical distribution of the observed pairs◦ p(y|x): stochastic model to be learnt

• Model fitting: expected value of the predicate function w.r.t.to the stochastic model should equal the expected value ofthe predicate measured from the training data

• Constrained optimization problemMaximize H(p) = −

x,y p(x)p(y|x)logp(y|x)

s.t.∑

x,y p(x, y)f(x, y) =∑

x,y p(x)p(y|x)f(x, y)

• p(y|x) = 1Z(x)exp

∑ki=1 λifi(x, y)

s.t.∑

x,y p(x, y)f(x, y) =∑

∑ki=1 λifi(x, y)

s.t.∑

x,y p(x, y)f(x, y) =∑

∑ki=1 λifi(x, y)

Results for the maximum entropy model

• Same dataset, feature and codebook as used for the pLSAexperiment

• Evaluation using Zhang Le’s maximum entropy toolkit

• 25-fold cross-validation accuracy: ∼ 70%

• The second best label is often the correct label: accuracyimproves to 85%

Figure 1: Confusion matrices for the maximum

entropy and naive Bayes models

entropy and naive Bayes modelsNatural Scene Categorization – p.29/32

A comparative study

Method # of catg. training # per catg. perf(%)

Maximum entropy 5 50 70

pLSA 5 50 46

Naive Bayes’ classifier 5 50 66

Fei-Fei 13 100 64

Vogel 6 ∼100 89.3

Vogel 6 ∼100 67.2

Oliva 8 250 ∼ 300 89

Table 0: A performance comparison with other

studies reported in literature.

Future Work

• Further investigations into the pLSA model• Issue of model convergence• DAEM algorithm is not the ideal solution• Using a richer feature set, e.g., bank of Gabor filters• For maximum entropy models, ways to define predicates

that will capture semantic information better

THANK YOU

Categorization of natural images

Technology

Transcript of Categorization of natural images