Matching Words and Pictures · 2008. 3. 25. · know exactly the right search terms to get useful...

Matching Words and Pictures

Rongzhou Shen, Danyi Feng, Qing Xia

March 4, 2008

Rongzhou Shen, Danyi Feng, Qing Xia () Matching Words and Pictures March 4, 2008 1 / 54

Introduction

Text and images are separately ambiguous


Introduction


Figure: Ubuntu 7.10 with Compiz Fusion


Introduction


However:They tend not to be when jointly displayed

Figure: Ubuntu 7.10 with Compiz Fusion


Outline

Applications

Data Preprocessing

Annotation Models

Correspondence Models

Model Integration

Evaluation

Experiments

Conclusion


Applications


Applications - Real World

Browsing and searching

Identify faces, naked people, pedestrians or cars


Applications - Real World

Browsing and searching

Identify faces, naked people, pedestrians or cars

However, there is a large disparity between user needs and whattechnology supplies


Applications - User Requests

A request to a stock photo library

Pretty girl doing something active, sporty in a summery setting, beach -not wearing lycra, exercise clothes - more relaxed in tee-shirt. Feature isabout deodorant so girl should look active - not sweaty but happy, healthy,carefree - nothing too posed or set up - nice and natural looking.



A request to a stock photo library

Pretty girl doing something active, sporty in a summery setting, beach -not wearing lycra, exercise clothes - more relaxed in tee-shirt. Feature isabout deodorant so girl should look active - not sweaty but happy, healthy,carefree - nothing too posed or set up - nice and natural looking.

So, what do users request from an image matching application?



Images both by object kinds and identities

Images by what they depict and by what they are about

Queries based on image histograms, texture, overall appearance, etc.are vanishingly uncommon

Text associated with images is extremely useful in practice



Images both by object kinds and identities

Images by what they depict and by what they are about

Queries based on image histograms, texture, overall appearance, etc.are vanishingly uncommon

Text associated with images is extremely useful in practice

Given the user requests, how do we fulfil these requests?


Linking Text and Images

BooleanQueries: Does the text link to the image? YES or NO

ProbabilityModels: Does the text link to the image? 83.74% possibility


Linking Text and Images

BooleanQueries: Does the text link to the image? YES or NO

ProbabilityModels: Does the text link to the image? 83.74% possibility

A Probability Model would be more beneficial: one doesn’t need toknow exactly the right search terms to get useful results.


Annotation and Correspondence

One possible way of using probability models is to predict text givenimages.

Annotation Predict annotations of entire images using all informationpresent.

Correspondence Associate particular words with particular imagesubstructures.


Annotation and Correspondence

One possible way of using probability models is to predict text givenimages.

Annotation Predict annotations of entire images using all informationpresent.

Correspondence Associate particular words with particular imagesubstructures.

So, what specific models are there and how are Annotation andCorrespondence done?


Data Preprocessing


Input Preprocessing

Each image is segmented using Normalised Cuts (Shi and Malik,2000).

Represent 8 largest regions in each image by computing a set of 40features for each region.


Input Preprocessing

Each image is segmented using Normalised Cuts (Shi and Malik,2000).

Represent 8 largest regions in each image by computing a set of 40features for each region.

What are the features?


Region features

Size Portion of the image covered by the region

Position Coordinates of the region center of mass normalized by theimage dimensions.

Color The average and standard deviation of (R,G,B), (L,a,b) and(r=R/(R+G+B), g=G/(R+G+B)) over the region

Texture The average and variance of 16 filter responses

Shape The ratio of the area to the perimeter squared, the momentof inertia and the ratio of the region area to that of itsconvex hull

We will refer to a region, together with the features, as a blob.


Annotation Models


Annotation Models - Introduction

Multi-Modal Hierarchical Aspect Model

Mixture of Multi-Modal Latent Dirichlet Allocation


Multi-Modal Hierarchical Aspect Models

Figure: Multi-modal extension of a hierarchical model for text.

Image regions are generated using a Gaussian distribution.

Words are generated using a multinomial distribution.



p(D|d) =∑

c

p(c)∏

w∈W

[∑

l

p(w |l , c)p(l |d)]Nw

Nw,d

∏

b∈B

[∑

l


Nb,d

l, c Levels, Clusters

w Words in document d

b The image regions in document d

D The set of observations for document

W, B The set of words and blobs for the document D = W ∪ BNw

Nw,d, Nb

Nb,dNormalisation constants for differing numbers of words and

blobs in each image.

We will refer to this as model I-0 in the results later on.



We also experimented with allowing a cluster dependent level structurewhere p(l |d) is replaced with p(l |c , d), and refered this as model I-1:

p(D|d) =∑

c

p(c)∏

w∈W

[∑

l

p(w |l , c)p(l |c , d)]Nw

Nw,d

∏

b∈B

[∑

l

p(b|l , c)p(l |c , d)]Nb

Nb,d

Both model I-0 and model I-1 are not generative, they are specificto the documents in the training set.



Two alternatives for generalization:

Marginalize out the training data

Estimate the mixing weights using a cluster specific averagecomputed during training (model I-2).



Two alternatives for generalization:

Marginalize out the training data

Estimate the mixing weights using a cluster specific averagecomputed during training (model I-2).

p(D) =∑

c

p(c)∏

w∈W

[∑

l

p(w |l , c)p(l |c)]Nw

Nw,d

∏

b∈B

[∑

l

p(b|l , c)p(l |c)]Nb

Nb,d



Latent Dirichlet Allocation(LDA)

A generative probabilistic model for independent collections of data whereeach collection is modelled by a randomly generated mixture over latentfactors.



Latent Dirichlet Allocation(LDA)

A generative probabilistic model for independent collections of data whereeach collection is modelled by a randomly generated mixture over latentfactors.

Example

Text modelling: A LDA model with two topics - CAT, DOGCAT: milk, meow, kittenDOG: puppy, bark, boneA document is generated by picking a distribution over topics, and giventhis distribution, picking the topic of each specific word. Then words aregenerated given their topics.



1 Choose one of J mixture components c ∼ Multinomial(η).




2 Conditioned on the mixture component, choose a mixture over Kfactors, θ ∼ Dir(αc).





3 For each of the N words:1 Choose one of K factors zn ∼ Multinomial(θ).2 Choose one of V words wn from p(wn|zn, c , β), the conditional

probability of wn given the mixture component and latent factor.





3 For each of the N words:1 Choose one of K factors zn ∼ Multinomial(θ).2 Choose one of V words wn from p(wn|zn, c , β), the conditional

probability of wn given the mixture component and latent factor.

4 For each of the M blobs:1 Choose a factor sm ∼ Multinomial(θ).2 Choose a blob bm from p(bm|sm, c , µ, Σ), a multivariate Gaussian

distribution with diagonal covariance, conditioned on the factor sm andthe mixture component c .


Correspondence Models - Introduction

It’s natural to want to build models that can predict words for specificimage regions rather than for a whole image, there are several simple ways:

The simplest : Discrete Data Translation

Hierarchical Clustering Model

MoM-LDA


Discrete Data Translation

It is inspired by machine translation: a lexicon links discrete objects (wordsin one language) to discrete objects(words in the other language)

Use K-mean to Vector-quantize representations of image regions Eachregion then get a single label( blob token).

Predict words :construct a joint probability table linking word tokento blob token


Correspondence from a Hierarchical Clustering Model

Hierarchical clustering models encode this correspondence throughCo-occurrence

Co-occurrence example

The word ”tiger” always co-occurs with an orange stripy region and neverotherwise.


Correspondence from a Hierarchical Clustering Model

Hierarchical clustering models encode this correspondence throughCo-occurrence

Co-occurrence example

The word ”tiger” always co-occurs with an orange stripy region and neverotherwise.

Word prediction

p(w |b) ∝∑

c

p(c)∑

l

p(l)p(w |l , c)p(b|l , c) (Region only)

p(w |b) ∝∑

c

p(c |B)∑

l

p(l)p(w |l , c)p(b|l , c) (Region cluster)


Integrating correspondence and

Hierarchical clustering


Integrating Correspondence and Hierarchical Cluster

Problem in translating from image region to word is similar with machinetranslation.

Problem in translating from English to French

sun → soleilBut if the surrounding words are computer relatedsun → sun


Integrating Correspondence and Hierarchical Cluster

Problem in translating from image region to word is similar with machinetranslation.

Problem in translating from English to French

sun → soleilBut if the surrounding words are computer relatedsun → sun

Solution

Build explicit correspondence information into existing hierarchicalclustering model

1 Linking Word Emission and Region Emission Probabilities withMixture Weights

2 Paired Word and Region Emission at Nodes


Linking Word Emission and Region Emission Probabilities

with Mixture Weights

To implement this strategy by having the vertical mixture weights for theregions carries over to that for words.

p(D|d) =∑

c

p(c)∏

w∈W

[∑

l

p(w |l , c)p(l |B , c , d)]Nw

Nw,d

∏

b∈B

[∑

l


Nb,d

wherep(l |B , c , d) ∝

∑

b∈B

p(l |b, c , d)


Linking Word Emission and Region Emission Probabilities

with Mixture Weights

To implement this strategy by having the vertical mixture weights for theregions carries over to that for words.

p(D|d) =∑

c

p(c)∏

w∈W

[∑

l

p(w |l , c)p(l |B , c , d)]Nw

Nw,d

∏

b∈B

[∑

l


Nb,d

wherep(l |B , c , d) ∝

∑

b∈B

p(l |b, c , d)

Dependence Models

This model is Model D-0p(l |c , d) =⇒ p(l |d)get D-1 cluster dependent level distributionsp(l) =⇒ p(l |d) get D-2 drop the dependency on training set


Paired Word and Region Emission at Nodes

This method further tightens the relationship between the regions andwords. Then the observed words and regions are emitted in pairsD = {(w , b)}

p(D|d) =∑

c

p(c)∏

(w ,b)∈D

[∑

l

p((w , b)|l , c)p(l |d)]




p(D|d) =∑

c

p(c)∏

(w ,b)∈D

[∑

l

p((w , b)|l , c)p(l |d)]


This model is Model C-0Model I-1 and I-2 are similarly modified to get C-1 and C-2




p(D|d) =∑

c

p(c)∏

(w ,b)∈D

[∑

l

p((w , b)|l , c)p(l |d)]


This model is Model C-0Model I-1 and I-2 are similarly modified to get C-1 and C-2

The correspondence is not provide in training data, we need toestimate correspondence as part of the training process


Estimate the correspondence

All methods for computing the correspondence assume that the probabilitythat a word and a segment correspond can be estimated by the probabilitythat they are emitted from the same node.Using w ⇔ b to denote that the word,w , and the region, b, correspond, incase C-0

p(w ⇔ b) ≈∑

c

p(c)∑

l

p((w , b)|l , c)p(l |d)


Correspondence Issues

Issues

The choice of correspondence model, Should there be a one-one mapbetween regions and words(Usually impossible.), there is no option ofdeciding that a region corresponds to no word.

Solution

A special word NULL

Fertility: one to many

Refusal to predict: when the annotation with the highest probabilitygiven the region has too low a probability


Evaluation


Evaluation

Measuring Annotation Performance:Compare the words predicted by models with words actually presentfor held-out data

Measuring Correspondence Performance:Examine whether we predict appropriate words for each particularregion


Evaluation-Annotation Models

Empirical word frequency

Relative to prediction performance

Matching the performance empirical density

A sensible indicator




Quality of word posterior distribution





Estimate the Kullback-Leibler divergence between predictivedistribution q(w |B),and target distribution p(w).






E(model)KL =

∑w∈vocabulary p(w)log p(w)

q(w |B






E(model)KL =

∑w∈vocabulary p(w)log p(w)

q(w |B

= constant − 1K

∑w∈observed logq(w |B).




Quality of word posterior distribtion

Goodness of the model:





Goodness of the model: loss function






Tradintinoal zero-one loss is highly misleading







EmodelNS = r/n − w/(N − n)

N the vocabulary sizen the number of actual words for the imager the number of words predicted correctly

w the number of words predicted incorrectly








N the vocabulary sizen the number of actual words for the imager the number of words predicted correctly

w the number of words predicted incorrectly

EmodelPR = r/n, based on the n best words.





E(model)KL = constant − 1

K


Goodness of the model:loss function







E(model)KL = constant − 1

K


Goodness of the model:loss function



EKL = 1K

∑data (E empirical

KL − EmodelKL )

ENS = 1K


NS − EmodelNS )

EPR = 1K


PR − EmodelPR )

Negtive → worse; positive → better


Evaluation-Correspondence Models

Manual correspondence scoring




Limit the size of the dataset

Contain significant noise






Assumption

A method that cannot predict annotations accurately is unlikely to predictcorrespondence well.






Using annotation as a proxy.

Assumption

A method that cannot predict annotations accurately is unlikely to predictcorrespondence well.



Using annotation as a proxy

Image based methods

Region based methods




Image based methodsCompute the annotation in the natural way.

Region based methods




Image based methodsCompute the annotation in the natural way.

Region based methodsConsider each region seperately.Use image based annotation method to computre region wordposteriors.Marginalize out the correspondence of each region from the model,and test it as an annotation model.


Experiments


Experiments

Use images from 160 CD’s from Corel image data set.

75% training sets, 25% “standard” held out sets.

Exclude words occur less than 20 times in test set.


Experiments

Annotation Results.

Correspondence Results


Experiments

Annotation Results.

KL-divergence score measure

Normalized score measure

Comparison of models using different scores

Correspondence Results


Annotation Results


Normalized score measure.

Comparison of models using different scores.


Annotation Results


Check for overfitting.Number of iteration.

Normalized score measure.



Annotation Results


Annotation Results


Check for overfitting:Number of iteration:


Annotation Results


Check for overfitting: not a big problemNumber of iteration:


Annotation Results


Check for overfitting: not a big problemNumber of iteration: performance on training set improved withnumber of iteratrions.


Annotation Results





Annotation Results



Normalized score measureGo from zero to some peak, and then to drop down to zero again.


Annotation Results

Annotation results are compared between models.Measure PR:


Annotation Results

Annotation results are compared between models.Measure NS:


Annotation Results

Annotation results are compared between models.Measure KL:


Annotation Results




Words are generallyassociated with pieces of images.Methods using clustering are reliant on having images which are closeto the training data.The result for the MoM-LDA model are worth nothing.


Correspondence Result

Study the keyword prediction error,E empiricalPR − Emodel

PR


Correspondence Result

Study the keyword prediction error,E empiricalPR − Emodel

PR

Paired word-blob emission approach improves correspondenceperformance over annotation performance.

For both correspondence and annotation, linear-D-0-region-onlymethod appears to be the best overall choice.


Discussion


Discussion

The Corel data set is simple

annotations are nounsvocabulary is small

Performance is almost certainly affected by the correspondence modelused

Have little information about the effect of supervision

Large scale evaluation of correspondence models is genuimely difficultthe key issue seems to be the entropy of the labels.


Thank you!


Any Questions?


Matching Words and Pictures · 2008. 3. 25. · know exactly the right search terms to get useful...

Documents

Transcript of Matching Words and Pictures · 2008. 3. 25. · know exactly the right search terms to get useful...