Fcv bio cv_simoncelli

Synthesis for understanding and evaluating vision systems

Eero Simoncelli

Howard Hughes Medical Institute,Center for Neural Science, and

Courant Institute of Mathematical SciencesNew York University

Frontiers in Computer Vision WorkshopMIT, 21-24 Aug 2011

Computer vision

Robotics

Optics/imaging

Machinelearning

Image processing

Computer graphics

Visual neuroscience

Visual perception

Why should computer vision care about biological vision?

RetinaOpticNerve

LGNOpticTract

VisualCortex


•Optimized for general-purpose vision

RetinaOpticNerve

LGNOpticTract

VisualCortex



RetinaOpticNerve

LGNOpticTract

VisualCortex

•Determines/limits what is perceived



RetinaOpticNerve

LGNOpticTract

VisualCortex

•Determines/limits what is perceived

•Useful scientific testing methodologies

Illustrative example: building a classifier

1. Transform input to some feature space

2. Use ML to learn parameters on a large (labelled) data set

3. Test on another data set

4. Repeat

Which features?

[Adelson & Bergen, 1985]

Which features?Oriented filters: capture stimulus-dependency of neural responses in primary visual cortex (area V1)

Simple cell

Complex cell +


Which features?Oriented filters: capture stimulus-dependency of neural responses in primary visual cortex (area V1)

Simple cell

Complex cell +

Simple cell

Complex cell +


[Carandini, Heeger, and Movshon, 1996]

Retinal image

Firing rate

Retinal image

Firing rate

Other cortical cells

Retinal image

Firing rate

Other cortical cells

RC circuit implementation

The linear model of simple cells

The normalization model of simple cells

Similarly, contrast gain control depends on the root-mean-square contrast falling over a region centered over the RF(Shapley and Victor, 1979, 1981), which we term suppressivefield (Bonin et al., 2005, 2006).We posit that this measure of localcontrast sets the conductance of the contrast gain control RCstage (Figure 3A).

The validity of this choice can be tested on the basis of a simpleprediction: increasing the size of a grating should affect the gainand the integration time of the RF exactly in the same way asa matched increase in contrast (Shapley and Victor, 1979,1981). Indeed, in the model both manipulations result in strongereffects of contrast gain control. We confirmed this prediction bymeasuring temporal weighting functions from responses to drift-ing gratings varying in contrast and diameter. Indeed, increasingdiameter reduced both the gain and the integration time, thesame effects seen when increasing contrast (Figure 3B, black).To model these effects, we allowed the conductance of thecontrast gain control stage (Figure 3A) to vary with stimulusdiameter as well as contrast (Figure 3C). The resulting temporalweighting functions closely resemble the ones estimated individ-ually (Figure 3B, compare black and red) and predict the re-sponses to gratings of various contrast and diameter almost aswell (72% versus 75% stimulus-driven variance explained forthe example cell; 77% versus 82% over the population, n = 34,median). The curves relating grating contrast to conductance,which depend on grating diameter (Figure 3C), could be madeto lie on a single line by appropriate horizontal shifts (Figure 3D)indicating that the effects of increasing diameter could beexactly matched by an appropriate increase in contrast. Thehorizontal shifts determine the weight contributed by each stim-ulus diameter (Figures 3E and 3F), and therefore allow us toestimate the size of the suppressive field. Defining size as thediameter corresponding to half of the total volume, we find that

on average the suppressive field is 2.0 ± 0.2 (s.e., bootstrap es-timate, n = 34) times larger than the center of the RF (Figure 3F).These estimates are consistent with earlier measures based onlyon response gain (Bonin et al., 2005).As in previous work, we postulate that local contrast is com-

puted from the output of the light adaptation stage and is com-bined across a number of neurons (subunits) having spatially dis-placed RFs (Bonin et al., 2005; Shapley and Victor, 1979). Theoutputs of the subunits are squared and combined in a weightedsum, and the result is square rooted (Bonin et al., 2006). Theweights are given by the profile of the suppressive field(Figure 3A). Because the responses of the subunits are shapedby light adaptation, which has a divisive effect on the responses,at steady state this computation of local contrast reduces to thecommon definition of root-mean-square contrast (Shapley andEnroth-Cugell, 1984), the ratio between the standard deviationand the mean of the local luminance distribution (ExperimentalProcedures).

Temporal Dynamics of Fast AdaptationFinally, to apply the model to arbitrary scenes, we must specifyhow the signals driving the adaptationmechanisms are integratedover time. Thismatter has been extensively studied, and based onthe literature we made two assumptions. First, we assumed thatthe measure of local luminance extends over !100 ms in therecent past (Enroth-Cugell and Shapley, 1973a; Lankheet et al.,1993a; Lee et al., 2003; Saito and Fukada, 1986; Yeh et al.,1996). Second, we assumed that the measure of local contrastis determined entirely by the responses of the subunits, with nofurther temporal integration. Thus, the measure of local contrastis estimated over a brief interval (Alitto and Usrey, 2008; Baccusand Meister, 2002; Victor, 1987), whose duration is shorter whenlocal luminance is high, and longer when local luminance is low.

Figure 3. The Spatial Footprint of Fast Adaptation(A) Local luminance is the average luminance falling over the RF in

a recent period of time. Local contrast is computed by the

suppressive field, by taking the square root of the squared and

integrated responses of a pool of subunits.

(B) The temporal weighting function measured with gratings of

various contrast and diameter (black). Fits of the model (red)

were obtained by estimating one conductance value for the sec-

ond RC stage for each combination of contrast and diameter.

(C) The estimated conductance increases with both contrast

(abscissa) and diameter (white to black).

(D) The four sets of conductance values can be aligned by shifting

them along the horizontal axis. The resulting curve describes

how conductance depends on local contrast. Red line is linear

regression.

(E) The volume under the portion of the suppressive field covered

by the stimuli of different diameter. The data points are obtained

from the magnitude of the shifts needed to align the curves in

(C). The curve is the fit of a descriptive function (Experimental Pro-

cedures). For this neuron, the size of the center of the RF is 1.0".

(F) Average over all neurons. Stimulus diameter is normalized by

the size of the center of the RF. Error bars indicate two SE.

Neuron

Visual Responses to Artificial and Natural Stimuli

628 Neuron 58, 625–638, May 22, 2008 ª2008 Elsevier Inc.

[Mante, Bonin & Carandini 2008]

Dynamic retina/LGN model

. . .

. . .

. . . 2

Input: V1 afferents

Output: MT neurons tuned forlocal image velocity

+

. . .

. . .

. . .12

Input: image intensities

Output: V1 neurons tuned forspatio-temporal orientation

1

+

LinearReceptive

Field

Half-squaringRectification

DivisiveNormalization

[Simoncelli & Heeger, 1998]

2-stage MT model

Biology uses cascades of canonical operations....

• Linear filters (local integrals and derivatives): selectivity/invariance

• Static nonlinearities (rectification, exponential, sigmoid): dynamic range control

• Pooling (sum of squares, max, etc): invariance

• Normalization: preservation of tuning curves, suppression by non-optimal stimuli

Improved object recognition?“In many recent object recognition systems, feature extraction stages are generally composed of a filter bank, a non-linear transformation, and some sort of feature pooling layer [...] We show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks. We show that two stages of feature extraction yield better accuracy than one....”

- From the abstract of “What is the Best Multi-Stage Architecture for Object Recognition?”Kevin Jarrett, Koray Kavukcuoglu, Marc’Aurelio Ranzato and Yann LeCun ICCV-2009

Using synthesis to test models I: Gender classification

•200 face images (100 male, 100 female)•Labeled by 27 human subjects•Four linear classifiers trained on subject data

[Graf & Wichmann, NIPS*03]

Linear classifiersSVM RVM Prot FLD

Linear classifiersSVM RVM Prot FLD

SVM RVM Prot FLD trainedon

!W

truedata

!W

subjdata

classifier vectors may be visualized as images:

!=−21 !=−14 !=−7 !=0 !=7 !=14 !=21

SVM

RVM

Prot

FLD

Add classifierSubtract classifier

[Wichmann, Graf, Simoncelli, Bülthoff, Schölkopf, NIPS*04]

Validation by “gender-morphing”

50

100

%Correct

Amount of classifier image added/subtracted(arbitrary units)

1.0 2.0 4.0 8.00.50.25

SVMRVMProtoFLD

[Wichmann, Graf, Simoncelli, Bülthoff, Schölkopf, NIPS*04]

Perceptual validationH

uman

subj

ect r

espo

nses

of visual re-representations, from V1 to V2 to V4 to ITcortex (Figure 2). Beginning with the studies of Gross [27],a wealth of work has shown that single neurons at thehighest level of the monkey ventral visual stream – the ITcortex – display spiking responses that are probably usefulfor object recognition. Specifically, many individual ITneurons respond selectively to particular classes of objects,such as faces or other complex shapes, yet show sometolerance to changes in object position, size, pose andillumination, and low-level shape cues. (Also see e.g.Ref. [28] for recent related results in humans.)

How can the responses of individual ventral streamneurons provide insight into object manifold untanglingin the brain? To approach this, we have focused on char-acterizing the initial wave of neuronal population ‘images’that are successively produced along the ventral visual str-eam as the retinal image is transformed and re-representedon its way to the IT cortex (Figure 2). For example, we andour collaborators recently found that simple linear classi-fiers can rapidly (within <300 ms of image onset) andaccurately decide the category of an object from the firingrates of an IT population of!200 neurons, despite variationin object position and size [19]. It is important to note thatusing ‘stronger’ (e.g. non-linear) classifiers did not substan-tially improve recognition performance and the same

classifiers fail when applied to a simulated V1 populationof equal size [19]. This shows thatperformance isnota resultof the classifiers themselves, but the powerful form of visualrepresentation conveyed by the IT cortex. Thus, comparedwith early visual representations, object manifolds are lesstangled in the IT population representation.

To show this untangling graphically, Figure 3 illustratesthe manifolds of the faces of Sam and Joe from Figure 1d(retina-like representation) re-represented in the V1 and ITcortical population spaces. To generate these, we took popu-lations of simulated V1-like response functions (e.g. Refs[29,30]) and IT-like response functions (e.g. Refs [31,32]),and applied them to all the images of Joe and Sam.This reveals that the V1 representation, like the retinalrepresentation, still contains highly curved, tangled objectmanifolds (Figure 3a), whereas the same object manifoldsare flattened and untangled in the IT representation(Figure 3b). Thus, from the point of view of downstreamdecisionneurons, the retinal andV1 representations are notin a good format to separate Joe from the rest of the world,whereas the IT representation is. In sum, the experimentalevidence suggests that the ventral stream transformation(culminating in IT) solves object recognition by untanglingobjectmanifolds.For eachvisual image striking the eye, thistotal transformation happens progressively (i.e. stepwise

Figure 2. Neuronal populations along the ventral visual processing stream. The rhesus monkey is currently our best model of the human visual system. Like humans,monkeys have high visual acuity, rely heavily on vision (!50% of macaque neocortex is devoted to vision) and easily perform visual recognition tasks. Moreover, themonkey visual areas have been mapped and are hierarchically organized [26], and the ventral visual stream is known to be critical for complex object discrimination(colored areas, see text). We show a lateral schematic of a rhesus monkey brain (adapted from Ref. [26]). We conceptualize each stage of the ventral stream as a newpopulation representation. The lower panels schematically illustrate these populations in early visual areas and at successively higher stages along the ventral visual stream– their relative size loosely reflects their relative output dimensionality (approximate number of feed-forward projection neurons). A given pattern of photons from the world(here, a face) is transduced into neuronal activity at the retina and is progressively and rapidly transformed and re-represented in each population, perhaps by a commontransformation (T). Solid arrows indicate the direction of visual information flow based on neuronal latency (!100 ms latency in IT), but this does not preclude fast feedbackboth within and between areas (dashed arrows, see Box 1). The gray arrows across the bottom indicate the population representations for the retina, V1 and IT, which areconsidered in Figures 1d and 3a,b, respectively. RGC, retinal ganglion cells; LGN, lateral geniculate nucleus.

Opinion TRENDS in Cognitive Sciences Vol.11 No.8 337

www.sciencedirect.com

Using synthesis to test models II: Ventral stream representation

[DiCarlo & Cox, 2007]

V1

V2

V4

a

b

Receptive field center (deg)

Receptive fie

ld s

ize (

deg)

0 5 10 15 20 25 30 35 40 45 50

0

5

10

15

20

25

V1 V2 V4

Deg

Deg

Deg Deg

!"# 0 40

!"#

0

40

0 400 !"# 0 40!"#

Figure 1. Physiological measurements of

receptive field size in macaque. (a) Receptive

field size (diameter) as a function of receptive

field center (eccentricity) for visual areas V1,

V2, and V4. Data adapted from Gattass et al.

(1981) and Gattass et al. (1988). The size-to-

eccentricity relationship in each area is well

described by a “hinged” line (see Methods).

(b) Cartoon depiction of receptive fields with

sizes based on physiological measurements.

The center of each array is the fovea. The size

of each circle is proportional to its eccentricity,

based on the corresponding scaling param-

eter (slope of the fitted line in a). At a given

eccentricity, a larger scaling parameter implies

larger receptive fields. In our model, we use

overlapping pooling regions that uniformly tile

the image and are separable and of constant

size in polar angle and log eccentricity

(Supplementary Fig. 1).

[Gattass et. al., 1981; Gattass et. al., 1988]

Eccentricity, receptive field center (deg)

V1 V2 V4 IT

V1

V2

V4

IT

[Freeman & Simoncelli, Nature Neurosci, Sep 2011]

Ventral stream“complex” cell

+

250

150

25 40170

Ventral streamreceptive fields

Canonical computation

V1 cells


3.11.412.5

.

.

.


+

250

150

25 40170



V1 cells


How do we test this?

3.11.412.5

.

.

.


+

250

150

25 40170



V1 cells



model

250

150

25

170

40

model

250

150

25

170

40

Original imageModel

responses

3.11.412.5

.

.

.

Synthesized image

Scientific prediction: such images should look the same (“Metamers”)

Idea: synthesize random samples from the equivalence class of images with identical model responses


Original imageModel

responses

3.11.412.5

.

.

.

Synthesized image

Scientific prediction: such images should look the same (“Metamers”)

Idea: synthesize random samples from the equivalence class of images with identical model responses

original image

synthesized image: should look the same when you fixate on the red dot

Reading

[Freeman & Simoncelli, Nature Neurosci, Sep 2011]b

c

aFigure 7. Effects of crowding

on reading and searching.

!"#$Two metamers, matched

to the model responses of a

page of text from the first

paragraph of Herman

Melville’s “Moby Dick”. Each

metamer was synthesized

using a different foveal

location (the letter above each

red dot). These locations are

separated by the distance

readers typically traverse

between fixations49. In each

metamer, the central word is

largely preserved; farther in

the periphery the text is

letter-like but scrambled, as if

printed with non-latin

characters. Note that the

boundary of readability in the

first image roughly coincides

with the location of the fixation

in the second image. We

emphasize that these are

samples drawn from the set of

images that are perceptually

metameric; although they

illustrate the kinds of

distortions that result from the

model, no single example

represents “what an observer

sees” in the periphery. (b) The

notoriously hard-to-find

“Waldo” (character with the

red and white striped shirt)

blends into the distracting

background, and is only

recognizable when we (or the

model) look right at him.

Cross-hairs surrounding each

image indicate the location of

the model fovea. (c) A soldier

in Afghanistan wears

sandy-stone patterned

clothing to match the stoney

texture of the street, and

similarly blends into the

background.

Camouflage

b

c

aFigure 7. Effects of crowding

on reading and searching.

!"#$Two metamers, matched

to the model responses of a

page of text from the first

paragraph of Herman

Melville’s “Moby Dick”. Each

metamer was synthesized

using a different foveal

location (the letter above each

red dot). These locations are

separated by the distance

readers typically traverse

between fixations49. In each

metamer, the central word is

largely preserved; farther in

the periphery the text is

letter-like but scrambled, as if

printed with non-latin

characters. Note that the

boundary of readability in the

first image roughly coincides

with the location of the fixation

in the second image. We

emphasize that these are

samples drawn from the set of

images that are perceptually

metameric; although they

illustrate the kinds of

distortions that result from the

model, no single example

represents “what an observer

sees” in the periphery. (b) The

notoriously hard-to-find

“Waldo” (character with the

red and white striped shirt)

blends into the distracting

background, and is only

recognizable when we (or the

model) look right at him.

Cross-hairs surrounding each

image indicate the location of

the model fovea. (c) A soldier

in Afghanistan wears

sandy-stone patterned

clothing to match the stoney

texture of the street, and

similarly blends into the

background.


Cascades of linear filtering, squaring/products, averaging over local regions....


Can this really lead to object recognition?

“Perhaps texture, somewhat redefined, is the primitive stuff out of which form is constructed”

- Jerome Lettvin, 1976


Can this really lead to object recognition?

Fcv bio cv_simoncelli

Technology

Transcript of Fcv bio cv_simoncelli