Pay Attention to the Pixel, Understand the Scene Betterskong2/slides/20180514_AIML_UCI.pdf · When...

Shu KongCS, ICS, UCI

Pay Attention to the Pixel, Understand the Scene Better

pixel level labelingsemantic segmentation -- assigning a class label to each pixel

Background: Scene Parsing

wall painting pillow sofa cabinet towel


old days (before 2013), extracting features to represent pixels,unary pixel classification and pixel pairs for CRF

features, transforms,grouping, etc.


nowadays, deep learning

Convolutional Neural Network (CNN)

aggregating local information

avoid computing over whole image directly

Background: Convolutional Neural Net

Y. Handwritten digit recognition: Applications of neural net chips and automatic learning, Neural Computation, 1989.

Receptive Field @ high-level layers

Background: Receptive Field in CNN

photo credit to Honglak Lee

Image Classification

Background: CNN for Scene Parsing

cat vs. dog

Dense Prediction

Background: CNN for Scene Parsing

wall painting pillow sofa cabinet towel

Background: Perspective and Scale

size(car) > size(train)?

size(chair) > size(whiteboard)?

1. Perspective-aware Pooling

2. Pixel-wise Attentional Gating (PAG)

1. Attentional Pooling for the “right” receptive field

2. pixel-level dynamic routing

3. Discussion

Outline

Goal: deciding for each pixel the size of receptive field (RF) to aggregate information

Perspective-aware Pooling


The closer the object is to the camera, the larger size it appears in the image, the larger RF the model should use to aggregate information.



The closer the object is to the camera, the larger size it appears in the image, the larger RF the model should use to aggregate information..

Depth conveys the scale information.


Idea: making the pooling size adaptive w.r.t depth



dilated convolution (Atrous Convolution).


2D atrous convolution of different dilate rates.

allowing for larger RF to aggregate more contextual information


DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs


quantize the depth into five scales with dilate rates {1, 2, 4, 8, 16}





Multiplicative gating

When depth is not available in inference --


S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018


Idea: train a depth estimator




Idea: train a depth estimator



Why better?

capacity, representation power

Consistent improvement



Recurrent refinement by adapting the predicted depth



Recurrently refining by adapting the predicted depth



Recurrent refinement by adapting the predicted depth



No depth at all?

Idea: train unsupervised attention map



No depth at all?

Idea: train unsupervised attention map



gt-depth

pred-depth

attention





3. Discussion

Pixel-wise Attentional Gating

Improvement

1. attention vs. depth

general, learning rule


Improvement



2. binary gating vs. weighted gating


S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018

Improvement




computational saving



PAG produces binary masks, e.g., using argmax on softmax.



How to output binary maps but still allowing end-to-end training?



How to output binary maps but still allowing end-to-end training?

Idea: Gumbel-softmax trick[1,2]


[1] Categorical reparameterization with gumbel-softmax, ICLR, 2017[2] The concrete distribution: A continuous relaxation of discrete random variables, ICLR, 2017

The normal way for sampling is to use argmax operator on softmax.


Gumbel, E.J.: Statistics of extremes. Courier Corporation (2012)


Then,




Then,

Here is an alternative that

where




Then,

Here is an alternative that

where

random variable m follows



but not continuous, not differentiable


Categorical reparameterization with gumbel-softmax, ICLR, 2017The concrete distribution: A continuous relaxation of discrete random variables, ICLR, 2017


expressing a discrete random variable as a one-hot vector




expressing a discrete random variable as a one-hot vector

the Gumbel softmax relaxation is



one-hot-encoded categorical distribution

from argmax to uniform controlled by = 0~inf


Categorical reparameterization with gumbel-softmax, ICLR, 2017

PAG produces binary masks.

Two applications

1. parallel pooling branch for deciding the “right” receptive field







3. Discussion

Attentional Pooling for the “Right” Receptive Field

Improvement




computational saving



Visual summary of three tasks on three different datasets

Consistent result:

PAG improves baseline, and is comparable to Multiplicative gating


S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, arxiv 1805.01556, 2018

semantic segmentation







3. Discussion

Pixel-Level Dynamic Routing

dynamic computation time for each pixel of an image

It is useful with limited computation budget.


sparse PAG at each layer



sparse binary mask

Using KL-divergence term for sparse masks.



experiment of semantic segmentation



learning to skip layers for dynamic routing


[1] BlockDrop: Dynamic Inference Paths in Residual Networks[2] Convolutional Networks with Adaptive Computation Graphs[3] SkipNet: Learning Dynamic Routing in Convolutional Networks

experiment of semantic segmentation



Semantic segmentation on NYU-depth-v2 dataset

Boundary detection on BSDS500 dataset



Visualization of sparse binary masks

ponder map = pixel-level computation depth



indoor image panorama street scene






3. Discussion

Discussion

1. Pixel-wise Attentional Gating unit (PAG) allocates dynamic computation for pixels; it is general, agnostic to architectures and problems.

Discussion


2. PAG scales back computation with little cost in performance.

Discussion



enough savings? real-time?

Discussion



3. More attention to pixels for scene parsing.

Discussion

More Attention to Pixels for Scene Parsing

photo credit to Nathan Silberman, Wongun Choi, Sean Bell, Edward Hsiao, iRobot


2. PAG scales back computation with little cost in performance. real-time?

3. More attention to pixels for scene parsing.

4. Potentially a unified model for all these tasks. How?

Discussion

Thanks

Q&A

Charless FowlkesShu Kong

Pay Attention to the Pixel, Understand the Scene Betterskong2/slides/20180514_AIML_UCI.pdf · When...

Documents

Transcript of Pay Attention to the Pixel, Understand the Scene Betterskong2/slides/20180514_AIML_UCI.pdf · When...