Introduction to Semantic Segmentation using...

Introduction to Semantic Segmentation using Convolutional Neural NetworksRADOVIC MILOS, [email protected]

OutlineIntroduction Common computer vision tasks

Semantic segmentation - problem definition

Semantic segmentation vs instance segmentation

Applications

How does it work?

Representing the semantic segmentation task

Constructing an architecture

Resolving computational burden

Methods for upsampling

Transposed convolution

Defining a loss function

Network architectures

Fully Convolutional Networks for Semantic Segmentation (FCN)

U-net

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets (FC-DenseNet) for Semantic Segmentation

Common computer vision tasksClassification

In classification we predict the class of an image.

Classification with localization

Here we predict the class label of an image as well as position of a bounding box surrounding the object.

Detection Segmentation

Object detection involves localization of multiple objects (that doesn’t have to belong to the same class).

Image segmentation is the process of partitioning an image into multiple segments.

Semantic segmentation - problem definition❏ The goal of semantic segmentation of an image is to label each and every pixel of an

image with a corresponding class of what is being represented.

❏ Because we're predicting for every pixel in the image, this task is commonly referred to as dense prediction.

Input Output Input Output

Semantic segmentation vs instance segmentation❏ Semantic segmentation does not separate instances of the same class. It only predicts

the category of each pixel.

❏ Instance segmentation is another approach for segmentation which does distinguish between separate objects of the same class (an example would be Mask R-CNN[1]).

Semantic segmentation Instance segmentation

[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Mask R-CNN, CVPR 2017.

Applications❏ Autonomous vehicles

❏ Medicine

❏ Fashion industry, scene understanding, etc...Organ segmentation Substructure segmentation Lesion segmentation

❏ Satellite (Or Aerial) Image Processing

Representing the semantic segmentation task❏ The goal of semantic segmentation is to take an image as input and output a

segmentation map where each pixel contains a class label represented as an integer.

❏ Target can be created by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.

Inputheight×width×3 (RGB color image)height×width×1 (grayscale image)

Targetheight×width×N_classes

argmax

Segmentation mapheight×width×1

Constructing an architecture

InputH x W x 3 Convolutions

H x W x Dl

OutputH x W x N_classes

Segmentation mapH x W x 1

❏ A naive approach: simply stack a number of convolutional layers (with same padding to preserve dimensions) and output a final segmentation map.

❏ This directly learns a mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings; however, it's quite computationally expensive to preserve the full resolution throughout the network.

Resolving computational burden❏ In order to maintain expressiveness, we typically need to reduce height and width of

feature maps as we get deeper in the network. This is fine for classification but not for semantic segmentation!

❏ Resolution: encoder/decoder structure where we downsample the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes, and the upsample the feature representations into a full-resolution segmentation map.

InputH x W x 3

High-resH/2 x W/2 x D

1

High-resH/2 x W/2 x D

1

Med-resH/4 x W/4 x D

2

Med-resH/4 x W/4 x D

2

Low-resH/8 x W/8 x D

3 Segmentation mapH x W x 1

Methods for upsampling❏ Whereas pooling operations downsample the resolution by summarizing a local area with

a single value, "unpooling" operations upsample the resolution by distributing a single value into a higher resolution.

Transposed convolution❏ Transposed convolution is by far the most popular approach as it allows us to develop a

learned upsampling.

❏ In transposed convolution, we take a single value from the low-resolution feature map and multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.

Transposed convolution - 2d example Transposed convolution - 1d example

Transposed convolution❏ Transposed convolution can be represented as a convolution with some modification of

the input.

Conv2DTranspose with 3x3 kernel and no padding applied to a 4x4

input to give a 6x6 output.

Conv2D with stride 1 and 2x2 padding

A Conv2DTranspose with 3x3 kernel, stride of 2x2 and no padding applied to a 2x2 input to give a 5x5 output.

Conv2D with stride 1 and 2x2 padding

Defining a loss function❏ The most commonly used loss

function for the task of image segmentation is a pixel-wise cross entropy loss.

❏ Problem: Model is biased towards the most prevalent class.

Soft Dice loss❏ Dice coefficient is essentially a measure of overlap

between two samples.

❏ The soft Dice loss is normalized according to the size of the target mask so it does not struggle learning from classes with lesser spatial representation in an image.

Fully Convolutional Networks for Semantic Segmentation (FCN)❏ In FCN paper[2] authors utilized classification networks to serve as the encoder module of the network,

appending a decoder module to upsample the coarse feature maps into a full-resolution segmentation map.

❏ Adding layers and a spatial loss (pixel-wise cross entropy loss) produces an efficient architecture for end-to-end dense learning.

[2] Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully Convolutional Models for Semantic Segmentation, CVPR 2015.

The FCN end-to-end dense prediction pipeline

FCN: From classifier to dense prediction❏ Encoder: AlexNet, VGG, and GoogLeNet

classification networks.

❏ Decoder: Transposed convolution layers.

Transforming fully connected layers into convolution layers enables a classification net to output a heatmap

Results on the on the validation set of PASCAL VOC 2011[3]

[3] http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html

FCN: Combining what and where

FCN-8s

FCN-16sFCN-32s

FCN architectures - Image from the original paper

FCN: results

Comparison of skip FCNs on a PASCAL VOC 2011 dataset.

Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail.

Results on the PASCAL VOC 2011 and 2012 test sets.

U-net❏ Initially developed for the segmentation

of biomedical images.

❏ The U-net[4] architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.

❏ Won the segmenting and tracking moving cells challenge at ISBI 2015.

[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015.[5] ISBI 2015 : International Symposium on Biomedical Imaging, https://biomedicalimaging.org/2015/.

Contaction

Expa

nsio

n

Increase the “What”Reduce the “Where”

Create high resolution segmentation mask

U-net: loss function❏ Authors used weighted pixel-wise cross entropy loss which force the network to learn the border pixels.

(a) raw image (c) mask for semantic segmentation

(d) weight map(b) overlay with ground truth segmentation

- activation in feature channel k at the pixel position x

K - number of classes

- total weight at the pixel position x

- weight value to balance the class frequencies

d1 - distance to the border of the nearest cell

d2 - distance to the border of the second nearest cell

- ground truth label at position x

U-net: overlap-tile strategy for seamless segmentation❏ Overlap-tile strategy allows the seamless segmentation of arbitrarily large images by splitting input

images into tiles (this may help us to overcome GPU memory limitations).

❏ To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.

Result of segmentation within the yellow lines requires image data within the blue lines as input. Solid lines refer to the first segmentation and dashed line refers to the second segmentation. By

repeating this process, an entire segmentation mask will be obtained for each slice.

U-net: results

Segmentation results (IoU) on the ISBI cell tracking challenge 2015

Part of an input image of the “PhC-U373” data set (left). Segmentation result (cyan mask) with manual ground

truth (yellow border) - right.

Input image of the “DIC-HeLa” data set (left). Segmentation result (random colored masks) with

manual ground truth (yellow border) - right.

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets (FC-DenseNet) for Semantic Segmentation

❏ In this paper[6] authors utilized ideas from the DenseNets[7] to deal with the problem of semantic segmentation.

❏ The network is composed of a downsampling path responsible for extracting coarse semantic features, (consisting from convolution, transition down and dense blocks) and an upsampling path trained to recover the input image resolution (consisting from convolution, transition up and dense blocks).

[6] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio, The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation, CVPR 2017[7] Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, Densely Connected Convolutional Networks, CVPR 2017

FC-DenseNet: from standard convolution to dense block

Standard ConvNet Concept

ResNet Concept

Dense Block in DenseNet with Growth Rate k Concept

FC-DenseNet: Dense blockLet x

l be the output of the lth layer.

In a standard CNN, xl is computed by applying a non-linear transformation H (usually convolution

followed by a ReLU and often dropout) to the output of the previous layer xl-1

:

Pushing this idea further, DenseNets design a more sophisticated connectivity pattern that iteratively concatenates all feature outputs in a feedforward fashion. Thus, the output of the lth layer is defined as:

where [ ... ] represents the concatenation operation. In this case, H is defined as batch normalization, followed by ReLU, a convolution and dropout.

Diagram of a dense block of 4 layers.

K channels

K channels

K channels

K channels

4K channels

m channels

FC-DenseNet: resolving computational burden❏ Problem: Since the upsampling path increases the feature

maps spatial resolution, the linear growth in the number of features would be too memory demanding, especially for the full resolution features in the pre-softmax layer.

❏ Resolution: In order to overcome this limitation, the input of a dense block is not concatenated with its output. Thus, the transposed convolution is applied only to the feature maps obtained by the last dense block and not to all feature maps concatenated so far.

Inp

ut

of

a d

ense

blo

ck is

no

t co

nca

ten

ated

to

it’s

ou

tpu

t

FC-DenseNet: network architecture

Architecture details of FC-DenseNet103 model

Building blocks of fully convolutional DenseNets

Bottleneck

Do

wn

sam

plin

g p

ath

Up

sam

plin

g p

ath

FC-DenseNet: results

Results on CamVid[8] dataset (11 classes, 367 training frames, 101 validation frames, 233 testing frames, 360x480 resolution). Architectures: (1) 56 layers (FC-DenseNet56), with 4 layers per dense block and a growth rate of 12; (2) 67 layers (FC-DenseNet67) with 5 layers per dense block and a growth rate of 16; (3) 103 layers (FC-DenseNet103) with a growth rate k = 16; (4) Classic Upsampling, an architecture using standard convolutions in the upsampling path instead of dense blocks.

[7] http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/[8] http://www.cc.gatech.edu/cpl/projects/videogeometriccontext/

Results on Gatech[9] dataset (63 videos for training/validation and 38 for testing with 190 frames per video on average. There are 8 classes in the dataset.)

[email protected]

Introduction to Semantic Segmentation using...

Documents

Transcript of Introduction to Semantic Segmentation using...