Introduction to Semantic Segmentation using...
Transcript of Introduction to Semantic Segmentation using...
Introduction to Semantic Segmentation using Convolutional Neural NetworksRADOVIC MILOS, [email protected]
OutlineIntroduction Common computer vision tasks
Semantic segmentation - problem definition
Semantic segmentation vs instance segmentation
Applications
How does it work?
Representing the semantic segmentation task
Constructing an architecture
Resolving computational burden
Methods for upsampling
Transposed convolution
Defining a loss function
Network architectures
Fully Convolutional Networks for Semantic Segmentation (FCN)
U-net
The One Hundred Layers Tiramisu: Fully Convolutional DenseNets (FC-DenseNet) for Semantic Segmentation
Common computer vision tasksClassification
In classification we predict the class of an image.
Classification with localization
Here we predict the class label of an image as well as position of a bounding box surrounding the object.
Detection Segmentation
Object detection involves localization of multiple objects (that doesn’t have to belong to the same class).
Image segmentation is the process of partitioning an image into multiple segments.
Semantic segmentation - problem definition❏ The goal of semantic segmentation of an image is to label each and every pixel of an
image with a corresponding class of what is being represented.
❏ Because we're predicting for every pixel in the image, this task is commonly referred to as dense prediction.
Input Output Input Output
Semantic segmentation vs instance segmentation❏ Semantic segmentation does not separate instances of the same class. It only predicts
the category of each pixel.
❏ Instance segmentation is another approach for segmentation which does distinguish between separate objects of the same class (an example would be Mask R-CNN[1]).
Semantic segmentation Instance segmentation
[1] Kaiming He, Georgia Gkioxari, Piotr Dollár, Mask R-CNN, CVPR 2017.
Applications❏ Autonomous vehicles
❏ Medicine
❏ Fashion industry, scene understanding, etc...Organ segmentation Substructure segmentation Lesion segmentation
❏ Satellite (Or Aerial) Image Processing
Representing the semantic segmentation task❏ The goal of semantic segmentation is to take an image as input and output a
segmentation map where each pixel contains a class label represented as an integer.
❏ Target can be created by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.
Inputheight×width×3 (RGB color image)height×width×1 (grayscale image)
Targetheight×width×N_classes
argmax
Segmentation mapheight×width×1
Constructing an architecture
InputH x W x 3 Convolutions
H x W x Dl
OutputH x W x N_classes
Segmentation mapH x W x 1
❏ A naive approach: simply stack a number of convolutional layers (with same padding to preserve dimensions) and output a final segmentation map.
❏ This directly learns a mapping from the input image to its corresponding segmentation through the successive transformation of feature mappings; however, it's quite computationally expensive to preserve the full resolution throughout the network.
Resolving computational burden❏ In order to maintain expressiveness, we typically need to reduce height and width of
feature maps as we get deeper in the network. This is fine for classification but not for semantic segmentation!
❏ Resolution: encoder/decoder structure where we downsample the spatial resolution of the input, developing lower-resolution feature mappings which are learned to be highly efficient at discriminating between classes, and the upsample the feature representations into a full-resolution segmentation map.
InputH x W x 3
High-resH/2 x W/2 x D
1
High-resH/2 x W/2 x D
1
Med-resH/4 x W/4 x D
2
Med-resH/4 x W/4 x D
2
Low-resH/8 x W/8 x D
3 Segmentation mapH x W x 1
Methods for upsampling❏ Whereas pooling operations downsample the resolution by summarizing a local area with
a single value, "unpooling" operations upsample the resolution by distributing a single value into a higher resolution.
Transposed convolution❏ Transposed convolution is by far the most popular approach as it allows us to develop a
learned upsampling.
❏ In transposed convolution, we take a single value from the low-resolution feature map and multiply all of the weights in our filter by this value, projecting those weighted values into the output feature map.
Transposed convolution - 2d example Transposed convolution - 1d example
Transposed convolution❏ Transposed convolution can be represented as a convolution with some modification of
the input.
Conv2DTranspose with 3x3 kernel and no padding applied to a 4x4
input to give a 6x6 output.
Conv2D with stride 1 and 2x2 padding
A Conv2DTranspose with 3x3 kernel, stride of 2x2 and no padding applied to a 2x2 input to give a 5x5 output.
Conv2D with stride 1 and 2x2 padding
Defining a loss function❏ The most commonly used loss
function for the task of image segmentation is a pixel-wise cross entropy loss.
❏ Problem: Model is biased towards the most prevalent class.
Soft Dice loss❏ Dice coefficient is essentially a measure of overlap
between two samples.
❏ The soft Dice loss is normalized according to the size of the target mask so it does not struggle learning from classes with lesser spatial representation in an image.
Fully Convolutional Networks for Semantic Segmentation (FCN)❏ In FCN paper[2] authors utilized classification networks to serve as the encoder module of the network,
appending a decoder module to upsample the coarse feature maps into a full-resolution segmentation map.
❏ Adding layers and a spatial loss (pixel-wise cross entropy loss) produces an efficient architecture for end-to-end dense learning.
[2] Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully Convolutional Models for Semantic Segmentation, CVPR 2015.
The FCN end-to-end dense prediction pipeline
FCN: From classifier to dense prediction❏ Encoder: AlexNet, VGG, and GoogLeNet
classification networks.
❏ Decoder: Transposed convolution layers.
Transforming fully connected layers into convolution layers enables a classification net to output a heatmap
Results on the on the validation set of PASCAL VOC 2011[3]
[3] http://host.robots.ox.ac.uk/pascal/VOC/voc2011/index.html
FCN: Combining what and where
FCN-8s
FCN-16sFCN-32s
FCN architectures - Image from the original paper
FCN: results
Comparison of skip FCNs on a PASCAL VOC 2011 dataset.
Refining fully convolutional nets by fusing information from layers with different strides improves segmentation detail.
Results on the PASCAL VOC 2011 and 2012 test sets.
U-net❏ Initially developed for the segmentation
of biomedical images.
❏ The U-net[4] architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization.
❏ Won the segmenting and tracking moving cells challenge at ISBI 2015.
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, MICCAI 2015.[5] ISBI 2015 : International Symposium on Biomedical Imaging, https://biomedicalimaging.org/2015/.
Contaction
Expa
nsio
n
Increase the “What”Reduce the “Where”
Create high resolution segmentation mask
U-net: loss function❏ Authors used weighted pixel-wise cross entropy loss which force the network to learn the border pixels.
(a) raw image (c) mask for semantic segmentation
(d) weight map(b) overlay with ground truth segmentation
- activation in feature channel k at the pixel position x
K - number of classes
- total weight at the pixel position x
- weight value to balance the class frequencies
d1 - distance to the border of the nearest cell
d2 - distance to the border of the second nearest cell
- ground truth label at position x
U-net: overlap-tile strategy for seamless segmentation❏ Overlap-tile strategy allows the seamless segmentation of arbitrarily large images by splitting input
images into tiles (this may help us to overcome GPU memory limitations).
❏ To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image.
Result of segmentation within the yellow lines requires image data within the blue lines as input. Solid lines refer to the first segmentation and dashed line refers to the second segmentation. By
repeating this process, an entire segmentation mask will be obtained for each slice.
U-net: results
Segmentation results (IoU) on the ISBI cell tracking challenge 2015
Part of an input image of the “PhC-U373” data set (left). Segmentation result (cyan mask) with manual ground
truth (yellow border) - right.
Input image of the “DIC-HeLa” data set (left). Segmentation result (random colored masks) with
manual ground truth (yellow border) - right.
The One Hundred Layers Tiramisu: Fully Convolutional DenseNets (FC-DenseNet) for Semantic Segmentation
❏ In this paper[6] authors utilized ideas from the DenseNets[7] to deal with the problem of semantic segmentation.
❏ The network is composed of a downsampling path responsible for extracting coarse semantic features, (consisting from convolution, transition down and dense blocks) and an upsampling path trained to recover the input image resolution (consisting from convolution, transition up and dense blocks).
[6] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio, The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation, CVPR 2017[7] Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, Densely Connected Convolutional Networks, CVPR 2017
FC-DenseNet: from standard convolution to dense block
Standard ConvNet Concept
ResNet Concept
Dense Block in DenseNet with Growth Rate k Concept
FC-DenseNet: Dense blockLet x
l be the output of the lth layer.
In a standard CNN, xl is computed by applying a non-linear transformation H (usually convolution
followed by a ReLU and often dropout) to the output of the previous layer xl-1
:
Pushing this idea further, DenseNets design a more sophisticated connectivity pattern that iteratively concatenates all feature outputs in a feedforward fashion. Thus, the output of the lth layer is defined as:
where [ ... ] represents the concatenation operation. In this case, H is defined as batch normalization, followed by ReLU, a convolution and dropout.
Diagram of a dense block of 4 layers.
K channels
K channels
K channels
K channels
4K channels
m channels
FC-DenseNet: resolving computational burden❏ Problem: Since the upsampling path increases the feature
maps spatial resolution, the linear growth in the number of features would be too memory demanding, especially for the full resolution features in the pre-softmax layer.
❏ Resolution: In order to overcome this limitation, the input of a dense block is not concatenated with its output. Thus, the transposed convolution is applied only to the feature maps obtained by the last dense block and not to all feature maps concatenated so far.
Inp
ut
of
a d
ense
blo
ck is
no
t co
nca
ten
ated
to
it’s
ou
tpu
t
FC-DenseNet: network architecture
Architecture details of FC-DenseNet103 model
Building blocks of fully convolutional DenseNets
Bottleneck
Do
wn
sam
plin
g p
ath
Up
sam
plin
g p
ath
FC-DenseNet: results
Results on CamVid[8] dataset (11 classes, 367 training frames, 101 validation frames, 233 testing frames, 360x480 resolution). Architectures: (1) 56 layers (FC-DenseNet56), with 4 layers per dense block and a growth rate of 12; (2) 67 layers (FC-DenseNet67) with 5 layers per dense block and a growth rate of 16; (3) 103 layers (FC-DenseNet103) with a growth rate k = 16; (4) Classic Upsampling, an architecture using standard convolutions in the upsampling path instead of dense blocks.
[7] http://mi.eng.cam.ac.uk/research/projects/VideoRec/CamVid/[8] http://www.cc.gatech.edu/cpl/projects/videogeometriccontext/
Results on Gatech[9] dataset (63 videos for training/validation and 38 for testing with 190 frames per video on average. There are 8 classes in the dataset.)