Post on 30-Dec-2020
Torr Vision Group, Engineering Department
Semantic Image Segmentation withDeep LearningSadeep Jayasumana
07/10/2015
Collaborators:Bernardino Romera-ParedesShuai ZhengPhillip Torr
Torr Vision Group, Engineering Department
Live Demo - http://crfasrnn.torr.vision/
Torr Vision Group, Engineering Department
Outline
Semantic segmentation
Why?
CNNs for Pixelwise prediction
CRFs
CRF as RNN
Conclusion
Torr Vision Group, Engineering Department
Semantic Segmentation
• Recognizing and delineating objects in an image Classifying each pixel in the image
Torr Vision Group, Engineering Department
Why Semantic Segmentation?
• To help partially sighted people by highlighting important objects in their glasses
Torr Vision Group, Engineering Department
Why Semantic Segmentation?
• To let robots segment objects so that they can grasp them
Torr Vision Group, Engineering Department
• Road scenes understanding• Useful for autonomous navigation of cars and
drones
Image taken from the cityscapes dataset.
Why Semantic Segmentation?
Torr Vision Group, Engineering Department
• Useful tool for editing images
Why Semantic Segmentation?
Torr Vision Group, Engineering Department
• Medical purposes: e.g. segmenting tumours, dental cavities, ...
Image taken from Mauricio Reyes
ISBI Challenge 2015, dental x-ray images
Why Semantic Segmentation?
Torr Vision Group, Engineering Department
But How?
• Deep convolutional neural networks are successful at learning a good representation of the visual inputs.
• However, here we have a structured output.
Torr Vision Group, Engineering Department
CNN for Pixel-wise Labelling• Usual convolutional networks
Torr Vision Group, Engineering Department
CNN for Pixel-wise Labelling• Usual convolutional networks
• Fully convolutional networks
Long et. al., Fully Convolutional Networks for Semantic Segmentation, CVPR 2015.
Torr Vision Group, Engineering Department
Fully Convolutional Networks[Long et al, CVPR 2014]
Torr Vision Group, Engineering Department
+ Significantly improved the state of the art in semantic segmentation.
- Poor object delineation: e.g. spatial consistency neglected.
Fully Convolutional Networks[Long et al, CVPR 2014]
Image FCN Results Ground truth
Torr Vision Group, Engineering Department
• A CRF can account for contextual information in the image
Conditional Random Fields (CRFs)
Coarse output from the pixel-wise classifier
MRF/CRF modelling Output after the CRF inference
Torr Vision Group, Engineering Department
Conditional Random Fields (CRFs)
�� ∈ {bg, cat, tree, person, …}
• Define a discrete random variable Xi for each pixel i.
• Each Xi can take a value from the label set.
• Connect random variables to form a random field. (MRF)
Torr Vision Group, Engineering Department
Conditional Random Fields (CRFs)
�� ∈ {bg, cat, tree, person, …} �� = cat�� = bg
• Define a discrete random variable Xi for each pixel i.
• Each Xi can take a value from the label set.
• Connect random variables to form a random field. (MRF)
• Most probable assignment given the image → segmentation.
Torr Vision Group, Engineering Department
Finding the Best Assignment
�� = bgPr �� = ��, �� = ��,… , �� = �� |� = Pr(� = �|�)
�� = cat
Pr � = �|� = exp −� �|�
• Maximize Pr � = � → Minimize� �
• So we have formulated the problem as an energy minimization.
Torr Vision Group, Engineering Department
� �|� = �����_���� + ��������_����
�� = ��
Torr Vision Group, Engineering Department
Unary energy
����(�� = ��) =?
� �|� = �����_���� + ��������_����
�� = ��
Torr Vision Group, Engineering Department
Unary energy
����(�� = ��) =?
Your label doesn’t agree with the initial
classifier → you pay a penalty.
� �|� = �����_���� + ��������_����
�� = ��
Torr Vision Group, Engineering Department
Unary energy
����(�� = ��) =?
Your label doesn’t agree with the initial
classifier → you pay a penalty.
Pairwise energy
����(�� = ��, �� = ��) =?
You assign different labels to two very similar
pixels → you pay a penalty.
How do you measure similarity?
� �|� = �����_���� + ��������_����
��
��
Torr Vision Group, Engineering Department
Unary energy
����(�� = ��) =?
Your label doesn’t agree with the initial
classifier → you pay a penalty.
Pairwise energy
����(�� = ��, �� = ��) =?
You assign different labels to two very similar
pixels → you pay a penalty.
How do you measure similarity?
� �|� = �����_���� + ��������_����
�� ��
Torr Vision Group, Engineering Department
��
��
Unary energy
����(�� = ��) =?
Your label doesn’t agree with the initial
classifier → you pay a penalty.
Pairwise energy
����(�� = ��, �� = ��) =?
You assign different labels to two very similar
pixels → you pay a penalty.
How do you measure similarity?
� �|� = �����_���� + ��������_����
Torr Vision Group, Engineering Department
Dense CRF Formulation
• Pairwise energies are defined for every pixel pair in the image.
� � = ������(
�
��) + ���������(��, ��)
�,�
• Exact inference is not feasible.
• Use approximate mean field inference.
[Krähenbühl & Koltun, NIPS 2011.]
Torr Vision Group, Engineering Department
Dense CRF Formulation
• Pairwise energies are defined for every pixel pair in the image.
� � = ������(
�
��) + ���������(��, ��)
�,�
• Exact inference is not feasible.
• Use approximate mean field inference.
[Krähenbühl & Koltun, NIPS 2011.]
exp(−� � ) = � � = ��(��)
�
���
Torr Vision Group, Engineering Department
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
BilateralQ
I
U
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
Bilateral ConvQ
I
U
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
Bilateral Conv ConvQ
I
U
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
Bilateral Conv Conv +Q
I
U
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
Bilateral Conv Conv + SoftMaxQ
I
U
Fully Connected CRFs as a CNN
Torr Vision Group, Engineering Department
Bilateral Conv Conv + SoftMaxQ
I
U
CRF as a Recurrent Neural Network
• Each of these blocks is differentiable We can backprop
Mean-field Iteration
Torr Vision Group, Engineering Department
CRF Iteration
SoftMax
Image
Unaries
• Each of these blocks is differentiable We can backprop
Output
CRF as RNN
CRF as a Recurrent Neural Network
Torr Vision Group, Engineering Department
Putting Things Together
FCN CRF-RNN
Torr Vision Group, Engineering Department
Experiments
68.3 69.5 72.9
FCN CRFFCNCRF-RNNCRF-RNN
FCN
Ours[Chen et al, 2015][Long et al, 2014]
Torr Vision Group, Engineering Department
Try our demo: http://crfasrnn.torr.visionCode & model: https://github.com/torrvision/crfasrnn
Shuai Zheng
Bernardino Romera-Paredes
Philip Torr
Torr Vision Group, Engineering Department
Examples
http://pp.vk.me/c622119/v622119584/20dc3/7lS5BU2Bp_k.jpg
Torr Vision Group, Engineering Department
Examples
http://media1.fdncms.com/boiseweekly/imager/mountain-bikers-are-advised-to-dism/u/original/3446917/walk_thru_sheep_1_.jpg
Torr Vision Group, Engineering Department
Examples
http://img.rtvslo.si/_up/upload/2014/07/22/65129194_tour-3.jpg
Torr Vision Group, Engineering Department
Examples
http://www.toxel.com/wp-content/uploads/2010/11/bike05.jpg
Torr Vision Group, Engineering Department
Not-so-good examples
http://www.independent.co.uk/incoming/article10335615.ece/alternates/w620/planecat.jpg
Torr Vision Group, Engineering Department
http://i1.wp.com/theverybesttop10.files.wordpress.com/2013/02/the-world_s-top-10-best-images-of-camouflage-cats-5.jpg?resize=375,500
Not-so-good examples
Torr Vision Group, Engineering Department
Tricky examples
http://se-preparer-aux-crises.fr/wp-content/uploads/2013/10/Golum.png
Torr Vision Group, Engineering Department
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRf4J7Hszkc8Wf6riVUX-cV_K-un8LJy5dYIBW1KDIn6i7UCzGHpg
Tricky examples
Torr Vision Group, Engineering Department
http://i.huffpost.com/gen/1478236/thumbs/s-DIRD6-large640.jpg
Tricky examples
Torr Vision Group, Engineering Department
Conclusion
• CNNs yield a coarse prediction on pixel-labeled tasks.
• CRFs improve the result by accounting for the contextual information in the image.
• Learning the whole pipeline end-to-end significantly improves the results.
CNN CRF
Torr Vision Group, Engineering Department
Conclusion
• CNNs yield a coarse prediction on pixel-labeled tasks.
• CRFs improve the result by accounting for the contextual information in the image.
• Learning the whole pipeline end-to-end significantly improves the results.
CNN CRF
Thank You!