[unofficial] Pyramid Scene Parsing Network (CVPR 2017)
-
Upload
shunta-saito -
Category
Technology
-
view
895 -
download
2
Transcript of [unofficial] Pyramid Scene Parsing Network (CVPR 2017)
Pyramid Scene Parsing NetworkHengshuang Zhao1, Jianping Shi2, Xiaojuan Qi1,
Xiaogang Wang1, Jiaya Jia 1
1The Chinese University of Hong Kong, 2SenseTime Group Limited
Presentation: Shunta Saito
Slide: Powered by Deckset
(c) Preferred Networks 1
Summary• Introduce Pyramid Pooling Module for better context grasp with sub-region awareness
(c) Preferred Networks 2
Why did I choose this paper?
• Presented in CVPR 2017
• 1st place in ImageNet Scene Parsing Challenge 2016 (ADE20K)
• was 1st place in Cityscapes leaderboard
• now it's in 2nd place (I noticed this last week!)
(c) Preferred Networks 3
Agenda
1. Common building blocks in semantic segmentation
2. Major Issue
3. Prior Work
4. Pyramid Pooling Module
5. Experiment results
(c) Preferred Networks 4
Semantic Segmentation
• Predict pixel-wise labels from natural images
• Each pixel in an image belongs to an object class
• So it's not instance-aware !
(c) Preferred Networks 5
Common Building Blocks (1)
Fully convolutional network (FCN)1
• A deep convolutional neural network which doesn't include any fully-connected layers
• Almost all recent methods are based on FCN
• Typically pre-trained with ImageNet under classification problem setting
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 6
Common Building Blocks (2)
Dilated convolution1
• Widen receptive field without reducing feature map resolution
• Important for leveraging global context prior efficiently
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 7
Common Building Blocks (3)
Multi-scale feature ensemble
• Higher-layer feature contains more semantic meaning and less location information
• Combining multi-scale features can improve the performance1
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 8
Common Building Blocks (4)
Conditional random field (CRF)
• Post-processing to refine the segmentation result (DeepLab1)
• Some following methods refined network via end-to-end modeling (DPN2, CRF as RNN3, Detections and Superpixels4)
4 "Higher order conditional random fields in deep neural networks", ECCV 2016
3 "Conditional random fields as recurrent neural networks", ICCV 2015
2 "Semantic image segmentation via deep parsing network", ICCV 2015
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 9
Common Building Blocks (5)
Global average pooling (GAP)
• ParsenNet1 proved that global average pooling with FCN can improve semantic segmentation results
• But the global descriptors used in the paper are not representative enough for some challenging datasets like ADE20K
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 10
Major Issue (1)
Mismatched relationship
• Co-occurrent visual patterns imply some contexts
• e.g., an airplane is likely to fly in sky while not over a road
• Lack of the ability to collect contextual information increases the chance of misclassification
• In the right figure, FCN predicts the boat in the yellow box as a "car" based on its appearance
(c) Preferred Networks 11
Major Issue (2)
Confusing Classes
• There are confusing classes in major datasets: field and earth; mountain and hill; wall, house, building and skyscraper, etc.
• The expert human annotator still makes 17.6% pixel error for ADE20K1
• FCN predicts the object in the box as part of skyscraper and part of building but the whole object should be either skyscraper or building, not both
• Utilizing the relationship between classes is important
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 12
Major Issue (3)
Inconspicuous Classes
• Small objects like streetlight and signboard are inconspicuous and hard to find while they may be important
• Big objects may appear in discontinuous, but FCN couldn't label the pillow which has similar appearance with the sheet correctly
• To improve performance for small or very big objects, sub-regions should be paid more attention
(c) Preferred Networks 13
Summary of Issues
• Use co-occurrent visual patterns as context
• Consider relationship between classes
• Sub-regions should be paid more attention
(c) Preferred Networks 14
Prior Work
Global Average Pooling (GAP)1
• Receptive field of ResNet is already larger than the input image, so GAP sounds good to summarize the all information
• But, pixels in an image may be various objects which have different sizes, so directly fusing them to form a single vector may lose the spatial relation and cause ambiguity
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 15
Prior Work
Spatial Pyramid Pooling (SPP)1
• Pooling with different kernel/stride sizes to the feature maps
• Then flatten and concatenate the pooling results to make fix-length representation
• There still is context information loss
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 16
Pyramid Pooling Module• A hierarchical global prior, containing information with different scales and varying among different sub-regions
• Pyramid Pooling Module for global scene prior constructed on the top of the final-layer-feature-map
(c) Preferred Networks 17
Pyramid Pooling Module• Use 1x1 conv to reduce the number of channels
• Then upsample (bilinear) them to the same size and concatenate all
(c) Preferred Networks 18
Implementation details (1)
• The average pooling are four levels, 1x1, 2x2, 3x3, and 6x6 (ksize, stride)
• Pre-trained ResNet model with dilated convolution is used as the feature extractor (the output size will be 1/8 of input image)
• They use two losses;
1. softmax loss between final layer and labels
2. softmax loss between an intermediate output of ResNet and labels1 (weighted by 0.4)
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 19
Implementation details (2)
Optimization
MomentumSGD with weight deacy
LR Scheduling
Momentum: 0.9 �
Weight decay: 0.0001 where �
(c) Preferred Networks 20
Implementation details (3)
Training iteration Dataset augmentation
ADE20K: 150K Random mirror
PASCAL VOC: 30K Random resize between 0.5 and 2
Cityscapes: 90K Random rotation betwee -10 and 10 degrees
Random Gaussian blur for ADE20K and PASCAL VOC
(c) Preferred Networks 21
Implementation detailts (4)• An appropriately large "cropsize" can yield good performance
• "batchsize" in the batch normalization layer is of great importance:
Cropsize Batchsize
ADE20K: 473 x 473 16 for all dataset
PASCAL VOC: 473 x 473
Cityscapes: 713 x 713
(c) Preferred Networks 22
Implementation detailts (5)
Distributed Batch Normalization
• To increase the "batchsize" in batch normalization layers, they used custom BN layer applied on data gathered from mulitple GPUs using OpenMPI
• We have Akiba-san's implementation of distributed batch normalization !
(c) Preferred Networks 23
ImageNet Scene Parsing Challenge 2016
• Dataset: ADE20K
• 150 classes and 1,038 image-level labels
• 20,000/2,000/3,000 pixel-level labels for train/val/test
(c) Preferred Networks 24
Ablation Study for Pyramid Pooling Module
• Average pooling works better than max pooling in all settings
• Pooling with pyramid parsing outperforms that using global pooling
• With dimension reduction (DR; reducing the number of channels after pyramid pooling), the performance is further enhanced
(c) Preferred Networks 25
Ablation Study for Auxiliary Loss
• Set the auxiliary loss weight between 0 and 1 and compared the final results
• yields the best performance
(c) Preferred Networks 26
Ablation Study for the ResNet Part
Deeper is better
(c) Preferred Networks 27
More Detailed Performance Analysis
Additional processing Improvement (% in mIoU)
Data augmentation (DA) +1.54
Auxiliary loss (AL) +1.41
Pyramid pooling module (PSP) +4.45
Use deeper ResNet (50 to 269) +2.13
Multi-scale testing (MS) +1.13
• For multi-scale testing, they create prediction at 6 different scales (0.5, 0.75, 1, 1.25, 1.5, and 1.75) and take average of them.
(c) Preferred Networks 28
Results on PASCAL VOC 2012
• Extended with Semantic Boundaries Dataset (SBD) 1, they
used
• 10582, 1449, and 1456 images for train/val/test
• Mismatched relationship: For "aeroplane" and "sky" in the
second and third rows, PSPNet finds missing parts.
• Confusing classes: For "cows" in row one, our baseline
model treats it as "horse" and "dog" while PSPNet corrects
these errors
• Conspicuous objects: For "person", "bottle" and "plant" in
following rows, PSPNet performs well on these small-size-
object classes in the images compared to the baseline model
1 "Semantic Contours from Inverse Detectors", ICCV 2011, http://home.bharathh.info/pubs/codes/SBD/download.html
(c) Preferred Networks 29
Results on PASCAL VOC 2012• Comparing PSPNet with previous best-performing methods on the testing set based on two settings, i.e., with or without pre-training
on MS-COCO dataset
(c) Preferred Networks 30
Results on Cityscapes
• Cityscapes dataset consits of 2975, 500, and 1525 train/val/tests images (19 classes)
• 20000 coarsely annotated images are available (in the table below, ‡ means it's used)
(c) Preferred Networks 31
Thank you for your attention
• The official repository doesn't include any training code
• My own implementation for both training and testing have been ready:
• mitmul/segmentation: https://github.pfidev.jp/mitmul/segmentation
• Now I'm training a model to ensure the reproducibility
• Once finished the reproduction work, I'll send the code to ChainerCV
• The training on Cityscapes dataset takes over 20 days using 8 GPUs even with ResNet50-based PSPNet (They used ResNet101 for Cityscapes)
• Now ChainerMN is necessary tool for such large-scale dataset and deep models
• So, we need more GPU machines connected with InfiniBand each other
(c) Preferred Networks 32