Automatic Portrait Segmentation and...
Transcript of Automatic Portrait Segmentation and...
Automatic Portrait Segmentation and Matting
Xiaoyong ShenThe Chinese University of Hong Kong
Research on CV
• Pixel based (low level/ early vision)• Filtering, restoration, denoise, enhancement, deblur,
editing, dehaze, etc.
• Region/ Patch based (Middle level vision)• Matching, optical flow, stereo matching, tracking,
segmentation, etc.
• Object/ Semantic based (high level vision)• Semantic segmentation, Object detection, image
classification, recognition, etc.
My Research on CV
• Pixel based (low level vision)• Filtering, restoration, denoise, enhancement, deblur,
editing, dehaze, etc.
• Region/ Patch based (Middle level vision)• Matching, optical flow, stereo matching, tracking,
segmentation, etc.
• Object based (high level vision)• Semantic segmentation, Object detection, image
classification, recognition, etc.
Multi-Spectral Image Restoration
• Input• Noisy RGB image I0
• E.g. captured at night
• Clean guidance image G• E.g. dark-flashed NIR, or flashed RGB images
• Output• Denoised image I
• Structures are clear as guidance G.• Appearance is the same as image I0.• Shadow/Highlight does not affect.
5[TPAMI 2015]
Scale Map
• Given 𝐼∗ – the expected ground truth noise-free image, our scale map s is defined under the following condition
min 𝛻𝐼∗ − 𝑠𝛻𝐺
• It adapts structures of 𝐺 to that of I*.
• It is an ideal ratio map between 𝛻𝐺 and 𝛻𝐼∗.
6
Result
7Our Result Ground Truth
Input Noisy Image Input NIR Image
RGB Input I
8
NIR Input G
9
BM3D
10
Our Result
11
Mutual-Structure Filter
[ICCV 2015 Oral Presentation]
Depth/RGB Restoration
Noisy Depth
Depth/RGB Restoration
Noisy RGB Image
Depth/RGB Restoration
Ground truth
Depth/RGB Restoration
OursPSNR = 37.19
Rolling Guidance Filter
One line code only: 𝐼𝑡+1 = 𝐽𝐹(𝐼0, 𝐼𝑡)
[ECCV 2014 Oral Presentation]
Texture Removal
18
Halftone Image
19
De-Filter
One line code only: 𝐼𝑡+1 = 𝐼𝑡 + (𝐼0 − 𝐹(𝐼𝑡))
Reverse Skin Retouch
Retouched input
Reverse Skin Retouch
Reversed
Reverse Skin Retouch
Before retouch
Multi-Spectral Matching
• Match general multi-spectral images with significant displacement and obvious structure inconsistency
Different Exposures RGB/Depth RGB/NIR Flash/No-flash
Result
• Match RGB/NIR image pair
InputsOur ResultBlended
Applications
• HDR construction
Without AlignmentWith AlignmentConstructed HDR
Internet Image Matching
Reference Input
Dense Correspondences ?
Exist Correspondence
No Correspondence
[SIGGRAPH ASIA 2016]
Our Motivation
Reference Input
Dense Correspondences ?
Foremost Region Matching
Time-lapse Generation
Automatic Morphing
Automatic Morphing
Object-based MatchingAchieve higher accuracy with the help of object (person)
Object-based Matching
State-of-the-art Ours
Classification and Segmentation
• Fine-grained Classification• DeepLAC (CVPR 2015)
• Text detection and recognition
• Semantic object segmentation• Portrait segmentation and matting
• VOC challenge
Automatic Portrait Segmentation
Motivation
• Abundant portraits in smartphone photos
38
Portrait, 30%
Others, 70%
Samsung UK
Portrait, 90%
Others, 10%
Symon Whitehorn from HTC
Portrait Post-processing
39
Foreground Selection
40
Quick Selection
41
Automatic Segmentation
42
Automatic?
Challenges
43
Similar Color Complex Background Various Accessories
Low Contrast Diverse PoseComplicated Edges
Possible Solutions
• Graph-cut with face tracker
44
Possible Solutions
• CNNs for semantic segmentation
45
Most Related Work
• Interactive Image Selection• Lazy snapping [Li et al. 2004]• Grabcut [Rother et al. 2004]• Paint Selection [Li et al. 2009]
• CNNs for Semantic Object Segmentation• FCN [Long et al. 2014]• DeepLab [Chen et al. 2014]• CRFasRNN [Zheng et al. 2015]
• Image Matting• Bayesian matting [Chuang et al. 2001]• Closed-form matting [Levin et al. 2008]• KNN matting [Chen et al. 2013]
46
Our Approach
47
PortraitFCN and PortraitFCN+
Our System
48
Detector
Conv ReLUPooling Conv
ConvPoolingReLU
DeConv Mask[Long et al. 2015]
PortraitFCN ModelRGB Channels 2 Outputs
PortraitFCN
49
• Fine tune it from original FCN-8s model
Portrait Knowledge
PortraitFCN+
50
Detector
Conv ReLUPooling Conv
ConvPoolingReLU
DeConv Mask[Long et al. 2015]
PortraitFCN+ ModelRGB+Shape+Position 2 Outputs
Shape Position
Shape Channel
51
……
Labeled Masks
Align
Canonical Pose
Mean
Shape Channel
𝑀 =σ𝑖𝑤𝑖 ∘ 𝑇𝑖(𝑀𝑖)
σ𝑖𝑤𝑖
Align
Test Image
Position Channel
52
Canonical Pose
x- Coordinate y- Coordinate
Position Test Image
Align
Effectiveness
53
Input
Effectiveness
54
PortraitFCN
Effectiveness
55
PortraitFCN+
Experiments and Applications
56
Our Dataset
• 1,800 portraits from Flickr with labeled mask• 1500 portraits as the training data
• 300 for testing
• Large variations on portrait types• Age, color, background, clothing, accessories, head
position, hair style, lighting, etc.
57
58
Training
• Fine turn the model starting from FCN-8s• Synthesize more data with different transforms
• Using the person class and background weights
• Find the best learning rate• Loss
• accuracy
59
Find the Best LR
60
Evaluation
61
Methods Mean IoU (%)
Graph-cut 80.02
FCN (Person Class) 73.09
IoU =area(output ∩ ground truth)
area(output ∪ ground truth)
Evaluation
62
Methods Mean IoU (%)
Graph-cut 80.02
FCN (Person Class) 73.09
PortraitFCN 94.20
IoU =area(output ∩ ground truth)
area(output ∪ ground truth)
Evaluation
63
Methods Mean IoU (%)
Graph-cut 80.02
FCN (Person Class) 73.09
PortraitFCN 94.20
PortraitFCN+ (Only with Mean Mask) 94.89
PortraitFCN+ (Only with Normalized x and y) 94.61
IoU =area(output ∩ ground truth)
area(output ∪ ground truth)
Evaluation
64
Methods Mean IoU (%)
Graph-cut 80.02
FCN (Person Class) 73.09
PortraitFCN 94.20
PortraitFCN+ (Only with Mean Mask) 94.89
PortraitFCN+ (Only with Normalized x and y) 94.61
PortraitFCN+ 95.91
IoU =area(output ∩ ground truth)
area(output ∪ ground truth)
Comparisons
65
Input
Comparisons
66
Ground Truth
Comparisons
67
Graph-cut
Comparisons
68
FCN-8s (Person)
Comparisons
69
PortraitFCN
Comparisons
70
PortraitFCN+
Comparisons
71
Input Ground Truth
IoU = 0.83 IoU = 0.42
IoU = 0.91 IoU = 0.85
FCN-8s Graph-cut
IoU = 0.99
IoU = 0.98
Ours
Comparisons
72
Input Ground Truth
IoU = 0.77 IoU = 0.95
IoU = 0.38 IoU = 0.84
FCN-8s Graph-cut
IoU = 0.98
IoU = 0.98
Ours
Comparisons
73
Input Ground Truth
IoU = 0.83 IoU = 0.53
IoU = 0.81 IoU = 0.89
FCN-8s Graph-cut
IoU = 0.99
IoU = 0.98
Ours
Robustness
74
Color Scale Rotation Occlusion
User Study
• Our result provides very good initialization for further refinement
75
Segmentation is not enough--Automatic Portrait Matting
Portrait Matting
Input Image Alpha Matte
Color transform Depth-of-field Portrait
Stylization Cartoon
Background Edit
Problem Definition
78
𝜶𝑭 + 𝟏 − 𝜶 𝑩
foreground background
Image Alpha/foreground opacity
𝑰 =
Natural Image Matting
• Color Sampling Methods• Given manual-labeled trimap
• Bayesian Matting [Y-Y Chuang, 2001], etc.
79
Image Trimap Alpha matte
Natural Image Matting
• Propagation approaches• Given manual-labeled strokes & trimap
• Closed-form Matting [Levin, 2008], etc.
80
𝛼 = 𝑎𝑟𝑔𝑚𝑖𝑛 𝛼𝑇𝐿𝛼 + 𝜆 𝛼 − 𝑏𝑠𝑇𝐷(𝛼 − 𝑏𝑠)
Matting Laplacian User-provided Strokes
Diagonal stroke mask
Motivation
• It is very hard to specify trimap or strokes
81
Input Labeled Strokes Closed-form Matting
error
Motivation
• It is very hard to specify trimap or strokes
Input Labeled Trimap Closed-form Matting
error
Motivation
83
Usually we need to refine the trimap many times to get a good alpha matte……
Segmentation to Matting
Segmentation to Matting
86
Learning for Automatic Matting
• Challenges• Data preparation
• Learning framework
• We propose end-to-end Convolutional Neural Networks (CNNs) for Portrait Matting
87
Learning Data Collection
• 2000 portraits from Flickr with large variation• Keywords…
• Different Age, gender, pose, hairstyle, background…
• Different camera type…
• Data example
88
8989
Data Labeling
• Apply closed-form matting and robust matting• Gradually refine the input trimap
• Choose the best one from closed-form or robust matting
• User interface
• Ground truth example
90
9191
Learn Automatic Matting
92
Our Method
93
Trimap labeling• Input: RGB image
• Output: trimap
• Network: Fine tuned from FCN
Our Method
94
Image Matting Layer• Input: trimap
• Output: alpha matte
• Novel-designed structure
Our Method
95
Image Matting Layer• Feed-Forward:
𝑚𝑖𝑛 𝜆𝐴𝑇𝐵𝐴 + 𝜆 𝐴 − 1 𝑇𝐹(𝐴 − 1) + 𝐴𝑇𝐿𝐴• Back-Forward:
𝜕𝑓
𝜕𝐵= −𝜆𝐷−1𝑑𝑖𝑎𝑔(𝐷−1𝐹)
𝜕𝑓
𝜕𝐹=𝜕𝑓
𝜕𝐵+ 𝐷−1
𝜕𝑓
𝜕𝜆= −𝜆𝐷−1𝑑𝑖𝑎𝑔 𝐹 + 𝐵 𝐷−1𝐹
Our Method
96
Image Matting Layer• Loss function:
𝐿(𝐴, 𝐴𝑔𝑡) =
𝑖
𝑤 𝐴𝑖𝑔𝑡
| 𝐴𝑖 − 𝐴𝑖𝑔𝑡
|,
𝑤 𝐴𝑖𝑔𝑡
= −𝑙𝑜𝑔(𝑝(𝐴 = 𝐴𝑖𝑔𝑡))
Model Training
97
• Data augmentation• 4 scales {0.6,0.8,1.2,1.5}
• 4 rotations {-45,-22,22,45} degree
• Gamma value {0.5,0.8,1.2,1.5}
• Network initialization• Fine tuned from FCN-8s Model [J. Long, 2015]
Experiments
98
• Running Time• Training time: 20k iterations, one day on Titan X GPU
• Testing Time: 0.6s for 600×800 color image.
• Comparisons• Graph-cut
• FCN Baseline: direct FCN segmentation followed by closed-form matting
Results
99
Input Graph-cut FCN Ours
Results
100
Input Graph-cut FCN Ours
Results
101
Input Graph-cut FCN Ours
Results
102
Input Graph-cut FCN Ours
Failure Cases
103
Input Alpha Matte Input Alpha Matte
Applications
104
Input Stylization PS GS Stick PS Fresco Stylization
Input Stylization Depth-of-Field PS Fresco Stylization
Applications
105
Input Stylization PS Palette Knife PS GS Stick PS Sketch
Input PS Oil Paint Depth-of-Field PS GS Stick Stylization
Applications
106
Input Stylization PS Palette Knife Depth-of-Field Stylization
Input Stylization PS Palette Knife PS Dark Stroke PS Paint Daubs
Conclusions
• High accuracy automatic portrait segmentation and matting approach• A novel CNN framework• Training and testing dataset• Benefits lots of applications
• Future work• Video segmentation• Human segmentation• Single portrait image depth estimation• Weakly supervised version
107
Q & A
108
Thanks