Self-supervised Learning for Visual Recognition
Transcript of Self-supervised Learning for Visual Recognition
![Page 1: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/1.jpg)
Self-supervised Learning for Visual Recognition
Hamed Pirsiavash
University of Maryland, Baltimore County
1
![Page 2: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/2.jpg)
Significant progress in recognition due to large annotated datasets
14 million images
10 million images
450 hours of video
1.7 million question/answers
![Page 3: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/3.jpg)
Self supervised learning
3
Zhang et al. ECCV’16
Input Output
![Page 4: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/4.jpg)
Chair: 0
Dog: 1
Car: 0.
.
.
Supervised Learning(classification)
Input image
4
Label
![Page 5: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/5.jpg)
Supervised Learning(classification)
Input image
5
Chair: 0
Dog: 1
Car: 0.
.
.
Label
![Page 6: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/6.jpg)
Chair: 1
Dog: 0
Car: 0.
.
.
Supervised Learning(classification)
Input image
Label
6
![Page 7: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/7.jpg)
Chair: 1
Dog: 0
Car: 0.
.
.
Supervised Learning(classification)
Input image
Label
7
Transfer to another task
![Page 8: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/8.jpg)
Supervised Learning(counting)
Input image
8
Chair: 0
Dog: 2
Car: 0.
.
.
Label
![Page 9: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/9.jpg)
9
Inference on counting network
![Page 10: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/10.jpg)
10
Constraint in the output
![Page 11: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/11.jpg)
11
Constraint in the output
![Page 12: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/12.jpg)
12
Constraint in the output
![Page 13: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/13.jpg)
13
Two constraints in learning
Annotation...
![Page 14: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/14.jpg)
14
Two constraints in learning
Annotation...
![Page 15: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/15.jpg)
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
Self supervised learning
15
![Page 16: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/16.jpg)
0
4.5
0
4.5
0
4.5
0
4.5T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
xφ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
16
Self supervised learning
![Page 17: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/17.jpg)
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
17
Self supervised learning
![Page 18: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/18.jpg)
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
18
Self supervised learning
![Page 19: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/19.jpg)
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
19
Self supervised learning
![Page 20: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/20.jpg)
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
d
20
Self supervised learning
![Page 21: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/21.jpg)
0
4.5
0
4.5
0
4.5
0
4.5
D ◦ x
T1 ◦ x
T2 ◦ x
T3 ◦ x
T4 ◦ x
0
4.5
+
y
x
D ◦ y
φ
φ
φ
φ
φ
0
4.5φ
max{ 0, M − |c − t |2}
|d − t |2
t
t
d
c
21
Self supervised learning
![Page 22: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/22.jpg)
Trained on ImageNet without annotation
22
Unit 1
Unit 2
Unit 3
Images with largest activation
![Page 23: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/23.jpg)
Trained on COCO without annotation
23
Unit 1
Unit 2
Unit 3
Images with largest activation
![Page 24: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/24.jpg)
Trained on ImageNet without annotation
24
query retrieved
Nearest neighbor search
![Page 25: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/25.jpg)
Trained on COCO without annotation
25
query retrieved
Nearest neighbor search
![Page 26: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/26.jpg)
26
Feature network(e.g., AlexNet)
Pretext task(e.g., counting)
Dataset (no labels)
![Page 27: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/27.jpg)
27
Fine-tuning
Feature network(e.g., AlexNet)
Pretext task(e.g., counting)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
![Page 28: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/28.jpg)
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Fine-tuning on PASCAL VOC07
28
Results on transfer learning
![Page 29: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/29.jpg)
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 29
![Page 30: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/30.jpg)
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 30
![Page 31: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/31.jpg)
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 31
![Page 32: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/32.jpg)
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 32
![Page 33: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/33.jpg)
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Agrawal et al. ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 33
![Page 34: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/34.jpg)
Doerch et al. ICCV’15 Noroozi and Favaro ECCV’16
Zhang et al. ECCV’16 Pathak et al. CVPR’16
Wang and Gupta ICCV’15 Pathak et al. CVPR’17
Jayaraman and Grauman ICCV’15
Owens et al. ECCV’16
Mirsa et al.ECCV’16 34
![Page 35: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/35.jpg)
Agenda
• Self supervised learning by counting
• Boosting self-supervised learning by knowledge transfer
35
![Page 36: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/36.jpg)
36
Fine-tuning
Feature network(e.g., AlexNet)
Pretext task(e.g., counting)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
![Page 37: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/37.jpg)
37
Fine-tuning
Feature network(e.g., AlexNet)
Target task(e.g., object detection)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
Larger Dataset (no labels)
![Page 38: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/38.jpg)
38
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
![Page 39: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/39.jpg)
39
Transferring
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
![Page 40: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/40.jpg)
40
Transferring
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)Feature network
(e.g., AlexNet)
More complicated Pretext task
![Page 41: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/41.jpg)
41
More complicatedFeature network
(e.g., VGG)
More complicated Pretext task
Target task(e.g., object detection)
Larger Dataset (no labels)
Dataset (with labels)
![Page 42: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/42.jpg)
42
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
More complicated Pretext task
Dataset (with labels)
![Page 43: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/43.jpg)
43
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels)
Pseudo labels
More complicated Pretext task
![Page 44: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/44.jpg)
44
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels) Fine-tuning
Pseudo labels
More complicated Pretext task
![Page 45: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/45.jpg)
45
Jigsaw
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
tures can help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
agrey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
tures can help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
Permute and then predict the permutation
Noroozi, Mehdi, and Paolo Favaro. "Unsupervised learning of visual representations by solving jigsaw puzzles." ECCV 2016.
![Page 46: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/46.jpg)
46
Jigsaw++324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
tures can help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typically use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). This classifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
turescan help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at thefine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, asit wasdonein theoriginal paper, we
train thenetwork 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handle theincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
CVPR
#1024
CVPR
#1024CVPR 2018 Submission #1024. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.
(a) Self-Super vised Learning Pre-Training. Suppose that
we are given a pretext task, a model and a dataset. Our first
step in SSL is to train our model on the pretext task with
the given dataset (see Fig. 2 (a)). Typically, the models of
choiceareconvolutional neural networks, and oneconsiders
as feature the output of some intermediate layer (shown as
a grey rectangle in Fig. 2 (a)).
(b) Cluster ing. Our next step is to compute feature vectors
for all the images in our dataset. Then, we use the k-means
algorithm with theEuclidean distance to cluster thefeatures
(see Fig. 2 (b)). Ideally, when performing this clustering on
ImageNet images, wewant theclusterscenters to bealigned
with object categories. In the experiments, we typical ly use
2,000 clusters.
(c) Extracting Pseudo-Labels. The cluster centers com-
puted in the previous section can be considered as virtual
categories. Indeed, we can assign feature vectors to the
closest cluster center to determineapseudo-label associated
to the chosen cluster. This operation is illustrated in Fig. 2
(c). Notice that the dataset used in this operation might be
different from that used in the clustering step or in the SSL
pre-training.
(d) Cluster Classification. Finally, we train a simple clas-
sifier using the architecture of the target task so that, given
an input image (from thedataset used to extract thepseudo-
labels), predicts the corresponding pseudo-label (see Fig. 2
(d)). Thisclassifier learns anew representation in the target
architecture that maps images that were originally close to
each other in the pre-trained feature space to close points.
4. The Jigsaw++ Pretext Task
Recent work [7, 31] has shown that deeper architec-
turescan help in SSL with PASCAL recognition tasks (e.g.,
ResNet). However, those methods use the same deep ar-
chitecture for both SSL and fine-tuning. Hence, they are
not comparable with previous methods that use a simpler
AlexNet architecture in fine-tuning. We are interested in
knowing how far one can improve the SSL pre-training of
AlexNet for PASCAL tasks. Since in our framework the
SSL task is not restricted to use the same architecture as in
the final supervised task, we can increase the difficulty of
theSSL task along with thecapacity of thearchitecture and
still use AlexNet at the fine-tuning stage.
To this aim, we extend the jigsaw [20] task and call it
the jigsaw++ task. The original pretext task [20] is to find
a reordering of tiles from a 3⇥ 3 grid of a square region
cropped from an image. In jigsaw++, we replace a random
number of tiles in the grid (up to 2) with (occluding) tiles
from another random image (see Fig. 3). The number of
tiles (0, 1 or 2 in our experiments) as well as their location
are randomly selected. The occluding tiles make the task
remarkably more complex. First, the model needs to detect
(a)
(c)
(b)
(d)
Figure 3: The j igsaw++ task. (a) the main image. (b) a
random image. (c) a puzzle from the original formulation
of [20], where all tiles come from the same image. (d) a
puzzle in the jigsaw++ task, where at most 2 tiles can come
from a random image.
the occluding tiles and second, it needs to solve the jigsaw
problem by using only theremaining patches. To makesure
we are not adding ambiguities to the task, we remove sim-
ilar permutations so that the minimum Hamming distance
between any two permutations is at least 3. In this way,
there is a unique solution to the jigsaw task for any num-
ber of occlusions in our training setting. Our final training
permutation set includes 701 permutations, in which theav-
erage and minimum Hamming distance is .86 and 3 respec-
tively. In addition to themean and std normalization of each
patch independently, as it wasdonein theoriginal paper, we
train the network 70% of the time on thegray scale images.
In this way, we prevent the network from using low level
statistics to detect occlusions and solve the jigsaw task.
Wetrain the jigsaw++ task on both VGG16 and AlexNet
architectures. By having a larger capacity with VGG16, the
network isbetter equipped to handletheincreased complex-
ity of the jigsaw++ task and is capable of extracting better
representations from the data.
Following our pipeline in Fig. 2, we train our models
4
• Add distracting patches
• Increase number of permutations
![Page 47: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/47.jpg)
47
Clusters on Jigsaw++
![Page 48: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/48.jpg)
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Fine-tuning on PASCAL VOC07
50
Results on transfer learning
![Page 49: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/49.jpg)
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Jigsaw++ (Ours) 72.5 56.5 42.6
Fine-tuning on PASCAL VOC07
51
Results on transfer learning
![Page 50: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/50.jpg)
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (Ours) 67.7 52.4 36.6
Jigsaw++ (Ours) 72.5 56.5 42.6
RotNet (ICLR’18) 72.9 54.4 39.1
Deep clustering (ECCV’18) 73.7 55.4 45.1
Fine-tuning on PASCAL VOC07
52
Results on transfer learning
![Page 51: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/51.jpg)
53
More complicatedFeature network
(e.g., VGG)
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels) Fine-tuning
Pseudo labels
More complicated Pretext task
![Page 52: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/52.jpg)
54
Target task(e.g., object detection)
Dataset (no labels)
Dataset (with labels) Fine-tuning
Pseudo labels
HOG
![Page 53: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/53.jpg)
Method Class. Det. Segm.
Supervised 79.9 57.1 48.0
Random 53.3 43.4 19.8
Sound 54.4 44.0 -
Video 63.1 47.2 -
Split-Brain 67.1 46.7 36.0
Watching-Objects 61.0 52.2 -
Jigsaw(new version) 67.6 53.2 37.6
Counting (ours) 67.7 52.4 36.6
Jigsaw++ (ours) 72.5 56.5 42.6
HOG (ours) 70.2 53.2 39.2
Fine-tuning on PASCAL VOC07
55
Results on transfer learning
Kaiming He Ross Girshick Piotr Dollar, “Rethinking ImageNet Pre-training”, arXiv, Nov 2018.
![Page 54: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/54.jpg)
Visualization of conv1 filters
56
From scratch
CC on VGG-Jigsaw++
CC onHOG
![Page 55: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/55.jpg)
57
Thanks to
Mehdi Noroozi Paolo FavaroAnanth Kavalkazhani
![Page 56: Self-supervised Learning for Visual Recognition](https://reader033.fdocuments.net/reader033/viewer/2022053104/6291c21c126b7b3fd002158c/html5/thumbnails/56.jpg)
58
Thanks!