Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris...
Transcript of Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris...
![Page 1: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/1.jpg)
Learning Transferable Visual Models From Natural Language SupervisionICML 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya SutskeverOpenAI
![Page 2: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/2.jpg)
Contrastive learning
![Page 3: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/3.jpg)
Contrastive learning
Panda Hippo CamelTigerPig
![Page 4: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/4.jpg)
Contrastive learning
Panda Hippo CamelTigerPig
![Page 5: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/5.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 6: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/6.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 7: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/7.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 8: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/8.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 9: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/9.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 10: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/10.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 11: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/11.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 12: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/12.jpg)
CLIP: Contrastive Language-Image Pre-training
![Page 13: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/13.jpg)
Zero-shot image classification
![Page 14: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/14.jpg)
Zero-shot image classification
![Page 15: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/15.jpg)
Zero-shot image classification
![Page 16: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/16.jpg)
Zero-shot image classification
![Page 17: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/17.jpg)
Zero-shot image classification
![Page 18: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/18.jpg)
Zero-shot image classification
![Page 19: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/19.jpg)
Zero-shot CLIP is much more robust
![Page 20: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/20.jpg)
Why contrastive?
![Page 21: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/21.jpg)
Training- Trained on 400M image-text pairs from the internet- Batch size of 32,768- 32 epochs over the dataset- Cosine learning rate decay
Architecture- ResNet-based or ViT-based image encoder- Transformer-based text encoder
Some CLIP details
![Page 22: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/22.jpg)
Representation Learning
![Page 23: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/23.jpg)
Linear probe
Logistic regression classifier on image features
- L-BFGS- Only one hyperparameter- Allows “fair” comparisons with other vision models- Provides lower bound for fine-tuned models
Evaluated on 27 image datasets × 65 vision models
satellite images, car models, medical images, city classification, rendered texts, aircrafts, birds, memes, ...
![Page 24: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/24.jpg)
Linear probe performance vs SOTA vision models
![Page 25: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/25.jpg)
vs ImageNet score
![Page 26: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/26.jpg)
Zero-Shot Transfer
![Page 27: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/27.jpg)
Zero-shot vs Linear-probe ResNet-50
Zero-shot CLIP matches fully supervised ResNet-50 across eval suite
![Page 28: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/28.jpg)
Zero-shot CLIP vs Few-shot linear probes
Zero-shot CLIP is as good as
- 4-shot linear-probe CLIP- 16-shot BiT-M
![Page 29: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/29.jpg)
Zero-shot vs Linear-probe CLIP
![Page 30: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/30.jpg)
Zero-shot performance vs model size
![Page 31: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/31.jpg)
Prompt engineering
![Page 32: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/32.jpg)
Robustness to Natural Distribution Shift
![Page 33: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/33.jpg)
Robustness to natural distribution shift
Zero-Shot CLIP is much more robust!
7 ImageNet-like Datasets (Taori et al.)
- ImageNetV2- ImageNet-A- ImageNet-R- ImageNet Sketch- ObjectNet- ImageNet Vid- Youtube-BB
![Page 34: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/34.jpg)
Adapting to ImageNet does not help robustness
![Page 35: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/35.jpg)
Robustness of few-shot linear probes
![Page 36: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/36.jpg)
Limitations and Broader Impacts
![Page 37: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/37.jpg)
- Zero-shot performance is well below the SOTA
- Especially weak on abstract tasks such as counting
- Poor on out-of-distribution data such as MNIST
- Susceptible to adversarial attacks
- Dataset selection in the eval suite, use of large validation sets for prompt engineering
- Social biases
Limitations of CLIP
![Page 38: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/38.jpg)
- Class design can heavily influence bias
Quantifying the (un)safety of CLIP models
Category Label Set
0-2 3-9 10-19 20-29 30-39 40-49 50-59 60-69
Default Label Set
30.3 35.0 29.5 16.3 13.9 18.5 19.1 16.2
Default Label Set + ‘child’
2.3 4.3 14.7 15.0 13.4 18.2 18.6 15.5
Percent of images classified into crime-related and non-human categories by FairFace Age category, showing comparison between results obtained using a default label set and a label set to which the label ’child’ has been added.
![Page 39: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/39.jpg)
- Enables niche tasks which lack training data
CelebA Zero-Shot Top 1 Identity Recognition Results
Not comprehensive, continuing to research to ensure safety
Quantifying the (un)safety of CLIP models
Model 100 Classes 1k Classes 2k Classes
CLIP L/14 59.2 43.3 42.2
CLIP RN50x62 56.4 39.5 38.4
CLIP RN50x62 52.7 37.4 36.3
CLIP RN50x62 52.8 38.1 37.3
![Page 40: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/40.jpg)
Related Work
![Page 41: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/41.jpg)
Prior Related Work
Natural language supervision:- YFCC100M WSL (Joulin et al.)- VirTex (Desai and Johnson)- ICMLM (Sariyildiz et al.)- ConVIRT (Zhang et al.)
Zero-Shot Transfer:- Visual N-Grams (Li et al.)
Broad Evaluation and Robustness:- VTAB (Zhang et al.)- ImageNet Testbed (Taori et al.)
![Page 42: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/42.jpg)
Multimodal Neurons in CLIP (Goh et al. Distill)
![Page 43: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/43.jpg)
Typographic Attacks
![Page 44: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/44.jpg)
StyleCLIP(Patashnik et al.)
Steering a GAN Using CLIP
Applications of CLIP
CLIP4Clip(Luo & Ji, et al.)
Video retrieval using CLIP features
![Page 45: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/45.jpg)
“Dogs playing poker”
More text-based image generations using CLIP
“Geoffrey Hinton”“A banquet hall”
© Gene Kogan, Ryan Murdock
![Page 46: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/46.jpg)
https://github.com/openai/CLIP
- PyTorch implementation- Colab notebook- Zero-Shot prediction reference- Linear probe reference- YFCC100M dataset- Released models
Try CLIP today!
![Page 47: Natural Language Supervision Learning Transferable Visual ...Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell,](https://reader036.fdocuments.net/reader036/viewer/2022071611/614a301612c9616cbc6940ec/html5/thumbnails/47.jpg)
Thank YouVisit openai.com for more information.
FOLLOW @OPENAI ON TWITTERWE ARE HIRING!