Assessing Generative Models via Precision and RecallAssessing Generative Models via Precision and...

1
Assessing Generative Models via Precision and Recall M. S. M. Sajjadi 1,2,3 O. Bachem 3 M. Luˇ ci´ c 3 O. Bousquet 3 S. Gelly 3 1 Max Planck Institute for Intelligent Systems 2 Max Planck ETH Center for Learning Systems 3 Google Brain msajjadi.com Overview I Evaluating generative models is a challenging problem. Inception Score (IS) and Fréchet Inception Distance (FID) correlate with the perceptual sample quality, but it is easy to find mod- els with the same IS or FID that have highly different characteristics as these are only one- dimensional scores (Figure below). I We propose a score that disentangles precision (quality of generated samples) from recall (pro- portion of target distribution that is covered by the generator). I Our contributions: A novel definition of precision and recall for distributions (PRD) with desir- able properties, an efficient algorithm to compute it, and we demonstrate that PRD distinguishes mode dropping from mode inventing on real world data sets (image and text data). High precision, low recall 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision MNIST left MNIST right CelebA left CelebA right Low precision, high recall I Figure above: Models with similar FID (MNIST: 32/29, CelebA: 65/62) but highly different char- acteristics. The PRD curves (middle) successfully distinguish between precision and recall. Definition: Precision and Recall for Distributions (PRD) I Goal: Evaluate distribution Q w.r.t. a reference distribution P in terms of precision (how much of Q is covered by P) and recall (how much of P is covered by Q). I Key idea: Decompose P and Q into mixtures of a shared component μ and noise distributions P (loss in recall) and Q (loss in precision). I Formal definition: For , β 2 (0, 1], the distribution Q has precision at recall β w.r.t. P if there exist distributions μ, P , Q such that P = β μ + (1 - β )P and Q = μ + (1 - )Q . P Q 0 1 (a) 0 1 α β 0 1 (b) β 0 1 (c) β 0 1 (d) β 0 1 (e) β 0 1 (f) β I Toy examples for distributions (left, top row) and their respective PRD curves (left, bottom row). Algorithm I How to compute PRD(Q, P)? There is an infinite number of ways to decompose P and Q. It is therefore not immediately clear how to compute the optimal set of precision and recall β . I Key insight: We can fix = λβ and iterate over different values for λ. 0.0 0.5 1.0 Recall β 0.0 0.5 1.0 Precision α λ =2 λ =1 λ =0.5 I We define (λ)= P ! 2min (λP(! ), Q(! )) and β (λ)= P ! 2min (P(! ), Q(! ) / λ) . I For a resolution m, let = n tan i m+1 · 2 | i =1, 2,...,m o . We then have: PRD(Q, P)= {((λ), β (λ)) | λ 2 } . I Visualization of the PRD algorithm (left). For each precision- recall ratio λ, we compute the optimal pair of precision and recall via the equation above and add a point to the curve. This gives rise to a PRD curve which shows the relationship between the distributions in terms of precision and recall. Application to Generative Models I Applying the PRD algorithm in the original space is infeasible as P and Q are both only defined through a finite set of samples in the common case of evaluating generative models. I We follow this Procedure: 1. Feature extraction: Embed both real and generated samples into the Pool3 layer of a pre-trained Inception network, yielding 2048-dimensional feature vectors. 2. Clustering: Cluster the union of both distributions in feature space. The cluster assignment his- togram for each one of P and Q captures the characteristics of the distribution. 3. PRD algorithm: Finally, apply the PRD algorithm on these one-dimensional discrete distributions. I Clustered samples from CIFAR-10 (above) show that the cluster assignments in feature space are meaningful. Each row is sampled from real (left) and generated (right) images of the same cluster. Disentangling Mode Drop from Mode Inventing I Setting: P contains first 5 classes, Q i has first i classes from CIFAR-10 dataset. I Inception Score (left, blue) linearly increases as we add more classes. I FID (left, green) drops in both cases, but they are not distinguished (e.g., Q 4 Q 6 ). I PRD (middle) clearly shows a drop in recall for i< 5 and a drop in precision for i> 5. I Applying PRD on NLP (right) shows similar behavior. The MultiNLI dataset used here consists of sentences from 5 topics. The embedding is based on a BiLSTM model. 1 2 3 4 5 6 7 8 9 10 Number of classes in Q 4 5 6 7 8 9 10 11 Inception score Inception score FID 0 20 40 60 80 100 FID 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Q 1 2 3 4 5 6 7 8 9 10 0.0 0.2 0.4 0.6 0.8 1.0 Recall 0.0 0.2 0.4 0.6 0.8 1.0 Precision Q 1 2 3 4 5 Large-Scale Evaluation I Inspecting PRD for 800 generative models (7 GAN variants, VAE). To summarize each PRD curve into a pair of values for visualization, we show the F β scores (left): F β = (1 + β 2 ) p · r (β 2 p)+ r . I A glance at different models trained on Fashion-MNIST (right) confirms that the results correlate well with the perceived precision and recall of the samples. I In this experiment, GANs generally have a higher precision and lower recall than VAEs. This follows the folklore that GANs produce higher-quality samples than VAEs but they often collapse to generating only part of the dataset. 0.0 0.2 0.4 0.6 0.8 1.0 F 8 (Recall) 0.0 0.2 0.4 0.6 0.8 1.0 F 1/8 (Precision) A B C D GAN VAE A B C D

Transcript of Assessing Generative Models via Precision and RecallAssessing Generative Models via Precision and...

Page 1: Assessing Generative Models via Precision and RecallAssessing Generative Models via Precision and Recall M. S. M. Sajjadi1,2,3 O. Bachem3 M. Luciˇ c´ 3 O. Bousquet3 S. Gelly3 1 Max

Assessing Generative Models via Precision and RecallM. S. M. Sajjadi 1,2,3 O. Bachem 3 M. Lucic 3 O. Bousquet 3 S. Gelly 3

1 Max Planck Institute for Intelligent Systems 2 Max Planck ETH Center for Learning Systems 3 Google Brainmsajjadi.com

Overview

I Evaluating generative models is a challenging problem. Inception Score (IS) and FréchetInception Distance (FID) correlate with the perceptual sample quality, but it is easy to find mod-els with the same IS or FID that have highly different characteristics as these are only one-dimensional scores (Figure below).

I We propose a score that disentangles precision (quality of generated samples) from recall (pro-portion of target distribution that is covered by the generator).

I Our contributions: A novel definition of precision and recall for distributions (PRD) with desir-able properties, an efficient algorithm to compute it, and we demonstrate that PRD distinguishesmode dropping from mode inventing on real world data sets (image and text data).

High precision, low recall

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cision

MNIST left

MNIST right

CelebA left

CelebA right

Low precision, high recall

I Figure above: Models with similar FID (MNIST: 32/29, CelebA: 65/62) but highly different char-acteristics. The PRD curves (middle) successfully distinguish between precision and recall.

Definition: Precision and Recall for Distributions (PRD)

I Goal: Evaluate distribution Q w.r.t. a reference distribution P in terms of precision (how much ofQ is covered by P) and recall (how much of P is covered by Q).

I Key idea: Decompose P and Q into mixtures of a shared component µ and noise distributions⌫P (loss in recall) and ⌫Q (loss in precision).

I Formal definition: For ↵, � 2 (0, 1], the distribution Q has precision ↵ at recall � w.r.t. P if thereexist distributions µ, ⌫P , ⌫Q such that

P = �µ + (1� �)⌫P and Q = ↵µ + (1� ↵)⌫Q.

P

(a)

Q

(b) (c) (d) (e) (f)

0 1

(a)

0

1

0 1

(b)

0 1

(c)

0 1

(d)

0 1

(e)

0 1

(f)

I Toy examples for distributions (left, top row) and their respective PRD curves (left, bottom row).

Algorithm

I How to compute PRD(Q,P)? There is an infinite number of ways to decompose P and Q. It istherefore not immediately clear how to compute the optimal set of precision ↵ and recall �.

I Key insight: We can fix ↵ = �� and iterate over different values for �.

0.0 0.5 1.0

Recall �

0.0

0.5

1.0

Pre

cision

� = 2 � = 1

� = 0.5

I We define ↵(�) =P

!2⌦min (�P(!),Q(!))and �(�) =

P!2⌦min (P(!),Q(!)/�) .

I For a resolution m, let ⇤ =ntan

⇣i

m+1 ·⇡2

⌘| i = 1, 2, . . . , m

o.

We then have: PRD(Q,P) = {(↵(�), �(�)) | � 2 ⇤} .

I Visualization of the PRD algorithm (left). For each precision-recall ratio �, we compute the optimal pair of precision and recallvia the equation above and add a point to the curve. This givesrise to a PRD curve which shows the relationship between thedistributions in terms of precision and recall.

Application to Generative Models

I Applying the PRD algorithm in the original space is infeasible as P and Q are both only definedthrough a finite set of samples in the common case of evaluating generative models.

I We follow this Procedure:1. Feature extraction: Embed both real and generated samples into the Pool3 layer of a pre-trained

Inception network, yielding 2048-dimensional feature vectors.2. Clustering: Cluster the union of both distributions in feature space. The cluster assignment his-

togram for each one of P and Q captures the characteristics of the distribution.3. PRD algorithm: Finally, apply the PRD algorithm on these one-dimensional discrete distributions.

I Clustered samples from CIFAR-10 (above) show that the cluster assignments in feature space aremeaningful. Each row is sampled from real (left) and generated (right) images of the same cluster.

Disentangling Mode Drop from Mode Inventing

I Setting: P contains first 5 classes, Qi has first i classes from CIFAR-10 dataset.I Inception Score (left, blue) linearly increases as we add more classes.I FID (left, green) drops in both cases, but they are not distinguished (e.g., Q4 ⇡ Q6).I PRD (middle) clearly shows a drop in recall for i < 5 and a drop in precision for i > 5.I Applying PRD on NLP (right) shows similar behavior. The MultiNLI dataset used here

consists of sentences from 5 topics. The embedding is based on a BiLSTM model.

1 2 3 4 5 6 7 8 9 10

Number of classes in Q

4

5

6

7

8

9

10

11

Ince

ptio

nsc

ore

Inception score

FID

0

20

40

60

80

100

FID

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cision

Q1

2

3

4

5

6

7

8

9

10

0.0 0.2 0.4 0.6 0.8 1.0Recall

0.0

0.2

0.4

0.6

0.8

1.0

Pre

cision

Q1

2

3

4

5

Large-Scale Evaluation

I Inspecting PRD for 800 generative models (7 GAN variants, VAE). To summarize eachPRD curve into a pair of values for visualization, we show the F� scores (left):

F� = (1 + �2)p · r

(�2p) + r.

I A glance at different models trained on Fashion-MNIST (right) confirms that the resultscorrelate well with the perceived precision and recall of the samples.

I In this experiment, GANs generally have a higher precision and lower recall thanVAEs. This follows the folklore that GANs produce higher-quality samples than VAEs butthey often collapse to generating only part of the dataset.

0.0 0.2 0.4 0.6 0.8 1.0

F8 (Recall)

0.0

0.2

0.4

0.6

0.8

1.0

F1/8

(Pre

cision

)

A

B

C

D

GAN

VAE

A B

C D