New Zero-Shot Knowledge Distillation in Deep...

9
Zero-Shot Knowledge Distillation in Deep Networks Gaurav Kumar Nayak *1 Konda Reddy Mopuri *2 Vaisakh Shaj *3 R. Venkatesh Babu 1 Anirban Chakraborty 1 Abstract Knowledge distillation deals with the problem of training a smaller model (Student) from a high ca- pacity source model (Teacher) so as to retain most of its performance. Existing approaches use either the training data or meta-data extracted from it in order to train the Student. However, accessing the dataset on which the Teacher has been trained may not always be feasible if the dataset is very large or it poses privacy or safety concerns (e.g., bio-metric or medical data). Hence, in this paper, we propose a novel data-free method to train the Student from the Teacher. Without even using any meta-data, we synthesize the Data Impressions from the complex Teacher model and utilize these as surrogates for the original training data samples to transfer its learning to Student via knowledge distillation. We, therefore, dub our method “Zero- Shot Knowledge Distillation” and demonstrate that our framework results in competitive gener- alization performance as achieved by distillation using the actual training data samples on multiple benchmark datasets. 1. Introduction Knowledge Distillation (Hinton et al., 2015) enables to transfer the complex mapping functions learned by cumber- some models to relatively simpler models. The cumbersome model can be an ensemble of multiple large models or a single model with large capacity and strong regualrizers such as Dropout (Srivastava et al., 2014), BatchNorm (Ioffe & Szegedy, 2015), etc. Typically the complex and small models are referred to as Teacher (T) and Student (S) models respectively. Generally the Teacher models deliver excel- * Equal contribution 1 Department of Computational and Data Sciences, Indian Institute of Science, Bangalore, India 2 School of Informatics, University of Edinburgh, United Kingdom 3 University of Lincoln, United Kingdom. Correspondence to: Gaurav Kumar Nayak <[email protected]>. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). lent performance, but they can be huge and computationally expensive. Hence, these models can not be deployed in limited resource environments or when real-time inference is expected. On the other hand, a Student model has sub- stantially less memory footprint, requires less computation, and thereby often results in a much faster inference time than that of the much larger Teacher model. The latent information hidden in the confidences assigned by the Teacher to the incorrect categories, referred to as ‘dark knowledge’ is transferred to the Student via the distil- lation process. It is this knowledge that helps the Teacher to generalize better and transfers to the Student via match- ing their soft-labels (output of the soft-max layer) instead of the one-hot vector encoded labels. Matching the soft- labels produced by the Teacher is the natural way to transfer its generalization ability. For performing the knowledge distillation, one can use the training data from the target distribution or an arbitrary data. Typically, the data used to perform the distillation is called ‘Transfer set’. In order to maximize the information provided per sample, we can make the soft targets to have a high entropy (non-peaky). This is generally achieved by using a high temperature at the softmax layer (Hinton et al., 2015). Also, because of non-peaky soft-labels, the training gradients computed on the loss will have less variance and enable to use higher learning rates leading to quick convergence. The existing approaches use natural data either from the target data distribution or a different transfer set to perform the distillation. It is found by (Hinton et al., 2015) that using original training data performs relatively better. They also suggest to have an additional term in the objective for the Student to predict correct labels on the training data along with matching the soft-labels from the Teacher (as shown in eq. (1)). However, accessing the samples over which the Teacher had been trained may not always be feasible. Often the training datasets are too large (e.g., ImageNet (Rus- sakovsky et al., 2015)). However, more importantly, most datasets are proprietary and not shared publicly due to pri- vacy or confidentiality concerns. Especially while dealing with biometric data of large population, healthcare data of patients etc. Also, quite often the corporate would not prefer its proprietary data to be potentially accessed by its competi- tors. In summary, data is more precious than anything else

Transcript of New Zero-Shot Knowledge Distillation in Deep...

Page 1: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

Gaurav Kumar Nayak * 1 Konda Reddy Mopuri * 2 Vaisakh Shaj * 3 R. Venkatesh Babu 1

Anirban Chakraborty 1

AbstractKnowledge distillation deals with the problem oftraining a smaller model (Student) from a high ca-pacity source model (Teacher) so as to retain mostof its performance. Existing approaches use eitherthe training data or meta-data extracted from itin order to train the Student. However, accessingthe dataset on which the Teacher has been trainedmay not always be feasible if the dataset is verylarge or it poses privacy or safety concerns (e.g.,bio-metric or medical data). Hence, in this paper,we propose a novel data-free method to train theStudent from the Teacher. Without even using anymeta-data, we synthesize the Data Impressionsfrom the complex Teacher model and utilize theseas surrogates for the original training data samplesto transfer its learning to Student via knowledgedistillation. We, therefore, dub our method “Zero-Shot Knowledge Distillation” and demonstratethat our framework results in competitive gener-alization performance as achieved by distillationusing the actual training data samples on multiplebenchmark datasets.

1. IntroductionKnowledge Distillation (Hinton et al., 2015) enables totransfer the complex mapping functions learned by cumber-some models to relatively simpler models. The cumbersomemodel can be an ensemble of multiple large models or asingle model with large capacity and strong regualrizerssuch as Dropout (Srivastava et al., 2014), BatchNorm (Ioffe& Szegedy, 2015), etc. Typically the complex and smallmodels are referred to as Teacher (T) and Student (S) modelsrespectively. Generally the Teacher models deliver excel-

*Equal contribution 1Department of Computational and DataSciences, Indian Institute of Science, Bangalore, India 2School ofInformatics, University of Edinburgh, United Kingdom 3Universityof Lincoln, United Kingdom. Correspondence to: Gaurav KumarNayak <[email protected]>.

Proceedings of the 36 th International Conference on MachineLearning, Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).

lent performance, but they can be huge and computationallyexpensive. Hence, these models can not be deployed inlimited resource environments or when real-time inferenceis expected. On the other hand, a Student model has sub-stantially less memory footprint, requires less computation,and thereby often results in a much faster inference timethan that of the much larger Teacher model.

The latent information hidden in the confidences assignedby the Teacher to the incorrect categories, referred to as‘dark knowledge’ is transferred to the Student via the distil-lation process. It is this knowledge that helps the Teacherto generalize better and transfers to the Student via match-ing their soft-labels (output of the soft-max layer) insteadof the one-hot vector encoded labels. Matching the soft-labels produced by the Teacher is the natural way to transferits generalization ability. For performing the knowledgedistillation, one can use the training data from the targetdistribution or an arbitrary data. Typically, the data usedto perform the distillation is called ‘Transfer set’. In orderto maximize the information provided per sample, we canmake the soft targets to have a high entropy (non-peaky).This is generally achieved by using a high temperature atthe softmax layer (Hinton et al., 2015). Also, because ofnon-peaky soft-labels, the training gradients computed onthe loss will have less variance and enable to use higherlearning rates leading to quick convergence.

The existing approaches use natural data either from thetarget data distribution or a different transfer set to performthe distillation. It is found by (Hinton et al., 2015) that usingoriginal training data performs relatively better. They alsosuggest to have an additional term in the objective for theStudent to predict correct labels on the training data alongwith matching the soft-labels from the Teacher (as shownin eq. (1)). However, accessing the samples over which theTeacher had been trained may not always be feasible. Oftenthe training datasets are too large (e.g., ImageNet (Rus-sakovsky et al., 2015)). However, more importantly, mostdatasets are proprietary and not shared publicly due to pri-vacy or confidentiality concerns. Especially while dealingwith biometric data of large population, healthcare data ofpatients etc. Also, quite often the corporate would not preferits proprietary data to be potentially accessed by its competi-tors. In summary, data is more precious than anything else

Page 2: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

in the era of deep learning and hence access to premiumdata (used in training a model) may not always be realistic.

Therefore, in this paper, we present a novel data-free frame-work to perform knowledge distillation. Since we do not useany data samples (either from the target dataset or a differenttransfer set) to perform the knowledge transfer, we nameour approach “Zero-Shot Knowledge Distillation” (ZSKD).With no prior knowledge about the target data, we performpseudo data synthesis from the Teacher model that act asthe transfer set to perform the distillation. Our approachobtains useful prior information about the underlying datadistribution in the form of Class Similarities from the modelparameters of the Teacher. Further, we successfully utilizethis prior in the crafting process via modelling the outputspace of the Teacher model as a Dirichlet distribution. Wename the crafted samples Data Impressions (DI) as theseare the impressions of the training data as understood by theTeacher model. Thus, the contributions of this work can belisted as follows:

• Unlike the existing methods that use either data sam-ples or the extracted meta-data to perform KnowledgeDistillation, we present, for the first time, the idea ofZero-Shot Knowledge Distillation (ZSKD), with nodata samples and no extracted prior information.

• In order to compose a transfer set for performing distil-lation, we present a sample extraction mechanism viamodelling the softmax space as a Dirichlet distributionand craft Data Impressions (DI) from the parametersof a Teacher model.

• We present a simple, yet powerful procedure to extractuseful prior in the form of Class Similarities (sec. 3.2)which enables better modelling of the data distribu-tion and is utilized in the Dirichlet sampling based DIgeneration framework.

• We demonstrate the effectiveness of our ZSKD ap-proach via an empirical evaluation over multiple bench-mark datasets and model architectures (sec. 4).

The rest of the paper is organized as follows: section 2presents a brief account of existing research that are relatedto this work, section 3 discusses the proposed framework indetail, section 4 demonstrates the empirical evaluation, andsection 5 presents a discussion on the proposed method andconcludes the paper.

2. Related WorksThe teacher model generally has high complexity and arenot preferred for real-time embedded platforms due to itslarge memory and computational requirements. In practice,

networks of smaller size which are compact and deployableare required. Several techniques have been proposed inthe past to transfer the knowledge from the teacher to thestudent model without much compromise in performance.We can categorize them broadly into three types based onthe amount of data used for knowledge distillation:

• Using entire training data or similar data: In (Bu-cilu et al., 2006), model compression technique is used.The target network is trained using the pseudo labelsobtained from the larger model with an objective tomatch the pre-softmax values (called logits). In (Hin-ton et al., 2015), the softmax distribution of classesproduced by teacher model using high temperature inits softmax (called “soft targets”) are used to train thestudent model as the knowledge contained in incorrectclass probabilities tends to capture the teacher gener-alization ability better in comparison to hard labels.The matching of logits is a special case of this generalmethod. In (Furlanello et al., 2018), knowledge trans-fer is done across several generations where the studentof current generation learns from its previous genera-tion. The final predictions are made from the ensembleof student models using the mean of the predictionsfrom each student.

• Using few samples of original data: In (Kimura et al.,2018), knowledge distillation is performed using feworiginal samples of training data which are augmentedby “pseudo training examples”. These pseudo exam-ples are obtained using inducing point (Snelson &Ghahramani, 2006) method via iterative optimizationtechnique in an adversarial manner which makes thetraining procedure complicated.

• Using meta data: In (Lopes et al., 2017), activationrecords are stored at each layer after the training ofteacher model and used as meta data to reconstructtraining samples and utilize them to train the studentmodel. Although, this method does consider the caseof knowledge distillation in the absence of training databut meta data is formed using the training data itself.So, meta data has dependency on training samples andhence it is not a complete data-free approach.

To the best of our knowledge, we are the first to demonstrateknowledge distillation in case where no training data is avail-able in any form. It has been shown by (Mopuri et al., 2018)that the pretrained models have memory in terms of learnedparameters and can be used to extract class representativesamples. Although, it was used in the context of adversar-ial perturbation task, we argue that carefully synthesizedsamples can be used as pseudo training data for knowledgedistillation.

Page 3: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

3. Proposed MethodIn this section, we briefly introduce the process of Knowl-edge Distillation (KD) and present the proposed frameworkfor performing Zero-Shot Knowledge Distillation in detail.

3.1. Knowledge Distillation

Transferring the generalization ability of a large, complexTeacher (T ) deep neural network to a less complex Stu-dent (S) network can be achieved using the class probabili-ties produced by a Teacher as “soft targets” (Hinton et al.,2015) for training the Student. For this transfer, existingapproaches require access to the original training data con-sisting of tuples of input data and targets (x, y) ∈ D. LetT be the Teacher network with learned parameters θT andS be the Student with parameters θS , note that in general|θS | � |θT |. Knowledge distillation methods train the Stu-dent via minimizing the following objective (L) with respectto the parameters θS over the training samples (x, y) ∈ D

L =∑

(x,y)∈D

LKD(S(x, θS , τ), T (x, θT , τ))+λLCE(yS , y)

(1)LCE is the cross-entropy loss computed on the labels yS pre-dicted by the Student and their corresponding ground truthlabels y. LKD is the distillation loss (e.g. cross-entropyor mean square error) comparing the soft labels (softmaxoutputs) predicted by the Student against the soft labelspredicted by the Teacher. T (x, θT ) represents the softmaxoutput of the Teacher and S(x, θS) denotes the softmax out-put of the Student. Note that, unless it is mentioned, weuse a softmax temperature of 1. If we use a temperaturevalue (τ) different from 1, we represent it as S(x, θS , τ)and T (x, θT , τ) for the remainder of the paper. λ is thehyper-parameter to balance the two objectives.

3.2. Modelling the Data in Softmax Space

However, in this work, we deal with the scenario wherewe have no access to (i) any training data samples (eitherfrom the target distribution or different), or (ii) meta-data ex-tracted from it (e.g. (Lopes et al., 2017)). In order to tacklethis, our approach taps the memory (learned parameters) ofthe Teacher and synthesizes pseudo samples from the un-derlying data distribution on which it is trained. Since theseare the impressions of the training data extracted from thetrained model, we name these synthesized input representa-tions as Data Impressions. We argue that these can serve asrepresentative samples from the training data distribution,which can then be used as a transfer set in order to performthe knowledge distillation to a desired Student model.

Thus, in order to craft the Data Impressions, we model theoutput (softmax) space of the Teacher model. Let s ∼ p(s),be the random vector that represents the neural softmax

outputs of the Teacher, T (x, θT ). We model p(sk) belong-ing to each class k, using a Dirichlet distribution which isa distribution over vectors whose components are in [0, 1]range and their sum is 1. Thus, the distribution to representthe softmax outputs sk of class k would be modelled as,Dir(K,αk), where k ∈ {1 . . .K} is the class index, K isthe dimension of the output probability vector (number ofcategories in the recognition problem) and αk is the concen-tration parameter of the distribution modelling class k. Theconcentration parameter αk is a K dimensional positivereal vector, i.e, αk = [αk1 , α

k2 , . . . , α

kK ], and αki > 0,∀i.

Concentration Parameter (α): Since the sample space ofthe Dirichlet distribution is interpreted as a discrete prob-ability distribution (over the labels), intuitively, the con-centration parameter (α) can be thought of as determininghow “concentrated” the probability mass of a sample from aDirichlet distribution is likely to be. With a value much lessthan 1, the mass will be highly concentrated in only a fewcomponents, and all the rest will have almost zero mass. Onthe other hand, with a value much greater than 1, the masswill be dispersed almost equally among all the components.

Obtaining prior information for the concentration parameteris not straightforward. The parameter cannot be the samefor all components since this results in all sets of probabil-ities being equally likely, which is not a realistic scenario.For instance, in case of CIFAR-10 dataset, it would not bemeaningful to have a softmax output in which the dog classand plane class have the same confidence (since they arevisually dissimilar). Also, same αi values denote the lack ofany prior information to favour one component of sampledsoftmax vector over the other. Hence, the concentrationparameters should be assigned in order to reflect the simi-larities across the components in the softmax vector. Sincethese components denote the underlying categories in therecognition problem, α should reflect the visual similaritiesamong them.

Thus, we resort to the Teacher network for extracting thisinformation. We compute a normalized class similarity ma-trix (C) using the weights W connecting the final (softmax)and the pre-final layers. The element C(i, j) of this matrixdenotes the visual similarity between the categories i and jin [0, 1]. Thus, a row ck of the class similarity matrix (C)gives the similarity of class k with each of the K categories(including itself). Each row ck can be treated as the concen-tration parameter (α) of the Dirichlet distribution (Dir),which models the distribution of output probability vectorsbelonging to class k.

Class Similarity Matrix: The class similarity matrix C iscalculated as follows. The final layer of a typical recogni-tion model will be a fully connected layer with a softmaxnon-linearity. Each neuron in this layer corresponds to aclass (k) and its activation is treated as the probability pre-

Page 4: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

plan

e

car

bird cat

deer

dog

frog

horse

ship

truck

plane

car

bird

cat

deer

dog

frog

horse

ship

truck0.0

0.2

0.4

0.6

0.8

1.0

Figure 1. Class similarity matrix computed for the Teacher modeltrained over CIFAR-10 dataset. Note that the class labels arementioned and the learned similarities are meaningful.

dicted by the model for that class. The weights connectingthe previous layer to this neuron (wk) can be considered asthe template of the class k learned by the Teacher network.This is because the predicted class probability is propor-tional to the alignment of the pre-final layer’s output withthe template (wk). The predicted probability peaks whenthe pre-final layer’s output is a positive scaled version ofthis template (wk). On the other hand, if the output ofthe pre-final layer is misaligned with the template wk, theconfidence predicted for class k is reduced. Therefore, wetreat the weights wk as the class template for class k andcompute the similarity between classes i and j as:

C(i, j) =wTi wj

‖wi‖∥∥wj∥∥ (2)

Since the elements of the concentration parameter have tobe positive real numbers, we further perform a min-maxnormalization over each row of the class similarity matrix.The visualization of the class similarity matrix calculatedfrom a CIFAR-10 trained model is shown in Figure 1.

3.3. Crafting Data Impressions via Dirichlet Sampling

Once the parameters K and αk of the Dirichlet distributionare obtained for each class k, we can sample class proba-bility (softmax) vectors, which respect the class similaritiesas learned by the Teacher network. Using the optimizationprocedure in eq. (3) we obtain the input representationscorresponding to these sampled output class probabilities.Let Y k = [yk1 ,y

k2 , . . . ,y

kN ] ∈ RK×N , be the N softmax

vectors corresponding to class k, sampled fromDir(K,αk)distribution. Corresponding to each sampled softmax vec-tor yki , we can craft a Data Impression xik, for which theTeacher predicts a similar softmax output. We achieve thisby optimizing the objective shown in eq. (3). We initializexki as a random noisy image and update it over multiple

iterations till the cross-entropy loss between the sampledsoftmax vector (yki ) and the softmax output predicted bythe Teacher is minimized.

xik = argmin

xLCE(yki , T (x, θT , τ)) (3)

where τ is the temperature used in the softmax layer. Theprocess is repeated for each of the N sampled softmaxprobability vectors in Y k, k ∈ {1 . . .K}.

Scaling Factor (β): The probability density function of theDirichlet distribution for K random variables is a K − 1dimensional probability simplex that exists on a K dimen-sional space. In addition to parameters K and α as dis-cussed in section 3.2, it is important to discuss the signif-icance of the range of αi ∈ α , in controlling the densityof the distribution. When αi < 1, ∀i ε[1,K], the den-sity congregates at the edges of the simplex (Balakrishnan& Nevzorov, 2004; Lin, 2016). As their values increase(when αi > 1,∀i ∈ [1,K]), the density becomes moreconcentrated on the center of the simplex (Balakrishnan& Nevzorov, 2004; Lin, 2016). Thus, we define a scalingfactor (β) which can control the range of the individual ele-ments of the concentration parameter, which in turn decidesregions in the simplex from which sampling is performed.This becomes a hyper-parameter for the algorithm. Thusthe actual sampling of the probability vectors happen fromp(s) = Dir(K,β × α). β intuitively models the spreadof the Dirichlet distribution and acts as a scaling parameteratop α to yield the final concentration parameter (prior). βcontrols the l1-norm of the final concentration parameterwhich, in turn, is inversely related to the variance of thedistribution. Variance of the sampled simplexes is high forsmaller values of β . However very low values for β (e.g.0.01), in conjunction with the chosen α, result in highlysparse softmax vectors concentrated on the extreme cornersof the simplex, which is equivalent to generating class im-pressions (see Fig. 3). As per the ablation studies, β valuesof 0.1, 1.0 or a mix of these are in general favorable sincethey encourage higher diversity (variance) and at the sametime does not result in highly sparse vectors.

3.4. Zero-Shot Knowledge Distillation

Once we craft the Data Impressions (DI) (X) from theTeacher model, we treat them as the ‘Transfer set’ andperform the knowledge distillation. Note that we use onlythe distillation loss LKD as shown in eq. (4). We ignore thecross-entropy loss from the general Distillation objective(eq. (1)) since there is only minor to no improvement in theperformance and it reduces the burden of hyper-parameter λ.The proposed ZSKD approach is detailed in Algorithm 1.

θS = argminθS

∑x∈X

LKD(T (x, θT , τ), S(x, θS , τ)) (4)

Page 5: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

Algorithm 1 Zero-Shot Knowledge DistillationInput :Teacher model T

N : number of DIs crafted per category,[β1, β2, ..., βB ]: B scaling factors,τ : Temperature for distillation

Output :Learned Student model S(θS),X: Data Impressions

1 Obtain K: number of categories from T2 Compute the class similarity matrix

C = [cT1 , cT2 , . . . , cTK ] as in eq. (2)3 X ← ∅4 for k=1:K do5 Set the concentration parameter αk = ck6 for b=1:B do7 for n=1:

⌊N/B

⌋do

8 Sample ykn ∼ Dir(K,βb ×αk)

9 Initialize xkn to random noise and craft xkn =

argminx

LCE(ykn, T (x, θT , τ))

10 X ← X ∪ xkn11 end12 end13 end14 Transfer the Teacher’s knowledge to Student using the DIs

via θS = argminθS

∑x∈X LKD(T (x, θT , τ), S(x, θS , τ))

Thus we generate a diverse set of pseudo training exam-ples that can provide with enough information to train theStudent model via Dirichlet sampling. Some of the DataImpressions are presented in Figure 4 for CIFAR-10 dataset.Note that the figures show 3 DIs per category. Also, notethat the top-2 confidences in the sampled softmax corre-sponding to each DI are mentioned on top. We observe thatthe DIs are visually far away from the actual data samplesof the dataset. However, some of the DIs synthesized frompeaky softmax vectors (e.g. the bird, cat, car, and deer in thefirst row) contain clearly visible patterns of the correspond-ing objects. The observation that the DIs being visually faraway from the actual data samples is understandable, sincethe objective to synthesize them (eq. (3)) pays no explicitattention to visual detail.

4. ExperimentsIn this section, we discuss the experimental evaluation ofthe proposed data-free knowledge transfer framework overa set of benchmark object recognition datasets: MNIST(LeCun et al., 1998), Fashion MNIST (FMNIST) (Xiaoet al., 2017), and CIFAR-10 (Krizhevsky & Hinton, 2009).As all the experiments in these three datasets are dealingwith classification problems with 10 categories each, valueof the parameter K in all our experiments is 10. For eachdataset, we first train the Teacher model over the available

Table 1. Performance of the proposed ZSKD framework on theMNIST dataset.

Model PerformanceTeacher-CE 99.34Student-CE 98.92

Student-KD (Hinton et al., 2015)60K original data 99.25

(Kimura et al., 2018)200 original data 86.70

(Lopes et al., 2017)(uses meta data) 92.47

ZSKD (Ours)(24000 DIs, and no original data) 98.77

training data using the cross-entropy loss. Then we extracta set of Data Impressions (DI) from it via modelling itssoftmax output space as explained in sections 3.2 and 3.3.Finally, we choose a (light weight) Student model and trainover the transfer set (DI) using eq. (4).

We consider two (B = 2) scaling factors, β1 = 1.0 andβ2 = 0.1 across all the datasets, i.e., for each dataset, halfthe Data Impressions are generated with β1 and the otherwith β2. However we observed that one can get a fairlydecent performance with a choice of beta equal to either 0.1or 1 (even without using the mixture of Dirichlet) acrossthe datasets. A temperature value (τ) of 20 is used acrossall the datasets. We investigate (in sec. 4.4) the effect oftransfer set size, i.e., the number of Data Impressions on theperformance of the Student model. Also, since the proposedapproach aims to achieve better generalization, it is a natu-ral choice to augment the crafted Data Impressions whileperforming the distillation. We augment the samples usingregular operations such as scaling, translation, rotation, flip-ping etc. which has proven useful in further boosting themodel performance (Dao et al., 2018).

4.1. MNIST

The MNIST dataset has 60000 training images and 10000test images of handwritten digits. We consider Lenet-5for the Teacher model and Lenet-5-Half for Student modelsimilar to (Lopes et al., 2017). The Lenet-5 Model contains2 convolution layers and pooling which is followed by threefully connected layers. Lenet-5 is modified to make Lenet-5-Half by taking half the number of filters in each of theconvolutional layers. The Teacher and Student models have61706 and 35820 parameters respectively. Input images areresized from 28 × 28 to 32 × 32 and the pixel values arenormalized to be in [0, 1] before feeding into the models.

The performance of our Zero-Shot Knowledge Distilla-tion for MNIST dataset is presented in Table 1. Notethat, in order to understand the effectiveness of the pro-posed ZSKD, the table also shows the performance of the

Page 6: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

Table 2. Performance of the proposed ZSKD framework on theFashion MNIST dataset.

Model PerformanceTeacher-CE 90.84Student-CE 89.43

Student-KD (Hinton et al., 2015)60K original data 89.66

(Kimura et al., 2018)200 original data 72.50

ZSKD (Ours)(48000 DIs, and no original data) 79.62

Teacher and Student models trained over actual data sam-ples along with a comparison against existing distillationapproaches. Teacher-CE denotes the classification accuracyof the Teacher model trained using the cross-entropy (CE)loss, Student-CE denotes the performance of the Studentmodel trained with all the training samples and their groundtruth labels using cross-entropy loss. Student-KD denotesthe accuracy of the Student model trained using the actualtraining samples through Knowledge Distillation (KD) fromTeacher. Note that this result can act as a vague upper boundfor the data-free distillation approaches.

It is clear that the proposed Zero-Shot Knowledge Distil-lation (ZSKD) outperforms the existing few data (Kimuraet al., 2018) and data-free counterparts (Lopes et al., 2017)by a great margin. Also, it performs close to the full data(classical) Knowledge Distillation while using only 24000DIs, i.e., 40% of the the original training set size.

4.2. Fashion MNIST

In comparison to MNIST, this dataset is more challengingand contains images of fashion products. The training andtesting set has 60000 and 10000 images respectively. Sim-ilar to MNIST, we consider Lenet-5 and Lenet-5-Half asTeacher and Student model respectively where each inputimage is resized from dimension 28× 28 to 32× 32.

Table 2 presents our results and compares with the exist-ing approaches. Similar to MNIST, ZSKD outperforms theexisting few data knowledge distillation approach (Kimuraet al., 2018) by a large margin, and performs close to the clas-sical knowledge distillation scenario (Hinton et al., 2015)with all the training samples.

4.3. CIFAR-10

Unlike MNIST and Fashion MNIST, this dataset containsRGB images of dimension 32×32×3. The dataset contains60000 images from 10 classes, where each class has 6000images. Among them, 50000 images are form the trainingset and rest of the 10000 images compose the test set. Wetake AlexNet (Krizhevsky et al., 2012) as Teacher model

Table 3. Performance of the proposed ZSKD framework on theCIFAR-10 dataset.

Model PerformanceTeacher-CE 83.03Student-CE 80.04

Student-KD (Hinton et al., 2015)50K original data 80.08

ZSKD (Ours)(40000 DIs, and no original data) 69.56

which is relatively large in comparison to LeNet-5. Sincethe standard AlexNet model is designed to process input ofdimension 227× 227× 3, we need to resize the input imageto this large dimension. To avoid that, we have modifiedthe standard AlexNet to accept 32 × 32 × 3 input images.The modified AlexNet contains 5 convolution layers withBatchNorm (Ioffe & Szegedy, 2015) regularization. Poolingis also applied on convolution layers 1, 2, and 5. The deepestthree layers are fully connected. AlexNet-Half is derivedfrom the AlexNet by taking half of convolutional filters andhalf of the neurons in the fully connected layers except inthe classification layer which has number of neurons equalto number of classes. The AlexNet-Half architecture is usedas the Student model. The Teacher and Student modelshave 1.65 × 106 and 7.23 × 105 parameters respectively.For architectural details of the teacher and the student nets,please refer to the supplementary document.

Table 3 presents the results on the CIFAR-10 dataset. Itcan be observed that the proposed ZSKD approach canachieve knowledge distillation with the Data Impressionsthat results in performance competitive to that realized usingthe actual data samples. Since the underlying target datasetis relatively more complex, we use a bigger transfer setcontaining 40000 DIs. However, the size of this transfer setcontaining DIs is still 20% smaller than that of the originaltraining set size used for the classical knowledge distillation(Hinton et al., 2015).

4.4. Size of the Transfer Set

In this subsection, we investigate the effect of transfer setsize on the performance of the distilled Student model. Weperform the distillation with different number of Data Im-pressions such as {1%, 5%, 10%, . . . , 80%} of the trainingset size. Figure 2 shows the performance of the resultingStudent model on the test set for all the datasets. For compar-ison, the plots present performance of the models distilledwith the equal number of actual training samples from thedataset. It is observed that, as one can expect, the perfor-mance increases with size of the transfer set. Interestingly,even a small number of Data Impressions (e.g. 20% ofthe training set size) are sufficient to provide a competitiveperformance, though the improvement in performance getsquickly saturated. Also, note that the initial performance

Page 7: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

0

25

50

75

100

1% 5% 10% 20% 40%

Data Impressions Original Training Data

Distillation with Data samples versus Data Impressions for MNIST

0

25

50

75

100

1% 5% 10% 20% 40% 80%

Data Impressions Original Training Data

Distillation with Data samples versus Data Impressions for FMNIST

0

25

50

75

100

1% 5% 10% 20% 40% 80%

Data Impressions Original Training Data

Distillation with Data samples versus Data Impressions for CIFAR-10

Figure 2. Performance (Test Accuracy) comparison of Data samples versus Data Impressions (without augmentation). Note that the x-axisdenotes the number of DIs or original training samples (in %) used for performing Knowledge Distillation with respect to the size of thetraining data.

0

25

50

75

100

1% 5% 10% 20% 40%

Class Impressions Data Impressions (Ours)

ZSKD with Class Impressioins and Data Impressions on MNIST

0

20

40

60

80

1% 5% 10% 20% 40% 80%

Class Impressions Data Impressions (Ours)

ZSKD with Class Impressions and Data Impressions on FMNIST

0

20

40

60

1% 5% 10% 20% 40% 80%

Class Impressions Data Impressions (Ours)

ZSKD with Class Impressions and Data Impressions on CIFAR-10

Figure 3. Performance (Test Accuracy) comparison of the ZSKD with Class Impressions (Mopuri et al., 2018) and proposed DataImpressions (without augmentation). Note that the x-axis denotes the number of DIs or CIs (in %) used for performing KnowledgeDistillation with respect to the training data size.

plane=1.0,bird=0.0

car=1.0,bird=0.0

bird=1.0,frog=0.0

cat=1.0,dog=0.0

deer=1.0,bird=0.0

dog=1.0,cat=0.0

frog=1.0,cat=0.0

horse=1.0,dog=0.0

ship=1.0,bird=0.0

truck=1.0,plane=0.0

plane=0.98,bird=0.01

car=0.96,cat=0.01

bird=0.96,plane=0.01

cat=0.98,dog=0.01

deer=0.97,horse=0.01

dog=0.98,car=0.01

frog=0.98,bird=0.01

horse=0.96,deer=0.03

ship=0.96,bird=0.02

truck=0.98,plane=0.0

plane=0.76,bird=0.21

car=0.67,bird=0.08

bird=0.77,dog=0.11

cat=0.75,dog=0.17

deer=0.77,bird=0.21

dog=0.78,bird=0.18

frog=0.72,bird=0.23

horse=0.75,bird=0.17

ship=0.76,bird=0.1

truck=0.7,car=0.22

plane=0.76,bird=0.21

car=0.67,bird=0.08

bird=0.77,dog=0.11

cat=0.75,dog=0.17

deer=0.77,bird=0.21

dog=0.78,bird=0.18

frog=0.72,bird=0.23

horse=0.75,bird=0.17

ship=0.76,bird=0.1

truck=0.7,car=0.22

1

Figure 4. Visualizing the DIs synthesized from the Teacher model trained on the CIFAR-10 dataset for different choices of output softmaxvectors (i.e., output class probabilities). Note that the figure shows 3 DIs per class in each column, each having a different spread over thelabels. However, only the top-2 confidences in the sampled softmax corresponding to each DI are mentioned on top for clarity. Pleasenote that there is no explicit objective for encouraging these pseudo samples to be visually closer to the actual training data samples,and yet some of the samples show striking patterns visually very similar to actual object shapes (e.g., bird, car, cat, dog, and deer in thefirst/second rows).

Page 8: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

(with smaller transfer set) reflects the complexity of the task(dataset). For simpler datasets such as MNIST, smaller trans-fer sets are sufficient to achieve competitive performance.In other words, small number of Data Impressions can dothe job of representing the patterns in the dataset. As thedataset becomes complex, more number of Data Impres-sions need to be generated to capture the underlying patternsin the dataset. Note that similar trends are observed in thedistillation with the actual training samples as well.

4.5. Class Versus Data Impressions

Feature visualization works such as (Simonyan et al., 2014;Springenberg et al., 2015; Olah et al., 2017; Mordvintsevet al., 2015) attempt to understand the patterns learned bythe deep neural networks in order to recognize the objects.These works reconstruct a chosen neural activation in the in-put space as one way to explain away the model’s inference.

One of the recent works by (Mopuri et al., 2018) recon-structs samples of a given class for a downstream task ofadversarial fooling. They optimize a random noise in theinput space till it results in a one-hot vector (softmax) output.This means, their optimization to craft the representativesamples would expect a one-hot vector in the output space.Hence, they call the reconstructions Class Impressions. Ourreconstruction (eq. (3)) is inspired from this, though wemodel the output space utilizing the class similarities per-ceived by the Teacher model. Because of this, we arguethat our modelling is closer to the original distribution andresults in better patterns in the reconstructions, calling themData Impressions of the Teacher model.

In this subsection, we compare these two varieties of recon-structions for the application of distillation. Figure 3 demon-strates the effectiveness of Class and Data Impressions overthree datasets. It is observed that the proposed Dirichletmodelling of the output space and the reconstructed impres-sions consistently outperform their class counterparts by alarge margin. Also, in case of Class Impressions, the incre-ment in the performance due to increased transfer set size isrelatively small compared to that of Data Impressions. Notethat for better understanding, the results are shown withoutany data augmentation while conducting the distillation.

5. Discussion and ConclusionKnowledge Distillation (Hinton et al., 2015) and few-shotlearning hold a great deal of potential in terms of both thechallenges they pose and the applications that can be re-alised. Data-free learning presented in recent works such as(Lopes et al., 2017; Mopuri et al., 2018) can be treated asa type of zero-shot learning, where the aim is to extract orreconstruct samples of the underlying data distribution froma trained model in order to realize a target application. It

is easy to see that this line of research has significant prac-tical implications. For instance, a deep learned model canbe obtained (i) from commercial products with deployedmodels (e.g., mobile phone or autonomous driving vehicle),or (ii) via hacking a deployment setup. In such cases onlytrained model is available without training data. Also, it canhelp us to mitigate the absence of training data in scenar-ios such as medical diagnosis, where, it is often the casethat patients’ privacy prohibits distribution of the trainingdata. In those cases only the trained models can be madeavailable. Further, given (i) the cost of annotating the data,and (ii) competitive advantage leveraged with more trainingdata, it is quite a possibility that the trained models will bemade available but not the actual training data. For exam-ple, models trained by Google and Facebook might utilizeproprietary data such as JFT-300M, SFC.

In this work, we presented for the first time, a com-plete framework called Zero-Shot Knowledge Distillation(ZSKD) to perform knowledge distillation without utiliz-ing any data samples or meta data extracted from it. Weproposed a sample extraction mechanism via modelling thedata distribution in the softmax space. As a useful prior,our model utilizes class similarity information extractedfrom the learned model and attempts to synthesize the un-derlying data samples. Further, we have investigated theeffectiveness of the these synthesized samples, named DataImpressions for a downstream task of training a substitutemodel via distillation.

A set of recent works that attempt to extract the trainingdata from a learned model, drive a downstream task suchas crafting adversarial perturbations or training a substitutemodel. However, in the current setup, the extracted samplesare not influenced by the target task so as to call them taskdriven. Besides, it is not observed that these aforementionedextractions utilize any strong prior about the data distribu-tion during the reconstruction. In that sense, our Dirichletmodelling of the output space that inculcates the visual simi-larity prior among the categories can be considered as a steptowards a faithful extraction of the underlying patterns inthe distribution. However, we believe that there is a lot ofscope for imbibing additional and better priors particularlyin the task driven scenario. For instance, utilizing multipleTeacher models trained on different tasks can enable betterextraction of the data patterns. Also, while estimating theimpressions, we can formulate objectives that can explicitlyencourage diversity in the extracted samples. Some of theseideas will be considered as our future research directions.

ReferencesBalakrishnan, N. and Nevzorov, V. B. A primer on statistical

distributions. John Wiley & Sons, 2004.

Page 9: New Zero-Shot Knowledge Distillation in Deep Networksproceedings.mlr.press/v97/nayak19a/nayak19a.pdf · 2020. 8. 18. · Zero-Shot Knowledge Distillation in Deep Networks in the era

Zero-Shot Knowledge Distillation in Deep Networks

Bucilu, C., Caruana, R., and Niculescu-Mizil, A. Modelcompression. In International Conference on Knowledgediscovery and Data mining. ACM, 2006.

Dao, T., Gu, A., Ratner, A. J., Smith, V., De Sa, C., and Re,C. A kernel theory of modern data augmentation. arXivpreprint arXiv:1803.06084, 2018.

Furlanello, T., Lipton, Z. C., Tschannen, M., Itti, L., andAnandkumar, A. Born again neural networks. arXivpreprint arXiv:1805.04770, 2018.

Hinton, G., Vinyals, O., and Dean, J. Distillingthe knowledge in a neural network. arXiv preprintarXiv:1503.02531, 2015.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerat-ing deep network training by reducing internal covariateshift. In International Conference on Machine Learning(ICML), 2015.

Kimura, A., Ghahramani, Z., Takeuchi, K., Iwata, T., andUeda, N. Few-shot learning of neural networks fromscratch by pseudo example optimization. In British Ma-chine Vision Conference (BMVC), 2018.

Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images. Technical report, CanadianInstitute for Advanced Research, 2009.

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenetclassification with deep convolutional neural networks.In Advances in Neural Information Processing Systems(NIPS), 2012.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

Lin, J. On the dirichlet distribution. Master’s thesis, Depart-ment of Mathematics and Statistics, Queens University,Kingston, Ontario, Canada, 2016.

Lopes, R. G., Fenu, S., and Starner, T. Data-free knowledgedistillation for deep neural networks. In LLD Workshopat Neural Information Processing Systems (NIPS ), 2017.

Mopuri, K. R., Krishna, P., and Babu, R. V. Ask, acquire,and attack: Data-free uap generation using class impres-sions. In European Conference on Computer Vision(ECCV), 2018.

Mordvintsev, A., Tyka, M., and Olah, C.Google deep dream. 2015. URL https://research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html.

Olah, C., Mordvintsev, A., and Schubert, L. Feature vi-sualization. Distill, 2017. URL https://distill.pub/2017/feature-visualization.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,M., Berg, A. C., and Fei-Fei, L. ImageNet Large ScaleVisual Recognition Challenge. International Journal ofComputer Vision (IJCV), 115(3), 2015.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep insideconvolutional networks: Visualising image classificationmodels and saliency maps. In International Conferenceon Learning Representations (ICLR) Workshops, 2014.

Snelson, E. and Ghahramani, Z. Sparse gaussian processesusing pseudo-inputs. In Advances in Neural InformationProcessing Systems (NIPS), 2006.

Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller,M. Striving for simplicity: The all convolutional net. InInternational Conference on Learning Representations(ICLR) workshops, 2015.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,and Salakhutdinov, R. Dropout: a simple way to preventneural networks from overfitting. The Journal of MachineLearning Research, 15(1):1929–1958, 2014.

Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: anovel image dataset for benchmarking machine learningalgorithms. arXiv preprint arXiv:1708.07747, 2017.