Compacting, Picking and Growing for Unforgetting Continual Learning · 2020-02-13 · Compacting,...

Compacting, Picking and Growing for UnforgettingContinual Learning

Steven C. Y. Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen,Yi-Ming Chan, and Chu-Song Chen

Institute of Information Science, Academia Sinica, Taipei, TaiwanMOST Joint Research Center for AI Technology and All Vista Healthcare

{brent12052003, andytu455176}@gmail.com,{chengen, redsword26, yiming, song}@iis.sinica.edu.tw

Abstract

Continual lifelong learning is essential to many applications. In this paper, wepropose a simple but effective approach to continual deep learning. Our approachleverages the principles of deep model compression, critical weights selection,and progressive networks expansion. By enforcing their integration in an iterativemanner, we introduce an incremental learning method that is scalable to the numberof sequential tasks in a continual learning process. Our approach is easy to imple-ment and owns several favorable characteristics. First, it can avoid forgetting (i.e.,learn new tasks while remembering all previous tasks). Second, it allows modelexpansion but can maintain the model compactness when handling sequential tasks.Besides, through our compaction and selection/expansion mechanism, we showthat the knowledge accumulated through learning previous tasks is helpful to build abetter model for the new tasks compared to training the models independently withtasks. Experimental results show that our approach can incrementally learn a deepmodel tackling multiple tasks without forgetting, while the model compactness ismaintained with the performance more satisfiable than individual task training.

1 Introduction

Continual lifelong learning [42, 28] has received much attention in recent deep learning studies. Inthis research track, we hope to learn a model capable of handling unknown sequential tasks whilekeeping the performance of the model on previously learned tasks. In continual lifelong learning, thetraining data of previous tasks are assumed non-available for the newly coming tasks. Although themodel learned can be used as a pre-trained model, fine-tuning a model for the new task will force themodel parameters to fit new data, which causes catastrophic forgetting [24, 31] on previous tasks.

To lessen the effect of catastrophic forgetting, techniques leveraging on regularization of gradientsor weights during training have been studied [14, 49, 19, 35]. In Kirkpatrick et al. [14] and Zenkeet al. [49], the proposed algorithms regularize the network weights and hope to search a commonconvergence for the current and previous tasks. Schwarz et al. [40] introduce a network-distillationmethod for regularization, which imposes constraints on the neural weights adapted from the teacher tothe student network and applies the elastic-weight-consolidation (EWC) [14] for incremental training.The regularization-based approaches reduce the affection of catastrophic forgetting. However, as thetraining data of previous tasks are missing during learning and the network capacity is fixed (andlimited), the regularization approaches often forget the learned skills gradually. Earlier tasks tend tobe forgotten more catastrophically in general. Hence, they would not be a favorable choice when thenumber of sequential tasks is unlimited.

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Task 1

Pick Learned Weights

GraduallyPrune & Retrain

Task 2Task 𝐾 Task 𝐾


Pruned Weights

Fill the Remaining

Weights Learnable Mask

Expand if Needed

Task 2

⨀

More Tasks

Task 1


Figure 1: Compacting, Picking, and Growing (CPG) continual learning. Given a well-trained model,gradual pruning is applied to compact the model to release redundant weights. The compact modelweights are kept to avoid forgetting. Then a learnable binary weight-picking mask is trained alongwith previously released space for new tasks to effectively reuse the knowledge of previous tasks. Themodel can be expanded for new tasks if it does not meet the performance goal. Best viewed in color.

To address the data-missing issue (i.e., lacking of the training data of old tasks), data-preserving andmemory-replay techniques have been introduced. Data-preserving approaches (such as [32, 3, 11, 34])are designed to directly save important data or latent codes as an efficient form, while memory-replayapproaches [41, 46, 13, 45, 27] introduce additional memory models such as GANs for keeping datainformation or distribution in an indirect way. The memory models have the ability to replay previousdata. Based on past data information, we can then train a model such that the performance can berecovered to a considerable extent for the old tasks. However, a general issue of memory-replayapproaches is that they require explicit re-training using old information accumulated, which leads toeither large working memory or compromise between the information memorized and forgetting.

This paper introduces an approach for learning sustainable but compact deep models, which can handlean unlimited number of sequential tasks while avoiding forgetting. As a limited architecture cannotensure to remember the skills incrementally learned from unlimited tasks, our method allows growingthe architecture to some extent. However, we also remove the model redundancy during continuallearning, and thus can increasingly compact multiple tasks with very limited model expansion.

Besides, pre-training or gradually fine-tuning the models from a starting task only incorporates priorknowledge at initialization; hence, the knowledge base is getting diminished with the past tasks. Ashumans have the ability to continually acquire, fine-tune and transfer knowledge and skills throughouttheir lifespan [28], in lifelong learning, we would hope that the experience accumulated from previoustasks is helpful to learn a new task. As the model increasingly learned by using our method serves asa compact, un-forgetting base, it generally yields a better model for the subsequent tasks than trainingthe tasks independently. Experimental results reveal that our lifelong learning method can leveragethe knowledge accumulated from the past to enhance the performance of new tasks.

Motivation of Our Method Design: Our method is designed by combining the ideas of deepmodel compression via weights pruning (Compacting), critical weights selection (Picking), andProgressiveNet extension (Growing). We refer it to as CPG, whose rationals are given below.

As stated above, although the regularization or memory-replay approaches lessen the effect offorgetting, they often do not guarantee to preserve the performance for previous tasks. To exactlyavoid forgetting, a promising way is to keep the old-task weights already learned [38, 47] and enlargethe network by adding nodes or weights for training new tasks. In ProgressiveNet [38], to ease thetraining of new tasks, the old-task weights are shared with the new ones but remain fixed, whereonly the new weights are adapted for the new task. As the old-task weights are kept, it ensures theperformance of learned tasks. However, as the complexity of the model architecture is proportionalto the number of tasks, it yields a highly redundant structure for keeping multiple models.

Motivated by ProgressiveNet, we design a method allowing the sustainability of architecture too. Toavoid constructing a complex and huge structure like ProgressiveNet, we perform model compressionfor the current task every time so that a condensed model is established for the old tasks. Accordingto deep-net compression [10], much redundancy is contained in a neural network and removing the

2

redundant weights does not affect the network performance. Our approach exploits this property,which compresses the current task by deleting neglectable weights. This yields a compressing-and-growing loop for a sequence of tasks. Following the idea of ProgressiveNet, the weights preserved forthe old tasks are set as invariant to avoid forgetting in our approach. However, unlike ProgressiveNetwhere the architecture is always grown for a new task, as the weights deleted from the current taskcan be released for use for the new tasks, we do not have to grow the architecture every time but canemploy the weights released previously for learning the next task. Therefore, in the growing stepof our CPG approach, two possible choices are provided. The first is to use the previously releasedweights for the new task. If the performance goal is not fulfilled yet when all the released weights areused, we then proceed to the second choice where the architecture is expanded and both the releasedand expanded weights are used for the new-task training.

Another distinction of our approach is the “picking” step. The idea is motivated below. In Progres-siveNet, the old-tasks weights preserved are all co-used (yet remain fixed) for learning the new tasks.However, as the number of tasks is increased, the amount of old-task weights is getting larger too.When all of them are co-used with the weights newly added in the growing step, the old weights (thatare fixed) act like inertia since only the fewer new weights are allowed to be adapted, which tends toslow down the learning process and make the solution found immature in our experience. To addressthis issue, we do not employ all the old-task weights but picking only some critical ones from themvia a differentiable mask. In the picking step of our CPG approach, the old weights’ picking-maskand the new weights added in the growing step are both adapted to learn an initial model for the newtask. Then, likewise, the initial model obtained is compressed and preserved for the new task as well.

To compress the weights for a task, a main difficulty is the lacking of prior knowledge to determinethe pruning ratio. To solve this problem, in the compacting step of our CPG approach, we employthe gradual pruning procedure [51] that prunes a small portion of weights and retrains the remainingweights to restore the performance iteratively. The procedure stops when meeting a pre-definedaccuracy goal. Note that only the newly added weights (from the released and/or expanded ones inthe growing step) are allowed to be pruned, whereas the old-task weights remain unchanged.

Method Overview: Our method overview is depicted as follows. The CPG method compressesthe deep model and (selectively) expands the architecture alternatively. First, a compressed model isbuilt from pruning. Given a new task, the weights of the old-task models are fixed as well. Next, wepick and re-use some of the old-task weights critical to the new task via a differentiable mask, anduse the previously released weights for learning together. If the accuracy goal is not attained yet, thearchitecture can be expanded by adding filters or nodes in the model and resuming the procedure.Then we repeat the gradual pruning [51] (i.e., iteratively removing a portion of weights and retraining)for compacting the model of the new task. An overview of our CPG approach is given in Figure 1.The new-task weights are formed by a combination of two parts: the first part is picked via a learnablemask on the old-task weights, and the second part is learned by gradual pruning/retraining of the extraweights. As the old-task weights are only picked but fixed, we can integrate the required functionmappings in a compact model without affecting their accuracy in inference. Main characteristics ofour approach are summarized as follows.

Avoid forgetting: Our approach ensures unforgetting. The function mappings previously built aremaintained as exactly the same when new tasks are incrementally added.

Expand with shrinking: Our method allows expansion but keeps the compactness of the architecture,which can potentially handle unlimited sequential tasks. Experimental results reveal that multipletasks can be condensed in a model with slight or no architecture growing.

Compact knowledge base: Experimental results show that the condensed model recorded forprevious tasks serves as knowledge base with accumulated experience for weights picking in ourapproach, which yields performance enhancement for learning new tasks.

2 Related Work

Continual lifelong learning [28] can be divided into three main categories: network regularization,memory or data replay, and dynamic architecture. Besides, works on task-free [2] or as a programsynthesis [43] have also been studied recently. In the following, we give a brief review of works inthe main categories, and readers are suggested to refer to a recent survey paper [28] for more studies.

3

Network regularization: The key idea of network regularization approaches [14, 49, 19, 35, 40, 5, 6]is to restrictively update learned model weights. To keep the learned task information, some penaltiesare added to the change of weights. EWC [14] uses Fisher’s information to evaluate the importanceof weights for old tasks, and updates weights according to the degree of importance. Based on similarideas, the method in [49] calculates the importance by the learning trajectory. Online EWC [40]and EWC++ [5] improve the efficiency issues of EWC. Learning without Memorizing(LwM) [6]presents an information preserving penalty. The approach builds an attention map, and hopes thatthe attention region of the previous and concurrent models are consistent. These works alleviatecatastrophic forgetting but cannot guarantee the previous-task accuracy exactly.

Memory replay: Memory or data replay methods [32, 41, 13, 3, 46, 45, 11, 34, 33, 27] use additionalmodels to remember data information. Generative Replay [41] introduces GANs to lifelong learning.It uses a generator to sample fake data which have similar distribution to previous data. New tasks canbe trained with these generated data. Memory Replay GANs (MeRGANs) [45] shows that forgettingphenomenon still exists in a generator, and the property of generated data will become worse withincoming tasks. They use replay data to enhance the generator quality. Dynamic Generative Memory(DGM) [27] uses neural masking to learn connection plasticity in conditional generative models, andset a dynamic expansion mechanism in the generator for sequential tasks. Although these methodscan exploit data information, they still cannot guarantee the exact performance of past tasks.

Dynamic architecture: Dynamic-architecture approaches [38, 20, 36, 29, 48] adapt the architecturewith a sequence of tasks. ProgressiveNet [38] expands the architecture for new tasks and keeps thefunction mappings by preserving the previous weights. LwF [20] divides the model layers into twoparts, shared and task-specific, where the former are co-used by tasks and the later are grown withfurther branches for new tasks. DAN [36] extends the architecture per new task, while each layer inthe new-task model is a sparse linear-combination of the original filters in the corresponding layer ofa base model. Architecture expansion has also been adopted in a recent memory-replay approach [27]on GANs. These methods can considerably lessen or avoid catastrophic forgetting via architectureexpansion, but the model is monotonically increased and a redundant structure would be yielded.

As continually growing the architecture will retain the model redundancy, some approach performsmodel compression before expansion [48] so that a compact model can be built. In the past, themost related method to ours is Dynamic-expansion Net (DEN) [48]. DEN reduces the weights ofthe previous tasks via sparse-regularization. Newly added weights and old weights are both adaptedfor the new task with sparse constraints. However, DEN does not ensure non-forgetting. As theold-task weights are jointly trained with the new weights, part of the old-tasks weights are selectedand modified. Hence, a "Split & Duplication" step is introduced to further ‘restore’ some of the oldweights modified for lessening the forgetting effect. Pack and Expand (PAE) [12] is our previousapproach that takes advantage of PackNet [23] and ProgressiveNet [38]. It can avoid forgetting,maintain model compactness, and allow dynamic model expansion. However, as it uses all weightsof previous tasks for sharing, the performance becomes less favorable when learning a new task.

Our approach (CPG) is accomplished by a compacting→picking(→growing) loop, which selectscritical weights from old tasks without modifying them, and thus avoids forgetting. Besides, ourapproach does not have to restore the old-task performance like DEN as the performance is alreadykept, which thus avoids a tedious "Split & Duplication" process which takes extra time for modeladjustment and will affect the new-task performance. Our approach is hence simple and easier toimplement. In the experimental results, we show that our approach also outperforms DEN and PAE.

3 The CPG approach for Continual Lifelong Learning

Without loss of generality, our work follows a task-based sequential learning setup that is a commonsetting in continual learning. In the following, we present our method in the sequential-task manner.

Task 1: Given the first task (task-1) and an initial model trained via its dataset, we perform gradualpruning [51] on the model to remove its redundancy with the performance kept. Instead of pruningweights one time to the pruning ratio goal, the gradual pruning removes a portion of the weights andretrains the model to restore the performance iteratively until meeting the pruning criteria. Thus, wecompact the current model so that redundancy among the model weights are removed (or released).The weights in the compact model are then set unalterable and remain fixed to avoid forgetting. After

4

gradual pruning, the model weights can be divided into two parts: one is preserved for task 1; theother is released and able to be employed by the subsequent tasks.

Task k to k+1: Assume that in task-k, a compact model that can handle tasks 1 to k has been builtand available. The model weights preserved for tasks 1 to k are denoted as WP

1:k. The released(redundant) weights associated with task-k are denoted as WE

k , and they are extra weights that canbe used for subsequent tasks. Given the dataset of task-(k+1), we apply a learnable mask M to pickthe old weights WP

1:k, M ∈ {0, 1}D with D the dimension of WP1:k. The weights picked are then

represented as M �WP1:k, the element-wise product of the 0-1 mask M and WP

1:k. Without lossof generality, we use the piggyback approach [22] that learns a real-valued mask M̂ and applies athreshold for binarization to construct M. Hence, given a new task, we pick a set of weights (knownas the critical weights) from the compact model via a learnable mask. Besides, we also use thereleased weights WE

k for the new task. The mask M and the additional weights WEk are learned

together on the training data of task-(k+1) with the loss function of task-(k+1) via back-propagation.Since the binarized mask M is not differentiable, when training the binary mask M, we update thereal-valued mask M̂ in the backward pass; then M is quantized with a threshold on M̂ and appliedto the forward pass. If the performance is unsatisfied yet, the model architecture can be grown toinclude more weights for training. That is, WE

k can be augmented with additional weights (such asnew filters in convolutional layers and nodes in fully-connected layers) and then resumes the trainingof both M and WE

k . Note that during traning, the mask M and new weights WEk are adapted but

the old weights WP1:k are “picked” only and remain fixed. Thus, old tasks can be exactly recalled.

Compaction of task k+1: After M and WEk are learned, an initial model of task-(k+1) is obtained.

Then, we fix the mask M and apply gradual pruning to compress WEk , so as to get the compact

model WPk+1 and the redundant (released) weights WE

k+1 for task-(k + 1). The compact model ofold tasks then becomes WP

1:(k+1) = WP1:k ∪WP

k+1. The compacting and picking/growing loop isrepeated from task to task. Details of CPG continual learning is listed in Algorithm 1.

Algorithm 1: Compacting, Picking and Growing Continual LearningInput: given task 1 and an original model trained on task 1.Set an accuracy goal for task 1;Alternatively remove small weights and re-train the remaining weights for task 1 via gradual pruning [51],

whenever the accuracy goal is still hold;Let the model weights preserved for task 1 be WP

1 (referred to as task-1 weights), and those that are removedby the iterative pruning be WE

1 (referred to as the released weights);for task k = 2 · · ·K (let the released weights of task k be WE

k ) doSet an accuracy goal for task k;Apply a mask M to the weights WP

1:k−1; train both M and WEk−1 for task k, with WP

1:k−1 fixed;If the accuracy goal is not achieved, expand the number of filters (weights) in the model, reset WE

k−1 andgo to previous step;

Gradually prune WEk−1 to obtain WE

k (with WP1:k−1 fixed) for task k, until meeting the accuracy goal;

WPk = WE

k−1\WEk and WP

1:k = WP1:k−1 ∪WP

k ;end

4 Experiments and Results

We perform three experiments to verify the effectiveness of our approach. The first experimentcontains 20 tasks organized with CIFAR-100 dataset [16]. In the second experiment, we follow thesame settings of PackNet [23] and Piggyback [22] approaches, where several fine-grained datasetsare chosen for classification in an incremental manner. In the third experiment, we start fromface verification and compact three further facial-informatic tasks (expression, gender, and age)incrementally to examine the performance of our continual learning approach in a realistic scenario.We implement our CPG approach1 and independent task learning (from scratch or fine-tuning) viaPyTorch [30] in all experiments, but implement DEN [27] via Tensorflow [1] with its official codes.

1Our codes are available at https://github.com/ivclab/CPG.

5

(a) Task-1 (b) Task-5 (c) Task-10 (d) Task-15

Figure 2: The accuracy of DEN, Finetune and CPG for the sequential tasks 1, 5, 10, 15 on CIFAR-100.

4.1 Twenty Tasks of CIFAR-100

We divide the CIFAR-100 dataset into 20 tasks. Each task has 5 classes, 2500 training images, and500 testing images. In the experiment, VGG16-BN model (VGG16 with batch normalization layers)is employed to train the 20 tasks sequentially. First, we compare our approach with DEN [27] (as italso uses an alternating mechanism of compression and expansion) and fine-tuning. To implementfine-tuning, we train task-1 from scratch by using VGG16-BN; then, assuming the models of task1 to task k are available, we then train the model of task-(k + 1) by fine-tuning one of the modelsrandomly selected from tasks 1 to k. We repeat this process 5 times and get the average accuracy(referred to as Finetune Avg). To implement our CPG approach, task-1 is also trained by usingVGG16-BN, and this initial model is adapted for the sequential tasks following Algorithm 1. DEN isimplemented via the official codes provided by the authors and modified for VGG16-BN.

Figure 2 shows the classification accuracy of DEN, fine-tuning, and our CPG. Figure 2(a) is theaccuracy of task-1 when all of the 20 tasks have been trained. Initially, the accuracy of DEN ishigher than CPG and fine-tuning although the same model is trained from scratch. We conjecture thatit is because they are implemented on different platforms (Tensorflow vs PyTorch). Nevertheless,the performance of task-1 gradually drops when the other tasks (2 to 20) are increasingly learnedin DEN, as shown in Figure 2(a), and the drops are particularly significant for tasks 15 to 20. InFigure 2(b), the initial accuracy of DEN on task-5 becomes a little worse than that of CPG andfine-tuning. It reveals that DEN could not employ the previously leaned model (tasks 1-4) to enhancethe performance of the current task (task 5). Besides, the accuracy of task-5 still drops when newtasks (6-20) are learned. Similarly, for tasks 10 and 15 respectively shown in Figures 2(c) and (d),DEN has a large performance gap on the initial model, with an increasing accuracy dropping either.

We attribute the phenomenon as follows. As DEN does not guarantee unforgetting, a "Split &Duplication" step is enforced to recover the old-task performance. Though DEN tries to preserve thelearned tasks as much as they could via optimizing weight sparsity, the tuning of hyperparameters inits loss function makes DEN non-intuitive to balance the learning of the current task and rememberingthe previous tasks. The performance thus drops although we have tried our best for tuning it. Onthe other hand, fine-tuning and our CPG have roughly the same accuracy initially on task-1 (bothare trained from scratch), whereas CPG gradually outperforms fine-tuning on tasks 5, 10, and 15in Figure 2. The results suggest that our approach can exploit the accumulated knowledge base toenhance the new task performance. After model growing for 20 tasks, the final amount of weights isincreased by 1.09 times (compared to VGG16-BN) for both DEN and CPG. Hence, our approach cannot only ensure maintaining the old-task performance (as the horizontal lines shown in Figure 2), buteffectively accumulate the weights for knowledge picking.

Unlike ProgressiveNet that uses all of the weights kept for the old tasks when training the new task,our method only picks the old-task weights critical to the new tasks. To evaluate the effectiveness ofthe weights picking mechanism, we compare CPG with PAE [12] and PackNet [23]. In our method,if all of the old weights are always picked, it is referred to as the pack-and-expand (PAE) approach. Ifwe further restrict PAE such that the architecture expansion is forbidden, it degenerates to an existingapproach, PackNet [23]. Note that both PAE and PackNet ensure unforgetting. As shown in Table 1,besides the first two tasks, CPG performs more favorably than PAE and PackNet consistently. Theresults reveal that the critical-weights picking mechanism in CPG not only reduces the unnecessaryweights but also boost the performance for new tasks. As PackNet does not allow model expansion,its weights amount remains the same (1×). However, when proceeding with more tasks, availablespace in PackNet gradually reduces, which limits the effectiveness of PackNet to learn new tasks.PAE uses all the previous weights during learning. As with more tasks, the weights from previous

6

Table 1: The performance of PackNet, PAE and CPG on CIFAR-100 twenty tasks. We use Avg., Exp.and Red. as abbreviations for Average accuracy, Expansion weights and Redundant weights.

Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Avg.Exp.(×)

Red.(×)

PackNet 66.4 80.0 76.2 78.4 80.0 79.8 67.8 61.4 68.8 77.2 79.0 59.4 66.4 57.2 36.0 54.2 51.6 58.8 67.8 83.2 67.5 1 0PAE 67.2 77.0 78.6 76.0 84.4 81.2 77.6 80.0 80.4 87.8 85.4 77.8 79.4 79.6 51.2 68.4 68.6 68.6 83.2 88.8 77.1 2 0CPG 65.2 76.6 79.8 81.4 86.6 84.8 83.4 85.0 87.2 89.2 90.8 82.4 85.6 85.2 53.2 74.4 70.0 73.4 88.8 94.8 80.9 1.5 0.41

Table 2: The performance of CPGs and individual models on CIFAR-100 twenty tasks. We usefine-Avg and fine-Max as abbreviations for Average and Max accuracy of the 5 fine-tuning models.

Methods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Avg.Exp.(×)

Red.(×)

Scratch 65.8 78.4 76.6 82.4 82.2 84.6 78.6 84.8 83.4 89.4 87.8 80.2 84.4 80.2 52.0 69.4 66.4 70.0 87.2 91.2 78.8 20 0fine-Avg 65.2 76.1 76.1 77.8 85.4 82.5 79.4 82.4 82.0 87.4 87.4 81.5 84.6 80.8 52.0 72.1 68.1 71.9 88.1 91.5 78.6 20 0fine-Max 65.8 76.8 78.6 80.0 86.2 84.8 80.4 84.0 83.8 88.4 89.4 83.8 87.2 82.8 53.6 74.6 68.8 74.4 89.2 92.2 80.2 20 0CPG avg 65.2 76.6 79.8 81.4 86.6 84.8 83.4 85.0 87.2 89.2 90.8 82.4 85.6 85.2 53.2 74.4 70.0 73.4 88.8 94.8 80.9 1.5 0.41CPG max 67.0 79.2 77.2 82.0 86.8 87.2 82.0 85.6 86.4 89.6 90.0 84.0 87.2 84.8 55.4 73.8 72.0 71.6 89.6 92.8 81.2 1.5 0CPG top 66.6 77.2 78.6 83.2 88.2 85.8 82.4 85.4 87.6 90.8 91.0 84.6 89.2 83.0 56.2 75.4 71.0 73.8 90.6 93.6 81.7 1.5 0

tasks would dominate the whole network and become a burden in learning new tasks. Finally, asshown in the Expand (Exp.) field in Table 1, PAE grows the model and uses 2 times of weights for the20 tasks. Our CPG expands to 1.5× of weights (with 0.41× redundant weights that can be releasedto future tasks). Hence, CPG finds a more compact and sustainable model with better accuracy whenthe picking mechanism is enforced.

Table 2 shows the performance of different settings of our CPG method, together with their comparisonto independent task learning (including learning from scratch and fine-tuning from a pre-trainedmodel). In this table, ‘scratch’ means learning each task independently from scratch via the VGG16-BN model. As depicted before, ‘fine-Avg’ means the average accuracy of fine-tuning from a previousmodel randomly selected and repeats the process 5 times. ‘fine-Max.’ means the maximum accuracyof these 5 random trials. In the implementation of our CPG algorithm, an accuracy goal has to beset for the gradual-pruning and model-expansion steps. In this table, the ‘avg’, ‘max’, and ‘top’correspond to the settings of accuracy goals to be fine-Avg, fine-Max, and a slight increment of themaximum of both, respectively. The upper bound of model weights expansion is set as 1.5 in thisexperiment. As can be seen in Table 2, CPG gets better accuracy than both the average and maximumof fine-tuning in general. CPG also performs more favorably than learning from scratch averagely.This reveals again that the knowledge previously learned with our CPG can help learn new tasks.

Besides, the results show that a higher accuracy goal yields better performance and more consumptionof weights in general. In Table 2, the accuracy achieved by ‘CPG avg’, ‘CPG max’, and ‘CPG top’is getting increased. The former remains to have 0.41× redundant weights that are saved for futureuse, whereas the later two consume all weights. The model size includes not only the backbonemodel weights, but also the overhead of final layers increased with new classes, batch-normalizationparameters, and the binary masks. Including all overheads, the model sizes of CPG for the threesettings are 2.16×, 2.40× and 2.41× of the original VGG16-BN, as shown in Table 3. Comparedto independent models (learning-from-scratch or fine-tuning) that require 20× for maintaining theold-task accuracy, our approach can yield a far smaller model to achieve exact unforgetting.

4.2 Fine-grained Image Classification Tasks

In this experiment, following the same settings in the works of PackNet [23] and Piggyback [22], siximage classification datasets are used. The statistics are summarized in Table 4, where ImageNet [17]is the first task, following by fine-grained classification tasks, CUBS [44], Stanford Cars [15] andFlowers [26], and finally WikiArt [39] and Sketch [8] that are artificial images drawing in variousstyles and objects. Unlike previous experiments where the first task consists of some of the fiveclasses from CIFAR-100. In this experiment, the first-task classifier is trained on ImageNet, which isa strong base for fine-tuning. Hence, in the fine-tuning setting of this experiment, tasks 2 to 6 are allfine-tuned from the task-1, instead of selecting a previous task randomly. For all tasks, the image sizeis 224× 224, and the architecture used in this experiment is ResNet50.

7

Table 3: Model sizes onCIFAR-100 twenty tasks.Methods Model Size (MB)

VGG16-BN 128.25Individual Models 2565CPG avg 278CPG max 308CPG top 310

Table 4: Statistics of the fine-grained datasetsDataset #Train #Eval #Classes

ImageNet 1,281,167 50,000 1,000CUBS 5,994 5,794 200Stanford Cars 8,144 8,041 196Flowers 2,040 6,149 102WikiArt 42,129 10,628 195Sketch 16,000 4,000 250

Table 5: Statistics of the facial-informatic datasets

Dataset #Train #Eval #Classes

VGGFace2 3,137,807 0 8,6301LFW 0 13,233 5,749FotW 6,171 3,086 3IMDB-Wiki 216,161 0 3AffectNet 283,901 3,500 7Adience 12,287 3,868 8

Table 6: Accuracy on fine-grained dataset.

Dataset Train fromScratch Finetune Prog.

Net PackNet Piggyback CPG

ImageNet 76.16 - 76.16 75.71 76.16 75.81CUBS 40.96 82.83 78.94 80.41 81.59 83.59Stanford Cars 61.56 91.83 89.21 86.11 89.62 92.80Flowers 59.73 96.56 93.41 93.04 94.77 96.62Wikiart 56.50 75.60 74.94 69.40 71.33 77.15Sketch 75.40 80.78 76.35 76.17 79.91 80.33

Model Size(MB)

554 554 563 115 121 121

Table 7: Accuracy on facial-informatic tasks.

Task Train fromScratch Finetune CPG

Face 99.417 ± 0.367 - 99.300 ± 0.384

Gender 83.70 90.80 89.66Expression 57.64 62.54 63.57Age 46.14 57.27 57.66

Exp. (×) 4 4 1Red. (×) 0 0 0.003

The performance is shown on Table 6. Five methods are compared with CPG: training from scratch,fine-tuning, ProgressiveNet, PackNet, and Piggyback. For the first task (ImageNet), CPG and PackNetperforms slightly worse than the others, since both methods have to compress the model (ResNet50)via pruning. Then, for tasks 2 to 6, CPG outperforms the others in almost all cases, which shows thesuperiority of our method on building a compact and unforgetting base for continual learning. As forthe model size, ProgressiveNet increases the model per task. Learning-from-scratch and fine-tuningneed 6 models to achieve unforgetting. Their model sizes are thus large. CPG yields a smaller modelsize comparable to piggyback, which is favorable when considering both accuracy and model size.

4.3 Facial-informatic Tasks

In a realistic scenario, four facial-informatic tasks, face verification, gender, expression and ageclassification are used with the datasets summarized in Table 5. For face verification, we useVGGFace2 [4] for training and LFW [18] for testing. For gender classification, we combine FotW [9]and IMDB-Wiki [37] datasets and classify faces into three categories, male, female and other. TheAffectNet dataset [25] is used for expression classification that classifies faces into seven primaryemotions. Finally, Adience dataset [7] contains faces with labels of eight different age groups. Forthese datasets, faces are aligned using MTCNN [50] with output size of 112 × 112. We use the20-layer CNN in SphereFace [21] and train a model for the face verification task accordingly. Wecompare CPG with the models fine-tuned from the face verification task. The results are reported inTable 7. Compared with individual models (training-from-scratch and fine-tuning), CPG can achievecomparable or more favorable results without additional expansion. After learning four tasks, CPGstill has 0.003 × of the released weights able to be used for new tasks.

5 Conclusion and Future Work

We introduce a simple but effective method, CPG, for continual learning avoiding forgetting. Com-pacting a model can prevent the model complexity from unaffordable when the number of tasks isincreased. Picking learned weights using binary masks and train them together with newly addedweights is an effective way to reuse previous knowledge. The weights for old tasks are preserved, andthus prevents them from forgetting. Growing the model for new tasks facilitates the model to learnunlimited and unknown/un-related tasks. CPG is easy to be realized and applicable to real situations.Experiments show that CPG can achieve similar or better accuracy with limited additional space.Currently, our method compacts a model by weights pruning, and we plan to include channel pruningin the future. Besides, we assume that clear boundaries exit between the tasks, and will extend ourapproach to handle continual learning problems without task boundaries. In the future, We also planto provide a mechanism of “selectively forgetting” some previous tasks via the masks recorded.

8

Acknowledgments

We thank the anonymous reviewers and area chair for their constructive comments. This work issupported in part under contract MOST 108-2634-F-001-004.

References[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S

Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning onheterogeneous distributed systems. arXiv, 2016.

[2] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In Proceedings ofCVPR, 2019.

[3] Pratik Prabhanjan Brahma and Adrienne Othon. Subset replay based continual learning for scalableimprovement of autonomous systems. In Proceedings of the IEEE CVPRW, 2018.

[4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising facesacross pose and age. In Proceedings of IEEE FG, 2018.

[5] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk forincremental learning: Understanding forgetting and intransigence. In Proceedings of ECCV, 2018.

[6] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning withoutmemorizing. Proceedings of CVPR, 2019.

[7] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estimation of unfiltered faces. IEEE TIFS,9(12):2170–2179, 2014.

[8] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM Trans. Graph.,31(4):44–1, 2012.

[9] Sergio Escalera, Mercedes Torres Torres, Brais Martinez, Xavier Baró, Hugo Jair Escalante, IsabelleGuyon, Georgios Tzimiropoulos, Ciprian Corneou, Marc Oliu, Mohammad Ali Bagheri, et al. Chalearnlooking at people and faces of the world: Face analysis workshop and challenge 2016. In Proceedings ofIEEE CVPRW, 2016.

[10] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks withpruning, trained quantization and huffman coding. In Proceedings of ICLR, 2016.

[11] Wenpeng Hu, Zhou Lin, Bing Liu, Chongyang Tao, Zhengwei Tao, Jinwen Ma, Dongyan Zhao, and RuiYan. Overcoming catastrophic forgetting via model adaptation. In Proceedings of ICLR, 2019.

[12] Steven CY Hung, Jia-Hong Lee, Timmy ST Wan, Chein-Hung Chen, Yi-Ming Chan, and Chu-Song Chen.Increasingly packing multiple facial-informatics modules in a unified deep-learning model via lifelonglearning. In Proceedings of International Conference on Multimedia Retrieval (ICMR), 2019.

[13] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. InProceedings of ICLR, 2018.

[14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophicforgetting in neural networks. Proceedings of the national academy of sciences, 2017.

[15] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grainedcategorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13),Sydney, Australia, 2013.

[16] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutionalneural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[18] Erik Learned-Miller, Gary B Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua. Labeled faces inthe wild: A survey. In Advances in face detection and facial image analysis. Springer, 2016.

[19] Sang-Woo Lee, Jin-Hwa Kim, JungWoo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgettingby incremental moment matching. In NIPS, 2017.

9

[20] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis andMachine Intelligence, 40:2935–2947, 2018.

[21] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deephypersphere embedding for face recognition. In Proceedings of IEEE CVPR, 2017.

[22] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multipletasks by learning to mask weights. In Proceedins of ECCV, 2018.

[23] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterativepruning. In Proceedings of the IEEE CVPR, 2018.

[24] James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementary learningsystems in the hippocampus and neocortex: insights from the successes and failures of connectionistmodels of learning and memory. Psychological review, 1995.

[25] Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. Affectnet: A database for facial expression,valence, and arousal computing in the wild. IEEE Trans. Affective Comput., 2017.

[26] M-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. InProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.

[27] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick Jähnichen, and Moin Nabi. Learning to remember:A synaptic plasticity driven framework for continual learning. In Proceedings of CVPR, 2019.

[28] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelonglearning with neural networks: A review. Neural Networks, 2019.

[29] German Ignacio Parisi, Xu Ji, and Stefan Wermter. On the role of neurogenesis in overcoming catastrophicforgetting. In Proceedings of NeurIPS Workshop on Continual Learning, 2018.

[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Proceedingsof NeurIPS, 2017.

[31] B. Pfulb and A. Gepperth. A comprehensive, application-oriented study of catastrophic forgetting in dnns.In ICLR 2019, 2019.

[32] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incrementalclassifier and representation learning. In Proceedings of IEEE CVPR, 2017.

[33] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro.Learning to learn without forgetting by maximizing transfer and minimizing interference. In Proceedingsof ICLR, 2019.

[34] Matthew Riemer, Tim Klinger, Djallel Bouneffouf, and Michele Franceschini. Scalable recollections forcontinual lifelong learning. In AAAI 2019, 2019.

[35] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations forovercoming catastrophic forgetting. In NeurIPS, 2018.

[36] Amir Rosenfeld and John K. Tsotsos. Incremental learning through deep adaptation. IEEE transactions onpattern analysis and machine intelligence, Early Access, 2018.

[37] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a singleimage. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 10–15,2015.

[38] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, KorayKavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv, 2016.

[39] Babak Saleh and Ahmed Elgammal. Large-scale classification of fine-art paintings: Learning the rightmetric on the right feature. In ICDMW, 2015.

[40] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh,Razvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. InProceedings of ICML, 2018.

[41] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generativereplay. In Proceedings of NeurIPS, 2017.

10

[42] Sebastian Thrun. A lifelong learning perspective for mobile robot control. In Intelligent Robots andSystems. Elsevier, 1995.

[43] Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles A. Sutton, and Swarat Chaudhuri. Houdini:Lifelong learning as program synthesis. In NeurIPS, 2018.

[44] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.

[45] Chenshen Wu, Luis Herranz, Xialei Liu, yaxing wang, Joost van de Weijer, and Bogdan Raducanu. Memoryreplay gans: Learning to generate new categories without forgetting. In Proceedings of NeurIPS, 2018.

[46] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, Zhengyou Zhang, andYun Fu. Incremental classifier learning with generative adversarial networks. CoRR, abs/1802.00853,2018.

[47] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng, and Zheng Zhang. Error-driven incrementallearning in deep convolutional neural network for large-scale image classification. In Proceedings ofACM-MM, 2014.

[48] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamicallyexpandable networks. In Proceedings of ICLR, 2018.

[49] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. InProceedings of ICML, 2017.

[50] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment usingmultitask cascaded convolutional networks. IEEE Signal Processing Letters, 23:1499–1503, 2016.

[51] Michael Zhu and Suyog Gupta. To prune, or not to prune: Exploring the efficacy of pruning for modelcompression. In Proceedings of ICLR Workshop, 2018.

11

Compacting, Picking and Growing for Unforgetting Continual Learning · 2020-02-13 · Compacting,...

Documents

Transcript of Compacting, Picking and Growing for Unforgetting Continual Learning · 2020-02-13 · Compacting,...