Measuring Catastrophic Forgetting In Neural Networksrmk6217/Kemker_WNYIP17_Poster.pdfMeasuring...

1
Measuring Catastrophic Forgetting In Neural Networks Ronald Kemker 1 , Marc McClure 1 , Angelina Abitino 2 , Tyler Hayes 1 , Christopher Kanan 1 1. Rochester Institute of Technology, Rochester NY {rmk6217, mcm5756, tlh6792, kanan}@rit.edu 2. Swarthmore College, Swarthmore, PA [email protected] Mechanisms for Mitigating Catastrophic Forgetting = + 2 , 2 #1 Regularization Model adds constraints to the weight updates to protect previously learned knowledge. Google DeepMind’s Elastic Weight Consolidation (EWC) model uses a Fisher Information Matrix to re-direct plasticity towards the weights that are the least important to retaining old information [1]. #2 Ensembling Model implicitly or explicitly trains multiple classifiers and combines them to make a prediction. Google’s PathNet model uses a genetic algorithm to find the optimal path through a large DCNN and then locks that path to preserve it [2]. #3 Rehearsal Model revisits previous training examples to prevent forgetting of previously trained knowledge. GeppNet stores previous all training examples and then replays them during incremental learning stages [3]. #4 Dual-Memory Model has separate processing centers for the fast acquisition of new information and long-term storage of pre-trained knowledge. GeppNet+STM uses a short-term memory buffer to store and recall previous training examples, and then consolidates these samples during sleep phases. #5 Sparse-Coding Model makes sparse updates to the network to prevent the disruption of pre- trained knowledge. The fixed expansion layer (FEL) model uses a large hidden layer that is sparsely populated with excitatory and inhibitory weights [4]. Incremental Learning Paradigms 1. Data Permutation Experiment This experiment measures how well a model can incrementally learn datasets with similar feature representations. We randomly permute the pixel locations of each image. 2. Incremental Class Learning Experiment First, we train the model on some base knowledge (half of the classes). Then, we train the remaining classes one-by-one. 3. Multi-Modal Experiment We measure how well a model can incrementally learn datasets with dissimilar feature representations. First, we train the model on image classification, and then we train that model on audio classification (and vice versa). Catastrophic Forgetting Neural networks are incapable of learning new information without disturbing the weights important for retaining existing memories, a phenomenon known as catastrophic forgetting. Although many mitigation techniques have been proposed, the only real way of preventing this is combining the old and new data and retraining the model from scratch. State-of-the-art frameworks can take weeks/months to train, so this is extremely inefficient. Motivation Researchers have proposed many strategies for mitigating catastrophic forgetting, but these methods have all failed to scale-up to real-world problems. Kirkpatrick et. al. (2017) claimed to solve catastrophic forgetting. but they only evaluated their frameworks on a toy dataset with only a few object classes (i.e. MNIST) We scaled up some of these mechanisms to large scale image and audio classification datasets with 100-200 object classes. We evaluated their performance on three different incremental learning paradigms using new metrics that we established. Datasets Metrics We established three metrics designed to measure how well a model retains existing memories ( Ω ), assimilate new data ( Ω ), and both tasks all at once (Ω ). We track mean-class test accuracy for the base knowledge , most recently learned class , and all classes seen to that point . We normalize the results by the accuracy obtained by training the model offline so that we can have a fair comparison between datasets. Ω = 1 T−1 =2 , Ω = 1 T−1 =2 , Ω = 1 T−1 =2 , Discussion/Conclusion No mechanism works well on every paradigm and data type. The regularization and ensembling mechanisms work well for incrementally learning datasets with similar feature representations. Incremental class learning models benefit from the rehearsal and dual- memory mechanisms; however, storage of past training examples is memory inefficient Models that employ sparsity as its mitigation strategy are too memory inefficient to be employed in a real- world scenario Mean of across datasets Summary of Experimental Results Acknowledgements Angelina Abitino was supported by NSF Research Experiences for Undergraduates (REU) award #1359361 to Roger Dube. We also thank NVIDIA for the generous donation of a Titan X GPU. References 1. Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural networks." Proceedings of the National Academy of Sciences (2017): 201611835. 2. Fernando, Chrisantha, et al. "Pathnet: Evolution channels gradient descent in super neural networks." arXiv preprint arXiv:1701.08734 (2017). 3. Gepperth, Alexander, and Cem Karaoguz. "A bio-inspired incremental learning architecture for applied perceptual problems." Cognitive Computation 8.5 (2016): 924-934. 4. Coop, Robert, Aaron Mishtal, and Itamar Arel. "Ensemble learning in fixed expansion layer networks for mitigating catastrophic forgetting." IEEE transactions on neural networks and learning systems 24.10 (2013): 1623-1634. Fig 1. Permuted MNIST Image Experimental Results Fig 2. Mean-class test accuracy of incremental class learning experiment Experimental Results

Transcript of Measuring Catastrophic Forgetting In Neural Networksrmk6217/Kemker_WNYIP17_Poster.pdfMeasuring...

Page 1: Measuring Catastrophic Forgetting In Neural Networksrmk6217/Kemker_WNYIP17_Poster.pdfMeasuring Catastrophic Forgetting In Neural Networks Ronald Kemker1, Marc McClure1, Angelina Abitino2,

Measuring Catastrophic Forgetting In Neural Networks

Ronald Kemker1, Marc McClure1, Angelina Abitino2, Tyler Hayes1, Christopher Kanan1

1. Rochester Institute of Technology, Rochester NY

{rmk6217, mcm5756, tlh6792, kanan}@rit.edu2. Swarthmore College, Swarthmore, PA

[email protected]

Mechanisms for Mitigating Catastrophic Forgetting

𝐿 𝜃 = 𝐿𝑡 𝜃 +

𝑖

𝜆

2𝐹𝑖 𝜃𝑖 − 𝜃𝐴,𝑖

∗ 2

#1 Regularization

Model adds constraints to the weight updates to protect previously learned

knowledge. Google DeepMind’s Elastic Weight Consolidation (EWC) model

uses a Fisher Information Matrix 𝐹𝑖 to re-direct plasticity towards the weightsthat are the least important to retaining old information [1].

#2 Ensembling

Model implicitly or explicitly trains multiple classifiers and combines them to

make a prediction. Google’s PathNet model uses a genetic algorithm to find the

optimal path through a large DCNN and then locks that path to preserve it [2].

#3 Rehearsal

Model revisits previous training examples to prevent forgetting of previously

trained knowledge. GeppNet stores previous all training examples and then

replays them during incremental learning stages [3].

#4 Dual-Memory

Model has separate processing centers for the fast acquisition of new

information and long-term storage of pre-trained knowledge. GeppNet+STM

uses a short-term memory buffer to store and recall previous training examples,

and then consolidates these samples during sleep phases.

#5 Sparse-Coding

Model makes sparse updates to the network to prevent the disruption of pre-

trained knowledge. The fixed expansion layer (FEL) model uses a large hidden

layer that is sparsely populated with excitatory and inhibitory weights [4].

Incremental Learning Paradigms

1. Data Permutation Experiment

This experiment measures how well a model can

incrementally learn datasets with similar feature

representations. We randomly permute the pixel

locations of each image.

2. Incremental Class Learning Experiment

First, we train the model on some base knowledge (half of the classes). Then,

we train the remaining classes one-by-one.

3. Multi-Modal Experiment

We measure how well a model can incrementally learn datasets with dissimilar

feature representations. First, we train the model on image classification, and

then we train that model on audio classification (and vice versa).

Catastrophic ForgettingNeural networks are incapable of learning new information without disturbing

the weights important for retaining existing memories, a phenomenon known as

catastrophic forgetting. Although many mitigation techniques have been

proposed, the only real way of preventing this is combining the old and new

data and retraining the model from scratch. State-of-the-art frameworks can

take weeks/months to train, so this is extremely inefficient.

Motivation

Researchers have proposed many strategies for mitigating catastrophic

forgetting, but these methods have all failed to scale-up to real-world

problems.

Kirkpatrick et. al. (2017) claimed to solve catastrophic forgetting. but they

only evaluated their frameworks on a toy dataset with only a few object classes

(i.e. MNIST)

We scaled up some of these mechanisms to large scale image and audio

classification datasets with 100-200 object classes. We evaluated their

performance on three different incremental learning paradigms using new

metrics that we established.

Datasets

Metrics

We established three metrics designed to measure

how well a model retains existing memories

(Ω𝑏𝑎𝑠𝑒), assimilate new data (Ω𝑛𝑒𝑤), and bothtasks all at once (Ω𝑎𝑙𝑙). We track mean-class testaccuracy for the base knowledge 𝛼𝑏𝑎𝑠𝑒 , mostrecently learned class 𝛼𝑛𝑒𝑤, and all classes seen tothat point 𝛼𝑎𝑙𝑙. We normalize the results by theaccuracy obtained by training the model offline

𝛼𝑖𝑑𝑒𝑎𝑙 so that we can have a fair comparisonbetween datasets.

Ω𝑏𝑎𝑠𝑒 =1

T − 1

𝑖=2

𝑇𝛼𝑏𝑎𝑠𝑒,𝑖𝛼𝑖𝑑𝑒𝑎𝑙

Ω𝑛𝑒𝑤 =1

T − 1

𝑖=2

𝑇

𝛼𝑛𝑒𝑤,𝑖

Ω𝑎𝑙𝑙 =1

T − 1

𝑖=2

𝑇𝛼𝑎𝑙𝑙,𝑖𝛼𝑖𝑑𝑒𝑎𝑙

Discussion/Conclusion

• No mechanism works well on

every paradigm and data type.

• The regularization and ensembling

mechanisms work well for

incrementally learning datasets with

similar feature representations.

• Incremental class learning models

benefit from the rehearsal and dual-

memory mechanisms; however,

storage of past training examples is

memory inefficient

• Models that employ sparsity as its

mitigation strategy are too memory

inefficient to be employed in a real-

world scenario

Mean of 𝛀𝒂𝒍𝒍 across datasets

Summary of Experimental Results

AcknowledgementsAngelina Abitino was supported by NSF Research Experiences for

Undergraduates (REU) award #1359361 to Roger Dube. We also thank

NVIDIA for the generous donation of a Titan X GPU.

References1. Kirkpatrick, James, et al. "Overcoming catastrophic forgetting in neural

networks." Proceedings of the National Academy of Sciences (2017): 201611835.2. Fernando, Chrisantha, et al. "Pathnet: Evolution channels gradient descent in

super neural networks." arXiv preprint arXiv:1701.08734 (2017).3. Gepperth, Alexander, and Cem Karaoguz. "A bio-inspired incremental learning

architecture for applied perceptual problems." Cognitive Computation 8.5 (2016):924-934.

4. Coop, Robert, Aaron Mishtal, and Itamar Arel. "Ensemble learning in fixedexpansion layer networks for mitigating catastrophic forgetting." IEEE transactionson neural networks and learning systems 24.10 (2013): 1623-1634.

Fig 1. Permuted MNIST

Image

Experimental Results

Fig 2. Mean-class test accuracy of incremental class learning experiment

Experimental Results