Bernard Ans, Stéphane Rousset, Robert M. French & Serban Musca

39
Bernard Ans, Stéphane Rousset, Robert M. French & Serban Musca (European Commission grant HPRN-CT-1999- 00065) Preventing Catastrophic Interference in Multiple-Sequence Learning Using Coupled Reverberating Elman Networks

description

Preventing Catastrophic Interference in Multiple-Sequence Learning Using Coupled Reverberating Elman Networks. Bernard Ans, Stéphane Rousset, Robert M. French & Serban Musca (European Commission grant HPRN-CT-1999-00065). The Problem of Multiple-Sequence Learning. - PowerPoint PPT Presentation

Transcript of Bernard Ans, Stéphane Rousset, Robert M. French & Serban Musca

Page 1: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Bernard Ans, Stéphane Rousset,

Robert M. French & Serban Musca(European Commission grant HPRN-CT-1999-00065)

Preventing Catastrophic Interference in Multiple-Sequence Learning

Using Coupled Reverberating Elman Networks

Page 2: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

The Problem of Multiple-Sequence Learning

• Real cognition requires the ability to learn sequences of patterns (or actions). (This is why SRN’s – Elman Networks – were originally developed.)

• But learning sequences really means being able to learn multiple sequences without the most recently learned ones erasing the previously learned ones.

• Catastrophic interference is a serious problem for the sequential learning of individual patterns. It is far worse when multiple sequences of patterns have to be learned consecutively.

Page 3: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

The Solution

• We have developed a “dual-network” system using coupled Elman networks that completely solves this problem.

• These two separate networks exchange information by means of “reverberated pseudopatterns.”

Page 4: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Pseudopatterns

f(x) N eural N etw ork

Inputs

O utputs

f(x)

• Assume a network-in-a-box learns a series of patterns produced by a function f(x).

• These original patterns are no longer available.

How can you approximate f(x)?

Page 5: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

1 0 0 1 1 Random Input

Page 6: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

1 0 0 1 1 Random Input

1 1 0 Associated output

Page 7: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

1 0 0 1 1 Random Input

1 1 0 Associated output

This creates a pseudopattern: 1: 1 0 0 1 1 1 1 0

Page 8: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

A large enough collection of these pseudopatterns:

1: 1 0 0 1 1 1 1 0

2: 1 1 0 0 0 0 1 1

3: 0 0 0 1 0 1 0 0

4: 0 1 1 1 1 0 0 0

Etc

will approximate the originally learned function.

Page 9: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Transferring information from Net 1 to Net 2 with pseudopatterns

1 0 0 1 1

1 1 0

1 0 0 1 1

1 1 0 target

inputRandominput

Associatedoutput

Net 1 Net 2

Page 10: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Information transfer by pseudopatterns in dual-network systems

• New information is presented to one network (Net 1).

• Pseudopatterns are generated by Net 2 where previously learned information is stored.

• Net 1 then trains not only on the new pattern(s) to be learned, but also on the pseudopatterns produced by Net 2.

• Once Net 1 has learned the new information, it generates (lots of) pseudopatterns that train Net 2

This is why we say that information is continually transferred between the two networks by means of pseudopatterns.

Page 11: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Are all pseudopatterns created equal? No.

Even though the simple dual-network system (i.e., new learning in one network; long-term storage in the other) using simple pseudopatterns does eliminate catastrophic interference, we can do better using “reverberated” pseudopatterns.

Page 12: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Building a Network that uses “reverberated” pseudopatterns.

Input layer

Hidden layer

Output layer

Start with a standard backpropagation network

Page 13: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Input layer

Hidden layer

Output layer

Add an autoassociator

Page 14: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

A new pattern to be learned, P: Input Target, will be learned as shown below.

Input

TargetInput

Page 15: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

What are “reverberated pseudopatterns” and

how are they generated?

Page 16: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

We start with a random input î0, feed it through the network and collect the output on the autoassociative side of the network.. This output is fed back into the input layer (“reverberated”) and, again, the output on the autoassociative side is collected. This is done R times.

0i

1i

Page 17: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

1i

1i

Page 18: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

1i

2i

Page 19: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

2i

2i

Page 20: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

2i

3i

Page 21: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Ri

t

After R reverberations, we associate the reverberated input and the “target” output.

tiRˆˆ:

This forms the reverberated pseudopattern:

Page 22: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

This dual-network approach using reverberated pseudopattern information transfer between the two

networks effectively overcomes catastrophic interference in multiple-pattern learning

Net 2Storage network

Net 1New-learning

network

Page 23: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

But what about multiple-sequence learning?

• Elman networks are designed to learn sequences of patterns. But they forget catastrophically when they attempt to learn multiple sequences.

• Can we generalize the dual-network, reverberated pseudopattern technique to dual Elman networks and eliminate catastrophic interference in multiple-sequence learning? Yes

Page 24: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Elman networks (a.k.a. Simple Recurrent Networks)

Copy hidden unit activations from previous time-step

Standard input S(t) Context H(t-1)

Hidden H(t)

S(t+1)

Learning a sequence S(1), S(2), …, S(n).

Page 25: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

A “Reverberated Simple Recurrent Network” (RSRN): an Elman network with an autoassociative part

S(t)

H(t-1) S(t+1)

Output layer

Hidden layer

Input layer

Teacher Error

H(t)

S(t)

H(t-1)

“autoassociative” (Input) nodes “target”

nodes

Input

Standard Context

Page 26: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

RSRN technique for sequentially learning two sequences A(t) and B(t).

• Net 1 learns A(t) completely.• Reverberated pseudopattern transfer to Net 2.• Net 1 makes one weight-change pass through B(t).• Net 2 generates a few “static” reverberated

pseudopatterns• Net 1 does one learning epoch on these

pseudopatterns from Net 2.• Continue until Net 1 has learned B(t).• Test how well Net 1 has retained A(t).

Page 27: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Two sequences to be learned:A(0), A(1), … A(10) and B(0), B(1), … B(10)

Net 1 Net 2

Net 1 learns (completely) sequence A(0), A(1), …, A(10)

Page 28: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Net 1 Net 2

010110100110010

1110010011010

Net 1 produces 10,000 pseudopatterns,

11Net : 010110100110010 1110010011010

010110100110010 Input

1110010011010 Teacher

Transferring the learning to Net 2

Page 29: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Net 1 Net 2feedforward

1110010011010 Teacher

010110100110010 Input

Transferring the learning to Net 2

Page 30: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Net 1 Net 2Backprop weight change

For each of the 10,000 pseudopatterns produced by Net 1, Net 2 makes 1 FF-BP pass.

010110100110010 Input

1110010011010 Teacher

Transferring the learning to Net 2

Page 31: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Learning B(0), B(1), … B(10) by NET 1

Net 1 Net 2

1. Net 1 does ONE learning epoch on sequence B(0), B(1), …, B(10)

2. Net 2 generates a few pseudopatterns NET 2

3. Net 1 does one FF-BP pass on each NET 2

Page 32: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Net 1 Net 2

1. Net 1 does ONE learning epoch on sequence B(0), B(1), …, B(10)

2. Net 2 generates a few pseudopatterns NET 2

Learning B(0), B(1), … B(10) by NET 1

3. Net 1 does one FF-BP pass on each NET 2

Continue until Net 1 has learned B(0), B(1), …, B(10)

Page 33: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Sequences chosen

• Twenty-two distinct random binary vectors of length 100 are created.

• Half of these vectors are used to produce the first ordered sequence of items, A, denoted by A(0), A(1), …, A(10).

• The remaining 11 vectors are used to create a second sequence of items, B, denoted by B(0), B(1), …, B(10).

• In order to introduce a degree of ambiguity into each sequence (so that a simple BP network would not be able to learn them), we modify each sequence so that A(8) = A(5) and B(5) = B(1).

Page 34: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Test method

• First, sequence A is completely learned by the network.

• Then sequence B is learned.

• During the course of learning, we monitor at regular intervals how much of sequence A has been forgotten by the network.

Page 35: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Normal Elman networks: Catastrophic forgetting

12

34

56

78

910

0

100

200

300

450

0

10

20

30

40

50

60

70

80

90

100

12

34

56

78

910 0

100

200

300

450

0

10

20

30

40

50

60

70

80

90

100

(a)

(b)

(a): Learning of sequence B (after having previously learned sequence A). By 450 epochs (an epoch corresponds to one pass through the entire sequence), sequence B has been completely learned.

(b): The number of incorrect units (out of 100) for each serial position of sequence A during learning of sequence B. After 450 epochs, the SRN has, for all intents and purposes, completely forgotten the previously learned sequence A

Page 36: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Dual-RSRN’s: Catastrophic forgetting is eliminated

12

34

56

78

910

0

100

200

300

400

0

10

20

30

40

50

60

70

80

90

100

12

34

56

78

910

0

100

200

300

400

0

10

20

30

40

50

60

70

80

90

100

(a)

(b)

Recall performance for sequences B and A during learning of sequence B by a dual-network RSRN.

(a): By 400 epochs, the second sequence B has been completely learned.

(b): The previously learned sequence A shows virtually no forgetting. Catastrophic forgetting of the previously learned sequence A has been completely overcome.

Page 37: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

12

34

56

78

910

0

100

200

300

450

0

10

20

30

40

50

60

70

80

90

100

12

34

56

78

910 0

100

200

300

450

0

10

20

30

40

50

60

70

80

90

100

(a)

(b)

12

34

56

78

910

0

100

200

300

400

0

10

20

30

40

50

60

70

80

90

100

12

34

56

78

910

0

100

200

300

400

0

10

20

30

40

50

60

70

80

90

100

(a)

(b)

Normal Elman Network:Massive forgetting% Error on Sequence A

Dual RSRN:No forgetting of Sequence A

Seq. B being learned

Page 38: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Cognitive/Neurobiological plausibility?

• The brain, somehow, does not forget catastrophically.

• Separating new learning from previously learned information seems necessary.

• McClelland, McNaughton, O’Reilly (1995) have suggested the hippocampal-neocortical separation may be Nature’s way of solving this problem.

• Pseudopattern transfer is not so far-fetched if we accept results that claim that neo-cortical memory consolidation, is due, at least in part, to REM sleep.

Page 39: Bernard Ans, Stéphane Rousset,  Robert M. French & Serban Musca

Conclusions

• The RSRN reverberating dual-network architecture (Ans & Rousset, 1997, 2000) can be generalized to sequential learning of multiple temporal sequences.

• When learning multiple sequences of patterns, interleaving simple reverberated input-output pseudopatterns, each of which reflect the entire previously learned sequence(s), reduces (or eliminates entirely) forgetting of the initially learned sequence(s).