Elman Ch_Mind and Motion

8/13/2019 Elman Ch_Mind and Motion

1/31

8 Language as a Dynamical SystemJeffrey.Elman

EDI TORSOne of the most distinctive features of human cognition is linguistic performancewhich exhibits unique kinds of complexity. An important argument in favor of thecomputationalapproach o cognitionhasbeen hat only computationalmachinesof acertain kind can exhibit behavior of that complexity. Can the dynamical approachoffer an alternative accountof this centralaspectof cognition?In this chapterJeff Elman confronts this problem headon. He proposeshe boldhypothesis hat human natural languageprocessing s actually the behavior of adynamical systema systemquite different than thestandard kinds of computationalmachine raditionally deployed n cognitive scienceDemonstrating the truth of thishypothesis equiresshowing how various aspectsof languageprocessing an besuc-cessfullymodeledusingdynamical systemsTheparticular aspect hat Elmanfocuseson here s theability to predict the next word in a sentence fter beingpresentedwiththefirst n words in a sequenceSuccessn this task requiresboth memoryfor whathas already beenpresentedand an understanding of the distinctive complexity ofnatural languageThe modelhedeploys s a kind of dynamical system hat is popularin connectionistwork, the SRN (simple recurrentnetwork).

It turns out that it is possibleto develop (by training using backpropagationa connectionistmethodfor determining parametersettings of a dynamical systemnetworks that exhibit a high degreeof successand whose ailings are interestinglyreminiscentof difficulties that humans encounterPerhaps he most important outcome

of this work, howeveris that it suggestsways to dramatically reconceptualizethe basicmechanismsunderlying linguistic performanceusing terms and concepts fdynamics. Thus internal representations f words are not symbols but locations instate space the lexiconor dictionary is the structure in this space and processingrulesare not symbolic specificationsbut the dynamicsof the systemwhich push thesystemstate in certaindirectionsrather than othersThereare, of coursemany aspectsof languageprocessinghat Elman's modelsdonot evenbegin to addressAt this stage theseaspectsare interestingopenproblemsfor the dynamical approachIn the meantimeit is clear that, contrary to the suspicions

of somedynamics does ndeedprovide a fruitful framework for the ongoingstudy of high-levelaspects f cognitionsuchas language

INTRODUCTION


2/31

8.1 INTRODUCTIONDespite considerablediversity among theories about how humansprocesslanguagethere area number of fundamentalassumptionshat are sharedbymost such heories This consensus xtends to the very basicquestionaboutwhat counts as a cognitive processSoalthoughmany cognitive scientistsarefond of referring to the brain as a

"mental organ

"(e.g., Chomsky, 1975)-implying a similarity to other organssuch as the liver or kidneys- it is alsoassumedhat the brain is an organ with specialpropertieswhich set it apart.Brains"carry out computation" (it is argued); they "entertain propositions";and they "support representations" Brainsmay be organsbut they areverydifferent from the other organs n the body.Obviously, there are substantial differencesbetween brains and kidneys,

just as there are betweenkidneysand heartsand the skin. It would be silly tominimize these differences On the other hand, a cautionary note is also inorder. The domainsover which the variousorgans operatearequite different,but their common biological substrate s quite similar. The brain is indeedquite remarkableand does some things which are very similar to human-madesymbol processorsbut thereare alsoprofound differencesbetween hebrain anddigital symbolprocessorsandattempts o ignore theseon groundsof simplificationor abstraction un the risk of fundamentallymisunderstandingthe nature of neuralcomputation(Churchland and Sejnowski1992). In alarger senseI raise the more general warning that (as Ed Hutchins hassuggested) "cognition may not be what we think it is." Among other things, Isuggest n this chapter hat language andcognition in generalmay be moreusefullyunderstoodas the behavior of a dynamical systemI believe this is aview which both acknowledges he similarity of the brain to other bodilyorgans and respects he evolutionary history of the nervous systemwhilealsoacknowledging he very remarkablepropertiespossessedy the brain.In the view I shall outline, representationsare not abstractsymbols butrather regionsof statespaceRulesare not operationson symbolsbut ratherembedded n the dynamicsof the systema dynamicswhich permits movementfrom certainregions to others while making other transitions difficult.Let me emphasize rom the beginning that I am not arguing that languagebehavior is not rule-governed. InsteadI suggestthat the natureof the rulesmay be different from what we have conceived hem to be.The remainder of this chapter is organizedas follows. In order to makeclearhow the dynamicalapproach instantiatedconcretelyhere as a connectionist network) differs from the standardapproachI begin by summarizingsome of the central characteristicsof the traditional approachto languageprocessingI then describea connectionist model which embodies differentoperating principles from the classicapproach to symbolic computation.The resultsof several simulationsusing that architectureare presentedanddiscussedFinally, I discusssome of the results which may be yielded by thisperspective

Jeffrey . Elman96


3/31

8.2 GRAMMARANDTHELEXICONTHETRAOmONALAPPROACHLanguageprocessing s traditionally assumed o involve a lexiconwhich isthe repository of factsconcerning ndividual words, and a set of ruleswhichconstrain he ways those words canbe combined o form sentencesFromthepoint of view of a listenerattempting to processspoken anguagethe initialproblem nvolves taking acoustic nput andretrieving the relevantword fromthe lexicon. This process s often supposed o involve separatestagesoflexicalaccessin which contact s made with candidatewords basedon partialinformation), and lexical ecognitionor retrieval, or selectionin which a choiceis made on a specificword), although finer-grained distinctions may also beuseful (e.g., Tyler and Frauenfelder 1987). Subsequent o recognition, theretrieved word must be inserted into a data structure that will eventuallycorrespond o a sentencethis procedure s assumedo involve the applicationof rules.

As described this scenariomay seem simple straightforward, and notlikely to be controversialBut in fact, there is considerabledebateabout anumber of important details. For instanceIs the lexiconpassiveor active? In some models the lexicon is a passivedata structure Forster1976). In othermodelslexical temsare active( MarslenWilson, 1980McClelland and Elman 1986; Morton , 1979) in the style ofSelfridge's "demons(Selfridge1958).How is the lexiconorganized nd what are ts entrypoints? In active modelsthe internalorganizationof the lexicon is lessan issuebecausehe lexicon isalsousuallycontent-addressableso that there s direct and simultaneous ontactbetweenan unknown input and all relevant lexical representationsWithpassivemodelsan additional ook-up processs requiredand so the organizationof the lexiconbecomesmore mportant for efficientandrapid searchThelexicon may be organizedalong dimensionswhich reflect phonologicalororthographicor syntactic properties or it may be organized along usageparameterssuchas requency Forster1976). Other problems ncludehow tocatalog morphologically related elements(e.g., are "telephone" and "telephonic" separateentries? "girl" and "girls"? "ox" and "oxen',?); how to represent

words with multiple meanings(the various meaningsof "bank" maybe different enough to warrant distinct entriesbut what about the variousmeaningsof "run," someof which areonly subtly different, while othershavemore distantbut still dearly relatedmeanings?); whether the lexicon includesinformation about argumentstructureandso on.Is recognition ll-or-nothingor graded?In sometheoriesrecognition occursat the point where a spokenword becomesuniquely distinguishedfrom itscompetitors (Marslen-Wilson, 1980). In other modelsthere may be no consistent

point whererecognitionoccursrather, recognition is a gradedprocesssubject to interactions which may hastenor slow down the retrieval of aword in a given context. The recognition point is a strategicallycontrolledthreshold(McClellandandElman1986).

LanguagesaDynamical ystem97


4/31

How do exical ompetitorsnteract? f the lexicon is active, there s the potentialfor interactionsbetween exicalcompetitorsSomemodels build inhibitoryinteractionsbetween words (McClelland and Elman1986); others have suggestedthat the empiricalevidence ules out word-word inhibitions {Mars en-Wilson, 1980).How are sentencetructures onstructedrom words? This single questionhasgiven rise to a vast and complex literature. The nature of the sentencestructuresthemselvess fiercelydebatedreflectingthe diversity of currentsyntactictheories There is in addition considerable ontroversy around the sort ofinformation which may playa role in the constructionprocessor the degreeto which at least a first-passparse s restricted to the purely syntactic informationavailable to it (Frazier and Rayner1982; TrueswellTanenhausandKello, 1992).There are thus a considerablenumber of questionswhich remain open.Nonetheless I believe it is accurate o say that there is also considerableconsensusegarding certain fundamentalprinciplesI take this consensus oinclude the following.1. A commitment to discrete nd contextfreesymbolsThis is more readilyobvious in the caseof the classicapproachesbut many connectionistmodelsutilize localist representationsn which entities are discreteand atomic (although

gradedactivationsmay be used o reflect uncertainhypotheses.A central feature of all of these forms of representation localist con-nectionist as well as symbolic- is that they are intrinsicallycontextfreeThesymbol for a word, for example is the sameregardlessof its usageThisgives suchsystemsgreat combinatorialpower, but it also limits their abilityto reflect diosyncraticor contextually specificbehaviorsThis assumptionalso leadsto a distinction between typesand tokensandmotivates the need for variablebindingTypes are the canonicalcontext-freeversionsof symbolstokensare the versions hat are associatedwith specificcontexts; and binding is the operation that enforces he association e.g., bymeansof indicessubscriptsor other diacritics).2. The view of rules as operators nd the lexiconas operandsWords inmost modelsare conceivedof asthe objectsof processingEven n models nwhich lexical entries may be active, once a word is recognized t becomessubject o grammatical ules which build up higher-level structures3. The staticnatureof representationsAlthough the processingof languageclearlyunfoldsover time, the representationshat areproducedby traditionalmodels typically have a curiously static quality. This is revealed n severalways. For instanceit is assumed hat the lexicon preexistsas a data structurein much the sameway that a dictionary exists independentlyof its useSimilarly, the higher-level structurescreatedduring sentencecomprehensionare built up through an accretiveprocessand the successful roduct of comprehensiowill be a mental structure n which all the constituentparts(words,categories relational information) are simultaneouslypresent {Presumablythesebecomenputsto somesubsequentnterpretive processwhich constructs

Jeffrey . Elman98


5/31

discourse structures) That is, although processing models ("performancemodels") often take seriouslythe temporal dynamics nvolved in computingtarget structuresthe target structures hemselvesare inherited from theorieswhich ignore temporalconsiderations"competencemodels").4. The building metaphorIn the traditional view, the act of constructingmental representationss similar to the act of constructinga physicaledificeIndeedthis is preciselywhat is claimed n the physical symbol systemhypothesis( Newell1980). In this view, words andmoreabstractconstituentsarelike the bricks in a building; rulesare the mortar which binds them together.As processingproceedsthe representationgrows much as does a buildingunder constructionSuccessful rocessing esults n a mentaledifice that is acompleteand consistentstructure- againmuch like a building.I taketheseassumptionso be widely sharedamongresearchersn the fieldof languageprocessingalthough they are rarely stated explicitly. Furthermore, theseassumptionshave formed the basis or a largebody of empiricalliterature; they have played a role in the framing of the questionsthat areposedand later in interpreting the experimental esultsCertainly it is incumbenton any theory that is offered as replacement o at least provide theframeworkfor describing he empiricalphenomenaaswell as mproving ourunderstandingof the data.

Why might we be interested n another theory? One reason s that thisview of our mental ife, which I have ust describedthat is, a view that relieson discrete static, passiveand context-free representationsappears o besharplyat variancewith what is known about the computationalpropertiesofthe brain (Churchland and Sejnowski 1992). It must also be acknowledgedthat while the theoriesof languagewhich subscribeo the assumptionsistedabove do provide a great deal of coverageof data, that coverage s oftenflawed, internally inconsistentand ad hoc, and highly controversialSo it isnot unreasonableo raisethe questionDo the shortcomingsof the theoriesarise rom assumptionshat arebasically lawed?Might therebe other, betterways of understanding he nature of the mental processes and representationsthat underlie language? In the next section I suggest an alternativeview of computationin which languageprocessings seenastaking place na dynamicalsystemThe lexicon is viewed as consistingof regions of statespacewithin that systemthe grammarconsistsof the dynamics(attractorsand repellerswhich constrainmovement n that spaceAs we shall seethisapproachentails representations hat are highly context-sensitivecontinuouslyvaried, andprobabilistic but, of course0.0 and 1.0 arealsoprobabilities,and in which the objects of mental representationare better thought of astrajectories hrough mentalspace ather than things constructedAn entry point to describing this approach s the question of how onedealswith time and the problem of serial processingLanguagelike manyother behaviorsunfolds and is processedover time. This simple fact- sosimple t seemsrivial- turns out to be problematicwhen explored in detail.Therefore I turn now to the question of time. I describea connectionist

LanguagesaDynamical ystem99


6/31

Time is the medium n which all our behaviorsunfold; it is the context withinwhich we understand he world. We recognizecausalitybecause ausesprecedeeffectswe learn that coherentmotion over time of points on the retinalarray is a good indicator of objecthoodand it is difficult to think about phenomenasuch as languageor goal-directed ehavioror planningwithout someway of representing ime. Time's arrow is such a central feature of our worldthat it is easy o think that, having acknowledgedts pervasivepresencelittlemore needs o be said.But time has been the stumbling block to many theories An importantissue n models of motor activity, for examplehas been the nature of themotor intention. Does the action plan consist of a literal specificationofoutput sequencesprobably not), or does it representserialorder in a moreabstract manner (probably so, but how?; e.g., Fowler, 1977; Jordan andRosenbaum 1988 Kelso, Saltzman and Tuller, 1986; MacNeilage 1970).Within the realm of naturallanguageprocessingthere s considerable ontroversyabout how information accumulates ver time and what information isavailable when (e.g., Altmann and Steedman1988; Ferreiraand Henderson1990TrueswellTanenhausand Kello, 1992).Time has been a challenge or connectionistmodels as well. Early modelsperhaps eflectingthe initial emphasison the parallel aspectsof thesemodelstypically adopted a spatial representationof time (e.g., McClelland andRumelhart1981). The basicapproachs illustrated n figure 8.1. The temporalorder of input events(first to last) is represented y the spatialorder (left toright) of the input vector. There are a number of problemswith this approach(seeElman1990, for discussion. One of the most serious s that the left-to-right spatial ordering has no intrinsic significanceat the level of computationwhich is meaningful or the network. All input dimensionsareorthogonal toone another in the input vector spaceThe humaneye tends to seepatternssuch as 01110000and 00001110ashaving undergonea spatial (or temporal,if we understand these as representingan ordered sequence translationbecausehe notation suggestsa special elationshipmay exist betweenadjacentbits. But this relationship s the result of considerableprocessingby thehumanvisual systemand is not intrinsic to the vectors themselvesThe firstelement n a vector is not "closer" in any useful sense o the secondelementthan it is to the last element Most important, it is not available to simplenetworks of the form shown in figure 8.1. A particularly unfortunate consequence

is that there s no basis n sucharchitectures or generalizingwhat hasbeen earnedabout spatialor temporalstimuli to novel patterns

Jeffrey . Elman00

approach o temporal processingand show how it can be applied to severallinguistic phenomenaIn the final section turn to the payoff and attempt toshow how this approach eads to useful new views about the lexicon andaboutgrammar8.3 THE PROBLEM OF TIME


7/31

oc-putudts

More recentmodels haveexploredwhat is intuitively a more appropriateidea: let time be representedby the effects t has on processingIf networkconnections include feedback oops, then this goal is achieved naturally.The state of the network will be some function of the current inputs plusthe network's prior stateVarious algorithms and architectureshave beendevelopedwhich exploit this insight (e.gElman , 1990; Jordan1986; Mozer,1989; Pearlmutter1989; RumelhartHinton, and Williams, 1986). Figure 8.2showsone architecturethe simplerecurrentnetwork (SRN) which was usedfor the studies o be reportedhere.In the SRN architectureat time t hidden units receiveexternal input, andalso collateral input from themselvesat time t - 1 (the context units aresimply usedto implement this delay). The activation function for any givenhidden unit hi is the familiar logistic,

0004

0000 hitkJenutits.

InlXlt units00000000'2 '3 . .

network is feedforward becauseactivations at each evel depend only on the input receivedfrom below. At the conclusionof processingan input, all activations are thus lost. A sequenceof inputs can be represented n such an architecture by associating the first node (on theleft) with the first elementin the sequencethe second node with the second elementand soon.

1f (hi) = 1+ e-ltet

Languageasa Dynamical System01


8/31

Figure 8..1 A simple recurrent network (SRN). Solid lines indicate full connectivity betweenlayers. with weights which are trainable. The dashed ine indicates a fixed one-to-one connectionbetween hidden and context layers. The context units are used to save the activations ofthe hidden units on any time step. Then, on the next time step, the hidden units are activatednot only by new input but by the infonnation in the context units- which is just the hiddenunits' own activations on the prior time step. An input sequences processedby presentingeach element n the sequenceone at a time, allowing the network to be activated at eachstepin time, and then proceedingto the next element. Note that although hidden unit activationsmay depend on prior inputs, by virtue of prior inputs' effects on the recycled hidden unit -context unit activations, the hidden units do not record the input sequencen any veridicalmanner Insteadthe task of the network is to learn to encode temporal events in some moreabstract mannerwhich allows the network to perfonn the task at hand.

That is, the net input on any given tick of the clock t includes not only theweighted sum of inputs and the nodes bias but the weighted sum of thehidden unit vector at the prior time step. (Henceforthwhen referring to thestatespaceof this systemI shall refer specifically o the k-dimensionalspacedefinedby the k hidden units.)In the typical feedforward network, hidden units develop representationswhich enable the network to perform the task at hand (Rumelhart et al.,1986). Theserepresentationsmay be highly abstractand are function-basedThat is, the similarity structure of the internal representations eflects thedemandsof the task being learnedrather than the similarity of the inputsform. When recurrence s addedthe hidden units assumean additional func-

OUTPUT NITS)04

HIDDENNITS0 0 0 0 \ fIxedw.-.-~ -. .oneD oneat1.0" ' -

4

0000CONTEXT NITS1M >(! **ten uriI8 I t.1

INPUT NITS

Jeffrey . Elman02

but where the net input to the unit at time I, Nell/), is nowNell/) = L wijaj(/) + hi + L Witht(1- 1)j t


9/31


10/31

Language is a domain in which the ordering of elements is particularlycomplex . Word order , for instance, reflects the interaction of multiple factors .These include syntactic constraints , semantic and pragmatic goals, discourseconsiderations, and processing constraints (e.g., verb -particle constructionssuch as " run up" may be split by a direct object , but not when the nounphrase is long enough to disrupt the processing of the discontinuous verb asa unit ). Whether or not one subscribes to the view that these knowledgesources exert their effects autonomously or interactively , there is no questionthat the final output - the word stream- reflects their joint interplay .We know also that the linear order of linguistic elements provides a poorbasis for characterizing the regularities that exist within a sentence. A nounmay agree in number with a verb which immediately follows it , as in l (a), orwhich is separated by an arbitrarily great distance, as in l (b) (the subscripts"pi" and "sg" refer to plural and singular):1. (a) The chiidreopi likepl ice cream.

(b) The girls I who Emily baby -sits for every other Wednesday whileher parents go to night schooilikessl ice cream.Such considerations led Miller and Chomsky (1963) to argue that statisticallybased algorithms are infeasible for language learning , since the number ofsentences a listener would need to hear in order to know precisely which ofthe 15 words which precede likes in l (b) determines the correct number forlikes would vastly outnumber the data available (in fact, even conservativeestimates suggest that more time would be needed than is available in anindividual 's entire lifetime ). On the other hand, recognition that the dependencies

respect an underlying hierarchical structure vastly simplifies the problem: subject nouns in English agree in number with their verbs; embeddedclausesmay intervene but do not participate in the agreement process.One way to challenge an SRN with a problem which has some relevanceto language would therefore be to attempt to train it to predict the successivewords in sentences. We know that this is a hard problem which cannot besolved in any general way by simple recourse to linear order . We know alsothat this is a task which has some psychological validity . Human listeners areable to predict word endings from beginnings ; listeners can predict gram-maticality from partial sentence input ; and sequencesof words which violateexpectations - i.e., which are unpredictable - result in distinctive electricalactivity in the brain: An interesting question is whether a network could betrained to predict successivewords . In the following two simulations we shallsee how , in the course of solving this task, the network develops novelrepresentations of the lexicon and of grammatical rules.

Words may be categorized with respect to many factors. These includesuch traditional notions as noun, verb, etc.; the argument structures they are

The lexicon as Structured State Space

Jeffrey . Elman04


11/31

associated with ; and semantic features. Many of these characteristics arepredictive of a word 's syntagmatic properties . But is the reverse true? Candistributional facts be used to infer something about a word 's semantic orcategorial features? The goal of the first simulation was to see if a networkcould work backward in just this sense.A small lexicon of 29 nouns and verbs was used to form simple sentences(see Elman, 1990, for details). Each word was represented as a localist vectorin which a single randomly assigned bit was turned on. This input representationinsured that there was nothing about the form of the word that wascorrelated with its properties , and thus that any classifications would have tobe discovered by the network based solely on distributional behavior .A network similar to the one shown in figure 8.2 was trained on a setof 10,000 sentences, with each word presented in sequence to the networkand each sentence concatenated to the preceding sentence. The task of thenetwork was to predict the successive word . After each word was input ,the output (which was the prediction of the next input ) was compared withthe actual next word and weights were adjusted by the backpropagation oferror -learning algorithm .At the conclusion of training , the network was tested by comparing itspredictions against the corpus. Since the corpus was nondeterministic , it wasnot reasonable to expect that the network (short of memorizing the sequence)would be able to make exact predictions . Instead, the network predicted thecohort of potential word successors in each context . The activation of eachcohort turned out to be highly correlated with the conditional probability ofeach word in that context (the mean cosine of the output vector with theempirically derived probability distribution was 0.916).This behavior suggests that in order to maximize performance at prediction, the network identifies inputs as belonging to classes of words based ondistributional properties and co-occurrence information . These classes werenot represented in the overt form of the word , since these were all orthogonalto one another . However , the network is &ee to learn internal representationsat the hidden unit layer which might capture this implicit information .To test this possibility , the corpus of sentences was run through the networka final time . As each word was input , the hidden unit activation patternthat was produced by the word , plus the context layer, was saved. For eachof the 29 words , a mean vector was computed , averaging across all instancesof the word in all contexts . These mean vectors were taken to be prototypes ,and were subjected to hierarchical clustering . The point of this was to seewhether the intervector distances revealed anything about similarity structureof the hidden unit representation space (Euclidean distance was taken as ameasure of similarity ). The tree in figure 8.3 was then constructed &om thathierarchical clustering .The similarity structure revealed in this tree indicates that the network discovered

several major categories of words . The two largest categories correspondto the input vectors , which are verbs and nouns. The verb category

Languagesa Dynamical ystem05


12/31

DOoIUG

ANIMATES

NOUNS

FOODBREAKABLB

is subdivided into those verbs which require a direct object , those which areintransitive , and those for which (in this corpus) a direct object was optional .The noun category is broken into animates and inanimates. The animatescontain two classeshuman and nonhuman , with nonhumans subdivided intolarge animals and small animals. The inanimates are divided into breakables,edibles, and miscellaneous.

First, it must be said that the network obviously knows nothing about thereal semantic content of these categories. It has simply inferred that such acategory structure exists. The structure is inferred because t provides the best

D.o.-ASSVERBSDOQPT ANIMA~

HUMAN

INANlit ATES

I I I I I.0.5

Jeffrey . Elman06

Figure 8.3 Hierarchicalclustering diagram of hidden unit activations in the simulation withsimplesentences After training, sentences repassed hrough the network, and the hidden unitactivation pattern for eachword is recorded. The clustering diagram indicates the similaritystructure among these patterns. This structure, which reflects the grammatical factors thatinfluence word position, is inferred by the network; the patterns which represent the actualinputs areorthogonal and carry none of this information.


13/31

basis or accounting or distributional propertiesObviously, a full account oflanguagewould requirean explanationof how this structure s given content(grounded in the body and in the world). But it is interesting that the evidencefor the structure can be inferred so easily on the basisonly of form-internal evidence and this result may encouragecaution about just howmuch information is implicit in the data and how difficult it may be to usethis information to constructa frameworkfor conceptual epresentationHowever, my main point is not to suggestthat this is the primary wayin which grammaticalcategoriesareacquiredby children, although I believethat co-occurrence nformation may indeedplaya role in such earning. Theprimary thing I would like to focuson is what this simulationsuggestsaboutthe nature of representationn systemsof this sort. That is, I would like toconsider the representationalproperties of such networks, apart from thespecificconditions which give rise to thoserepresentationsWhere is the lexicon in this network? Recall the earlierassumptionsThelexicon is typically conceived of as a passivedata structure Words are objectsof processingThey are first subjectto acousticand phonetic analysisand then their internal representationsmust be accessed recognized andretrieved from permanentstorageFollowing this, the internal representationshave to be inserted nto a grammaticalstructureThe statusof words in a systemof the sort describedhere s very different:Words are not the objects f processingas much asthey are nputswhich drivethe processorn a more direct manner As Wiles and Bloesch 1992) suggestit is more useful to understandinputs to networks of this sort as operatorsrather than as operandsInputs operateon the network's internal state andmove it to anotherposition in statespaceWhat the network learnsover timeis what response t should make to different words, taking context intoaccount Becausewords have reliable and systematiceffectson behavior, it isnot surprising hat all instancesof a given word should result in stateswhichare tightly clustered or that grammatically or semanticallyrelated wordsshouldproducesimilar effects on the network. We might choose o think ofthe internal state that the network is in when it processes a word as representingthat word (in context), but it is more accurate o think of that state asthe resultof processinghe word, rather than as a representationof the worditself.Note that there is an implicitly hierarchicalorganizationto the regionsofstate spaceassociatedwith different words. This organization is achievedthrough the spatialstructure Conceptualsimilarity is realizedthrough positionin statespaceWords that are conceptuallydistant producehidden unitactivation patternsthat are spatially far apart. Higher-level categoriescorrespondto large regions of spacelower-level categoriescorrespond o morerestrictedsubregionsFor exampledragons a noun and causeshe networkto move into the noun region of the state space It is also [+ animate,which is reflected n the subregionof noun spacewhich resultsBecauseargeanimals ypically are described n different termsand do different things than

Languageas a Dynamical System07


14/31

smallanimalsthe generalregion of spacecorresponding o dragonmonsterand lion is distinct from that occupiedby mousecatand dogThe boundariesbetweentheseregionsmay be thought of ashard in somecases e.g., nounsarevery far from verbs) or soft in others (e.g., sandwichcookieand breadarenot very far from carbookand rock One evenmight imaginecaseswhere incertaincontexts, tokens of oneword might overlapwith tokens of another. Insuch casesone would say that the systemhasgeneratedhighly similarcon-strualsof the different words.

If the lexicon is represented as regions of state space, what about rules? Wehave already seen that some aspects of grammar are captured in the token -ization of words , but this is a fairly limited sense of grammar . The well -formedness of sentences depends on relationships which are not readilystated in terms of simple linear order. Thus the proper generalization aboutwhy the main verb in 1(b) is in the singular is that the main subject is singular, and not that the word 15 words prior was a singular noun. The abilityto express such generalizations would seem to require a mechanism forexplicitly representing abstract grammatical structure , including constituentrelationships (e.g., the notion that some elements are part of others). Notationssuch as phrase structure trees (among others) provide precisely thiscapability . It is not obvious how complex grammatical relations might beexpressed using distributed representations . Indeed, it has been argued thatdistributed representations (of the sort exemplified by the hidden unit activation

patterns in the previous simulation ) cannot have constituent structure inany systematic fashion (Fodor and Pylyshyn , 1988). (As a backup, Fodor andPylyshyn suggest that if distributed representations do have a systematicconstituent structure , then they are merely implementations of what they callthe "classical" theory , in this casethe language of thought : Fodor, 1976.)The fact that the grammar of the first simulation was extremely simplemade it difficult to explore these issues. Sentences were all declarative andmonoclausal. This simulation sheds little light on the grammatical potential ofsuch networks .

A better test would be to train the network to predict words in complexsentences that contain long -distance dependencies. This was done in Elman(1991b) using a strategy similar to the one outlined in the prior simulation ,except that sentences had the following characteristics:1. Nouns and verbs agreed in number. Singular nouns required singularverbs; plural nouns selected plural verbs.2. Verbs differed with regard to their verb argument structure . Some verbswere transitive, others were intransitive , and others were optionally transitive.3. Nouns could be modified by relative clauses. Relative clauses could eitherbe object -relatives (the head had the object role in the clause) or subject-

Rules as Attradors

Jeffrey . Elman08


15/31


relative (the headwas the subjectof the clause, and either subjector objectnouns could be relativized.As in the previous simulationwords were representedn localist fashionso that information about neither the grammaticalcategory (noun or verb)nor the number (singularor plural) was contained n the form of the word.The network also only saw positive instancesonly grammaticalsentences

werepresentedThe threeproperties nteract in ways designed o make he prediction taskdifficult. The prediction of number is easy in a sentencesuch as 2(a), butharder n (2b).2. (a) The boYSpl haseplhe dogs.(b) The boysplwho the dogschasessrunplaway.In the first casethe verb follows immediately. In the secondcasethe firstnoun agreeswith the secondverb (run) and is plural; the verb which is actuallyclosest o it (chaseis in the singularbecauset agreeswith the interveningword (dog.Relative clausescausesimilar complicationsfor verb argument structureIn 3(a- c), it is not difficult for the network to learnthat chaserequiresa directobject, seepermits(but doesnot require) one, and lives s intransitive.3. (a) The cats chase he dog.(b) The girls see The girls see he car.(c) The patient lives.On the other hand, consider4. The dog who the catschase un away.The direct object of the verb chasen the relative clause s dogHowever, dogis alsothe headof the clause aswell as the subjectof the mainclause. Chasein this grammar is obligatorily transitive, but the network must learn thatwhen it occurs n suchstructures he object position is left empty (gapped)becausehe direct object has already been mentioned (filled) as the clausehead

These data illustrate the sorts of phenomenawhich have been usedby linguists to argue for abstractrepresentationswith constituent structure(Chomsky1975); they have also beenused to motivate the claim that language

processing equiressome orm of pushdownstore or stackmechanismThey thereforeimposea difficult set of demandson a recurrentnetwork.However, after training a network on such stimuli (Elman1991b), it appearedthe network was able to make correct predictions (mean cosinebetween outputs and empirically derived conditional probability distributions: 0.852; a perfectperformancewould have been 1.0). Thesepredictionshonored the grammaticalconstraints hat were present n the training data.The network was able to correctly predict the number of a main sentenceverb even in the presence f intervening clauses which might have the same


16/31

or conflicting number agreement between nouns and verbs). The networkalso not only learned about verb argument structure differences but correctly" remembered" when an object -relative head had appeared, so that it wouldnot predict a noun following an embedded transitive verb . Figure 8.4 showsthe predictions made by the network during testing with a novel sentence.How is this behavior achieved? What is the nature of the underlyingknowledge possessed by the network which allows it to perform in a waywhich conforms with the grammar? It is not likely that the network simplymemorized the training data, because the network was able to generalize itsperformance to novel sentences and structures it had never seen before. Butjust how general was the solution , and just how systematic?In the previous simulation , hierarchical clustering was used to measure thesimilarity structure between internal representations of words . This gives usan indirect means of determining the spatial structure of the representationspace. It does not let us actually determine what that structure is.1 So onewould like to be able to visualize the internal state space more directly . Thisis also important because it would allow us to study the ways in which thenetwork 's internal state changes over time as it processes a sentence. Thesetrajectories might tell us something about how the grammar is encoded.One difficulty which arises in trying to visualize movement in the hiddenunit activation space over time is that it is an extremely high -dimensionalspace (70 dimensions, in the current simulation ). These representations aredistributed , which typically has the consequence that interpretable informationcannot be obtained by examining activity of single hidden units . Informationis more often encoded along dimensions that are represented acrossmultiple hidden units.This is not to say, however , that the information is not there, of course,but simply that one needs to discover the proper viewing perspective to getat it. One way of doing this is to carry out a principal components analysis(PCA ) over the hidden unit activation vectors . PCA allows us to discoverthe dimensions along which there is variation in the vectors ; it also makes itpossible to visualize the vectors in a coordinate system aligned with thisvariation . This new coordinate system has the effect of giving a somewhatmore localized description to the hidden unit activation patterns . Since thedimensions are ordered with respect to amount of variance accounted for , wecan now look at the trajectories of the hidden unit patterns along selecteddimensions of the state space.2In figures 8.5, 8.6, and 8.7 we seethe movement over time through variousplans in the hidden unit state space as the trained network processes varioustest sentences. Figure 8.5 compares the path through state space (along thesecond principal component or PCA 2) as the network processes thesentences boys hear boys and boy hears boy. PCA 2 encodes the number ofthe main clause subject noun, and the difference in the position along thisdimension correlates with whether the subject is singular or plural . Figure 8.6

JeffreyL. Elman10


17/31

boys who Mary chases ...

boys who Mary ... chases

. . . . . . . . . . . .0.0 0.2 0.4 0.8 0.8 1.0 0.0 0.2 0.4 0.8 0.8 1.0SummedctivationS ~ edadivalkn

Figure 8.4 The predictionsmadeby the network in the simulation with complex sentencesas the network processesthe sentenceboys who Mary chaseseed cats." Eachpaneldisplaysthe activations of output units after successivewords; outputs are summedacrossgroups forpurposesof displaying the data. V, verbs; N, nouns; sf, singular; pi, plural; prop, proper nouns;t, transitive verbs; i, intransitive verbs; tli , optionally transitive verbs.

boys.. nwID~MB-18qBpIapOaIaM81gr8q81gO..prapa..I8 . . . . .000204080810endwhoYBpl(noVBp(r8qYBpl(op~YBIg(noYBIg(req

' BIg(OPaJN prapNQlNpINOIJ4Ig0.0 0.2 0.4 0.8 0.8 1.0

s~ ~ S&mn8d. , . . . .boys who ... boys who Mary chases feed ...

endwhoVBpi(noVBpI(18qVBpi(opl)VBIg(no)VBIg(reqJBIg(oplCXJNPfQpNOUNpINOlI' 4-Ig

endweV8pl(noV8pi(18qV8pI(OPQVBIQno)YB1Qr8qi BlQ(opQ~tD, N pIto.If .IQ. . . . . .

0.0 0.2 0.4 0.8 0.8 1.0 0.0 0.2 0.4 0.8 0.8 1.0SWwn8d~

boys who Mary feedcatsendwhoVBpi(no)V8pi(req)VBpi(opl)VBIg(no)VBIg(reqJBIg(opl)OONpropNOUNpiNOUNIg

211 Languageas a Dynamical System

. . . . . .

n-VBpi(nolva,.(18qVBpI(~ QVBlg(no)YB1gr8qJBIg(~ Q~tOfi.pINOUf . Ig


18/31

I I I I I1.0 1.5 2.0 2. 5 3.0TiM

Figure 8 .5 Trajectories through hidden unit state space as the network process es the sentences"boy hears boy " and "boys hear boy ." The number (singular Ys. plural ) of the subject isindicated by the position in state space along the second principal component .

e - .IIbora212 JeffreyL. Elman


19/31

ct . . . MeS


IO

I I I I-1 0 1 2PCA1

Figure 8.6 Trajectories through hidden unit state spaceas the network processes the sentences"boy chases oy," "boy seesboy," and "boy walks." Transitivity of the verb is encodedby its position along an axis which cuts across he first and third prindpal componentsPCAprindpal componentsanalysis


20/31

Figure 8.7 Trajectories through hidden unit state space(PCA 1 and 11) as the networkprocesses the sentencesboy chases oy," "boy chases oy who chases oy," "boy who chasesboy chasesboy," and "boy chasesboy who chasesboy who chasesboy" (to assist n readingthe plots, the final word of each sentence s terminated with a " ]5").

~ , -I 0DJ.z ~ N--0Q.Z

-~.

-~Q

-1 0 1PCA1

-1 0 1PCA1

Jeffrey Elman14


21/31

compares trajectories for sentences with verbs which have different argumentexpectations ; chasesrequires a direct object , seespermits one, and walks

precludes one. As can be seen, these differences in argument structure arereflected in a displacement in state space from upper left to lower right .Finally , figure 8.7 illustrates the path through state space for various sentenceswhich differ in degree of embedding . The actual degree of embeddingis captured by the displacement in state space of the embedded clausessentenceswith multiple embeddings appear somewhat as spirals.These trajectories illustrate the general principle at work in this network .The network has learned to represent differences in lexical items as differentregions in the hidden unit state space. The sequential dependencies whichexist among words in sentences are captured by the movement over timethrough this spaceas the network processes successivewords in the sentence.These dependencies are actually encoded in the weights which map inputs(i.e., the current state plus new word ) to the next state. The weights may bethought of as implementing the grammatical rules which allow well -formedsequences to be processed and to yield valid expectations about successivewords . Furthermore , the rules are general. The network weights create attrac-tors in the state space, so that the network is able to respond sensibly tonovel inputs , as when unfamiliar words are encountered in familiar contexts .

The imageof languageprocessingust outlined doesnot look very muchlikethe traditional picture we beganwith. Insteadof a dictionary-like lexicon, wehave a statespacepartitioned into variousregions. Instead of symbolic rulesandphrasestructure reeswe havea dynamicalsystem n which grammaticalconstructionsare representedby trajectories through state space Let menow considerwhat implicationsthis approachmight have for understandingseveralaspectsof languageprocessing

Although I have focused here on processing of sentencesobviously languageprocessing in real situations typically involves discourse which extends overmany sentences. It is not clear, in the traditional scheme, how informationrepresented in sentence structures might be kept available for discourse purposes. The problem is just that on the one hand there are clearly limitationson how much information can be stored, so obviously not everything can bepreserved; but on the other hand there are many aspects of sentence-levelprocessing which may be crucially affected by prior sentences..These includenot only anaphora, but also such things as argument structure expectations(e.g., the verb to give normally requires a direct object and an indirect object ,but in certain contexts these need not appear overtly if understood : Do youplan to give moneyto the United Way?No, I gave last week).

8.5 DISCUSSION

Beyond Sentences

LanguagesaDynamical ystem1S


22/31

The network's approach o languageprocessinghandlessuchrequirementsin a naturalmanner The network is a systemwhich might be characterized shighly opportunistic. It learns o performa task, in this caseprediction, doingjust what it needs o do. Note that in figure 8.5, for examplethe informationabout the number of the subjectnoun is maintainedonly until the verb thatagreeswith the subjecthas been processedFrom that point on, the twosentences re identical. This happensbecauseonce the verb is encounteredsubjectnumber is no longer relevant to any aspectof the prediction task.(This emphasizeshe importanceof the task because resumably asks otherthan prediction could easily require that the subjectnumber be maintainedfor longer.)This approach to preserving information suggests that such networkswould readily adaptto processingmultiple sentencesn discoursesince hereis no particularreanalysisor re-representationof information requiredat sentenceboundariesand no reasonwhy some nformation cannot be preservedacrosssentencesIndeedSt. John (1992) and Harris and Elman(1989) havedemonstrated hat networks of this kind readily adapt to processingparagraphsand short stories (The emphasison functionality is reminiscent ofsuggestionsmade by Agre and Chapman 1987, and Brooks 1989. Theseauthorsarguethat animalsneed not perfectly representeverything that is intheir environmentnor store it indefinitely. Insteadthey needmerely be ableto process hat which is relevant to the task at hand.)TypesandTokensConsider he first simulationand the network's use of statespaceo representwords. This is directly relevant to the way in which the systemaddresses thetypestokenproblemwhich arisesn symbolic systemsIn symbolic systemsbecauseepresentations re abstractand context-free,a binding mechanism s required to attach an instantiation of a type to aparticulartoken. In the network, on the other hand, tokens aredistinguishedfrom one anotherby virtue of producing small but potentially discriminabledifferencesn the statespaceJohn23John43andJohn192using subscripts oindicate different occurrencesof the same lexical item) will be physicallydifferent vectors. Their identity as tokens of the same ype is capturedby thefact that they areall located n a region which may be designatedas theJohnspace and which contains no other vectors. Thus, one can speakof thisboundedregion ascorresponding o the lexical type JohnThe differencesn context, however, createdifferences n the state. Furthermore

, these differencesare systematicThe clustering tree in figure 8.3 wascarried out over the mean vector for each word, averagedacrosscontexts. Ifthe actualhidden unit activation patternsare usedthe tree is, of coursequitelarge since here are hundredsof tokens of each word. Inspectionof the treereveals wo important facts. First, all tokens of a type are more similar to oneanother than to any other type, so the arborization of tokens of boyand dog

Jeffrey . Elman16


23/31

do not mix (although , as was pointed out , such overlap is not impossible andmay in some circumstances be desirable). Second, there is a substructure tothe spatial distribution of tokens which is true of multiple types . Tokens ofboy used as subject occur more closely to one another than to the tokens ofboy as object . This is also true of the tokens of girl . Moreover , the spatialdimension along which subject-tokens vary from object -tokens is the samefor all nouns. Subject-tokens of all nouns are positioned in the same region ofthis dimension , and object -tokens are positioned in a different region . Thismeans that rather than proliferating an undesirable number of representations ,this tokenization of types actually encodes grammatically relevant information. Note that the tokenization process does not involve creation of newsyntactic or semantic atoms. It is, instead, a systematic process. The state-space dimensions along which token variation occurs may be interpretedmeaningfully . The token 's location in state space is thus at least functionallycompositional (in the sensedescribed by van Gelder, 1990).

AccommodationPolysemy and


Polysemy refers to the casewhere a word has multiple senses Accommodationis used to describethe phenomenon n which word meaningsarecontextually altered (Langacker 1987). The network approachto languageprocessingprovides an accountof both phenomenaand shows how theymay be related

Although thereare clear nstanceswhere the samephonological form hasentirely different meanings bankfor instance, in many casespolysemy is amatter of degreeThere may be senseswhich are different, although meta-phorically relatedas n 5(a- c):5. (a) Arturo Barriosrunsvery fast!(b) This clock runs slow.

(c) My dad runs the grocery storedown the block.In other casesthe differencesare far more subtlethough just as real:6. (a) FrankShorterruns the marathon asterthan I everwill.

(b) The rabbit runs across he road.(c) The young toddler runs to her mother.

In 6(a- c), the construalof runs is slightly different, depending on who isdoing the running. But just as in 5, the way in which the verb is interpreteddependson context. As Langacker 1987) described he processIt must be emphasizedhat syntagmaticcombination nvolves more than thesimpleaddition of componentsA compositestructure s an integratedsystemformed by coordinating its components n a specificoften elaboratemannerIn fact, it often haspropertiesthat go beyond what one might expect fromits componentsalone. . . . [O]ne componentmay need to be adjusted n certain

detailswhen integrated to form a compositestructureI refer to this asaccommodation. Forexamplethe meaningof run asappliedto humansmust


24/31

be adjusted n certainrespectswhen extended o four leggedanimalssuch ashorsesdogs, and cats . . . in a technical sensethis extensioncreatesa newsemantic variant of the lexical tem. (pp. 76- 77)In figure 8.8 we see hat the network's representations f words in contextdemonstratesust this sort of accommodation Trajectories are shown forvarioussentencesall of which containthe mainverb burn. The representation

of the verb variesdependingon the subjectnoun. The simulationsshownhere do not exploit the variantsof the verb, but it is clearthat this is a basicproperty of suchnetworks."LeakyThe sensitivity to context that is illustrated in figure 8.8 also occursacrosslevelsof organizationThe network is ableto representconstituent structure(in the form of embeddedsentences, but it is alsotrue that the representationof embeddedelementsmay be affectedby words at other syntactic levels.This means hat the network doesnot implementa stackor pushdownmachineof the classicsort, and would seemnot to implementtrue recursioninwhich information at each evel of processings encapsulated nd unaffectedby information at other levels. Is this good or bad?If one is designinga programming anguagethis sort of "leaky" recursionis highly undesirable It is important that the value of variables ocal to onecall of a procedurenot be affectedby their value at other levels. True recursionprovides this sort of encapsulationof information. I would suggest hatthe appearance f a similarsort of recursion n natural language s deceptivehowever, and that while natural languagemay require one aspectof whatrecursion provides (constituent structure and self-embedding, it may notrequirethe sort of informationalfirewallsbetween evelsof organizationIndeedembeddedmaterial typically has an elaborativefunction. Relativeclausesfor exampleprovide information about the headof a noun phrase(which is at a higher level of organization). Adverbial clausesperform a similarfunction for main clauseverbs. In general then, subordination involvesa conceptualdependencebetween clausesThus, it may be important thata languageprocessingmechanism acilitate rather than impede interactionsacross evelsof information.There are specificconsequencesor processingwhich may be observed na systemof this sort, which only loosely approximatesrecursion First, thefinite bound on precisionmeans hat right-branchingsentences uch as 7 a)will be processed etter thancenter-embeddedsentences uchas 7(b):7. (a) The womansawthe boy that heard the man that left.(b) The manthe boy the womansaw heard eft.It hasbeen known for many years that sentences f the first sort are processedin humans more easily and accuratelythan sentencesof the secondkind, and a numberof reasonshave been suggested e.g., Miller and Isard,

Recursion " and Processing Complex Sentences

Jeffrey . Elman18


25/31

IZ

~


I I I I I I I-1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2PCA1

Figure 8.8 Trajectories through hidden unit state space (PCA 1 and 2) as the networkprocessesthe sentences {john, mary, lion, tiger, boy, girl } burnshouse" as well as" {museumhouse} burns" (the final word of eachsentencess terminatedwith " ]5"). The internal representationsof the word "burns" variesslightly asa funmon of the verb's subject


26/31

1964). In the case of the network , such an asymmetry arises because right -branching structures do not require that infomtation be carried forward overembedded material , whereas in center-embedded sentences infomtation fromthe matrix sentence must be saved over intervening embedded clauses.But it is also true that not all center-embedded sentences are equallydifficult to comprehend . Intelligibility may be improved in the presence ofsemantic constraints . Compare the following , in 8 (a and b):8. (a) The man the woman the boy saw heard left .(b) The claim the horse he entered in the race at the last minute was a

ringer was absolutely false.In 8(b) the three subject nouns create strong - and different - expectationsabout possible verbs and objects. This semantic information might be expectedto help the hearer more quickly resolve the possible subject-verb -object associationsand assist processing (Bever, 1970; King and Just, 1991). The verbs in8(a), on the other hand, provide no such help. All three nouns might plausiblybe the subject of all three verbs.In a series of simulations , Weckerly and Elman (1992) demonstrated thata simple recurrent network exhibited similar perfomtance characteristics. Itwas better able to process right -branching structures, compared to center-embedded sentences. And center-embedded sentences that contained strongsemantic constraints were processed better compared to center-embeddedsentences without such constraints . Essentially, the presence of constraintsmeant that the internal state vectors generated during processing were moredistinct (further apart in state space) and therefore preserved infomtationbetter than the vectors in sentences in which nouns were more similar .

Specifiche mmediate of Lexically Information

Jeffrey . Elman20

One question which has generated considerable controversy concerns thetime course of processing, and when certain information may be available andused in the process of sentence processing. One proposal is that there is afirst -pass parse during which only category -general syntactic information isavailable (Frazier and Rayner, 1982). The other major position is that consid-erably more information , including lexically specific constraints on argumentstructure, is available and used in processing (Taraban and McClelland , 1990).Trueswell , Tanenhaus, and Kello (1993) present empirical evidence from avariety of experimental paradigms which strongly suggests that listeners areable to use subcategorization information in resolving the syntactic structureof a noun phrase which would otherwise be ambiguous . For example, in 9(a and b), the verb forgot permits both a noun phrase complement and also asentential complement ; at the point in time when the solution has been read,either 9(a) or 9(b) is possible.9. (a) The student forgot the solution was in the back of the book .(b) The student forgot the solution .


27/31


28/31

these networks is deficient in that it is solely reactive. These networks areinput -output devices. Given an input , they produce the output appropriatefor that training regime. The networks are thus tightly coupled with theworld in a manner which leaves little room for endogenously generated activity. There is no possibility here for either spontaneous speech or for reflectiveinternal language. Put most bluntly , these are networks that do not think !

These same criticisms may be leveled, of course, at many other currentand more traditional models of language, so they should not be taken asinherent deficiencies of the approach. Indeed, I suspect that the view oflinguistic behavior as deriving from a dynamical system probably allowsfor greater opportunities for remedying these shortcomings . One excitingapproach involves embedding such networks in environments in which theiractivity is subject to evolutionary pressure, and viewing them as examples ofartificial life (e.g., Nolfi , Elman, and Parisi, in press). But in any event, it isobvious that much remains to be done.

1. For exampleimaginea tree with two major branches, each of which hastwo subbranches.We can be certain that the items on the major branches occupy different regions of statespaceMore precisely they lie along the dimension of major variation in that spaceWecan say nothing about other dimensions of variation, however. Two subbranches may dividein similar ways. For example[+ human] and [ - human] animatesmay eachhave branches for[+ large] and [ - large] elements But it is impossible o know whether [+ large, + human] and[ + large, - humanelementsdiffer from their [ - large] counterparts n exactly the sameway,i.e., by lying along somecommon axis corresponding to size. dustering tells us only aboutdistancerelationshipsnot about the organization of spacewhich underlies hose relationships2. There are limitations to the use of PCA to analyze hidden unit vectors. PCA yields arotation of the original coordinate system which requires that the new axes be orthogonal.However, it need not be the case that the dimensions of variation in the hidden unit spaceare orthogonal to one another; this is especially true if the output units that receive thehidden unit vectors as input are nonlinear. It would be preferableto carry out a nonlinearPCA or use some other techniquewhich both relaxed the requirementof orthogonality andtook into account the effect of the hidden-to-output nonlinearity.REFERENFSAgre, P. E., andChapmanD. (1987. Pengian implementationf a theoryof activity. InProceedingsf theAAA187 LosAltos, CA: MorganKaufmannAltmannG. T. M., and SteedmanM. J. (1988. Interactionwith contextduringhumansentenceprocessingCognition30191- 238BeverT. (1970. Thecognitivebasis or linguistic tructure In J. R. Hayes Ed), Cognitionndthedevelopmentf anguageNewYork:Wiley.BrooksR. A. (1989. A robot that walks emergent ehaviorsrom a carefullyevolvednetwork. NeuralComputation1, 253- 262BrowmanC. P., and GoldsteinL. (1985. Dynamicmodelingof phoneticstructure In V.FromkenEd), PhoneticinguisticsNewYork Academic ress

NOTES

JeffreyL. Elman22


29/31


ChomskyN. (1975. Reflectionsn anguageNew York: PantheonChurchlandP. S., andSejnowskiT. J. (1992. Thecomputationalrain CambridgeMA: MITPressElmanJ. L. (1990, Finding tructuren time Cognitivecience14179- 211ElmanJ. L. (1991. Distributed epresentationssimple ecurrent etworksandgrammaticalstructure Machine earning7195- 225FerreiraF., and HendersonJ. M. (1990. The useof verb nformationn syntactic arsingacomparisonf evidenceromeyemovementsndword-by-word self-paced eadingJournal fExperimentalsychologyLearningMemory ndCognition16555- 568FodorJ. (1976. Theanguagef thoughtSussexHarvester ressFodorJ., andPylyshynZ. (1988.Connectionismndcognitivearchitecturea criticalanalysisIn S. PinkerandJ. Mueller Eds), ConnectionsndsymbolsCambridgeMA: MIT PressForsterK. (1976. Accessinghe mental exicon In R. J. WalesandE. Walker(Eds), Newapproaches o anguageechanismsAmsterdam North-HollandFowlerC. (1977. Timingand control n speechroductionBloomingtonIndianaUniversityLinguistics lubFowlerC. (1980. Coarticulationnd heories f extrinsicimingcontrolJournal fPhonetics8,113 133FrazierL., andRaynerK. (1982. Makingandcorrecting rrorsduringsentenceomprehension: eyemovementsn theanalysis f structurally mbiguousentencesCognitive sychology14178 210HarrisC., andElmanJ. L. (1989. Representingariable nformationwith simple ecurrentnetworksIn Proceedingsf the 10thAnnualConferencef theCognitive cienceocietyHillsdaleNJErlbaumJordanM. I. (1986. Serial rderaparallel istributedrocessingpproach(Institute or CognitiveScienceeport8604) Universityof CaliforniaSanDiegoJordanM. I., and RosenbaumD. A. (1988. Action(Technicaleport88-26.) Department fComputer cienceUniversityof Massachusettst AmherstKelsoJ. A. S., SaltzmanE., andTullerB. (1986. Thedynamicalheoryof speech roductiondataand heoryJournal fPhonetics1429- 60.King J., andJustM. A. (1991. Individualdifferencesn syntacticprocessingthe role ofworkingmemoryJournal f Memory ndLanguage30580- 602LangackerR. W. (1987. Foundationsf cognitiverammartheoreticalerspectivesVol. 1. Stanford

, CA: StanfordUniversityPressMacNeilageP. F. (1970. Motor controlof serialorderingof speechPsychologicaleview77182 196MarslenWilsonW. D. (1980. Speech nderstandingsapsychologicalrocessInJ. C.Simon(Ed), SpokenanguagenderstandingndgenerationDordrechtNetherlandsReidelMcClellandJ. L., and ElmanJ. L. (1986. The TRACEmodelof speech erceptionCognitivePsychology181- 86.McClellandJ. L., and RumelhartDE . (1981. An interactive ctivationmodel of contextseffects n letter perceptionpart 1. An accountof basic indingsPsychologicaleview88365- 407


30/31

Wiles, J., and Bloesch A. (1992). Operators and curried functions: training and analysis ofsimple recurrentnetworks. In JE . Moody , S. J. Hansonand R. P. Lippman (Eds.), Advancesnneuralnformationprocessingystems . San Mateo, CA: Morgan Kaufmann

WeckerlyJ.. and ElmanJ. L. (1992. A POPapproacho processingenterembedded entences. In Proceedingsf the14thAnnualConferencef theCognitivecienceocietyHillsdaleNJ:Erlbaum

224 Jeffrey . Elman

Miller, G. A., andChomskyN. (1963. Finitarymodels f languagesers In R. D. LuceR. R.BushandE. GalanterEds), HandbookfmathematicalsychologyVol. 2. NewYork Wiley.Miller, G., and sardS. (1964. Free ecallof selfembeddednglish entencesInfonnationndControl7, 292- 303Morton, J. (1979. WordrecognitionIn J. MortonandJ. C. MarshallEds), Psycholinguistics:structuresndprocessesCambridgeMA: MIT PressMozerM. C. (1989. A focused ackpropagation lgorithmor temporal attern ecognitionComplexystems3, 49- 81.NewellA. (1980.PhysicalymbolsystemsCognitivecience4, 135 183NolAS., Elman]. L., andParisiD. (in press. Learning ndevolution n neuralnetworksAdaptive ehaviorPearlmutterB. A. (1989. Learning tatespacerajectoriesn recurrentneuralnetworks InProceedingsfthe nternationalointConferencen NeuralNetworks11365), WashingtonDC.PierrehumbertJ. B., andPierrehumbertR. T. (1990. On attributinggrammarso dynamicalsystemsJournal fPhonetics18465- 477PollackJ. B. (1990. The nductionof dynamicalecognizersMachine earning7, 227- 252RumelhartD. E., HintonG. E., andWilliamsR. J. (1986. Learningnternal epresentationsyerrorpropagationIn DE . RumelhartndJ. L. McClellandEds), Parallelistributedrocessingexplorationsn themicrosin4cturefcognitionVol. 1. CambridgeMA: MIT PressSelfridgeO. G. (1958. Pandemoniuma paradigm or learningMechanisationf thoughtprocesses Proceedingsf a symposiumeldat theNationalPhysicalaboratoryNovember958LondonHMSO.SiegelmannH. T., andSontagE. D. (1992. Neural etworks ith realweightsanalogomputationalcompleRily(ReportSYCON92-05.) RutgersCenter or Systems ndControlRutgersUniversityNewBrunswickNJ.St. JohnM. F. (1992. Thestory gestalta modelof knowledgeintensiveprocesses n textcomprehensionCognitivecience16271- 306.TarabanR., andMcClellandJ. L. (1990. Constituent ttachmentnd hematicoleexpectations.Journal fMemory ndLanguage27597- 632.TrueswellJ. C., TanenhausM. K., andKenoC. (1992. Verb-specific onstraintsn sentenceprocessingseparatingffectsof lexicalpreferencerom gardenpathsJournal f ErperimentalPsychologyLearningMemory ndCognitionTyler, L. K., and FrauenfelderUH . (1987. The process f spokenword recognitionanintroduction In UH . Frauenfelderand L. K. Tyler (Eds), Spokenord ecognitionCambridgeMA: MIT PressVanGelderT. (1990. Compositionalityaconnectionistariationon a classicalheme CognitiveScience14 355- 384


31/31

Guide to Further ReadingA good collection of recent work in connectionist models of language may be found inSharkey (in press. The initial description of simple recurrent networks appears n (Elman1990). Additional studies with SRNs are reported in Cleeremans ServanSchreiber andMcClelland (1989) and Elman (1991). An important discussionof recurrent networks-baseddynamicalsystemsapproach o language s found in Pollack(1991).CleeremansA , ServanSchreiberD., and McClelland, J. L. (1989). Finite state automata andsimple recurrentnetworks. NeuralComputation1, 372- 381.ElmanJ. L. (1990). Finding structure n time. CognitiveScience14179- 211.ElmanJ. L. (1991). Distributed representationssimple recurrent networks, and grammaticalstructure. MachineLearning7, 195- 225.Pollack J. B. (1991). The induction of dynamicalrecognizersMachineLearning7, 227- 252.SharkeyN. (Ed.). (in press. Connectionist atural language rocessing readingsrom connectionscienceOxford , England: Intellect.

Elman Ch_Mind and Motion

Documents

Transcript of Elman Ch_Mind and Motion