Connectionist Models and Linguistic Theory

8/11/2019 Connectionist Models and Linguistic Theory

1/50

COGNITIVE SCIENCE

18,

l-50 (1994)

ConnectionistModels and LinguisticTheory:

Investigationsof StressSystems n Language

PRAHLAD GUPTA

DAVID S. TOURETZKY

Carnegie Mellon University

We question the widespreod assumption that l inguistic theory should guide the

formuiotion of mechanistic accounts of huma n language processing. We develop

a pseudo -l inguistic theory for the dom ain of l inguistic stress,based on observa-

tion of the ieornin g behavior of a perceptron exposed to a variety of stress pat-

terns. There are significant similarities betw een our analysis of perceptron stress

learning and metrical phonology, the l inguistic theory of hum an stress. Both

approaches attem pt to identify salient characteristics of the stress systems under

exam ination witho ut reference to the workings of the underlying processor. Our

theory and computer simulat ions exhibit some strikingly suggestive correspon-

dences w ith metrical theory. We show, however, that our high-leve l pseudo-

l inguistic account bears no causal relation to processing in the perceptron, and

provides l i ttle insight into the nature of this processing. Because of the persua-

sive similarities betw een the nature of our theory an d l inguistic theorizing, we

suggest that l inguistic theory may be in much the same position. Contrary to the

usual assumption, i t may not provide useful guidance in attempts to identi fy pro-

cessing mechanisms underlying hum an language.

The fundamentalquestions o be answered bout languageare, as Noam

Chomskyhaspointedout: (1)What is thesystemof knowledgen the mind/

brain of the speakerof Englishor Spanishor Japanese?2) How does his

knowledge ome nto being? 3) How is this knowledge sed?and (4)What

are the physicalmechanisms ervingas he basis or this systemof knowl-

edge,and for its use? Chomsky, 1988,p. 3).

We would l ike to thank Deirdre Wheeler for introducing us to the l i terature on stres s and

providing guidance in the early stage-s f this wo rk. We also thank Gary Cottrel l , Elan Dres her,

Jef f Ehnan, Dan Everet t , M ichael Gasser, Br ian M acWhinney, Jay McClel land, Er ic Nyberg,

Brad Pritchett, Steve Sm all, Deirdre Whe eler, and an anonym ous reviewer for helpful com-

me nts on earlier ve rsions of this paper, and David Evan s for ac ces s o computing facil i ties at

Carnegie Mellons Laboratory for Com putational Linguistics. Of course , none of them is re-

sponsible for any errors in this work , nor do they necess ari ly agree with the opinions expressed

here.

The second author wa s supported by a grant from H ughes Aircraft Corporation, and by

the Office of Nava l R esearch under contrac t num ber NOO O14-86-K-0678.

Correspondence and requests for reprints should be sent to Prahlad Gup ta, Depa rtment of

Psychology, Carnegie Mellon University, Pittsburgh, PA 15213.

1


2/50

2 GUPTA AND TOURETZKY

Linguistic theory s viewed, n the Chomskyan radition, as lluminating

the nature of mind, and as guiding the search or lower level processing.

Thus, according o Chomsky, to the extent hat the linguist can provide

answers o the first threequestionsabove, he brain scientistcan begin o

explore answers o the fourth question, hat is, can begin to explore he

physical mechanisms hat exhibit the properties evealed n the linguists

abstract heory. In the absence f answerso the first threequestions, rain

scientistsdo not know what they are searching or (Chomsky, 1988,p. 6).

Chomsky distinguishes hree levels of analysis n linguistic inquiry:

observation,description,and explanation; heseare he tasks or linguistics

(Chomsky, 1988, . 60-62).Explanation s via the formulation of Universal

Grammar, which is viewedasproviding a genuineexplanationof the phe-

nomena Chomsky,1988,p. 62).UniversalGrammarreveals he principles

of the mind (Chomsky, 1988,p, 91), and thus constitutes he abstract

theory that is needed o guide he search or actual brain mechanisms.

This view canbe statedmore generallyas he claim that the abstract or-

mulations of linguistic theory are he appropriate rameworkwithin which

to organize the search for the lower-level processingmechanismsof

language.Our aim in this paper s to draw attention o the fact that this s an

assumption, and one with important consequences.o the extent hat this

claim is true, it suggestshat formulation of accountsof language rocess-

ing, as well as the search or the neural substrates f language, hould be

guidedby the constructsof linguistic theory.To the extent hat this claim is

false, the useof linguistic constructsas the basis for processing ccounts

and neural nvestigationmay be misguided,and misleading.

Our goal, then, is to examinewhether t is necessarilyrue that the con-

structsof linguistic theory play a valuable ole in formulation of accounts

of lower-levelprocessing.n this paper,we will develop

&pseudo-linguistic

theory

for the domain of linguistic stress,basedon observationof the

learningbehaviorof a perceptron xposedo a varietyof stress atterns.We

will then compare his theorywith

metricalphonology,

the actual inguistic

theory of stress,showinghow our theory exhibits many of the kinds of

regularitiesand predictionsof metrical theory. Thus, our theory will be

analogous,n important ways, o metrical theory. It predicts, or example,

that certain stresssystemswill be unlearnable.We will then analyze he

behaviorof the underlyingperceptronmodel n termsof its actual process-

ing. This will show hat our theoreticalpredictionsof non-learnabilitybear

no relation to the low-level, causalexplanationwe developas part of the

processing ccount.

We call our theory pse udo-l inguistic only because it is based on data generated b y the

perceptron rather than hum ans.


3/50

CONNECTIONIST MODE LS AND LINGUISTIC THEORY

3

In short, we will demonstratehat a lower-levelaccountmay not be so-

morphic (or evensimilar) to a higher-level ccount,and hat under hese ir-

cumstances, he higher-levelaccount may provide little insight into the

underlying phenomena.Furthermore, since the nature and the mode of

development f our higher-level ccountarecloselyanalogouso the nature

and developmentof bodies of linguistic theory, this constitutesa strong

argumentagainstassuming hat the high-levelaccountsof linguistic theory

will necessarilylluminate lower-levelprocessingmechanisms, nd against

assuming hat linguistic theory is the appropriateconceptual ramework

within which to organize he search or thesemechanisms. ucharguments

haveof course eenmadebefore Pylyshyn,1973;Rumelhart8cMcClelland,

1987;Smolensky,1988).However, he presentwork substantiateshe argu-

ment with a concretedemonstrationof how an analytical framework that

bearsa close esemblanceo linguistic heorycan ail to havean analoguen

the underlyingprocessingmechanisms.

The organizationof the paper s as follows. The first sectionprovides

somebackground nformation about the domain of linguistic stress,while

the second ectiondescribes ur perceptron imulations of stress earning.

The third sectiondevelops ur pseudo-linguisticheory, anddiscussesorre-

spondencesetween spects f the perceptron imulations,our theory, and

metrical theory. In the fourth section,we show that our theory n reality

offers only an abstract account hat is unrelated o low-level processing

characteristics. aving shown hat a high-levelaccountneednot illuminate

underlying processing,we turn in the fifth section o a considerationof

someotherreasonswhy theanalyses f metrical theorymight not be appro-

priate primitives for a processingmodel. In view of all this, we conclude

that the Chomskyanassumption hat linguistic theoryshouldguide nvesti-

gation of processingmechanismsmay be unwarranted.

1. BACKGROUND: STRESS SYSTEMS IN LANGUAGE

1.1. Motivation for Choice of Domain

In order to examinea subsystem f linguistic theory n termsof its implica-

tions for lower-levelprocessing, prerequisites, of course, hat the theory

be clearly specified.According o Dresherand Kaye (1990),stress ystems

arean attractivedomain for investigationbecause:a) the inguistic theory

is well-developed, o that comparedwith syntax, here s a relatively com-

pletedescriptionof the observed henomena, nd (b) stress ystems anbe

studiedrelatively independently f other aspectsof language Dresher&

Kaye, 1990,p. 138).We thereforechose o constructour pseudo-linguistic

theory with respect o linguistic stress, ince t is regarded sa well-defined

domain for which theoreticalanalyses rovide good coverage.


4/50

4

GUPTA AND TOURETZKY

1.2. Evolution of the Linguistic Theory

The analysisof stresshasevolved hrough a number of phases:

1.

2.

3.

4.

5.

Linear analysespresented tressas a phonemic eature of individual

vowels,with different evelsof stressepresenting ifferent evelsof ab-

soluteprominence.This approachwas aken n Trager& Smith (1951),

and culminated n Chomskyand Halles seminalThe Sound Pattern of

English

(1968).

Metrical theory, as developed y Liberman & Prince (Liberman, 1975;

Liberman & Prince, 1977), ntroducedboth a non-linearanalysisof

stress atterns in termsof

metrical trees),

and the reatmentof stress s

a relative property rather than an absolute one; however, he stress

featurewasretained n the analysis.

In subsequent evelopmentsPrince, 1976;Selkirk, 1980), elianceon

this featurewaseliminatedby incorporationof the dea hat subtrees f

metrical treeshad an independent tatus metrical feet), so that stress

assignment ulescould make reference o them.

The positing of internal structure for syllables (McCarthy, 1979a;

McCarthy, 1979b;Vergnaud& Halle, 1978)provideda meansof dis-

tinguishing ight andheavysyllables, distinction o whichstress atterns

are widely sensitive,but which had beenproblematic under previous

analyses.

An analysisof metrical tree geometriesHayes, 1980)providedan ac-

count of many aspects f stress ystemsn terms of a small numberof

parameters.

Through he development f metrical theory, therehasbeendebateover

whether he autosegmentalepresentationsor stress

re metrical trees only

(Hayes, 1980),metrical grids only (Prince, 1983;Selkirk, 1984),or some

combination of the two (Halle & Vergnaud, 1987a;Halle & Vergnaud,

1987b;Hayes, 1984a;Hayes, 1984b;Liberman: 1975;Liberman & Prince,

1977).

1.3. Syllable Structure

A syllable s analyzedas being comprisedof an onset,which contains he

material before he vowel, and a rime. The rime is comprisedof a nucleus,

which contains he vocalicmaterial, anda

coda,

which containsany remain-

ing (non-vocalic)material. For further discussion,seeKaye (1989, pp.

54-58).

A

syllablemay be

open

(it ends n a vowel); or

closed

(it ends n a conso-

nant). In terms of syllablestructure,an opensyllablehas a non-branching

rime (the rime has a nucleus,but not a coda), and a closedsyllable hasa

branching

rime (the rime hasboth a nucleusand a coda).


5/50


5

In many languages,tress ends o beplacedon certainkinds of syllables

rather han on others; he former are ermedheavysyllables,and he atter

light syllables.What countsas a heavyor a light syllablemay differ across

languagesn which such a distinction is present,but, most commonly, a

heavysyllable s one hat can be characterized shavinga branching ime,

and a light syllable can be characterized s having a non-branching ime

(Goldsmith,1990, . 113).Languageshat nvolvesucha distinction between

heavyand ight syllables, .e., between he weightsof syllables)are termed

quantity-sensitive,and languages hat do not, quantity-insensitive.Note

that in quantity-insensitiveanguages yllablescan occur both with and

without branching rimes, but the distinction between hese kinds of

syllableshas no relevanceor the placementof stress.

1.4. Metrical Phonology

Thereseems o be theoreticalagreementhat stress atternsare sensitive o

information about syllablestructure,and in particular, to the structureof

the syllablerime, and not the syllableonset.We follow this assumptior?.

Thus rime structure s taken o be he basic evel at which accountsof stress

systems,re ormulated.(For an overviewof metrical heory, seeGoldsmith

(1990,Chapter4), Kaye(1989,pp. 139-145), an derHulst & Smith (1982),

or Dresher& Kaye (1990, pp. l-8)). Stresspatterns are controlled by

metrical structures uilt on top of rime structures.The versionof metrical

structure adopted here is metrical feet. We assume he parameters or-

mulatedby Dresherand Kaye (1990,p. 142):

W)

The word-tree s strong on the [Left/Right]

WI

Feet are [Binary/Unbounded]

(P3)

Feet arebuilt from the [Left/Right]

(P4)

Feet arestrong on the [Left/Right]

(P5)

Feet are Quantity-SensitiveQS) [Yes/No]

06)

Feetare QS to the [Rime/Nucleus]

0?7)

A strongbranchof a foot must itself branch [No/Yes]

WV

There s an extrametricalsyllable [Yes/No]

(P9)

It is extrametricalon the [Left/Right]

(Pw

A weak foot is defooted n clash [No/Yes]

Wl)

Feetare non-iterative No/Yes]

As an exampleof the applicationof theseparameters, onsider he stress

patternof Maranungku, n which primary stress alls on the first syllableof

the word and secondary tresson alternatesucceeding yllables.Figure 1

See, for exam ple, Dresher & Kaye (1990, p. 141) or Goldsm ith (1990, p. 170). No te,

howe ver, that som e researchers have presented evidence that onse ts ma y in fact be relevant to

the placeme nt of stress (Dav is, 1988; Everett & Eve rett, 1984).


6/50


Word-tree

Flgure 1. Metrl cal structures for a six-syllable word In Moran ungku.

showsan abstract epresentation f a six-syllableword, with eachsyllable

represented s u. The assignmentof stress s characterizedas follows.

Binary, quantity-insensitive,eft-dominant feet are constructed teratively

from the left edgeof the word. Each foot has a strong and a weak

branch (labeled S and W, respectively,n the figure). The strong,or

dominant branchassigns tress o the syllable t dominates.Since he feet

are left-dominant, odd-numbered yllablesare assigned tress.Over the

roots of thesemetricalfeet, a eft-dominant word-tree s constructed,which

assigns tress o the structuredominatedby its leftmost branch.The third

and fifth syllablesare each dominated by the dominant branch of one

metrical structure a foot), while the first syllable s dominatedby the domi-

nant branches f two structures a foot, and he word-tree).Even-numbered

syllablesaredominatedonly by nondominantbranches f feet. The result s

that even-numberedyllables eceiveno stress; he third and fifth syllables

receiveonedegree f stress secondary tress);and he first syllable eceives

two degrees f stress primary stress.) he parameter ettings haracterizing

Maranungkuare [Pl Left], [PZ Binary], [P3 Left], [P4 Left], [PSNo], [P7

No], [P8 No], [PlO No], [Pll No]. ParametersP6 and P9 do not apply

because f the settingsof parametersP5 and P8, respectively.

1.5 Principles and Parameters

Metrical theory illustrates he

principles and parameters

approach o lan-

guage,one of whosecentralhypothesess that languageearningproceeds

through the discoveryof appropriateparametersettings.Every possible

human anguage anbe characterizedn terms of parametersettings;once


7/50

CONNECTIONIST MODE LS AND LINGUISTIC THEORY 7

thesesettingsare determined, he nature of structure-sensitiveperations

and he structures n which hey operates known, so that the detailsof lan-

guage rocessing reautomaticallydetermined at an abstract evel).Subse-

quently, the assignment f stress n the actual

production

or

processing

of

language s assumed o involve neural processeshat correspondquite

directly with the abstractprocess f applicationof theseparameter ettings

as guidelines o the constructionand manipulation of metrical feet.

1.6. Previous Computational Models of Stress Learning

Computationalmodels of stresssystemsn language avebeendeveloped

by Dresher& Kaye (1990)and by Nyberg(1990,1992).The focus of these

models s on the learningof theparameters specifiedby metrical theory;

they therefore ake asa startingpoint the constructs f that theory, and n-

corporate ts assumptions.What they add to the linguistic theory s what

Dresher8~Kaye erm a learning theory: a specificationof how the data he

language earner encounters n its environment are to be used to set

parameters. he following features haracterizehesemodels:

1.

2.

3.

4.

They assume he existence f processesxplicitly correspondingo the

linguistic notion of parametersetting.

They proposea

learning

theoryas an accountof that parameter-setting

process.

They assumehat he process f production (i.e., of producingappropri-

ate stress ontours or input wordsafter earninghasoccurred)nvolves

explicit .representational tructuresand structure-sensitive perations

directly corresponding o metrical-theoretic rees and operationson

those rees.

They assume o necessaryelationshipbetween he processingmecha-

nisms nvolved in

learning

vs. those nvolved in

production.

That is,

learning s accomplished y supplying alues or the parameters efined

by metrical theory; thesevalues hen form a knowledge ase or stress

assignment,whoseprocessingnvolves, for example, he construction

of binary trees rom right to left-an operationhavingno necessaryor-

respondence ith thoseby which the parameter alueswere acquired.

The abovemodelsexemplify he assumptionwe are nterestedn examin-

ing in the presentwork: that abstract inguistic formulations are he appro-

priate starting point for processingmodels.

Processingn the perceptronwill differ from the Dresher& Kaye and

Nybergmodelswith regard o the abovenoted characteristics:

1. There is no explicit incorporation of parameters n the perceptron

model.


8/50


2. The learning heory employedconsistsof one of the general earning

algorithmscommon in connection& modeling, and is not an account

of parameter-setting.

3. The process f productiondoesnot involve explicitly structured epre-

sentations n the classical ense.

4. The processingmechanisms nd structures nvolved n production are

essentially he sameas those nvolved n learning.

Thus, to the extent that we are successful n contructing an abstract,

pseudo-linguisticheory that predictsbehaviorof the perceptronmodel, t

will not be merelyby virtue of having built in the constructs f metrical

theory to beginwith.

2. OVERVIEW OF MODEL AND SIMUL.ATIONS

2.1. Nineteen Stress Systems

Nine quantity-insensitive QI) languagesand ten quantity-sensitive QS)

languageswere examined n our experiments.The data, summarized n

Table 1, were aken primarily from Hayes(1980).Note that the QI stress

patternsof Latvian andFrench,MaranungkuandWeri, Lakota and Polish,

and Paiute and Waraoare mirror imagesof eachother. The QS stress at-

terns of Malayalam and Yapese,Osseticand Rotuman, EasternPermyak

Komi and EasternCheremis,and Khalka MongolianandAguacatecMayan

are also mirror images).

2.2. Structure of the Model

In separate xperiments,we taught a perceptron o produce he stresspat-

tern of each of the nineteen anguages. he domain was imited to single

words, as n the previous earningmodelsof metrical phonologydeveloped

by Dresherand Kaye (1990)and Nyberg (1990).Again as in the other

models, the effects of morpho-syntactic nformation such as lexical cate-

gory were gnored,and the simplifying assumptionwas made hat the only

relevant nformation about syllableswas their weight.

Two input representations ereused. n thesyllabicrepresentation, sed

for QI patternsonly, a syllablewasrepresentedsa [l, l] vector, and [0, 0]

represented o syllable. In the weight-string epresentation,which was

necessaryor QS languages,he input patternsusedwere [l, O] for a light

syllable, [0, l] for a heavy syllable, and [0, O] for no syllable. For stress

systems, ith up to two levelsof stress, he output targetsused n training

were1.0 or primary stress, .5 for secondary tress, nd0 for no stress. or

Descriptions som ewh at more com plex than ours have been reported for Polish (Halle &

Vergnaud, 1987a, pp. 57-58) and Malayalam (Hay es, 1980, pp. 66, 109). How ever, this does

not detract from our discus sion in any wa y, as stres s sys tem s corresponding to our sim plif ica-

t ions are reported to exis t: Swahil i (Halle & Clem ents, 1983, p. 17) and Gurkhali (Hay es, 1980,

p. 66), corresponding to Polish and Ma lay&m , respec tively.


9/50

T

1

S

e

P

e

n

D

p

o

a

E

m

p

e

S

e

A

g

m

e

R

L

D

p

o

o

S

e

P

e

n

E

m

p

e

Q

y

n

v

L

L

L

v

a

F

x

w

d

n

a

s

e

L

F

e

F

x

w

d

n

s

e

L

M

a

a

P

m

a

y

s

e

o

f

s

s

a

e

s

y

s

e

o

a

e

n

e

s

n

s

a

e

L

W

e

P

m

a

y

s

e

o

l

a

s

a

e

s

y

s

e

o

a

e

n

e

p

e

n

s

a

e

L

G

a

w

P

m

a

y

s

e

o

f

s

s

a

e

s

y

s

e

o

p

m

a

e

s

a

e

t

e

a

y

s

e

o

a

e

n

e

s

a

e

p

e

n

t

h

p

n

s

e

o

s

s

a

e

L

L

a

L

P

s

L

-

P

u

e

L

W

a

0

P

m

a

y

s

e

o

s

s

a

e

P

m

a

y

s

e

o

p

m

a

e

s

a

e

-

P

m

a

y

s

e

o

s

s

a

e

s

y

s

e

o

a

e

n

e

s

n

s

a

e

P

m

a

y

s

e

o

p

m

a

e

s

a

e

s

y

s

e

o

a

e

n

e

p

e

n

s

a

e

(

c

n


10/50

T

1

C

n

Q

v

S

v

L

L

O

L

1

L L L L L L L

B

L

K E

m

o

M

a

a

a

m

Y O

c

R

u

m

a

K

m

i

C

e

m

i

s

M

o

a

M

a

P

m

a

y

s

e

o

f

s

s

a

e

s

y

s

e

o

h

s

a

e

(

H

o

s

a

e

o

s

a

e

w

h

l

o

v

w

(

P

m

a

y

s

e

o

f

n

a

h

s

a

e

(

H

o

s

a

e

P

m

a

y

s

e

o

f

s

s

a

e

e

w

f

s

s

a

e

l

g

a

s

s

a

e

h

(

H

o

v

w

P

m

a

y

s

e

o

l

a

s

a

e

e

w

la

i

s

l

g

a

p

m

a

e

h

(

H

l

o

v

w

P

m

a

y

s

e

o

f

s

s

a

e

i

h

e

s

o

s

s

a

e

(

H

l

o

v

w

P

m

a

y

s

e

o

l

a

s

a

e

i

h

e

s

o

p

m

a

e

s

a

e

(

H

l

o

v

w

P

m

a

y

s

e

o

f

s

h

s

a

e

o

o

l

a

s

a

e

i

n

h

(

H

l

o

v

w

P

m

a

y

s

e

o

l

a

h

s

a

e

o

o

f

s

s

a

e

i

n

h

(

H

l

o

v

w

P

m

a

y

s

e

o

f

s

h

s

a

e

o

o

f

s

s

a

e

i

n

h

(

H

l

o

v

w

P

m

a

y

s

e

o

l

a

h

s

a

e

o

o

la

s

a

e

i

n

h

(

H

l

o

v

w

L

L

O

O

L

O

L

O

L

L

O

O

O

O

O

O

L

O

O

O

L

O

O

L

O

O

O

O

O

O

L

L

O

H

O

L

L

O

L

H

L

O

L

O

O

H

O

L

O

O

L

O

L

O

H

L

O

O

O

O

L

O

L

O

O

O

O

O

L

O

O

O

L

L

O

L

O

O

O

O

O

L

O

L

L

O

L

O

O

O

L

O

O

O

O

O

O

L

O

O

O

O

L

O

L

L

O

O

O

O

O

O

L

O

H

L

O

O

L

O

L

L

O

O

O

O

O

O

L

O

O

L

H

L

L

O

O

O

O

O

O

N

e

E

m

p

e

a

e

o

s

e

a

g

m

e

i

n

s

s

a

e

w

d

P

m

a

y

s

e

i

s

d

e

b

t

h

s

s

p

1

e

g

S

s

y

s

e

b

t

h

s

s

p

2

t

e

a

y

s

e

b

t

h

s

s

p

3

a

n

s

e

b

t

h

s

s

p

0

S

n

c

e

a

a

b

a

y

s

a

e

a

i

s

u

f

o

t

h

Q

s

e

p

t

e

n

F

Q

s

e

p

e

n

H

a

L

a

e

u

t

o

d

e

H

a

L

g

s

a

e

r

e

v

y


11/50

CONNECTIONIST MOD ELS AND LINGUISTIC THEORY

11

I

I OUTPUT UNIT

J

(PERCEPTRON)

-6 -5 -4 -3 -2

-1 0 +l t2 t3 t4 t5 t6

INPUT LAYER (2 x 13 units)

Figure 2.

Perceptron mod el used in simulotlons.

stresssystemswith three evels of stress, he output targetswere 1.0 for

primary stress,0.6 for secondary, .35 or tertiary, and0 for no stress. he

input data set for all stresssystems onsistedof all word-forms of up to

sevensyllables.With the syllabic input representationhere are sevenof

these,andwith theweight-string nput representation,hereare254distinct

patterns. The perceptrons nput array was a buffer of 13 syllables;each

word wasprocessednesyllableat a time by sliding t through he buffer (see

Figure 2). The desiredoutput at eachstepwas he stressevel of the middle

syllableof the buffer. Connectionweightswereadjustedat eachstepusing

the back-propagationearningalgorithm.*(Rumelhart,Hinton, & Williams,

In practice, we used weight str ing training sets n which there were an equal number of in-

put patterns of each length. Thu s, there was one instance of each of the 128 (=2) seven-

syllable patterns, and 64 instances of each of the two m onosyllablic patterns. This length

balancing wa s neces sary for certain languages for suc cess ful traning. We do not kno w h ow w ell

i t agrees with the actual frequency of form s in these languages, though as a general rule high-

frequency form s are probably shorter.

Note that although the architecture of the model is two-layered, with a single output unit,

as in a simple perceptron, we used the back-propagation algorithm (BP) rather than the

Widrow -Hoff algorithm (W H); (Widrow & Ho ff, 1960). BP adds to W H the scaling of the

error for each output unit by a function (the derivative) of the activation of that output unit,

and thus performs a more sensit ive and example-tuned weight adjustment than W H. Note that

BP and W H algorithms for two layers are guaranteed by the perceptron convergence procedure

to be equivalent in term s of learning capabil i t ies, for binary-valued outpu&. How ever, outputs

in the present simulations are not alway s binary; they som etime s take on intermediate values.

As a result, the different com putation of the error term in BP turned out to provide better

learning than WH.


12/50

12

GUPTA AND TOURETZKY

1986).One

epoch

consistedof one presentation f the entire training set.

The network was rained or asmany epochsasnecessaryo ensure hat the

stress alueproducedby the perceptronwaswithin 0.1 of the targetvalue,

for eachsyllable of the word, for all words n the training set. A

learning

rate of 0.05 and momentum of 0.90 were used n all simulations. Initial

weightswereuniformly distributed andomvalues n therangeof f 0.5. Each

simulation wasrun at least hree imes, and the learning imes averaged.

2.3. Training the Perceptron

To enableprecisedescription, et the input buffer be regardedas a 2 X 13

array asshown n Figure2. Let the 7th, that is, center,column benumbered

0, let the columns to the left of the center be numberednegatively - 1

through -6 going outwards rom the center,and the columns o the right

of the centerbenumberedpositively + 1 hrough + 6, goingoutwards rom

the center.

As an exampleof processing, upposehat the next word in the training

set or somequantity-insensitiveanguages the four-syllablepattern S S S

S, and theassociatedtress ontour(i.e., target), s [O0 0 11, ndicating hat

the final syllable eceives tress,and all other syllablesareunstressed6.he

first syllableenters he rightmost elementof the input buffer (element+ 6).

This is the first time step of processing. t the next time step, the first

syllable (in the rightmost input buffer position, that is, element +6) is

shifted eft to buffer element + 5, and the second yllableenters he input

buffer in element+6.

After two further time steps,elements+ 3, + 4, + 5, and + 6 of the nput

buffer contain he four syllablesof the currentword. At the next wo time

steps, he eftward flow of syllables hrough he nput buffer continues,un-

til at time step6, the words four syllablesare n elements+ 1 through +4.

At time step7, the first syllablemoves nto element0 of the input buffer,

and he four syllablesof theword occupyelements through + 3 of the buf-

fer. At this time step, raining of the network occurs or the first time: the

perceptron s trained to associate he pattern n its input buffer (the four

syllablesof the word, in buffer elements0 through +3) with the target

stress evel for the syllablecurrently n buffer element0. At the next time

step, the syllablesof the word are shifted left again, and occupybuffer

elements 1 through +2. The network s trained to associatehis pattern

with the argetstressevel or the second yllable,which s now in element .

Similarly, at the next two time steps, he perceptron s trained to produce

the stressevelsappropriate or the third and fourth syllablesof the word,

as they come o occupyelement0 of the input buffer.

6 To simp lify discussion here, each syllable is represented as an S token , rather than a s

an H or IL. For quantity- insensit ive sy ste m s, this information suff ice s to determine place-

ment of stress.


13/50

CONNECTIONIST MOD ELS AND LINGUISTIC THEORY 13

At this point, the buffer is flushed(all elements reset o zero),and one

trial is over. The next word in the training set cannow enter,beginning

the next trial.

Thus, the processing f one word, syllable by syllable, constitutesone

trial.

Onepass hroughall the words n the training setconstitutes n

epoch.

It shouldbe noted hat sequentialprocessing f syllables s not a neces-

sarypart of this model.Exactly the sameeffect canbeobtainedby a parallel

scheme f 13perceptron-like nits whoseweightvectorsare ied togetherby

weight-sharing

(Hertz, Krogh, & Palmer, 1991,p. 140).All 13units would

learn n unison, and words could be processedn one parallel step.

3. PERCEPTRON STRESS LEARNING:

A PSEUDO-LINGUISTIC ACCOUNT

In this section,we analyze he characteristics f the variousstresspatterns

to which the perceptron s exposed,and relate them to the perceptrons

learningbehavior,with a view to developinga

pseudo-linguistic theory.

In

the sameway that metrical phonologyattempts o provide an accountof

linguistic stressn humans,our pseudo-linguisticheorywill attempt o pro-

vide an accountof the learningof stress n a perceptron.

We first presentan analysisof factors affecting earnability for the QI

stress ystems.We developan analyticschemehat enables s both to char-

acterize he patternsand to predict the easeor difficulty of their learning.

We then include QS systemsand show that one analytical framework can

take into accountboth QI and QS systems.This overall framework con-

stitutesour pseudo-linguisticheory.

3.1. Learnability of QI Systems

We begin by noting that learning imes differ considerably or (Latvian,

French}, {Maranungku,Weri}, {Lakota, Polish} and Garawa,asshown n

the ast column of Table2. Moreover, Paiuteand Warao wereunlearnable

with this model*.

These are learning t ime s with the syllubic input representation.

* A case could perhaps be made tha t mono syllables are special (Elan Dresher, personal

com mun ication, March 16, 1992), so that i t might be plausible to assu me that there are no

stressed mon osyllables in the data. We investigated this possibi l i ty in two wa ys. First, i f mono-

syllables are treated as receiving no stress , then the stress patterns of Paiute and Warao becom e

learnable within the current architecture: It takes 45 epochs for the network to learn the stres s

pattern o f Paiute, and 36 epochs for Warao. Second, i f monosyllables are removed from the

data se t, the learning t imes are 44 epochs for P aiute, and 46 epochs for Warao. Note also that

both stres s patterns were learnable, even with the stressed mon osyllables, using either a two-

layer architecture with two output units (targets were [0, O] for no stres s, [ l , 0] for sec ondary

stres s, and [1, l ] for primary stress ), or a three-layer architecture. We present the f inding of

nonlearnabil ity within the present model in order to maintain a cons istency of analysis in one

mod el, and since the two-layer model with a single output unit faci l i tates analysis of connec-

tion weights and their contribution to processing.


14/50


TABLE 2

Prellminory Analysis of Learning Times for QI Stress Systems

IPS SCA Ah Mt's MSL QI LANGUAGES&F EPCXHS

(syllabic)

00 00 0 Latvian Ll 17

French L2 16

00 10 1 Maranungku L3 37

Weri L4 34

01 10 1 Garawa L5 165

10 00 0 Lakota L6 255

Polish L7

254

10 10 1 Paiute L8 **

Warao L9 **

Note. These learning times were obtained using the syllabic input representation.

IPS = Inconsistent Prlmory Stress; SCA = Stress Clash Avoidance; Alt = Alternotion ;

MP S = Mult iple Primary Stresses: ML = Mult iple Stress Levels. Symbols Ll-L9 refer to

Table 1.

Examination of the inherent featuresof thesestresspatternssuggests

various factors asbeing relevant o learning:

Alternation

of stressesasopposed o a singlestress)s suggestedy the

differencebetweenearning imes for {Latvian, French}and {Maranungku,

Weri}, which also suggesthat the number ofstress levels may be relevant.

Recall rom Table 1 that, in.Garawa,primary stress s placedon the first

syllable,secondary tresson the penultimatesyllable,and tertiary stresson

alternatesyllablespreceding he penultimate,but that no stressappears n

the secondsyllable. The primary, secondary,and tertiary stresspatterns

potentially ead o stress ppearing n both the first and he second yllables;

however, his is avoided stresss neverplacedon the second yllable).This

exemplifies he tendency n human anguageso avoid the appearance f

stresson adjacentsyllables.The greater earning ime for Garawasuggests

that such

stress clash avoidance

is computationallyexpensive.

Iq languages uchas Latvian, French,Maranungku,Weri, and Garawa,

prinia@stresss alwayson a syllableat the edgeof the word. In Lakota and

Polish, for which learning imes aresubstantiallygreater han thoseof the


15/50

CONNECTIONIST MODE LS AND LINGUISTIC THEORY 15

other languages, rimary stress s alwaysat a non-edge yllable,except n

mono- and d&syllables. Paiuteand Waraoare denticalwith respect o the

placementof primary stress o Lakota and Polish, respectively,but are

unlearnable).Thus, placement of primary stress seemscomputationally

relevant. n particular, t appearsmore difficult to learn patterns n which

primary stress s assigned t the edgesnconsistently.

To test heseassumptions, nd to determinewhat features f Paiute and

Warao ed to their nonlearnability,we constructed

ypothetical stress

pat-

terns hat exhibit morecombinations f the above eatures hanmay actually

exist. Thesestresspatternsare describedn Table 3. We trained networks

on thesevariations n order to evaluate he effectsof the various eatures.

The following factors emergedas determinantsof learnability for the

rangeof QI patternsconsidered:

gr)

On

(In)

(IV)

09

Inconsistent Primary Stress

(IPS): it is computationallyexpensiveo

learn the pattern if neither edgereceivesprimary stressexcept n

mono-and d&syllables;his canbe regarded s an ndexof computa-

tional complexity that takes he values(0, 1): 1 if an edge eceives

primary stress nconsistently,and 0, otherwise.

Stress class avoidake @CA):

if the componentsof a stresspattern

can potentially lead to

stress clash,

then the languagemay either

actuallypermit suchstress lash, or it may avoid t. This index akes

the values 0, 1): 0 if stressclash s permitted, and 1 if stress lash

is avoided.

Alternation (Alt): an index of learnability with value0 if there s no

alternation, and value 1 if there s. Alternation meansa pattern of

somekind that repeatson alternatesyllables.

Multiple Primary Stresses

(MIS): has value0 if there s exactlyone

primary stress, ndvalue1 f there s more than oneprimary stress.t

hasbeenassumedhat a repeating atternof primary stresses ill be

on alternate, ather than adjacentsyllables.Thus, [Alternation= 0]

implies [MPS = 01.The hypotheticalstress atternsexaminednclude

somewith more thanoneprimary stress;however,as ar as s known,

no actually occurringQI stresspattern has more than one primary

stress.

Multiple Stress Levels MSL): hasvalue0 if there s a single evelof

stress primary stressonly), and value 1 otherwise.

It is possible o order these actors with respect o eachother to form a

five-digit binary string characterizing he ease/difficulty of learning. That

is, the computationalcomplexityof learninga stress atterncanbe charac-

terizedasa S-bit binary numberwhosebits representhe five factorsabove,

in decreasing rderof significance. able2 shows hat this characterization

captureshe earning imesof the QI patterns uiteaccurately. s an example


16/50

T

3

D

p

o

o

H

h

c

Q

S

e

P

e

n

R

L

D

p

o

o

S

e

P

e

n

h h h h h h h h h h

0

h h h h h h h h h h h h h h h h h h h h

L

v

a

s

e

L

v

a

e

F

e

e

F

e

M

s

e

L

v

a

L

v

a

e

M

o

o

e

W

e

s

e

L

v

a

e

a

G

a

w

S

G

a

w

e

S

M

o

a

s

e

W

e

s

e

L

v

o

o

G

a

w

s

e

S

L

v

o

e

o

G

a

w

n

o

L

v

o

e

L

v

o

S

L

v

o

e

S

G

a

w

e

.

L

v

o

e

e

S

G

a

w

s

e

L

v

o

e

a

S

L

v

o

e

o

S

L

o

e

L

a

L

a

e

L

a

o

L

a

s

e

o

M

a

n

s

e

o

f

s

s

a

e

s

y

o

s

M

a

n

s

e

o

f

s

s

y

o

s

t

e

a

y

o

t

h

d

s

o

e

M

a

n

s

e

o

f

n

s

y

o

a

e

M

a

n

s

e

o

f

n

s

y

o

p

t

e

a

y

o

a

e

M

a

n

s

e

o

f

s

a

l

a

s

o

e

M

a

n

s

e

o

f

s

a

l

o

s

y

o

a

e

M

a

n

s

e

o

f

s

s

y

o

p

a

e

n

e

p

e

n

t

e

a

y

a

s

y

s

e

M

a

n

s

e

o

l

o

s

y

o

o

e

a

e

n

e

p

e

n

t

e

a

y

s

y

s

e

M

a

n

s

e

o

f

s

s

y

o

p

a

a

e

n

e

p

e

n

s

a

e

M

a

n

s

e

o

f

s

s

y

o

p

t

e

o

y

o

a

e

n

e

p

e

n

s

a

e

M

o

n

s

e

o

f

s

s

y

o

p

a

a

e

n

e

p

e

n

s

a

e

M

a

n

s

e

o

f

s

a

a

e

n

e

s

n

s

a

e

M

a

n

s

e

o

l

a

a

o

e

n

e

p

e

n

s

a

e

M

a

n

s

e

o

f

s

a

l

a

a

a

e

n

e

p

e

n

s

a

e

M

a

n

s

e

o

f

s

a

p

a

a

e

n

e

p

e

n

s

a

e

M

a

n

s

e

o

f

s

a

a

e

a

a

e

n

e

p

e

n

s

a

e

s

y

o

f

n

M

a

n

s

e

o

f

s

s

y

o

p

t

e

a

y

o

a

e

a

e

n

s

e

o

s

M

a

n

s

e

o

f

s

s

y

o

l

a

t

e

a

y

o

a

e

n

s

e

o

s

M

a

n

s

e

o

f

s

a

l

o

b

n

s

e

o

s

M

o

n

s

e

o

f

s

a

l

a

s

y

o

a

e

n

s

e

o

s

M

a

n

s

e

o

f

s

s

y

o

p

a

a

e

n

e

p

e

n

s

a

e

n

s

e

o

s

M

a

n

s

e

o

f

s

s

y

o

l

a

a

a

e

n

e

p

e

n

s

a

e

n

s

e

o

s

M

a

n

s

e

o

f

s

a

p

a

a

e

n

e

p

e

n

s

a

e

b

n

s

e

o

s

M

a

n

s

e

o

f

s

o

l

o

a

a

e

n

e

p

e

n

s

o

e

n

s

e

o

s

M

o

n

s

e

o

f

s

a

a

e

a

o

p

e

n

s

a

e

s

y

o

l

a

b

n

s

e

o

s

M

a

n

s

e

o

s

s

y

o

p

M

a

n

s

e

o

s

a

p

s

a

e

M

a

n

s

e

o

s

a

p

s

y

o

f

o

h

s

a

e

M

a

n

s

e

o

s

a

a

e

n

e

s

n

s

a

e

b

n

o

l

a

M

a

n

s

e

o

s

a

p

s

y

o

f

o

h

a

a

e

n

e

s

n

s

a

e


17/50


17

of how to read Table 2, note that Garawa takes longer to learn than Latvian

(165 vs. 17 epochs). This is reflected in the parameter setting for Garawa,

OllOl, being lexicographically greater than that for Latvian, OOOOO.

The analysis of learnability is summarized in Table 4 for all the QI stress

patterns, both actual and hypothetical. It can be seen that the 5-bit charac-

terization fits the learning times of various actual and hypothetical patterns

reasonably well; there are, however, exceptions, indicating that this 5-bit

characterization is only a heuristic. For example, the hypothetical stress pat-

terns with reference numbers h21 through h25 have a higher 5-bit character-

ization than some other stress patterns, but lower learning times.

The effect of stres s clash avoidance is seen in cons istent learning t ime differentials

between stres s patterns of com plexity less than or greater than binary 1000. Learning

time s with com plexity 0001 are in the range 10 to 25 epoch s, while comp lexity 1001

patterns are of the order of 170 epoch s; com plexity 0010 is of the order of 30 epoch s, and

lOlO, 190 epoch s, for patterns the only difference between which is the absen ce/

presence of stres s clash avoidance (Latvian2edge and Latvian2edgeSCA, references h5

and h19 respec tively). A pattern with com plexity 0011 (Latvian2edge2stress reference

h6) has a learning t ime of 37 epoch s, while a pattern differing only in the addition of SCA

(Latvlan2edge2stressS CA, reference h20) takes 206 epoch s. Com plexity 0101 patterns

are in the range 30 to 60 epoch s, while co mp lexity 1101 patterns are in the range 70 to

170 epoch s; in part icular, while Garawa (reference L5) has a learning t ime of 165 epoch s,

the sam e pattern without SCA has a learning t ime of 36 epochs (Garawa-SC, reference

hl0). A stres s pattern o f com plexity 0111 takes 65 epochs to learn (Latvian2edge2stress-

lalt, reference hl6), while addit ion of stres s clash avoidance results in a learning t ime of

129 epochs (Latvian2edge2s tress-1 alt-SCA, reference h25).

The effect of alternatlon Is seen in a contrast between learning t ime s for patterns of

com plexity 001 (range: 10 to 25 epochs) and 101 (range: 30 to 60 epochs ); 010 and

110 (30 epochs vs . a range of 60 to 90 epochs ); 011 and 111 (37 vs . 65 epochs ).

The effect of mult lple primary stress es is seen in the contras t between the stres s pat-

terns: LatvianPstress (reference hl , com plexity OOl, 21 epochs) and LatvianPedgePstress

(reference h6, com plexity 01 l , 37 epoch s); Latvian2edge2stresealt (reference h9, com -

plexity 101 I, 56 epochs) and Latvian2edge2strese1 alt (reference h16, com plexity 1 11 ,

65 epochs).

The effec t of lnconslstent primary stres s is considerable: stres s patterns with the m ost

significa nt bit 1 are learnable in the perceptron mod el on ly if all the other bits are 0; suc h

patterns (Lakota, Polish, references L6, L7, com plexity ~lOOO O, 255 epochs) have a

higher learning t ime than any of the patterns with mo st signif icant bit 0. All the examined

stres s patterns o f com plexity greater than 10000 were unlearnable In the perceptron

mod el. Re call that Paiute and Warao were unlearnable; the present fram ework is cons istent

with that result, since under the present analysis, these two patterns have complexity

10101 I. Recall also from footnote 6 that i f we assu me no stresse d mon osyllables in Paiute

and Wa rao, then these stres s patterns be come learnable, in approximately 35-45 epoch s.

Under these circum stances, the complexi ty index of the two stress syste ms is 00101. As

can be seen from Table 2 and Table 4, a learning t ime o f approximately 40 epochs Is com -

pletely con sistent with the learning t imes obtained for other stres s patterns with this index.

That is, our sche me for indexing com plexity is cons istent with the results of Paiute and

Wa rao, whether or not there are assum ed to be stressed mon osyllables.


18/50

TABLE 4

Analysis of Quantity-Insensitive Learn ing

Epochs

IPS SCA Ah MP S MSL Lanauaae Ref (sv llab lc)

0 0 0

Latvian

L l 17

French L2 16

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

Latvlan2stress :: 21

LatvianJstress 11

FrenchSstress h3 23

French3stress h4 14

LatvianPedge h5 30

LatvianPedgePstress h6 37

fmposslble

Maranungku L3 37

Weri L4 34

Paiu te (unstressed mono syllables) LB 45

Warao (unstressed mono syllables) L9 36

Maranungkutstress h7 43

WeriSstress h3 41

Latvian2edge2stress-alt h9 53

Garawa-SC hlo 33

Garawa2stressSC hll 50

Maranungkulstress h12 61

Werilstress h13 65

Latvian2edge-al t h14 70

Garawal stress-SC h15 33

1 1 1 Latvian2edae2stress -lalt h16 85

1

0 0 0

lmpossfble

1 0 1 Garawa-non-al t h17 164

Lotvian3stress2edaeSCA hl3 163

0 1 0

0 1 1

1 0 1

1 1 0

1 1 1

0 0 0

0 0 1

0 1 0

0 1 1

1 0 1

1 1 0

1 1 1

Latv ian2edgeSCA

Latvian2edge2stressSCA

Garawa

Garawa2stress

Lotvian2edge2stress-alt-SCA

Garawalstress

LatvianZedge-al t-SCA

Latvian2edge2stress-lalt-SCA

Lakota

Pol ish

LakotaPstress

LakotaPedge

LakotaPedgePstress

Paiu te

Warao

Lakota-alt

Lakotalstress-alt

h19 194

h20 206

:1 165

71

h22 91

h23 121

h24 126

h25 129

L6 255

254

1;726 l

h27 **

h23 **

L3 **

l

:9 **

h30 l *

Note. These learning t imes were obtained using the syl labic input representation.

IPS = Inconsistent Primory Stress: SC A = Stress Clash Avoidanc e; Aft = Altern ation;

MP S = Mu l t ip le Primary Stresses: MS L = Mul t ip le Stress Levels. Symbols Ll -L9 refer to

Table 1, and hl -h30 to Table 3.

18


19/50


19

The im pact of mult iple stres s levels is relatively sma ller, and less uniform, both of which

mo tivate this factors being treated as least signif icant. Thu s, though there are several in-

stanc es where a stres s pattern with a greater number of stres s levels h as a higher learning

time (hl vs . Li; h3 vs L2; h6 vs . h5; h7 vs . L3; h6 vs . L4; h21 vs . L5), there are also case s in

which a stres s pattern with a higher number of stres s levels has a lower learning t ime than

one w ith fewer stress levels (h2 vs . hl; h4 vs . h3; hll vs . h15; h21 vs . h23):

The effect o f a part icular factor s eem s to be reduced when a higher-order bit has a non-

zero va lue. Th us, the effec ts of alternation are less clear when there is stres s clash

avoidance: Withou t SC A, the range of learning t imes for patterns without alternation Is 10 to

40 epochs, and with alternation 30 to 90 epoch s; but with SCA , the range without alterna-

t ion Is 160 to 210 epochs , and with alternation 70 to 170 epochs.

In summary, the complexity measure suggested here appears to iden-

tify a number of factors relevant to the learnability of QI stress patterns

within a minimal connectionist architecture. It also assesses heir relative

impacts. The analysis is undoubtedly a simplification, but it provides a

framework within which to relate the various learning results.

3.2. Incorporating QS Systems

We now turn to a consideration of quantity-sensitive (QS) stress systems.

For QS patterns, information about syllable weight needs to be included in

the input representation-the input has to consist of (encoded) sequences of

H and L tokens. A purely

syllabic

nput representation is, by defini-

tion of quantity-sensitivity, inadequate. Weight-stringepresentations were

therefore adopted, as discussed in Section 2.2. To maintain consistency of

analysis across QI and QS stress patterns, simulations for the QI languages

were re-run using the weight-string representation. Note that the stress pat-

terns for all possible weight strings of length n are the same for a QI

language.

Learning times are shown in Table 5 for simulations for all QI and QS

stress systems. The section on the left shows QI learning times, while the

section on the right shows QS learning times. Note that in this table, learn-

ing times for all languages are reported in terms of the weight-string repre-

sentation rather than the unweighted syllabic representation used for the

previous QI studies.

The differences in learning times across QI patterns are less marked than

the differentials in Table 2. This is the result of the increased training set size

with the weight-string representation as compared with the syllabic repre-

sentation. However, learning times are completely consistent with the

overall learnability analysis developed in the previous section, so that the

analysis of QI systems is in no way affected. The reader may wish to com-

pare the learning times for QI systems in Table 5 (weight-string input repre-

sentations) with those in Table 2 (syllabic input representations).


20/50


In order o incorporateQS systems, he earnabilityanalysisproposed n

Section3.1 on the basisof QI patterns urnsout to requiresome efinement

(although he analysisof QI systems hemselves ill remainunchanged; ee

below). InconsistentPrimary Stress IPS) was previouslyhypothesized s

taking binary values;a valueof 1 for IPS indicated hat primary stresswas

assignednconsistently t the edgeof words; a valueof 0 ndicated hat this

was not the case. f this measures modified so that its value ndicates he

proportion of cases in which primary stress is not assigned at the edge of a

word,

the learning esults or both QI andQS patterns anbe ntegrated, o

a large extent, nto a unified account. Recall that learning times for QS

systemsare thoseshown n the right-handsectionof Table 5.

The learning times for Malayalam and Yapeseare approximately 20

epochs,while those or OsseticandRotumanare approximately30 epochs.

The differencebetween hesepairs of stress atterns s: for Malayalam and

Yapese,primary stress s placedat the end

except

when the edgevowel is

short and he next vowel ong (i.e., except0.25of the time); for Osseticand

Rotuman, primary stress alls at the edgeexcept when the edgevowel is

short, that is, except n 0.5 of the cases.

The five factors discussedearlier were: InconsistentPrimary Stress

(IPS); StressClash Avoidance SCA); Alternation (Ah); Multiple Primary

StressesMPS); andMultiple Stress evels MSL). The valuesof thesendices

respectively, or both Malayalam and Yapese,are [0.250 0 1 01,and for

both Osseticand Rotuman, [0.5 0 0 1 01.The differencebetweenearning

times for thesepairs of otherwise denticalpatternscan then be accounted

for in terms of differing values of the IPS measure.Note that the earlier

analysisof QI languagesemainsunchanged: tresspatterns hat had IPS

value 0 still do, and those hat had IPS value 1 still do as well.

The learning imes of Komi and Cheremisare substantiallyhigher han

thoseof Koya, Eskimo, Malayalam, Yapese,Ossetic,and Rotuman.Recall

the stress atternof Komi: stress he first heavysyllable,or the ast syllable

if there are no heavysyllables.That is, for Komi, a particular syllable S

receivesprimary stressunder the following conditions: (1) there are no

heavysyllables o the eft of S n the syllablestring;and (2)S s Heavyor S is

the last syllable.The second lauseof the conditional nvolvessingle-posi-

tional information: nformation eitherabout he syllableS tself (S s Heavy),

or about the absence/presencef a syllable right-adjacent o S in the

weight-string. If there s no syllable o the right of S in the weight-string,

then S is the ast syllable; f there s a syllable ight-adjacent o S, then S is

not the last syllable).The first clauseof the conditional, however, nvolves

aggregative information: information aboutalf the syllables o the eft of S

in the weight-string.We see hat in assigning tress o syllable strings n

Komi, one must somehow scan the input string to extract his aggrega-

tive information; similarly for Cheremis.


21/50

T

5

S

m

m

a

y

o

R

s

a

A

y

s

o

Q

a

Q

L

n

n

E

A

m

I

P

S

A

M

P

M

S

Q

L

R

(

w

n

0

0

0

0

0

0

L

v

a

L

2

0

0

0

0

0

1

K

L

O

2

0

0

0

0

1

0

E

m

o

L

3

0

0

0

1

0

1

M

a

a

L

3

W

e

L

3

0

0

1

1

0

1

G

Y

C

W

a

L

7

0

0

2

0

0

0

0

M

a

o

a

m

L

1

Y

L

1

0 0

0

5

1

0 0

0 0

0 0

0

0

L

o

L

1

O

c

L

3

R

u

m

a

L

2

P

s

L

1

0

1

0

1

0

1

P

u

e

W

a

a

L

t

L

l

1

0

0

0

0

0

K

m

i

L

6

2

C

e

m

is

L

2

2

0

0

0

0

0

M

o

a

L

2

M

a

L

2

N

e

T

l

e

n

n

t

m

e

w

e

o

a

n

u

n

w

g

s

n

i

n

r

e

e

a

o

A

e

v

I

n

a

m

a

o

I

P

n

s

e

P

m

a

y

S

e

S

e

C

a

A

d

A

=

e

n

o

M

P

M

u

p

e

P

m

a

y

S

e

M

S

M

u

p

e

S

e

L

s

S

m

b

s

L

L

r

e

e

t

o

T

e

1

Y


22/50


Komi and Cheremis an thereforebe analyzedas stresspatterns hat re-

quire uggregative information for the determinationof stressplacement;

noneof the otherstress atterns equiresuch nformation. For example, or

Koya, a syllableS should eceive tressf it is the first syllable whichcanbe

determined rom information about the presence/absencef a syllable n

the left-adjacentweight-stringposition), or if it is heavy,both of which are

single-positional kinds of information. For Ossetic,a syllableS should be

stressedf (a) it is the first syllableand it is heavy(which requires ingle-

positiona information about the left-adjacentweight-stringposition, and

aboutS tself); or if (b) t is the second yllable single-positional information

about heweight-string lement wo positions o the eft of S)

and

the syllable

in the left-adjacent osition s light

(also single-positional

information).

The difference n learning imes betweenKomi and Cheremison the one

hand, and Koya, Eskimo, Malayalam, Yapese,Osseticand Rotuman, on

the other, can now be analyzed n terms of differing informational re-

quirements.Whether or not aggregative information is needed herefore

seems o be a further factor relevant o the learnability of stresspatterns.

We further need o consider he fact that the patternsof Mongolian and

Mayan havevery much higher earning imes than thoseof any other stress

patterns, including Komi and Cheremis. Recall the stress pattern of

Mongolian: stress he first heavysyllable,or the first syllable f thereareno

heavysyllables.Notice how it differs from that of Komi, whosepattern s:

stress he first heavy syllable, or the last syllable if there are no heavy

syllables.

Thus, for Mongolian, if the currentsyllableS is heavy t shouldreceive

stress f it is thefirst heavysyllable. f the currentsyllableS is light, then t

shouldreceivestress nly if (a) there s no syllable o its left in the weight-,

string (which ndicates hat it is the first syllable),and (b) there s no heavy

syllable to its right in the weight-string.That is, for Mongolian, there s

aggregative

information requiredabout heavysyllablesboth to the left of

the currentsyllable

and

to its right. Information aboutheavysyllables o the

left of S is relevant f S is heavy; nformation about heavysyllables o the

right is relevant f S is light. This requirement or keeping rack of dual

kinds of aggregativenformation seemso be what makes he patternsodif-

ficult to learn. This contrastswith Komi, for which aggregativenformation

is requiredonly about syllables o the left of S.

A.,parallel analysiscan be made for Mayan. For both Mongolian and

Mayan, then, the very high

learning

times result from the requirement or

two kinds of aggregativenformation, in contrastwith only one kind for

Komi and Cheremis.Acquiring this information requires idirectionalscan-

ning of the input string for Mongolian and Mayan, as comparedwith uni-

directionalscanning,as discussed bove, or Komi and Cheremis.


23/50

CONNECTION&T MOD ELS AND LINGUISTIC THEORY

23

The results rom Komi, Cheremis,Mongolian, and Mayan thus suggest

an additional factor that is relevant or determinationof learnability, but

that comes nto play only in the caseof QS pattern: how much scannings

required or

aggregative

information. This canbe treatedasa sixth indexof

computationalcomplexity hat takeson thevalues 0, 1,2}. We, therefore,

have he following factor, in addition to the five previouslydiscussed:

(VI) Aggregative Information (Agg): hasvalue0 if no aggregativenfor-

mation is required

single-positional

information suffices);value 1 f

unidirectional scanning s required (Komi, Cheremis);and 2 if bi-

directionalscanning s required(Mongolian, Mayan).

With thesemodifications (viz., refinementof the IPS measure, nd addi-

tion of the Aggregativemeasure),he sameparameter cheme an be used

for both the QI and QS anguage lasses,with good earnabilitypredictions

within eachclass,as shown n Table 5.

We note, moreover, that learning times for the QS stresspatternsof

Koya and Eskimo fit right in with the analysis.The characterizationof

Koya n termsof our complexitymeasures OOOOO1,hile that of Eskimo

is 900010. Learning imes for thesesystems,accordingly, it in between

the learning times for the QI systemsof Latvian and French (complexity

index OOOOOO)nd Maranungkuand Weri (complexity ndex 000101).

As can alsobe seen,differencesn learning imesbetweenQS stress at-

ternsalso it in more generallywith the analysisdeveloped arlier,andwith

the analysisof

single-positional vs. aggregative

informational requirements

developedn this section.

Thus, both the QI and QS results all into a single analysiswithin this

generalized arameterschemeand weight-string epresentation, lthough

with a less perfect fit than the within-classresults.For example, he QI

stresspatternsof Lakota and Polish havehigher complexity ndexes han

the QS stresspatternsof Malayalam, Yapese,Ossetic,and Rotuman, but

lower learning times. Quantity-sensitivity hus appears o affect learning

times, as seems easonableo expect,due o the distribution of the weight-

string training set (seediscussionn Section2.2). However,no measure

of its effect will be offered here.The analytical rameworkdevelopedhus

far appears o hold within QI languages, ndwithin QS languages;urther

analysiswould be neededo relate earningresultsacross he two kinds of

stresspatterns.

Finally, it is worth noting that learning imes for thesestresspatterns

havealsobeenobtainedwith a three-layer rchitecture.Differences etween

learning times for different stress patterns were somewhat reduced.

However, earning

imes

were

ordered

exactly as in the two-layer model

resultsshown n Table 5, suggestinghat these earningresultsare robust


24/50

24

GUPTA AND TOURETZKY

with regard o architecturaldifferences. n the discussion hroughout this

paper,however,we focus on results rom the two-layermodel only, since

the two-layermodelwith a singleoutput unit facilitatesanalysisof connec-

tion weightsand their contribution to processing.

3.3. Markedness and Learning Times:

Correspondences with Metrical Theory

So far, we haveusedour perceptron earning esults o devisean analytical

framework or stress.t shouldbeclearhow our theory s analogouso con-

ventional inguistic heory: t is an accountof stress ystems ased n obser-

vation of the behaviorof a perceptronexposed o thosesystems, n very

much the sameway that metrical theory s an accountof linguistic stress,

basedon observationsof what human beings earn. The stresspattern

descriptions iven to the perceptronareat the same evel of abstractionas

those hat form the startingpoint for metrical heory. But sinceour account

is basedon learning ime rather han surface orms anddistributional data,

we refer to it as a pseudo-linguisticheory.

In this section,we will discuss heways n which our learning ime results

are in good agreementwith some of the markedness predictions of

metrical theory. That is, our resultsexhibit a correspondenceith theoreti-

cal predictions,strengtheninghe analogybetween ur theory and metrical

theory. As we will show, earning esultssuchas hese ould alsoprovidean

additional sourceof data for choosingbetween heoreticalalternatives.g

Within the dominant inguistic tradition, a universalgrammar of stress

should ncorporatea theory of markedness, o as o predictwhich features

of stresssystems re at the core of the human anguage aculty and which

are at the periphery.The distributional approach o markednessreats as

unmarked those inguistic orms that occurmore requently n the worlds

languages. nother approach o markednesss

learnability theory,

which

examines he logical process f language cquisitionO. hus, for example,

Dresher& Kaye take iteration to be the default or unmarkedsetting for

parameterPll, because here s evidence hat can cause evision of this

default if itturns out to be the incorrectsetting: he absence f any secon-

dary stresses ervesas a diagnostic hat feet are

not

iterative (Dresher&

) Nybergs parameter-based model also provides such data, but in term s of how ma ny ex-

urn&s

of a stres s pattern have to be presented to the stress learner (Nyberg, 1990).

lo As an exam ple, the Subset Principle (Berwick, 1985; We xler & Man xini, 1987) hos im-

plications for ma rkedne ss. Suppose that two possible sett ings u and b for parameter P result in

the learner respectively accepting s ets S,,and Sb of l inguistic form s. If S ,, s a subs et of St,, then,

once P has been set to value b, no posit ive evidence can ever re-set it to a, even if that was the

correct sett ing. Unm arked values for parameters should therefore be the ones yielding the m ost

constrained system.


25/50


TABLE 6

Learn ing Times far Ql and QS Stress Patterns, Grouped by Theoretical Analysis

Lanauaa e Characterization Epochs

7

a

9

Latvlan, French

Koya

Eskimo

Maranungku, Weri

Lokota, Pol ish

Malaya lam,

Yapese

Ossetic,

Rotuman

Koml, Cheremls

Mongl ian,

Mavan

Word-tree, no feet

Word-tree, Iterative unb oun ded QS feet

No word-tree, i terative unbounde d QS feet

Ward-tree, Iterative binary Ql feet

Word-tree, non-iterative binary Ql feet

Word-tree, non-iterative binary QS feet,

domina nt node branches

Word-tree, non-iterative binary QS feet

Word-tree, non-iterative unb ound ed QS feet

Word-tree, non-iterative unb ound ed QS feet,

domina nt node branches

2

2

3

3

10

19

(*:;

214

W)

2302

(f4)

Note. These learn ing times were obta ined using wefght-strfng input representations for

both Ql and QS patterns. Each f igure Is the average learning t ime for languages In the group.

Kaye, 1990,p. 191). f non-iterationwere he default, their learningsystem

might not encounterevidene hat would enable t to correct this default

setting, f it were n fact incorrect. It shouldbe noted that, while this is a

representativepplicationof subset heory, the choiceof defaultparameter

valuesdepends n the particular earningalgorithm employed.

Table 6 shows he stress ystems roupedby their theoreticalanalysesn

termsof the parameter cheme iscussedn Section1.4.The ast column of

.the able shows he averageearning ime in epochs or eachgroupof stress

patterns theseare the same earning imes shown n Table 5). As can be

seen, hereappearso bea fairly systematic ifferentiationof learning imes

for groupsof stresspatternswith different clustersof parameter ettings.

I As noted in Section 2.2,

weight-string

representations are neces sary for QS stre ss pat-

terns. For QI sys tem s,

syllabic

representations are suff icient. Of course, weight-str ing repre-

sentations can be used for QI sys tem s, although the information about syl lable we ight wil l be

redundant. To obtain learning t imes that m ight reflect differences between the stres s patterns

them selve s, rather than merely reflecting differing input representations and training set size s,

we ran simulations for both Q S and QI patterns using weight-str ing input representations. The

learning t imes in Table 6 are based on use of this weight-str ing representation for both QI and

QS patterns.

I The stress sys tem of Garawa, as described in Hayes (1980, pp. N-55), cannot be char-

acter ized in terms of the parameter scheme adopted here. The stress syste ms of Paiute and

Warao cannot be learned by a perceptron, as already discusse d in Section 3 .1, and as wil l be

discuss ed in m ore detail in Section 4.2. (Recall, howev er, that they can be learned (a) using a

two-layer a rchitecture with two output units, or (b) a three-layer architecture with hidden

units, or (c) within the present perceptron architecture, i f monosyllabic stres s is not included.)

Learning results for these three stre ss sy ste m s have therefore been excluded from the present

analysis, as either the parameterized characterization or learning t imes cannot be established.


26/50

26

GUPTA AND TOURETZKY

Learning times appear o be significantly higher for stresssystems n

groups5 through9, which havenon-iterative eet, han for those n groups

1 through 4, which either do not have metrical feet at all, or else have

iterative feet. This makes he interestingprediction that non-iterative eet

are more difficult to learn, and hencemarked.This predictioncorresponds

with both Halle & Vergnauds xhaustivityCondition3,andwith thechoice

of marked and unmarkedsettings n Dresher8cKayesparameterscheme

(ParameterPl 1).

Comparisonof learning imes for group 1 vis-a-visgroups2,3 and4 also

suggestshat a stresssystemwith only a word-tree(i.e., with no metrical

feet) s easier o learn than one with (iterative)metrical feet.

The dramatic difference n learning imes betweengroups8 and 9 sug-

gests hat it is marked or the dominantnode o be obligatorily branchingls.

Group 8 differs from group 9 only in not havingobligatorybranching,and

averageearning imes were214epochs s. 2302epochs.

This predictionagrees ith the distributionalview that obligatorybranch-

ing is relativelymarked16, ut runs counter o Dresher& Kayes choiceof

default values parameterP7)*.

However,comparisonof group6 with group7 suggestshat systemswith

obligatorybranchingmay be more easily earned:group 6, with obligatory

branching,hasa learning ime of 19epochs, omparedwith group 7, with-

out obligatory branching,but with a learning ime of 29 epochs.This runs

counter o the distributionalargument,but agrees ith the earnabilityview.

Two points areworth noting. First, it is interesting hat where here s a

conflict between he distributional and learnability theory predictionsof

markedness,here s also conflicting evidencerom the perceptron imula-

tion. Second, heseconflicting perceptron esultshighlight the fact that it

may be infeasible o analyze he effectsof different settings or individual

parameters;t may only be possible o makebroaderanalyses f the effects

of

clusters

of parametersettings.Strong nteractionsbetweenparameters

havealsobeenobservedn othercomputational earningmodelsof metrical

phonology Eric Nyberg, personalcommunication,August, 1991).

I The rules of cons tituent boundary construction apply exh austively.. . (Halle L

Vergnaud, 1987a, p. 15).

I In Dresher 8c Kayes mod el, i teration is the default or unmarked parameter sett ing

because there is evidence that can cause revision of this default. The absence of a ny secondary

stress es serves as a diagnostic that feet are not i terative (Dresher 8c Kaye, 1990, p. 191).

I This mean s that the strong branch of a foot m ust dominate a heavy syllable, and cannot

dominant a l ight on e.

I6 Thu s, Hayes (1980, p. 113): the max imally unmarked labeling convention is that which

ma kes all d ominant nodes strong . . . he conve ntion that wins second place is: label dominant

nodes as strong if and only if the y branch.

I7 Obligatory branching is the default because evidence (the presence of any stressed l ight

syllables that do not receive stres s from the word-tree) can force its revision (Dresher & Kaye,

1990, p. 193).


27/50


27

In view of thegreaterdifferential n learning imesbetweenGroups8 and

9 than betweenGroups6 and 7, we conclude hat the effect of obligatory

branching s to

increase

learning ime. That is, we view our learning esults

assupporting he markedness f obligatorybranching.This raises he nter-

estingpossibility hat learning esultssuchas hose rom the percentpercep-

tron simulationscanprovidea newsource f insight nto questions f mark-

edness. he distributional view of the markedness f obligatorybranching

(Hayes,1980,p. 113)seemso conflict with the earnabilityview (Dresher&

Kaye, 1990, p. 193). The present simulations seem to agreewith the

distributional view, and could perhaps erveasa fresh sourceof evidence.

This potential contribution to theoreticalanalysiscan be further illu-

strated for the stresssystemsof Lakota and Polish, which are mirror

images,Recall hat in Lakota, primary stress alls on the second yllableof

the word. The analysis o far adopted or Lakota s that it hasnon-iterative

binary right-dominant QI feet constructed rom left to right, with a left-

dominantword-tree. Let us call this

Analysis A.

As illustrated n Figure3,

this leads o the constructionof onebinary right-dominantQI foot at the

left edgeof the word. This, togetherwith the left-dominant word-tree,

results n the assignmentof primary stress o the secondsyllable. As has

beenshown,under his analysis he perceptron earning esultssupport he

markedness f non-iteration recall he differing learning imesof Groups1

through 4, vs. Groups5 through 9).

However, an alternative analysis s that Lakota has a left-dominant

word-tree with no metrical feet, and the first syllable is extrametrical

(Dresher& Kaye, 1990,p. 143).Let us call this Analysis B. As illustrated n

Figure 3, the leftmost syllable s treatedas invisible to the stress ules,

and the word-treeassignsprimary stress o the leftmost of the visible

syllables.The result is that the secondsyllable receives rimary stress.

Under this analysis,Lakota and Polish (Group 5, in Table 6) differ from

Latvian and French (Group 1 in Table 6) only in having an extrametrical

syllable.The differing learning imes for the two groups (1 epochvs. 10

epochs) hen suggest hat extrametricality s

marked.

However, this runs

counter o both the distributionalview (Hayes,1980,p. 82)lgand the earn-

ability theory view (Dresher& Kaye, 1990,pp. 189,191).

I This is based on Hayes analysis of

penultimate stress

(Hayes, 1980, p. 55).

I9 Hay es argues for the importance of the device of extrametricality to the theory because

of i ts,role in accounting for a variety of stres s phenomena . Thu s, extrametricality is unma rked

in the sense previously referred to: that of being attested to in a fair variety o f languages.

lo In Dresher &

Kayes

formulation, the presence of stress on a leftmo st or r ightm ost

syllable can rule out extrametricality. How ever, there is no posit ive c ue that unambiguously

determines the presence of ex trame tricality. Hen ce, the default value for parame ter P8 is that

there is extram etricality (PE[ye s]). I f the default were PE[N o], but this wa s an incorrect sett ing,

there would be no cue that could lead to detection that this is incorrect.


28/50

28

ANALYSIS A

GUPTA AND TOURETZKY

I

ANALYSIS B

Word-tree

Foot

A I\

/I

w s w w

S

w w

u (r u u (qy.~~

0-u-u

Syllables

.

i

I--

Extra-metrical syllable

Figure 3. Two metrical analyses far a four-syl lable word in Lakota. Strong branches are

labe led S, and weak branches W. Analysts A: construction of one binary right-dom inant

Ql foot at the left edge of the word, together wi th a left-dominant word-tree, resul ts in the

assignme nt of primary stress to the second syl lable. Analysis 3: the leftmost syl lable is

treated as invisible to the stress rules (extrametrtcal), and the word-tree assigns primary

stress to the leftmost of the visible syl lables. The result is that the second syl lable

receives primary stress.

To summarize,

nalysis A views

LakotaandPolishashavingnon-iterative

feet, which both the distributional/theoreticaland learnability approaches

treat as marked.

Analysis B

views hesestresspatternsas havingan extra-

metrical syllable,which both approachesreat asunmarked.So far, there s

nothing theory-external o help choosebetween he analyses.Simulation

resultssuchas hese rovidesucha means: ince he earning esultsarecon-

sistentwith the theoreticalmarkedness f non-iteration,but not with the

unmarkedness f extrametricality, hey provide at leastweak support for

preferring

Analysis A

over

Analysis B.

4. LOWER LEVEL PROCESSING: THE CAUSAL ACCOUNT

In the previous section, we developedour abstract pseudo-linguistic

theory of the earningandassignment f stress y a perceptron. his high-

levelaccountwasbasedon observation f perceptronearning esults,with-

out knowledgeof processingmechanismswithin the perceptron.


29/50


29

In this section, we open up the black box of the perceptron and ex-

a

Connectionist Models and Linguistic Theory

Documents

Transcript of Connectionist Models and Linguistic Theory