Landauer Multiple meanings in LSAlsa.colorado.edu/papers/Draft.Gorfein.ch.fin.4.pdf · Landauer...

Landauer Multiple meanings in LSA

1

Single representations of multiple meanings in Latent Semantic Analysis

Thomas K Landauer

University of Colorado at Boulder

WW XX

YY ZZ

YY is kleeper than WW .

YY is kleeper than ZZ .

Is YY kleeper than XX?

Is "kleeper" ambiguous?

Latent Semantic Analysis (LSA) is a psychological model and computational simulation

intended to mimic and help explain the way that humans learn and represent the meaning of

words, text, and other knowledge. In this chapter I briefly describe the underlying theoretical and

computational machinery of LSA, review some of the surprising things it is able to do, and

discuss some of its limitations and possibilities for future development. I will concentrate on

what LSA has to say about multiple word meanings, where it succeeds and fails, and what is

needed to fix it.

For researchers and theorists concerned with word meanings and ambiguity, the most

important implication of the LSA theory is that it questions the idea that different senses of a

word have separate and discrete representations that are individually disambiguated. Instead, it

represents a word meaning as a single point in a very high dimensional semantic space. In LSA,

the acquisition of a word meaning is an irreversible mathematical melding of the meanings of all

the contexts in which it has been encountered. In comprehension, words are not disambiguated by

sense one at a time; their different effects in different contexts are merely determined by how they

combine. Thus, a word has a different "sense" every time it is used.


2

In LSA, the representation of a single word carries a very large number of independent

dimensions of meaning. Therefore, it can have quite different effects depending on whether its

various meaning components are consistent, inconsistent, or orthogonal, strong or weak relative

to the rest of the passage. Because the semantic space in which words and passages are

represented is extremely sparse, apparent ambiguity of one word in a passage does not necessarily

affect the meaning of a passage sufficiently to make the whole ambiguous

The conspicuous deficiency of LSA referred to above is that it ignores dynamic changes

in the meaning of a sentence that depend on the order of the words, that is, many of the semantic

effects of grammatical and syntactic factors. These obviously have much to do with sentential

ambiguity. It turns out, empirically, that LSA's account substantially narrows the necessary role

of such factors, but it is nevertheless often critically hampered by their absence. For example,

it does a poor job of distinguishing between the meaning effects of antonyms. Thus, LSA should

be considered a baseline model of the learning and representation of meaning, not a theory of the

entire human processes of language production and comprehension.

Mathematically, what allows LSA to work at all is the representation of passage meaning

as a simple linear sum of word meanings. Clearly this is only a first-order approximation. While

LSA has been found to model human meaning very much better than one would expect if that

assumption were often badly wrong, there are certainly higher order dynamic processes that

appear as syntactic effects on meanings.

The main purpose of this chapter is to describe how LSA represents multiple senses in a

single representation and explore the theoretical and empirical consequences this has for

understanding verbal ambiguity. For those unfamiliar with LSA, the relevant foundations of LSA

will be sketched and the psycholinguistic phenomena that is has successfully modeled listed.

Then its representation of multiple meanings will be illustrated and discussed.

The basis of Latent Semantic Analysis.


3

LSA uses a fully automatic mathematical/statistical technique to extract and infer

semantic relations between the meanings of words from their contextual usage in large collections

of natural discourse. The analysis yields a mathematically well-defined representation of a word

that can be thought of as a kind of average meaning. LSA is not a traditional natural language

processing or artificial intelligence program; it uses no humanly constructed ontologies,

dictionaries, knowledge bases, semantic networks, grammars, syntactic parsers, part-of-speech

taggers, or morphologies. It takes as input large quantities of raw text that it parses into words and

passages such as sentences and paragraphs using the same formatting conventions that human

readers see. Although LSA uses as input data the patterns of how words are used in ordinary

language, its analysis is much deeper than the simple frequency, word-word co-occurrence,

sequential probability modeling, or keyword counting and matching techniques that have

sometimes been assumed to be the only purely computational alternatives to traditional linguistic-

theory.

The theory and assumptions behind LSA. The problem of language learning for machine

or a human can be represented as follows. The meaning of a passage (psg) is some function of the

meaning of its words plus other linguistic and non-linguistic context. For example:

the meaning of "husband and wife"

= ƒ{m("husband") > m("and") > m("wife") , m (context)};

in general,

m(psgi) = f{m(wdi1) > m(wdi2)> ... >m(wdin) , m (contexti)}.

where “>” means followed by and “ , “ means combined with.

The language learner's problem is to solve an enormous system of such simultaneous

equations for the meanings of all the words in the language and thus also the meaning of any

passage. To make an approximate solution feasible, LSA makes three strong simplifying

assumptions: (1) concurrent extra-passage context and non-verbal contextual learning experience is


4

ignored, (2) only textually represented language that a simulated person might have experienced is

used as data, and (3) the function by which word meanings are combined is addition.

Thus, the meaning of "husband and wife" becomes

m ("husband and wife") = {m(husband)+ m(and) +m("wife")},

and, in general,

m(psgi) = {mwdi1+mwdi2 ...+ mwdin,}

The simulations to date are all of word-to-word, word-to-passage or passage to-passage

relations, and are thus totally within a semantic space learned exclusively from word-passage

relations. If direct perceptual and intentional experiences were necessary for the simulated human

functions, LSA's simulations would fail entirely. They do not. Of course, they would probably be

more accurate if the model could also take advantage of these sources. Moreover, around three-

fourths of the word meanings known to adults, and of those learned by students from around age

ten, are learned solely from reading (see Landauer and Dumais (1997), and there is no a priori

reason to believe that the mechanisms of learning from text and speech are fundamentally

different. How well text alone serves can best be determined by realistic simulations of human

understanding.

In LSA the representation of a word or passage is not a single number, but a vector, a set

of real-numbered coefficients on a large number, typically 300, orthogonal dimensions. Meanings

can also be representation geometrically, as a line or point in very high dimensional space. The

representation of a passage is then the vector sum of its component words. Fig.1 shows an

example. The word “and” has a very short vector that runs in a direction almost orthogonal to

“wife” and “husband” and thus has very little effect on the resultant parallelogram that gives a

point for the combined meaning.

To give an initial intuitive impression of how well LSA works, here are some examples

of its estimate of semantic similarity between pairs of words after learning from a corpus of the


5

same size and content as a typical first-year American college student’s lifetime reading. LSA’s

usual measure of semantic similarity is the cosine of the angle between meaning vectors for

words or passages in the semantic space. A cosine is a correlation-like measure that varies

between -1 and 1. On average, two randomly chosen words have cosines of about .02 ± .06. LSA

infers appropriate degrees of relations between synonyms (e. g. tiny-small cos = .54), antonyms

(left-right, .72), singulars and plural forms of nouns (house-houses, .54; child-children, .62) and

present and past tenses of verbs (talk-talked, .67; run-ran, .57) whether regular or irregular, and

various other related forms (think-thinking, .67; quick-quickly, .62; thankful-thankless, .22;

birdcage-bird, .13; birdcage-cage, .23, chocolate-candy, .62; money-coins, .62, and, N.B., well-

hole, .22, well-healthy, .29, but hole-healthy, .04).

Fig. 1 about here.

The rationale behind singular value decomposition, the mathematical heart of

LSA, is that it can solve sets of simultaneous linear equations that are ill conditioned or

“singular”. Two or more equations may not be independent, that is one or more of them can be

constructed by some linear combination of the others. There may not be enough equations to

solve for all of the variables. Two or more equations may imply different values for the same

variable. The psychological counterparts are obvious: a human language learner gets insufficient

experience for very large numbers of words and encounters many contexts of different meaning

that contain the same word.

Singular Value Decomposition (SVD) accomplishes several things in the face of such

difficulties. In the case of redundancy, it finds a smaller set of abstract variables to substitute for

the original, collapsing any two variables that always have the same implied value into one. It

partially overcomes incompleteness (not enough equations—that is, language experience—for the

number of variables) by deriving an approximate solution through a kind of transitive inference

procedure. It is can always derive some number of independent (not derivable by linear

combinations of others) variables such that every word is described by an equation containing


6

values for all the other. This representation relates every word to every other without establishing

an absolute value for any. A whole system of such relative values is extremely useful. A natural

language, for example, could be built only on relations, as has long been conjectured by some

philosophers and linguists. LSA explicitly models languages in this way.

By applying SVD to a language corpus, LSA deals with inconsistent contextual usages by

representing a word with values on more than one variable, usually on several hundred. Thus, in

the data from which LSA (and humans) infer meaning, each occurrence of a word in context

makes a unique contribution to its meaning. But its high-dimensional representation, LSA (and by

hypothesis, the humans it models), can capture all the independent aspects of a word's meaning in

a single mathematical entity.

For the linguistics and psychology of ambiguity, this is the most important point. In the

LSA model, the meaning of a word is just how it affects the meaning of a verbal contexts in

which it appears. Because every context is potentially different, there are an infinite number of

effects that any word can have. Thus, "sense" differences are continuous, not discrete. At a fine

enough level, a word has a different "sense" in every different passage in which it occurs. Except

for words that have been used so few times that each occasion can be enumerated, listing or

describing different senses in a dictionary is only an approximation, a way of partitioning the

various uses of the word into larger or smaller subsets. The variation in meanings may be small or

large, the tightness, amount of overlap, and average separation of clusters great or small,

but—according to LSA—the differences between one use and another is fundamentally

continuous.

If this characterization is correct, even if LSA's current formulation or solution method is

not, the idea of multiple separate and discrete meanings or senses is mistaken. It is replaced with

the idea that a word's meaning is a complex, that it combines into one representation many

different aspects. In mathematical terms, it is very high dimensional. Different meanings or

senses should not to be thought of as stored separately in memory at all, and experiments


7

designed to determine how people "access" or choose between different senses, or how they

disambiguate individual words before computing clause or sentence meaning, are misguided.

What should take their place is research on how word meanings combine with each other.

How LSA is applied. A large corpus of text, as similar as possible to the sources from

which the humans whose performance is to be simulated would have acquired the knowledge to

be simulated, is divided into meaningful passages, such as paragraphs, which are then represented

as equations. An approximate solution to the system of equations is found by the matrix algebraic

technique of Singular Value Decomposition. (For mathematical and computational details see

Berry, 1992, or Landauer and Dumais, 1997). After the analysis, each word in the corpus, and any

passage, is represented as a high (typically around 300) dimensional vector. This general, and

sometimes strongly non-monotonic optimal number of dimensions is important. Dimension

reduction constitutes an inductive step by which words are represented by values on a smaller set

of abstract features rather than their raw pattern of observed occurrences. The effect of the

dimension reduction is analogous to the extra information about geographical locations that is

gained by plotting multitudes of point-to-point distances on a three-dimensional globe. One

important result of this step is that words of similar meaning that rarely or never appear in the

same passage, such as different terms for the same thing, get appropriately represented by similar

vectors. In a typical LSA generated “semantic space”, less than 2% of the word-word pairs for

which similarities are estimated ever occurred in the same passage.

How well LSA works. LSA yields good simulation of human verbal meaning across a

wide spectrum of verbal phenomena and test applications: (1) correct query-document topic

similarity judgments, even when there are no literal words in common between query and

document (Dumais, 1994); (2) correct mimicking of human word-word semantic relation and

category membership judgments (Landauer, Foltz and Laham, 1998), (3) correct choices on

vocabulary, and—after training on a textbook—subject-matter multiple choice tests (Landauer,

Foltz and Laham, 1998), (4) accurate measurement of conceptual coherence of text and resulting


8

comprehensibility (Foltz, Kintsch and Landauer, 1993, 1998), (5) correct prediction of word-

word and passage-word priming of word recognition in psycholinguistic experiments (Landauer

and Dumais, 1997), (6) accurate prediction of expert human holistic ratings and matching of the

conceptual content of student essays and textbook sections (Landauer, Foltz and Laham, 1998),

(7) optimal matching of instructional texts to learner knowledge as displayed in essays (Wolfe et

al., 1998), (8) mimicking of synonym, antonym, singular-plural, past-present, and compound-

component word relations, (Landauer, Foltz and Laham, 1998), (9) representing word ambiguity

and polysemy—the possession of two or more distinct senses or meanings by the same word

(more later), (10)correct mimicking of semantic categorization of words by children and adults

(Landauer, Foltz and Laham, 1998), (11) providing significant improvement for language

modeling in automatic speech recognition (Jurafsky and Coccaro, 1998, Belagarda, 2000,

Hofmann, 1999), (12) matching textual personnel work histories to discursive job and task

descriptions (Laham, Bennett and Landauer, 2000), (13) estimating conceptual overlap among

large numbers of training courses by analysis of test contents (Laham , Bennett and Landauer,

2000s), (14) accurately simulating the phenomenal rate of growth of human vocabulary during K-

12 school years (Landauer and Dumais, 1995; Anglin, 1970,1993), and several other

psycholinguistic and psychological phenomena (Kintsch, 1999, 2001).

Multiple meanings of a word. To repeat, LSA deals with inconsistent contextual usages

by representing a word as an equation containing many independent variables; usually several

hundred. Therefore, a word in LSA could, in principle, have as many entirely unrelated meanings

as there are dimensions in its representational space. Because of the high-dimensionality of the

space, a word can be simultaneously close to to any number of other words, some of which may

be close to each-other while others may be relatively quite far apart. Therefore, a single LSA

representation of a word can, in the usual sense, be ambiguous. For example, it can represent two

entirely different meanings of "sage". We can see this in the cosines between the word sage and


9

the passage-length definitions of two of its senses in WordNet, as represented in a semantic space

based on a representative sample of the lifetime reading of a typical first year college student 1.i

sage"-- definition 1. cos = .23 "One as a profound philosopher distinguished for wisdom"

"sage"-- definition 2. cos = .21 "A perennial herb or shrub sometimes grown as an

ornamental… scarlet flowered salvia splendens"

def1--def2: cos = .06

The word "sage" is similar to both definitions, but the two definitions are not very similar to each

other.

Unfortunately there is no easy way to choose sets of examples of ambiguous words in a

representative or random manner. However, we can examine a variety of common and obvious

cases. For an intuitive baseline, synonyms randomly picked from a synonym dictionary have

average cosines of around .25.

First, to illustrate the kind of analyses and results to be exhibited, here are cosine

similarities of "lead" to related words from two sets of five words each, the two sets representing

two quite different senses.

1 In all of these examples, the measures of semantic relations in LSA were provided by facilities

available at http://LSA.colorado.edu. Measures were based on the general reading, adult level

semantic space, with 300 dimensions. The training corpus for space this was a ~12 million token,

~92 thousand type, ~66,000 sample collection of texts representative of the lifetime reading of a

single typical American college freshman collected by Touchstone Applied Science Associates

(TASA) as the basis of their Educator's Word Frequency Guide. Any researcher may use this

facility, but careful reading of background material and detailed instructions before so doing is

strongly advised, as there are non-obvious technical and conceptual issues that can determine the

meaning and validity of obtained results.

http://LSA.colorado.edu


10

lead

metal .34

zinc .46

tin .35

solder .26

pipe .25

mean .33

follow .36

pull .12

direct .26

guide .17

harness .19

mean .22

Within the two sets above, the “metal” and “guide”-like words have intra-set average cosines .48

and .18, respectively, whereas pairs of words taken from different sets, have average cosines of

.06.

In the remaining examples, I present three exhibits for each homographic word. (1) Its

relation to four other selected words, two related to one and two to another of its relatively

unrelated senses. To show the pattern, this starts with "lead". (2) Examples illustrating how the

LSA representation of a word can be close to two dictionary style definitions of two quite

different senses. The definitions were taken from the online version of WordNet (Fellbaum,

1998), edited to remove words with the same lemma as the homographic word as well as some

superfluous function words. (3) Summary statistics on the cosine similarity of the word to the


11

texts of all its senses defined in WordNet. Unfortunately, the mean and standard deviation of

cosines for random pairs of this kind cannot be calculated because we cannot specify the

population of passages.


12

The first case is the word "lead", for which analysis (1) gives:

zinc metal follow direct

lead .46 .34 .36 .26

.57 .30

.20

The interpretation of the above is that "lead" has cosines of .46, .34, .36 and .26 with "zinc",

"metal", "follow" and "direct" respectively. "Zinc" and "metal" representing one sense, have a

cosine of .57 with each other, "follow" and "direct" representing another sense have a cosine of

.30. The average of "zinc" with "follow", "zinc" with "direct", "metal" with "follow" and "metal"

with "direct", semantic similarities between words representing one sense and those representing

the other, is .20.

Analysis (2) for lead gives:

“lead”

WordNet

1. "take, direct, conduct, guide: .41

2. "soft, heavy, toxic, dull gray metallic element" .47

Analysis (3) for lead gives:

Cosine with 32 senses: mean = .37, s.d. = .09

The other multiple meaning results follow the same pattern.


13

Fly:

insect mosquito soar pilot

fly .26 .34 .54 .58

.61 .27

.09

WordNet

1. "travel through the air" .34

2. "two-winged insect" .36

Cosine with 21 senses: mean = .27, s.d. =.12

--------------------------------------------------------------------------------------

Run:

stride sprint operate manage

run .45 .47 .24 .28

.47 .31

.08

WordNet:

1. "a row of unravelled stitches" .30

2. "move fast by using feet, with one foot off

the ground at any given time" .50


--------------------------------------------------------------------------------------

Bank:

money saving river levee


14

bank .33 .30 .17 .14

.52 .57

.04

WordNet

1. "sloping land beside a river".15

2. "deposit, put in an account" .57


----------------------------------------------------------------------------------------

Swallow:

bird sparrow gulp chew

swallow .23 .18 .31 .75

.32 .24

.07

WordNet

1. "the process of taking food into the body through the mouth by eating" .57

2. "small long-winged songbird noted for swift graceful flight and the regularity of its migrations"

.30


In each of these and several other cases examined: (1) the homographic word is

significantly (> 1.7 s.d. above chance) related to each of two members of two sets of words of

different senses, and the two words in one set are less well related on average to the two words in

the other set, usually significantly so; (2) in every case, the homographic word is significantly

related to definitions of two quite different senses; and (3) in every case, the average cosine


15

between the homographic word and definitions for all its senses as given in WordNet is

substantial. The rationale for presenting both analysis (2) and (3) is that many of the WordNet

senses for a given homographic word are quite similar, so (3) does not necessarily show that LSA

captures more than one sense, while (2) demonstrates this phenomenon, but, being based on a

selected example, does not necessarily show that the LSA representation captures senses well on

average, while (3) does so (although, to repeat, there is no good way to find a chance level for a

traditional significance test.)

The effect of ambiguous words on passage meaning. Because a single word may

represent many different meanings, it should be obvious that adding an ambiguous word to a

passage could potentially make the LSA representation of the passage ambiguous as well. For

example, "He was impressed with the size of the ..." becomes an ambiguous sentence with the

addition of "bank" in LSA, just as it does in English. And as we will see later, and as discussed by

Kintsch (2001), in some sentences, such as "the hunter shot the elk", where one important aspect

the overall sentence meaning—who shot whom—depends on a predication relation conveyed by

word order, the standard LSA representation does not properly resolve the passage ambiguity.

(However, note that such a set of words is essentially unambiguous in conveying the major

meaning—as compared to almost every other sentence ever uttered—that a shooting occurred

involving a deer and a hunter, no matter what order the words are given in. And, moreover, world

knowledge—which LSA has, but which there is as yet no automatic process for applying to

disambiguation—will very often resolve the apparent ambiguity.)

Nevertheless, there are several ways in which the same word can lend different meanings

to different utterances without producing significant ambiguity or confusion. That is, one of two

or more meanings of a word may change an utterance in a useful way, while others also change it

but in ways that don't matter for communicating desired information. That is because it is the

direction of a passage vector (the vector sum of its components) relative to other possible

passages that defines its qualitative meaning.


16

A geographical landmark analogy may help. To go to west on Interstate highway 80 from

Omaha you take the direction labeled "Denver". To go to east from Salt Lake City you also take

the direction labeled "Denver". Thus the same word, "Denver" conveys the meaning "east" or

"west" depending on its context. And the fact that Denver is also slightly South of both Omaha

and Salt Lake City will not cause you to mistakenly take I 29 or I15 north or south instead.

Nonetheless in another pair of contexts, this time purely verbal, "Take I25 in the Denver direction

from Colorado Springs", and "Take I25 in the Denver direction from Cheyenne", "Denver" brings

the meanings north and south to the passage without causing any ambiguities about east and west.

The neologism, "kleeper" that opened this chapter displays this potential multiple-effect

property of a word.

WW XX

YY ZZ

YY is kleeper than WW .

YY is kleeper than ZZ .

Is YY kleeper than XX?

In LSA "kleeper" would be represented singly as a word that distinguishes left from right, or up

from down, and / from \ , and in sentences where only one such difference is at issue, the

extraneous (ambiguous) meanings are irrelevant to the communication. Moreover, if I were to say

"Place AA and BB so that AA is kleeper than BB", you could make several different conforming

diagrams, in this way expressing "kleeper"'s ambiguity. Thus a word can sometimes be

ambiguous, and sometimes not, depending on its context.

Some other meanings of "Denver", e.g. that it is a large city, are even less important

because they are orthogonal to the message and unlikely to cause it to be misunderstood in any

important way. Still other meanings of "Denver", e.g. that it is near the Rocky Mountains, are

highly redundant within the context and also lead to no confusion. What “Denver” contributes to


17

the passage meaning depends on its relation to the rest: it does not need to be analyzed into

separate meanings first.

In the very sparsely populated, very high dimensional semantic spaces implied by LSA

analyses, moving a passage a small amount in either a nearly orthogonal or nearly collinear

direction is very unlikely to alter its relative meaning significantly, that is make the passage more

confusable with any other that is likely to occur.

Here are some examples of the effects a single word can have or not have in a passage.

Keep in mind that these are far from random examples. They are instead intentionally constructed

or chosen passages that serve as existence proofs of phenomena that can occur in LSA

representations. We start with two pairs of short phrases containing the word "lead" used in two

of its very different senses. In the first pair we see what happens when we substitute a relatively

unambiguous synonym of one of its meanings, the meaning related to mass. Doing so makes very

little difference in the LSA representation of the phrase, or of the relation of the phrase to one in

which an antonym of only one of the meanings of "lead" is substituted. The same pattern occurs

in the second pair, where the two phrases use a different meaning of "lead", the one having to do

with interpersonal behavior, and instead of a substituting an antonym , we substitute a word with

semantics totally unrelated to the mass meaning. Across the two pairs we see at most only the

slightest distortion of the meaning of the phrase by substituting a word one of whose meanings is

consistent and another strongly inconsistent. (By examining two cases where different senses of

"lead" are involved we avoid the possibility that one meaning is so dominant that no other has

any effect.)

a lead weight a light weight

a heavy weight .92 .61

a lead weight .57


18

lead the army desert the army

direct the army .87 .68

lead the army .68


19

In the next example, we show the effect of leaving out or putting in a single highly

ambiguous word, "swallow", in a full sentence that is highly consistent with one or the other of

two meanings, one related to eating, the other naming a species of bird. The sentences are

definitions of "swallow" adapted from WordNet.

chew, gulp bird, sparrow swallow

To ... is the process of taking food into the body

through the mouth by eating.

.38 .14 .57

To swallow is the process of taking food into the

body through the mouth by eating.

.39 .15 .59

A ... is a small long-winged songbird noted for

swift graceful flight and the regularity of its

migrations.

.20 .30 .30

A swallow is a small long-winged songbird noted

for swift graceful flight and the regularity of its

migrations.

.24 .30 .39

Because the word "swallow" is so consistent with either definition, leaving it out of the definition

has almost no effect on the relation of the definition to the two words of related meaning, .38 vs.

.39, .30 vs. ,30. The ambiguous sense has some effect on the relation of the definition to the two

words of unrelated meaning, but affects the non-dominant bird-species meaning. Similarly,

including "swallow" itself in the definition makes the definition more like the word itself to a

greater degree for the non-dominant meaning. Dominance in LSA would presumably be reflected

in greater vector length along dimensions that determine one meaning rather than the other.

However, I have not tried to analyze examples in a way that would demonstrate that.


20


21

Finally, here is an example of how a typical full sentence containing an ambiguous word,

"ring", is related to two of its meanings.

symbolic gold

object

marriage band high pitched

sound

loud noise

The housewife's face literally lit up as

the plumber extracted her lost wedding

ring from the sink.

.11 .14 .03 -.04

The office walls were so thin that they

could hear the ring of their neighbor's

phone whenever a call came in.

.02 .04 .18 .29

There is no sign that the ambiguity of "ring" distorts LSA's representation of the sentences. The

relation to the comparison phrases (composed of words not found in the sentences) displays

conveyance of the two meanings of "ring" with no significant interference from the alternative

meaning.

I want to avoid the impression that these examples are representative. Ostensibly similar

examples often fail to exhibit the same properties, and I have found no good way to construct a

set from which generalization to any rigorously defined population could be made. Rather, the

examples are existence proofs. Their point is just to illustrate the LSA conception of the meaning

of a word as its effect on a passage, that the effect can be different depending on the passage, and

that alternative meanings do not necessarily affect passage meaning to any great extent. In

addition, the examples are not meant to imply that such effects and lacks thereof are universal.

Indeed, they are not; multiple meanings can and do produce sentence ambiguity in some cases.

However, that too is what LSA expects. The important implication, supported at least tentatively

by these examples, is that initial disambiguation of individual words before combining them into


22

passage meanings may not be needed to explain the affects of multiple meanings of words. LSA

does no such word-by-word disambiguation. However, obviously, much more work would be

necessary to determine just when the LSA process is sufficient and when it fails. I am sure, for

example, that there will be cases of failure in instances where word order based syntactic or

grammatical factors modify the meaning of an ambiguous word.

The overall argument of this section is that in most multiword passages the average

meaning of words may be enough to capture almost all of what is going on without the unrelated

meaning components of individual words causing enough difference to turn the total passage

meaning itself ambiguous. This helps to explain how the close simulations of human judgments

and behaviors with respect to passage meaning have been accomplished by the technique.

Failure of standard LSA to account for syntactically governed ambiguity. Obviously,

of course, ambiguous sentences—and, more rarely, ambiguous paragraphs—do occur, and much

use has been made of their explanation and implications in linguistics and psycholinguistics.

Thus, the fact that the vector representation of a word meaning may contain more than one

meaning is still both a matter of concern and an opportunity. The concern is exactly when and

how an ambiguous word settles into a definite meaning when alone, or has its appropriate effect

and not another, in a sentence or passage. The opportunity is that when a word representation

contains more than one sense or meaning, a mathematically precise way of explaining how word

meanings combine to form new meanings may be possible. In LSA representations, a passage is a

sum or average of the meanings of all the words in it, without modification of word meanings

themselves depending on the company they are currently keeping, and without assuming that

each sense is represented as a separate entity. The LSA representation allows a word to have a

quantitatively different effect in every utterance in which it takes part (just as it was embedded in

a different context each time it was encountered during learning). This makes it equivalent to

allowing every word to have an unlimited number of subtly different senses, a property that

common intuition and the differences between dictionaries seems to bear out, but avoids the


23

probably hopeless, and, I believe, fundamentally mistaken (although approximations are

practically useful) effort to exhaustively catalogue or separately model their mental

representations.

On the other hand, it is clear that combining words into a sentence, or placing them in a

particular order in a sentence, often generates different meanings, as for example in metaphorical

expressions, predications of the meaning of word A on word B rather than C, modifying word D,

not E, by the placement of word F, and so forth. "The red hen pecked the white rooster", vs. "The

red rooster pecked the white hen." No matter what kind of a representation one posits for the

meaning of a word, it would be necessary to have some mechanism by which these dynamic

changes can occur. The vector representation of LSA turns out to offer at least one such

explanation. The heart of the explanation is that all of the meanings of a word need not

necessarily be brought into play at once. Kintsch (1999, 2001) give clear examples of one way in

which this might be accomplished. In Kintsch’s model, a word does not enter into a passage

meaning simply as its vector. Rather it enters by producing an activation of neighboring words,

the closer, the more activated. The effect that these activated neighbors have on the meaning of

the passage in turn depends on what other words are present and on their relation to those

neighbors. In the result, it is a local selection on the possible combinations of the various meaning

dimensions of all the words that is used.

Kintsch’s model takes a big step towards showing us how LSA representations, or any

representations at all for that matter, can be dynamically combined through a computationally

specifiable process to produce correct emergent passage meanings. It does not yet avoid human

intervention, because to date Kintsch has had to tell the model which is the predicate and which

the argument.

How are we to attack this serious incompleteness in LSA in the mechanistic spirit we

desire? It seems likely that some of the dynamic effects are nonlinear, not simple arithmetic

averaging. This would mean that something different from simple addition needs to be embedded


24

in the function by which a passage meaning is generated from its constituent words. There are

significant theoretical, mathematical, and computational difficulties in the way of doing this, and

we have not yet found a successful way. The ability of SVD to solve the huge system of

simultaneous equations posed by the language learning problem depends on the simplifications

that we listed above, among which the addition function is probably as important as any. One

possible escape from this dilemma would be to find other more complex elementary features,

such as multiword combinations, to include as terms in the additive equations.

An even more difficult problem may be representing the effect of the order of appearance

of words. Order has a large bearing on such syntactic issues as which word is the predicate,

especially in highly order depend languages such as English. Again, a new internal function for

the equations will be required. And again, we have not yet found a way to proceed. Among the

nonlinear combining and order phenomena that will need a mechanism are ones that generate

structured relations such as phrases and anaphora. Some sort of dynamical system approach, such

as ones recently explored in neural net language modeling (e.g. Tabor and Tannenhaus, 1999) is

one appealing possibility. Another approach would be the addition of ordered multiword

combinations.

Obviously, much remains to be done. However, I see no reason to despair. The fact that

people learn and use languages means that there are such functions, and also implies that there

must be a general way to infer the particular form of the function used in a particular language

from its observation and social practice. Some might claim that the project is futile because the

observed evidence is insufficient for any learning system to infer the functions. I think this

assumption is already well on its way to crumbling under the weight of evidence. A recent article

by Christiansen and Chater (1999a), for example has shown that a standard recurrent neural

network system can, with no direct human aid, produce recursive appearing embedded structures

that mimic rather nicely the corresponding performance that humans actually exhibit. Other

neural net experiments have chipped away at the anti-learning assumption as well, as reviewed by


25

Seidenberg (1997) and recently exemplified by Hinton (2000) in a multiplicative components

model that learns to produce long-distance word order relations, albeit in a toy environment. And,

of course, I think that the LSA results on word and passage meaning, along with the additions that

Kintsch proposes are quite encouraging.

However, I believe that the most important implication from what has been done so far is

just that the concept of independent storage of discrete multiple senses of words, the standard

framework for thinking about lexical ambiguity, needs reconsideration. The theory and results I

have reviewed suggest that it might be more fruitful to focus on mechanisms by which a single

meaning representation can be learned from and applied in multiple contexts.


26

References

Berry, M. W. (1992). Large scale singular value computations. International Journal of

Supercomputer Applications, 6(1), 13-49.

Christiansen, M. H. & Chater, N. (1999). Toward a connectionist model of recursion in

human linguistic performance. Cognitive Science, 23, 157-205.

Christiansen, M. H. & Chater, N. (1999b). Connectionist natural language processing:

The state of the art. Cognitive Science, 23, 417-437.

Collins, A. M. and Loftus, E. F. (1975), A spreading activation theory of semantic

processing. Psychological Review, 82, 407-428.

Fellbaum, C.(1998) (Ed.) WordNet: An electronic lexical database. Cambridge, MA.,

MIT Press.

Jurafsky, D. (1996) A probabilistic model of lexical and syntactic access and

disambiguation. Cognitive Science, 20, 137-194.

Kintsch, W. (1999) Metaphor comprehension: a computational theory. Psychonomic

bulletin and review, 7, 257-266.

Kintsch, W. (2001) Predication, Cognitive Science.

Landauer, T. K. & Dumais, S. T. (1997). A solution to Plato's problem: The Latent

Semantic Analysis theory of the acquisition, induction, and representation of knowledge.

Psychological Review, 104, 11-140

Landauer, T. K., Foltz, P. W. and Laham, D. (1998), An introduction to Latent Semantic

Analysis, Discourse Processes, 25, 259-284.

Landauer, T. K., Laham, D.,& Foltz, P. W. (1998), Learning human-like knowledge by

singular value decomposition: A progress report. In M. I. Jordon, M. J. Kearns & S. Solla (Eds),

Advances in Neural Information Processing Systems 10, (pp. 45-51). Cambridge: MIT Press.


27

Landauer, T. K. , Laham, D., Rehder, B, & Schreiner, M. E. (1997). How well can

passage meaning be derived without using word order? A comparison of Latent Semantic

Analysis and humans. In M. G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual

meeting of the Cognitive Science Society (pp. 412-417). Mahwah, NJ: Erlbaum.

Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure,

Communications of the ACM, 38, 32-38.

Schvaneveldt, R. W. (Ed.) (1990) Pathfinder networks: Theory and applications.

Norwood, NJ: Ablex.

Seidenberg, M. S. (1997) Language acquisition and use: Learning and applying

probabilistic constraints. Science, 275, 1599-1603.


28

"wife" "Husband

and wife"

"and"

"husband"

Figure 1.

i

Landauer Multiple meanings in LSAlsa.colorado.edu/papers/Draft.Gorfein.ch.fin.4.pdf · Landauer...

Documents

Transcript of Landauer Multiple meanings in LSAlsa.colorado.edu/papers/Draft.Gorfein.ch.fin.4.pdf · Landauer...