Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of...

50
Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004

Transcript of Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of...

Page 1: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

Corpus Linguisticsmeets

Lexical Semantic TheoryJames Pustejovsky

Brandeis University

University of PaviaDecember 15, 2004

Page 2: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

2

Background

• Joint work with Patrick Hanks and Anna Rumshisky

• Research funded by NSF

• References:– Pustejovsky, J., P. Hanks, and A. Rumshisky

(2004) “Automated Induction of Sense in Context”, Proceedings of COLING, Geneva.

– Pustejovsky, J. and P. Hanks (2001) “Very Large Lexical Databases”, Tutorial Notes from ACL, Toulouse.

Page 3: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

3

Outline• Corpus Linguistics needs Theory

– Linguistic Theory needs corpus data

• Assumptions about Lexicons– Lexicons are for something

• Remarks on Lexical Architectures- Possible versus Probable Meaning

• Encoding Context for a predicate– Capturing word senses through context

• Semantic Induction from Corpora– Theory guides clustering

Page 4: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

4

Building Lexicons with Corpus Analysis

• Regarding Lexicons:– Lexicons are for some purpose or task. – There is no one lexicon but multiple lexicons.

• Regarding Senses (GL, 1995):– Words have senses, but there are no finite number

of senses independent of the contextualized use of words in composition

• Words have meaning potentials:– Words are active objects with functional behavior.

Page 5: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

5

What is the relationship between corpus and lexicon?

• Corpus: – an accumulation of tokens

• Lexicon: – an ordered collection of word-types (lemmas), with

data attached.

Page 6: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

6

As corpora grow:

• There is a continuing flow of new nouns:

• There are very few new verbs and adjectives: – but increasing number of contexts for them.

• No new function words.

Page 7: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

7

Content of Lexicons for real Applications

• Proper Names: – humans, locations, institutions, brands, products

• Open class items:– nouns, verbs, modifiers

• Multiword Expressions– compounds, idioms, collocations, constructions

Page 8: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

8

Things to hang on Words:• Inflectional forms of the lemma• Phonetic form• Syntactic categorization• Subcategorization• Semantic Type• Typical Contexts (phraseology)• Co-specifications• Implicatures (contextually determined)• Translations• Examples of usage• Probabilities

Page 9: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

9

Things to hang on Words:• Inflectional forms of the lemma• Phonetic form• Syntactic categorization• Subcategorization• Semantic Type• Typical Contexts (phraseology)• Co-specifications• Implicatures (contextually determined)• Translations• Examples of usage• Probabilities

Page 10: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

10

Things to hang on Words:• Inflectional forms of the lemma• Phonetic form• Syntactic categorization• Subcategorization• Semantic Type• Typical Contexts (phraseology)• Co-specifications• Implicatures (contextually determined)• Translations• Examples of usage• Probabilities

Page 11: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

11

Lexicon Design should:

• Enable the Possible:

• Be tempered with the Probable:

• Be embedded within a specific application:– instances of actual.

Page 12: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

12

Selectional Features from Corpora

Selection doesn’t specify exactly how a word is going to behave on all occasions. Rather:

• Selection specifies how words typically behave:

• The typical is the foundation for forecasting the probable:

• The probable comes from corpus and the cospecifications associated with words.

Page 13: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

13

Lexical Acquisition

Goals: - Acquisition of subcategorization using corpus

analytics;- Learning selectional associations;- Clustering of complementation patterns

All are necessary techniques, but:- There must be an initial lexical architecture;- Efficacy of the results depends on application

model and corpus available.

Page 14: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

Encoding ContextWhat is the Context of a Linguistic Utterance?• Local context characterized as Strong Selection• Broad context captured in part by Weak Selection• Words encode context as types;• Compositional rules refer to these types:• Types can be selected;• Types can be coerced.• Types can be exploited.• Composition can license new interpretations.

Page 15: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

Basic Generative Lexicon

• Two classes of sortal constraints on a concept:– Argument structure– Event structure

• These bind into the Qualia Structure• Compositional Rules invoke Type Selection• Type Coercion: Inviolable Selection• Type Exploitation: Subselection of type features

Page 16: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

Formal: the basic category which distinguishes it within a larger domain;

Constitutive: the relation between an object and its constituent parts;

Telic: its purpose and function; Agentive: factors involved in its origin or

“bringing it about”.

Qualia Structure

Page 17: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

17

Types and Words Select Different Things

Types: - Operation: Selection Restrictions (semantic typing)- Result: Possible combinations

Tokens:- Operation: Corpus Selection (cospecification)- Result: Probable combinations

Page 18: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

18

Sense of a word depends on its context

Peter treated Mary badly.Peter treated Mary with antibiotics.Peter treated Mary with respect.Peter treated Mary for her asthma.Peter treated Mary to a fancy dinner.Peter treated Mary to his views on George W. Bush.Peter treated the woodwork with creosote.

• Consider the word treat:

• Dictionaries do not provide the contexts that distinguish one sense of a word from another.

Page 19: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

19

Problem: what context is relevant?

• The more senses a word has, the greater its lexical entropy.

• How to decide what context features determine the sense of a word?

• We want a data-driven sense definition.– Sort contexts of use for a given word into

“buckets” to reduce lexical entropy – Analyze features typical for each “bucket”

Page 20: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

20

Corpus Pattern Analysis (CPA)

Corpus Pattern Analysis (CPA) is a corpus analytic and automated induction technique that:

1. Identifies the typical syntagmatic patterns for each word and determines discriminant context features.

2. Catalogs semantic types of arguments that are relevant for distinguishing between different senses.

3. Creates an inventory of syntactic and lexical realizations for relevant semantic types.

Page 21: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

21

CPA (II)

• Word senses are linked to syntagmatic patterns.

• Selection contexts of a word are the typical syntagmatic patterns of its use.

• Selection contexts can be indexed on clauses and phrases, as well as single words.

• Selection contexts are captured in CPA patterns.

Current work focuses on CPA patterns for verbs.

Page 22: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

22

Research Areas Impacted by CPA

• Selectional preference acquisition– Resnik (1996), Briscoe & Carroll (1997),

Abney & Light (1999), Korhonen (2002)

• Word sense disambiguation– SENSEVAL efforts, Stevenson & Wilks

(2001), Aguirre et al. (2002)

• Ontology construction– EuroWordNet, SIMPLE

Page 23: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

23

CPA Components

• Lexical discovery– Manual discovery of selection context patterns for

specific verbs through corpus analysis

• Automatic recognition of pattern use– Sorting unseen instances of verb use according to

nearest match to identified patterns – Similar to conventional WSD

• Automatic pattern acquisition– Acquisition of patterns for unanalyzed cases

• Discriminant feature selection• Predicate-based argument clustering

Page 24: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

24

CPA Pattern Elements

• Syntactic Parsing– Phrase-level parsing (clause roles)

• Shallow Semantic Typing– Generic semantic features – Brandeis Shallow Ontology

• Minor Category Parsing– Adverbial Phrases, Locatives, Purpose Clauses,

Rationale Clauses, Temporal Adjuncts, etc.

• Subphrasal Syntactic Cue Recognition– Genitives, partitives, bare plural/determiner distinction,

infinitivals, negatives, past participles, etc.

Page 25: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

25

CPA Pattern Grammar• Pattern grammar (fragment)

CPA-Pattern -> Segment verb-lit Segment | CPA-Pattern ';' Rstr

Segment -> Element | Segment Segment | '(' Segment ')' | Segment '|' Segment

Element -> literal | '[' Rstr ArgType ']' | '[' Rstr literal ']' | '[' Rstr ']' | '[' NO Cue ']'

Rstr -> POS | Phrasal | Rstr '|' Rstr | Cue -> POS | Phrasal | AdvCue

• For example,[[Person]] assemble [[Artifact]]

[PLURAL[Person]] | [[Human Group]] assemble (in [[Location]])

[[Person 1]] treat [[Person 2]] ; NO Adv[Manner]

Page 26: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

26

Corpus-Driven Type System• Shallow Typing

– applying a shallow-type ontology to a parsed corpus

• Type Promotion– promoting to type position lexical units breaking a

particular statistical threshold

• Lexical Sets – predicate-based groupings of similarly typed lexical

elements from corpusE.g. absorb: heat, light, energy, power, shock, wave, sound, impact, movement

– populated through type-filtered cluster analysis, in each argument position

Page 27: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

27

Fine-tuning the Features

• Extending classification of minor categories e.g. adverbials of manner/effect– Peter treated Mary rudely.– Peter treated Mary effectively.

• Semantic features defining lexical setse.g. Energy (argtype for absorb) – heat, light, energy, power, shock, wave,

sound, impact, movement

Page 28: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

28

Implementation Details

• CPA patterns for an initial sampling of verbs is derived manually

• A corpus is parsed (British National Corpus).• A shallow type system is applied to the parsed

corpus (Brandeis Shallow Ontology).• A training sample is created.• Machine learning techniques are applied to

disambiguate the unseen instances using pattern features.

Page 29: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

29

Brandeis Shallow Ontology • BSO Noun Coverage

– 3400 type nodes total– 20,000 noun entries– 10,000 nominal collocation entries

• 65 Shallow Types– ‘Abstract’, ‘Asset’, ‘Animate’, ‘Artifact’, ‘Document’,

‘Human Group’, ‘Information’, ‘Institution’, ‘Location’, ‘Person’, ‘PhysObj’, ‘Process’, ‘Substance’, ‘Surface’, ‘Time Period’, etc.

• Subset of 24 shallow types was used in the experiments.

Page 30: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

30

RASP Statistical Parsing System(Briscoe & Carroll, 2002)

• input tokenized, POS-tagged, lemmatized• generates forest of full parse trees for each

sentence• set of grammatical relations associated with

each parse analysis– named relation, head, dependent

subjects: ncsubj, clausal (csubj, xsubj)objects: dobj, iobj, clausal complementmodifiers: adverbs, modifiers of event nominals

• pick the top-ranked tree for the sentences where full parse was a success

Page 31: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

31

Selected context features from RASP/BSO implementation

• obj_institution: object belongs to the BSO type ‘Institution’

• subj_human_group: subject belongs to the BSO type ‘HumanGroup’

• mod_adv_ly: target verb has an adverbial modifier, with a -ly adverb

• clausal_like: target verb has a clausal argument introduced by ‘like’

• iobj_with: target verb has an indirect object introduced by ‘with’

• obj_PRP: direct object is a personal pronoun• stem_VVG: the target verb stem is an -ing form

Page 32: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

32

Disambiguation accuracyfor sample predicates

verb patterns training set decision tree kNNedit 2 100 87% 86%

treat 4 200 45% 52%

submit 4 100 59% 64%

Page 33: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

33

Experimental Results

• CPA appears to be as accurate or better than other techniques for WSD.

• Different types of ambiguities are resolved with different degree of effectiveness.

• It will be tested on the latest SENSEVAL data.

Page 34: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

34

Goals of CPA

• To create an inventory of semantically motivated syntagmatic patterns, so as to reduce the ‘lexical entropy’ of each word.

• To develop procedures for populating lexical sets by computational cluster analysis of text corpora.

• To collect evidence for the principles that govern the exploitations of norms.

Page 35: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

35

Lexical Discovery (creating patterns)

• Create a sample concordance for each word

– 300-500 examples

– from a ‘balanced’ corpus (i.e. general language) [We use the British National Corpus, 100M words, and the

Associated Press Newswire for 1991-3, 150M words]

• Identify statistically significant collocates

• Classify every line in the sample, on the basis of its context.

• Take further samples if necessary to establish that a particular phraseology is conventional

• Check results against corpus-based dictionaries.

• Use introspection to interpret data, but not to create data.

Page 36: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

36

Every line in the sample must be classified

The classes are:

• Norms (normal uses in normal contexts)

• Exploitations (e.g. coercions and ad-hoc metaphors)

• Alternations – e.g. [[Doctor]] treat [[Patient]] <> [[Medicine]] treat [[Ailment]]

• Names (Sea Biscuit: name of a horse, not a cracker)

• Mentions (to mention a word or phrase is not to use it)

• Errors

• Unassignables

Page 37: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

37

Lexical sets are contrastive sets

• Different lexical sets generate different meanings.

• The lexical sets associated with each sense of each verb are different.– It remains to be discovered whether they are ‘transferable’.

• In principle, lexical sets are open-ended.

• In practice, a lexical set may have only 1 or 2 members, e.g. take a {look | glance}.

• No certainties in word meaning; only probabilities.

• … but probabilities can be measured.

Page 38: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

38

A Simple CPA Entry

toast, verb1. [[Person]] toast [[Food = bread, nuts, cheese]]

Implicature: cook or brown [[Food]] by exposure to radiant heat.

2. [[Person 1]] toast {[[Person 2]] | success | memory}

Implicature: honour [[Person 2]] by raising a glass of wine, then

drinking some.

Page 39: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

39

A more complicated verb: ‘take’• 61 phrasal verb patterns, e.g.

[[Person]] take [[Garment]] off [[Plane]] take off

[[Human Group]] take [[Business]] over

• 105 light verb uses (with specific objects), e.g. [[Event]] take place [[Person]] take {photograph | photo | snaps | picture} [[Person]] take {the plunge}

• 18 ‘heavy verb’ uses, e.g. [[Person]] take [[PhysObj]] [Adv[Direction]]

• 13 adverbial patterns, e.g. [[Person]] take [[TopType]] seriously [[Human Group]] take [[Child]] {into care}

• TOTAL: 197, and growing (but slowly)

Page 40: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

40

Noun norms

• Norms for nouns are different in kind from norms for verbs.

• Adjectives and prepositions are more like verbs than nouns.

• A different analytical apparatus is required for nouns.

• Prototype statements for each true noun can be derived from a corpus.

Page 41: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

41

What are the components of a normal context? – (2) Nouns

The apparatus for CPA (corpus pattern analysis) of nouns:

• Collocations.

Page 42: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

42

Arranging collocates: storm (1)

WHAT DO STORMS DO?• Storms blow.

• Storms rage.

• Storms lash coastlines.

• Storms batter ships and places.

• Storms hit ships and places.

• Storms ravage coastlines and other places. 

Page 43: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

43

Arranging collocates: storm (2)

BEGINNING OF A STORM:

• Before it begins, a storm is brewing, gathering, or impending.

• There is often a calm or a lull before a storm.

• Storms last for a certain period of time.

• Storms break.

END OF A STORM:

• Storms abate.

• Storms subside.

• Storms pass.

Page 44: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

44

Arranging collocates: storm (3)

WHAT HAPPENS TO PEOPLE IN A STORM?

• People can weather, survive, or ride (out) a storm.

• Ships and people may get caught in a storm.

Page 45: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

45

Arranging collocates: storm (4)

WHAT KINDS OF STORMS ARE THERE?

• There are thunder storms, electrical storms, rain storms, hail storms, snow storms, winter storms, dust storms, sand storms, tropical storms…

• Storms are violent, severe, raging, howling, terrible, disastrous, fearful, ferocious…

Page 46: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

46

Arranging collocates: storm (5) TYPICAL QUALITIES OF STORMS:

• Storms, especially snow storms, may be heavy.

• An unexpected storm is a freak storm.

• The centre of a storm is called the eye of the storm.

• A major storm is remembered as the great storm (of [[Year]]).

____

• STORMS ARE ALSO ASSOCIATED WITH rain, wind, hurricanes, gales, and floods.

Page 47: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

47

Why norms are important

These statements about abate and storm represent typical usage as well as typical meaning.

• They are empirically well founded (corpus-derived).

• This is where syntax meets semantics.

Page 48: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

48

How is CPA different from FrameNet?

CPA: • investigates syntagmatic criteria for distinguishing different meanings of

polysemous words, in a “semantically shallow” way.FrameNet:• expresses the deep semantics of situations (frames);• proceeds frame by frame, not word by word;• analyses situations in terms of frame elements;• studies meaning differences and similarities between different words in a

frame;• does not explicitly study meaning differences of polysemous words;• does not analyze corpus data systematically, but goes fishing in corpora

for examples in support of hypotheses;• has problems grouping words into frames, and misses some;• has no established inventory of frames;• has no criteria for completeness of a lexical entry.

Page 49: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

49

Challenges

• Extending the empirical discovery of lexical sets and other pattern features

• Learning to recognize all the features required by CPA patterns

Page 50: Corpus Linguistics meets Lexical Semantic Theory James Pustejovsky Brandeis University University of Pavia December 15, 2004.

50

Conclusions

• Creation of a selection context dictionary

• Development of a corpus-driven type system

• Identification of meaning by a richer set of criteria

• Basis for investigating the mechanisms of coercion and exploitation