Thesauruses for Natural Language Processing
description
Transcript of Thesauruses for Natural Language Processing
![Page 1: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/1.jpg)
Thesauruses for Natural Language Processing
Adam Kilgarriff
Lexicography MasterClass
and
University of Brighton
![Page 2: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/2.jpg)
![Page 3: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/3.jpg)
Outline
Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs
![Page 4: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/4.jpg)
What is a thesaurus?
a resource that groups words according to similarity
![Page 5: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/5.jpg)
Manual and automatic
Manual– Roget, WordNets, many publishers
Automatic– Sparck Jones (1960s), Grefenstette (1994), Lin
(1998), Lee (1999) – aka distributional– two words are similar if they occur in same
contexts
Are they comparable?
![Page 6: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/6.jpg)
Thesauruses in NLP
sparse data
![Page 7: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/7.jpg)
Thesauruses in NLP sparse data
does x go with y?– don’t know, they have never been seen together
New question:does x+friends go with y+friends– indirect evidence for x and y– thesaurus tells us who friends are– “backing off”
![Page 8: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/8.jpg)
Relevant in:
Parsing– PP-attachment– conjunction scope
Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction
![Page 9: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/9.jpg)
Speech understanding
He’s as headstrong as an alleg***** in the upwaters of the Yangtze
![Page 10: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/10.jpg)
Speech understanding
He’s as headstrong as an alleg***** in the upwaters of the Yangtze
allegory?
![Page 11: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/11.jpg)
Speech understanding
He’s as headstrong as an alleg***** in the upwaters of the Yangtze
allegory? alligator?
![Page 12: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/12.jpg)
Speech understanding
He’s as headstrong as an alleg***** in the upwaters of the Yangtze
allegory? in upwaters? No alligator? in upwaters? No
![Page 13: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/13.jpg)
Speech understanding
He’s as headstrong as an alleg***** in the upwaters of the Yangtze
allegory? in upwaters? No alligator? in upwaters? No allegory+friends in upwaters? No alligator+friends in upwaters? Yes
![Page 14: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/14.jpg)
PP-attachmentinvestigate stromatolite with microscope/speckles
– microscope: verb attachment– speckles: noun attachment
inspect jasper with spectrometer– which?
![Page 15: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/15.jpg)
PP attachment (cont)
compare frequencies of– <inspect, with, spectrometer>– <jasper, with, spectrometer>
![Page 16: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/16.jpg)
PP attachment (cont)
compare frequencies of– <inspect, with, spectrometer>– <jasper, with, spectrometer>
both zero? Try– <inspect+friends, with,
spectrometer+friends>– <jasper+friends, with,
spectrometer+friends>
![Page 17: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/17.jpg)
Conjunction scope
Compare– old boots and shoes– old boots and apples
![Page 18: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/18.jpg)
Conjunction scope
Compare– old boots and shoes– old boots and apples
Are the shoes old?
![Page 19: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/19.jpg)
Conjunction scope
Compare– old boots and shoes– old boots and apples
Are the shoes old? Are the apples old?
![Page 20: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/20.jpg)
Conjunction scope
Compare– old boots and shoes– old boots and apples
Are the shoes old? Are the apples old? Hypothesis:
– wide scope only when words are similar
![Page 21: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/21.jpg)
Conjunction scope
Compare– old boots and shoes– old boots and apples
Are the shoes old? Are the apples old? Hypothesis:
– wide scope only when words are similar hard problem: thesaurus might help
![Page 22: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/22.jpg)
Bridging anaphor resolution
– Maria bought a large apple. The fruit was red and crisp.
fruit and apple co-refer
![Page 23: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/23.jpg)
Bridging anaphor resolution
– Maria bought a large apple. The fruit was red and crisp.
fruit and apple co-refer How to find co-referring terms?
![Page 24: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/24.jpg)
Text cohesion
words on same theme– same segment
change in theme of words– new segment
same theme: same thesaurus class
![Page 25: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/25.jpg)
Word Sense Disambiguation (WSD) pike: fish or weapon
– We caught a pike this afternoon probably no direct evidence for
– catch pike probably is direct evidence for
– catch {pike,carp,bream,cod,haddock,…}
![Page 26: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/26.jpg)
WordNet, Roget
widely used for all the above
![Page 27: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/27.jpg)
The WASPS thesaurus– credit: David Tugwell– EPSRC grant K8931
POS-tag, lemmatise and parse the BNC (100M words)
Find all grammatical relations– <obj, climb, bank>– <modifier, big, bank>– <subject, bank, refuse>
70 million triples
![Page 28: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/28.jpg)
WASPS thesaurus (cont)
Similarity:– <obj, drink, beer>– <obj, drink, wine>
one point similarity between beer and wine count all points of similarity between all pairs
of words weight according to frequencies
– product of MI: Lin (1998)
![Page 29: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/29.jpg)
Word Sketches
one-page summary of a word’s grammatical and collocational behaviour
demo: http://wasps.itri.bton.ac.uk the Sketch Engine
– input any corpus– generate word sketches and thesaurus– just available now
![Page 30: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/30.jpg)
Nearest neighbours to zebra
![Page 31: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/31.jpg)
Nearest neighbours
zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey
![Page 32: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/32.jpg)
![Page 33: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/33.jpg)
exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy
pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup
![Page 34: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/34.jpg)
VERBS
measure
determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust
boil
simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften
![Page 35: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/35.jpg)
ADJECTIVES
hypnotic
haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky
pink
purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp
![Page 36: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/36.jpg)
Nearest neighbours
crane winch swan heron
winch crane heron tern
heron mast crane gull
tractor rigging gull swan
truck pump tern crane
swan tractor curlew flamingo
![Page 37: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/37.jpg)
no clustering (tho’ could be done) no hierarchy (tho’ could be done) rhythm all on the web: http://wasps.itri.bton.ac.
uk– registration required
![Page 38: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/38.jpg)
The web
an enormous linguist’s playground– Computational Linguistics Special Issue,
Kilgarriff and Grefenstette (eds) 29 (3)• (coming soon)
![Page 39: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/39.jpg)
Google sets
http://labs.google.com/sets Input: zebra giraffe buffalo
![Page 40: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/40.jpg)
Google sets
http://labs.google.com/sets Input: zebra giraffe buffalo kudu hyena impala leopard hippo
waterbuck elephant cheetah eland
![Page 41: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/41.jpg)
Google sets
http://labs.google.com/sets Input: harbin beijing nanking
![Page 42: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/42.jpg)
Google sets
http://labs.google.com/sets Input: harbin beijing nanking Output: shanghai chengdu guangzhou
hangzhou changchun zhejiang kunming dalian jinan fuzhou
![Page 43: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/43.jpg)
Tree structure Roget
– all human knowledge as tree structure
– 1000 top categories• subdivisions
– like this» etc» etc
![Page 44: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/44.jpg)
Directories and thesauruses
Yahoo, http://www.yahoo.com Open directory project, http://dmoz.org
– all human activity as tree structure
plus corpus at every node– gather corpus, identify domain vocabulary
• Gonzalo and colleagues, Madrid, CL Special Issue
• Agirre and colleagues, ‘topic signatures’
![Page 45: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/45.jpg)
Words and word senses
automatic thesauruses– words
![Page 46: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/46.jpg)
Words and word senses
automatic thesauruses– words
manual thesauruses– simple hierarchy is appealing– homonyms
![Page 47: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/47.jpg)
Words and word senses
automatic thesauruses– words
manual thesauruses– simple hierarchy is appealing– homonyms– “aha! objects must be word senses”
![Page 48: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/48.jpg)
Problems
Theoretical Practical
![Page 49: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/49.jpg)
Theoretical
![Page 50: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/50.jpg)
![Page 51: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/51.jpg)
![Page 52: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/52.jpg)
Wittgenstein
Don’t ask for the meaning, ask for the use
![Page 53: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/53.jpg)
Practical
![Page 54: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/54.jpg)
Problems
Practical– a thesaurus is a tool– if the tool organises words senses you must do
WSD before you can use it– WSD: state of the art, optimal conditions: 80%
.
![Page 55: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/55.jpg)
Problems
Practical– a thesaurus is a tool– if the tool organises words senses you must do
WSD before you can use it– WSD: state of the art, optimal conditions: 80%
“To use this tool, first replace one fifth of your input with junk”
![Page 56: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/56.jpg)
Avoid word senses
![Page 57: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/57.jpg)
Avoid word senses
This word has three meanings/senses
![Page 58: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/58.jpg)
Avoid word senses
This word has three meanings/senses This word has three kinds of use
– well founded– empirical– we can study it
![Page 59: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/59.jpg)
sorry, roget
![Page 60: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/60.jpg)
sorry, AI
![Page 61: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/61.jpg)
sorry, AI AI model for NLP:
– NLP turns text into meanings– AI reasons over meanings– word meanings are concepts in an ontology– a Roget-like thesaurus is (to a good
approximation) an ontology– Guarino: “cleansing” WordNet
If a thesaurus groups words in their various uses (not meanings)– not the sort of thing AI can reason over
![Page 62: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/62.jpg)
sorry, AI
“linguistics expressions prompt for meanings rather than express meanings”– Fauconnier and Turner 2003
It would be nice if … But …
![Page 63: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/63.jpg)
Evaluation
manual thesauruses– not done
automatic thesauruses: attempts– pseudo-disambiguation (Lee 1999)– with ref to manual ones (Lin 1998)
![Page 64: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/64.jpg)
Task-based evaluation
![Page 65: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/65.jpg)
Task-based evaluation
Parsing– PP-attachment– conjunction scope
Bridging anaphors Text cohesion Word sense disambiguation (WSD) Speech understanding Spelling correction
![Page 66: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/66.jpg)
What is performance at the task– with no thesaurus– with Roget– with WordNet– with WASPS
![Page 67: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/67.jpg)
Plans
set up evaluation tasks theseval web-based thesaurus
– Open Directory Project hierarchies campaign
![Page 68: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/68.jpg)
Cyborgs
Robots: will they take over? Rod Brooks’s answer:
– Wrong question: greatest advances are in what the human+computer ensemble can do
![Page 69: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/69.jpg)
Cyborgs
A creature that is partly human and partly machine – Macmillan English Dictionary
![Page 70: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/70.jpg)
![Page 71: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/71.jpg)
![Page 72: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/72.jpg)
![Page 73: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/73.jpg)
![Page 74: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/74.jpg)
Cyborgs and the Information Society
The thedsaurus-making agent is part human (for precision), part computer (for recall).
![Page 75: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/75.jpg)
Summary: Thesauruses for NLP
Definition Uses for NLP WASPS thesaurus web thesauruses Argument: words not word senses Evaluation proposals Cyborgs
![Page 76: Thesauruses for Natural Language Processing](https://reader036.fdocuments.net/reader036/viewer/2022062518/56814868550346895db5764b/html5/thumbnails/76.jpg)
Thesaurus-makers of the future?