1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical...
-
Upload
juliet-shelton -
Category
Documents
-
view
218 -
download
0
Transcript of 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical...
1
Corpora, Dictionaries, and points in between in the
age of the web
Adam Kilgarriff
Lexical Computing Ltd
Lexicography MasterClass Ltd
Universities of Leeds & Sussex, UK
October 2009 Kilgarriff: FLTRP 2
Outline Precision and recall History of corpus lexicography Sketch Engine
– demo Automatic Collocations Dictionary
– demo Electronic dictionaries
October 2009 Kilgarriff: FLTRP 3
Find me all the fat cats
a request for information
October 2009 Kilgarriff: FLTRP 4
High recall
Lots of responses Maybe not all good
October 2009 Kilgarriff: FLTRP 5
High precision
Fewer hits Higher confidence
October 2009 Kilgarriff: FLTRP 6
Us precision, them recall
Recall Precision
Computers good bad
People bad good
October 2009 Kilgarriff: FLTRP 7
Us precision, them recall
True in many areas– web searching, google– finding an image to illustrate a talk
Nowhere more so than
lexicography
October 2009 Kilgarriff: FLTRP 8
Lexicography: finding facts about words
collocations grammatical patterns idioms synonyms antonyms meanings translations
October 2009 Kilgarriff: FLTRP 9
Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs
October 2009 Kilgarriff: FLTRP 10
Four ages of corpus lexicography
October 2009 Kilgarriff: FLTRP 11
Age 1:Precomputer
Oxford English Dictionary:• 5 million index cards
October 2009 Kilgarriff: FLTRP 12
Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html the coloured-pens method
October 2009 Kilgarriff: FLTRP 13
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,
slow • 5000 lines: no
October 2009 Kilgarriff: FLTRP 14
Age 3: Collocation statistics
Problem:too much data - how to summarise?
Solution:list of words occurring in neighbourhood of headword, with frequencies
Sorted by salience
October 2009 Kilgarriff: FLTRP 15
Collocation listingFor right collocates of save (>5 hits)
word fr(x+y) fr(y) word fr(x+y) fr(y)
forests 6 170 life 36 4875
$1.2 6 180 dollars 8 1668
lives 37 1697 costs 7 1719
enormous 6 301 thousands 6 1481
annually 7 447 face 9 2590
jobs 20 2001 estimated 6 2387
money 64 6776 your 7 3141
October 2009 Kilgarriff: FLTRP 16
Collocation statistics
Which words? – next word – last word – window, +1 to +5; window, -5 to -1
How sorted? most common collocates --but for most
nouns it's the
October 2009 Kilgarriff: FLTRP 17
Collocation listingFor right collocates of save (>5 hits)
word fr(x+y) fr(y) word fr(x+y) fr(y)
forests 6 170 life 36 4875
$1.2 6 180 dollars 8 1668
lives 37 1697 costs 7 1719
enormous 6 301 thousands 6 1481
annually 7 447 face 9 2590
jobs 20 2001 estimated 6 2387
money 64 6776 your 7 3141
October 2009 Kilgarriff: FLTRP 18
Age-3 collocation statistics: limitations
Lists contain junk unsorted for type --MI lists mix adverbs,
subjects, objects, prepositions
What we really want: noise-free lists one list for each grammatical relation
October 2009 Kilgarriff: FLTRP 19
Age 4: The word sketch Automatic one-page summary of a
word’s grammatical and collocatonal behaviour
October 2009 Kilgarriff: FLTRP 20
The Sketch Engine Input:
– any corpus, any language Lemmatised, part-of-speech tagged
– specification of grammatical relations Word sketches integrated with Corpus query system
– Supports complex searching, sorting etc First release early 2004
October 2009 Kilgarriff: FLTRP 21
Recap: Lexicography: finding facts about words
collocations grammatical patterns idioms synonyms meanings translations
October 2009 Kilgarriff: FLTRP 22
Thesaurus Also near-synonyms
– are there any true synonyms? Distributional: which words share same
distributions– if corpus contains
<object, drink, wine>, <object, drink, beer>
– 1 pt similarity between wine and beer– gather all points; find nearest neighbours
Sparck Jones, Lin, Grefenstette
October 2009 Kilgarriff: FLTRP 23
Electronic dictionaries Conference on them last week Rundell quotation On
– PC– Handheld– Cellphone– Web
October 2009 Kilgarriff: FLTRP 24
On PCs CD-ROMs as added extra
– No income model– Large extra publishing cost– No extra income
October 2009 Kilgarriff: FLTRP 25
Handhelds Students like them, teachers don’t
– Subversive!– Fast to use: used even for conversation
Many dictionaries on one device– Users usually do not know which– For publishers
Complex distribution channels Dictionary publishers have little control
October 2009 Kilgarriff: FLTRP 26
Cellphones
October 2009 Kilgarriff: FLTRP 27
Web dictionaries Traditional publishers vs new players Business models
– Free + premium– Advertising
How many hits/month? Macmillan 2.5m Cambridge UP 30m Leo 100m