Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd...
-
Upload
george-ezra-young -
Category
Documents
-
view
222 -
download
0
Transcript of Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd...
Corpora by Web Services
Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex
Leeds, April 2010 Kilgarriff: Corpora by Web Services 2
Starting a PhD in NLP
Then Prolog Type in a few
grammar rules Lexical entries Example sentences
We’re off!
Leeds, April 2010 Kilgarriff: Corpora by Web Services 3
Now Corpus
Which? Budget/schedule Howe much can we afford? Hard disk space
Access software Build
Big job, makign it fast is hard – or Research, acquire, install, maintain …
Leeds, April 2010 Kilgarriff: Corpora by Web Services 4
Resarch question Morphology, syntax, discourse structure,
semantics, anaphor First six months at least
Acquiring data, software Complications
Leeds, April 2010 Kilgarriff: Corpora by Web Services 6
If you’re not super-geeky
Did I do it properly? Dumbing down
Let’s choose an easier question Looking over shoulder
Leeds, April 2010 Kilgarriff: Corpora by Web Services 9
Corpora by web services
Possible? Already available
Leeds, April 2010 Kilgarriff: Corpora by Web Services 10
Sketch Engine
Corpus querying Fast Handles large corpora In use for lexicography at
OUP, CUP, Macmillan, Collins, Le Robert Word sketches
Data-driven summary of a word’s grammatical and collocational behaviour
Leeds, April 2010 Kilgarriff: Corpora by Web Services 12
Corpora
63Welsh53Romanian
174Vietnamese66Portuguese149Greek
108Thai6Persian1627German
5Telugu95Norwegian126French
114Swedish409Japanese5508English
117Spanish1910Italian128Dutch
738Slovene34Irish800Czech
536Slovak102Indonesian456Chinese
188Russian31Hindi174Arabic
Leeds, April 2010 Kilgarriff: Corpora by Web Services 13
Big, High Quality corpora
Big Performance
Banko and Brill 2004 There’s no data like more data
Ample data for rare phenomena Big subcorpora
5b Medical: 30m
Leeds, April 2010 Kilgarriff: Corpora by Web Services 14
Quality Bad data
Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language …
Less discussed Maybe a footnote I wonder why
Quick fixes and run
Leeds, April 2010 Kilgarriff: Corpora by Web Services 15
The Google/Yahoo/Bing option
Appeal Not setup costs Start googling today
Leeds, April 2010 Kilgarriff: Corpora by Web Services 16
Very interesting work Keller and Lapata
Validity of SE counts vs BNC counts vs psycholinguistic validity of collocations
36 queries per collocation “fulfil obligation” “fulfil ? Obligation” “fulfilling obligations” ...
Nakov, Nakov and Hearst Great interest in query syntax
Leeds, April 2010 Kilgarriff: Corpora by Web Services 17
but
Limited hits-per-query Limited hits-per-day Sort order
Not documented 'unsorted' not possible
Snippets too short for research No (documented) morphology Limited query syntax
Leeds, April 2010 Kilgarriff: Corpora by Web Services 18
and
At mercy of commercial company Might change at any time Not replicable
Leeds, April 2010 Kilgarriff: Corpora by Web Services 19
So
Appeal No setup costs
Serious research Many difficult practical issues Not a tool designed for linguists
Conclusion If only SE indexes are big enough
Yes Else no
Leeds, April 2010 Kilgarriff: Corpora by Web Services 20
Strategy
More languages Corpus Factory, as Sharoff
Bigger Big Web Corpus (BiWeC) Currently 5.5b fully processed Target 20b
Better
Leeds, April 2010 Kilgarriff: Corpora by Web Services 21
New Model Corpus
BNC is past its sell-by Early 1990s Pre web Still dominant model
New model needed
Leeds, April 2010 Kilgarriff: Corpora by Web Services 22
Model
Small: model train Model train
Design: software model NMC
1:100 for BiWeC-scale 100m
Update of BNC as design model Data from web but Text type avalable
Leeds, April 2010 Kilgarriff: Corpora by Web Services 23
Open-source/collaboration
We distribute You annotate
Pos-tags, parses, anaphor, discourse moves, semantics, multiwords, entity-types ...
Domain, register, region ... Send us annotations We integrate
And give access in SkE
Leeds, April 2010 Kilgarriff: Corpora by Web Services 24
Divide and rule
Bigger (BiWeC) Better (NMC) Take best annotations
Accuracy Speed Usefulness Good collaboration
from NMC, apply to BiWeC
Leeds, April 2010 Kilgarriff: Corpora by Web Services 25
TEDDCLOG
Taiwan English Data-Driven CLOze Generation
with Simon Smith and colleagues, Taipei API case study
Leeds, April 2010 Kilgarriff: Corpora by Web Services 26
Cloze
'fill-the gap' Several metal _____ violently with cold water
A: behave B: react C: realise D: respond
Popular with students, teachers, testers Unpopular with theorists :-(
Leeds, April 2010 Kilgarriff: Corpora by Web Services 27
One objection
Test item writers make them up Not naturally-occurring language
The Sinclair-Johns critique
Also: expensive
TEDDCLOG Uses corpus sentences and distractors
Leeds, April 2010 Kilgarriff: Corpora by Web Services 28
reactThesaurus module
Several metals react violently with cold water.
Diffs moduleConcordance module
behave, interact, respond
Text processing moduleSeveral metals ___ violently with cold
water. (a) behave (b) react (c) realise (d) respond
behave realise
respond
metals behave x metals respond x
metals realise xmetals react √
Leeds, April 2010 Kilgarriff: Corpora by Web Services 29
API calls
Find distractorts thesaurus
Find key-only collocate Sketch diffs
Needs optimising
Find carrier sentence Concordance with GDEX module
Good Dictionary Example Finder
Leeds, April 2010 Kilgarriff: Corpora by Web Services 30
Current status
TEDDCLOG Next phase: produccing decent results
Corpora by Web Services Upping server capacity Looking for users (currently with UKWaC)
New Model Corpus Nervous over copyright but Available in SkE, for download
Leeds, April 2010 Kilgarriff: Corpora by Web Services 31
Another announcement: DANTE
Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins
BNC, FrameNet, Euralex, COBUILD...
English side, New English-Irish dictionary Available for NLP research imminently