Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
-
Upload
noah-daniel -
Category
Documents
-
view
214 -
download
0
Transcript of Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
![Page 1: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/1.jpg)
Auckland 2012 Kilgarriff: Web Corpora 1
Web Corpora
Adam Kilgarriff
![Page 2: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/2.jpg)
Auckland 2012 Kilgarriff: Web Corpora 2
You can’t help noticing
• Replaceable or replacable?– http://googlefight.com
![Page 3: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/3.jpg)
Auckland 2012 Kilgarriff: Web Corpora 3
• Very very large– 2006 estimates for duplicate free, linguistic, Google-
indexed web• German: 44 billion words• Italian: 25 billion words• English: 1,000 billion -10,000 billion words
• Most languages• Most language types• Up-to-date• Free• Instant access
![Page 4: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/4.jpg)
Auckland 2012 Kilgarriff: Web Corpora 4
Overview
• Is the web a corpus?• Representativeness• What is out there?
– Web1T
• Googleology• Web corpus types
– Targeted sites: Oxford English Corpus– General: WaC family– WebBootCaT
![Page 5: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/5.jpg)
Auckland 2012 Kilgarriff: Web Corpora 5
Is the web a corpus?
• Sinclair – in “Developing linguistic corpora, a guide to good practice. Corpus and
Text – Basic Principles”
“…not a corpus because• dimensions unknown, constantly changing• not designed from a linguistic perpective
• But– We can find out dimensions – Many corpora are not designed
• “as much chatroom dialogue as I can get”
• Def: a corpus is a collection of texts – when viewed as an object of language research
![Page 6: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/6.jpg)
Auckland 2012 Kilgarriff: Web Corpora 6
Is the web a corpus?
Yes
![Page 7: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/7.jpg)
Auckland 2012 Kilgarriff: Web Corpora 7
but it’s not representative
![Page 8: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/8.jpg)
Auckland 2012 Kilgarriff: Web Corpora 8
Theory
A random sample of a population is representative of it.
Observations on sample support inferences about population
(within confidence bounds)
![Page 9: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/9.jpg)
Auckland 2012 Kilgarriff: Web Corpora 9
TheoryA random sample of a population is …
• What is the population?– production and reception
– speech and text
– copying
![Page 10: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/10.jpg)
Auckland 2012 Kilgarriff: Web Corpora 10
Theory• Population not defined• Representative sample not possible
![Page 11: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/11.jpg)
Auckland 2012 Kilgarriff: Web Corpora 11
sublanguage• Language = core + sublanguages• Options for corpus construction
– none– some– all
• None– impoverished view of language
• Some: BNC– cake recipes and gastro-uterine disease– not car repair manuals or astronomy or …
• All: until recently, not viable
![Page 12: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/12.jpg)
Auckland 2012 Kilgarriff: Web Corpora 12
Representativeness• The web is not representative• but nor is anything else• Text type variation
– under-researched, lacking in theory• Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Kilgarriff 2001, Sharoff 2006
• Text type is an issue across linguistics– Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
![Page 13: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/13.jpg)
Auckland 2012 Kilgarriff: Web Corpora 13
What is out there?
• What text types are there on the web?– some are new: chatroom
– proportions
• is it overwhelmed by porn? How much?
• Hard question
![Page 14: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/14.jpg)
Auckland 2012 Kilgarriff: Web Corpora 14
Comparing frequency lists
• Web1T– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012)
words of English• that’s 1,000,000,000,000
• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio
![Page 15: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/15.jpg)
Auckland 2012 Kilgarriff: Web Corpora 15
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old
• 4 legal– trademarks pursuant accordance herein
![Page 16: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/16.jpg)
Auckland 2012 Kilgarriff: Web Corpora 16
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
![Page 17: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/17.jpg)
Auckland 2012 Kilgarriff: Web Corpora 17
Observations
• Pronouns and past tense verbs– Fiction
• Masc vs fem
• Yesterday– Probably daily newspapers
• Constancy of ratios:– He/him/himself– She/her/herself
![Page 18: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/18.jpg)
Auckland 2012 Kilgarriff: Web Corpora 18
• The web– a social, cultural, political phenomenon– new, little understood– a legitimate object of science– mostly language
• we are well placed– a lot of people will be interested
• Let’s– study the web– source of language data– apply our tools for web use (dictionaries, MT)– use the web as infrastructure
![Page 19: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/19.jpg)
Auckland 2012 Kilgarriff: Web Corpora 19
Using Search Engines
No setup costsStart querying today
Methods• Hit counts• ‘snippets’
– Metasearch engines, WebCorp
• Find pages and download
![Page 20: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/20.jpg)
Auckland 2012 Kilgarriff: Web Corpora 20
Googleology
• Google hit counts for language modelling
– Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to
each of Google and Altavista
• Very interesting work
• Great interest in query syntax
![Page 21: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/21.jpg)
Auckland 2012 Kilgarriff: Web Corpora 21
The Trouble with Google• not enough instances
– max 1000• not enough queries
– max 1000 per day with API• not enough context
– 10-word snippet around search term• sort order
– search term in titles and headings • untrustworthy hit counts• limited search options• linguistically dumb, eg not lemmatised
• aime/aimer/aimes/aimons/aimez/aiment …
![Page 22: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/22.jpg)
Auckland 2012 Kilgarriff: Web Corpora 22
• Appeal– Zero-cost entry, just start googling
• Reality– High-quality work: high-cost methodology
![Page 23: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/23.jpg)
Auckland 2012 Kilgarriff: Web Corpora 23
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
![Page 24: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/24.jpg)
Auckland 2012 Kilgarriff: Web Corpora 24
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
• Googleology is bad science
![Page 25: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/25.jpg)
Auckland 2012 Kilgarriff: Web Corpora 25
Web corpus types
• Large, general corpora
• Small, specialised corpora– Specially for translators
![Page 26: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/26.jpg)
Auckland 2012 Kilgarriff: Web Corpora 26
Basic steps
• Gather pages– Google hits– Select and gather whole sites– General crawl
• Filter
• De-duplicate
• Linguistic processing
• Load into corpus tool
![Page 27: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/27.jpg)
Auckland 2012 Kilgarriff: Web Corpora 27
Oxford English Corpus
• Whole domains chosen and harvested– control over text type
• 2.3 billion words
![Page 28: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/28.jpg)
Auckland 2012 Kilgarriff: Web Corpora 28
Oxford English Corpus
![Page 29: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/29.jpg)
Auckland 2012 Kilgarriff: Web Corpora 29
Oxford English Corpus
![Page 30: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/30.jpg)
Auckland 2012 Kilgarriff: Web Corpora 30
WaC family (DeWaC, ItWaC)
• 1.5 B words each• Baroni and colleagues• Seeds:
– mid-frequency words from ‘core vocab’ lists and corpora
• Google on seed words, then crawl
TenTen family (enTenTen, deTenTen)
• 2-10 billion words each, same methodology• Lexical Computing
![Page 31: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/31.jpg)
Auckland 2012 Kilgarriff: Web Corpora 31
Filtering
• Non-text (sound, image etc) files• Boilerplate (within file)
– Copyright notices, navigation bars– “high markup” heuristic
• Not “text in sentences”– Look for function words– Lists?? Sports results?? Crossword puzzles??
• Spam, pornography– Tough
• De-duplication (also tough)
![Page 32: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/32.jpg)
Auckland 2012 Kilgarriff: Web Corpora 32
Small, specialised corpora
• Terminologists
• Translators needing target-language domain-specific vocab
• Specialist dictionaries– Don’t exist– Expensive/inaccessible– Out of date
![Page 33: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/33.jpg)
Auckland 2012 Kilgarriff: Web Corpora 33
BootCat (Bootstrapping Corpora and Terms)
– Put in seed terms– Google/Yahoo/bing search– Retrieve Google/Yahoo/bing hits
• Remove duplicates, boilerplate
– Small instant corpora– Baroni and Bernardini, LREC 2004– Web version
• WebBootCaT• At Sketch Engine site
![Page 34: Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.](https://reader036.fdocuments.net/reader036/viewer/2022081603/56649e735503460f94b732d7/html5/thumbnails/34.jpg)
Auckland 2012 Kilgarriff: Web Corpora 34
Task
• Choose area of specialist interest– Choose your language
• Select at least 5 seed terms– Specialist: good
• Build corpus– At least 100,000 words– Iterate if necessary
• Find at least six words/phrases/meanings you did not know before