Auckland 2012 Kilgarriff: Web Corpora 1
Web Corpora
Adam Kilgarriff
Auckland 2012 Kilgarriff: Web Corpora 2
You can’t help noticing
• Replaceable or replacable?– http://googlefight.com
Auckland 2012 Kilgarriff: Web Corpora 3
• Very very large– 2006 estimates for duplicate free, linguistic, Google-
indexed web• German: 44 billion words• Italian: 25 billion words• English: 1,000 billion -10,000 billion words
• Most languages• Most language types• Up-to-date• Free• Instant access
Auckland 2012 Kilgarriff: Web Corpora 4
Overview
• Is the web a corpus?• Representativeness• What is out there?
– Web1T
• Googleology• Web corpus types
– Targeted sites: Oxford English Corpus– General: WaC family– WebBootCaT
Auckland 2012 Kilgarriff: Web Corpora 5
Is the web a corpus?
• Sinclair – in “Developing linguistic corpora, a guide to good practice. Corpus and
Text – Basic Principles”
“…not a corpus because• dimensions unknown, constantly changing• not designed from a linguistic perpective
• But– We can find out dimensions – Many corpora are not designed
• “as much chatroom dialogue as I can get”
• Def: a corpus is a collection of texts – when viewed as an object of language research
Auckland 2012 Kilgarriff: Web Corpora 6
Is the web a corpus?
Yes
Auckland 2012 Kilgarriff: Web Corpora 7
but it’s not representative
Auckland 2012 Kilgarriff: Web Corpora 8
Theory
A random sample of a population is representative of it.
Observations on sample support inferences about population
(within confidence bounds)
Auckland 2012 Kilgarriff: Web Corpora 9
TheoryA random sample of a population is …
• What is the population?– production and reception
– speech and text
– copying
Auckland 2012 Kilgarriff: Web Corpora 10
Theory• Population not defined• Representative sample not possible
Auckland 2012 Kilgarriff: Web Corpora 11
sublanguage• Language = core + sublanguages• Options for corpus construction
– none– some– all
• None– impoverished view of language
• Some: BNC– cake recipes and gastro-uterine disease– not car repair manuals or astronomy or …
• All: until recently, not viable
Auckland 2012 Kilgarriff: Web Corpora 12
Representativeness• The web is not representative• but nor is anything else• Text type variation
– under-researched, lacking in theory• Atkins Clear Ostler 1993 on design brief for BNC;
Biber 1988, Kilgarriff 2001, Sharoff 2006
• Text type is an issue across linguistics– Web: issue is acute because, as against BNC or
WSJ, we simply don’t know what is there
Auckland 2012 Kilgarriff: Web Corpora 13
What is out there?
• What text types are there on the web?– some are new: chatroom
– proportions
• is it overwhelmed by porn? How much?
• Hard question
Auckland 2012 Kilgarriff: Web Corpora 14
Comparing frequency lists
• Web1T– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012)
words of English• that’s 1,000,000,000,000
• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio
Auckland 2012 Kilgarriff: Web Corpora 15
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old
• 4 legal– trademarks pursuant accordance herein
Auckland 2012 Kilgarriff: Web Corpora 16
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
Auckland 2012 Kilgarriff: Web Corpora 17
Observations
• Pronouns and past tense verbs– Fiction
• Masc vs fem
• Yesterday– Probably daily newspapers
• Constancy of ratios:– He/him/himself– She/her/herself
Auckland 2012 Kilgarriff: Web Corpora 18
• The web– a social, cultural, political phenomenon– new, little understood– a legitimate object of science– mostly language
• we are well placed– a lot of people will be interested
• Let’s– study the web– source of language data– apply our tools for web use (dictionaries, MT)– use the web as infrastructure
Auckland 2012 Kilgarriff: Web Corpora 19
Using Search Engines
No setup costsStart querying today
Methods• Hit counts• ‘snippets’
– Metasearch engines, WebCorp
• Find pages and download
Auckland 2012 Kilgarriff: Web Corpora 20
Googleology
• Google hit counts for language modelling
– Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to
each of Google and Altavista
• Very interesting work
• Great interest in query syntax
Auckland 2012 Kilgarriff: Web Corpora 21
The Trouble with Google• not enough instances
– max 1000• not enough queries
– max 1000 per day with API• not enough context
– 10-word snippet around search term• sort order
– search term in titles and headings • untrustworthy hit counts• limited search options• linguistically dumb, eg not lemmatised
• aime/aimer/aimes/aimons/aimez/aiment …
Auckland 2012 Kilgarriff: Web Corpora 22
• Appeal– Zero-cost entry, just start googling
• Reality– High-quality work: high-cost methodology
Auckland 2012 Kilgarriff: Web Corpora 23
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
Auckland 2012 Kilgarriff: Web Corpora 24
Also:
• No replicability
• Methods, stats not published
• At mercy of commercial corporation
• Googleology is bad science
Auckland 2012 Kilgarriff: Web Corpora 25
Web corpus types
• Large, general corpora
• Small, specialised corpora– Specially for translators
Auckland 2012 Kilgarriff: Web Corpora 26
Basic steps
• Gather pages– Google hits– Select and gather whole sites– General crawl
• Filter
• De-duplicate
• Linguistic processing
• Load into corpus tool
Auckland 2012 Kilgarriff: Web Corpora 27
Oxford English Corpus
• Whole domains chosen and harvested– control over text type
• 2.3 billion words
Auckland 2012 Kilgarriff: Web Corpora 28
Oxford English Corpus
Auckland 2012 Kilgarriff: Web Corpora 29
Oxford English Corpus
Auckland 2012 Kilgarriff: Web Corpora 30
WaC family (DeWaC, ItWaC)
• 1.5 B words each• Baroni and colleagues• Seeds:
– mid-frequency words from ‘core vocab’ lists and corpora
• Google on seed words, then crawl
TenTen family (enTenTen, deTenTen)
• 2-10 billion words each, same methodology• Lexical Computing
Auckland 2012 Kilgarriff: Web Corpora 31
Filtering
• Non-text (sound, image etc) files• Boilerplate (within file)
– Copyright notices, navigation bars– “high markup” heuristic
• Not “text in sentences”– Look for function words– Lists?? Sports results?? Crossword puzzles??
• Spam, pornography– Tough
• De-duplication (also tough)
Auckland 2012 Kilgarriff: Web Corpora 32
Small, specialised corpora
• Terminologists
• Translators needing target-language domain-specific vocab
• Specialist dictionaries– Don’t exist– Expensive/inaccessible– Out of date
Auckland 2012 Kilgarriff: Web Corpora 33
BootCat (Bootstrapping Corpora and Terms)
– Put in seed terms– Google/Yahoo/bing search– Retrieve Google/Yahoo/bing hits
• Remove duplicates, boilerplate
– Small instant corpora– Baroni and Bernardini, LREC 2004– Web version
• WebBootCaT• At Sketch Engine site
Auckland 2012 Kilgarriff: Web Corpora 34
Task
• Choose area of specialist interest– Choose your language
• Select at least 5 seed terms– Specialist: good
• Build corpus– At least 100,000 words– Iterate if necessary
• Find at least six words/phrases/meanings you did not know before
Top Related