1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
-
Upload
lisa-harper -
Category
Documents
-
view
215 -
download
0
Transcript of 1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
LING 5200, 2006 BASED on Kevin Cohen’s LING
52002
Features of corpora
Size (little/big/huge) Plasticity (finite/monitor) Metadata (none/lots) Annotation (none, …, lots) Balance
LING 5200, 2006 BASED on Kevin Cohen’s LING
52003
Features: size
Relative over time Currently, micro/small/large/massive
LING 5200, 2006 BASED on Kevin Cohen’s LING
52004
Features: size
Relative over time 1960's: 1M words (Brown) 1990's: 4.5M words (Penn Treebank) 2000's: 415M words (BOE) 2000's: 1000M (English Gigaword)
Currently, micro/small/large/massive
LING 5200, 2006 BASED on Kevin Cohen’s LING
52005
Features
Finite size established in
advance sample sizes
adjusted accordingly doesn't change over
time
Monitor allow diachronic
analysis grows over time
LING 5200, 2006 BASED on Kevin Cohen’s LING
52006
Metadata
(practically) none language, at least document boundaries
some document attributes
title body author date
PMID- 6509398DP - 1984 NovTI - The natural history of Machado-
Joseph disease. An analysis of 138 personally examined cases.
PG - 510-25AB - We have examined 138 cases of a
disorder previously described in people of Portuguese origin and which has received many names. By computer analysis of 46 different items of a standardized neurological examination carried out in each patient, we have been able to delineate the main components of
LING 5200, 2006 BASED on Kevin Cohen’s LING
52007
Metadata
Lots Author characteristics
gender, age, mother tongue(s), dialect, educational level
genre classification news scientific personal
topic relevance
MH - Aged
MH - Azores/ethnology
MH - Cerebellar Ataxia/diagnosis
MH - Gene Frequency
MH - Human
MH - Phenotype
MH - Portugal/ethnology
MH - Support, Non-U.S. Gov't
MH - Syndrome
MH - United States
MH - Variation (Genetics)
LING 5200, 2006 BASED on Kevin Cohen’s LING
52008
Balanced corpora
What are you balancing?
Most common: genre Authors
gender age education dialect
LING 5200, 2006 BASED on Kevin Cohen’s LING
52009
Balanced corpora
speech writing
unpublished published
non-fictionfiction
informativeinstructional persuasive
Composition of the International Corpus of English
academic popular news
(Adapted from Meyer 2002)
LING 5200, 2006 BASED on Kevin Cohen’s LING
520010
Balanced corpora
speech writing
dialogue monologue
scriptedunscripted
talksnews speeches
Composition of the International Corpus of English
(Adapted from Meyer 2002)
LING 5200, 2006 BASED on Kevin Cohen’s LING
520011
Corpus length
Overall length Sample size
partial 2,000 words (Brown, LOB, ICE) 5,000 words (London-Lund)
full takes up space copyright permission issues harder
LING 5200, 2006 BASED on Kevin Cohen’s LING
520012
Sample size
Motivating assumption: more important to maximize number of authors/genres than length of text from each
LING 5200, 2006 BASED on Kevin Cohen’s LING
520013
By purpose
Linguistic-y lexicon vs. other
NLP General purpose information retrieval information extraction
LING 5200, 2006 BASED on Kevin Cohen’s LING
520014
By purpose
Linguistic-y lexicon vs. other
NLP General purpose information retrieval information extraction
Foreign language instruction Native L2 "Learner" L2
LING 5200, 2006 BASED on Kevin Cohen’s LING
520018
Annotation
None "collection"
Some POS lemmas
lemma(be) = {be, am, is, are, were, being, been}
LING 5200, 2006 BASED on Kevin Cohen’s LING
520019
Annotation
Lots syntax (treebank, "bracketing") semantics
predicate/argument structure ontological
<class (mammal, pet)>Dogs</class> make me happy.
LING 5200, 2006 BASED on Kevin Cohen’s LING
520020
Diachronic
Historical (OE, ME, …) Later sampling of earlier balanced
corpus Monitor
LING 5200, 2006 BASED on Kevin Cohen’s LING
520021
Spoken
Phonetically motivated (elicited) Other ("natural")
LING 5200, 2006 BASED on Kevin Cohen’s LING
520022
Multilingual
Parallel L1 contents == L2 contents Parliamentary proceedings in English &
French Shakespeare in English and German
Translation/comparable two L1's; genre == genre E.g., weather reports
LING 5200, 2006 BASED on Kevin Cohen’s LING
520023
Penn Treebank
treebank: corpus of syntactically-annotated data
first release: 4.5 million words, 3 years' work
currently 4.9 M
LING 5200, 2006 BASED on Kevin Cohen’s LING
520024
Penn Treebank
Scientifi c abstracts 231KNewspaper stories 3,066KDOA bulletins 79KFiction 106KMUC-3 112KComputer manuals 89KRadio transcripts 12KFlight-booking 20KBrown corpus 1,172K
LING 5200, 2006 BASED on Kevin Cohen’s LING
520025
Penn Treebank
POS-tagged Switchboard data Dysfluency-annotated Switchboard data Syntactically-annotated Switchboard da
ta
http://www.cis.upenn.edu/~treebank/switch-samp-pos.htmlhttp://www.cis.upenn.edu/~treebank/switch-samp-dfl.htmlhttp://www.cis.upenn.edu/~treebank/switch-samp-bkt.html
LING 5200, 2006 BASED on Kevin Cohen’s LING
520026
GENIA
2000 abstracts red blood cell transcription factors POS-tagged (HW2, #16) semantic annotation with molecular
biology ontology
LING 5200, 2006 BASED on Kevin Cohen’s LING
520027
Corpora/resources
Dictionaries, ontologies, ... CELEX WordNet
LING 5200, 2006 BASED on Kevin Cohen’s LING
520028
Corpora/resources
Dictionaries, ontologies, ... "discovery procedure" phonology
contrasts phonotactics
morphology term formation inflectional