The Leo Corpus
description
Transcript of The Leo Corpus
The Leo Corpus
•German L1 Learner Corpus
Overview
• corpora in child language research
• CHILDES project
• Leo corpus
• CLAN language analysis tools
Corpora in acquisition research?
• linguistic intuitions of native speakers?– adult speakers’ intuitions fail– child will not speak on demand– child can’t judge own sentences
Leo (1;11,16): Leiter hoch. ‘ladder up/high’
? particle: (pull/push/...) up ? adjective: (the ladder is) long utterance context is needed!
Corpora in acquisition research?
• linguistic intuitions of native speakers?– adult speakers’ intuitions fail– child will not speak on demand– child can’t judge own sentences
Leo (1;11,16): Leiter hoch. looking at a long ladder
in a book adult German: (The) ladder (is) long adjective (in adult terms!)
Corpora in acquisition research
• corpora contain actually made utterances• situated in natural contexts• data are verifiable• frequency analyses possible
Kinds of corpora:• diary studies (e.g., Preyer 1882)• experimental data• spoken speech corpora (longitudinal / cross-sectional)
CHILDES
• Child Language Data Exchange System
• Brian MacWhinney / Catherine Snow
• founded in 1984
• part of TalkBank (adult corpora)
• > 1500 published articles
• 4500 Members
CHILDES
3 parts:
• language data, i.e. corpora
• CHAT transcription system
• CLAN computer programs
CHILDES corpora
• 130 corpora publicly available (via www)• 26 languages• L1 normally developing• L1 language disorders• bi- and trilingual children and adults
TalkBank adult corpora:• L2, aphasics (English, German,
Hungarian, Chinese, Italian), ...
CHAT and CLAN
• CHAT:– Codes for the Human Analysis of Transcripts– ensure a standard format for all corpora
• CLAN:– Computerized Language Analysis– several commands for analyzing data in
CHAT format
• single program interface
Leo corpus
• Leo, monolingual L1 German boy
• recorded 1999 – 2002
• Heike Behrens, MPI Leipzig
• transcribed in CHAT format
• analysable with CLAN programs
• not publicly available
Leo corpus
2;0 – 3;0: 5 x 1hr / week = 20-22hrs / month+ diary for new structures
3;0 – 5;0: 5 x 1hr / month = 5hrs / month ca. 400hrs total recording time
• includes utterances of child and conversation partners
• spontaneous interaction (free play)• no book-reading• experimenter present• some sessions videotaped
Leo corpus
• 1.8 million words of spoken speech
• child: ca. 500.000 words
• BNC: largest balanced corpus
• 100 million words
• 10% spoken speech: 10 million words
• “dense” corpus
“Dense” corpora
• longitudinal databases with denser recording intervals– traditional: 0.5 – 1hr / week– Leo: 1.25 – 5hrs / week
• assumption:– child is awake and talks 10hrs / day– traditional: ca. 1% of output– Leo: 2% - 7% of output
(Tomasello / Stahl 2004)
“Dense” corpora
• advantages:– capture of infrequent phenomena– better estimate of vocabulary size– age of emergence– smoother developmental curves– input / production frequency measures
“Dense” corpora Likelihood to capture a target token in a year of recording:
(Tomasello / Stahl 2004)
tokens:
1/day
`
10/day
Drawback: only 1 child!
• no generalizations possible
• drawback?
• usage-based approach– child is believed to construct language
individually– based on personal experience with language– no help from language-specific knowledge
Usage-based approach
• child moves gradually from lexically specific to abstract knowledge
• no adult categories
• input and frequency play a role ( corpus needed!)
• close studies of individuals highly valuable
• dense longitudinal vs. traditional cross-sectional corpora
“Control” corpora
Kerstin & Simone• Max Miller, MPI Nijmegen• 1;3 / 1;9 – 4;0
Kerstin: – 0.5 – 2.7 recordings / month– ca. 270.000 words (child: 55.000)
Simone: – 1.25 – 3.5 recordings / month– ca. 450.000 words (child: 86.000)
“Control” corpora
Pauline & Sebastian
• Prof. Rigol
Pauline:– 0;0 – 7;11 / 1 – 2 recordings / month– 340.000 words (child: 85.000)
Sebastian:– 0;0 – 7;4 / 1 – 2 recordings / month– 350.000 words (child: 75.000)
Leo corpus
• CHAT-format
• 1 transcription file per session
• txt-format
• no running text
@Headers (file explanations)*Main tier lines (utterances)%Dependent tiers (annotations of utterances)@End
CHAT: Headers@Begin@Languages: de@Participants: CHI Leo Target_Child, MUT Maren Mother, VAT Thorsten
Father, MEC Mechthild Observer@ID: de|mpi_evan|CHI|2;06.08|male|group|middle|Target_Child|
education|@ID: de|mpi_evan|MUT|30;00.00|female|group|middle|Mother|
Abitur_Lehre|@ID: de|mpi_evan|VAT|35;00.00|male|group|middle|Father|university|@ID: de|mpi_evan|MEC|24;00.00|female|group|middle|Observer|
university|@Filename: le020608.cha@Date: 11-SEP-1999@Age of CHI: 2;06.08@Comment: Dependent: exp, vrb, act, par,@Comment: in der Wohnung, beim Einkaufen
CHAT: Main tiers• Each utterance on own line • Each line starts with a tier• Each speaker has own tier: *CHI, *VAT, ...• Annotations on dependent tier: %mor, %pho...
Child: Yes. Fish! – Father: Fish? – Child: Yes.
*CHI: ja . %mor: $INTER|ja .*CHI: Fisch ! %mor: $N:03:m:NOM:SG|Fisch !*VAT: Fisch ? %mor: $N:03:m:CAS|Fisch ?*CHI: ja .
CHAT: Transcription
• orthographic or not?
• depends on purpose
• orthographic transcription: ease of retrieval– additional information via dependent tiers
(%pho)– utterances can be linked to digitized sound
files (Sonic-CHAT)– or to video files
SONIC CHAT
CHAT: Transcription
• spoken speech not as orderly as written texts
• coding scheme for spoken speech phenomena:– overlaps– trailing off– noncompletions is(t)– retracing Schrei [//] Scheibenwischer– non-words hm@o– replacements nix [: nichts]
CHAT: Annotation
• annotations to an utterance (tagging) on the dependent tiers of that utterance
*CHI: [D] ist drin .
%mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin .
%exp: es ist noch etwas Kakao im Becher .
• here:– %mor: morphology– %exp: explanation of utterance situation
CHAT: Annotation
• annotations to an utterance (tagging) on the dependent tiers of that utterance
*CHI: [D] ist drin .
%mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin .
copula suppletive (empty) tense agreement citation form
ist is the 3rd. pers. sing. present tense of the suppletive copula verb sein
CHAT: Annotation
• tagging is based on theoretical notions of adult language!– e.g., when ist is tagged as VCOP etc., this
doesn’t mean that it constitutes a VCOP for the child
*CHI:[D] Leiter hoch .
%mor:$N:02:f:AKK:SG|Leiter $PT|hoch .
• hoch a verb particle for the child?
CHAT: Annotation
• transcription and annotation in CLAN editor– converts txt- and SALT-format files to CHAT– automatic tagging (%mor-tiers)
• lexicon file with word information• tag disambiguation (manual / probabilistic)
– computes coding reliability– checks conformity with CHAT-conventions
• works on different workstations (unlike TRANSANA)
– access files on network drive
CLAN / CHAT interface
CLAN Commands
• search commands, e.g.– simple and combined strings in utterances and
annotations– interaction blocks– imitations, repetitions, overlaps
• computing commands, e.g.– mlu / mlt (mean lenght of utterances / turns)– longest words / utterances– vocabulary diversity: TTR, measure D– frequency of phonemes positions
CLAN Commands
• commands in DOS-like style
Examples
research question: WH-words
• emergence
• frequency
• use
Emergence of Interrogative Pronouns
coding:*CHI: was machst du ? (lit.: what do you?)
%mor: $PRO:int|was [...]
search :kwal +t*CHI +t%mor +s$pro:int* le020*.cha
search mor-tiers in all files up to 2;9
in the child’s for strings
starting with $pro:int
Triple – click to
access utterance!
Emergence of Interrogative Pronouns
• output:kwal (08-Dec-2004) is conducting analyses on: ONLY speaker main tiers matching: *CHI;and those speakers' ONLY dependent tiers matching: %MOR;[…]From file <le020006.cha>From file <le020007.cha>From file <le020008.cha>----------------------------------------*** File "le020008.cha": line 2603. Keyword: $pro:int|wo *CHI: Mama, wo bis(t) du ! (Mama, where are you!)%mor: $N:01:f:VOC:SG|Mama $PRO:int|wo
$VCOP:S:POS:PRES:2s|sein $PRO:pers:NOM:SG|du $N:01:f:VOC:SG|Mama !
Frequency of Interrogative Pronouns
• Does the child’s use match with the input frequency?
Child:freq +t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o
give for child’s
frequency mor-tier of $pro:int for all files together sort
count for other up to 2;9 output
people’s
Input:freq -t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o
Frequency of Interrogative Pronouns
• result:
child input536 $pro:int|was 13162 $pro:int|was
305 $pro:int|wo 3486 $pro:int|wo
70 $pro:int|wie 1608 $pro:int|wer
31 $pro:int|wer 1255 $pro:int|wie
Non-interrogative use of wh-words
• Make a file with all words for interrogative pronouns Leo uses:
freq […] +s"$PRO:INT%|*" +u +d1 >> Leowh.cut
for all files without frequency and direct
together count numbers output to file
Non-interrogative use of wh-words
Leowh.cut:
$pro:int|was
$pro:int|wo
$pro:int|wie
$pro:int|wer
• strip file of all $pro:int| , so that just the wh-words are left:
chstring +s"$pro:int|" "" +y leowh.cut
change from to file not in CHAT-format
Non-interrogative use of wh-words
• then look for uses of these words in sentences that do not contain $pro:int
combo +t*CHI +t%mor le*.cha [email protected]^*^%mor:^*^!$pro:int*
take words from followed by not containing
file as search string %mor $pro:int
for utterances
• combo: search with Boolean operators
Word order in wh-questions
• German: verb has to follow wh-word directly – any errors?
• search for all utterances that do not follow this pattern:
combo +t*CHI +t%mor +s$pro:int*^!$v* le*.cha
search for child’s $pro:int not directly followed by any $v
Cooccurences of wh-words
• What words does was (what) cooccur with when used as an interrogative pronoun?
kwal +t*CHI +t%mor +s$pro:int%"|"was +d +o* le*.cha | cooccur +swas +t*CHI +u
• kwal looks for all uses of was as $pro|int• the results are directed to cooccur (“piping”)6 was da 32 was das 1 was denkst 2 was denn
Measuring lexical diversity
• traditional: type-token-ratio (TTR)– number of different word types– against total number of words
• every word is a new word: TTR 1.0• the lower the TTR, the less lexical diversity• problem: depends on sample size
– in a large sample, the total vocabulary will finally be exhausted
– TTR levels out because highly frequent words will increase the number of tokens disproportionally
– rarely occuring types will have little influence on TTR
measure D
• measure D is obtained by – randomly sampling the corpus– calculating the actual leveling out of the TTR rate– and comparing this to theoretic models of TTR
curves• the probability of new types being introduced
in the corpus is calculated, regardless of sample size
• In CLAN:– TTR: freq– Measure D: VOCD