The Leo Corpus

The Leo Corpus

•German L1 Learner Corpus

Overview

• corpora in child language research

• CHILDES project

• Leo corpus

• CLAN language analysis tools

Corpora in acquisition research?

• linguistic intuitions of native speakers?– adult speakers’ intuitions fail– child will not speak on demand– child can’t judge own sentences

Leo (1;11,16): Leiter hoch. ‘ladder up/high’

? particle: (pull/push/...) up ? adjective: (the ladder is) long utterance context is needed!

Corpora in acquisition research?

• linguistic intuitions of native speakers?– adult speakers’ intuitions fail– child will not speak on demand– child can’t judge own sentences

Leo (1;11,16): Leiter hoch. looking at a long ladder

in a book adult German: (The) ladder (is) long adjective (in adult terms!)

Corpora in acquisition research

• corpora contain actually made utterances• situated in natural contexts• data are verifiable• frequency analyses possible

Kinds of corpora:• diary studies (e.g., Preyer 1882)• experimental data• spoken speech corpora (longitudinal / cross-sectional)

CHILDES

• Child Language Data Exchange System

• Brian MacWhinney / Catherine Snow

• founded in 1984

• part of TalkBank (adult corpora)

• > 1500 published articles

• 4500 Members

CHILDES

3 parts:

• language data, i.e. corpora

• CHAT transcription system

• CLAN computer programs

CHILDES corpora

• 130 corpora publicly available (via www)• 26 languages• L1 normally developing• L1 language disorders• bi- and trilingual children and adults

TalkBank adult corpora:• L2, aphasics (English, German,

Hungarian, Chinese, Italian), ...

CHAT and CLAN

• CHAT:– Codes for the Human Analysis of Transcripts– ensure a standard format for all corpora

• CLAN:– Computerized Language Analysis– several commands for analyzing data in

CHAT format

• single program interface

Leo corpus

• Leo, monolingual L1 German boy

• recorded 1999 – 2002

• Heike Behrens, MPI Leipzig

• transcribed in CHAT format

• analysable with CLAN programs

• not publicly available

Leo corpus

2;0 – 3;0: 5 x 1hr / week = 20-22hrs / month+ diary for new structures

3;0 – 5;0: 5 x 1hr / month = 5hrs / month ca. 400hrs total recording time

• includes utterances of child and conversation partners

• spontaneous interaction (free play)• no book-reading• experimenter present• some sessions videotaped

Leo corpus

• 1.8 million words of spoken speech

• child: ca. 500.000 words

• BNC: largest balanced corpus

• 100 million words

• 10% spoken speech: 10 million words

• “dense” corpus

“Dense” corpora

• longitudinal databases with denser recording intervals– traditional: 0.5 – 1hr / week– Leo: 1.25 – 5hrs / week

• assumption:– child is awake and talks 10hrs / day– traditional: ca. 1% of output– Leo: 2% - 7% of output

(Tomasello / Stahl 2004)

“Dense” corpora

• advantages:– capture of infrequent phenomena– better estimate of vocabulary size– age of emergence– smoother developmental curves– input / production frequency measures

“Dense” corpora Likelihood to capture a target token in a year of recording:

(Tomasello / Stahl 2004)

tokens:

1/day

`

10/day

Drawback: only 1 child!

• no generalizations possible

• drawback?

• usage-based approach– child is believed to construct language

individually– based on personal experience with language– no help from language-specific knowledge

Usage-based approach

• child moves gradually from lexically specific to abstract knowledge

• no adult categories

• input and frequency play a role ( corpus needed!)

• close studies of individuals highly valuable

• dense longitudinal vs. traditional cross-sectional corpora

“Control” corpora

Kerstin & Simone• Max Miller, MPI Nijmegen• 1;3 / 1;9 – 4;0

Kerstin: – 0.5 – 2.7 recordings / month– ca. 270.000 words (child: 55.000)

Simone: – 1.25 – 3.5 recordings / month– ca. 450.000 words (child: 86.000)

“Control” corpora

Pauline & Sebastian

• Prof. Rigol

Pauline:– 0;0 – 7;11 / 1 – 2 recordings / month– 340.000 words (child: 85.000)

Sebastian:– 0;0 – 7;4 / 1 – 2 recordings / month– 350.000 words (child: 75.000)

Leo corpus

• CHAT-format

• 1 transcription file per session

• txt-format

• no running text

@Headers (file explanations)*Main tier lines (utterances)%Dependent tiers (annotations of utterances)@End

CHAT: Main tiers• Each utterance on own line • Each line starts with a tier• Each speaker has own tier: *CHI, *VAT, ...• Annotations on dependent tier: %mor, %pho...

Child: Yes. Fish! – Father: Fish? – Child: Yes.

*CHI: ja . %mor: $INTER|ja .*CHI: Fisch ! %mor: $N:03:m:NOM:SG|Fisch !*VAT: Fisch ? %mor: $N:03:m:CAS|Fisch ?*CHI: ja .

CHAT: Transcription

• orthographic or not?

• depends on purpose

• orthographic transcription: ease of retrieval– additional information via dependent tiers

(%pho)– utterances can be linked to digitized sound

files (Sonic-CHAT)– or to video files

SONIC CHAT

CHAT: Transcription

• spoken speech not as orderly as written texts

• coding scheme for spoken speech phenomena:– overlaps– trailing off– noncompletions is(t)– retracing Schrei [//] Scheibenwischer– non-words hm@o– replacements nix [: nichts]

CHAT: Annotation

• annotations to an utterance (tagging) on the dependent tiers of that utterance

*CHI: [D] ist drin .

%mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin .

%exp: es ist noch etwas Kakao im Becher .

• here:– %mor: morphology– %exp: explanation of utterance situation

CHAT: Annotation

• annotations to an utterance (tagging) on the dependent tiers of that utterance

*CHI: [D] ist drin .

%mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin .

copula suppletive (empty) tense agreement citation form

ist is the 3rd. pers. sing. present tense of the suppletive copula verb sein

CHAT: Annotation

• tagging is based on theoretical notions of adult language!– e.g., when ist is tagged as VCOP etc., this

doesn’t mean that it constitutes a VCOP for the child

*CHI:[D] Leiter hoch .

%mor:$N:02:f:AKK:SG|Leiter $PT|hoch .

• hoch a verb particle for the child?

CHAT: Annotation

• transcription and annotation in CLAN editor– converts txt- and SALT-format files to CHAT– automatic tagging (%mor-tiers)

• lexicon file with word information• tag disambiguation (manual / probabilistic)

– computes coding reliability– checks conformity with CHAT-conventions

• works on different workstations (unlike TRANSANA)

– access files on network drive

CLAN / CHAT interface

CLAN Commands

• search commands, e.g.– simple and combined strings in utterances and

annotations– interaction blocks– imitations, repetitions, overlaps

• computing commands, e.g.– mlu / mlt (mean lenght of utterances / turns)– longest words / utterances– vocabulary diversity: TTR, measure D– frequency of phonemes positions

CLAN Commands

• commands in DOS-like style

Examples

research question: WH-words

• emergence

• frequency

• use

Emergence of Interrogative Pronouns

coding:*CHI: was machst du ? (lit.: what do you?)

%mor: $PRO:int|was [...]

search :kwal +t*CHI +t%mor +s$pro:int* le020*.cha

search mor-tiers in all files up to 2;9

in the child’s for strings

starting with $pro:int

Triple – click to

access utterance!

Emergence of Interrogative Pronouns

• output:kwal (08-Dec-2004) is conducting analyses on: ONLY speaker main tiers matching: *CHI;and those speakers' ONLY dependent tiers matching: %MOR;[…]From file <le020006.cha>From file <le020007.cha>From file <le020008.cha>----------------------------------------*** File "le020008.cha": line 2603. Keyword: $pro:int|wo *CHI: Mama, wo bis(t) du ! (Mama, where are you!)%mor: $N:01:f:VOC:SG|Mama $PRO:int|wo

$VCOP:S:POS:PRES:2s|sein $PRO:pers:NOM:SG|du $N:01:f:VOC:SG|Mama !

Frequency of Interrogative Pronouns

• Does the child’s use match with the input frequency?

Child:freq +t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o

give for child’s

frequency mor-tier of $pro:int for all files together sort

count for other up to 2;9 output

people’s

Input:freq -t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o

Non-interrogative use of wh-words

• Make a file with all words for interrogative pronouns Leo uses:

freq […] +s"$PRO:INT%|*" +u +d1 >> Leowh.cut

for all files without frequency and direct

together count numbers output to file


• then look for uses of these words in sentences that do not contain $pro:int

combo +t*CHI +t%mor le*.cha [email protected]^*^%mor:^*^!$pro:int*

take words from followed by not containing

file as search string %mor $pro:int

for utterances

• combo: search with Boolean operators

Word order in wh-questions

• German: verb has to follow wh-word directly – any errors?

• search for all utterances that do not follow this pattern:

combo +t*CHI +t%mor +s$pro:int*^!$v* le*.cha

search for child’s $pro:int not directly followed by any $v

Cooccurences of wh-words

• What words does was (what) cooccur with when used as an interrogative pronoun?

kwal +t*CHI +t%mor +s$pro:int%"|"was +d +o* le*.cha | cooccur +swas +t*CHI +u

• kwal looks for all uses of was as $pro|int• the results are directed to cooccur (“piping”)6 was da 32 was das 1 was denkst 2 was denn

Measuring lexical diversity

• traditional: type-token-ratio (TTR)– number of different word types– against total number of words

• every word is a new word: TTR 1.0• the lower the TTR, the less lexical diversity• problem: depends on sample size

– in a large sample, the total vocabulary will finally be exhausted

– TTR levels out because highly frequent words will increase the number of tokens disproportionally

– rarely occuring types will have little influence on TTR

measure D

• measure D is obtained by – randomly sampling the corpus– calculating the actual leveling out of the TTR rate– and comparing this to theoretic models of TTR

curves• the probability of new types being introduced

in the corpus is calculated, regardless of sample size

• In CLAN:– TTR: freq– Measure D: VOCD

The Leo Corpus

Documents

Transcript of The Leo Corpus