The Leo Corpus

43
The Leo Corpus •German L1 Learner Corpus

description

The Leo Corpus. German L1 Learner Corpus. Overview. corpora in child language research CHILDES project Leo corpus CLAN language analysis tools. Corpora in acquisition research?. linguistic intuitions of native speakers? adult speakers’ intuitions fail child will not speak on demand - PowerPoint PPT Presentation

Transcript of The Leo Corpus

Page 1: The Leo Corpus

The Leo Corpus

•German L1 Learner Corpus

Page 2: The Leo Corpus

Overview

• corpora in child language research

• CHILDES project

• Leo corpus

• CLAN language analysis tools

Page 3: The Leo Corpus

Corpora in acquisition research?

• linguistic intuitions of native speakers?– adult speakers’ intuitions fail– child will not speak on demand– child can’t judge own sentences

Leo (1;11,16): Leiter hoch. ‘ladder up/high’

? particle: (pull/push/...) up ? adjective: (the ladder is) long utterance context is needed!

Page 4: The Leo Corpus

Corpora in acquisition research?

• linguistic intuitions of native speakers?– adult speakers’ intuitions fail– child will not speak on demand– child can’t judge own sentences

Leo (1;11,16): Leiter hoch. looking at a long ladder

in a book adult German: (The) ladder (is) long adjective (in adult terms!)

Page 5: The Leo Corpus

Corpora in acquisition research

• corpora contain actually made utterances• situated in natural contexts• data are verifiable• frequency analyses possible

Kinds of corpora:• diary studies (e.g., Preyer 1882)• experimental data• spoken speech corpora (longitudinal / cross-sectional)

Page 6: The Leo Corpus

CHILDES

• Child Language Data Exchange System

• Brian MacWhinney / Catherine Snow

• founded in 1984

• part of TalkBank (adult corpora)

• > 1500 published articles

• 4500 Members

Page 7: The Leo Corpus

CHILDES

3 parts:

• language data, i.e. corpora

• CHAT transcription system

• CLAN computer programs

Page 8: The Leo Corpus

CHILDES corpora

• 130 corpora publicly available (via www)• 26 languages• L1 normally developing• L1 language disorders• bi- and trilingual children and adults

TalkBank adult corpora:• L2, aphasics (English, German,

Hungarian, Chinese, Italian), ...

Page 9: The Leo Corpus

CHAT and CLAN

• CHAT:– Codes for the Human Analysis of Transcripts– ensure a standard format for all corpora

• CLAN:– Computerized Language Analysis– several commands for analyzing data in

CHAT format

• single program interface

Page 10: The Leo Corpus

Leo corpus

• Leo, monolingual L1 German boy

• recorded 1999 – 2002

• Heike Behrens, MPI Leipzig

• transcribed in CHAT format

• analysable with CLAN programs

• not publicly available

Page 11: The Leo Corpus

Leo corpus

2;0 – 3;0: 5 x 1hr / week = 20-22hrs / month+ diary for new structures

3;0 – 5;0: 5 x 1hr / month = 5hrs / month ca. 400hrs total recording time

• includes utterances of child and conversation partners

• spontaneous interaction (free play)• no book-reading• experimenter present• some sessions videotaped

Page 12: The Leo Corpus

Leo corpus

• 1.8 million words of spoken speech

• child: ca. 500.000 words

• BNC: largest balanced corpus

• 100 million words

• 10% spoken speech: 10 million words

• “dense” corpus

Page 13: The Leo Corpus

“Dense” corpora

• longitudinal databases with denser recording intervals– traditional: 0.5 – 1hr / week– Leo: 1.25 – 5hrs / week

• assumption:– child is awake and talks 10hrs / day– traditional: ca. 1% of output– Leo: 2% - 7% of output

(Tomasello / Stahl 2004)

Page 14: The Leo Corpus

“Dense” corpora

• advantages:– capture of infrequent phenomena– better estimate of vocabulary size– age of emergence– smoother developmental curves– input / production frequency measures

Page 15: The Leo Corpus

“Dense” corpora Likelihood to capture a target token in a year of recording:

(Tomasello / Stahl 2004)

tokens:

1/day

`

10/day

Page 16: The Leo Corpus

Drawback: only 1 child!

• no generalizations possible

• drawback?

• usage-based approach– child is believed to construct language

individually– based on personal experience with language– no help from language-specific knowledge

Page 17: The Leo Corpus

Usage-based approach

• child moves gradually from lexically specific to abstract knowledge

• no adult categories

• input and frequency play a role ( corpus needed!)

• close studies of individuals highly valuable

• dense longitudinal vs. traditional cross-sectional corpora

Page 18: The Leo Corpus

“Control” corpora

Kerstin & Simone• Max Miller, MPI Nijmegen• 1;3 / 1;9 – 4;0

Kerstin: – 0.5 – 2.7 recordings / month– ca. 270.000 words (child: 55.000)

Simone: – 1.25 – 3.5 recordings / month– ca. 450.000 words (child: 86.000)

Page 19: The Leo Corpus

“Control” corpora

Pauline & Sebastian

• Prof. Rigol

Pauline:– 0;0 – 7;11 / 1 – 2 recordings / month– 340.000 words (child: 85.000)

Sebastian:– 0;0 – 7;4 / 1 – 2 recordings / month– 350.000 words (child: 75.000)

Page 20: The Leo Corpus

Leo corpus

• CHAT-format

• 1 transcription file per session

• txt-format

• no running text

@Headers (file explanations)*Main tier lines (utterances)%Dependent tiers (annotations of utterances)@End

Page 21: The Leo Corpus

CHAT: Headers@Begin@Languages: de@Participants: CHI Leo Target_Child, MUT Maren Mother, VAT Thorsten

Father, MEC Mechthild Observer@ID: de|mpi_evan|CHI|2;06.08|male|group|middle|Target_Child|

education|@ID: de|mpi_evan|MUT|30;00.00|female|group|middle|Mother|

Abitur_Lehre|@ID: de|mpi_evan|VAT|35;00.00|male|group|middle|Father|university|@ID: de|mpi_evan|MEC|24;00.00|female|group|middle|Observer|

university|@Filename: le020608.cha@Date: 11-SEP-1999@Age of CHI: 2;06.08@Comment: Dependent: exp, vrb, act, par,@Comment: in der Wohnung, beim Einkaufen

Page 22: The Leo Corpus

CHAT: Main tiers• Each utterance on own line • Each line starts with a tier• Each speaker has own tier: *CHI, *VAT, ...• Annotations on dependent tier: %mor, %pho...

Child: Yes. Fish! – Father: Fish? – Child: Yes.

*CHI: ja . %mor: $INTER|ja .*CHI: Fisch ! %mor: $N:03:m:NOM:SG|Fisch !*VAT: Fisch ? %mor: $N:03:m:CAS|Fisch ?*CHI: ja .

Page 23: The Leo Corpus

CHAT: Transcription

• orthographic or not?

• depends on purpose

• orthographic transcription: ease of retrieval– additional information via dependent tiers

(%pho)– utterances can be linked to digitized sound

files (Sonic-CHAT)– or to video files

Page 24: The Leo Corpus

SONIC CHAT

Page 25: The Leo Corpus

CHAT: Transcription

• spoken speech not as orderly as written texts

• coding scheme for spoken speech phenomena:– overlaps– trailing off– noncompletions is(t)– retracing Schrei [//] Scheibenwischer– non-words hm@o– replacements nix [: nichts]

Page 26: The Leo Corpus

CHAT: Annotation

• annotations to an utterance (tagging) on the dependent tiers of that utterance

*CHI: [D] ist drin .

%mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin .

%exp: es ist noch etwas Kakao im Becher .

• here:– %mor: morphology– %exp: explanation of utterance situation

Page 27: The Leo Corpus

CHAT: Annotation

• annotations to an utterance (tagging) on the dependent tiers of that utterance

*CHI: [D] ist drin .

%mor: $VCOP:S:POS:PRES:3s|sein $ADV|drin .

copula suppletive (empty) tense agreement citation form

ist is the 3rd. pers. sing. present tense of the suppletive copula verb sein

Page 28: The Leo Corpus

CHAT: Annotation

• tagging is based on theoretical notions of adult language!– e.g., when ist is tagged as VCOP etc., this

doesn’t mean that it constitutes a VCOP for the child

*CHI:[D] Leiter hoch .

%mor:$N:02:f:AKK:SG|Leiter $PT|hoch .

• hoch a verb particle for the child?

Page 29: The Leo Corpus

CHAT: Annotation

• transcription and annotation in CLAN editor– converts txt- and SALT-format files to CHAT– automatic tagging (%mor-tiers)

• lexicon file with word information• tag disambiguation (manual / probabilistic)

– computes coding reliability– checks conformity with CHAT-conventions

• works on different workstations (unlike TRANSANA)

– access files on network drive

Page 30: The Leo Corpus

CLAN / CHAT interface

Page 31: The Leo Corpus

CLAN Commands

• search commands, e.g.– simple and combined strings in utterances and

annotations– interaction blocks– imitations, repetitions, overlaps

• computing commands, e.g.– mlu / mlt (mean lenght of utterances / turns)– longest words / utterances– vocabulary diversity: TTR, measure D– frequency of phonemes positions

Page 32: The Leo Corpus

CLAN Commands

• commands in DOS-like style

Examples

research question: WH-words

• emergence

• frequency

• use

Page 33: The Leo Corpus

Emergence of Interrogative Pronouns

coding:*CHI: was machst du ? (lit.: what do you?)

%mor: $PRO:int|was [...]

search :kwal +t*CHI +t%mor +s$pro:int* le020*.cha

search mor-tiers in all files up to 2;9

in the child’s for strings

starting with $pro:int

Page 34: The Leo Corpus

Triple – click to

access utterance!

Emergence of Interrogative Pronouns

• output:kwal (08-Dec-2004) is conducting analyses on: ONLY speaker main tiers matching: *CHI;and those speakers' ONLY dependent tiers matching: %MOR;[…]From file <le020006.cha>From file <le020007.cha>From file <le020008.cha>----------------------------------------*** File "le020008.cha": line 2603. Keyword: $pro:int|wo *CHI: Mama, wo bis(t) du ! (Mama, where are you!)%mor: $N:01:f:VOC:SG|Mama $PRO:int|wo

$VCOP:S:POS:PRES:2s|sein $PRO:pers:NOM:SG|du $N:01:f:VOC:SG|Mama !

Page 35: The Leo Corpus

Frequency of Interrogative Pronouns

• Does the child’s use match with the input frequency?

Child:freq +t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o

give for child’s

frequency mor-tier of $pro:int for all files together sort

count for other up to 2;9 output

people’s

Input:freq -t*CHI +t%mor +s"$PRO:INT%|*" le020*.cha +u +o

Page 36: The Leo Corpus

Frequency of Interrogative Pronouns

• result:

child input536 $pro:int|was 13162 $pro:int|was

305 $pro:int|wo 3486 $pro:int|wo

70 $pro:int|wie 1608 $pro:int|wer

31 $pro:int|wer 1255 $pro:int|wie

Page 37: The Leo Corpus

Non-interrogative use of wh-words

• Make a file with all words for interrogative pronouns Leo uses:

freq […] +s"$PRO:INT%|*" +u +d1 >> Leowh.cut

for all files without frequency and direct

together count numbers output to file

Page 38: The Leo Corpus

Non-interrogative use of wh-words

Leowh.cut:

$pro:int|was

$pro:int|wo

$pro:int|wie

$pro:int|wer

• strip file of all $pro:int| , so that just the wh-words are left:

chstring +s"$pro:int|" "" +y leowh.cut

change from to file not in CHAT-format

Page 39: The Leo Corpus

Non-interrogative use of wh-words

• then look for uses of these words in sentences that do not contain $pro:int

combo +t*CHI +t%mor le*.cha [email protected]^*^%mor:^*^!$pro:int*

take words from followed by not containing

file as search string %mor $pro:int

for utterances

• combo: search with Boolean operators

Page 40: The Leo Corpus

Word order in wh-questions

• German: verb has to follow wh-word directly – any errors?

• search for all utterances that do not follow this pattern:

combo +t*CHI +t%mor +s$pro:int*^!$v* le*.cha

search for child’s $pro:int not directly followed by any $v

Page 41: The Leo Corpus

Cooccurences of wh-words

• What words does was (what) cooccur with when used as an interrogative pronoun?

kwal +t*CHI +t%mor +s$pro:int%"|"was +d +o* le*.cha | cooccur +swas +t*CHI +u

• kwal looks for all uses of was as $pro|int• the results are directed to cooccur (“piping”)6 was da 32 was das 1 was denkst 2 was denn

Page 42: The Leo Corpus

Measuring lexical diversity

• traditional: type-token-ratio (TTR)– number of different word types– against total number of words

• every word is a new word: TTR 1.0• the lower the TTR, the less lexical diversity• problem: depends on sample size

– in a large sample, the total vocabulary will finally be exhausted

– TTR levels out because highly frequent words will increase the number of tokens disproportionally

– rarely occuring types will have little influence on TTR

Page 43: The Leo Corpus

measure D

• measure D is obtained by – randomly sampling the corpus– calculating the actual leveling out of the TTR rate– and comparing this to theoretic models of TTR

curves• the probability of new types being introduced

in the corpus is calculated, regardless of sample size

• In CLAN:– TTR: freq– Measure D: VOCD