Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

28
Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1

description

(See [Odijk 2011, 2014] for more data and qualifications Introduction 3 Catinitmodifierpredicaterest AHij is daarHeel / erg /zeerblijmee glossHe is thereveryhappywith PHij is daar*heel / erg / zeerin zijn sasmee glossHe is thereveryhappywith V…omdat dat mij*heel / erg / zeerverbaast gloss…because that me verysurprises

Transcript of Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Page 1: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Using PaQu for language acquisition research

Jan OdijkCLARIN 2015 Conference

Wroclaw, 2015-10-16

1

Page 2: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Introduction• CHILDES Corpora• PaQu• Evaluation & Analysis• Conclusions• Future Work

Overview

2

Page 3: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

(See [Odijk 2011, 2014] for more data and qualifications

Introduction

3

Cat init modifier predicate rest

A Hij is daar Heel / erg /zeer blij mee

gloss He is there very happy with

P Hij is daar *heel / erg / zeer in zijn sas mee

gloss He is there very happy with

V …omdat dat mij *heel / erg / zeer verbaast

gloss …because that me

very surprises

Page 4: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

•Distinction is purely syntactic•Cannot be derived from semantic differences•Correlation with other known facts unlikely •Cannot be derived from general (universal) principles• must be acquired by L1 learners of Dutch

Introduction

4

Page 5: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Minimal pair in acquisition • Requires acquisition of negative property– No evidence in the input– No ‘correction’ or correction ignored

• May provide evidence for/against relevant hypotheses– E.g. Indirect Negative Evidence hypothesis• Absence of evidence evidence for absence

Introduction

5

Page 6: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Problem: Ambiguity– Heel 7-fold ambiguous– Erg 4-fold ambiguous– Zeer 3-fold ambiguous

• (as any decent natural language word)• For our purposes:– Morpho-syntactic and syntactic properties resolve

the ambuigities

Corpus Analysis

6

Page 7: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• [Odijk 2014] • Automatic Corpus analysis: GrETEL, OpenSONAR, COAVA ,

LWRS, CMD• These apply to specific corpora only • Manual Corpus analysis of CHILDES Van Kampen Corpus• How can I apply these applications to my own corpus?• request for PaQu (extends LWRS), AutoSearch (extends

CMD), …

Corpus Analysis

7

Page 8: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• PaQu= Parse and Query: https://dev.clarin.nl/node/4182 • Web application made by Groningen University• Upload corpus

– Plain text or in Alpino format• Plain Text is automatically parsed by Alpino• Resulting treebank can be searched and analyzed

– Search• Word relations interface and XPATH Queries

– Analysis • User-definable statistics on search results (and metadata)

PaQu

8

Page 9: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Take the Dutch CHILDES corpora• Select all utterances containing heel, erg or zeer• Clean the utterances, e.g.• ja , maar <we be> [//] we bewaren (he)t ook• ja , maar we bewaren het ook

• Upload it into PaQu• Gather statistics and draw conclusions

Experiments

9

Page 10: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Adult utterances of Van Kampen Corpus• Manual annotation used as gold standard (Acc)• Alpino makes finer distinctions: I mapped these• Annotation errors in the gold standard: revised gold

standard (Rev Acc)

Experiment 1

10

Page 11: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Accuracy

Experiment 1: Results

11

word Acc Rev Accheel 0.94 0.95erg 0.88 0.91zeer 0.21 0.21

Page 12: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Good for heel, erg• Bad for zeer, but:• Completely due to zeer doen (lit. pain(ful) do, ‘to hurt’)• Can be identified very easily in PaQu

• Generalisability: Limited• It concerns (cleaned) adult speech• It concerns relatively short sentences, explicitly separated• It mostly concerns a very local grammatical relation

Experiment 1: Interpretation

12

Page 13: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• All adults’ utterances:

Experiment 2:

13

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 14: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Heel most frequent (almost 54%)

Experiment 2:Interpretation

14

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 15: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Heel as mod A overwhelming: > 93%

Experiment 2:Interpretation

15

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 16: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Heel as mod V, mod P wrong analysis

Experiment 2:Interpretation

16

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 17: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Mod A and mod V more balanced for erg

Experiment 2:Interpretation

17

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 18: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Evidence for zeer mostly lacking• Cases of Mod V are mostly wrong analyses

Experiment 2:Interpretation

18

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 19: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Evidence for Mod P mostly lacking• Some evidence for erg, zeer (4 occurrences)

Experiment 2:Interpretation

19

Results mod A mod N Mod V mod P predc other unclear Total

heel 886 46 2 2 14 0 2 952

erg 347 27 109 0 187 5 0 675

zeer 7 1 83 0 19 21 7 138

Page 20: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Van Kampen Children’s speech: Accuracy• Similar to the Adults’ speech but slightly lower

Experiment 3:

20

Word Accheel 0.90erg 0.73zeer 0.17

Page 21: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• Linguistics:• No examples for mod P: how to explain heel v. erg, zeer?• Overwhelmingness of mod A for heel might be a relevant

factor• Current Dutch CHILDES corpora probably too small to draw

reliable conclusions

Conclusions

21

Page 22: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

• PaQu:• PaQu is very useful for doing better and more efficient

manual verification of hypotheses• In some cases its parses and their statistics can reliably be

used directly (though care is required!)• Several small details were improved, small additions to

functionality made through these experiments

Conclusions

22

Page 23: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

•More experiments for the children’s speech (cf. [Odijk 2014:34])• Similar experiments for other examples • te ‘too’ v. overmatig ‘excessively’; worden ‘become’v. raken ‘get’ and

others• Extend PaQu to include all relevant `metadata’• Extend PaQu to natively support common formats such as CHAT,

Folia, TEI, …•Make similar system for GrETEL, OpenSONAR•Manually verify (parts of) parses for CHILDES corpora(most is being done in CLARIAH-NL or UU AnnCor)

Future Work

23

Page 24: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Thanks for Attention!Visit the Demo at 16:30!

Visit the Bazaar at 14:30 for a completely different use of PaQu!

24

Page 25: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

NO!

Correlation with other Differences?

25

Phenomenon Opposes VersusMod V,P heel erg, zeerMeaning erg heel, zeerInflection heel, erg zeerComparative, Superlative

erg heel, zeer

Modifiee erg heel, zeerPragmatics zeer heel, erg

Page 26: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Ambiguity: HEEL

26

word Morpho- syntax

Syntax Meaning

heelA

Mod N (1)`whole’(2) ‘in one piece’(3)`large’

Predc ‘in one piece’Mod A `very’

Vf (1)`heal’ (2) `receive’

Page 27: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Ambiguity: ERG

27

word Morpho-syntax

Syntax Meaning

erg

N utrum `erg’

N neutrum `evil’

AMod N, predc

‘bad’, ‘awful’

Mod A V P very

Page 28: Using PaQu for language acquisition research Jan Odijk CLARIN 2015 Conference Wroclaw, 2015-10-16 1.

Ambiguity: ZEER

28

word Morpho- Syntax

Syntax Meaning

zeer

N `pain’

AMod N, predc ‘painful’

Mod A V P ‘very’