The advantage of using a corpus rather than introspection
• empirical, reproducable:empirical, reproducable: Falsifiable science Falsifiable science
• objective, neutral:objective, neutral: The corpus is always (mostly) The corpus is always (mostly) right, no interference from test-person's respect for right, no interference from test-person's respect for textbookstextbooks
• definable observation space:definable observation space: Diachronics, genre, Diachronics, genre, text typetext type
• statistics: statistics: Observe linguistic tendencies (%) as Observe linguistic tendencies (%) as opposed to (speaker-dependent) “stable” systems, opposed to (speaker-dependent) “stable” systems, quantify ?, ??, *, **quantify ?, ??, *, **
• context: context: All cases count, no “blind spots” All cases count, no “blind spots”
The Portuguese example
• Portuguese object pronouns need an Portuguese object pronouns need an “attractor” (negation, subject) in order to “attractor” (negation, subject) in order to allow pre-verbal positionallow pre-verbal position
• More so in Portugal than in Brazil or More so in Portugal than in Brazil or MozambiqueMozambique
• Diachronic fluctuation, sociolect / speaker Diachronic fluctuation, sociolect / speaker statusstatus
• Introspection yields normative resultsIntrospection yields normative results
• Corpus yields true(er) results (NURC, Tycho Corpus yields true(er) results (NURC, Tycho Brahe, Folha vs. Público ....)Brahe, Folha vs. Público ....)
How to enrich a corpus
• Meta-information: Source, time-Meta-information: Source, time-stamp etc.stamp etc.
• Grammatical annotation: Part of Grammatical annotation: Part of speech (PoS), inflexion, syntactic speech (PoS), inflexion, syntactic function, syntactic structure, function, syntactic structure, semantics ...semantics ...
• Manual vs. automatical annotationManual vs. automatical annotation
e.g. Korpus90 and Korpus2000
• mixed text, ca. 20 (28) mill. ord eachmixed text, ca. 20 (28) mill. ord each
• sentence-randomized “quote” corpussentence-randomized “quote” corpus
• compiled by DSL (www.dsl.dk)compiled by DSL (www.dsl.dk)
• grammatically annotated by VISL grammatically annotated by VISL (visl.sdu.dk)(visl.sdu.dk)
– a) automatically with the DanGram parsera) automatically with the DanGram parser
– b) 1% manually revised (Arboretum treebank)b) 1% manually revised (Arboretum treebank)
Other Danish corpora:
• Europarl (28M): true parallel Europarl (28M): true parallel multilingualmultilingual
• Bergenholtz (3M), Parole (0.25M)Bergenholtz (3M), Parole (0.25M)
• Wikipedia etc.: The internet as Wikipedia etc.: The internet as corpuscorpus
• Specialised: BySoc, Folketing, e-Specialised: BySoc, Folketing, e-mail ...mail ...
How to annotate• All annotation is theory dependent, but some schemes less so All annotation is theory dependent, but some schemes less so
than others. The higher the annotation level, the more theory than others. The higher the annotation level, the more theory dependentdependent
• double role of corpora: (a) as goal, (b) as (gold-standard double role of corpora: (a) as goal, (b) as (gold-standard annotated) data for machine learning: rule-based systems for annotated) data for machine learning: rule-based systems for boot-strappingboot-strapping
• PoS (tagging): needs a lexicon (“real” or corpus-based)PoS (tagging): needs a lexicon (“real” or corpus-based)(a) probabilistic: HMM-base line, DTT, TnT, Brill etc., F ca. 97+%(a) probabilistic: HMM-base line, DTT, TnT, Brill etc., F ca. 97+%(b) rule-based: (b) rule-based: --- PoS Disambiguation as a “side-effect” of syntax (PSG etc.)--- PoS Disambiguation as a “side-effect” of syntax (PSG etc.)--- PoS Disambiguation as primary method (CG), F ca. 99%--- PoS Disambiguation as primary method (CG), F ca. 99%
• Syntax (parsing): function focus vs. form focusSyntax (parsing): function focus vs. form focus(a) primarily probabilistic: PCFG (constituent), (a) primarily probabilistic: PCFG (constituent),
MALT-parser (dependency F 90% after PoS)MALT-parser (dependency F 90% after PoS)(b) primarily rule-based: HPSG, LFG (constituent trees), (b) primarily rule-based: HPSG, LFG (constituent trees),
CG (syn. function F 96%, shallow dependency)CG (syn. function F 96%, shallow dependency)
Constraint Grammar
A methodological rather than descriptive paradigm (Karlsson A methodological rather than descriptive paradigm (Karlsson 1995)1995)Token-based assignment and contextual disambiguation of Token-based assignment and contextual disambiguation of tag-encoded grammatical informationtag-encoded grammatical information
Grammars need lexicon/analyzer-based input and consist of Grammars need lexicon/analyzer-based input and consist of thousands of MAP, SUBSTITUTE, REMOVE and SELECT rules.thousands of MAP, SUBSTITUTE, REMOVE and SELECT rules.
The VISL project (SDU) uses The VISL project (SDU) uses Constraint GrammarConstraint Grammar parsers to parsers to add form and function tags to word tokens in corpora or add form and function tags to word tokens in corpora or running textrunning text
Form: e.g. N = noun, P = plural, GEN = genitiveForm: e.g. N = noun, P = plural, GEN = genitive
Syntactic function: e.g. @SUBJ = subject, @ACC = direct Syntactic function: e.g. @SUBJ = subject, @ACC = direct objectobject
Syntactic form: e.g. dependency markers (@SUBJ>, @<SUBJ), Syntactic form: e.g. dependency markers (@SUBJ>, @<SUBJ), numbered dependency (e.g. #5->3) or secondary constituent numbered dependency (e.g. #5->3) or secondary constituent treestrees
Running CG-annotation
1. Da (When) [da] KS @SUB #52. den (the) [den] ART UTR S DEF @>N #43. gamle (old) [gammel] ADJ nG S DEF NOM @>N #44. sælger (salesman)[sælger] N UTR S IDF NOM @SUBJ> #55. kørte (drove) [køre] <mv> V IMPF AKT @FS-ADVL> #11 6. hjem (home) [hjem] ADV DIR @<SA #57. i (in) [i] PRP @<ADVL #58. sin (his) [sin] <poss> <refl> DET UTR S @>N #99. bil (car) [bil] N UTR S IDF NOM @P< #710., #511. kunne (could) [kunne] <aux> V IMPF AKT @FAUX #012. han (he) [han] PERS UTR 3S NOM @<SUBJ #1113. se (see) [se] <mv> V INF AKT @AUX< #1114. mange (many) [mange] <quant> DET nG P NOM @>N #1515. rådyr (deer) [rådyr] N NEU P IDF NOM @<ACC #1316. på (in) [på] PRP @<OA #1317. de (the) [den] ART nG P DEF @>N #1918. våde (wet) [våd] ADJ nG P nD NOM @>N #1919. marker (fields) [mark] N UTR P IDF NOM @P< #16
DanGram
Preprocessing
Morphological analysis
CG-disambiguationPoS/morph
CG-syntax
NER, case roles
PSG grammarDependency
grammarTreebanks
CG corpora
Inflexion lexicon100.000 lexemes
Valency potential
Semantic prototypes
Raw text
Cg-results for Danish: PoS (~ 99% accuracy)
Class recall precision F-score Class recall precision F-scoreN 99.5 99.1 99.2 ART 99.3 99.3 99.3PROP 100 100 100 DET 97.1 98.5 97.7V PR 99.2 99.2 99.2 PERS 99.4 99.4 99.3V IMPF 100 97.2 98.8 INDP 98.2 100 99.2V INF 98.1 99.0 98.5 NUM 100 100 100V PCP1 100 100 100 ADJ 96.8 94.4 95.5V PCP2 94.9 97.4 96.1 ADV 95.8 98.0 96.8INFM 100 100 100 PRP 99.4 99.1 99.2KS 96.6 95.0 95.7 KC 100 99.1 99.5
CG-result for Danish: Syntactic function (~95% accuracy)
Class recall precision F-score Class recall precision F-score
@SUBJ> 96.7 95.2 95.9 @>N 97.3 98.2 97.7
@<SUBJ 90.1 96.8 93.3 @N< 90.9 96.1 93.4
@F-SUBJ> 86.6 86.6 86.6 @APP* 100 87.5 93.3
@F-<SUBJ 100 100 100 @N<PRED 100 80.0 88.8
@<ACC 94.6 95.3 94.9 @>A 88.6 95.9 92.1
@ACC>* 88.8 88.8 88.8 @A< 89.4 94.4 91.8
@<DAT* 100 75.0 85.7 @P< 98.1 98.1 98.1
@<PIV 93.5 87.8 90.5 @FS-<SUBJ* 77.7 77.7 77.7
@<SC 92.0 84.3 87.9 @FS-<ACC 100 72.7 84.1
@<OC* 83.3 100 90.8 @FS-ACC> 100 91.6 95.6
@<SA 83.3 86.9 85.0 @FS-<ADVL 90.3 96.5 93.2
@<OA* 100 75.0 86.7 @FS-ADVL> 84.6 78.5 81.4
@<ADVL 93.2 90.6 91.8 @FS-P< 90.9 100 95.2
@ADVL> 96.9 93.2 95.0 @ICL-<SUBJ* 100 100 100
@KOMP<* 100 100 100 @ICL-P< 96.1 100 98.0
@P< 98.1 98.1 98.1
Corpus Eye
• internet-based internet-based http://corp.hum.sdu.dk/http://corp.hum.sdu.dk/, using , using CQP (Corpus Query Protocol)CQP (Corpus Query Protocol)
• menu based category searches in contextmenu based category searches in context
• multi-token constituents, regular expressions multi-token constituents, regular expressions and quantifiersand quantifiers
• sorting and quantificationsorting and quantification
• grammatically annotated corpora for 8 grammatically annotated corpora for 8 Germanic and Romance languages (about 1 Germanic and Romance languages (about 1 billion words), mostly from the written language billion words), mostly from the written language domain. domain.
The case for treebanks
• A treebank is a corpus annotated with full syntactic structure, A treebank is a corpus annotated with full syntactic structure, attaching tokens to each other (dependency grammar) or to attaching tokens to each other (dependency grammar) or to interconnected non-terminal nodes (constituent grammar)interconnected non-terminal nodes (constituent grammar)
• Treebanks contain more syntactic detail than tagged corporaTreebanks contain more syntactic detail than tagged corpora
• Treebanks allow to train or evaluate automatic systems of Treebanks allow to train or evaluate automatic systems of analysisanalysis
• Treebanks allow searches for complex units and their relations, Treebanks allow searches for complex units and their relations, rather than individual tokens or their features. For instance, the rather than individual tokens or their features. For instance, the sequence of NPs with certain functions can be queried directly, sequence of NPs with certain functions can be queried directly, or conditioned on their being daughters of an embedded clause or conditioned on their being daughters of an embedded clause (subclause).(subclause).
• Treebanks exist for a large number of languages (cp. CoNLL-X Treebanks exist for a large number of languages (cp. CoNLL-X shared task), e.g. Negra/TIGER (German), Penn (English), shared task), e.g. Negra/TIGER (German), Penn (English), Mamba (Swedish), Cast3LB (Spanish), PDT (Czech) ....Mamba (Swedish), Cast3LB (Spanish), PDT (Czech) ....
• The largest The largest VISL treebankVISL treebank is the double-format is the double-format ArboretumArboretum treebank for Danish, annotated in both dependency and treebank for Danish, annotated in both dependency and constituent grammarconstituent grammar
Indented PSG-notationSTA:fclfA:fcl=SUB:conj-s('da') Da (When) =S:np==DN:art('den' UTR S DEF) den (the)==DN:adj('gammel' nG S DEF NOM) gamle (old)==H:n('sælger' UTR S IDF NOM) sælger (salesman)=P:v-fin('køre' IMPF AKT) kørte (drove)=As:adv('hjem' DIR) hjem (home)=fA:pp==H:prp('in') i (in)==DP:np===DN:pron-poss('sin' <refl> UTR S) sin (his)===H:n('bil' UTR S IDF NOM) bil (car) P:vp-=Vaux:v-fin('kunne' IMPF AKT) kunne (could)S:pron-pers('han' UTR 3S NOM) han (he)-P:vp=Vm:v-inf('se' AKT) se (see)Od:np=DN:pron-indef('mange' <quant> nG P NOM) mange (many)=H:n('rådyr' NEU P IDF NOM) rådyr (deer)Ao:pp=H:prp('på') på (in)=DP:np==DN:art('den' nG P DEF) de (the)==DN:adj('våd' nG nD NOM) våde (wet)==H:n('mark' UTR P IDF NOM) marker (fields)
FUNCTION:formEDGES:nodes/terminals
Syntaktiske funktioner i Korpus2000: sætningsniveau
0
500
1000
1500
2000
2500
3000
SUBJ F/S-SUBJ
ACC DAT PIV SC/SA OC/OA ADVL PRED
<
>
FS
ICL
Syntactic functions in Korpus2000: group level
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
>N, N< >A, <A P<, >P
<
>
FS
ICL
Sunctactic functions in Korpus2000:special functions
0 200 400 600 800 1000 1200 1400
>>P
N<PRED
KOMP<
ADVL
SUB
AUX<
INFM
Semantic restrictions for objects:Does semantic class play a role for object positions?
forflytte <Hprof>_2 (human professional)forfægte <pp>_3 (tankeprodukt)forfølge <ac>_8 <Hprof>_6 <H>_4 .... (aktiviteter og mennesker)forføre <H>_3 (people)forgylde <H>_4 <Hprof>_3 (mennesker)forhale <act-c>_3 <act>_3 (handlinger og aktiviteter)forhandle <ac>_17 <sem-r>_9 <conv>_8 .... (tællelige abstrakta, "readables",
aftaler)forhaste <pp>_3 <sem>_3 (tankeprodukter)forhindre <act>_35 <Hprof>_23 <ac>_18 <act>_18 <H>_17 <HH>_14 <event>_9forhøje <ac>_13 <mon>_7 <mon-c>_5 ... (abstrakta og pengebeløb)forkaste <pp>_5 <Hprof>_4 <ac>_3 <conv>_3 .. (tankeprodukter, profess.,
aftaler)forklare <ac>_39 <act-c>_7 <act>_6 ... (abstrakta og handlinger)forkorte <per>_4 (perioder)
Searches based on semantic prototype annotation: verb semantics vs. object semantics
han så aldrig igen @ADVL filmen <sem>+@ACChan så aldrig vennen <H>+@ACC igen @ADVL
Direct/accusative objects in Danish
form type fronted (ACC>) right of main verb (<ACC)
finite clause (FS) 5.2 % (quotes!) 12.8 % non-finite clause (ICL) 0.0 % (1 case) 5.3 %
nouns (N) 0.3 % (checked) 53.8 % proper nouns (PROP) 0.0 % (12 cases) 3.4%
relative pronouns 1.9 % - interrogative pronouns 0.5 % - (4 adverbs)
personal pronouns 1.0 % 12.0 % others 0.4 % 4.4 %
all 9.3 % 91.7 %
7,1 % i 1,1 million words from Korpus2000
Fronted nominal objects
Subtype n frequency definition interrogative 79 29.0 % at se, hvilken interesse kineserne skulle have topic 74 27.2 % Denne interesse overførte han på virksomheden
De problemer har jeg slet ikke. focus 55 20.2 % Blot 6-7 kr. vil sparekassen se som betaling
Sin spillefilmsdebut fik han i 1962 med ... fronted in verb chain
43 15.8 % ... få tyvekosterne bragt hjem ... får man billeder at se gratis ... at lære de nødvendige redskaber at kende
raised 12 4.4 % Den slags er vi jo nogle stykker der kan lide fixed 7 2.6 % Hvad udvalget af værker angår, har ... vp-internal 2 0.7% ... at min søn ingen huller havde
... hun har ingen kage bagt
Pronoun ellipsis in relative clauses
der som zero all: 938 n % n % n % n %
SUBJ 421 44,9 175 18,7 (15) (1,6) 611 65,1 raised - - 3 0,3 - - 3 0,3 det-focus 33 3,5 10 1,1 - - 43 4,6 ACC - - 34 3,6 37 3,9 71 7,6 raised - - 7 0,7 2 0,2 9 1,0 det-focus - - - - 6 0,6 6 0,6 >>P 4 0,4 16 1,7 12 1,3 32 3,4 raised - - 7 0,7 1 0,1 8 0,9 det-focus - - - - 5 0,5 5 0,5 DAT, CS, OC - - 5 0,5 - - 5 0,5 458 48,8 257 27,4 78 8,3 793 84,5
hvor når, da zero ADVL-adv 111 11,8 10 1,1 10 1,1 131 14,0 hvorPRP PRP+hvilken 88 9,4 924 98,5 P< (ADVL) 7 0,7 1 0,1 8 0,9 hvis at hvilket >N, SUB, S< 1 0,1 4 0,1 1 0,1 6 0,6 938 100,0
Preadverbial placement of definite np-objects -- adverb types(VFIN ) (ADV @<ADVL) (N DEF @<ACC)
-> only main clauses?, ->same adverbs as intra-vp?
German, for comparison
(np-def @ACC) (adv) (AUX<)
(np-def @ACC) (adv) (>AUX)
(adv) (np-def @ACC) (AUX<)
Candidates for adverbs with an influence on object position:Vp-inserted adverbsand their position specificity
red = attitudinal-adverbsblue = conjunctional adverbs
Pre-positioned adverbs in preposition-governed infinitives
(PRP) (ADV) (INF @ICL-P<)
Rød = fokusadverbierblå = tidsadverbier
grøn = bøjede adverbier
Post-positioned "light objects"VFIN (ADV @<ADVL) (PERS @<ACC)
either 1./2. person (speech!) or special cases ...
Cross language perspective
• VISL uses a uniform descriptive system, with VISL uses a uniform descriptive system, with consistent form and function categories, for 27 consistent form and function categories, for 27 languages, handling special cases at the languages, handling special cases at the subcategory levelsubcategory level
• CorpusEye offers 2 large CG-annotated multi-CorpusEye offers 2 large CG-annotated multi-language corpora, allowing a certain degree of language corpora, allowing a certain degree of statistical standardisation (genre, lexicon etc.) statistical standardisation (genre, lexicon etc.) across languagesacross languages
– 1. Europarl parallel corpus (da, de, en, es, fr, it, 1. Europarl parallel corpus (da, de, en, es, fr, it, pt)pt)
– 2. Wikipedia corpus (da, de, en, eo, es, fr, it, pt)2. Wikipedia corpus (da, de, en, eo, es, fr, it, pt)
• Both the annotation (e.g. np-types), search system Both the annotation (e.g. np-types), search system (e.g. different statistics) and language inventory (e.g. different statistics) and language inventory (e.g. se) can be expanded in a project-driven way(e.g. se) can be expanded in a project-driven way
Cross SL category distribution
GER = Germanic average, ROM = Romance average, Red = high values, Blue = low valuesNotables: Sentence length, inflexion vs. aux chains, subjunctive and conditional, ROM-adj vs. GER-v, ROM-coord., DK vs. ES, xx-French (shorter than even GER), politeness vocative
da sv de en nl GER xx/fr es it pt ROM fi elwords per sentence 25.5 25.1 25.3 25.7 23.1 24.9 27.8 32.1 32.9 33.2 32.7 25.3 31.0finite subclauses 3.81 3.75 3.47 3.47 3.30 3.56 3.16 4.04 3.68 3.52 3.75 3.00 3.72 relative clauses 1.95 2.05 1.68 1.70 1.58 1.79 1.72 2.16 2.10 2.07 2.11 1.50 2.09 direct object clauses 1.11 1.04 1.02 1.03 0.95 1.03 0.85 1.10 0.90 0.81 0.94 0.78 0.94 adverbial clauses 0.63 0.54 0.67 0.61 0.63 0.62 0.52 0.70 0.63 0.55 0.63 0.57 0.62participial adverbialsubclauses (log-5)
2.92 2.15 3.20 4.35 4.52 3.43 3.96 3.82 4.09 4.71 4.21 3.31 4.78
auxiliary chain parts 3.46 3.35 3.34 3.36 3.13 3.33 2.89 2.98 2.99 2.52 2.83 3.02 2.77 passive pcp2 0.47 0.45 0.42 0.45 0.44 0.45 0.41 0.33 0.34 0.39 0.35 0.44 0.39 active pcp2 1.17 1.14 1.15 1.33 1.07 1.17 1.12 1.22 1.20 0.95 1.12 1.04 1.17 infinitive 1.43 1.38 1.39 1.21 1.25 1.33 0.99 1.12 1.11 0.93 1.05 1.20 0.89subjunctive/vfin 4.99 5.58 4.76 4.53 4.40 4.85 4.19 4.76 4.26 4.79 4.60 5.55 4.35conditional 0.56 0.56 0.56 0.62 0.43 0.55 0.43 0.49 0.43 0.40 0.44 0.56 0.39vocative 0.04 0.04 0.06 0.05 0.06 0.05 0.05 0.06 0.07 0.04 0.06 0.05 0.05attributive 6.70 6.98 7.02 7.01 7.29 7.00 7.26 7.37 7.64 8.13 7.71 7.65 7.62common nouns 20.90 21.26 21.00 21.33 21.35 21.2 22.07 21.37 21.09 22.14 21.5 22.66 21.71finite verbs 8.94 8.59 8.48 8.29 8.49 8.56 7.57 8.18 7.78 7.23 7.73 7.83 7.86coordinating conjunction 2.67 2.48 2.80 2.68 2.56 2.64 2.74 3.20 3.16 3.28 3.21 2.40 3.20subordinating conjunct. 2.33 2.16 2.22 2.17 2.13 2.20 1.84 2.35 2.01 1.87 2.08 1.88 2.06demonstrative 1.96 2.14 2.34 2.17 2.24 2.17 1.99 2.17 1.98 2.02 2.06 1.82 1.81
VISLhttp://visl.sdu.dk
****************************
parsers: http://beta.visl.sdu.dk
corpus search: http://corp.hum.sdu.dk
teaching: http://visl.sdu.dk
Teksttypologi: Passivkonstruktioner
• Passivfrekvens som stilmærke for Passivfrekvens som stilmærke for kancellistil, abstraktionsniveau kancellistil, abstraktionsniveau m.m.?m.m.?
• 3,1% alle passiver, 2,3% finitte 3,1% alle passiver, 2,3% finitte former inkl. aktiv participium, 5,9 former inkl. aktiv participium, 5,9 infinitiverinfinitiver
• s-passiv eller blive-passivs-passiv eller blive-passiv
• leksemspecifikke passivnormaler?leksemspecifikke passivnormaler?
• (a) Børnene flokkedes omkring ismaskinen. *Børnene blev flokket.Leksikaliseret S-passiv ("slås", "synes")
• (b) Løgene svitses. Løgene bliver svitset. Høj Spas/akt, høj Spas/Bpas
• (c) Aktieudbytte beskattes med 25%. Aktieudbytte bliver beskattet med 25%.
Høj Spas/akt, neutral Spas/Bpas
• (d) Minimælk fås kun fra Arla. *Minimælk bliver fået. Lav Spas/akt, høj Spas/Bpas
• (e) Der arbejdes på en løsning. Der bliver arbejdet. *Den bliver arbejdet. Blive-passiv kun med formelt subjekt.
• (f1) Bøgerne er solgt d. 10. oktober (=er blevet). *Bøgerne er solgte d. 10. oktober.(f2) Tallene er vist (=vises) med rød skrift. *Tallene er viste med rød skrift.
Være-passiv enten som s- eller som blive-passiv
Leksikokgrafisk arbejde
fx leksemer der indgår i bestemte syntaktiske sekvenser:@SUBJ> (subjekt) @MV (main verb) @<ACC (objekt)”hest” ”æde””hø”
opmærkning med semantiske prototyper:opmærkning med semantiske prototyper:21aflyse <occ> (arrangementer)19aflyse <act-c> (tallelige handlinger og aktiviteter)4 aflyse <ac> (tallelige abstrakta)4 aflyse <act> (handlinger og aktiviteter)4 aflyse <sem-l> (musikalske værker m.m.)3 aflyse <event> (hændelser)3 aflyse <sit> (situationer)
Top Related