Glossing CHAT files using the Bangor...

37
Glossing CHAT files using the Bangor Autoglosser Kevin Donnelly, Sarah Cooper, Margaret Deuchar ESRC Centre for Research on Bilingualism Bangor, Wales

Transcript of Glossing CHAT files using the Bangor...

Page 1: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Glossing CHAT filesusing the Bangor Autoglosser

Kevin Donnelly, Sarah Cooper, Margaret Deuchar

ESRC Centre for Research on BilingualismBangor, Wales

Page 2: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

A sample utterance

*SER: dw@1 i@1 (y)n@1 tynnu@1 llun@1 i@1 [/] i@1 (y)r@1 plant@1 <i@1plant@1> [//] <i@1 (y)r@1> [//] # i@1 er@0 &h Helen@0 a@1 Susanna@0a@1 +/. %snd:”deuchar1” 73881 79477

%gls: be.1S.PRES PRON.1S PRT take.NONFIN picture for for DET childrenfor children for DET for IM Helen and Susanna and

%eng: I draw a picture for . . . for the children, for, er, Helen and Susanna and. . .

(Siarad corpus, Deuchar1)

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 2 / 37

Page 3: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Utterance format

*SER dw@1 i@1 (y)n@1 hopeless@2 efo@1 tynnu@1 llun@1 . %snd:”deuchar1” 72848 73881Speaker *SERUtterance dw@1 i@1 (y)n@1 hopeless@2

efo@1 tynnu@1 llun@1 .Language tags 1=Welsh, 2=English, 0=indeter-

minateAudio location %snd:”deuchar1” 72848 73881Manual gloss be.1S.PRES PRON.1S PRT

hopeless with take.NONFINpicture

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 3 / 37

Page 4: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Glossing

Allows non-native speakers to parse theconversationLabour-intensiveTediousInconsistent: ychydig – “a bit”/“a little”Glossing tags difficult to revise later

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 4 / 37

Page 5: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Bangor corpora

Chats Hours Words DateWelsh-English 69 40 456k 2009

(Siarad)Welsh-Spanish 32 20 161k 2011

(Patagonia)Spanish-English 31 20 126k 2011

(Miami)132 80 743k

All available under the GPL.

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 5 / 37

Page 6: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Existing options

Spanish – CLAN → MOR + POSTWelsh – no tagger at all

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 6 / 37

Page 7: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Dictionaries

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 7 / 37

Page 8: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Dictionary format

Derived from GPL or PD resourcesOne database tableWords, not morphemesEasily presented in a spreadsheetEasy to updateEasy to get started

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 8 / 37

Page 9: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Welsh dictionary

surface lemma enlemma pos gender number tensebara bara bread n m sg

cathod cath cat n f plmynd mynd go v infinaeth mynd go v 3s past

hapus hapus happy adjrhywsut rhywsut somehow adv

heb heb without prep

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 9 / 37

Page 10: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Spanish dictionary

surface lemma enlemma pos gender number tenseperro perro dog n m sg

canciones cancion song n f plempezar empezar start v infinempieza empezar start v 23s presempieza empezar start v 2s imper

rojo rojo red adj m sgrojas rojo red adj f plpor por for prep

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 10 / 37

Page 11: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The autoglossing process

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 11 / 37

Page 12: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stages in theautoglossing process

Stage 1 – Import the unglossed fileStage 2 – Look up the words it containsStage 3 – Disambiguate between alternatives fora wordStage 4 – Output the glossed file

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 12 / 37

Page 13: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage 1Import the chat file

Read each line of the file into an utterances tableSelect the utterance and discard non-wordmaterialSplit the resulting utterance into wordsPut them into a words table

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 13 / 37

Page 14: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Sample import

*SOF: <y si> [/] y si entra algun camion ahıpor ejemplo a dejar muebles o cualquier cosa .%snd:”sastre1” 11733 15321Speaker *SOFUtterance y si entra algun camion ahı

por ejemplo a dejar muebles ocualquier cosa .

Audio location %snd:”sastre1” 11733 15321English And if some lorry goes in there,

for example, to leave off furnitureor whatever.

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 14 / 37

Page 15: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The words table

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 15 / 37

Page 16: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage2Dictionary lookup

Using the language tag, look up each wordagainst the appropriate dictionaryDo basic segmentation (e.g clitic pronouns inSpanish, verb-tenses in English, mutation inWelsh)Write out all the dictionary entries (readings) forthat wordFeed these to the constraint grammar parser fordisambiguation

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 16 / 37

Page 17: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Constraint Grammar

Developed by Fred Karlsson in the 90sThird generation of the parser: visl-cg3Eckhard Bick, Tino DidriksenFree (GPL) licensebeta.visl.sdu.dk/constraint grammar.htmlEasily-understood rules

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 17 / 37

Page 18: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage 3Disambiguation

select (n) if (-1 (ord));Choose the noun (n) reading if the first word tothe left (-1) is an ordinal (ord)Welsh: yr ail dro (the second time)English: the third manSpanish: el primer viaje (the first journey)Verb readings for dro, man and viaje will bedeleted

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 18 / 37

Page 19: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Language-specific rules

Include that language’s tag in the rule toconstrain its applicationselect ([es] n) if (-1 ([es] ord));Now applies only to Spanish: el primer viaje

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 19 / 37

Page 20: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Before disambiguation"<ddim>"

"dim" {96,1} [cy] n m sg :nothing: [208789] + sm"dim" {96,1} [cy] adv :not: [204176] + sm

"<yn>""yn" {96,2} [cy] stat :stative: [200654]"yn" {96,2} [cy] prep :in: [204430]"gan" {96,2} [cy] prep :with: [196964] + sm

"<gynnar>""cynnar" {96,3} [cy] adj :early: [209212] + sm

"<iawn>""iawn" {96,4} [cy] adv :OK: [207540]"iawn" {96,4} [cy] adv :very: [203775]

(Patagonia corpus, Patagonia1)“not very early”

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 20 / 37

Page 21: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

After disambiguation

"<ddim>""dim" {96,1} [cy] adv :not: [204176] + sm

"<yn>""yn" {96,2} [cy] stat :stative: [200654]

"<gynnar>""cynnar" {96,3} [cy] adj :early: [209212] + sm

"<iawn>""iawn" {96,4} [cy] adv :very: [203775]

(Patagonia corpus, Patagonia1)“not very early”

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 21 / 37

Page 22: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage 4Output the glossed file

Read the disambiguated constraint grammaroutputInsert each lexeme and its part-of-speech tagsinto the words tableUse the utterances and words tables to write outan autoglossed file

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 22 / 37

Page 23: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The words table

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 23 / 37

Page 24: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The words table – glossed

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 24 / 37

Page 25: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Evaluation

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 25 / 37

Page 26: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Speed

900-1100 words per minute1 minute to autogloss 5 minutes of speechSiarad: 500,000 words in 8h27m

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 26 / 37

Page 27: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison withother methods

Spanish – MOR glosser (part of the CLAN suite)Welsh – manual (human) glossingTwo sample files from each corpus glossed usingboth methodsAligned and then inspected manuallyTypos or missing lexemes not counted as errorsNames omitted from considerationFull data-bundle available for download

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 27 / 37

Page 28: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison betweenautoglosser and MOR

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 28 / 37

Page 29: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Spanish files

Tested on Herring11, Sastre58,039 tokens, 1,638 types (TTR: 0.20)

Language mixSpanish 88%English 9%

indeterminate 3%

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 29 / 37

Page 30: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Welsh files

Tested on Stammers7, Stammers99,454 tokens, 1,376 types (TTR: 0.15)

Language mixWelsh 87%English 2%

indeterminate 11%

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 30 / 37

Page 31: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison with MORglossing (Spanish)

Autoglosser MORCoverage 96.9% 95.7%Accuracy 97.4%* 97.6%†

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 31 / 37

* wrong lexeme 0.7%, wrong POS 0.2%, ambiguous 1.7%† wrong lexeme 1.6%, wrong POS 0.7%, ambiguous 0.1%

Page 32: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison with manualglossing (Welsh)

Autoglosser HumanCoverage 98.3% 99.9%Accuracy 97.9%* 99.9%

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 32 / 37

* wrong lexeme 0.7%, wrong POS 0.1%, ambiguous 1.4%

Page 33: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Spin-off benefits

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 33 / 37

Page 34: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Typesetting – before

*SER: dw@1 i@1 (y)n@1 hopeless@2 efo@1 tynnu@1 llun@1 .%snd:”deuchar1” 72848 73881%gls: be.1S.PRES PRON.1S PRT hopeless with take.NONFIN picture%eng: I’m hopeless at drawing*MYF: +< &=laugh . %snd:”deuchar1” 73196 73881*SER: dw@1 i@1 (y)n@1 tynnu@1 llun@1 i@1 [/] i@1 (y)r@1 plant@1 <i@1plant@1> [//] <i@1 (y)r@1> [//] # i@1 er@0 &h Helen@0 a@1 Susanna@0a@1 +/. %snd:”deuchar1” 73881 79477%gls: be.1S.PRES PRON.1S PRT take.NONFIN picture for for DET childrenfor children for DET for IM Helen and Susanna and%eng: I draw a picture for . . . for the children, for, er, Helen and Susanna and. . .

(Siarad corpus, deuchar1)

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 34 / 37

Page 35: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Typesetting – after

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 35 / 37

Page 36: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Resources

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 36 / 37

Page 37: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

bangortalk.org.ukData-bundle for this presentation

Link to Bangor Autoglosser code (Git repository)Licensed under GPL v3

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 37 / 37