Download - Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Transcript
Page 1: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Glossing CHAT filesusing the Bangor Autoglosser

Kevin Donnelly, Sarah Cooper, Margaret Deuchar

ESRC Centre for Research on BilingualismBangor, Wales

Page 2: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

A sample utterance

*SER: dw@1 i@1 (y)n@1 tynnu@1 llun@1 i@1 [/] i@1 (y)r@1 plant@1 <i@1plant@1> [//] <i@1 (y)r@1> [//] # i@1 er@0 &h Helen@0 a@1 Susanna@0a@1 +/. %snd:”deuchar1” 73881 79477

%gls: be.1S.PRES PRON.1S PRT take.NONFIN picture for for DET childrenfor children for DET for IM Helen and Susanna and

%eng: I draw a picture for . . . for the children, for, er, Helen and Susanna and. . .

(Siarad corpus, Deuchar1)

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 2 / 37

Page 3: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Utterance format

*SER dw@1 i@1 (y)n@1 hopeless@2 efo@1 tynnu@1 llun@1 . %snd:”deuchar1” 72848 73881Speaker *SERUtterance dw@1 i@1 (y)n@1 hopeless@2

efo@1 tynnu@1 llun@1 .Language tags 1=Welsh, 2=English, 0=indeter-

minateAudio location %snd:”deuchar1” 72848 73881Manual gloss be.1S.PRES PRON.1S PRT

hopeless with take.NONFINpicture

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 3 / 37

Page 4: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Glossing

Allows non-native speakers to parse theconversationLabour-intensiveTediousInconsistent: ychydig – “a bit”/“a little”Glossing tags difficult to revise later

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 4 / 37

Page 5: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Bangor corpora

Chats Hours Words DateWelsh-English 69 40 456k 2009

(Siarad)Welsh-Spanish 32 20 161k 2011

(Patagonia)Spanish-English 31 20 126k 2011

(Miami)132 80 743k

All available under the GPL.

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 5 / 37

Page 6: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Existing options

Spanish – CLAN → MOR + POSTWelsh – no tagger at all

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 6 / 37

Page 7: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Dictionaries

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 7 / 37

Page 8: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Dictionary format

Derived from GPL or PD resourcesOne database tableWords, not morphemesEasily presented in a spreadsheetEasy to updateEasy to get started

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 8 / 37

Page 9: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Welsh dictionary

surface lemma enlemma pos gender number tensebara bara bread n m sg

cathod cath cat n f plmynd mynd go v infinaeth mynd go v 3s past

hapus hapus happy adjrhywsut rhywsut somehow adv

heb heb without prep

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 9 / 37

Page 10: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Spanish dictionary

surface lemma enlemma pos gender number tenseperro perro dog n m sg

canciones cancion song n f plempezar empezar start v infinempieza empezar start v 23s presempieza empezar start v 2s imper

rojo rojo red adj m sgrojas rojo red adj f plpor por for prep

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 10 / 37

Page 11: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The autoglossing process

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 11 / 37

Page 12: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stages in theautoglossing process

Stage 1 – Import the unglossed fileStage 2 – Look up the words it containsStage 3 – Disambiguate between alternatives fora wordStage 4 – Output the glossed file

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 12 / 37

Page 13: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage 1Import the chat file

Read each line of the file into an utterances tableSelect the utterance and discard non-wordmaterialSplit the resulting utterance into wordsPut them into a words table

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 13 / 37

Page 14: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Sample import

*SOF: <y si> [/] y si entra algun camion ahıpor ejemplo a dejar muebles o cualquier cosa .%snd:”sastre1” 11733 15321Speaker *SOFUtterance y si entra algun camion ahı

por ejemplo a dejar muebles ocualquier cosa .

Audio location %snd:”sastre1” 11733 15321English And if some lorry goes in there,

for example, to leave off furnitureor whatever.

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 14 / 37

Page 15: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The words table

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 15 / 37

Page 16: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage2Dictionary lookup

Using the language tag, look up each wordagainst the appropriate dictionaryDo basic segmentation (e.g clitic pronouns inSpanish, verb-tenses in English, mutation inWelsh)Write out all the dictionary entries (readings) forthat wordFeed these to the constraint grammar parser fordisambiguation

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 16 / 37

Page 17: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Constraint Grammar

Developed by Fred Karlsson in the 90sThird generation of the parser: visl-cg3Eckhard Bick, Tino DidriksenFree (GPL) licensebeta.visl.sdu.dk/constraint grammar.htmlEasily-understood rules

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 17 / 37

Page 18: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage 3Disambiguation

select (n) if (-1 (ord));Choose the noun (n) reading if the first word tothe left (-1) is an ordinal (ord)Welsh: yr ail dro (the second time)English: the third manSpanish: el primer viaje (the first journey)Verb readings for dro, man and viaje will bedeleted

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 18 / 37

Page 19: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Language-specific rules

Include that language’s tag in the rule toconstrain its applicationselect ([es] n) if (-1 ([es] ord));Now applies only to Spanish: el primer viaje

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 19 / 37

Page 20: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Before disambiguation"<ddim>"

"dim" {96,1} [cy] n m sg :nothing: [208789] + sm"dim" {96,1} [cy] adv :not: [204176] + sm

"<yn>""yn" {96,2} [cy] stat :stative: [200654]"yn" {96,2} [cy] prep :in: [204430]"gan" {96,2} [cy] prep :with: [196964] + sm

"<gynnar>""cynnar" {96,3} [cy] adj :early: [209212] + sm

"<iawn>""iawn" {96,4} [cy] adv :OK: [207540]"iawn" {96,4} [cy] adv :very: [203775]

(Patagonia corpus, Patagonia1)“not very early”

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 20 / 37

Page 21: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

After disambiguation

"<ddim>""dim" {96,1} [cy] adv :not: [204176] + sm

"<yn>""yn" {96,2} [cy] stat :stative: [200654]

"<gynnar>""cynnar" {96,3} [cy] adj :early: [209212] + sm

"<iawn>""iawn" {96,4} [cy] adv :very: [203775]

(Patagonia corpus, Patagonia1)“not very early”

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 21 / 37

Page 22: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Stage 4Output the glossed file

Read the disambiguated constraint grammaroutputInsert each lexeme and its part-of-speech tagsinto the words tableUse the utterances and words tables to write outan autoglossed file

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 22 / 37

Page 23: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The words table

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 23 / 37

Page 24: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

The words table – glossed

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 24 / 37

Page 25: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Evaluation

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 25 / 37

Page 26: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Speed

900-1100 words per minute1 minute to autogloss 5 minutes of speechSiarad: 500,000 words in 8h27m

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 26 / 37

Page 27: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison withother methods

Spanish – MOR glosser (part of the CLAN suite)Welsh – manual (human) glossingTwo sample files from each corpus glossed usingboth methodsAligned and then inspected manuallyTypos or missing lexemes not counted as errorsNames omitted from considerationFull data-bundle available for download

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 27 / 37

Page 28: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison betweenautoglosser and MOR

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 28 / 37

Page 29: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Spanish files

Tested on Herring11, Sastre58,039 tokens, 1,638 types (TTR: 0.20)

Language mixSpanish 88%English 9%

indeterminate 3%

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 29 / 37

Page 30: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Welsh files

Tested on Stammers7, Stammers99,454 tokens, 1,376 types (TTR: 0.15)

Language mixWelsh 87%English 2%

indeterminate 11%

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 30 / 37

Page 31: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison with MORglossing (Spanish)

Autoglosser MORCoverage 96.9% 95.7%Accuracy 97.4%* 97.6%†

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 31 / 37

* wrong lexeme 0.7%, wrong POS 0.2%, ambiguous 1.7%† wrong lexeme 1.6%, wrong POS 0.7%, ambiguous 0.1%

Page 32: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Comparison with manualglossing (Welsh)

Autoglosser HumanCoverage 98.3% 99.9%Accuracy 97.9%* 99.9%

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 32 / 37

* wrong lexeme 0.7%, wrong POS 0.1%, ambiguous 1.4%

Page 33: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Spin-off benefits

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 33 / 37

Page 34: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Typesetting – before

*SER: dw@1 i@1 (y)n@1 hopeless@2 efo@1 tynnu@1 llun@1 .%snd:”deuchar1” 72848 73881%gls: be.1S.PRES PRON.1S PRT hopeless with take.NONFIN picture%eng: I’m hopeless at drawing*MYF: +< &=laugh . %snd:”deuchar1” 73196 73881*SER: dw@1 i@1 (y)n@1 tynnu@1 llun@1 i@1 [/] i@1 (y)r@1 plant@1 <i@1plant@1> [//] <i@1 (y)r@1> [//] # i@1 er@0 &h Helen@0 a@1 Susanna@0a@1 +/. %snd:”deuchar1” 73881 79477%gls: be.1S.PRES PRON.1S PRT take.NONFIN picture for for DET childrenfor children for DET for IM Helen and Susanna and%eng: I draw a picture for . . . for the children, for, er, Helen and Susanna and. . .

(Siarad corpus, deuchar1)

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 34 / 37

Page 35: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Typesetting – after

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 35 / 37

Page 36: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

Resources

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 36 / 37

Page 37: Glossing CHAT files using the Bangor Autoglosserbangortalk.org.uk/publications/Donnelly2011_Glossing_CHAT.pdf · using the Bangor Autoglosser Kevin Donnelly, ... be.1S.PRES PRON.1S

bangortalk.org.ukData-bundle for this presentation

Link to Bangor Autoglosser code (Git repository)Licensed under GPL v3

Donnelly, Cooper, Deuchar (Bangor) Autoglossing CHAT files 37 / 37