Muskeln Kraftwerke zur Energiegewinnung Burkhard Weisser Sportmedizin.
Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong...
-
Upload
johnathan-richard -
Category
Documents
-
view
217 -
download
0
Transcript of Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong...
![Page 1: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/1.jpg)
Pragmatic Annotation & Analysis in DART
Martin WeisserSchool of English & Education
Guangdong University of Foreign [email protected]
martinweisser.org
![Page 2: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/2.jpg)
Outline
• Getting DART• Design Background• DART Annotation Scheme• Basic Automated Annotation• Speech-Act Analysis• N-Gram Analysis• Creating & Editing Resources
![Page 3: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/3.jpg)
Getting DART
• go to http://martinweisser.org/ling_soft.html#DART
• download & run installer (currently 64bit Win only)
![Page 4: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/4.jpg)
Design Background (1)
• 1997–1998: Expert Advisory Group on Language Engineering Standards (EAGLES) WP4guidelines for the representation and annotation of
dialogue
• 2001–2002: SPAAC (A Speech-Act Annotated Corpus of Dialogues) Projectannotation of some 1,200 task-oriented dialogue files
(Trainline + BT)– need to annotate and post-edit corpus efficiently and
consistently on multiple levels SPAACy
![Page 5: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/5.jpg)
Design Background (2)
colour coding helps to identify syntactic patternspost-processing
constrained through fixed options
resources loaded automatically
![Page 6: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/6.jpg)
Design Background (3)
• flaws in SPAACY– monolithic, i.e. no separation of ‘linguistic intelligence’ &
output displayhard to improve linguistic analysis– processing & editing of single files only– other interface issues, e.g. too many buttons, etc.
development of DART– modularisation– strict separation of processing and linguistic analysis
routines– enhanced options for analysis and creation of resources
![Page 7: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/7.jpg)
DART Annotation Scheme (1) –Basic Input Format
optional stylesheet reference
text with optional punctuation ‘tags’ or embedded comments
basic skeleton can be created via ‘File→New’ (Ctrl + n)
![Page 8: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/8.jpg)
DART Annotation Scheme (1) –Output Format
syntactic category mode = semantico-pragmatic markers/’IFIDs’
topic = semantic info
(surface) polarityspeech act(s)
speech act generally inferred from combination of syntax + mode
![Page 9: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/9.jpg)
Basic Automated Annotation
input files workspace
output files workspace
to load single file, press Ctrl + a(, for whole directory Ctrl + d)
single file loaded;to pre-edit, click hyperlink;
to annotate pragmatically, press Ctrl+a
debugging output;ignore if annotation completes successfully
single file processed;to post-edit, click hyperlink
![Page 10: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/10.jpg)
Speech-Act Analysis
• generate frequency list of syntactic category + speech act(s) from ‘Analysis→Speech act stats’
• click hyperlinked speech act (combination) to prime concordancer
• investigate results• if necessary, correct speech act tag by clicking
the hyperlink to the file and editing it
![Page 11: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/11.jpg)
N-Gram Analysis
• useful for determining formulaic expressions for modes or topic patterns (or in general)
• predefined options for uni- to tri-grams• optionally also freely definable n-grams• frequency lists display abs. & rel. frequencies• hyperlink again primes concordancer– for all n>1 with interpolated optional fillers– due to accommodating mixed-case data, sometimes
‘case insensitive’ flag required
![Page 12: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/12.jpg)
Creating & Editing Resources (1)
• mostly done via ‘Edit resources’ menu…• … apart from creating new files• to create new corpus
– choose ‘Edit configuration’– click ‘Add corpus entry’– fill in corpus, lexicon, and topic file name (usually identical, apart
from extension)– click ‘Save configuration’
• new resources created– data folder for corpus– three subfolders: ‘info’, ‘notes’, and ‘stats’– dummy lexicon & topics files (in relevant program folders)
![Page 13: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/13.jpg)
Creating & Editing Resources (2)
• existing resources can be edited…– generally via relevant entry in the ‘Edit resources’ menu– lexica & topic files via hyperlinks in configuration editor
• safest to edit only dialogue, lexica & topic files…• … unless you really know what you’re doing • lexica can also be ‘synthesised’ from corpus data
![Page 14: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/14.jpg)
Creating & Editing Resources (3) –Lexica
• very simple format– word (base form) + space + tag + optional comment (preceded
by #)– special DART tagset
• allows for lexical polysemy– uppercase tag name = unambiguous– lowercase tag name = predominantly tag X
• tooltips on tag buttons provide explanations while editing
• synthesising lexicon works by– creating word list from corpus– ‘subtracting’ items from general lexicon– suggesting possible candidates after morphological analysis
![Page 15: Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies weissermar@gmail.com martinweisser.org.](https://reader035.fdocuments.net/reader035/viewer/2022081603/56649ea25503460f94ba5fa4/html5/thumbnails/15.jpg)
Creating & Editing Resources (4) –Topic Files
• syntax more complex than for lexica• combination of topic labels, space, double
colon, space, associated (representative) patterns
• patterns expressed as– regexes– individual sub-patterns separated by 3 underscores