Thomas Schmidt SFB 538 ‚Mehrsprachigkeit‘ University of Hamburg
description
Transcript of Thomas Schmidt SFB 538 ‚Mehrsprachigkeit‘ University of Hamburg
KONVENS Wien, 15 Sep 2004
EXMARaLDA – A modeling and visualization framework for the computer-assisted transcription
of spoken language
Thomas Schmidt
SFB 538 ‚Mehrsprachigkeit‘
University of Hamburg
Background
• Multilingual Database, SFB 538 „Mehrsprachigkeit“, University of Hamburg
• EXMARaLDA (Extensible Markup Language for Discourse Annotation)
• Dissertation project „Computer-based transcription of spoken language as a modelling and visualisiation process“ (Supervisor: Angelika Storrer)
Background
• Transcription of spoken language– Interviewer / child interaction– Classroom interaction– Interpreted doctor-patient discourse
– for discourse / conversation analysis– for (child) language acquisition studies
Background
• Problem: Diversity of Transcription Data
– Theoretical diversity: • Entities of transcription (utterances, turns, non-verbal activities
etc.)• Relations between entities (temporal, hierarchical, features, ...)• Presentation formats (partitur notation, column notation, ...)
– Technological diversity: • Storage formats (text, binary, RDB)• Software (syncWriter, HIAT-DOS, DBM-Systems, word
processors, ...)• Operating Systems (Windows, MAC OS)
Background
Background
Background
• Problem: Diversity of Transcription Data
• Aim: A common platform for computer-assisted transcriptionExchange, reuse, archive transcription data
Merge corpora
Use different software tools with one piece of data
Background
• Problem: Diversity of Transcription Data
• Aim: A common platform for computer-assisted transcription
• (Elements of a) SolutionXML technologyThree level architecture
Separate form from contentSeparate logical from physical structure
Topics of this talk
2. Components of the developed system
1. Some methodological considerations:
Linguistic methods Computer science methods
„Computing in the humanities“
Interdisciplinary communication
Methodological considerations
Established view„Verschriftlichung“ Theory
Quality criteria Readability
Transcription as...
Adequacy
Transcript
FormForm Text technology viewForm ContentDocument...
Database viewE/R modelFormFormViewApplication vs.Logical layer
Model theory viewSymbolic modelForm
Analogue model
ModellingVisualisationVisualisationVisualisation
ComputerTranscription as...
Modified view
Methodological considerations
Transcription as Modeling and Visualization of spoken language
Accordance with text-technological conceptsOne model, different visualizationsNo tradeoff between readability and adequacyNo tradeoff between human and computer processabilityNo “Standardization” of models
a common modelling framework, not a common modelno ontological specifications
XML = Standardization of physical representation
Visualization to Model
Visualization to Model
Structural relations: 1. Temporal sequence
Structural relations: 1. Temporal sequence2. Simultaneity
Visualization to Model
Structural relations: 1. Temporal sequence2. Simultaneity3. Equivalence (Entity Feature)
Visualization to Model
Structural relations: 1. Temporal sequence2. Simultaneity3. Equivalence (Entity Feature)4. Hierarchy (Containment)
Visualization to Model
Modeling framework
• Relational? Sequence? Simultaneity?• OHCO? Simultaneity?• DAG: Annotation Graphs? Complexity? Transcription Graphs
System architecture
Application: Input tools
EXMARaLDA Partitur-Editor
Application: Input tools
Simple EXMARaLDA Text file
Application: Input tools
TASX annotator
Application: Input tools
PRAAT
Application: Input tools
EUDICO Linguistic Annotator (ELAN)
... as a wrapped partitur
... as a line transcript ... in column notation
Application: Visualization
Application: Corpus management
EXMARaLDA Corpus Manager (COMA)
Application: Query/Analysis
Search and Query Instrument for EXMARaLDA (SQUIRREL)
Project status
• Software past beta stage• Five projects at our own institution use EXMARaLDA for their corpus work• Around 800 users in research and teaching outside SFB• Used at the IDS in Mannheim• Submitted a suggestion for integration of data model into P5 of the TEI guidelines
Summary
Transcription as theory and „Verschriftlichung“ Computer-assisted transcription as modelling and visualisation
Interdisciplinary bridge / Methodology of computational techniques in „classical“ linguistics Concrete practical improvements for work with transcription data
EXMARaLDA and Database „Multilingalism“Data model, formats and tools building on the separation of model and visualisation
Fin.