Cognitive corpus-based LSP lexicography – research and implementation issues – a case study on...

33
Cognitive corpus-based LSP lexicography – research and implementation issues – a case study on the Multilingual Glossary on Risk Management Gerhard Budin University of Vienna Austrian Academy of Sciences 8 th of April, 2011

Transcript of Cognitive corpus-based LSP lexicography – research and implementation issues – a case study on...

Cognitive corpus-based LSP lexicography – research and

implementation issues – a case study on the Multilingual Glossary on Risk Management

Gerhard BudinUniversity of Vienna

Austrian Academy of Sciences

8th of April, 2011

Our empirical research landscape

The Making of…a Multilingual Glossary on

Risk Management

Motivations and Methods:Terminologies for Risk Communication• The Role of LSP Lexicography in domain communication

– Increasing the “transparency” of terms – Help negotiate a common understanding of terms in

intra-, inter- and trans-disciplinary and transcultural discourse

– Help increase the consistency of risk discourse (written and spoken) and increase understanding in target audiences

– Reduce unnecessary synonyms, disambiguate polysems, help separate homonyms

– Help create risk terminologies in many languages– Support knowledge sharing and knowledge transfer in

cooperative work environments – Support cross-cultural discourse (e.g. translation and

parallel texts)

The Domains of Risk Management

• Multidisciplinary, diverse, and fragmented - or• Transdisciplinary, overlapping, converging,

integrated, and complementary• The need for mediating between different

approaches, cultures, and discourses: – Technological, engineering, research, science– Administration, legislation, monitoring– Social, sociological, political, cultural– Domain approaches (financial, ecological, chemical,

safety, geographical, planning and forecast, health, etc.)

WIN Project (FP6 2004-2009): WP “Human Language Interoperability”• Objectives

– WP 2200 is designed to support international risk management and risk communication processes (within the WIN project and beyond)

• Achieved results (with ongoing work)– Large parallel corpora collection with risk-related texts and

lexical resources (fr, en, de, es, ro, fi, hu, ru)– Multilingual index with conceptual structure– Bibliography and codes of sources– Risk Ontology – Multilingual online terminology database

Integrative R&D Approach• A combination of theoretical approaches and their methods

in order to achieve a result that is targeted towards the needs of the project consortium and the cooperation partners– Quantitative (computational) and qualitative (intellectual)

methods of corpus analysis– Lexicographical and terminographical (word/text-oriented

and concept/knowledge-oriented) – Text linguistics and translation studies– Cross-cultural comparative approach and knowledge

system approach, multi-domain communication– Knowledge engineering, computational semantics/Web

2.0 (ontologies, frame semantics, etc.)– Cognitive Science approach (media pedagogy – eLearning,

specific learner support, interactive approach (mental lexicon), usability engineering

Motivation and Convergence of Research Interests and Contexts

• Interest in cognitive science research applied to terminology management, ontology engineering, translation technologies, E-Learning systems design and implementation

• Research Cluster 1 “Translation – Cognition – Technologies” at the Center for Translation Studies, University of Vienna

• Interdisciplinary Research Platform on Cognitive Science – Cluster on Cognitive Linguistics

• Research Priority 1 Lexicology, Terminology, and Parallel Corpora at the Institute for Corpus Linguistics and Text Technology at the Austrian Academy of Sciences

Research contexts in several projects• Previous and ongoing projects

– Dynamont• Methodology for Creating Dynamic Ontologies, BMVIT, national research

programme “Semantic Systems” – multi-dimensional ontology modelling– WIN (Wide Area Information Network on Risk Management) MGRM Multilingual

Glossary on Risk Management• IP (Integrated Project) in FP6, 2004-2008, focus on creating a multilingual

terminology and ontology of risk management – risk ontology for natural hazards– Montific - Multilingual ONTology for Internal Financial Control, a LLP project (Leonardo

da Vinci II) • Building a “learning ontology” for an eLearning environment

– STABILITY AND ADAPTATION OF CLASSIFICATION SYSTEMS IN A CROSS-CULTURAL PERSPECTIVE - European Science Foundation: COST A 31 project

• cognitive linguistics – how “classifiers” are embodied in language incl. ontologies– TES4IP - Terminology Services for the Intellectual Property Domain (Bridge project

funded by FFG, Austrian Research Agency• Term extraction, multi-word term recognition, named entity recognition, legal

vocabularies and legal ontologies• -> Ongoing study

– Cognitive Ontologies• Designing, Generating and Using Domain Ontologies

Ontology Engineering and Cognitive Science

• Cognitive Aspects have been of interest in a variety of ontology engineering approaches– Barry Smith

• Epistemological focus combined with work on domain ontologies (mainly bio-medical)

• Criticizing the epistemological foundations of terminology theory in elaborating his foundational theory of ontology

– Aldo Gangemi • DOLCE: Descriptive Ontology for Linguistic and Cognitive Engineering• Foundational theory of ontology• Many projects, also on tools and on domain ontologies

– But also many others (Guarino, Sheth, Obrst, Noy, et al) have done research on these aspects

– Some criticism, that the focus in ontology evaluation is on syntactic evaluation for computational uses (only) – the classical scenario

“Cognitive Ontologies”

• Conceptual clarification:– Ontologies of cognitive processes

• In neuroscience research, similar to other bio-medical ontologies (cognitive atlas, neuropsychiatric phenomena, ontology of cognitive objects, etc.)

– Ontologies with a focus on their cognitive aspects• DOLCE and other cognitive-oriented approaches• Constructivist epistemology for ontology building,

concerning the relation to “reality”

• Increasing convergence of these two concepts

Our own research• Our previous and ongoing projects have been focusing

on cognitive adequacy of domain ontologies and their use in knowledge acquisition in learning situations– Terminology studies as a contribution from this

perspective (related research by Nistrup Madsen/Erdman Thomsen 2005, 2009, etc.)

• Using DOLCE design patterns for multi-dimensional conceptual modeling for ontology building – the DYNAMONT project

• From domain corpora to terminologies and from there to domain ontologies – for eLearning scenarios – the MONTIFIC project– For domain experts – the WIN/MULTH/MGRM project

Moving up (and down) the Ontology Spectrum

• The challenge: from linguistic-cultural diversity of discourse and free-form lexical structures to a unified, formalized, axiomatized ontology – and back, to support human understanding and social processes such as collaborative learning

• The method: an integrative, multi-level modelling approach specifying the steps in a process-oriented workflow framework (with variable, combinable steps depending on concrete needs) for – Gradual semantic enrichment– Gradual semantic formalization– Multi- and cross-lingual referencing/alignment for text management– Constant interaction between full texts and lex-term resources

• The technology: a multi-component workbench (i.e. Dynamont-WB incl. ProTerm as a central element), using XML, RDF, OWL, SKOS, WordNet + GlobalWordnet, MLIF (containing TBX, TMX, XLIFF, LMF, TMF, etc.), FrameNet, etc.

• The advantage: full exploitation of all types of languages resources (LR) and knowledge organization systems (KOS), providing a framework not only for their semantic enrichment and formalization as ontologies but also for ontology-based multilingual authoring, text generation and translation

The global risk communication scenario

• Several projects since 1994 covering the following activities:– Thesaurus building– Creating multilingual terminology databases– Creating multilingual text corpora– Lexicographical glossary– Semantic enrichment (e.g. conceptual links, frame semantics)– Collection and analysis of relevant knowledge organization systems– Annotation of resources– Mark-up of resources (TBX, etc.)– Ontology building– Communication design

From texts and terminologies to ontologies- and back to texts

• Using the Risk scenario– Termbase

• Export XML• Domain Models – meta-models -> patterns

– Text corpus• Term extraction – comparative testing ProTerm, MultiTerm

Extract, MultiCorpora• Aligning with termbase• Convert to RDF

– Ontology import -> editor– Mappings (GMT, XML, RDF, OWL, UML, comma delimited, RDB, for

different kinds of lex-term resources, FN->OWL, etc.) • The MULTH-WIN Project as an example of methods

integration

Terminological frame semantics

• INTERVENTION (ACTOR(S), ACTIVITIES/PHASES):• RISK DETECTING (PRE-EVENT)• - R-ASSESSMENT• - R-PERCEPTION (X is risk)• - EXPERIENCE (statistics, case studies)• - OBSERVATION (monitoring)• - METHOD • - SATELLITE• - PROGNOSES• - R-ANALYSIS• - R-FEATURES• - SITUATION/CONTEXT (danger/hazard)• - SIMULATION (course of events)• - PROBALISTIC METHODS (safety)• - RELIABILITY• - R-IDENTIFICATION (DAMAGE)• - R-SOURCE• - DAMAGE CAUSE• - VULNERABILITY (DAMAGE TARGET)• - SUSCEPTABILITY (capacity/people)

Rothkegel

Terminological frame semantics

I. Pre-event B. Public awareness and planning, II. In-event: C. Events and response

afflux/Hochwasser durch AufstauBE [[TYPE=flood], [PLACE=], [TIME=]], HAVE [CAUSE [[ORIGIN=], [NIEDERSCHLAG [TYPE=]], [STAU [TYPE= Aufstau]]], DAMAGE [TARGET=, SOURCE=, DEGREE=]], HAPPEN [STATES=, PROCESSES=]]backwater/RückstauBE [[TYPE=flood], [PLACE=], [TIME=]], HAVE [CAUSE [[ORIGIN=], [NIEDERSCHLAG [TYPE=]], [STAU [TYPE=

Rückstau]]], DAMAGE [TARGET=, SOURCE=, DEGREE=]], HAPPEN [STATES=, PROCESSES=]]

Rothkegel

Ordnance Survey

Dynamont architecture, tools and workflows

Phase 1: Identify the Problem

Phase 2: Structure the Problem

Phase 3: Identify Purpose and Scenario

Phase 4: Identify concepts of domain / subject matter

Phase 5: Create Knowledge Model

Phase 6: Create Application Profile

Phase 7: Create Acceptance

Phase 8: Create System

Ontology Creation

Postgres

Phase 9: Implement System

Visualization

Storage

MethodologyCollaboration

MDA-Component

The Glossary• The paper version of the glossary is used by risk managers, civil

engineers, but also teachers, students, translators, journalists, etc. • Generally, the purpose of such multilingual conceptual glossaries is to

improve domain communication and to facilitate mutual understanding across linguistic boundaries.

• The concepts of risk management and their definitions presented in this glossary were carefully selected from a large body of technical literature and authentic text corpora in the respective languages.

• These sources are referenced in the bibliography. • The multilingual glossary presented here includes 8 languages: English

and French as main pivot languages, as well as German, Spanish, Romanian, Finnish, Hungarian, and Russian.

• It comprises about 230 central concepts of risk management with about 400 definitions and about 1400 terms representing these concepts in each language (including synonyms and hyperonyms), indicating the conceptual relations between the entries.

The Glossary• The following themes are used as the macro-structure of the glossary:

– A. Risk assessment and technology assessment – B. Public perception of risk, planning, preparation and alarm, – C0. Risk events, equipment and operations, general terms– C1. Fire - events, equipment and operations– C2. Floods - events, equipment and operations– C3. Oil spills - events, equipment and operations.

• Each glossary entry follows the same micro-structure with the following information elements:– A conceptual number combined with a theme from the macro-structure– The equivalent terms in the 8 languages, accompanied by grammatical

information– The definitions of the concept in each language, including multiple

definitions that may differ from each other, accompanied by the textual source of the definition, also including structural semantic information on the concept

– Related terms and expressions.

Research issues

• Experimental settings• User studies, user modelling• Data modelling• Corpus-analysis• Multilingual – multi-domain – cross-cultural • Knowledge dynamics - Dynamic knowledge representations• Cognitive studies

Conclusions and Outlook I

• Online terminology database is continuously used• 8-language Glossary Version produced in February 2011

• Next steps in 2011:– Work in progress!– Database to be extended from 5 to 8 languages– Full text corpora to be extended– Promotion of the glossary in different user communities– Term extraction, research– Extension into more languages– More scientific publications

Conclusions and outlook II

• Research perspectives– Further research in

• Cognitive ontologies• User modelling, usability of terminological databases

and LSP dictionaries• Corpus-linguistic research – semantic annotation

modelling• Multilingual, multi-domain, cross-cultural issues

Selected References• Budin, G. Socio-terminology and computational terminology – toward an

integrated, corpus-based research approach. In: De Cilia, Rudolf et al. (eds.). Discourse, Politics, Identity. Tübingen: Stauffenburg Verlag, 2010, 21-31

• Budin, G. Semantic Systems supporting Cross-Disciplinary Environmental Communication. In: Hryniewicz, O.; Studzinski, J.; Szediw, A. (eds.). Environmental Informatics and Systems Research. Vol 2 Workshop and application papers. EnviroInfo 2007. Aachen 2007, 23-30

• CEDIM , Center for Disaster Management and Risk Reduction Technology c/o University of Karlsruhe (2005). Glossar: Begriffe und Definitionen aus den Risikowissenschaften.

• Gangemi DOLCE• Greciano, G. (2001). L'harmonisation de la terminologie en Sciences du Risque. In

Proceedings of Security Conference, Montpellier XII. Council of Europe-FER. Strasbourg, France.

• Greciano, G. (2001). Les sciences du risque: convergences interculturelles. In Proceedings of Risk Conference, Strasbourg X. Council of Europe-FER. Strasbourg, France.

• Greciano, G. (2001). Pour un glossaire combinatoire plurilingue du Risque. Proceedings of Risk-Conference, Mèze V. Council of Europe-FER.Strasbourg, France.

• Massué, J.P. (2001). "Mobilisation de la Communauté scientifique au service de l'amélioration de la gestion des risques". Mèze, FER-EUR-OPA.Strasbourg

• Nistrup Madsen/Erdman Thomsen 2005, 2009

AcknowledgementsGLOSSAIRE MULTILINGUE DE LA GESTION DU RISQUE

Français / Allemand / Anglais / Espagnol / Roumain / Finlandais / Hongrois / Russe

édité par Gertrud Gréciano, Gerhard Budin, Danielle Candel, John Humbleyavec le soutien de la Commission de l’Union Européenne, des Universités de Strasbourg, Vienne, Helsinki, de la Région Alsace, de la Délégation générale à la langue française et aux langues de

France, et de l’Académie des Sciences d’Autriche.

Auteurs: Gertrud Gréciano (Strasbourg), Gerhard Budin (Vienne),Annely Rothkegel (Chemnitz), Ulrike Hass (Essen)

Traducteurs: Cornelia Cujba (Iasi), Attila Frigyer (Budapest), Luis Gonzalez (Caracas-Paris),

Csilla Höfler-Bornemisza (Vienne), Annikii Liimatainen (Helsinki), Alexei Milko (Strasbourg-Moscou)

Coopération scientifique et technique: Steffi Baumann (Chemnitz), Aban Budin (Vienne), Christian Burghard (Chemnitz), Dimitrij Dobrovolskij (Moscou-Vienne), Eva Haas

(Munich-Ispra), Natalia Jonkova (Moscou), Andra Moga (Iasi-Vienne), Maren Runte (Essen), Julia Steuber (Essen), Virginie Tombeux (Paris), Elena Volgina (Moscou)

Thank you for your attention

Gerhard Budin

Center for Translation StudiesUniversity of Vienna

Institute of Corpus Linguistics and Text TechnologyAustrian Academy of Sciences

[email protected]://mgrm.univie.ac.at