2009 11 04 tekom TBX presented · [email protected] Wolf-Dietrich von Loeffelholz...

40

Transcript of 2009 11 04 tekom TBX presented · [email protected] Wolf-Dietrich von Loeffelholz...

Terminology management made easier —

a TBX-compliant terminology repository for a translation agency

Dave Calvert, TransForm Gesellschaft für Sprachen- und Mediendienste mbH

[email protected]

Wolf-Dietrich von [email protected]

Who’s who� TransForm GmbH

� Established 1994

� Specializes in corporate image and science and technology

� EN 15038 certified

� Wolf Dietrich von Loeffelholz

� Freelance software development and maintenance

Problem� Terminological data

� In-house legacy format MultiTerm 5.5� In-house legacy format intranet database� Freelancer preferred format Wordfast� Customers’ terminology

• All imaginable formats and conditions

����� Restricted interoperability� Need to run concurrent, incompatible systems� Pressure to upgrade to extremely expensive server-based solutions.

Concept� Application to store and maintain terminological data� Future-proof data format� Import and export file formats currently in use at TransForm

� Define other import/export formats without the need for substantial programming

� Access via existing intranet

����Web-based terminology repository with non-proprietary data format

TBX— the way forward

http://www.lisa.org/Term-Base-eXchange.32.0.html

Why TBX?� Substantial advantages to the user

� Open standard — effectively future-proof

� Open standard — pressure on tool vendors to support the format

� Clearly defined

� XML, so relatively easy to work with

� Available for use without licensing fees

TBX-Basic� TBX for small and medium sized language industry applications

� TBX is too powerful for most LSP applications

� TBX-Basic — lightweight version of TBX

� Developed by LISA Terminology Special Interest Group

� Specifically aimed at small and medium sized language industry applications

� Fully complient with TBX

� Restricted subset of TBX features

Our answer� Store terminological data as TBX

� Ensures future compatibility

� Use of standard will boost quality of data in medium term

� TBX capability ensures TBX-Basic capability

� Future changes to terminological markuplanguage (TML) possible within constraints of TBX

Now a mapping problem� Mapping legacy database terminological data formats to TBX

� TBX has three-level concept structure

� Concept

� Language

� Term

� Information on all levels is constrained in terms of what may and what must be stored

� Both explicit and implicit information must be handled

Implicit terminological information

� Glossary stored as:•M:\Customers\LN\Leistungselektronik_2

� contains entry

•Regelkreis control loop en.wikipedia.org

� Source and target terms, target term source are explicitly recorded

� Customer and project must be derived from path and filename.

� Languages are implied.

Handling implicit information � Input templates must define implicit information to be captured

� Wordfast requires more data to be entered at import time

� Intranet database records permit lookup of information

TBX-Basic Structure� Three levels

� Concept “termEntry”• Subject• Definition and its source• Cross-reference and/or image

� Language “langSet”• Definition and its source

� Term “tig”• Term notes, linguistic usage labels• Context and its source• Term source, administrative usage labels

� Any level• Administrative / transactional information• Notes

Compliance Issues � Structural or syntactic compliance

� Check using validation program e.g. tbxcheckhttp://sourceforge.net/projects/tbxutil/

� Content compliance� Can depend on purpose of data

� Machine processing requires Part of speech (TBX-Basic)

� Human use does not if either a Definition or a Context is provided (TBX-Basic)

� TransForm data was collected without consideration of these issues

� Full compliance with TBX-Basic only possible for new data

� Careful use of implicit information will help to mitigate these issues

What we intend to do with it� Import existing terminological data from:

� MultiTerm 5.5 databases

� Wordfast glossaries

� Intranet-based system

� Customers

� Maintain existing data

� Replace existing terminology collection back end� Terminology captured direct to TBX format

� Export project-specific and customer-specific terminology in the form of:� Dictionaries

� Glossaries

� Databases

MultiTerm data format� Last file-based version of MultiTerm

� Flexible concept-oriented system� Index fields—defined as languages and contain terms

� System fields� Attribute and text fields

� Order, number and relationships of attribute and text fields are not constrained

MultiTerm data format

MultiTerm–TBX-Basic mapping

Wordfast glossaries

� Tab-delimited text glossaries

� Simple

� Open

� User-definable fields

� Fine for the translator

Wordfast glossaries—TransForm

� Source term

� Target term

� Note

� Term source

� Context sentence

� Context source

Wordfast glossary–TBX mapping

Intranet terminology capture� Term entry screen

• Simple term entry structure

• To be expanded by the addition of a context and its source

Intranet terminology mapping

How it works — customers’ data

� Excel data

� Import in similar way to Wordfast glossary

� Tab-delimited text (Wordfast style)

����

����

� Convert to tab-delimited text

� TBX

Implementation

� Wolf Dietrich von Loeffelhoz• [email protected]

� Freelance software development and maintenance

Structure

� System

� Converter

� Import

� Administrative tools

� Export

� Features little helpers

System

� Web server with php5 and java

� Database to store metadata for search and management purposes

� Flex application to help with management of terminological data

� TBX storage on the file system with backmatter

System — php5

� PDO as abstract database layer

� Mysql, Oracle, MS sql, postgresql, etc

� DOM as document object model

� Work any XML needs

� Pear html template engine

Converter� Need for import function to have XML data

� UTF-8 convert to conform to XML requirements

� MultiTerm glossary (5.5 and earlier)• Tagged format

� Wordfast glossary or any tab-delimited format

• Definition of language combination and attribute fields

Import function

� Import of XML data of unknown format

� Mapping filters

� XML import filter

� TBX mapping filter

� TBX template

Import function — categories� Definable based on XML import filter

� Required information

� Term and language

� Expected information

� Admin, user and date information

� Optional information

� Error Logging

Import function — import� Import filter

� Import file

� Concept grouping

� TBX header information

� TBX back information

� Full automated import

� Step-by-Step with user validation

Administrative tools

� Grouping of terms

� Fixed grouping during import — so-called term id

� User-defined grouping into concept

Administrative tools

� RIA to help with management of concepts and terms

� Access to stored terminological data using search masks

� One-click copying of concepts and all associated terminology into export group

Export function� Export of export groups

� TBX

� TMX

� Any mapped XML format

� MultiTerm 5.5 tagged

� Wordfast

Features� Ldap authentification

� Dav

� Quick import of TBX files

� Quick access to TBX files

� RSS Feed

� Subscribe and observe insert, update and delete on

• User level

• Concept, term and/or comment level

Current status� Working title TBX-Transform

� Self-certified https site

� In beta release

� Full integration into intranet system in progress

� External beta testers by year-end

Immediate objectives� Fitness for productive use

� Including work on user interface

� Complete integration into intranet system

� Start migrating existing data

Future strategies � Import existing data, convert and storage as TBX

� Step-by-step validation where necessary

� Consolidation where appropriate

� Future-proofing

� Additional export forms for glossaries and dictionaries