Terminology Extraction tekom 2008 - tekom e.V

30
Terminology Extraction tekom 2008 Methods and Tools Angelika Zerfass

Transcript of Terminology Extraction tekom 2008 - tekom e.V

Terminology Extraction

tekom 2008

Methods and Tools

Angelika Zerfass

[email protected]

Agenda

What is terminology extraction?What are the complications and difficulties associated with terminology extraction?Comparison of different extraction strategy:

ManualConcordanceStatisticalLinguistic

File formats for extraction and export

[email protected]

Terminology workAfter using translation memory systems and sentence level recycling, now terminology comes into the focus for possible savings.Source language text needs to become even more consistent

More accuracy in translationFaster translation

But … terminology work is WORKnever-ending (new technologies, more languages)ever-changing (mergers, marketing innovations)time-consuming / resource-intensive

Automation???

[email protected]

Monolingual Extraction

Extraction of terms from documents in one language.

Creation of term lists…Important terms

Who defines what is important?How can a tool “know”, what is important?

Frequent termsWhat is frequent? 3 times / 10 times…Are frequent terms also important?

New termsAccording to whose level of subject matter knowledge?Compared to which term list / term database?

[email protected]

Bilingual Extraction

Term extraction from bilingual sources like translation memory files or bilingual translation files

Creation of parallel lists of terms and their translation(s)

All forms of the term and all its translationsOnly basic formMost frequent translation of source term

[email protected]

This is, how a text looks to a statistical extraction tool…

Vot gnig harengoga fuor tok gnig nor shewerginhatz. Mirhon bortup tip trewshu gnig batbo loqtet. Bortup ter, bortup nofdas, semsel nih furpo ayano bliktreptat. Mirhon granbevtrov driktopret grig go wasbrekit mut mirkep taptro gnig suf. Aktrep zitpek nitnit bortup mil. Setrimb ak troptan bur metlatkento.

[email protected]

Term extraction issuesTerminology extraction is a highly individual process

Goal of extraction, subject matter expertise, available time

Tools use different methods for terminology extraction

Concordance, statistics, linguisticsTools support different file formats for extraction and export

Monolingual, bilingual, export formatsTools sometimes don’t show the context from which the term was taken

[email protected]

Term Extraction Tools

Assistance for manual extractionConcordance tools

Extraction of all term combinationsStatistical extraction tools

Frequent termsAll languages

Linguistic extraction toolsExtraction of noun phrases…Supported languages only

[email protected]

Manual Extraction

Human reads the text, understands the meaning and selects terms (or term pairs) according to previous knowledge of the subject matter and/or the goal for the extraction.

List of standard termsList of company termsList of new termsAdditional information like source, context example…

[email protected]

Tools assisting manual extraction

Tools that connect to an editor and allow the collection of terms or term pairs

Translation memory tools that save terms and term pairs directly into the term database component

Term checking tools that report missing terms / translations

[email protected]

Manual extraction

Time consumingResource intensiveSubject matter and language expertise requiredMost accurate regarding the goalIndividual goals can be set

[email protected]

Concordance Tools

Automatic creation of a list of all terms and term combinations from a documentNo term is missedLong list of termsManual selection process necessary

[email protected]

Concordance results

[email protected]

TM tool

Term can consist of up to X wordsTerms that already exist in the database are not extractedExtraction from all files of a project (various file formats)

[email protected]

Extraction with TM tool

Export of term list for translationExport of term list to term database

[email protected]

Statistical Extraction ToolMonolingual and bilingual extractionTerms that occur more than X times are extractedList of frequent terms – frequent terms are seen as importantImportant terms / new terms that appear in this document less than X times are not extractedCan be used for any languageList of term candidates must be checked by a human with subject matter and language expertise

[email protected]

Terminology Tool of TM Suite

Settings for number of words per termSettings for frequency

[email protected]

Bilingual Extraction Results

[email protected]

Bilingual Extraction Results

[email protected]

Beispiel Japanisch

[email protected]

Linguistic Extraction Tool

Tool knows about the structure of the languageExtracted terms can be reduced to their basic from with the help of dictionaries and rulesUser can define the rules used for extractionExtraction limited to supported languages

[email protected]

Linguistic Settings

Extraction according to specific rules of the languageFrequency settings

[email protected]

Results of Extraction with Context Window

[email protected]

Bilingual Extraction Results SDL PhraseFinder

Translations of terms come from the extraction files and internal dictionariesEach term is shown with its context and a grammatical analysisResults of extraction

List of one-word termsList of multi-word termsList of context sentences

Export and view can be filtered

[email protected]

Bilingual Extraction Results

[email protected]

File Formats

TM tools extract from every file format they supportConcordance tools are usually limited to text or Word RTF files, maybe also HTMLBilingual extraction can be produced from bilingual file formats like translation memories, project files of a TM tool or bilingual translation files, but not from two separate filesExport usually in Excel, tab-delimited TXT or directly into the terminology component

[email protected]

ConclusionNo one tool can do what a human can do, but depending on the goal, the tools can help to automate repetitive tasks and comparisons with stop word lists and/or term bases

Concordance tools extract all words and provide filter and search settings for the view of the term listStatistical tools offer settings for frequencies, term length and comparison with stop word lists or existing term lists / term databases Linguistic tools can be customized by rules for the extraction, which could be different for various languages and use language-specific dictionaries

Thank you very much for your

attention

[email protected]

Some Terminology Extraction ToolsConcordance tools

Simple Concordance Program (SCP), http://www.textworld.com/scp/ExtPhr32, http://publish.uwo.ca/~craven/freeware.htm

Term extraction tools / components of translation memory toolsStatistical Extraction

MultiTerm Extract, Déjà Vu Lexicon, Heartsome Dictionary Editor, acrossTermiDOG (www.dog-gmbh.de), Chamblon Terminology Extractor (http://www.chamblon.com/terminologyextractor.htm)…

Linguistic ExtractionSynthema Terminology Wizard

(http://www.synthema.it/english/servizi/traduzioni.html), SDL PhraseFinder…

30