Open-source machine translation for Icelandic: the Apertium platform as an opportunity

49
Concepts Opportunities from open-source MT systems Challenges The Apertium platform Apertium for Icelandic? Concluding remarks Open-source machine translation for Icelandic: the Apertium platform as an opportunity Mikel L. Forcada 1,2 1 Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) 2 Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain) April 18, 2008: Icelandic Language Technology Conference Mikel L. Forcada Open-source MT for Icelandic

description

 

Transcript of Open-source machine translation for Icelandic: the Apertium platform as an opportunity

Page 1: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Open-source machine translation for Icelandic:the Apertium platform as an opportunity

Mikel L. Forcada1,2

1Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant,E-03071 Alacant (Spain)

2Prompsit Language Engineering, S.L., E-03690 St. Vicent del Raspeig (Spain)

April 18, 2008: Icelandic Language Technology Conference

Mikel L. Forcada Open-source MT for Icelandic

Page 2: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Contents

1 Concepts

2 Opportunities from open-source MT systems

3 Challenges

4 The Apertium platform

5 Apertium for Icelandic?

6 Concluding remarks

Mikel L. Forcada Open-source MT for Icelandic

Page 3: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Open-source and free softwareMachine translation software

Open-source and free software

Open-source software is also called free software:0 anyone can use it for any purpose1 anyone can examine it to see how it works and modify it for

any new purpose2 anyone can freely distribute it3 anyone may release an improved version so that everyone

benefitsFor conditions 1 and 3 to be met, anyone should be able toaccess the source code, hence the name open source.

Mikel L. Forcada Open-source MT for Icelandic

Page 4: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Open-source and free softwareMachine translation software

Machine translation software/1

MT is special: it strongly depends on datarule-based MT (RBMT): dictionaries, rulescorpus-based MT (CBMT): sentence-aligned parallel text,monolingual corpora

Three components in every MT system:The engine (also decoder , recombinator . . . )Data (linguistic data, corpora)Tools to maintain these data and convert them to the formatused by the engine

Mikel L. Forcada Open-source MT for Icelandic

Page 5: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Open-source and free softwareMachine translation software

Machine translation software/2

I will focus on RBMT. Reasons:CBMT requires massive amounts of sentence-alignedparallel text (is there such a resource for Icelandic?).RBMT may use linguistic data elicited by speakers withoutaccess to existing machine-readable resources.RBMT is more transparent: errors are easier to diagnoseand debug.I am more familiar with RBMT!

Mikel L. Forcada Open-source MT for Icelandic

Page 6: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Open-source and free softwareMachine translation software

MT software/3 : commercial machine translation

Most commercial MT systems are RBMT (but:LanguageWeaver, Google Labs).They use proprietary technologies which are not disclosed(perceived as their main competitive advantage).Only partial modification (customization) of linguistic datais allowed.

Mikel L. Forcada Open-source MT for Icelandic

Page 7: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Open-source and free softwareMachine translation software

MT software/4: open-source machine translation

For MT to be open-source, the engine, the data and thetools must all be open-source.In the case of CBMT this means that corpora must also beopen.

Mikel L. Forcada Open-source MT for Icelandic

Page 8: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Commercial MT systems and small languages: limitedopportunities

The main MT companies target major world languages.Not Icelandic. . . Some closed-source systems:

TranExp’s InterTran offers en↔is “interactive translation”(with limited lexical coverage): test at http://www.translation-guide.com/free_online_translators.php?from=Icelandic&to=EnglishStefán Briem’s prototypes for is↔en or is↔da may betested at tungutorg.is.A company named ESTeam (www.esteam.gr) is alsolisted as offering MT for Icelandic.

It is very hard to adapt closed, commercial MT systems tosmall languages

Mikel L. Forcada Open-source MT for Icelandic

Page 9: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Opportunities from open-source MT systems

Even if reasonable-quality closed-source MT is available,the development and use of open-source MT systemsprovides additional opportunities:

Increases language expertise and resourcesIncreases technological independence

Mikel L. Forcada Open-source MT for Icelandic

Page 10: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Increasing expertise and language resources

When building an open-source MT system for a smalllanguage, a variety of situations may occur.All of them involve building small-language expertise andresources through

reflection about the small languageelicitation of linguistic (monolingual and bilingual)knowledge about itsubsequent encoding of this knowledge

The open-source setting makes new expertise andresources naturally available to the community.Three scenarios may occur:

Mikel L. Forcada Open-source MT for Icelandic

Page 11: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Building data for an existing MT engine from scratch

One needs:A freely available (open-source or not) MT engineFreely available (open-source or not) tools to managelinguistic dataComplete documentation on how to build linguistic data foruse with the engine and tools

This is a very unfavourable setting. Many decisions have tobe made, e.g., defining the set of lexical categories andinflection indicators.The blank sheet syndrome may paralyze the project.If overcome, the expertise acquired and the resultingopen-source data could be improved or used for otherpurposes: positive effect on the small language.

Mikel L. Forcada Open-source MT for Icelandic

Page 12: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Building data for an existing MT engine from existinglanguage-pair data

If free tools and engine and open-source data are availablefor another pair with a similar or related language, theblank sheet syndrome is drastically reduced. One could,for example:

use the same set of lexical categories and inflectionindicatorsbuild inflection paradigms on top of existing ones

Mikel L. Forcada Open-source MT for Icelandic

Page 13: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Adapting a new open-source engine or tools for a newlanguage pair

If source code is available for the engine and tools, expertscould enhance or adapt them to address new features ofthe small language not dealt with adequately by the currentcode:

character setsstructural transfer not powerful enough, etc.

More challenging than building new dataBut programmers do not need to have full command of thesmall language (abstract management of linguistic issuespossible).

Code rewriting would add expertise and resources to thelanguage community.

Mikel L. Forcada Open-source MT for Icelandic

Page 14: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Increasing expertise and language resourcesIncreasing independence

Increasing technological independence

Having an open-source engine, tools and data makesusers of the small language less dependent on a singlecommercial, closed-source provider.This has an analogous effect, not only on machinetranslation, but also on other language technologies.

Mikel L. Forcada Open-source MT for Icelandic

Page 15: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Organizing community developmentEliciting linguistic knowledgeSimplicity of linguistic knowledge neededStandardization and documentation of linguistic data formatsModularity

Organizing community development/1

Assume we are just developing linguistic data.Open-source makes it possible for a small-languagecommunity to collaboratively develop machine translationfor it.Some small languages have people with good linguisticand translation skills (this is the case of Icelandic).But the availability of human resources with language andtranslation skills is necessary but not sufficient.

Mikel L. Forcada Open-source MT for Icelandic

Page 16: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Organizing community developmentEliciting linguistic knowledgeSimplicity of linguistic knowledge neededStandardization and documentation of linguistic data formatsModularity

Organizing community development/2

Some structure is necessary. Ideally:A coordinating team mastering the engine and tools usedis needed to lead the effort, including:

code coordinators (installing, maintainance, modificationsto the code)linguistic coordinators (linguistic data maintenance)

A project web serverto distribute the last version of the systemto execute it onlinefor developers to contribute new linguistic data or code

A group of skilled developers, certified in some sense bythe coordinating team.

Mikel L. Forcada Open-source MT for Icelandic

Page 17: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Organizing community developmentEliciting linguistic knowledgeSimplicity of linguistic knowledge neededStandardization and documentation of linguistic data formatsModularity

Eliciting linguistic knowledge

Existing linguistic knowledge has to made explicit (elicited)to contribute it to the system.Elicitation of lexical knowledge is possible throughwell-designed web form interfaces:

to provide the lemmas of the source and target wordto select the inflection paradigm of the source and targetwordto establish the scope of the equivalence (bidirectional,left-to-right, right-to-left).

Elicitation of other knowledge (e.g., structural transferrules) is harder (a subject of research indeed).

Mikel L. Forcada Open-source MT for Icelandic

Page 18: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Organizing community developmentEliciting linguistic knowledgeSimplicity of linguistic knowledge neededStandardization and documentation of linguistic data formatsModularity

Simplicity of linguistic knowledge needed

To encourage and ease collaborative development, the level oflinguistic knowledge necessary to start build a new MT systemshould be kept to a minimum (basic high-school grammar skillsand concepts).

This is rather easy in shallow-transfer MT systems.But is very difficult (if not impossible) for deep transfersystems.

Well-written documentation may be very helpful. Havingsomeone available online to ask questions to is even better.

Mikel L. Forcada Open-source MT for Icelandic

Page 19: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Organizing community developmentEliciting linguistic knowledgeSimplicity of linguistic knowledge neededStandardization and documentation of linguistic data formatsModularity

Standardization and documentation of linguistic dataformats

An adequate documentation of the format of linguistic datais crucial.The way: using XML. Why?

Each data item is explicitly labeled with a descriptive,named tag with a clear meaning attachedThe structure of documents may easily be validated againstDTDs or schemasMany technologies exist for XML (converting from and toXML, interoperability ).

Mikel L. Forcada Open-source MT for Icelandic

Page 20: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Organizing community developmentEliciting linguistic knowledgeSimplicity of linguistic knowledge neededStandardization and documentation of linguistic data formatsModularity

Modularity

The emphasis of open-source is the reusability of codeand linguistic data to build new MT systems or otherlanguage-technology applications.For that objective modularity is a must.A modular engine induces modularity in its data.For example, having an independent morphologicalanalyser and an independent morphological dictionary

Makes it easier to build an MT system for a different targetlanguageMay be used to build an intelligent search engine(inflection-independent search)

Mikel L. Forcada Open-source MT for Icelandic

Page 21: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Background

Apertium is based on the technologies developed by theTransducens group at the Universitat d’Alacant during thedevelopment of two existing systems:

interNOSTRUM (interNOSTRUM.com, Spanish–Catalan)Tradutor Universia (tradutor.universia.net,Spanish–Portuguese)

These technologies, initially designed for related-languagepairs, have been extended to handle language pairs which arenot so related.

Mikel L. Forcada Open-source MT for Icelandic

Page 22: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Rationale /1

To generate translations which arereasonably intelligible andeasy to correct

between related languages such as Spanish (es) and Catalan(ca) or Portuguese (pt), etc., or Nynorsk (nn), Bokmål (no)and Icelandic (is), one can just augment word for wordtranslation with

robust lexical processing (including multi-word units)lexical categorial disambiguation (part-of-speech tagging)local structural processing based on simple andwell-formulated rules for frequent structuraltransformations (reordering, agreement)

Mikel L. Forcada Open-source MT for Icelandic

Page 23: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Rationale /2

For harder, not so related, language pairs:It should be possible to build on that simple model.It should be possible to generalize its concepts so thatcomplexity is kept as low as possible.

Mikel L. Forcada Open-source MT for Icelandic

Page 24: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Rationale /3

It should be possible to generate the whole system fromlinguistic data (monolingual and bilingual dictionaries,grammar rules) specified in a declarative way.This information should be provided in an interoperableformat⇒ XML. These are the different types of data:

(language-independent) rules to treat text formatsspecification of the part-of-speech taggermorphological and bilingual dictionaries and dictionaries oforthographical transformation rulesstructural transfer rules

Mikel L. Forcada Open-source MT for Icelandic

Page 25: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Rationale /4

It should be possible to have a single generic(language-independent) engine reading language-pairdata (“separation of algorithms and data”).Language-pair data should be preprocessed so that thesystem is fast (>10,000 words per second) and compact;for example, lexical transformations are performed byminimized finite-state transducers (FSTs).

Mikel L. Forcada Open-source MT for Icelandic

Page 26: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Rationale /5

Reasons for the open-source development of Apertium:To give everyone free, unlimited access to the bestpossible machine-translation technologies.To establish a modular, documented, open platform forshallow-transfer machine translation and other humanlanguage processing tasks.To favour the interchange and reuse of existing linguisticdata.To make integration with other open-source technologieseasier.

Mikel L. Forcada Open-source MT for Icelandic

Page 27: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Rationale /6

More reasons for open-source development of Apertium:To benefit from collaborative development

of the machine translation engineof language-pair data for currently existing or new languagepairs

from industries, academia and small-language supportorganizations.To help shift MT business from the obsolescentlicence-centered model to a service-centered model.To radically guarantee the reproducibility of machinetranslation and natural language processing research.Because it does not make sense to use public funds todevelop non-free, closed-source software.

Mikel L. Forcada Open-source MT for Icelandic

Page 28: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

The Apertium platform

Apertium is an open-source machine translation platform(http://www.apertium.org) providing:

1 An open-source modular shallow-transfer machinetranslation engine with:

text format managementfinite-state lexical processingstatistical lexical disambiguationshallow transfer based on finite-state pattern matching

2 Open-source linguistic data in well-specified XML formatsfor a variety of language pairs

3 Open-source tools: compilers to turn linguistic data into afast and compact form used by the engine and software tolearn disambiguation or structural transfer rules.

Mikel L. Forcada Open-source MT for Icelandic

Page 29: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

The Apertium engine/1SL text→ De-formatter

↓Morphological analyser [←FST]

↓Categorial disambiguator [←FST+stat.]

↓[rules→] Structural transfer ↔ Lexical transfer [←FST]

↓Morphological generator [←FST]

↓Post-generator [←FST]

↓Re-formatter →TL text

Mikel L. Forcada Open-source MT for Icelandic

Page 30: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

The Apertium engine/2

Communication between modules: text (Unix “pipelines”).Advantages:

Simplifies diagnosis and debuggingAllows the modification of data between two modulesusing, e.g., filtersMakes it easy to insert alternative modules (interesting forresearch and development purposes)

Mikel L. Forcada Open-source MT for Icelandic

Page 31: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

De-formatter

Separates text from format information.Currently available for ISO-8859 or UTF-8 plain text,HTML, RTF, ODF, OpenOffice.org .sxw, etc.).Based on finite-state techniques.Most of these filters are generated (using a XSLTstylesheet) from an XML de-formatter specification file.

Mikel L. Forcada Open-source MT for Icelandic

Page 32: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Morphological analyser

segments the source text in surface forms (SFs),assigns to each SF one or more lexical forms (LFs), eachone with:

lemmalexical category (part-of-speech)morphological inflection information

processes contractions (en: can’t=can+not; is:talarðu=talar+þú, ertu=ert+þú) and multi-word units whichmay be invariable (is: með öðrum orðum, við hlíðina á) orvariable (is: brjóta af sér → braut af sér ).reads finite-state transducers generated from amorphological dictionary in XML (using a compiler).

Mikel L. Forcada Open-source MT for Icelandic

Page 33: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Categorial disambiguator (part-of-speech tagger)

picks one of the LFs corresponding to each ambiguous SF(about 30% of them) according to contextuses hidden Markov models and hand-written constraintrulesis trained using representative corpora for the sourcelanguage (manually disambiguated or not) or, recently,using statistical models for the TLits behavior is completely specified by an XML archive

Mikel L. Forcada Open-source MT for Icelandic

Page 34: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Structural transfer /1

It is based on finite-state techniques (finite-staterecognizers).The XML transfer-rule file is preprocessed for fasterinterpreting.Rules have a pattern–action form.It detects LF patterns to be processed using a left-to-right,longest-match strategy.It executes the actions associated to each pattern in therule file to generate the corresponding LF pattern for theTL.

Mikel L. Forcada Open-source MT for Icelandic

Page 35: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Structural transfer /2

For “harder” language pairs, a three-stage structural transfer isavailable:

Patterns of LFs (chunks) are detected, processed andmarkedPatterns of chunks are detected and processed: thisinterchunk processing allows for longer-range(“inter-chunk”) syntactic transformationsThe output chunks are finished and the resulting LFs arewritten.

Mikel L. Forcada Open-source MT for Icelandic

Page 36: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Lexical transfer module

reads each SL LF and generates the corresponding TL LFreads finite-state transducers generated from bilingualdictionaries in XML (using a compiler).invoked by the structural transfer module

Mikel L. Forcada Open-source MT for Icelandic

Page 37: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Morphological generator

Generates from each TL LF, a TL SF, after adequatelyinflecting itIt reads finite-state transducers generated from amorphological dictionary in XML (using a compiler)

Mikel L. Forcada Open-source MT for Icelandic

Page 38: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Post-generator

Performs some TL orthographical transformations, such ascontractions (ca: de +els → dels; en: can + not →cannot), inserting apostrophes (ca: de + amics →d’amics), etc.It is based on finite-state transducers generated from apost-generation rule dictionary (using a compiler).

Mikel L. Forcada Open-source MT for Icelandic

Page 39: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Re-formatter

Integrates format information (plain ISO-8859 or UTF-8text, HTML, RTF, ODT, OpenOffice.org .sxw, etc.) into thetranslated text.Also used to modify URLs in links for translate-as-you-surf .It is based on finite-state techniques.It is generated (using a XSLT stylesheet) from an XMLde-formatter specification file

Mikel L. Forcada Open-source MT for Icelandic

Page 40: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Language-pair data

The Apertium project hosts the development of a large numberof language pairs:

Stable language pairs include: es↔ca, es↔gl, es↔pt,en↔ca, en↔es, es↔fr, ca↔oc, ro→es, es→eo,ca→eo.There is also a growing number of language pairs underdevelopment. Some include Scandinavian languages (da,sv, nn, nb).

Mikel L. Forcada Open-source MT for Icelandic

Page 41: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

Project funding

Funded byThe Ministry of Industry, Tourism and Commerce of Spain(also, the Ministries of Education and Science and ofScience and Technology of Spain)The Secretariat for Technology and the Information Societyof the Government of CataloniaThe Ministry of Foreign Affairs of RomaniaThe Universitat d’AlacantCompanies: Prompsit Language Engineering, ABCEnciklopedioj, etc.

Mikel L. Forcada Open-source MT for Icelandic

Page 42: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

The Apertium community/1

Not the ideal community development situation, but close.In addition to the original (funded) developers, a community hasformed around the platform (instigated by Francis Tyers).

More than 60 developers insourceforge.net/projects/apertium/, manyoutside the original group; code updated very frequently,hundreds of monthly SVN commits.A collectively-maintained wiki shows the currentdevelopment and tips for people building new languagepairs or code.

Mikel L. Forcada Open-source MT for Icelandic

Page 43: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

BackgroundRationaleThe Apertium platformThe Apertium engineLanguage-pair dataFundingThe Apertium community

The Apertium community/2

Externally developed tools and code:a graphical user interface apertium-tolk, and thediagnostic tool apertium-viewplugins for OpenOffice.org or the Pidgin (previously Gaim)messaging programWindows ports, etc.

Many people gather and interact in the #apertium IRCchannel (at freenode.net).Stable packages ported to Debian GNU/Linux (and thenext Ubuntu).

Mikel L. Forcada Open-source MT for Icelandic

Page 44: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Apertium for Icelandic /1

To build, for instance, a GPL apertium-is-en prototype:one could reuse the en dictionaries in apertium-en-caor apertium-en-es (analysis and generation) and thepart-of-speech taggers tooone should build an is dictionary:

getting some inspiration from existing (incomplete) data inApertium for sv, da, fo. . .using Wiktionary [an experiment by Francis Tyers:http://apertium.svn.sourceforge.net/viewvc/apertium/trunk/incubator/apertium-fo-is.is.dix?view=markup]convincing the authors of icemorphy or tungutorg torelease (part of) their data under the GPL license.

Mikel L. Forcada Open-source MT for Icelandic

Page 45: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Apertium for Icelandic /2

one could train an is part-of-speech tagger, perhaps withsome help from icetagger or tungutorgone should build a bilingual is–en dictionary, for instance:

by completing the English and Icelandic dictionaries inErganeby modifying bilingual dictionaries learned from asentence-aligned bilingual corpus using Caseli et al.’sReTraTos (sf.net/projects/retratos)

one could then use Sanchez-Martínez and Forcada’smethod to learn an initial set of structural transfer rulesusing the same or a different corpus, and then refine it.

A prototype would be available in 1 person·year! Who dares?Mikel L. Forcada Open-source MT for Icelandic

Page 46: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Apertium for Icelandic /3

Is the time right? The Government of Iceland has agreed on a“Policy on Free and Open-source Software” (“Stefna umfrjálsan og opinn hugbúnað”, Mar. 11, 2008).

“Giving access to the source code expands theopportunities for adapting and examining security aspectsof the software, in addition to allowing for its furtherdevelopment if the producers discontinue it for somereason.”“There is a great need to increase the return on public bodyinvestments in software design. [...] Once software hasbeen prepared, it is important that it has the potential ofbeing reused [...] Reusability can be achieved by [...]ensuring that it is free and open-source.”

Mikel L. Forcada Open-source MT for Icelandic

Page 47: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Concluding remarks /1

Icelandic, as any other living language, however small,needs machine translation and has the right to it!The development of open-source MT for Icelandic canhave specific, additional effects (increasing expertise,contributing reusable resources, reducing technologicaldependency). Apertium eases this task.Development of MT for a small language faces a number ofchallenges: elictation of linguistic knowledge, need forstandard formats, modularity. Apertium offers the last two.

Of course, I will be happy to discuss these conclusions!

Mikel L. Forcada Open-source MT for Icelandic

Page 48: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

Takk fyrir!

Thanks, Hrafn Loftsson, and the rest of the colleagues atReykjavík University and the University of Iceland forinviting me to this conference and making me feel at home.Thank you all for your attention.

Mikel L. Forcada Open-source MT for Icelandic

Page 49: Open-source machine translation for Icelandic:  the Apertium platform as an opportunity

ConceptsOpportunities from open-source MT systems

ChallengesThe Apertium platform

Apertium for Icelandic?Concluding remarks

I should practice what I preach. . .

This work may be distributed under the terms ofthe Creative Commons Attribution–Share Alike license:http://creativecommons.org/licenses/by-sa/3.0/

the GNU GPL v. 3.0 License:http://www.gnu.org/licenses/gpl.html

Dual license! E-mail me to get the sources: [email protected]

Mikel L. Forcada Open-source MT for Icelandic