Sd llod-15 apertium

38
Apertium RDF: an experience in generating linguistic linked open data Jorge Gracia Ontology Engineering Group (OEG) Universidad Politécnica de Madrid (UPM) [email protected] 1st Summer Datathon on Linguistic Linked Open Data Cercedilla (Spain), 15-19th June 2014

Transcript of Sd llod-15 apertium

16/06/2015 1 Presenter name

Apertium RDF: an experience in generating linguistic linked open data

Jorge Gracia Ontology Engineering Group (OEG)

Universidad Politécnica de Madrid (UPM) [email protected]

1st Summer Datathon on Linguistic Linked Open Data Cercedilla (Spain), 15-19th June 2014

16/06/2015 2 Jorge Gracia

Outline

Motivation The Apertium platform Representing translations in RDF Building the Apertium RDF graph Traversing the graph Linking with external sources Conclusions 2

16/06/2015 3 Jorge Gracia

3

Motivation

16/06/2015 4 Jorge Gracia

Motivation

Current multilingual lexica and electronic dictionaries • Proprietary formats • Non-standard APIs • Disconnected from other resources

4

16/06/2015 5 Jorge Gracia

Motivation

GOAL: to expose translations contained in bilingual dictionaries as Linked Data on the Web

Joint effort by

5

16/06/2015 6 Jorge Gracia

6

The Apertium platform

16/06/2015 7 Jorge Gracia

Apertium

Apertium [http://www.apertium.org] open source platform for Machine Translation. Its bilingual dictionaries available in XML.

7

16/06/2015 8 Jorge Gracia

Apertium

8

Afrikaans <-> Dutch Breton --> French Catalan <-> Italian Welsh <-> English Danish <-- Norwegian English <-> Catalan English <-> Spanish English <-> Galician Esperanto <-- Catalan Esperanto <-> English Esperanto <-- Spanish Esperanto <-- French Spanish <-> Aragonese Spanish <-> Asturian Spanish <-> Catalan Spanish <-> Galician

Spanish <-> Italian Spanish <-> Portuguese Spanish <-> Romanian Basque --> English Basque --> Spanish French <-> Catalan French <-> Spanish Serbo-Croatian <-> English Serbo-Croatian <-> Macedonian Serbo-Croatian <-> Slovenian Indonesian <-> Malaysian Icelandic <-> Swedish Icelandic --> English Kazakh <-> Tatar Macedonian <-> Bulgarian Macedonian --> English

Norwegian Nynorsk <-> Norwegian Bokmål

Occitan <-> Catalan Occitan <-> Spanish Portuguese <-> Catalan Portuguese <-> Galician Northern Sami --> Norwegian

Bokmål Swedish <-> Danish ……

More that 40 language pairs

22 of them (more stable) available in LMF

16/06/2015 9 Jorge Gracia

9

Representing translations in RDF

16/06/2015 10 Jorge Gracia

lemon

10

16/06/2015 11 Jorge Gracia

LexicalSense

trans

translationTarget

context

TranslationSet Translation translationConfidence:double

The translation module

Translation Categories http://purl.org/net/translation-categories

translationCategory

context

Resource

http://purl.org/net/translation.owl Translation Module

translationSource

directEquivalent

culturalEquivalent

lexicalEquivalent

11

16/06/2015 12 Jorge Gracia

lemon:LexicalEntry

lemon:LexicalEntry

lemon:LexicalSense

lemon:LexicalSense

lemon:Lexicon lexiconEN

lemon:Lexicon lexiconES

tr:Translation

“bench”@en

“banco”@es

lemon:entry

lemon:entry

lemon:isSenseOf

lemon:isSenseOf tr:translationTarget

tr:translationSource

tr:trans

lemon:lexicalForm

lemon:lexicalForm

lemon:Form

lemon:Form

lemon:writtenRep

tr:TranslationSet translationSetEN-ES

lemon:writtenRep

Translation example

16/06/2015 13 Jorge Gracia

13

Building the Apertium RDF graph

16/06/2015 14 Jorge Gracia

Methodology

1. Data analysis and vocabulary selection 2. Modelling 3. URIs design 4. RDF generation 5. Publication as linked data

14

16/06/2015 15 Jorge Gracia

Modelling

Mapping of data sources

16/06/2015 16 Jorge Gracia

URIs design

# Apertium English lexicon: http://linguistic.linkeddata.es/id/apertium/lexiconEN # Apertium Spanish lexicon: http://linguistic.linkeddata.es/id/apertium/lexiconES # Apertium English-Spanish translation set: http://linguistic.linkeddata.es/id/apertium/tranSetEN-ES

Following ISA recommendations [Archer et al.]:

Archer, P., Goedertier, S., & Loutas, N. (2012). Study on persistent URIs. Tech. rep..

16/06/2015 17 Jorge Gracia

RDF Generation

RDF generation based on Open Refine • E.g., RDF generated: apertium:lexiconEN a lemon:Lexicon ;

dc:source <http://hdl.handle.net/10230/17110> . ... apertium:lexiconEN lemon:entry apertium:lexiconEN/bench-n-en . apertium:lexiconEN/bench-n-en a lemon:LexicalEntry ; lemon:lexicalForm apertium:lexiconEN/bench-n-en-form ; lexinfo:partOfSpeech lexinfo:noun . apertium:lexiconEN/bench-n-en-form a lemon:Form ; lemon:writtenRep "bench"@en .

16/06/2015 18 Jorge Gracia

Publication

• SPARQL endpoint http://linguistic.linkeddata.es/apertium/sparql-

editor/

• Web interface http://linguistic.linkeddata.es/apertium/

• Datahub http://datahub.io/dataset?q=apertium+rdf&organiz

ation=oeg-upm

18

16/06/2015 19 Jorge Gracia

19

Traversing the graph

16/06/2015 20 Jorge Gracia

22 generated datasets

20

Lang. pair # triples # trans.

CA-IT 180,851 7,869 EN-CA 759,601 33,029 EN-ES 576,316 25,830 EN-GL 425,117 20,034 EO-CA 426,301 19,964 EO-EN 617,772 31,474 EO-ES 380,198 17,212 EO-FR 726,281 35,791 ES-AN 71,997 3,110

ES-AST 825,54 36,096 ES-CA 730,501 31,291

Lang. pair # triples # trans.

ES-GL 206,284 8,985 ES-PT 279,245 12,054 ES-RO 400,366 17,318 EU-ES 262,336 11,838 EU-EN 265,466 13,089 FR-CA 152,002 6,550 FR-ES 495,614 21,475

OC-CA 346,346 15,983 OC-ES 317,162 14,561 PT-CA 163,149 7,111 PT-GL 234,065 10,144

16/06/2015 21 Jorge Gracia

Apertium RDF in the LLOD cloud

21

16/06/2015 22 Jorge Gracia

Apertium RDF in the LLOD cloud

16/06/2015 23 Jorge Gracia

Direct translations

23

Direct translations for “bank”@en

Translated written repr. Part of Speech "banc"@ca http://www.lexinfo.net/ontology/2.0/lexinfo#noun "riba"@ca http://www.lexinfo.net/ontology/2.0/lexinfo#noun "banco"@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun "orilla"@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun "ribera"@es http://www.lexinfo.net/ontology/2.0/lexinfo#noun "beira"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun "banco"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun "ourela"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun "orela"@gl http://www.lexinfo.net/ontology/2.0/lexinfo#noun "banku"@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun "erribera"@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun "ertz"@eu http://www.lexinfo.net/ontology/2.0/lexinfo#noun "amuntegar"@ca http://www.lexinfo.net/ontology/2.0/lexinfo#verb "agolpar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb "amontonar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb "apelotonar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb "hacinar"@es http://www.lexinfo.net/ontology/2.0/lexinfo#verb .... ...

16/06/2015 24 Jorge Gracia

Lexicon CA

Lexicon EN

Lexicon EN

Lexicon ES

Translation Set EN-ES

Translation Set EN-CA

Apertium LMF Apertium RDF

EN-ES

EN-CA

Monolingual lexicons Translation sets

24

16/06/2015 25 Jorge Gracia

orilla

“ribera”@es

bank-banco

TranslationSetEN-ES LexiconES LexiconEN

“orilla”@es

banco-banco

TranslationSetES-PT LexiconPT

banco

“banco”@pt

bank

bench ribera

orla

bank-ribera

bank-orilla

bench-banco

orilla-orla

“bench”@en

“bank”@en

“orla”@pt

banco

“banco”@es

16/06/2015 26 Jorge Gracia

Indirect translations

Indirect translations for “bank” EN-> ES -> PT

26

Pivot translation written repres. Indirect translation written repres.

"banco"@es "banco"@pt

"orilla"@es "orla"@pt

16/06/2015 27 Jorge Gracia

Apertium RDF graph

Dijkstra algorithm to choose shortest path

27

16/06/2015 28 Jorge Gracia

bench banco

LexiconEN LexiconES LexiconCA

banc

orilla

ribera

bank

riba

How to measure confidence

16/06/2015 29 Jorge Gracia

One time inverse consultation (OTIC)

29

Given a lexical entry s: 1. Get direct translations of s in the pivot language Ps

2. ∀ p ∈ Ps, get its translations in the target language Tp

3. For every t ∈ Tp, (a) gets its set of translations in the pivot language (Pt) (b) calculates the score for t:

||||*2)(

ts

ts

PPPPtscore

+∩

=

Tanaka, K., & Umemura, K. (1994). Construction of a bilingual dictionary intermediated by a third language. In COLING, pp. 297–303.

16/06/2015 30 Jorge Gracia

bench banco

LexiconEN LexiconES LexiconCA

banc

orilla

ribera

bank

riba

One time inverse consultation

s = “banco”@es Pbanco={“bank”@en, “bench”@en} Tbank={“banc”@ca, “riba”@ca} Tbench={“banc”@ca} Pbanc={“bank”@en, “bench”@en} Priba={“bank”@en}

score(“banc”@ca) = 1.0 score(“riba”@ca) = 0.5

16/06/2015 31 Jorge Gracia

31

Linking with external sources

16/06/2015 32 Jorge Gracia

Linking to BabelNet

32

Around 130.000 links between Apertium RDF – BabelNet

16/06/2015 33 Jorge Gracia

Linking to BabelNet

Translated Written Repr. BabelSynset BabelNet gloss

"banco" @es http://babelnet.org/rdf/s00008371n “A building in which the business of banking transacted”

"banco" @es http://babelnet.org/rdf/s00008366n “An arrangement of similar objects in a row or in tiers”

"banco" @es http://babelnet.org/rdf/s15346085n “An ocean bank, sometimes referred to as a fishing bank or simply bank, ...”

… … …

"orilla" @es http://babelnet.org/rdf/s00008363n “Sloping land (especially the slope beside a body of water)”

"ribera" @es http://babelnet.org/rdf/s00008363n “Sloping land (especially the slope beside a body of water)”

Translations for “bank”@en

16/06/2015 34 Jorge Gracia

34

Conclusions

16/06/2015 35 Jorge Gracia

Conclusions

• Apertium data on the Web following SW standards • Common entry point for all the Apertium dictionaries • Direct and indirect translations can be easily obtained

via SPARQL • Confidence degree for indirect translations • Linked with BabelNet

35

16/06/2015 36 Jorge Gracia

Conclusions

Related reading… http://kdictionaries.com/kdn/kdn23.pdf

16/06/2015 37 Jorge Gracia

Thanks for your attention !

37

http://linguistic.linkeddata.es/apertium/

16/06/2015 38 Jorge Gracia

Some results of applying OTIC

38

Language path Threshold Precision Recall

EN-CA-ES

0.0 76% 48% 0.5 77% 48% 1.0 82% 43%

ES-EN-CA

0.0 53% 39% 0.5 55% 39% 1.0 61% 36%

EN-ES-CA

0.0 73% 38% 0.5 76% 38% 1.0 83% 33%