Finding Concept Coverings in Aligning Ontologies of Linked Dataceur-ws.org/Vol-868/paper1.pdfFinding...

Finding Concept Coverings in AligningOntologies of Linked Data

Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite

Information Sciences Institute and Department of Computer ScienceUniversity of Southern California

4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292{parundek,knoblock,ambite}@usc.edu

Abstract. Despite the recent growth in the size of the Linked DataCloud, the absence of links between the vocabularies of the sources hasresulted in heterogenous schemas. Our previous work tried to find con-ceptual mapping between two sources and was successful in finding align-ments, such as equivalence and subset relations, using the instances thatare linked as equal. By using existential concepts and their intersectionsto define specialized classes (restriction classes), we were able to findalignments where previously existing concepts in one source did not havecorresponding equivalent concepts in the other source. Upon inspection,we found that though we were able to find a good number of alignments,we were unable to completely cover one source with the other. In manycases we observed that even though a larger class could be defined com-pletely by the multiple smaller classes that it subsumed, we were unableto find these alignments because our definition of restriction classes didnot contain the disjunction operator to define a union of concepts. Inthis paper we propose a method that discovers alignments such as these,where a (larger) concept of the first source is aligned to the union of thesubsumed (smaller) concepts from the other source. We apply this newalgorithm to the Geospatial, Biological Classification, and Genetics do-mains and show that this approach is able to discover numerous conceptcoverings, where (in most cases) the subsumed classes are disjoint. Theresulting alignments are useful for determining the mappings betweenontologies, refining existing ontologies, and finding inconsistencies thatmay indicate that some instances have been erroneously aligned.

1 Introduction

The Web of Linked Data has seen huge growth in the past few years. As ofSeptember 2011, the Linked Open Data Cloud has grown to a size of 31.6 billiontriples. This includes a wide range of data sources belonging to the government(42%), geographic (19.4%), life sciences (9.6%) and other domains1. A commonway that the instances in these sources are linked to others is the use of theowl:sameAs property. Though the size of Linked Data Cloud seems to be in-creasing drastically (10% over the 28.5 billion triples in 2010), inspection of the

1 http://www4.wiwiss.fu-berlin.de/lodcloud/state/

2 Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite

sources at the ontology level reveals that only a few of them (15 out of the 190sources) have some mapping of the vocabularies. For the success of the SemanticWeb, it is important that these heterogenous schemas be linked. As described inour previous papers on Linking and Building Ontologies of Linked Data [8] andAligning Ontologies of Geospatial Linked Data [7], an extensional technique canbe used to generate alignments between the ontologies behind these sources. Inthese previous papers, we introduced the concept of restriction classes, which isthe set of instances that satisfy a conjunction of value restrictions on properties(property-value pairs).

Though our algorithm was able to identify a good number of alignments, itwas unable to completely cover one source with the classes in the other source.Upon closer look, we found that most of these alignments that we missed did nothave a corresponding restriction class in the other source, and instead subsumedmultiple restriction classes. While reviewing these subset relations, we discoveredthat in many cases the union of the smaller classes completely covered the largerclass. In this paper, we describe how we extend our previous work to discoversuch concept coverings by introducing more expressive set of class descriptions(unions of value restrictions).2 In most of these coverings, the smaller classes arealso found to be disjoint. In addition, further analysis of the alignments of thesecoverings provides a powerful tool to discover incorrect links in the Web of LinkedData, which can potentially be used to point out and rectify the inconsistenciesin the instance alignments.

This paper is organized as follows. First, we describe the Linked Open Datasources that we try to align in the paper. Second, we briefly review our alignmentalgorithm from [8] along with the limitations of the results that were generated.Third, we describe our approach to finding alignments between unions of re-strictions classes. Fourth, we describe how outliers in these alignments help toidentify inconsistencies and erroneous links. Fifth, we describe the experimen-tal results on union alignments over additional domains. Finally, we compareagainst related work, and discuss our contributions and future work.

2 Sources Used for Alignments

Linked Data, by definition, links the instances of multiple sources. Often, sourcesconform to different, but related, ontologies that can also be meaningfully linked[8]. In this section we describe some of these sources from different domains thatwe try to align, instances in which are linked using an equivalence property likeowl:sameAs.

Linking GeoNames with places in DBpedia: DBpedia (dbpedia.org) isa knowledge base that covers multiple domains including around 526,000 placesand other geographical features from the Geospatial domain. We try to alignthe concepts in DBpedia with GeoNames (geonames.org), which is a geographicsource with about 7.8 million things. It uses a flat-file like ontology, where all

2 This work is an extended version of our workshop paper [6]. We have extended themethod to find coverings in the Biological Classification and Genetics domains.

Finding Concept Coverings in Aligning Ontologies of Linked Data 3

instances belong to a single concept of Feature. This makes the ontology rudimen-tary, with the type data (e.g. mountains, lakes, etc.) about these geographicalfeatures instead in the Feature Class & Feature Code properties.

Linking LinkedGeoDatawith places in DBpedia: We also try to findalignments between the ontologies behind LinkedGeoData (linkedgeodata.org)and DBpedia. LinkedGeoData is derived from the Open Street Map initiativewith around 101,000 instances linked to DBpedia using the owl:sameAs property.

Linking species from Geospecies with DBpedia: The Geospecies (geospecies.org)knowledge base contains species belonging to plant, animal, and other kingdomslinked to species in DBpedia using the skos:closeMatch property. Since the in-stances in the taxonomies in both these sources are the same, the sources areideal for finding the alignment between the vocabularies.

Linking genes from GeneID with MGI : The Bio2RDF (bio2rdf.org)project contains inter-linked life sciences data extracted from multiple data-setsthat cover genes, chemicals, enzymes, etc. We consider two sources from theGenetics domain from Bio2RDF, GeneID (extracted from the National Centerfor Biotechnology Information database) and MGI (extracted from the MouseGenome Informatics project), where the genes are marked equivalent.

Although we provide results of the above four mentioned alignments in Sec-tion 4, in the rest of this paper we explain our methodology by using the align-ment of GeoNames with DBpedia as an example.

3 Aligning Ontologies on the Web of Linked Data

First, we briefly describe our previous work on finding subset and equivalentalignments between restriction classes from two ontologies. Then, we describehow to use the subset alignments to finding more expressive union alignments.Finally, we discuss how outliers in these union alignments often identify incorrectlinks in the Web of Linked Data.

3.1 Our previous work on aligning ontologies of Linked Data

In [8] we introduced the concept of restriction classes to align extensional con-cepts in two sources. A restriction class is a concept that is derived extensionallyand defined by a conjunction of value restrictions for properties (called property-value pairs) in a source. Such a definition helps overcome the problem of aligningrudimentary ontologies with more sophisticated ones. For example, GeoNamesonly has a single concept (Feature) to which all of its instances belong, whileDBpedia has a rich ontology. However, Feature has several properties that canbe used to define more meaningful classes. For example, the set of instances inGeoNames with the value PPL in the property featureCode, nicely aligns withthe instances of City in DBpedia.

Our algorithm explored the space of restriction classes from two ontologiesand was able to find equivalent and subset alignments between these restrictionclasses. Fig. 1 illustrates the instance sets considered to score an alignment hy-pothesis. We first find the instances belonging to the restriction class r1 fromthe first source and r2 from the second source. We then compute the image


of r1 (denoted by I(r1)), which is the set of instances from the second sourcelinked to instances in r1 (dashed lines in the figure). By comparing r2 with theintersection of I(r1) and r2 (shaded region), we can determine the relationshipbetween r1 and r2. We defined two metrics P and R, as the ratio of |I(r1)∩r2| to|I(r1)| and |r2| respectively, to quantify set-containment relations. For example,two classes are equivalent if P = R = 1. In order to allow a certain marginof error induced by the data-set, we used the relaxed versions P’ and R’ aspart of our scoring mechanism. In this case, two classes were considered equiv-alent if P ′ > 0.9 and R′ > 0.9 For example, consider the alignment betweenrestriction classes (lgd:gnis%3AST alpha=NJ) from LinkedGeoData and (db-pedia:Place#type=http://dbpedia.org/resource/City (New Jersey)) from DBpe-dia. Based on the extension sets, our algorithm finds |I(r1)| = 39, |r2| = 40,|I(r1)∩ r2| = 39, R′ = 0.97 and P ′ = 1.0. Based on our error margins, we assertthe alignment as equivalent in an extensional sense. The exploration of the spaceof alignments and the scoring procedure is described in detail in [8].

r1 r2

I(r1) Instance pairs where both r1 and r2 holds

Set of instance pairs where both r1 and r2 holds

Key:

Set of instances from O1 where r1 holds

Set of instances from O2 where r2 holds

Set of instances from O2 paired to instances from O1

Instance pairs where r1 holds

Fig. 1. Comparing the linked instances from two ontologies

Though the approach produced a large number of equivalent alignments, wewere not able to find a complete coverage because some restriction classes did nothave a corresponding equivalent restriction class and instead subsumed multiplesmaller restriction classes. For example, in the GeoNames and DBpedia align-ment, we found that {rdf:type=dbpedia:EducationalInstitution} from DBpediasubsumed {geonames:featureCode=S.SCH}, {geonames:featureCode=S.SCHC}and {geonames:featureCode=S.UNIV} (i.e. Schools, Colleges and Universitiesfrom GeoNames). We discovered that taken together, the union of these threerestriction classes completely define rdf:type=dbpedia:EducationalInstitution. Tofind such previously undetected alignments we decided to extend the expressivityof our restriction classesby introducing a disjunction operator to detect conceptcoverings completely.

3.2 Identifying union alignments

In our current work, we use the subset and equivalent alignments generatedby the previous work to try and align a larger class from one ontology with aunion of smaller subsumed restriction classes in the other ontology. Since theproblem of finding alignments with conjunctions and disjunction of property-value pairs of restriction classes is combinatorial in nature, we focus only on


subset and equivalence relations that map to an restriction classes with a singleproperty-value pair. This helps us find the simplest definitions of concepts andalso makes the problem tractable. Alignments generated by our previous workthat satisfy the single property-value pair constraint are first grouped accordingto the subsuming restriction classes and then according to the property of thesmaller classes. Since restriction classes are constructed by forming a set ofinstances that have one of the properties restricted to a single value, aggregatingrestriction classes from the group according to their properties builds a moreintuitive definition of the union. We can now define the disjunction operatorthat constructs the union concept from the smaller restriction classes in thesesub-groups. The disjunction operator is defined for restriction classes, such thati) the concept formed by the disjunction of the classes represents the union oftheir set of instances, ii) each of the classes that are aggregated contain only asingle property-value pair and iii) the property for all those property-value pairsis the same. We then try to detect the alignment between the larger commonrestriction class and the union by using an extensional approach similar to ourprevious paper. We call such an alignment a hypothesis union alignment.

We define US as the set of instances that is the union of individual smallerrestriction classes Union(r2); UL as the image of the larger class by itself,Img(r1)); and UA as the overlap between these sets, union(Img(r1) ∩ r2)). Wecheck whether the larger restriction class is equivalent to the union concept byusing scoring functions analogous to P ′ & R′ from our previous paper. The new

scoring mechanism defines P ′U as|UA||US | and R

′U as

|UA||UL| with relaxed scoring as-

sumptions as in P ′ & R′. To accommodate errors in the data-set, we consider ita complete coverage when the score is greater than a relaxed score of 0.9. Thatis, the hypothesis union alignment is considered equivalent if P ′U > 0.9 & R

′U >

0.9. Since by construction, each of the subset already satisfies P ′ > 0.9, thenwe are assured that P ′U is always going to be greater than 0.9. Thus, a unionalignment is equivalent if R′U > 0.9.

Figure 2 provides an example of the approach. Our previous algorithm findsthat {geonames:featureCode = S.SCH}, {geonames:featureCode = S.SCHC},{geonames:featureCode = S.UNIV} are subsets of {rdf:type=dbpedia:Educational-Institution}. As can be seen in the Venn diagram in Figure 2, UL is Img({rdf:type= dbpedia:EducationalInstitution}), US is {geonames:featureCode = S.SCH} ∪{geonames:featureCode = S.SCHC} ∪ {geonames:featureCode = S.UNIV}, andUA is the intersection of the two. With the educational institutions example,R′U for the alignment of dbpedia:EducationalInstitution to the union of S.SCH,S.SCHC & S.UNIV is 0.98. We can thus confirm the hypothesis and considerthis union alignment equivalent. Section 4 shows additional examples of unionalignments.

3.3 Using outliers in union mappings to identify linked data errors

The computation of union alignments allows for a margin of error in the subsetcomputation. It turns out that the outliers, the instances that do not satisfy the


Img(r1) : Educational Institutions from Dbpedia

Key:

Union(r2): Schools, Colleges and Universities from Geonames.

Schools from Geonames.

Img(r1) Union(r2)

Colleges from Geonames.

Universities from Geonames.

EducationalInstitution

S.SCH S.SCHC

S.UNIV

Outliers.

Fig. 2. Spatial covering of Educational Institutions from DBpedia

restriction classes in the alignments, are often due to incorrect links. Thus, ouralgorithm also provides a novel method to curate the Web of Linked Data.

Consider the outlier found in the {dbpedia:country = Spain} ≡ {geonames:-countryCode = ES} alignment. Of the 3918 instances of dbpedia:country=Spain,3917 have a link to a geonames:countryCode=ES. The one instance not havingcountry code ES has an assertion of country code IT (Italy) in GeoNames. Thealgorithm would flag this situation as a possible linking error, since there is over-whelming support for the ES being the country code of Spain. A more interestingcase occurs in the alignment of {rdf:type = dbpedia:EducationalInstitution} to{geonames:featureCode ∈ {S.SCH, S.SCHC, S.UNIV}}. For {rdf:type = dbpe-dia:EducationalInstitution}, 396 instances out of the 404 Educational Institu-tions were accounted for as having their geonames:featureCode as one of S.SCH,S.SCHC or S.UNIV. From the 8 outliers, 1 does not have a geonames:featureCodeproperty asserted. The other 7 have their feature codes as either S.BLDG (3buildings), S.EST (1 establishment), S.HSP (1 hospital), S.LIBR (1 library) orS.MUS (1 museum). This case requires more sophisticated curation and theoutliers may indicate a case for multiple inheritance. For example, the hospi-tal instance in geonames may be a medical college that could be classified as auniversity.

Our union alignment algorithm is able to detect similar other outliers andprovides a powerful tool to quickly focus on links that require human curation,or that could be automatically flagged as problematic, and provides evidence forthe error.

4 Experimental Results

The results of union alignment algorithm over the four pairs of sources we con-sider appear in Table 1. In total, the 7069 union alignments explained (covered)77966 subset alignments, for a compression ratio of 90%.


Table 1. Union Alignments Found in the 4 Source Pairs

Source1 Source2 Union Alignments 12 Union Alignments 21 Total union(Subset Alignments 12) (Subset Alignments 21) alignments

GeoNames DBpedia 434 (2197) 318 (7942) 752LinkedGeoData DBpedia 2746 (12572) 3097 (48345) 5843

Geospecies DBpedia 191 (1226) 255 (2569) 446GeneID MGI 6 (29) 22 (3086) 28

The resulting alignments were intuitive. Some interesting examples appearin Tables 2, 3 and 4. In the tables, for each union alignment, column 2 describesthe large restriction class from ontology1 and colunm 3 describes the union ofthe (smaller) classes on ontology2 with the corresponding property and value

set. The score of the union is noted in column 4 (R′U =|UA||UL| ) followed by |UA|

and |UL| in columns 5 and 6. Column 7 describes the outliers, i.e. values of v2that form restriction classes that are not direct subsets of the larger restrictionclass. Each of these outliers also has a fraction with the number of instancesthat belong to the intersection upon the the number of instances of the smaller

restriction class (or |Img(r1)∩r2||r2| ). It can be seen that the fraction is less than our

relaxed subset score. If the value of this fraction was greater than the relaxedsubset score (i.e. 0.9), the set would have been included in column 3 instead.The last column mentions how many of the total UL instances were we ableto explain using UA and the outliers. For example, the union alignment #1 ofTable 2 is the Educational Institution example described before. It shows howeducational institutions from DBpedia can be explained by schools, colleges anduniversities in GeoNames. Column 4, 5 and 6 explain the alignment score R′U(0.98), the size UA (396) and the size of UL (404). Outliers (S.BLDG, S.EST,S.LIBR, S.MUS, S.HSP) along with their P ′ fractions appear in column 7. Wewere able to explain 403 of the total 404 instances (see column 8).

We find other interesting alignments, a representative few of which are shownin the tables. In some cases, the union alignments found were intuitive becauseof an underlying hierarchical nature of the concepts involved, especially in case ofalignments of administrative divisions in geospatial sources and alignments in thebiological classification taxonomy. For example, #3 highlights alignments thatreflect the containment properties of administrative divisions. Other interestingtypes of alignment were also found. For example #7 tries to map two non-similarconcepts. It explains the license plate codes found in the state (bundesland) ofSaarland3. Due to lack of space, we explain the other union alignments alongsidein the tables. The complete set of alignments discovered by our algorithm areavailable on our group page.4

Outliers. In alignments that had inconsistencies, we identified three mainreasons: (i)Incorrect instance alignments - outliers arising out of possible erro-

3 http://www.europlates.com/publish/euro-plate-info/german-city-codes4 http://www.isi.edu/integration/data/UnionAlignments


neous equivalence link between instances (e.g. #4, #8, etc.), (ii)Missing instancealignments - insufficient support for coverage due to missing links between in-stances or missing instances (e.g. #9, etc.), (iii) Incorrect values for properties- outliers arising out of possible erroneous assertion for property (e.g. #5, #6,etc.). In the tables, we also mention the classes that these inconsistencies belongto along with their support.

5 Related Work

Ontology alignment and schema matching have been a well explored area of re-search since the early days of ontologies[3, 1] and received renewed interest inrecent years with the rise of the Semantic Web and Linked Data. Though mostwork done in the Web of Linked Data is on linking instances across differentsources, an increasing number of authors have looked into aligning the sourceontologies in the past couple of years. Jain et al. [4] describe the BLOOMS ap-proach which uses a central forest of concepts derived from topics in Wikipedia.An update to this is the BLOOMS+ approach [5] that aligns Linked Open Dataontologies with an upper-level ontology called Proton. BLOOMS & BLOOMS+are unable to find alignments because of the small number of classes in GeoN-ames that have vague declarations. The advantage of our approach over these isthat our use of restriction classes is able to find a large set of alignments in caseslike aligning GeoNames with DBpedia where Proton fails due to a rudimentaryontology. Cruz et al. [2] describe a dynamic ontology mapping approach calledAgreementMaker that uses similarity measures along with a mediator ontologyto find mappings using the labels of the classes. From the subset and equiv-alent alignment between GeoNames(10 concepts) and DBpedia(257 concepts),AgreementMaker was able to achieve a precision of 26% and a recall of 68%. Webelieve that since their approach did not consider unions of concepts, it wouldnot have been able to find alignments like the Educational Institutions exam-ple (#1) by using only the labels and the structure of the ontology, though athorough comparison is not possible. In our work, we find equivalent relationsbetween a concept on one side and a union of concepts on another side. CSR [9]is a similar work to ours that tries to align a concept from one ontology to a unionof concepts from the other ontology. In their approach, the authors describe howthe similarity of properties are used as features in predicting the subsumptionrelationships. It differs from our approach in that it uses a statistical machinelearning approach for detection of subsets rather than the extensional approach.An approach that uses statistical methods for finding alignments, similar to ourwork, has also been described in Völker et al. [10]. This work induces schemas forRDF data sources by generating OWL2 axioms using an intermediate associa-tivity table of instances and concepts (called transaction data-sets) and miningassociativity rules from it.

6 Conclusions and Future Work

We described an approach to identifying union alignments in data sources on theWeb of Linked Data from the Geospatial, Biological Classification and Genetics

Finding Concept Coverings in Aligning Ontologies of Linked Data 9Table

2.

Exam

ple

alignm

ents

from

theGeoNames

-DBpedia

LinkedGeoData

-DBpedia

.

#{r

1}

p2∈{v

2}

R′ U

=|U

A|

|UL||U

A||U

L|

Outliers

#Explain

ed

Instance

s

DBped

ia(larg

er)

-Geo

Names(smaller)

1{rdf:type

=geonames:featureCode∈

0.9

801

396

404

S.B

LD

G(3

/122),

S.E

ST

(1/13),

403

dbpedia:E

ducationalInstitution}

{S.S

CH

,S.S

CH

C,

S.U

NIV

}S.L

IBR

(1/7),

S.H

SP

(1/31),

S.M

US

(1/43)

As

des

crib

edin

Sec

tion

4,

Sch

ools

,C

olleg

esand

Univ

ersi

ties

inGeoNames

make

Educa

tional

Inst

ituti

ons

inDBpedia

2{d

bpedia:country=

dbpedia:Spain}

geonames:countryC

ode=

ES

0.9

997

3917

3918

IT(1

/7635)

3918

The

conce

pts

for

the

countr

ySpain

are

equal

inb

oth

sourc

es.

The

only

outl

ier

has

it’s

countr

yas

Italy

,an

erro

neo

us

ass

erti

on.

3dbpedia:region=

geonames:parentA

DM2∈

1.0

754

754

754

dbpedia:B

asse-Norm

andie

{geo

nam

es:2

989247,

geo

nam

es:2

996268,

geo

nam

es:3

029094}

We

confirm

the

hie

rarc

hic

al

natu

reof

adm

inis

trati

ve

div

isio

ns

wit

halignm

ents

bet

wee

nadm

inis

trati

ve

unit

sat

two

diff

eren

tle

vel

s.

4{rdf:type

=geonames:featureCode∈

0.9

924

1981

1996

S.A

IRF

(9/22),

S.F

RM

T(1

/5),

1996

dbpedia:A

irport}

{S.A

IRB

,S.A

IRP}

S.S

CH

(1/404),

S.S

TN

B(2

/5)

S.S

TN

M(1

/36),

T.H

LL

(1/61)

Inalignm

enin

gair

port

s,an

air

fiel

dsh

ould

hav

eb

een

an

an

air

port

.H

owev

er,

ther

ew

as

not

enough

inst

ance

supp

ort

.

Geo

Names(larg

er)

-DBped

ia(smaller)

5{geonames:countryC

ode=

NL}

dbpedia:country∈

0.9

802

1939

1978

dbp

edia

:Kin

gdom

of

1940

{dbp

edia

:The

Net

her

lands,

the

Net

her

lands

dbp

edia

:Fla

gof

the

Net

her

lands.

svg,

dbp

edia

:Net

her

lands}

The

Alignm

ent

for

Net

her

lands

should

hav

eb

een

as

stra

ightf

orw

ard

as

#2.

How

ever

we

hav

ep

oss

ible

alias

nam

es,

such

as

TheNetherlands

andKingdom

ofNetherlands,

as

wel

la

poss

ible

linka

ge

erro

rto

FlagoftheNetherlands.svg

6{geonames:countryC

ode=

JO}

dbpedia:country∈

0.9

519

20

20

{dbp

edia

:Jord

an,

dbp

edia

:Fla

gof

Jord

an.s

vg}

The

erro

rpatt

ern

in#

5se

ems

tore

pea

tsy

stem

ati

cally,

as

can

be

seen

from

this

alignm

ent

for

the

coutr

yof

Jord

an.

DBped

ia(larg

er)

-Lin

ked

Geo

Data

(smaller)

7{d

bpedia:bundesland=

Saarland}

lgd:O

penGeoDBLicen

sePlate-

0.9

346

49

46

Num

ber

∈{

HO

M,

IGB

,M

ZG

,N

K,

SB

,SL

S,

VK

,W

ND}

Our

alg

ori

thm

als

opro

duce

sin

tere

stin

galignm

ents

bet

wee

ndiff

eren

tpro

per

ties

.In

this

case

,w

efind

8of

the

10

lice

nse

pla

tes

inth

est

ate

of

Saarl

and

10 Rahul Parundekar, Craig A. Knoblock, and José Luis AmbiteTable

3.

Exam

ple

alig

nm

ents

from

theLinked

GeoData

-DBped

ia,Geospecies

-DBped

ia

#{r1 }

p2∈{v2 }

R′U

=|U

A|

|UL||U

A ||U

L |Outlie

rs#

Explain

ed

Insta

nce

s

8{rdf:type,

rdf:type

∈0.9

901

2609

2610

2609

dbped

ia:E

ducatio

nalIn

stitutio

n}{lg

d:A

men

ity,lg

d:K

2543,

lgd:S

chool,

lgd:U

niv

ersity,lg

d:W

aterT

ower}

Educa

tional

Institu

tions

inDBped

iaca

nb

eex

pla

ined

with

classes

inLinked

GeoData

.A

nex

am

ple

of

an

inco

rrent

alig

nm

ent,

aw

ater

tower

has

been

linked

toas

an

educa

tional

institu

tion.

Lin

ked

Geo

Data

(larg

er)

-DBped

ia(sm

alle

r)

9{lgd

:gnisS

Talpha=

NJ}

dbped

ia:su

bdivisio

nName∈

1.0

214

214

214

{A

tlantic,

Burlin

gto

n,

{C

ap

eM

ay,H

udso

n,

Hunterd

on,

Monm

oth

,N

ewJersey,

Ocea

n,

Passa

ic}D

ue

tom

issing

insta

nce

alig

nm

ents,

this

unionalign

men

tin

correctly

claim

sth

at

the

state

of

New

Jersey

isco

mp

osed

of

9co

unties

while

actu

ally

ithas

21.

10{rdf:type

=lgd

:Waterw

ay}

rdf:type

∈0.9

733

34

dbp

edia

:Pla

ce(1/94989)

34

dbp

edia

:Riv

erdbp

edia

:Strea

m}

Waterw

ays

inLinked

GeoData

as

equal

toth

eunio

nof

stream

sand

rivers

from

DBped

ia

DBped

ia(la

rger)

-Geo

spec

ies(sm

alle

r)

11{rdf:type

=dbped

ia:A

mphibia

n}geo

species:hasO

rderN

ame∈

0.9

990

91

Testu

din

es(1

/7)

91

dbped

ia:A

mphibia

n}

{A

nura

,C

audata

,G

ym

nophio

nia}

Sp

eciesfro

mGeospecies

with

the

ord

ernam

esA

nura

,C

audata

&G

ym

nophio

nia

are

all

Am

phib

ians

We

also

find

inco

nsista

ncies

due

tom

isalig

ned

insta

nces,

e.g.

one

Turtle

(Testid

une)

was

classifi

edas

am

phib

ian.

12{rdf:type

=dbped

ia:Salamander}

{geo

species:hasO

rderN

ame=

0.9

416

17

Testu

din

es(1

/7)

17

Caudata}

Up

on

furth

erin

spectio

nof

#11,

we

find

that

the

culp

ritis

aSala

mander

Geo

spec

ies(la

rger)

-DBped

ia(sm

alle

r)

13{rdf:type

=dbped

ia:P

lant}

{geo

species:inKingdom

=0.9

91874

1876

geo

species:k

ingdom

s/A

c(1/8)

1875

geo

species:k

ingdom

s/A

b}T

he

Kin

gdom

Pla

nta

e,fro

mb

oth

sources,

alm

ost

match

esp

erfectly.T

he

only

inco

nsista

nt

insta

nce

happ

ens

tob

ea

fungus.


Table

4.

Exam

ple

alignm

ents

from

theGen

eID

-MGI

#{r

1}

p2∈{v

2}

R′ U

=|U

A|

|UL|

|UA|

|UL|

Outliers

#Explain

ed

Instance

s

14{geospecies:inOrder

=dbpedia:ordo∈

0.9

9247

247

247

geospecies:orders/jtSaY}

{dbp

edia

:Carn

ivora

,dbp

edia

:Carn

ivore}

Inco

nsi

stanci

esin

the

ob

ject

valu

esca

nals

ob

ese

en-

Carn

ivore

sfr

om

Geospecies

are

aligned

wit

hb

oth

:C

arn

ivora

&C

arn

ivore

.

15{geospecies:hasO

rderName=

dbpedia:ordo∈

1111

111

111

Chiroptera}

{Chir

opte

ra@

en,

dbp

edia

:Bat}

We

can

det

ect

that

spec

ies

wit

hord

erC

hir

opte

raco

rrec

tly

bel

ong

toth

eord

erof

Bats

.U

nfo

rtunate

y,due

tova

lues

of

the

pro

per

tyb

eing

the

lite

ral

“C

hir

opta

@en

”,

the

alignm

ent

isnot

clea

n.

GeneID

(larg

er)

-M

GI

(smaller)

16{bio2rdf:subT

ype=

{bio2rdf:subT

ype=

0.9

35919

6317

Gen

e(3

18/24692)

6237

pseudo}

Pse

udogen

e}D

ue

toth

eabse

nce

of

acl

ear

hie

rarc

hy,

we

found

only

afe

whie

rarc

hic

al

rela

tions.

For

exam

ple

,alignm

ents

of

the

class

esP

seudogen

es.

17{bio2rdf:xT

axon=

bio2rdf:subT

ype∈

130993

30993

30993

taxon:10090}

{Com

ple

xC

lust

er/R

egio

n,

DN

ASeg

men

t,G

ene,

Pse

udogen

e}T

he

Mus

Musc

ulu

s(h

ouse

mouse

)ta

xonom

yis

com

ple

tely

com

pose

dof

com

ple

xcl

ust

ers,

DN

Ase

gm

ents

,G

enes

and

Pse

udogen

es.

MGI

(larg

er)

-GeneID

(smaller)

18{bio2rdf:subT

ype=

bio2rdf:subT

ype=

pseudo

0.9

45919

6297

oth

er(4

/230)

6297

Pseudogen

e}pro

tein

-codin

g(3

51/39999)

unknow

n(2

3/570)

Inco

nsi

stanci

esare

als

oev

iden

tas

the

valu

espse

udo

and

Pse

udogen

eare

use

dto

den

ote

the

sam

eth

ing.

19{m

gi:gen

omeS

tart

=1}

geneid:location∈

0.9

81697

1735

“”(3

7/1048)

1735

{1,

10.0

cM,

5(1

/52)

11.0

cM,

110.4

cM,

...}

20{m

gi:gen

omeS

tart

=X}

geneid:location∈

0.9

91748

1758

“”(1

0/1048)

1758

{X,

X0.5

cM,

X0.8

cM,

X1.0

cM,

...}

We

find

inte

rest

ing

alignm

ents

like

#19

&#

20

,w

hic

halign

the

gen

om

est

art

posi

tion

inMGI

wit

hth

elo

cati

on

inGen

eID

As

can

be

seen

,th

eva

lues

of

the

loca

tions

(dis

tance

sin

centi

morg

ans)

inGen

eID

conta

ingen

om

est

art

valu

eas

apre

fix.

Inco

nsi

stanci

esare

als

ose

en,

e.g.

in#

19

agen

eth

at

start

sw

ith

5is

mis

aligned

and

in#

20,

wher

eth

eva

lue

isan

empty

stri

ng.


domains. By extending our definition of restriction classes with the disjunctionoperator, we are able to find alignments of union concepts from one source tolarger concepts from the other source. Our approach produce coverings whereconcepts at different levels in the ontologies of two sources can be mapped evenwhen there is no direct equivalence. We are also able to find outliers that enableus to identify inconsistencies in the instances that are linked by looking at thealignment pattern. The results provide deeper insight into the nature of thealignments of Linked Data.

As part of our future work we want to try to find a more complete descriptionsfor the sources. Our preliminary findings show that the results of this paper canbe used to find patterns in the properties. For example, the countryCode propertyin GeoNames is closely associated with the country property in DBpedia, thoughtheir ranges are not exactly equal. We believe that an in-depth analysis of thealignment of ontologies of sources is warranted with the recent rise in the linksin the Linked Data cloud. This is an extremely important step for the grandSemantic Web vision.

References

1. Bernstein, P., Madhavan, J., Rahm, E.: Generic schema matching, ten years later.Proceedings of the VLDB Endowment 4(11) (2011)

2. Cruz, I., Palmonari, M., Caimi, F., Stroe, C.: Towards on the go matching of linkedopen data ontologies. In: Workshop on Discovering Meaning On The Go in LargeHeterogeneous Data. p. 37 (2011)

3. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag (2007)4. Jain, P., Hitzler, P., Sheth, A., Verma, K., Yeh, P.: Ontology alignment for linked

open data. The Semantic Web–ISWC 2010 pp. 402–417 (2010)5. Jain, P., Yeh, P., Verma, K., Vasquez, R., Damova, M., Hitzler, P., Sheth, A.:

Contextual ontology alignment of lod with an upper ontology: A case study withproton. The Semantic Web: Research and Applications pp. 80–92 (2011)

6. Parundekar, R., Ambite, J.L., Knoblock, C.A.: Aligning unions of concepts in on-tologies of geospatial linked data. In: Proceedings of the Terra Cognita 2011 Work-shop in Conjunction with the 10th International Semantic Web Conference. Bonn,Germany (2011)

7. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Aligning geospatial ontologies onthe linked data web. In: Proceedings of the GIScience Workshop on Linked Spa-tiotemporal Data. Zurich, Switzerland (2010)

8. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Linking and building ontologies oflinked data. In: Proceedings of the 9th International Semantic Web Conference(ISWC 2010). Shanghai, China (2010)

9. Spiliopoulos, V., Valarakos, A., Vouros, G.: Csr: discovering subsumption relationsfor the alignment of ontologies. The Semantic Web: Research and Applications pp.418–431 (2008)

10. Völker, J., Niepert, M.: Statistical schema induction. The Semantic Web: Researchand Applications pp. 124–138 (2011)

Finding Concept Coverings in Aligning Ontologies of Linked Dataceur-ws.org/Vol-868/paper1.pdfFinding...

Documents

Transcript of Finding Concept Coverings in Aligning Ontologies of Linked Dataceur-ws.org/Vol-868/paper1.pdfFinding...