Finding Concept Coverings in Aligning Ontologies of Linked Dataceur-ws.org/Vol-868/paper1.pdfFinding...
Transcript of Finding Concept Coverings in Aligning Ontologies of Linked Dataceur-ws.org/Vol-868/paper1.pdfFinding...
-
Finding Concept Coverings in AligningOntologies of Linked Data
Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite
Information Sciences Institute and Department of Computer ScienceUniversity of Southern California
4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292{parundek,knoblock,ambite}@usc.edu
Abstract. Despite the recent growth in the size of the Linked DataCloud, the absence of links between the vocabularies of the sources hasresulted in heterogenous schemas. Our previous work tried to find con-ceptual mapping between two sources and was successful in finding align-ments, such as equivalence and subset relations, using the instances thatare linked as equal. By using existential concepts and their intersectionsto define specialized classes (restriction classes), we were able to findalignments where previously existing concepts in one source did not havecorresponding equivalent concepts in the other source. Upon inspection,we found that though we were able to find a good number of alignments,we were unable to completely cover one source with the other. In manycases we observed that even though a larger class could be defined com-pletely by the multiple smaller classes that it subsumed, we were unableto find these alignments because our definition of restriction classes didnot contain the disjunction operator to define a union of concepts. Inthis paper we propose a method that discovers alignments such as these,where a (larger) concept of the first source is aligned to the union of thesubsumed (smaller) concepts from the other source. We apply this newalgorithm to the Geospatial, Biological Classification, and Genetics do-mains and show that this approach is able to discover numerous conceptcoverings, where (in most cases) the subsumed classes are disjoint. Theresulting alignments are useful for determining the mappings betweenontologies, refining existing ontologies, and finding inconsistencies thatmay indicate that some instances have been erroneously aligned.
1 Introduction
The Web of Linked Data has seen huge growth in the past few years. As ofSeptember 2011, the Linked Open Data Cloud has grown to a size of 31.6 billiontriples. This includes a wide range of data sources belonging to the government(42%), geographic (19.4%), life sciences (9.6%) and other domains1. A commonway that the instances in these sources are linked to others is the use of theowl:sameAs property. Though the size of Linked Data Cloud seems to be in-creasing drastically (10% over the 28.5 billion triples in 2010), inspection of the
1 http://www4.wiwiss.fu-berlin.de/lodcloud/state/
-
2 Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite
sources at the ontology level reveals that only a few of them (15 out of the 190sources) have some mapping of the vocabularies. For the success of the SemanticWeb, it is important that these heterogenous schemas be linked. As described inour previous papers on Linking and Building Ontologies of Linked Data [8] andAligning Ontologies of Geospatial Linked Data [7], an extensional technique canbe used to generate alignments between the ontologies behind these sources. Inthese previous papers, we introduced the concept of restriction classes, which isthe set of instances that satisfy a conjunction of value restrictions on properties(property-value pairs).
Though our algorithm was able to identify a good number of alignments, itwas unable to completely cover one source with the classes in the other source.Upon closer look, we found that most of these alignments that we missed did nothave a corresponding restriction class in the other source, and instead subsumedmultiple restriction classes. While reviewing these subset relations, we discoveredthat in many cases the union of the smaller classes completely covered the largerclass. In this paper, we describe how we extend our previous work to discoversuch concept coverings by introducing more expressive set of class descriptions(unions of value restrictions).2 In most of these coverings, the smaller classes arealso found to be disjoint. In addition, further analysis of the alignments of thesecoverings provides a powerful tool to discover incorrect links in the Web of LinkedData, which can potentially be used to point out and rectify the inconsistenciesin the instance alignments.
This paper is organized as follows. First, we describe the Linked Open Datasources that we try to align in the paper. Second, we briefly review our alignmentalgorithm from [8] along with the limitations of the results that were generated.Third, we describe our approach to finding alignments between unions of re-strictions classes. Fourth, we describe how outliers in these alignments help toidentify inconsistencies and erroneous links. Fifth, we describe the experimen-tal results on union alignments over additional domains. Finally, we compareagainst related work, and discuss our contributions and future work.
2 Sources Used for Alignments
Linked Data, by definition, links the instances of multiple sources. Often, sourcesconform to different, but related, ontologies that can also be meaningfully linked[8]. In this section we describe some of these sources from different domains thatwe try to align, instances in which are linked using an equivalence property likeowl:sameAs.
Linking GeoNames with places in DBpedia: DBpedia (dbpedia.org) isa knowledge base that covers multiple domains including around 526,000 placesand other geographical features from the Geospatial domain. We try to alignthe concepts in DBpedia with GeoNames (geonames.org), which is a geographicsource with about 7.8 million things. It uses a flat-file like ontology, where all
2 This work is an extended version of our workshop paper [6]. We have extended themethod to find coverings in the Biological Classification and Genetics domains.
-
Finding Concept Coverings in Aligning Ontologies of Linked Data 3
instances belong to a single concept of Feature. This makes the ontology rudimen-tary, with the type data (e.g. mountains, lakes, etc.) about these geographicalfeatures instead in the Feature Class & Feature Code properties.
Linking LinkedGeoDatawith places in DBpedia: We also try to findalignments between the ontologies behind LinkedGeoData (linkedgeodata.org)and DBpedia. LinkedGeoData is derived from the Open Street Map initiativewith around 101,000 instances linked to DBpedia using the owl:sameAs property.
Linking species from Geospecies with DBpedia: The Geospecies (geospecies.org)knowledge base contains species belonging to plant, animal, and other kingdomslinked to species in DBpedia using the skos:closeMatch property. Since the in-stances in the taxonomies in both these sources are the same, the sources areideal for finding the alignment between the vocabularies.
Linking genes from GeneID with MGI : The Bio2RDF (bio2rdf.org)project contains inter-linked life sciences data extracted from multiple data-setsthat cover genes, chemicals, enzymes, etc. We consider two sources from theGenetics domain from Bio2RDF, GeneID (extracted from the National Centerfor Biotechnology Information database) and MGI (extracted from the MouseGenome Informatics project), where the genes are marked equivalent.
Although we provide results of the above four mentioned alignments in Sec-tion 4, in the rest of this paper we explain our methodology by using the align-ment of GeoNames with DBpedia as an example.
3 Aligning Ontologies on the Web of Linked Data
First, we briefly describe our previous work on finding subset and equivalentalignments between restriction classes from two ontologies. Then, we describehow to use the subset alignments to finding more expressive union alignments.Finally, we discuss how outliers in these union alignments often identify incorrectlinks in the Web of Linked Data.
3.1 Our previous work on aligning ontologies of Linked Data
In [8] we introduced the concept of restriction classes to align extensional con-cepts in two sources. A restriction class is a concept that is derived extensionallyand defined by a conjunction of value restrictions for properties (called property-value pairs) in a source. Such a definition helps overcome the problem of aligningrudimentary ontologies with more sophisticated ones. For example, GeoNamesonly has a single concept (Feature) to which all of its instances belong, whileDBpedia has a rich ontology. However, Feature has several properties that canbe used to define more meaningful classes. For example, the set of instances inGeoNames with the value PPL in the property featureCode, nicely aligns withthe instances of City in DBpedia.
Our algorithm explored the space of restriction classes from two ontologiesand was able to find equivalent and subset alignments between these restrictionclasses. Fig. 1 illustrates the instance sets considered to score an alignment hy-pothesis. We first find the instances belonging to the restriction class r1 fromthe first source and r2 from the second source. We then compute the image
-
4 Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite
of r1 (denoted by I(r1)), which is the set of instances from the second sourcelinked to instances in r1 (dashed lines in the figure). By comparing r2 with theintersection of I(r1) and r2 (shaded region), we can determine the relationshipbetween r1 and r2. We defined two metrics P and R, as the ratio of |I(r1)∩r2| to|I(r1)| and |r2| respectively, to quantify set-containment relations. For example,two classes are equivalent if P = R = 1. In order to allow a certain marginof error induced by the data-set, we used the relaxed versions P’ and R’ aspart of our scoring mechanism. In this case, two classes were considered equiv-alent if P ′ > 0.9 and R′ > 0.9 For example, consider the alignment betweenrestriction classes (lgd:gnis%3AST alpha=NJ) from LinkedGeoData and (db-pedia:Place#type=http://dbpedia.org/resource/City (New Jersey)) from DBpe-dia. Based on the extension sets, our algorithm finds |I(r1)| = 39, |r2| = 40,|I(r1)∩ r2| = 39, R′ = 0.97 and P ′ = 1.0. Based on our error margins, we assertthe alignment as equivalent in an extensional sense. The exploration of the spaceof alignments and the scoring procedure is described in detail in [8].
r1 r2
I(r1) Instance pairs where both r1 and r2 holds
Set of instance pairs where both r1 and r2 holds
Key:
Set of instances from O1 where r1 holds
Set of instances from O2 where r2 holds
Set of instances from O2 paired to instances from O1
Instance pairs where r1 holds
Fig. 1. Comparing the linked instances from two ontologies
Though the approach produced a large number of equivalent alignments, wewere not able to find a complete coverage because some restriction classes did nothave a corresponding equivalent restriction class and instead subsumed multiplesmaller restriction classes. For example, in the GeoNames and DBpedia align-ment, we found that {rdf:type=dbpedia:EducationalInstitution} from DBpediasubsumed {geonames:featureCode=S.SCH}, {geonames:featureCode=S.SCHC}and {geonames:featureCode=S.UNIV} (i.e. Schools, Colleges and Universitiesfrom GeoNames). We discovered that taken together, the union of these threerestriction classes completely define rdf:type=dbpedia:EducationalInstitution. Tofind such previously undetected alignments we decided to extend the expressivityof our restriction classesby introducing a disjunction operator to detect conceptcoverings completely.
3.2 Identifying union alignments
In our current work, we use the subset and equivalent alignments generatedby the previous work to try and align a larger class from one ontology with aunion of smaller subsumed restriction classes in the other ontology. Since theproblem of finding alignments with conjunctions and disjunction of property-value pairs of restriction classes is combinatorial in nature, we focus only on
-
Finding Concept Coverings in Aligning Ontologies of Linked Data 5
subset and equivalence relations that map to an restriction classes with a singleproperty-value pair. This helps us find the simplest definitions of concepts andalso makes the problem tractable. Alignments generated by our previous workthat satisfy the single property-value pair constraint are first grouped accordingto the subsuming restriction classes and then according to the property of thesmaller classes. Since restriction classes are constructed by forming a set ofinstances that have one of the properties restricted to a single value, aggregatingrestriction classes from the group according to their properties builds a moreintuitive definition of the union. We can now define the disjunction operatorthat constructs the union concept from the smaller restriction classes in thesesub-groups. The disjunction operator is defined for restriction classes, such thati) the concept formed by the disjunction of the classes represents the union oftheir set of instances, ii) each of the classes that are aggregated contain only asingle property-value pair and iii) the property for all those property-value pairsis the same. We then try to detect the alignment between the larger commonrestriction class and the union by using an extensional approach similar to ourprevious paper. We call such an alignment a hypothesis union alignment.
We define US as the set of instances that is the union of individual smallerrestriction classes Union(r2); UL as the image of the larger class by itself,Img(r1)); and UA as the overlap between these sets, union(Img(r1) ∩ r2)). Wecheck whether the larger restriction class is equivalent to the union concept byusing scoring functions analogous to P ′ & R′ from our previous paper. The new
scoring mechanism defines P ′U as|UA||US | and R
′U as
|UA||UL| with relaxed scoring as-
sumptions as in P ′ & R′. To accommodate errors in the data-set, we consider ita complete coverage when the score is greater than a relaxed score of 0.9. Thatis, the hypothesis union alignment is considered equivalent if P ′U > 0.9 & R
′U >
0.9. Since by construction, each of the subset already satisfies P ′ > 0.9, thenwe are assured that P ′U is always going to be greater than 0.9. Thus, a unionalignment is equivalent if R′U > 0.9.
Figure 2 provides an example of the approach. Our previous algorithm findsthat {geonames:featureCode = S.SCH}, {geonames:featureCode = S.SCHC},{geonames:featureCode = S.UNIV} are subsets of {rdf:type=dbpedia:Educational-Institution}. As can be seen in the Venn diagram in Figure 2, UL is Img({rdf:type= dbpedia:EducationalInstitution}), US is {geonames:featureCode = S.SCH} ∪{geonames:featureCode = S.SCHC} ∪ {geonames:featureCode = S.UNIV}, andUA is the intersection of the two. With the educational institutions example,R′U for the alignment of dbpedia:EducationalInstitution to the union of S.SCH,S.SCHC & S.UNIV is 0.98. We can thus confirm the hypothesis and considerthis union alignment equivalent. Section 4 shows additional examples of unionalignments.
3.3 Using outliers in union mappings to identify linked data errors
The computation of union alignments allows for a margin of error in the subsetcomputation. It turns out that the outliers, the instances that do not satisfy the
-
6 Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite
Img(r1) : Educational Institutions from Dbpedia
Key:
Union(r2): Schools, Colleges and Universities from Geonames.
Schools from Geonames.
Img(r1) Union(r2)
Colleges from Geonames.
Universities from Geonames.
EducationalInstitution
S.SCH S.SCHC
S.UNIV
Outliers.
Fig. 2. Spatial covering of Educational Institutions from DBpedia
restriction classes in the alignments, are often due to incorrect links. Thus, ouralgorithm also provides a novel method to curate the Web of Linked Data.
Consider the outlier found in the {dbpedia:country = Spain} ≡ {geonames:-countryCode = ES} alignment. Of the 3918 instances of dbpedia:country=Spain,3917 have a link to a geonames:countryCode=ES. The one instance not havingcountry code ES has an assertion of country code IT (Italy) in GeoNames. Thealgorithm would flag this situation as a possible linking error, since there is over-whelming support for the ES being the country code of Spain. A more interestingcase occurs in the alignment of {rdf:type = dbpedia:EducationalInstitution} to{geonames:featureCode ∈ {S.SCH, S.SCHC, S.UNIV}}. For {rdf:type = dbpe-dia:EducationalInstitution}, 396 instances out of the 404 Educational Institu-tions were accounted for as having their geonames:featureCode as one of S.SCH,S.SCHC or S.UNIV. From the 8 outliers, 1 does not have a geonames:featureCodeproperty asserted. The other 7 have their feature codes as either S.BLDG (3buildings), S.EST (1 establishment), S.HSP (1 hospital), S.LIBR (1 library) orS.MUS (1 museum). This case requires more sophisticated curation and theoutliers may indicate a case for multiple inheritance. For example, the hospi-tal instance in geonames may be a medical college that could be classified as auniversity.
Our union alignment algorithm is able to detect similar other outliers andprovides a powerful tool to quickly focus on links that require human curation,or that could be automatically flagged as problematic, and provides evidence forthe error.
4 Experimental Results
The results of union alignment algorithm over the four pairs of sources we con-sider appear in Table 1. In total, the 7069 union alignments explained (covered)77966 subset alignments, for a compression ratio of 90%.
-
Finding Concept Coverings in Aligning Ontologies of Linked Data 7
Table 1. Union Alignments Found in the 4 Source Pairs
Source1 Source2 Union Alignments 12 Union Alignments 21 Total union(Subset Alignments 12) (Subset Alignments 21) alignments
GeoNames DBpedia 434 (2197) 318 (7942) 752LinkedGeoData DBpedia 2746 (12572) 3097 (48345) 5843
Geospecies DBpedia 191 (1226) 255 (2569) 446GeneID MGI 6 (29) 22 (3086) 28
The resulting alignments were intuitive. Some interesting examples appearin Tables 2, 3 and 4. In the tables, for each union alignment, column 2 describesthe large restriction class from ontology1 and colunm 3 describes the union ofthe (smaller) classes on ontology2 with the corresponding property and value
set. The score of the union is noted in column 4 (R′U =|UA||UL| ) followed by |UA|
and |UL| in columns 5 and 6. Column 7 describes the outliers, i.e. values of v2that form restriction classes that are not direct subsets of the larger restrictionclass. Each of these outliers also has a fraction with the number of instancesthat belong to the intersection upon the the number of instances of the smaller
restriction class (or |Img(r1)∩r2||r2| ). It can be seen that the fraction is less than our
relaxed subset score. If the value of this fraction was greater than the relaxedsubset score (i.e. 0.9), the set would have been included in column 3 instead.The last column mentions how many of the total UL instances were we ableto explain using UA and the outliers. For example, the union alignment #1 ofTable 2 is the Educational Institution example described before. It shows howeducational institutions from DBpedia can be explained by schools, colleges anduniversities in GeoNames. Column 4, 5 and 6 explain the alignment score R′U(0.98), the size UA (396) and the size of UL (404). Outliers (S.BLDG, S.EST,S.LIBR, S.MUS, S.HSP) along with their P ′ fractions appear in column 7. Wewere able to explain 403 of the total 404 instances (see column 8).
We find other interesting alignments, a representative few of which are shownin the tables. In some cases, the union alignments found were intuitive becauseof an underlying hierarchical nature of the concepts involved, especially in case ofalignments of administrative divisions in geospatial sources and alignments in thebiological classification taxonomy. For example, #3 highlights alignments thatreflect the containment properties of administrative divisions. Other interestingtypes of alignment were also found. For example #7 tries to map two non-similarconcepts. It explains the license plate codes found in the state (bundesland) ofSaarland3. Due to lack of space, we explain the other union alignments alongsidein the tables. The complete set of alignments discovered by our algorithm areavailable on our group page.4
Outliers. In alignments that had inconsistencies, we identified three mainreasons: (i)Incorrect instance alignments - outliers arising out of possible erro-
3 http://www.europlates.com/publish/euro-plate-info/german-city-codes4 http://www.isi.edu/integration/data/UnionAlignments
-
8 Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite
neous equivalence link between instances (e.g. #4, #8, etc.), (ii)Missing instancealignments - insufficient support for coverage due to missing links between in-stances or missing instances (e.g. #9, etc.), (iii) Incorrect values for properties- outliers arising out of possible erroneous assertion for property (e.g. #5, #6,etc.). In the tables, we also mention the classes that these inconsistencies belongto along with their support.
5 Related Work
Ontology alignment and schema matching have been a well explored area of re-search since the early days of ontologies[3, 1] and received renewed interest inrecent years with the rise of the Semantic Web and Linked Data. Though mostwork done in the Web of Linked Data is on linking instances across differentsources, an increasing number of authors have looked into aligning the sourceontologies in the past couple of years. Jain et al. [4] describe the BLOOMS ap-proach which uses a central forest of concepts derived from topics in Wikipedia.An update to this is the BLOOMS+ approach [5] that aligns Linked Open Dataontologies with an upper-level ontology called Proton. BLOOMS & BLOOMS+are unable to find alignments because of the small number of classes in GeoN-ames that have vague declarations. The advantage of our approach over these isthat our use of restriction classes is able to find a large set of alignments in caseslike aligning GeoNames with DBpedia where Proton fails due to a rudimentaryontology. Cruz et al. [2] describe a dynamic ontology mapping approach calledAgreementMaker that uses similarity measures along with a mediator ontologyto find mappings using the labels of the classes. From the subset and equiv-alent alignment between GeoNames(10 concepts) and DBpedia(257 concepts),AgreementMaker was able to achieve a precision of 26% and a recall of 68%. Webelieve that since their approach did not consider unions of concepts, it wouldnot have been able to find alignments like the Educational Institutions exam-ple (#1) by using only the labels and the structure of the ontology, though athorough comparison is not possible. In our work, we find equivalent relationsbetween a concept on one side and a union of concepts on another side. CSR [9]is a similar work to ours that tries to align a concept from one ontology to a unionof concepts from the other ontology. In their approach, the authors describe howthe similarity of properties are used as features in predicting the subsumptionrelationships. It differs from our approach in that it uses a statistical machinelearning approach for detection of subsets rather than the extensional approach.An approach that uses statistical methods for finding alignments, similar to ourwork, has also been described in Völker et al. [10]. This work induces schemas forRDF data sources by generating OWL2 axioms using an intermediate associa-tivity table of instances and concepts (called transaction data-sets) and miningassociativity rules from it.
6 Conclusions and Future Work
We described an approach to identifying union alignments in data sources on theWeb of Linked Data from the Geospatial, Biological Classification and Genetics
-
Finding Concept Coverings in Aligning Ontologies of Linked Data 9Table
2.
Exam
ple
alignm
ents
from
theGeoNames
-DBpedia
LinkedGeoData
-DBpedia
.
#{r
1}
p2∈{v
2}
R′ U
=|U
A|
|UL||U
A||U
L|
Outliers
#Explain
ed
Instance
s
DBped
ia(larg
er)
-Geo
Names(smaller)
1{rdf:type
=geonames:featureCode∈
0.9
801
396
404
S.B
LD
G(3
/122),
S.E
ST
(1/13),
403
dbpedia:E
ducationalInstitution}
{S.S
CH
,S.S
CH
C,
S.U
NIV
}S.L
IBR
(1/7),
S.H
SP
(1/31),
S.M
US
(1/43)
As
des
crib
edin
Sec
tion
4,
Sch
ools
,C
olleg
esand
Univ
ersi
ties
inGeoNames
make
Educa
tional
Inst
ituti
ons
inDBpedia
2{d
bpedia:country=
dbpedia:Spain}
geonames:countryC
ode=
ES
0.9
997
3917
3918
IT(1
/7635)
3918
The
conce
pts
for
the
countr
ySpain
are
equal
inb
oth
sourc
es.
The
only
outl
ier
has
it’s
countr
yas
Italy
,an
erro
neo
us
ass
erti
on.
3dbpedia:region=
geonames:parentA
DM2∈
1.0
754
754
754
dbpedia:B
asse-Norm
andie
{geo
nam
es:2
989247,
geo
nam
es:2
996268,
geo
nam
es:3
029094}
We
confirm
the
hie
rarc
hic
al
natu
reof
adm
inis
trati
ve
div
isio
ns
wit
halignm
ents
bet
wee
nadm
inis
trati
ve
unit
sat
two
diff
eren
tle
vel
s.
4{rdf:type
=geonames:featureCode∈
0.9
924
1981
1996
S.A
IRF
(9/22),
S.F
RM
T(1
/5),
1996
dbpedia:A
irport}
{S.A
IRB
,S.A
IRP}
S.S
CH
(1/404),
S.S
TN
B(2
/5)
S.S
TN
M(1
/36),
T.H
LL
(1/61)
Inalignm
enin
gair
port
s,an
air
fiel
dsh
ould
hav
eb
een
an
an
air
port
.H
owev
er,
ther
ew
as
not
enough
inst
ance
supp
ort
.
Geo
Names(larg
er)
-DBped
ia(smaller)
5{geonames:countryC
ode=
NL}
dbpedia:country∈
0.9
802
1939
1978
dbp
edia
:Kin
gdom
of
1940
{dbp
edia
:The
Net
her
lands,
the
Net
her
lands
dbp
edia
:Fla
gof
the
Net
her
lands.
svg,
dbp
edia
:Net
her
lands}
The
Alignm
ent
for
Net
her
lands
should
hav
eb
een
as
stra
ightf
orw
ard
as
#2.
How
ever
we
hav
ep
oss
ible
alias
nam
es,
such
as
TheNetherlands
andKingdom
ofNetherlands,
as
wel
la
poss
ible
linka
ge
erro
rto
FlagoftheNetherlands.svg
6{geonames:countryC
ode=
JO}
dbpedia:country∈
0.9
519
20
20
{dbp
edia
:Jord
an,
dbp
edia
:Fla
gof
Jord
an.s
vg}
The
erro
rpatt
ern
in#
5se
ems
tore
pea
tsy
stem
ati
cally,
as
can
be
seen
from
this
alignm
ent
for
the
coutr
yof
Jord
an.
DBped
ia(larg
er)
-Lin
ked
Geo
Data
(smaller)
7{d
bpedia:bundesland=
Saarland}
lgd:O
penGeoDBLicen
sePlate-
0.9
346
49
46
Num
ber
∈{
HO
M,
IGB
,M
ZG
,N
K,
SB
,SL
S,
VK
,W
ND}
Our
alg
ori
thm
als
opro
duce
sin
tere
stin
galignm
ents
bet
wee
ndiff
eren
tpro
per
ties
.In
this
case
,w
efind
8of
the
10
lice
nse
pla
tes
inth
est
ate
of
Saarl
and
-
10 Rahul Parundekar, Craig A. Knoblock, and José Luis AmbiteTable
3.
Exam
ple
alig
nm
ents
from
theLinked
GeoData
-DBped
ia,Geospecies
-DBped
ia
#{r1 }
p2∈{v2 }
R′U
=|U
A|
|UL||U
A ||U
L |Outlie
rs#
Explain
ed
Insta
nce
s
8{rdf:type,
rdf:type
∈0.9
901
2609
2610
2609
dbped
ia:E
ducatio
nalIn
stitutio
n}{lg
d:A
men
ity,lg
d:K
2543,
lgd:S
chool,
lgd:U
niv
ersity,lg
d:W
aterT
ower}
Educa
tional
Institu
tions
inDBped
iaca
nb
eex
pla
ined
with
classes
inLinked
GeoData
.A
nex
am
ple
of
an
inco
rrent
alig
nm
ent,
aw
ater
tower
has
been
linked
toas
an
educa
tional
institu
tion.
Lin
ked
Geo
Data
(larg
er)
-DBped
ia(sm
alle
r)
9{lgd
:gnisS
Talpha=
NJ}
dbped
ia:su
bdivisio
nName∈
1.0
214
214
214
{A
tlantic,
Burlin
gto
n,
{C
ap
eM
ay,H
udso
n,
Hunterd
on,
Monm
oth
,N
ewJersey,
Ocea
n,
Passa
ic}D
ue
tom
issing
insta
nce
alig
nm
ents,
this
unionalign
men
tin
correctly
claim
sth
at
the
state
of
New
Jersey
isco
mp
osed
of
9co
unties
while
actu
ally
ithas
21.
10{rdf:type
=lgd
:Waterw
ay}
rdf:type
∈0.9
733
34
dbp
edia
:Pla
ce(1/94989)
34
dbp
edia
:Riv
erdbp
edia
:Strea
m}
Waterw
ays
inLinked
GeoData
as
equal
toth
eunio
nof
stream
sand
rivers
from
DBped
ia
DBped
ia(la
rger)
-Geo
spec
ies(sm
alle
r)
11{rdf:type
=dbped
ia:A
mphibia
n}geo
species:hasO
rderN
ame∈
0.9
990
91
Testu
din
es(1
/7)
91
dbped
ia:A
mphibia
n}
{A
nura
,C
audata
,G
ym
nophio
nia}
Sp
eciesfro
mGeospecies
with
the
ord
ernam
esA
nura
,C
audata
&G
ym
nophio
nia
are
all
Am
phib
ians
We
also
find
inco
nsista
ncies
due
tom
isalig
ned
insta
nces,
e.g.
one
Turtle
(Testid
une)
was
classifi
edas
am
phib
ian.
12{rdf:type
=dbped
ia:Salamander}
{geo
species:hasO
rderN
ame=
0.9
416
17
Testu
din
es(1
/7)
17
Caudata}
Up
on
furth
erin
spectio
nof
#11,
we
find
that
the
culp
ritis
aSala
mander
Geo
spec
ies(la
rger)
-DBped
ia(sm
alle
r)
13{rdf:type
=dbped
ia:P
lant}
{geo
species:inKingdom
=0.9
91874
1876
geo
species:k
ingdom
s/A
c(1/8)
1875
geo
species:k
ingdom
s/A
b}T
he
Kin
gdom
Pla
nta
e,fro
mb
oth
sources,
alm
ost
match
esp
erfectly.T
he
only
inco
nsista
nt
insta
nce
happ
ens
tob
ea
fungus.
-
Finding Concept Coverings in Aligning Ontologies of Linked Data 11
Table
4.
Exam
ple
alignm
ents
from
theGen
eID
-MGI
#{r
1}
p2∈{v
2}
R′ U
=|U
A|
|UL|
|UA|
|UL|
Outliers
#Explain
ed
Instance
s
14{geospecies:inOrder
=dbpedia:ordo∈
0.9
9247
247
247
geospecies:orders/jtSaY}
{dbp
edia
:Carn
ivora
,dbp
edia
:Carn
ivore}
Inco
nsi
stanci
esin
the
ob
ject
valu
esca
nals
ob
ese
en-
Carn
ivore
sfr
om
Geospecies
are
aligned
wit
hb
oth
:C
arn
ivora
&C
arn
ivore
.
15{geospecies:hasO
rderName=
dbpedia:ordo∈
1111
111
111
Chiroptera}
{Chir
opte
ra@
en,
dbp
edia
:Bat}
We
can
det
ect
that
spec
ies
wit
hord
erC
hir
opte
raco
rrec
tly
bel
ong
toth
eord
erof
Bats
.U
nfo
rtunate
y,due
tova
lues
of
the
pro
per
tyb
eing
the
lite
ral
“C
hir
opta
@en
”,
the
alignm
ent
isnot
clea
n.
GeneID
(larg
er)
-M
GI
(smaller)
16{bio2rdf:subT
ype=
{bio2rdf:subT
ype=
0.9
35919
6317
Gen
e(3
18/24692)
6237
pseudo}
Pse
udogen
e}D
ue
toth
eabse
nce
of
acl
ear
hie
rarc
hy,
we
found
only
afe
whie
rarc
hic
al
rela
tions.
For
exam
ple
,alignm
ents
of
the
class
esP
seudogen
es.
17{bio2rdf:xT
axon=
bio2rdf:subT
ype∈
130993
30993
30993
taxon:10090}
{Com
ple
xC
lust
er/R
egio
n,
DN
ASeg
men
t,G
ene,
Pse
udogen
e}T
he
Mus
Musc
ulu
s(h
ouse
mouse
)ta
xonom
yis
com
ple
tely
com
pose
dof
com
ple
xcl
ust
ers,
DN
Ase
gm
ents
,G
enes
and
Pse
udogen
es.
MGI
(larg
er)
-GeneID
(smaller)
18{bio2rdf:subT
ype=
bio2rdf:subT
ype=
pseudo
0.9
45919
6297
oth
er(4
/230)
6297
Pseudogen
e}pro
tein
-codin
g(3
51/39999)
unknow
n(2
3/570)
Inco
nsi
stanci
esare
als
oev
iden
tas
the
valu
espse
udo
and
Pse
udogen
eare
use
dto
den
ote
the
sam
eth
ing.
19{m
gi:gen
omeS
tart
=1}
geneid:location∈
0.9
81697
1735
“”(3
7/1048)
1735
{1,
10.0
cM,
5(1
/52)
11.0
cM,
110.4
cM,
...}
20{m
gi:gen
omeS
tart
=X}
geneid:location∈
0.9
91748
1758
“”(1
0/1048)
1758
{X,
X0.5
cM,
X0.8
cM,
X1.0
cM,
...}
We
find
inte
rest
ing
alignm
ents
like
#19
&#
20
,w
hic
halign
the
gen
om
est
art
posi
tion
inMGI
wit
hth
elo
cati
on
inGen
eID
As
can
be
seen
,th
eva
lues
of
the
loca
tions
(dis
tance
sin
centi
morg
ans)
inGen
eID
conta
ingen
om
est
art
valu
eas
apre
fix.
Inco
nsi
stanci
esare
als
ose
en,
e.g.
in#
19
agen
eth
at
start
sw
ith
5is
mis
aligned
and
in#
20,
wher
eth
eva
lue
isan
empty
stri
ng.
-
12 Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite
domains. By extending our definition of restriction classes with the disjunctionoperator, we are able to find alignments of union concepts from one source tolarger concepts from the other source. Our approach produce coverings whereconcepts at different levels in the ontologies of two sources can be mapped evenwhen there is no direct equivalence. We are also able to find outliers that enableus to identify inconsistencies in the instances that are linked by looking at thealignment pattern. The results provide deeper insight into the nature of thealignments of Linked Data.
As part of our future work we want to try to find a more complete descriptionsfor the sources. Our preliminary findings show that the results of this paper canbe used to find patterns in the properties. For example, the countryCode propertyin GeoNames is closely associated with the country property in DBpedia, thoughtheir ranges are not exactly equal. We believe that an in-depth analysis of thealignment of ontologies of sources is warranted with the recent rise in the linksin the Linked Data cloud. This is an extremely important step for the grandSemantic Web vision.
References
1. Bernstein, P., Madhavan, J., Rahm, E.: Generic schema matching, ten years later.Proceedings of the VLDB Endowment 4(11) (2011)
2. Cruz, I., Palmonari, M., Caimi, F., Stroe, C.: Towards on the go matching of linkedopen data ontologies. In: Workshop on Discovering Meaning On The Go in LargeHeterogeneous Data. p. 37 (2011)
3. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag (2007)4. Jain, P., Hitzler, P., Sheth, A., Verma, K., Yeh, P.: Ontology alignment for linked
open data. The Semantic Web–ISWC 2010 pp. 402–417 (2010)5. Jain, P., Yeh, P., Verma, K., Vasquez, R., Damova, M., Hitzler, P., Sheth, A.:
Contextual ontology alignment of lod with an upper ontology: A case study withproton. The Semantic Web: Research and Applications pp. 80–92 (2011)
6. Parundekar, R., Ambite, J.L., Knoblock, C.A.: Aligning unions of concepts in on-tologies of geospatial linked data. In: Proceedings of the Terra Cognita 2011 Work-shop in Conjunction with the 10th International Semantic Web Conference. Bonn,Germany (2011)
7. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Aligning geospatial ontologies onthe linked data web. In: Proceedings of the GIScience Workshop on Linked Spa-tiotemporal Data. Zurich, Switzerland (2010)
8. Parundekar, R., Knoblock, C.A., Ambite, J.L.: Linking and building ontologies oflinked data. In: Proceedings of the 9th International Semantic Web Conference(ISWC 2010). Shanghai, China (2010)
9. Spiliopoulos, V., Valarakos, A., Vouros, G.: Csr: discovering subsumption relationsfor the alignment of ontologies. The Semantic Web: Research and Applications pp.418–431 (2008)
10. Völker, J., Niepert, M.: Statistical schema induction. The Semantic Web: Researchand Applications pp. 124–138 (2011)