Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A...

32
Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked Data for Efficient Semantic Annotation In-Young Ko August 16, 2012 Adapted from the material of Hangyu Ko, a Ph.D. student at WebEng Lab. 2012 International Asian Summer School on Linked Data (IASLOD 2012) August 13 – 17, 2012, KAIST, Daejeon, Korea

Transcript of Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A...

Page 1: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Korea Advanced Institute of Science and Technology

Scalable Access and Process of Linked Open Data

A Semantic Cloud Generation Approach based on Linked Data for Efficient Semantic Annotation

In-Young KoAugust 16, 2012

Adapted from the material of Hangyu Ko, a Ph.D. student at WebEng Lab.

2012 International Asian Summer School on Linked Data (IASLOD 2012) August 13 – 17, 2012, KAIST, Daejeon, Korea

Page 2: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Contents Introduction

Objectives and Motivations Challenges

SPARQL Performance Dynamic Access to Linked Data Forming effective semantic clouds

Similarity-Link-based Semantic Cloud Generation Similarity-Link Analysis for Concept Grouping Centrality Measurement Incremental Traversing and Grouping

Evaluation Performance of Incremental Traversing Candidate Reduction of Similarity-Link Analysis

Conclusion

2012.08.16

2

Copyright (c) In-Young Ko, KAIST

Page 3: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Objectives To provide a semantic-cloud-based annotation scheme

Use semantic clouds as the primary interface

Easy to add semantic annotation in resource-constrained environments (e.g., smart phones and IPTVs)

To propose the framework of generating efficient semantic clouds To allow users to intuitively recognize candidate concepts with

resolving semantic ambiguity

To utilize Linked Data to dynamically generate semantic clouds

2012.08.16

3

Copyright (c) In-Young Ko, KAIST

Page 4: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Tagging with terms that map onto ontology classes Enhanced information retrieval

e.g., Resolve anomalies in search “Apple” the fruit and “Apple” the company

Types of semantic annotation Manual Semantic Annotation

Human annotators are often fraught with errors Knowledge acquisition bottleneck

Automatic Semantic Annotation Impossible to automatically identify and classify all entities in

source documents with complete accuracy Semi-automatic Semantic Annotation

All existing semantic annotation systems rely on human intervention at some point in the annotation process

Semantic Annotation

2012.08.16

4

Copyright (c) In-Young Ko, KAIST

Page 5: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Usability and Scalability in Semantic Annotation

Problems of previous efforts on semantic annotation Use terms from ontologies created by domain experts

Do not provide sufficient options to cover various kinds of semantics

Do not necessarily reflect newly created knowledge in an up-to-date manner

2012.08.16

5

Copyright (c) In-Young Ko, KAIST

Page 6: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Motivating Scenario

2012.08.16

6

Copyright (c) In-Young Ko, KAIST

Page 7: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Expected Results Example ‘Apple’

Additional relevant terms that don’t contain the keyword e.g. ‘iTunes’, ‘Macintosh’ in the pink cloud

2012.08.16

7

Copyright (c) In-Young Ko, KAIST

Apple Corp.

Page 8: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Linked Data is large-scale & heterogeneous Semantic Web data

More than 31 billions RDF from 295 different datasets

2012.08.16

8

Copyright (c) In-Young Ko, KAIST

What is Linked Data ?

Page 9: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Problems in Retrieving Linked Data

2012.08.16

9

Copyright (c) In-Young Ko, KAIST

SPARQL Need to know all the endpoints of datasets

Slow response time

Linked Data Search Engines Sindice, SWSE, Falcon, etc.

Only 0.6% ~ 30% of the result comes from Linked Data datasets

Limited number of results: maximum 1000 URIs for each query

Billion Triple Challenge (BTC) Dataset (2009 ~ 2011) Aggregation and organization of dumps from Linked Data search engines Only 30% of triples belongs to Linked Data datasets (984,611,067 / 3,173,563,606)

LD Spider Need to know seed URIs

Page 10: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Challenges in Generating Semantic Clouds

2012.08.16

10

Copyright (c) In-Young Ko, KAIST

Dynamic Access to Linked Data Too many responses for each query

Not feasible to ask users to choose the most appropriate one

Forming effective semantic clouds for annotation Relevant terms are grouped into few number of clouds

The semantics of a cloud should be intuitively recognizable

Semantic ambiguity between clouds should be minimized

No. Keyword Triples1. Animal 67,0542. Apple 49,4433. California 149,7854. Cloud 63,9395. Music 256,3526. New York 17,2917. Sky 153,7168. Tiger 27,9019. Travel 65,871

10. Wedding 22,624

Page 11: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

An incremental and iterative access method

Three phases of semantic cloud generation:

Semantic Cloud Generation Framework

2012.08.16

11

Copyright (c) In-Young Ko, KAIST

Page 12: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

2012.08.16

12

Copyright (c) In-Young Ko, KAIST

1. Find Spotting Points• Find the initial set of RDF nodes by using a LOD search engine or BTC• Retrieve and group similar RDF concepts via SPARQL endpoints• Prioritize the nodes within a group by using centrality analysis

1. Find Spotting Points• Find the initial set of RDF nodes by using a LOD search engine or BTC• Retrieve and group similar RDF concepts via SPARQL endpoints• Prioritize the nodes within a group by using centrality analysis

2. Select Links to Traverse• Consider popular relations such as FOAF, DC, SKOS and SIOC• Selectively traverse the Linked Data graph based on user or task context

2. Select Links to Traverse• Consider popular relations such as FOAF, DC, SKOS and SIOC• Selectively traverse the Linked Data graph based on user or task context

3. Generate Concept Clouds• Check semantic similarity to merge RDF nodes• Minimize semantic ambiguity to make clusters more distinguishable• Increase # of hops, and # of common terms to cover more RDF nodes

3. Generate Concept Clouds• Check semantic similarity to merge RDF nodes• Minimize semantic ambiguity to make clusters more distinguishable• Increase # of hops, and # of common terms to cover more RDF nodes

User Selects a Cloud

User Selects a Cloud

: end

: start

Semantic Cloud Generation Steps

N

Y

Page 13: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Finding Spotting Points Concept Search

Keyword based search on relevant concepts Similarity-Link Analysis

owl:sameAs parsing for grouping semantically same concepts skos:broader parsing for grouping semantically relevant concepts

Centrality Measurements Importance of each node, connection of each node

2012.08.16

13

Copyright (c) In-Young Ko, KAIST

Page 14: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Concept Search

2012.08.16

14

Copyright (c) In-Young Ko, KAIST

Keyword-based SearchKeyword-based Search

SubjectSubject PredicatePredicate ObjectObject

http://www.w3.org/2000/01/rdf-schema#labelhttp://www.w3.org/2004/02/skos/core#prefLabelhttp://purl.org/dc/elements/1.1/titlehttp://purl.org/dc/terms/titlehttp://sw.cyc.com/CycAnnotations_v1#labelhttp://rdf.freebase.com/ns/type.object.namehttp://www.geonames.org/ontology#namehttp://www.w3.org/2004/02/skos/core#altLabel

DBPedia Freebase∙∙∙

Concept RetrievalConcept Retrieval

Concept1Concept2

⁞Concept n

Concept1Concept2

⁞Concept

k

Each data set uses different ontologyEach data set uses different ontology Collect the ‘Subject’ concepts Collect the ‘Subject’ concepts

Page 15: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Similarity-Link Analysis Model

2012.08.16

15

Copyright (c) In-Young Ko, KAIST

ConceptOutLink InLink

Integer

has # of Links

Literal

hasLabel hasURI

Literal

hasInLinkshasOutLinks

hasSkosNarrower/Broader

LiteralLiteralLiteral

hasOwlsameAshasSkosExactMatch0…n0…n 0…n

1…11…1

1…1

Page 16: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Similarity-Link Analysis – owl:sameAs

2012.08.16

16

Copyright (c) In-Young Ko, KAIST

CC

CC

CC

CCCC

CC

CC

CC

CC

owl:sameAs

owl:sameAs

owl:sameAs

owl:sameAs

Page 17: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Similarity-Link Analysis – skos:broader

2012.08.16

17

Copyright (c) In-Young Ko, KAIST

CCCC

CC

CC

CC

CC

CCCC

CC

skos:broader

skos:broader

Page 18: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Similarity-Link Analysis – an example

2012.08.16

18

Copyright (c) In-Young Ko, KAIST

<http://rdf.freebase.com/ns/m/0k8z><http://rdf.freebase.com/ns/m/0k8z>

<http://dbpedia.org/data/Category:Apple_Inc.><http://dbpedia.org/data/Category:Apple_Inc.>owl:sameAs

Similarity-Link AnalysisSimilarity-Link Analysis

skos:broader

<http://dbpedia.org/data/Category:Apple_Inc._hardware><http://dbpedia.org/data/Category:Apple_Inc._hardware>

<http://dbpedia.org/data/Category:Apple_IIGS><http://dbpedia.org/data/Category:Apple_IIGS>

<http://dbpedia.org/data/Category:Apple_Lisa><http://dbpedia.org/data/Category:Apple_Lisa>

Page 19: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Centrality Analysis Connections to other concepts

Degree centrality The number of links incident upon a node Indegree (popularity), outdegree (gregariousness)

Eigenvector centrality Connections to high-scoring nodes contribute more than connections to low-

scoring nodes Katz centrality & PageRank

Generalization of degree centrality (all nodes connected through a path) A variant of eigenvector centrality

Closeness to other concepts Closeness centrality

Farness = sum of distances to all other nodes Closeness = the inverse of the farness

Betweenness centrality High probability to occur on a randomly chosen shortest path b/w two randomly

chosen nodes high betweenness

2012.08.16

19

Copyright (c) In-Young Ko, KAIST

http://en.wikipedia.org/wiki/Centrality

Page 20: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Concept Groups

No. Label Dataset1 Apple@cs Freebase2 Apple_Corp. Freebase3 Sugared Apple Freebase4 R.W. Apple Jr. DBpedia5 Apple Pie DBpedia6 Apple_Pink Freebase7 Barton Brands Freebase8 Apple MacBook Freebase9 H.W. Longfellow Freebase

10 Apple_Developer_Tools Freebase

2012.08.16

20

Copyright (c) In-Young Ko, KAIST

Keyword: ‘Apple’

Page 21: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Concept Clusters (1-hop traversal)

2012.08.16

21

Copyright (c) In-Young Ko, KAIST

Page 22: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Incremental Traversal & Grouping

Problems of accessing Linked Data via SPARQL endpoints Slow response time Exponentially increased number of concepts to traverse

Approach to solve the problems Incrementally traverse (80% of relevant concepts can be retrieved in

2 hops) Wait for the result from an endpoint only for the threshold time

2012.08.16

22

Copyright (c) In-Young Ko, KAIST

0 hop 1 hop 2 hop

Page 23: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Evaluation Performance of semantic cloud generation Concept reduction ratio User study

Data Preparation CKAN data hub to obtain endpoints

173 of endpoints Jena ARQ for exploiting SPARQL query Test keywords

Top 30 tags in Flickr

2012.08.16

23

Copyright (c) In-Young Ko, KAIST

Page 24: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Performance of Incremental Traversal

Threshold (5 seconds) for SPARQL Query 153 endpoints out of 173 endpoints Coverage: 88.44%

2012.08.16

24

Copyright (c) In-Young Ko, KAIST

Res

pons

e ti

me

(mse

c.)

5000

20 153

Page 25: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Performance of Incremental Traversing

2012.08.16

25

Copyright (c) In-Young Ko, KAIST

Page 26: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Concept Reduction of Similarity-link Analysis

2012.08.16

26

Copyright (c) In-Young Ko, KAIST

Keyword # of Concept# of Concept

(sameAs)# of Concept

(SKOS)Reduction Ratio

%

Newyork 895 296 0 33.07263Animal 1524 484 1 31.82415

California 2911 648 175 28.27207Wedding 164 40 0 24.39024

Music 8264 1839 16 22.44676Sky 2741 242 278 18.97118

Tiger 459 42 41 18.08279Apple 877 772 729 16.87571

Reduction Ratio (%)

(Keyword)

Avg. 14.25%

Page 27: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

User Study Top 30 Popular Tags from Web (Flickr)

Apple, Mouse, Tiger, Paris, Bank, Health, Web, Art, Nature, Park Beach, California, Canon, Music, London, Travel, Wedding, Festival, Square, Party Newyork, Water, Sky, Snow, Portrait, Nikon, Cloud, Green, Spring, Animal

2012.08.16

27

Copyright (c) In-Young Ko, KAIST

Page 28: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Implementation in IPTV domain

2012.08.16

28

Copyright (c) In-Young Ko, KAIST

Annotation Timing

Annotation Timing

Start ButtonStart

Button

Keyword (User input)

Keyword (User input)

Cloud Generation

Cloud Generation

Selected Linked Data

Selected Linked Data

Semantic Cloud

Semantic Cloud

1

2

3

4

Cloud Generation

Cloud Generation

5

Page 29: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Conclusion Contributions

Efficient handling of a large-scale Linked Data Generating semantic clouds that enable users to

Specify semantics by using simply keywords Intuitively recognize semantic options to annotate Easily resolve semantic ambiguity

Future Works User studies to measure the usability of the proposed

approach Considering semantically ambiguous situations

Empirical studies to decide followings Optimal number of spotting point Maximum number of hops to traverse Threshold value to decide the optimal set of SPARQL

endpoints for initial generation

2012.08.16

29

Copyright (c) In-Young Ko, KAIST

Page 30: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

Questions?

2012.08.16

30

Copyright (c) In-Young Ko, KAIST

Page 31: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

References (1/2)[1] Christian B., Tom H., Berners-Lee T.: Linked Data – The Story So Far. International Journal on Semantic Web and Information Systems, vol. 5, issue 3, 1-22 (2009)

[2] Bayerl P.S., Lungen H., Gut U., Paul K.I.: Methodology for reliable schema development and evaluation of manual annotations. Knowledge Markup and Semantic Annotation at the International Conference on Knowledge Capture 2003 (2003)

[3] Vehvilaiinen A., Hyvonen E., Alm O.: A Semi-Automatic Semantic Annotation and Authoring Tool for a Library Help Desk Service. In Proceedings of the 1st Semantic Authoring and Annotation Conference 2006 (2006)

[4] Kiryakov A., Popov B., Ognyanoff Dl., Manov D., Kirilov A., Goranov M.: Semantic Annotation, Indexing, and Retrieval. In ELSEVIER Journal of Web Semantics 2004 (2004)

[5] Reeve L., Han H.: Survey of Semantic Annotation Platforms. In ACM Symposium on Applied Computing (2005)

[6] Uren V., Cimiano P., Iria J., Handschuh S., Vargas-Vera M., Motta E., Ciravegna F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. In ELSEVIER Journal of Web Semantics (2005)

[7] In-Young Ko, Sang-Ho Choi, Han-Gyu Ko.: A Blog-centered IPTV Environment for Enhancing Contents Provision, Consumption, and Evolution. In Proceedings of the 10th International Conference on Web Engineering 2010 LNCS, vol. 6189, 522-526 (2010)

[8] Lord F.M.: Optimal Number of Choices per Item – A Comparison of Four Approaches. In Journal of Educational Measurement, vol. 14, no. 1, 33-38 (1977)

[9] Ding L., Finin T., Joshi A., Pank R., Cost S.R., Peng Y., Reddivari P., Doshi V., Sachs J.: Swoogle: a search and metadata eigine for the semantic web. In Proceedings of the CIMK 2004 (2004)

[10] Cheng, G., Ge W., Qu Y.: Falcons: Searching and Browsing Entities on the Semantic Web. In Proceedings of the 17th International World Wide Web Conference, Beijing, China, April 21-25, (2008)

[11] Tummarello, G., Delbru, R., Oren, E.: Sindice.com: Weaving the Open Linked Data. In Proceedings of the 6th International Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference. LNCS, vol. 4825, 552-565 (2007)

[12] Delbru R., Rakhmawati N.A., Tummarello G.: Sindice at SemSearch 2010. In Proceedings of the 19th International World Wide Web Conference, Raleigh, North Carolina, USA, April 26-30 (2010)

[13] Benjamin A., Leo S., Tomas R.: ConTag: A semantic tag recommendation system, In Proceedings of I-MEDIA and I-SEMANTICS, Graz, Austria, September 5-7 (2007)

2012.08.16

31

Copyright (c) In-Young Ko, KAIST

Page 32: Korea Advanced Institute of Science and Technology Scalable Access and Process of Linked Open Data A Semantic Cloud Generation Approach based on Linked.

References (2/2)[14] Roberto M., Azzurra R., Tommaso D. N., Eugenio D.: Semantic tag cloud generation via Dbpedia, In Proceedings of the 11th International Conference, EC-Web 2010, Bilbao, Spain, September (2010)

[15] Song Y., Zhang L., Giles. L.: Automatic Tag Recommendation Algorithm for Social Recommender Systems, In ACM Transactions on the Web, Vol. 5, No. 1, Article 4, February 2011 (2011)

[16] W3C SWEO Community Project Linking Open Data, http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData (accessed June, 12 2012)

[17] Linking Open Data Statistics, http://www4.wiwiss.fu-berlin.de/lodcloud/ (accessed June, 12 2012)

[18] Hogan A., Zimmermann A., Umbrich J., Polleres A., Decker S.: Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. In ELSEVIER Journal of Web Semantics 2011 (2011)

[19] Hu W., Qu Y., Sun X.: Bootstrapping object coreferencing on the semantic web. Journal of Computer Science Technology, 26(4), 663-675

[20] Ding I., Shinavier J., Shangguan Z., McGuinness D.: SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl:sameAs in Linked Data. Lecture Notes in Computer Science, Volume 6496/2010, 145-160

[21] Gionanni B., Stefano S.: A Spectrometry of Linked Data. In proceedings of LDOW 2012 in International World Wide Web Conference 2012

[22] Freeman, Linton: A set of measures of centrality based upon betweenness. Sociometry 40: 35-41, 1997

[23] Han-Gyu Ko, In-Young Ko.: Generation of Semantic Clouds based on Linked Data for Efficient Multimedia Semantic Annotation. In proceedings of ExploreWeb 2011 in International Conference on Web Engineering 2011

[24] SPARQL Query Language for RDF, http://www.w3.org/TR/rdf-sparql-query/ (accessed June, 12 2012)

[25] Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Technical report (1998)

[26] Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. In Proceedings of the 9 th Annual ACM-SIAM Symposium on Discrete Algorithms (1998)

2012.08.16

32

Copyright (c) In-Young Ko, KAIST