ADLUG 2012: Linking Linked Data

37
Copyright 2009-2010 @CULT. All rights reserved Linking Linked Data Andrea Gazzarini Software Architect 31st ADLUG ANNUAL MEETING 2012 Sala Brunelleschi of the OPA – CESVOT - Firenze 19 – 21 September 2012

description

A proposal for combining two different technologies, Solr and a triple store, in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.

Transcript of ADLUG 2012: Linking Linked Data

Page 1: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved

Linking Linked Data

Andrea GazzariniSoftware Architect

31st ADLUG ANNUAL MEETING 2012 Sala Brunelleschi of the OPA – CESVOT - Firenze19 – 21 September 2012

Page 2: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 2

Agenda

Goals

Information Retrieval

Triple store

Proof of concept

Q&A

Page 3: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 3

Agenda

Goals

Information Retrieval

Triple store

Proof of concept

Q&A

Page 4: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 4

Goals

1) Combine two different technologies in order to improve the (user) search experience by decoupling the “search” from the “view” perspective.

2) Provide a fast full-featured fulltext search that is able to scale over billion of records, providing tipical search features like faceting, stemming, autocompletion and so on...

3) Provide a system that is able to benefit of the Linked Data extensibility feature

Page 5: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 5

Le avventure di Pinocchio

This is a record extracted from the recordset we will use during this presentation.

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di Attilio Mussino.260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Page 6: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 6

Agenda

Goals

Information Retrieval

Triple store

Proof of concept

Q&A

Page 7: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 7

Information Retrieval (1/2)

For our purposes we will (simplistically) define an Information Retrieval (IR) as a full-text search framework able to index textual data and perform some manipulation in order to enable some end user interesting search features like:

» Relevance computation and boosting» Autocompletion» Faceting» Stemming» Did you mean?» Search by phoneme (i.e. Sounds Like)» More like this» ...and many many others...

But there's a price to pay for that...

Page 8: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 8

Inverted index

In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents. The purpose of an inverted index is to allow fast full text searches, at a cost of increased processing when a document is added to the database. The inverted file may be the database file itself, rather than its index. It is the most popular data structure used in document retrieval systems

http://en.wikipedia.org/wiki/Inverted_index

An inverted index is an optimized structure that allows fast searches but is supposed to be immutable so that means if you need to change something in your data you need to rebuild your index.

Page 9: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 9

Semantic destruction (1/3)

A search engine doesn't care about how much accuracy you put and how many time you spent for cataloguing a bibliographic resource...once indexed, it will loose any semantic meaning!

...ipsum dolor sit amet, consectetur adipiscing...

A

S

C

I

Z

I

A

U

U

O E

C

P O

Y

L

WRY D

Page 10: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 10

Semantic destruction (2/3)

The adventures of Pinocchio

The adventures of Pinocchio

adventures Pinocchio

adventures pinocchio

adventure pinocchio

ATFN PNX

Tokenization

Stopwords

Lowercase

Stemming (light)

Phoneme (!)

These are the only tokens that will be indexed!

Page 11: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 11

Semantic destruction (3/3)

ATFN PNX

KRL KLT

Page 12: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 12

Agenda

Goals

Information Retrieval

Triple store

Proof of concept

Q&A

Page 13: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 13

Triple store (1/2)

A triplestore is a purpose-built database for the storage and retrieval of triples, a triple being a data entity composed of subject-predicate-object, like "Bob is 35" or "Bob knows Fred".

http://en.wikipedia.org/wiki/Triplestore

Subject Predicate Object

book hasTitle The adventures of Pinocchio

book hasAuthor Collodi, Carlo

book hasPublisher Giunti

Of course it is more similar to a database and basically has nothing to do with an inverted index.

Page 14: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 14

Triple store (2/2)

Using a triple store you can have 1) a standard Query language (SPARQL) to query the store;

2) a standard format for exchanging data (RDF);

3) a storage where you are free to change your data in realtime without doing any kind of reindex operation;

But, most important, you cannot have

any of the seach features we described in the previous slides; for some of them it is practically impossible (e.g. faceting), for others (e.g. autocompletion) the problem is mainly the response time;

Page 15: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 15

Agenda

Goals

Information Retrieval

Triple store

Proof of concept

Q&A

Page 16: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 16

Proof of Concept

Our system is able to combine together the previous described technologies trying to get all the advantages and minimize the disadvantages.

Information Retrieval

Triple store

RDF / XML N3 Turtle NTriples MARC XMLMARC (Binary)

Search View

Page 17: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 17

Concretely...

Page 18: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 18

Le avventure di Pinocchio (MARC)

000 00694nam a2200241 i 4500008 971205s1997 it j 000 0 ita c020 a 880921191X082 1 a 853.8100 1 a Collodi, Carlo.245 13 a Le avventure di Pinocchio / c C. Collodi ; illustrazioni di Attilio Mussino.260 a Firenze : b Giunti, c 1997.440 0 a Collana favolosa / [Giunti]521 a Letteratura per ragazzi700 1 a Mussino, Attilio.

Page 19: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 19

Le avventure di Pinocchio (RDF / XML)

<bibo:Book rdf:about="http://www.cbt.trentinocultura.net/biblio/000002577949"> <dcterms:identifier>000002577949</dcterms:identifier> <bibo:isbn10>880921191X</bibo:isbn10> <dcterms:shortTitle>Le avventure di Pinocchio</dcterms:shortTitle> <dcterms:title> Le avventure di Pinocchio / C. Collodi ; illustrazioni di Attilio Mussino </dcterms:title> <dc:creator rdf:resource="http://www.cbt.trentinocultura.net/person/collodi_carlo"/> <dcterms:language>ita</dcterms:language> <dcterms:audience rdf:resource="http://www.cbt.trentinocultura.net/subject/opera_per_bambini"/> <dcterms:isPartOf rdf:resource="http://www.cbt.trentinocultura.net/biblio/2378129373323" /> <dcterms:extent>186 p.</dcterms:extent> <isbd:hasPlaceOfPublicationProductionDistribution> Firenze </isbd:hasPlaceOfPublicationProductionDistribution> <dcterms:issued>1997</dcterms:issued> <dcterms:publisher rdf:resource="http://www.cbt.trentinocultura.net/organisations/giunti"/></bibo:Book>

<foaf:Person rdf:about="http://www.cbt.trentinocultura.net/person/collodi_carlo"> <foaf:name>Collodi, Carlo</foaf:name></foaf:Person>

<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/organisations/giunti"> <foaf:name>Giunti</foaf:name></foaf:Organization>

The book...

...and the publisher

...the author...

Page 20: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 20

Step 1: transform MARC in RDF

As first step we need to transform MARC records in their corresponding RDF representation.

This presentation is not focused on this advanced topic, we will just index ten MARC records only for demonstrating the capabilities of the system.

We choosen the RDF / XML format for expressing the resulting triples. This will be the input data of the system.

MARC 21 RDF / XML

Page 21: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 21

Step 2: submit RDF data

The RDF data created in the previous step needs to be submitted to the system.

RDF / XML

Page 22: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 22

Step 3: make a search...

Autocompletion

Faceting

Page 23: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 23

Step 4: more publisher data...

It would be great if my users could see additional data on search results.

For example, I could ask data to publishers (logo, homepage and so on)...maybe for them could be a kind of advertisment, while for my users an

additional information displayed on my catalog

But

1) I don't want those data be part of my search index;2) I don't want to include those data in my bibliographic database;3) I don't want to reindex my data when some publisher information changes4) I would like to manage, improve those data without affecting searches

Page 24: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 24

Step 6: Our sample publisher

<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/organisations/giunti"> <foaf:name>Giunti</foaf:name></foaf:Organization>

Before...

...and after

<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/organisations/giunti"> <foaf:name>Giunti</foaf:name> <foaf:logo rdf:resource=”http://www.giunti.it/custom/src/@css/images/logo_Giunti.jpg”/> <rdfs:comment>Fondata nel pieno delle battaglie risorgimentali...</rdfs:comment> <foaf:mbox rdf:resource=”mailto:[email protected]”/> <foaf:homepage rdf:resource=”http://www.giunti.it”/></foaf:Organization>

As you can see, we added a logo, a brief description of the publisher, a mailbox and a homepage. We got data directly from the publisher website.

This data will be submitted again to the search system but without rebuild the search index.

As consequence of that, changes made to the publishers are immediately available.

Page 25: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 25

Step 7: see additional data...

Page 26: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 26

Step 7 bis: another publisher...

Page 27: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 27

Step 8: still more (linked) data... (1/3)

Great! My users were enthusiast!! So I'd like more...and not only publisher...but what else?

Sir, I think it would be very useful if we would show, beside each record, author information

Yes definitely it would, but you have no idea of what kind ofjob I did to insert all publisher data and I don't want to do the same for authors...too much work!

If I remember well your system is using Linked Data isn't it?Yes

So in this case the right question is not “How can I do, I have no data”, but “What kind of data I would like to show?”

???

Page 28: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 28

Step 8: still more (linked) data...(2/3)

There a lot of RDF authoritative endpoints that are exposing their data free of charge; the main advantage is that you can link this information to your system and you don't have to worry about their maintenance: it's not your data! See http://viaf.org or http://dbpedia.org

By linking those resources, you can get data in a standardized way because sources are sharing one or more (accepted) ontologies for describing authors, subjects, things and so on...

So for the example above we need the gather additional information about people (authors) and fortunately there's an ontology called Friend of a Friend (FOAF) that fits exactly our needs. This ontology is used in all RDF sources describing persons (like VIAF, Dbpedia)

In our example instead of copying and storing in our triple store (as we did for publishers) all information about Carlo Collodi, the author of “The adventures of Pinocchio”, we will simply link our internal representation with the same resource as defined in DBPedia.

Page 29: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 29

Step 8: still more (linked) data...(3/3)

Page 30: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 30

Step 9: Our sample author

<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/person/collodi_carlo"> <foaf:name>Collodi, Carlo</foaf:name></foaf:Organization>

Before...

...and after

<foaf:Organization rdf:about="http://www.cbt.trentinocultura.net/person/collodi_carlo"> <foaf:name>Collodi, Carlo</foaf:name> <owl:sameAs rdf:resource=”http://dbpedia.org/resource/Carlo_Collodi”/></foaf:Organization>

As you can see, we didn't add any information but just a “link” with the sameAs predicate.

The URI (http://dbpedia.org/resource/Carlo_Collodi) points to a web resource describingCarlo Collodi, so we can gather this data and display to the end user (for example).

Page 31: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 31

Step 10: again the same search...

Page 32: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 32

Step 10 bis: another author...

Page 33: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 33

Step 11: still more data??? yes!

Wow!! And now? Is there some other content I could “link”?

Yes sir, subjects for example...are you using subjectscoming from the “Nuovo Soggettario”?

Yes

So in this case you can link those subjects directly with concepts of the thesaurus, therefore providing to end users information like scope notes, history notes, term relationships and so on..

And, as another example, for places you can link “Geonames” resources, which provides RDF description of cities, countries.

Page 34: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 34

Step 12: Linking the “Nuovo Soggettario“

Page 35: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 35

Step 13: Linking Firenze with Geonames

Page 36: ADLUG 2012: Linking Linked Data

Copyright 2009-2010 @CULT. All rights reserved 36

Agenda

Goals

Information Retrieval

Triple store

Proof of concept

Q&A

Page 37: ADLUG 2012: Linking Linked Data

Linking Linked Data

31st ADLUG ANNUAL MEETING 2012 Sala Brunelleschi of the OPA – Firenze19 – 21 September 2012

Thank You!