Download - THGenius, rdf and open linked data for thesaurus management

Copyright 2010 @CULT. All rights reserved

From SubjectASubjectA to THTHGGeniusenius the Semantic Web searching

29th ADLUG ANNUAL MEETING 2010 Centro Congressi Panorama – TrentoProvincia Autonoma di Trento 22-24 September 2010

RDF and Open Linked Data, a first approach (part II)


Library data in a modern contextThe library catalogue (as traditional catalogue or as OPAC) has been

the only context for library data since its inception. The library catalogue purpose:a. Identifying the library’s holdingb. Supporting management of those holdingsc. Providing entry and discovery points for librarians and nonlibrarians

users

The efforts of librarians in the creation and maintenance of the catalog is rewarded by users? For different and various reasons, users favor the Web as an information platform over the library

The question for librarians and vendors has to be: how increase the feeling between libraries and users

The question that we must face, and that we must face sooner rather than later, is how we can best transform our data so that it can become part of the dominant information environment that is the Web


The Web as context

Actual scenario: a change is in act

the web is more and more the source of information for searchers and researchers, and the library needs to be interconnected with that web of

data

The library catalog data must be transformed from the actual ‘textual description’ to a set of data elements to which machine processes can be applied.

This data elements must be compatible with the current technology that is the World Wide Web

This process is what we can define

the evolution from ‘library catalog’ to Semantic Web

As vendor this process means

the evolution from SubjectA to THGenius


Data in the traditional catalogue

=LDR 00688nz a2200265n 4500 =001 000000008238 =005 20100519190730.0 =008 100519nn\ano\\ba\n\\\\\\\\\\\n\ana\\\\\d =040 \\$aOSZK$bhun$fKöztaurusz =151 \\$aAsia Minor Occidentalis =551 \\$wgnnn$aókori történeti táj =551 \\$whnnn$aBithynia =551 \\$whnnn$aCaria =551 \\$whnnn$aIonia =551 \\$whnnn$aLycaonia =551 \\$whnnn$aLycia =551 \\$whnnn$aLydia =551 \\$whnnn$aMysia =551 \\$whnnn$aPamphylia =551 \\$whnnn$aPhrygia =551 \\$whnnn$aPisidia =551 \\$wjnnn$aAsia Minor =551 \\$wpnnn$aTörökország =551 \\$wmnnn$aAsia Minor Orientalis =751 \4$a(392)

The ‘Asia Minor Occidentalis’ as MARC21 authority recordMARC21 authority record


The knowledge base for Web

<skos:Concept rdf:about="http://nektar.oszk.hu/resource/auth/Asia_Minor_Occidentalis"><skos:inScheme rdf:resource="http://www.oszk.hu/thesaurus/location"/><dc:source>OSZK geotezaurusz</dc:source><dc:type>location</dc:type><skos:prefLabel xml:lang="hu">Asia Minor Occidentalis</skos:prefLabel><skos:broader rdf:resource="http://nektar.oszk.hu/resource/auth/ókori_történeti_táj"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Bithynia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Caria"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Ionia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Lycaonia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Lycia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Lydia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Mysia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Pamphylia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Phrygia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Pisidia"/><skos:broader rdf:resource="http://nektar.oszk.hu/resource/auth/Asia_Minor"/><skos:related rdf:resource="http://nektar.oszk.hu/resource/auth/Törökország"/>

</skos:Concept>

The ‘Asia Minor Occidentalis’ as web resource (in RDF/SKOS format)


The Web as context

What we can manage now with THGenius (RDF Resource Description Framework)

• RDF/SKOS objects: Simple Knowledge Organization System (to rapresentation of thesauri, classification schemes, taxonomies, subject-headings systems and so on)

• RDF/FOAF objects: acronym of Friends of friends (ontology describing persons, their activities and their relations with other people and objects)

• RDF/DC object: acronym for RDF Dublin Core metadata (used to describe information resources, such as documents)

To obtain the common goal: to publish on the web our data as linked

entities


Library data in a modern contexthttp://semanticweb.org/wiki/SPARQL_endpoint

“A SPARQL endpoint enables users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-processable formats. Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base”

“Both the formulation of the queries and the human-readable presentation of the results should typically be implemented by the calling software, and not be done manually by human users”

Our proposal: THGenius


THGenius: the SPARQL endpointThe ‘Asia Minor Occidentalis’ in ThGenius (that reads the SKOS concept)


THGenius: search the Semantic Web


The open search in THGenius


THGenius: different perspectives to see concepts


THGenius: also a new Thesaurus management system

WeCat: a traditional way to manage Thesauri


THGenius: also a new Thesaurus management system

ThGenius: authorised people to manage Thesaurus via web

Copyright 2010 @CULT. All rights reserved 20

Keyword and Keyphrases Indexing (1/9)

Keywords and keyphrases summarize and describe the content of single documents and provide additional semantic metadata that is useful for a lot of purposes.

The task of assigning keywords and keyphrases to a document is called keyphrase / keyword indexing.

In libraries, professional indexers select keyphrases and keywords from a controlled vocabulary (Subject Headings) according to defined cataloguing rules.

The idea behind the process described in the next slides is to automatize the indexing task in order to automatically add to our documents a set of keywords / keyphrases extracted using semantic relationships within a thesaurus (expressed in SKOS format).

This is another interesting advantage of having a thesaurus in SKOS format.



For our example we will use the following set of documents:

Format File Description

Circulation.doc Amicus Circulation Module user manual

Dubliners.pdf The Dubliners (by J.Joyce)

Harry_Potter.pdf Harry Potter and the Quest of Values (a thesis)

bondvaluation.xls Bond Calc Spreadsheet

Moby-Dick.pdf Moby Dick (by H.Melville)

Searching.odt Amicus Search Module user manual

WeLoan.ppt Amicus Circulation Web Module (by T.Possemato)



First of all we will extract metadata from our documents. Specifically we will get the “title” and “author” metadata attributes.

The following is what the process produces:

Metadata attribute : title=Bond CalculatorMetadata attribute : author=Robert Jones

Metadata attribute : title=Amicus Circulation Module - User ManualMetadata attribute : author=Anneke

Metadata attribute : title= DublinersMetadata attribute : author=James Joyce

Metadata attribute : title=Harry Potter and the Quest for ValuesMetadata attribute : author=Tony Lennard

Metadata attribute : title=Moby Dick Metadata attribute : author=Herman Melville

Metadata attribute : title=WeLoan - The new circulation moduleMetadata attribute : author=Tiziana Possemato

...



As a second step, we will proceed with text extraction.

Regardless the file format, we will extract the textual content from each document.

Together with the previously extracted metadata, this is an important part for keyword indexing because later, using the extracted text, the system will be able to undestand terms occurrency, frequency and relevance within the documents.

Keep in mind that the file format is not important from this point of view. That means you can use doc, txt, pdf, rtf, xml, html, open office documents and generally speaking, all formats that have a (direct or indirect) textual content.



In order to give you an example, the following is a section of the text extracted from “Amicus Circulation Module user manual” (a Microsoft Word document)

...If the item is received in the requesting library, it needs to be checked in to make the item available for circulation. To do this, one has to follow the steps given below:Click on the Check In button on the Circulation Main Menu.Enter the barcode of the copy and press enter. A message appears that the item has arrived from transit. Click on the close button.If one checks the status of the copy in the requesting library, one will find an additional field on the status of copy screen: “Original branch”, showing the owning branch of the transferred item.After the check in of the item, the copy can be charged out to the borrower who requested the copy. The policies of the requesting library are valid as policies for this book.A hold can be placed on an item of another library from the moment the book has been transferred by the owning library. See: charge out policiesNote that: The item can be charged out immediately when a borrower is present in the library at that moment. In this case, it is not necessary to check in the copy first before doing a charge out. To return the copyThere are two options to return a transferred copy to the owning library. The first option is when the borrower comes to check in the copy:Enter the Barcode number of the copy on the Check In screen and press enter....



After extracting the textual content from our documents now it's time to extract Keywords and keyphrases.

In order to do that we need:

– Metadata attributes: see first step;– Text: see second step;– A controller vocabulary (thesaurus) in SKOS format;

Regarding the last point, for this example, we will use the Library of Congress Subject Headings (LCSH) but keep in mind that any Thesaurus in SKOS format can be used.



The following are keyphrases and keywords extracted from Moby Dick by Herman Melville using two different thesaurus (Library of Congress Subject Headings and Medical Subject Headings).

SoilsWhalingWhalesHandBoats and boatingHistoryShipsSteam enginesSteam engineersPoultrySeasJournalismHistoryEmotionsInterest (Psychology)FatSteam engineeringSteam-engines...

DogsMaleSmellSimian Acquired Immunodeficiency SyndromeFemaleSpermatozoaAnimalsAnimation [Publication Type]SleepLegCattleMouthMonstersAgedAgingMortalityDNA Transposable ElementsBrain...

LCSH MESH



And finally, after indexing metadata, text, keyword and keyphrases we can search those documents using our favourite search engine.


What THGenius is:

• The best opportunity for a library to be attractive for modern and smart users

• The evolution from traditional library catalog to semantic web: not only from a vendor but also from a library view point

• A very powerful and userfriendly way to produce, use and share library data, available for web

• A simple way to ‘manage’ thesaurus and authority data in a very standard and reusable format

• A powerful and simply way to improve the search functions, increasing fulltext and other different file metadata

THGenius in few concepts