Copyright 2010 @CULT. All rights reserved
From SubjectASubjectA to THTHGGeniusenius the Semantic Web searching
29th ADLUG ANNUAL MEETING 2010 Centro Congressi Panorama – TrentoProvincia Autonoma di Trento 22-24 September 2010
RDF and Open Linked Data, a first approach (part II)
Copyright 2010 @CULT. All rights reserved
Library data in a modern contextThe library catalogue (as traditional catalogue or as OPAC) has been
the only context for library data since its inception. The library catalogue purpose:a. Identifying the library’s holdingb. Supporting management of those holdingsc. Providing entry and discovery points for librarians and nonlibrarians
users
The efforts of librarians in the creation and maintenance of the catalog is rewarded by users? For different and various reasons, users favor the Web as an information platform over the library
The question for librarians and vendors has to be: how increase the feeling between libraries and users
The question that we must face, and that we must face sooner rather than later, is how we can best transform our data so that it can become part of the dominant information environment that is the Web
Copyright 2010 @CULT. All rights reserved
The Web as context
Actual scenario: a change is in act
the web is more and more the source of information for searchers and researchers, and the library needs to be interconnected with that web of
data
The library catalog data must be transformed from the actual ‘textual description’ to a set of data elements to which machine processes can be applied.
This data elements must be compatible with the current technology that is the World Wide Web
This process is what we can define
the evolution from ‘library catalog’ to Semantic Web
As vendor this process means
the evolution from SubjectA to THGenius
Copyright 2010 @CULT. All rights reserved
Data in the traditional catalogue
=LDR 00688nz a2200265n 4500 =001 000000008238 =005 20100519190730.0 =008 100519nn\ano\\ba\n\\\\\\\\\\\n\ana\\\\\d =040 \\$aOSZK$bhun$fKöztaurusz =151 \\$aAsia Minor Occidentalis =551 \\$wgnnn$aókori történeti táj =551 \\$whnnn$aBithynia =551 \\$whnnn$aCaria =551 \\$whnnn$aIonia =551 \\$whnnn$aLycaonia =551 \\$whnnn$aLycia =551 \\$whnnn$aLydia =551 \\$whnnn$aMysia =551 \\$whnnn$aPamphylia =551 \\$whnnn$aPhrygia =551 \\$whnnn$aPisidia =551 \\$wjnnn$aAsia Minor =551 \\$wpnnn$aTörökország =551 \\$wmnnn$aAsia Minor Orientalis =751 \4$a(392)
The ‘Asia Minor Occidentalis’ as MARC21 authority recordMARC21 authority record
Copyright 2010 @CULT. All rights reserved
The knowledge base for Web
<skos:Concept rdf:about="http://nektar.oszk.hu/resource/auth/Asia_Minor_Occidentalis"><skos:inScheme rdf:resource="http://www.oszk.hu/thesaurus/location"/><dc:source>OSZK geotezaurusz</dc:source><dc:type>location</dc:type><skos:prefLabel xml:lang="hu">Asia Minor Occidentalis</skos:prefLabel><skos:broader rdf:resource="http://nektar.oszk.hu/resource/auth/ókori_történeti_táj"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Bithynia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Caria"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Ionia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Lycaonia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Lycia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Lydia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Mysia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Pamphylia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Phrygia"/><skos:narrower rdf:resource="http://nektar.oszk.hu/resource/auth/Pisidia"/><skos:broader rdf:resource="http://nektar.oszk.hu/resource/auth/Asia_Minor"/><skos:related rdf:resource="http://nektar.oszk.hu/resource/auth/Törökország"/>
</skos:Concept>
The ‘Asia Minor Occidentalis’ as web resource (in RDF/SKOS format)
Copyright 2010 @CULT. All rights reserved
The Web as context
What we can manage now with THGenius (RDF Resource Description Framework)
• RDF/SKOS objects: Simple Knowledge Organization System (to rapresentation of thesauri, classification schemes, taxonomies, subject-headings systems and so on)
• RDF/FOAF objects: acronym of Friends of friends (ontology describing persons, their activities and their relations with other people and objects)
• RDF/DC object: acronym for RDF Dublin Core metadata (used to describe information resources, such as documents)
To obtain the common goal: to publish on the web our data as linked
entities
Copyright 2010 @CULT. All rights reserved
Library data in a modern contexthttp://semanticweb.org/wiki/SPARQL_endpoint
“A SPARQL endpoint enables users (human or other) to query a knowledge base via the SPARQL language. Results are typically returned in one or more machine-processable formats. Therefore, a SPARQL endpoint is mostly conceived as a machine-friendly interface towards a knowledge base”
“Both the formulation of the queries and the human-readable presentation of the results should typically be implemented by the calling software, and not be done manually by human users”
Our proposal: THGenius
Copyright 2010 @CULT. All rights reserved
THGenius: the SPARQL endpointThe ‘Asia Minor Occidentalis’ in ThGenius (that reads the SKOS concept)
Copyright 2010 @CULT. All rights reserved
THGenius: search the Semantic Web
Copyright 2010 @CULT. All rights reserved
THGenius: search the Semantic Web
Copyright 2010 @CULT. All rights reserved
The open search in THGenius
Copyright 2010 @CULT. All rights reserved
THGenius: different perspectives to see concepts
Copyright 2010 @CULT. All rights reserved
THGenius: different perspectives to see concepts
Copyright 2010 @CULT. All rights reserved
THGenius: also a new Thesaurus management system
WeCat: a traditional way to manage Thesauri
Copyright 2010 @CULT. All rights reserved
THGenius: also a new Thesaurus management system
WeCat: a traditional way to manage Thesauri
Copyright 2010 @CULT. All rights reserved
THGenius: also a new Thesaurus management system
WeCat: a traditional way to manage Thesauri
Copyright 2010 @CULT. All rights reserved
THGenius: also a new Thesaurus management system
WeCat: a traditional way to manage Thesauri
Copyright 2010 @CULT. All rights reserved
THGenius: also a new Thesaurus management system
ThGenius: authorised people to manage Thesaurus via web
Copyright 2010 @CULT. All rights reserved
THGenius: also a new Thesaurus management system
ThGenius: authorised people to manage Thesaurus via web
Copyright 2010 @CULT. All rights reserved 20
Keyword and Keyphrases Indexing (1/9)
Keywords and keyphrases summarize and describe the content of single documents and provide additional semantic metadata that is useful for a lot of purposes.
The task of assigning keywords and keyphrases to a document is called keyphrase / keyword indexing.
In libraries, professional indexers select keyphrases and keywords from a controlled vocabulary (Subject Headings) according to defined cataloguing rules.
The idea behind the process described in the next slides is to automatize the indexing task in order to automatically add to our documents a set of keywords / keyphrases extracted using semantic relationships within a thesaurus (expressed in SKOS format).
This is another interesting advantage of having a thesaurus in SKOS format.
Copyright 2010 @CULT. All rights reserved 21
Keyword and Keyphrases Indexing (2/9)
For our example we will use the following set of documents:
Format File Description
Circulation.doc Amicus Circulation Module user manual
Dubliners.pdf The Dubliners (by J.Joyce)
Harry_Potter.pdf Harry Potter and the Quest of Values (a thesis)
bondvaluation.xls Bond Calc Spreadsheet
Moby-Dick.pdf Moby Dick (by H.Melville)
Searching.odt Amicus Search Module user manual
WeLoan.ppt Amicus Circulation Web Module (by T.Possemato)
Copyright 2010 @CULT. All rights reserved 22
Keyword and Keyphrases Indexing (3/9)
First of all we will extract metadata from our documents. Specifically we will get the “title” and “author” metadata attributes.
The following is what the process produces:
Metadata attribute : title=Bond CalculatorMetadata attribute : author=Robert Jones
Metadata attribute : title=Amicus Circulation Module - User ManualMetadata attribute : author=Anneke
Metadata attribute : title= DublinersMetadata attribute : author=James Joyce
Metadata attribute : title=Harry Potter and the Quest for ValuesMetadata attribute : author=Tony Lennard
Metadata attribute : title=Moby Dick Metadata attribute : author=Herman Melville
Metadata attribute : title=WeLoan - The new circulation moduleMetadata attribute : author=Tiziana Possemato
...
Copyright 2010 @CULT. All rights reserved 23
Keyword and Keyphrases Indexing (4/9)
As a second step, we will proceed with text extraction.
Regardless the file format, we will extract the textual content from each document.
Together with the previously extracted metadata, this is an important part for keyword indexing because later, using the extracted text, the system will be able to undestand terms occurrency, frequency and relevance within the documents.
Keep in mind that the file format is not important from this point of view. That means you can use doc, txt, pdf, rtf, xml, html, open office documents and generally speaking, all formats that have a (direct or indirect) textual content.
Copyright 2010 @CULT. All rights reserved 24
Keyword and Keyphrases Indexing (5/9)
In order to give you an example, the following is a section of the text extracted from “Amicus Circulation Module user manual” (a Microsoft Word document)
...If the item is received in the requesting library, it needs to be checked in to make the item available for circulation. To do this, one has to follow the steps given below:Click on the Check In button on the Circulation Main Menu.Enter the barcode of the copy and press enter. A message appears that the item has arrived from transit. Click on the close button.If one checks the status of the copy in the requesting library, one will find an additional field on the status of copy screen: “Original branch”, showing the owning branch of the transferred item.After the check in of the item, the copy can be charged out to the borrower who requested the copy. The policies of the requesting library are valid as policies for this book.A hold can be placed on an item of another library from the moment the book has been transferred by the owning library. See: charge out policiesNote that: The item can be charged out immediately when a borrower is present in the library at that moment. In this case, it is not necessary to check in the copy first before doing a charge out. To return the copyThere are two options to return a transferred copy to the owning library. The first option is when the borrower comes to check in the copy:Enter the Barcode number of the copy on the Check In screen and press enter....
Copyright 2010 @CULT. All rights reserved 25
Keyword and Keyphrases Indexing (6/9)
After extracting the textual content from our documents now it's time to extract Keywords and keyphrases.
In order to do that we need:
– Metadata attributes: see first step;– Text: see second step;– A controller vocabulary (thesaurus) in SKOS format;
Regarding the last point, for this example, we will use the Library of Congress Subject Headings (LCSH) but keep in mind that any Thesaurus in SKOS format can be used.
Copyright 2010 @CULT. All rights reserved 26
Keyword and Keyphrases Indexing (7/9)
The following are keyphrases and keywords extracted from Moby Dick by Herman Melville using two different thesaurus (Library of Congress Subject Headings and Medical Subject Headings).
SoilsWhalingWhalesHandBoats and boatingHistoryShipsSteam enginesSteam engineersPoultrySeasJournalismHistoryEmotionsInterest (Psychology)FatSteam engineeringSteam-engines...
DogsMaleSmellSimian Acquired Immunodeficiency SyndromeFemaleSpermatozoaAnimalsAnimation [Publication Type]SleepLegCattleMouthMonstersAgedAgingMortalityDNA Transposable ElementsBrain...
LCSH MESH
Copyright 2010 @CULT. All rights reserved 27
Keyword and Keyphrases Indexing (8/9)
And finally, after indexing metadata, text, keyword and keyphrases we can search those documents using our favourite search engine.
Copyright 2010 @CULT. All rights reserved 28
Keyword and Keyphrases Indexing (9/9)
Copyright 2010 @CULT. All rights reserved
What THGenius is:
• The best opportunity for a library to be attractive for modern and smart users
• The evolution from traditional library catalog to semantic web: not only from a vendor but also from a library view point
• A very powerful and userfriendly way to produce, use and share library data, available for web
• A simple way to ‘manage’ thesaurus and authority data in a very standard and reusable format
• A powerful and simply way to improve the search functions, increasing fulltext and other different file metadata
THGenius in few concepts
Top Related