Dspace OAI-PMH

27
By Sem Gebresilassie 13 May 2015 [email protected]

Transcript of Dspace OAI-PMH

Page 1: Dspace OAI-PMH

By

Sem Gebresilassie

13 May [email protected]

Page 2: Dspace OAI-PMH

Harvesting Statstical Metadata from an Online Repository for Data Analysis and Visualization

Page 3: Dspace OAI-PMH

Outline Goal and Motivation Theseus.fi Dspace Getting Data out from Dspace Dspace OAI-PMH as a Data provider for Theseus Request Types(Verbs) Flow Control Harvesting Data from Theseus’s Data provider Project Result Final thoughts

Page 4: Dspace OAI-PMH

Goal

Harvest metadata of thesis documents from Theseus

author name, title, keywords, submission year....

Store the harvested data into a separate MYSQL database.

Build a Web portal out of this stored data

Goal and Motivation

Why conduct this project?

Thesis data analysis and visualization of overall statistical facts.

Compare thesis documents

Compare universities and departments

Analyse trending keywords used by students every year

Page 5: Dspace OAI-PMH

Theseus.fi

Digital libraries are now commonly used by academic institutions worldwide.

Theseus provides online access to theses and publications from Finnish universities of applied sciences.

End users can search, browse and upload thesis documents to Theseus.

Page 6: Dspace OAI-PMH

...

Theseus also has an API that can be used by third party organizations to utilize theses data.

Theseus is powered by a pioneer open source digital asset management system called Dspace.

Functionalities and features of Theseus are inherited from Dspace.

Page 7: Dspace OAI-PMH

Dspace

Dspace is an open source software platform that provides stable, long-term storages commonly for digital intellectual materials.

Many academic institutions worldwide use Dspace to offer their users an easy access to their digital resources.

Dspace can be freely downloaded and used or even modified to store digital materials.

Page 8: Dspace OAI-PMH

AbbreviationsOAI: Open Archives Initiative

PMH: Protocol for Metadata Harvesting

Page 9: Dspace OAI-PMH

Getting Data out from Dspace

OAI-PMH is HTTP based protocol that defines methods and protocols for sharing, publishing and archiving metadata from Dspace repositories

Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is used to programatically access data from Dspace.

Page 10: Dspace OAI-PMH

Dspace OAI-PMH as a Data provider for Theseus

Dspace repositories have an 'OAI Base URL' in addition the URL for human users.

OAI Base URL : http://publications.theseus.fi/oai/request?

URL for human users : https://www.theseus.fi/

This URL is used in machine to machine communications between data consumers and data harvesters.

When harvesting request is made using the OAI Base URL , Theseus’s data provider returns XML formatted metadata of thesis documents.

Page 11: Dspace OAI-PMH

Theseus OAI-PMH exposes thesis documents in twelve unique metadata formats.

KansalliKirjasto format:

<kk:field schema="dc" element="contributor" qualifier="author" language="none" value=" Denut, Nicolae "/>

OAI Dublin Core format : <dc:creator> Denut, Nicolae </dc:creator>

Each metadata format can be queried to get any data from Theseus’s data provider.

Page 12: Dspace OAI-PMH

Request Types (Verbs)

There are six methods in OAI-PMH that can be appended to OAI based URLs to access different repository contents.

Theseus implements all six request types to provide thesis metadata to harvesters.

1. Identify: fetches information about Theseus data-provider itself

2. ListMetadataFormats: returns a list of available metadata formats supported by a Theseus data provider

3. ListIdentifiers: lists thesis record identifiers

Page 13: Dspace OAI-PMH

4. ListSets: retrieves the set structure (list of universities and departments) .

5. ListRecords: gets list of complete metadata of thesis documents from a Theseus and

6. GetRecord: retrieves individual metadata of a thesis document

By attaching any one of these request types to Theseus’s OAI base URL,a request URL can be formed.

Page 14: Dspace OAI-PMH

+AOI Base URL

Request type => Reque

st URL

Page 15: Dspace OAI-PMH

http://publications.theseus.fi/oai/request?verb=ListSets

Page 16: Dspace OAI-PMH

Flow control

The three request types ListIdentifiers, ListSets and ListRecords return large lists from Theseus.

In such cases, it is practical to partition them among a series of requests and responses.

Resumption tokens are options from OAI protocol that allow data providers to chunk long list responses in parts.

Page 17: Dspace OAI-PMH

Resumption token work flow

Page 18: Dspace OAI-PMH

Harvesting Data from Theseus’s Data provider

Simple HTML DOM parser, is an open source parser library written in PHP to read, modify, and return structured content from external data sources.

This parser library can create a Document Object Model by loading structured data from a URL.

To get nodes of the DOM object , this library provides a method called “find ()”.

Page 19: Dspace OAI-PMH

Universities Departments Thesis documentsIdentifier (setSpec) identifier (setSpec) Thesis IdentifierUniversity name Department Name Author namesListSets Request URLs ListSets Request URLs TitlesTotal number of papers Total number of papers GetRecord request URLs

University identifiers Department identifiers University identifiers KeywordsSubjects (official keywords)Number of pagesyearLanguage

Summary of gathered theses metadata

Page 20: Dspace OAI-PMH

84,391 Whoa! That’s a big number, aren’t you proud?

Page 21: Dspace OAI-PMH

Project Result

• How many Thesis documents are in Theseus?

• Which school has what amount of papers in Theseus?

• How many papers is each school publishing every year?

• What departments are there in each school?

• How many papers belong to which department?

• How many pages does each paper have?

• In what language is the paper written?

• How many times has each paper been downloaded by Theseus visitors?

• What are the keywords of each thesis document?

Page 22: Dspace OAI-PMH

The built Web portal aims to give better insights on the contribution of each school to Theseus on its front page.

Page 23: Dspace OAI-PMH

Web portal showing

Page 24: Dspace OAI-PMH

Departments versus number of Thesis documents in Metropolia UAS

Page 25: Dspace OAI-PMH

Analysing Keywords is also easy

I want to analyse

keywords

Fill out a form

See results

Page 26: Dspace OAI-PMH

Keyword fetching form

Page 27: Dspace OAI-PMH

Thank you!

Any questions?

[email protected]