IWIR-CRIS '06
Data retrieval in PURE
Data retrieval in the 4-year old PURE CRIS project at 9 universities
2
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Agenda
■ Overview■ Retrieval
Validated manual data gathering Dynamic integration to local back-end systems Aggregation, enrichment and import of historic data Experiments with automated imports of historic data
■ Exposure Two web services OAI Z39.50 Reports Portal framework
■ Archiving■ Near future
3
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Overview
■ Brief overview■ … in order to discuss ingestion, integration,
conversion and import in a specific context
4
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Overview
■ Brief overview ■ History
Development begun in 2002■ Users
9 universities (DK+SE), several hospitals + other research institutions
■ Platform and architecture J2EE enterprise application Release management: All users have instances of same release
version, same code-base■ Business model
Commercial software licenses, powerful user group, shared budgets
■ Modular Basic module, Reporting module, Student thesis module, External
publications module, Bibliometrics module, Press module.
5
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Overview
6
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Retrieval
■ Manual data gathering ■ User roles/right + workflow:
= de-centralized data gathering = validated data gathering = continuous data gathering
■ GUI example■ Management focus is necessary
Reports and statistics, KPI-management, etc. ■ Adding value to researchers is necessary
Instantly in Google indexes, instantly updated personal websites, instantly updated CV, increased citations (source in paper), etc.
7
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Retrieval
■ Dynamic integration■ Dynamic integration to local back-end systems:
Personnel systems, payroll systems (for data retrieval) LDAPs, Active Directories (for data retrieval + authentication) Single sign-on systems (for authentication) … to automatically create object types such as “person” or
“organization”
■ … and yes, PURE hosts data, too We need complete objects according to the meta-data model
■ Plug-in architecture in PURE: Pro = individually adapted integration Con = individually programmed plug-in necessary Future = GUI, standardized plug-ins
8
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Retrieval
■ Import■ Historic data■ Many sources
More or less useful data More or less consequent use of formats :-)
■ The PXA format PURE XML Archive format - .zip based Meta-data, relations between entities, binary files
■ Aggregation > enrichment > conversion > import
The process is external to PURE
9
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Retrieval
■ Experiments■ Experiments with automated imports of historic
data from specific, identified sources ■ [source format] > PXA conversion > import >
enrichment/validation■ Very poor data quality demands the concept of
“draft objects” in PURE
10
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Exposure
■ Web services■ RPC/encoded + document/literal■ Rich libraries of methods
■ Including format-specific methods: APA, MLA, HARVARD, VANCOUVER and CBE
■ Free and near-instant adding of methods
■ WS code example (if time)
11
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Exposure
■ OAI support■ OAI-PMH data provider■ OAI-PMH formats
■ DC■ DDF-MXD (Danish national format)■ SVEP (Swedish national format)
… more to come
■ Also used to harvest other PURE-repositories for “external publications”
12
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Exposure
■ Z39.50■ Enabling of searches in PURE from library
systems ■ SRW/SRU
13
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Exposure
■ Reports ■ PURE reporting module
■ GUI example
14
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Exposure
■ Reference manager■ Export of data to local Reference Manager
installation■ Using RM-formatted export file ■ Promotes registering to the repository
rather than in RM■ GUI example
15
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Exposure
■ Portal framework■ PUREportal – free PURE-specific framework for
custom development of research exhibition portals
■ Online example
■ Typical cost scenario € 20,000■ Typical delivery time 1 month ■ Little need for requirements specification ■ Automatic PURE-API maintenance
16
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Archiving
■ Data archiving – 2 levels ■ SQL environment
■ Meta-data and relations■ Binary files just stored in server file system
■ FEDORA via connector (not PURE-specific, Open Source)
■ Facilitates: Higher quality archival of binary files Long term preservation in general Adoption of PURE in institutions’ general FEDORA strategies
17
atiraNiels Jernes Vej
10DK-9220 Aalborg
+45 9635 6100www.atira.dk
Near future
■ The near future regarding data retrieval ■ More automated imports using increasingly advanced
converters■ Automated data delivery (push and harvest) to:
Industry specific search services (e.g. PubMed, Nordicom) Documentary data collections (such as clinicaltrials.org), and
national collections (such as DDF (DK), ForskDok (NO), etc. ■ Temporary import objects
When imported data are not in sufficient quality to create valid objects
when data cannot be properly related to other objects upon import