Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting...

15
Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas

Transcript of Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting...

Page 1: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Data Wrangling at Rice University

Denis Galvin Rice UniversityMetaArchive Annual Membership Meeting

Houston Texas

Page 2: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

ETDs at Rice

• Dspace• Collection in a database driven

by programming• 42,581 G• Brief and Full records

Page 3: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

ETD Structure• Briefhttp://scholarship.rice.edu/handle/1911/13401• Fullhttp://scholarship.rice.edu/handle/1911/13401?show=full• PDFshttp://scholarship.rice.edu/bitstream/handle/1911/13401/1338793.PDF?sequence=1

Page 4: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Testing

• All testing done on Centos using VMware • Plugintool testing• Run one daemon• Copying other sites plugins

Page 5: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Manifest Page

Page 6: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Dublin Core

request?verb=ListRecords&metadataPrefix=oai_dc&set=hdl_1911_8299

Page 7: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Sub-Manifest Page

• Links to ETDs within DSpace

Page 8: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Plugin

• Configuration parameters:Base URL• For the sub-manifest pages:

Part (integer)

Page 9: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Crawl Rules

Page 10: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Crawl rules explained

• Include master manifest page:

• Include sub-manifest page:

• Include items under /bitstream

• Include OAI-PMH link

Page 11: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Crawl rules explained• Include full record

• OAI-PMH link on manifest master• Pulls in Dublin Corehttp://scholarship.rice.edu/dspace-

oai/request?verb=ListRecords&metadataPrefix=oai_dc&set=hdl_1911_8299

Page 12: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.
Page 13: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Collection Sizes

• Recommended AU between 1G and 10G

• 5 AUs between 7 and 10G• Create new AUs as collection

grows

Page 14: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Tips

• Don’t trust testing with the plugin tool

• Read documentation • Test with Run One Daemon• Test on the caches• Use expert mode to write

plugin

Page 15: Data Wrangling at Rice University Denis Galvin Rice University MetaArchive Annual Membership Meeting Houston Texas.

Questions?