Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library...
-
Upload
wilfrid-obrien -
Category
Documents
-
view
213 -
download
0
Transcript of Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library...
Overview of Overview of IU Digital Collections IU Digital Collections SearchSearch
Hui ZhangJon DunnIndiana University Digital Library Program
IU Digital Library Brown BagOctober 19, 2011
OutlineOutlineIntroduction and motivation – JonDemo – JonTechnical implementation – HuiNext steps and future work – Jon
Why cross-collection Why cross-collection search?search?Support discovery across multiple
content formats, collections, and repositories at IU
Use cases:◦Multiple formats/collections within a
single thematic grouping (e.g. Hoagy Carmichael)
◦Show off the richness and diversity of IU’s digital collections (PR – see open.iu.edu)
◦Find digital content at IU for teaching or research use
Digital collections evolution:Digital collections evolution:Discrete collection web sitesDiscrete collection web sites
Digital collections evolution:Digital collections evolution:ServicesServices
METS Navigator Archives Online PhotoCat
Video Streaming Service Variations
Digital collections evolution:Digital collections evolution:ServicesServicesAdvantages
◦Can develop workflows for content ingestion and description that are both optimized and scalable
◦Content stored in a common repository (Fedora)
◦Can develop discovery interfaces optimized for particular content (e.g. images vs. music)
◦Common services to expose content into other platforms (e.g. Google)
Disadvantages◦“Siloing” discovery by content type can be
an issue
Cross-collection search: Cross-collection search: First iterationFirst iterationOnly selected collections with metadata in
Fedora◦ Includes Archives Online and most image
collections◦ Not video streaming, Variations, encoded text,
IUScholarWorks, various “legacy” collections
Metadata only (MODS)◦ Stored natively as MODS in Fedora◦ Disseminated on the fly from other formats
(PhotoCat2)◦ Transformed via XSLT from EAD
(Archives Online)
Apache Solr OverviewApache Solr Overview• A Java-based web application, open
source search server, Apache Lucene at its core
• Demonstration• Solr vs. relational database• Pros: full-text search, text analysis,
flexible fields• Cons: no relational operation on fields
• Solr vs. Lucene• Pros: web application, centralized
configuration, facet• Cons: security, slower
Solr Schema and Solr Schema and ConfigurationConfigurationSchema: specify how the index is
built◦field, field type◦dynamicField, copyField, uniqueKey◦Text analysis: stop, stem, synonym,
tokenizationConfiguration: specify Solr itself,
query, data import
Converting MODS to Solr Converting MODS to Solr XMLXMLSolr XML
◦<add><doc><field>…</filed>…</doc></add>
◦Can simply be “POST” into the Solr indexTranslation of MODS to Solr XML
◦Use XSLT◦Called by the indexing program
Extract facet values◦Format: MODS:typeofResource◦Collection: customized based on item’s
Fedora PID
<add> <doc> <field name="id">iudl:10000</field><field name="title_t">Women Medical Students</field> <field name="name_t">Photographic Services, Photographer</field> <field name="name_facet">Photographic Services</field> <field name="subject_topic_t">Medical students</field> <field name="subject_topic_facet">Medical students</field> <field name="subject_city_t">Bloomington</field> <field name="subject_city_facet">Bloomington</field> <field name="subject_state_t">Indiana</field> <field name="subject_state_facet">Indiana</field> <field name="type_of_resource_t">still image</field> <field name="type_of_resource_facet">still image</field> <field name="genre_t">Photographs</field> <field name="genre_facet">Photographs</field> <field name="w3c_taken_date">04-13-1956</field> <field name="year">1956</field> <field name="item_id">P0028020</field> <field name="coll_id_mods">/archives/photos/</field>…</doc></add>
Solr IndexingSolr IndexingCarried by two Java programs running
under DLP’s Fedora Index Service framework
The service can be invoked by a RESTful HTTP request, the Solr indexing is triggered based on conditions specified in the properties file
The MODS records are extracted from the Fedora repository (natively stored) or generated by the getMODS disseminator (Photocat2 collections)
Overview of BlacklightOverview of BlacklightAn open source project developed for
libraries with many potentials:◦As a library catalog◦As the discovery interface to a digital
repositoryOptimized to handle diversified
content (facet browsing)Originally developed by University of
Virginia, has a growing community of active contributors and users
Now part of Hydra ProjectWritten in Ruby, runs on Rails,
requires Solr
Customize Blacklight for DLP Customize Blacklight for DLP Collections Collections Integrate blacklight with MODS-based
index◦Blacklight by default expects MARC fields
New functions and features◦Render thumbnail in result view◦Use collection website as the landing
pageStyle and layout
◦Standard IU banner and footer◦Color, font, and window size
Future ImprovementsFuture ImprovementsAutomatic update of Solr index
◦Fedora repository communicates with the Solr indexing program via JMS about item update
Include full-text content◦It is challenging to have full-text
content and metadata in one index◦Optimize the indexing and search
algorithms◦Search against full-text and use
metadata as facets
Future Improvement Future Improvement (cont’d)(cont’d)Add more collections
◦Other collections from Fedora◦Non-Fedora DLP collections◦Archives of Institutional Memory◦ IUScholarWorks Repository?◦ IUPUI Digital Collections (ContentDM)?
Conduct usability evaluationExplore integration w/ new
Blacklight-based discovery layer for IUCAT
Variations on Video IMLS grant◦Hydra/Blacklight-based discovery on
PBcore