University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E....
-
Upload
peregrine-cannon -
Category
Documents
-
view
213 -
download
0
Transcript of University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E....
University of North Texas Libraries
Building Search Systems for Digital Library Collections
Mark E. Phillips
Texas Conference on Digital Libraries
May 31, 2007, Austin Texas
University of North Texas Libraries
University of North Texas Libraries - Digital Initiatives
Library Digital Collections = 31000+ Digital Objects
• 3 “Systems” – Congressional Research Service Archive
• 9,500+ CRS Reports– Portal to Texas History
• 20,000+ records – 115,205 files– UNT Libraries “Digital Collections”
• 1,800+ records – 131,481 files
• Digital Object Types– Images = 18,282– Physical Objects = 1,019– Texts = 11,668– Websites = 46– Sound Records = 20
University of North Texas Libraries
Infrastructure• UNT Libraries Digital Library Infrastructure
– Highly customized installation of IndexData’s Keystone Digital Library System
– OAIS based system– Digital objects housed as xml files on filesystem– One xml file per digital object– Supports simple, complex and link records– Custom workflow for batch ingest– Manages web presentable files and descriptive and
preservation metadata– Digital masters stored in separate system
University of North Texas Libraries
Search 1.0
• Keystone supplied search– Zebra retrieval engine– 1 index per “system”– Highly customizable search system– Vendor supplied search interface and
functionality
University of North Texas Libraries
Search 1.0 - Issues
• Difficult configuration
• Issues with large xml file retrieval (10MB+ xml files)
• Search grammar not functioning correctly
• Relevance ranking was “magic”
• No custom searching
• Only searching at the digital object level
University of North Texas Libraries
Search 1.5
• MySQL database for page level searching– In Document Searching (IDS)– Two levels of granularity (Zebra=object and
MySQL=page)– Easy customization– More documentation on relevance ranking– Logical search grammars
University of North Texas Libraries
Search 1.5 – Issues
• Different search grammars Zebra vs. MySQL fulltext
• Scaling issues
• Search Performance
• System Resources
University of North Texas Libraries
Search System Criteria• Customizable relevance ranking• Sorting • Simple search syntax
– Fielded Searching– Term Modifiers
• Wildcard Searches• Fuzzy Searches• Proximity Searches• Range Searches
– Boolean Operators– Grouping
• Caching• Implemented as a web-service
University of North Texas Libraries
Search 2.0
• Solr is an open source enterprise search server based on the Lucene Java search library.
• XML/HTTP based• Hit highlighting• Faceted search• Caching• Replication• Web administration interface.
University of North Texas Libraries
Current Architecture
Query
Digital CollectionsServer
Digital ObjectIndex
Page Index
Solr Solr
Spelling Suggestions
Results Page
University of North Texas Libraries
University of North Texas Libraries
University of North Texas Libraries
University of North Texas Libraries
University of North Texas Libraries
University of North Texas Libraries
Customizable Relevance
• Combine Full-text AND descriptive metadata– Positive Boost to Title – (+20)– Positive Boost to Subject – (+15)– Positive Boost to Creator – (+14)– Positive Boost to Metadata overall – (+5)– Full-text = Neutral boost
University of North Texas Libraries
Better results
• Helps to overcome IDF’s effect on results
• Results order more logically
• Takes advantage of both metadata and full-text
• User defined relevance ranking?
University of North Texas Libraries
Questions?