University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E....

18
University of North Texas Librarie s Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May 31, 2007, Austin Texas

Transcript of University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E....

Page 1: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Building Search Systems for Digital Library Collections

Mark E. Phillips

Texas Conference on Digital Libraries

May 31, 2007, Austin Texas

Page 2: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

University of North Texas Libraries - Digital Initiatives

Library Digital Collections = 31000+ Digital Objects

• 3 “Systems” – Congressional Research Service Archive

• 9,500+ CRS Reports– Portal to Texas History

• 20,000+ records – 115,205 files– UNT Libraries “Digital Collections”

• 1,800+ records – 131,481 files

• Digital Object Types– Images = 18,282– Physical Objects = 1,019– Texts = 11,668– Websites = 46– Sound Records = 20

Page 3: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Infrastructure• UNT Libraries Digital Library Infrastructure

– Highly customized installation of IndexData’s Keystone Digital Library System

– OAIS based system– Digital objects housed as xml files on filesystem– One xml file per digital object– Supports simple, complex and link records– Custom workflow for batch ingest– Manages web presentable files and descriptive and

preservation metadata– Digital masters stored in separate system

Page 4: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Search 1.0

• Keystone supplied search– Zebra retrieval engine– 1 index per “system”– Highly customizable search system– Vendor supplied search interface and

functionality

Page 5: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Search 1.0 - Issues

• Difficult configuration

• Issues with large xml file retrieval (10MB+ xml files)

• Search grammar not functioning correctly

• Relevance ranking was “magic”

• No custom searching

• Only searching at the digital object level

Page 6: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Search 1.5

• MySQL database for page level searching– In Document Searching (IDS)– Two levels of granularity (Zebra=object and

MySQL=page)– Easy customization– More documentation on relevance ranking– Logical search grammars

Page 7: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Search 1.5 – Issues

• Different search grammars Zebra vs. MySQL fulltext

• Scaling issues

• Search Performance

• System Resources

Page 8: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Search System Criteria• Customizable relevance ranking• Sorting • Simple search syntax

– Fielded Searching– Term Modifiers

• Wildcard Searches• Fuzzy Searches• Proximity Searches• Range Searches

– Boolean Operators– Grouping

• Caching• Implemented as a web-service

Page 9: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Search 2.0

• Solr is an open source enterprise search server based on the Lucene Java search library.

• XML/HTTP based• Hit highlighting• Faceted search• Caching• Replication• Web administration interface.

Page 10: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Current Architecture

Query

Digital CollectionsServer

Digital ObjectIndex

Page Index

Solr Solr

Spelling Suggestions

Results Page

Page 11: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Page 12: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Page 13: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Page 14: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Page 15: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Page 16: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Customizable Relevance

• Combine Full-text AND descriptive metadata– Positive Boost to Title – (+20)– Positive Boost to Subject – (+15)– Positive Boost to Creator – (+14)– Positive Boost to Metadata overall – (+5)– Full-text = Neutral boost

Page 17: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Better results

• Helps to overcome IDF’s effect on results

• Results order more logically

• Takes advantage of both metadata and full-text

• User defined relevance ranking?

Page 18: University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

University of North Texas Libraries

Questions?