Introduction to the Xapian Search Engine
description
Transcript of Introduction to the Xapian Search Engine
Introduction to the Xapian Search Engine
Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC
Open Source Search Engine Library
Written in C++ (we use the PERL bindings)
Uses the BM25 ranking function which gives the relevance
matching
“Scales well”: 100+ million documents
Oh… code that we don’t need to maintain!
Presentation
Database
Document
◦ data
◦ terms
◦ Values
(Xapian) Metadata management
Searching
Are you ready for it?
Core Concepts
Collection of files storing indexes, positions, term
frequencies, …
One write-lock, multiple read-locks
Stored in archives/<id>/var/xapian/
Supports multiple-DB’s (unused in EPrints)
Can store arbitrary metadata
Core Concepts: Database
A Document is an item returned by a search
So it’s also the meaty bit of indexing
Maps to a single data-obj in EPrints
Has three main components:
◦ data
◦ terms
◦ values
Core Concepts: Document
Arbitrary blob of data
Un-processed by Xapian
Used to store information needed to display the results
Used to store the data-obj identifier in EPrints in order to
quickly build EPrints::List objects
Could be used to store more complex data: cached
citations, JSON/PERL representation of the data-obj
Limit ~100MB per Document
Core Concepts: Document Data
Basis of relevance search: a search is a process of
comparing the terms specified by a Query against the
terms in the DB
Three main types of terms:
◦ Un-prefixed terms: can be seen as a general pool of indexed terms
◦ Prefixed terms: allow to search a sub-set of information (title,
authors…)
◦ Boolean terms: used to index identifiers (which don’t add any useful
information to the probabilistic indexes)
Core Concepts: Document Terms
Boolean terms useful for filtering exact values (e.g.
subjects:PM, type:article, …). No text processing involved,
values appear 0 or 1 time in Documents.
Textual data - TermGenerator class:
◦ Provides the Stemmer and Stopper classes (note: language-
dependent)
◦ Spelling correction
◦ Exact matching (“hello world”) and the termpos joys
Core Concepts: Document Terms (2)
Unprefixed terms used for the simple search
Prefixed terms used for a field-based search (such as the
advanced search)
Boolean terms used for any identifier-type of fields – this
includes facets (when searching)
Core Concepts: Document Terms (3)
“search helpers” – we used them for ordering and faceting
(occurences & available facets)
Each value (e.g. an order-value, a facet value) is stored in
a numbered slot (32-bit integer)
Mappings between a meaningful string and a slot are
stored in the Xapian DB as metadata
eprint.creators_name.en (1000000) is the slot for the
order-value for the field “creators_name” on the dataset
“eprint” for English
Core Concepts: Document Values
eprint.facet.type.0 (1500300) is the 1st slot for a facet
“type” on the dataset eprint
Used by the MultiValueSorter class to order data (when
not ordered by relevance)
Used to find out available facets (after a search) and the
occurrences of the values e.g. there are 3 items of type
‘article’, 14 items of date ‘2013’
Xapian documentation advises on keeping the number of
values low (slow down searching)
We usually limit the number of slots for a facet to 5
Core Concepts: Document Values (2)
We need to keep track of our slot mappings in the Xapian
Database (not done by Xapian for us )
EPrints reserves 1 000 000 slots per dataset:
◦ 500 000 for order-values (1 per orderable field)
◦ 500 000 for facet slots (1 per facetable value)
EPrints also stores the current slot offsets to know:
◦ where the range for the next dataset starts
◦ where the next slot of order-values are
EPrints also stores some other useful information as
Metadata
Core Concepts: Metadata management
Core Concepts: Metadata management (2)
Reverse process of indexing
Composed of a tree of Query objects (and sometime a
QueryParser object) linked by boolean operators
$query = new Query( “hello” )
$query = new Query( AND, $query, “world” )
Can be stringified to see how the query is interpreted
(easier to read than SQL!)
Core Concepts: Searching
Parses user queries
Supports:
◦ wildcards: wild* will match wildcat
◦ boolean op’s: pear AND (red OR green NOT blue)
◦ love/hate op’s: crab +nebula –crustacean
◦ exact match: “lorem ipsum”
◦ synonyms: colour/color, realise/realize
◦ stemming: happiness/happy -> happi
◦ suggestions: may provide a corrected query
Features can be turned on/off (all are enabled on EPrints)
Core Concepts: Searching - QueryParser
The object which runs the query
Alternative ordering methods can be applied
A MatchDecider method may be provided to filter out
results (in fact, we use that to compute facets)
Returns an MSet (Match Set) which contains the actual
matching Documents
Core Concepts: Search - Enquire
http://xapian.org
◦ architecture overview
◦ documentation
◦ advice for implementation
Questions?
EPrints implementation…
Final words