Introduction to the Xapian Search Engine

Post on 15-Jan-2016

64 views 0 download

description

Introduction to the Xapian Search Engine. Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC. Presentation. Open Source Search Engine Library Written in C++ (we use the PERL bindings) Uses the BM25 ranking function which gives the relevance matching - PowerPoint PPT Presentation

Transcript of Introduction to the Xapian Search Engine

Introduction to the Xapian Search Engine

Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC

Open Source Search Engine Library

Written in C++ (we use the PERL bindings)

Uses the BM25 ranking function which gives the relevance

matching

“Scales well”: 100+ million documents

Oh… code that we don’t need to maintain!

Presentation

Database

Document

◦ data

◦ terms

◦ Values

(Xapian) Metadata management

Searching

Are you ready for it?

Core Concepts

Collection of files storing indexes, positions, term

frequencies, …

One write-lock, multiple read-locks

Stored in archives/<id>/var/xapian/

Supports multiple-DB’s (unused in EPrints)

Can store arbitrary metadata

Core Concepts: Database

A Document is an item returned by a search

So it’s also the meaty bit of indexing

Maps to a single data-obj in EPrints

Has three main components:

◦ data

◦ terms

◦ values

Core Concepts: Document

Arbitrary blob of data

Un-processed by Xapian

Used to store information needed to display the results

Used to store the data-obj identifier in EPrints in order to

quickly build EPrints::List objects

Could be used to store more complex data: cached

citations, JSON/PERL representation of the data-obj

Limit ~100MB per Document

Core Concepts: Document Data

Basis of relevance search: a search is a process of

comparing the terms specified by a Query against the

terms in the DB

Three main types of terms:

◦ Un-prefixed terms: can be seen as a general pool of indexed terms

◦ Prefixed terms: allow to search a sub-set of information (title,

authors…)

◦ Boolean terms: used to index identifiers (which don’t add any useful

information to the probabilistic indexes)

Core Concepts: Document Terms

Boolean terms useful for filtering exact values (e.g.

subjects:PM, type:article, …). No text processing involved,

values appear 0 or 1 time in Documents.

Textual data - TermGenerator class:

◦ Provides the Stemmer and Stopper classes (note: language-

dependent)

◦ Spelling correction

◦ Exact matching (“hello world”) and the termpos joys

Core Concepts: Document Terms (2)

Unprefixed terms used for the simple search

Prefixed terms used for a field-based search (such as the

advanced search)

Boolean terms used for any identifier-type of fields – this

includes facets (when searching)

Core Concepts: Document Terms (3)

“search helpers” – we used them for ordering and faceting

(occurences & available facets)

Each value (e.g. an order-value, a facet value) is stored in

a numbered slot (32-bit integer)

Mappings between a meaningful string and a slot are

stored in the Xapian DB as metadata

eprint.creators_name.en (1000000) is the slot for the

order-value for the field “creators_name” on the dataset

“eprint” for English

Core Concepts: Document Values

eprint.facet.type.0 (1500300) is the 1st slot for a facet

“type” on the dataset eprint

Used by the MultiValueSorter class to order data (when

not ordered by relevance)

Used to find out available facets (after a search) and the

occurrences of the values e.g. there are 3 items of type

‘article’, 14 items of date ‘2013’

Xapian documentation advises on keeping the number of

values low (slow down searching)

We usually limit the number of slots for a facet to 5

Core Concepts: Document Values (2)

We need to keep track of our slot mappings in the Xapian

Database (not done by Xapian for us )

EPrints reserves 1 000 000 slots per dataset:

◦ 500 000 for order-values (1 per orderable field)

◦ 500 000 for facet slots (1 per facetable value)

EPrints also stores the current slot offsets to know:

◦ where the range for the next dataset starts

◦ where the next slot of order-values are

EPrints also stores some other useful information as

Metadata

Core Concepts: Metadata management

Core Concepts: Metadata management (2)

Reverse process of indexing

Composed of a tree of Query objects (and sometime a

QueryParser object) linked by boolean operators

$query = new Query( “hello” )

$query = new Query( AND, $query, “world” )

Can be stringified to see how the query is interpreted

(easier to read than SQL!)

Core Concepts: Searching

Parses user queries

Supports:

◦ wildcards: wild* will match wildcat

◦ boolean op’s: pear AND (red OR green NOT blue)

◦ love/hate op’s: crab +nebula –crustacean

◦ exact match: “lorem ipsum”

◦ synonyms: colour/color, realise/realize

◦ stemming: happiness/happy -> happi

◦ suggestions: may provide a corrected query

Features can be turned on/off (all are enabled on EPrints)

Core Concepts: Searching - QueryParser

The object which runs the query

Alternative ordering methods can be applied

A MatchDecider method may be provided to filter out

results (in fact, we use that to compute facets)

Returns an MSet (Match Set) which contains the actual

matching Documents

Core Concepts: Search - Enquire

http://xapian.org

◦ architecture overview

◦ documentation

◦ advice for implementation

Questions?

EPrints implementation…

Final words