Introduction to the Xapian Search Engine

17
Introduction to the Xapian Search Engine Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

description

Introduction to the Xapian Search Engine. Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC. Presentation. Open Source Search Engine Library Written in C++ (we use the PERL bindings) Uses the BM25 ranking function which gives the relevance matching - PowerPoint PPT Presentation

Transcript of Introduction to the Xapian Search Engine

Page 1: Introduction to the  Xapian  Search Engine

Introduction to the Xapian Search Engine

Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC

Page 2: Introduction to the  Xapian  Search Engine

Open Source Search Engine Library

Written in C++ (we use the PERL bindings)

Uses the BM25 ranking function which gives the relevance

matching

“Scales well”: 100+ million documents

Oh… code that we don’t need to maintain!

Presentation

Page 3: Introduction to the  Xapian  Search Engine

Database

Document

◦ data

◦ terms

◦ Values

(Xapian) Metadata management

Searching

Are you ready for it?

Core Concepts

Page 4: Introduction to the  Xapian  Search Engine

Collection of files storing indexes, positions, term

frequencies, …

One write-lock, multiple read-locks

Stored in archives/<id>/var/xapian/

Supports multiple-DB’s (unused in EPrints)

Can store arbitrary metadata

Core Concepts: Database

Page 5: Introduction to the  Xapian  Search Engine

A Document is an item returned by a search

So it’s also the meaty bit of indexing

Maps to a single data-obj in EPrints

Has three main components:

◦ data

◦ terms

◦ values

Core Concepts: Document

Page 6: Introduction to the  Xapian  Search Engine

Arbitrary blob of data

Un-processed by Xapian

Used to store information needed to display the results

Used to store the data-obj identifier in EPrints in order to

quickly build EPrints::List objects

Could be used to store more complex data: cached

citations, JSON/PERL representation of the data-obj

Limit ~100MB per Document

Core Concepts: Document Data

Page 7: Introduction to the  Xapian  Search Engine

Basis of relevance search: a search is a process of

comparing the terms specified by a Query against the

terms in the DB

Three main types of terms:

◦ Un-prefixed terms: can be seen as a general pool of indexed terms

◦ Prefixed terms: allow to search a sub-set of information (title,

authors…)

◦ Boolean terms: used to index identifiers (which don’t add any useful

information to the probabilistic indexes)

Core Concepts: Document Terms

Page 8: Introduction to the  Xapian  Search Engine

Boolean terms useful for filtering exact values (e.g.

subjects:PM, type:article, …). No text processing involved,

values appear 0 or 1 time in Documents.

Textual data - TermGenerator class:

◦ Provides the Stemmer and Stopper classes (note: language-

dependent)

◦ Spelling correction

◦ Exact matching (“hello world”) and the termpos joys

Core Concepts: Document Terms (2)

Page 9: Introduction to the  Xapian  Search Engine

Unprefixed terms used for the simple search

Prefixed terms used for a field-based search (such as the

advanced search)

Boolean terms used for any identifier-type of fields – this

includes facets (when searching)

Core Concepts: Document Terms (3)

Page 10: Introduction to the  Xapian  Search Engine

“search helpers” – we used them for ordering and faceting

(occurences & available facets)

Each value (e.g. an order-value, a facet value) is stored in

a numbered slot (32-bit integer)

Mappings between a meaningful string and a slot are

stored in the Xapian DB as metadata

eprint.creators_name.en (1000000) is the slot for the

order-value for the field “creators_name” on the dataset

“eprint” for English

Core Concepts: Document Values

Page 11: Introduction to the  Xapian  Search Engine

eprint.facet.type.0 (1500300) is the 1st slot for a facet

“type” on the dataset eprint

Used by the MultiValueSorter class to order data (when

not ordered by relevance)

Used to find out available facets (after a search) and the

occurrences of the values e.g. there are 3 items of type

‘article’, 14 items of date ‘2013’

Xapian documentation advises on keeping the number of

values low (slow down searching)

We usually limit the number of slots for a facet to 5

Core Concepts: Document Values (2)

Page 12: Introduction to the  Xapian  Search Engine

We need to keep track of our slot mappings in the Xapian

Database (not done by Xapian for us )

EPrints reserves 1 000 000 slots per dataset:

◦ 500 000 for order-values (1 per orderable field)

◦ 500 000 for facet slots (1 per facetable value)

EPrints also stores the current slot offsets to know:

◦ where the range for the next dataset starts

◦ where the next slot of order-values are

EPrints also stores some other useful information as

Metadata

Core Concepts: Metadata management

Page 13: Introduction to the  Xapian  Search Engine

Core Concepts: Metadata management (2)

Page 14: Introduction to the  Xapian  Search Engine

Reverse process of indexing

Composed of a tree of Query objects (and sometime a

QueryParser object) linked by boolean operators

$query = new Query( “hello” )

$query = new Query( AND, $query, “world” )

Can be stringified to see how the query is interpreted

(easier to read than SQL!)

Core Concepts: Searching

Page 15: Introduction to the  Xapian  Search Engine

Parses user queries

Supports:

◦ wildcards: wild* will match wildcat

◦ boolean op’s: pear AND (red OR green NOT blue)

◦ love/hate op’s: crab +nebula –crustacean

◦ exact match: “lorem ipsum”

◦ synonyms: colour/color, realise/realize

◦ stemming: happiness/happy -> happi

◦ suggestions: may provide a corrected query

Features can be turned on/off (all are enabled on EPrints)

Core Concepts: Searching - QueryParser

Page 16: Introduction to the  Xapian  Search Engine

The object which runs the query

Alternative ordering methods can be applied

A MatchDecider method may be provided to filter out

results (in fact, we use that to compute facets)

Returns an MSet (Match Set) which contains the actual

matching Documents

Core Concepts: Search - Enquire

Page 17: Introduction to the  Xapian  Search Engine

http://xapian.org

◦ architecture overview

◦ documentation

◦ advice for implementation

Questions?

EPrints implementation…

Final words