Introduction to the Xapian Search Engine

Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC

Open Source Search Engine Library

Written in C++ (we use the PERL bindings)

Uses the BM25 ranking function which gives the relevance

matching

“Scales well”: 100+ million documents

Oh… code that we don’t need to maintain!

Presentation

Database

Document

◦ data

◦ terms

◦ Values

(Xapian) Metadata management

Searching

Are you ready for it?

Core Concepts

Collection of files storing indexes, positions, term

frequencies, …

One write-lock, multiple read-locks

Stored in archives/<id>/var/xapian/

Supports multiple-DB’s (unused in EPrints)

Can store arbitrary metadata

Core Concepts: Database

A Document is an item returned by a search

So it’s also the meaty bit of indexing

Maps to a single data-obj in EPrints

Has three main components:

◦ data

◦ terms

◦ values

Core Concepts: Document

Arbitrary blob of data

Un-processed by Xapian

Used to store information needed to display the results

Used to store the data-obj identifier in EPrints in order to

quickly build EPrints::List objects

Could be used to store more complex data: cached

citations, JSON/PERL representation of the data-obj

Limit ~100MB per Document

Core Concepts: Document Data

Basis of relevance search: a search is a process of

comparing the terms specified by a Query against the

terms in the DB

Three main types of terms:

◦ Un-prefixed terms: can be seen as a general pool of indexed terms

◦ Prefixed terms: allow to search a sub-set of information (title,

authors…)

◦ Boolean terms: used to index identifiers (which don’t add any useful

information to the probabilistic indexes)

Core Concepts: Document Terms

Boolean terms useful for filtering exact values (e.g.

subjects:PM, type:article, …). No text processing involved,

values appear 0 or 1 time in Documents.

Textual data - TermGenerator class:

◦ Provides the Stemmer and Stopper classes (note: language-

dependent)

◦ Spelling correction

◦ Exact matching (“hello world”) and the termpos joys

Core Concepts: Document Terms (2)

Unprefixed terms used for the simple search

Prefixed terms used for a field-based search (such as the

advanced search)

Boolean terms used for any identifier-type of fields – this

includes facets (when searching)

Core Concepts: Document Terms (3)

“search helpers” – we used them for ordering and faceting

(occurences & available facets)

Each value (e.g. an order-value, a facet value) is stored in

a numbered slot (32-bit integer)

Mappings between a meaningful string and a slot are

stored in the Xapian DB as metadata

eprint.creators_name.en (1000000) is the slot for the

order-value for the field “creators_name” on the dataset

“eprint” for English

Core Concepts: Document Values

eprint.facet.type.0 (1500300) is the 1st slot for a facet

“type” on the dataset eprint

Used by the MultiValueSorter class to order data (when

not ordered by relevance)

Used to find out available facets (after a search) and the

occurrences of the values e.g. there are 3 items of type

‘article’, 14 items of date ‘2013’

Xapian documentation advises on keeping the number of

values low (slow down searching)

We usually limit the number of slots for a facet to 5

Core Concepts: Document Values (2)

We need to keep track of our slot mappings in the Xapian

Database (not done by Xapian for us )

EPrints reserves 1 000 000 slots per dataset:

◦ 500 000 for order-values (1 per orderable field)

◦ 500 000 for facet slots (1 per facetable value)

EPrints also stores the current slot offsets to know:

◦ where the range for the next dataset starts

◦ where the next slot of order-values are

EPrints also stores some other useful information as

Metadata

Core Concepts: Metadata management

Core Concepts: Metadata management (2)

Reverse process of indexing

Composed of a tree of Query objects (and sometime a

QueryParser object) linked by boolean operators

$query = new Query( “hello” )

$query = new Query( AND, $query, “world” )

Can be stringified to see how the query is interpreted

(easier to read than SQL!)

Core Concepts: Searching

Parses user queries

Supports:

◦ wildcards: wild* will match wildcat

◦ boolean op’s: pear AND (red OR green NOT blue)

◦ love/hate op’s: crab +nebula –crustacean

◦ exact match: “lorem ipsum”

◦ synonyms: colour/color, realise/realize

◦ stemming: happiness/happy -> happi

◦ suggestions: may provide a corrected query

Features can be turned on/off (all are enabled on EPrints)

Core Concepts: Searching - QueryParser

The object which runs the query

Alternative ordering methods can be applied

A MatchDecider method may be provided to filter out

results (in fact, we use that to compute facets)

Returns an MSet (Match Set) which contains the actual

matching Documents

Core Concepts: Search - Enquire

http://xapian.org

◦ architecture overview

◦ documentation

◦ advice for implementation

Questions?

EPrints implementation…

Final words

Introduction to the Xapian Search Engine

Documents

Transcript of Introduction to the Xapian Search Engine

Search engine advertising - courses.ischool.berkeley.educourses.ischool.berkeley.edu/i141/f05/lectures/search-engine-advertising.pdf · Search engine advertising Hal Varian. SIMS

Website Search Engine Optimization: Geographical and Cultural … · 2014-12-18 · Search Engine Optimization, Web Crawlers, Search Engine Algorithms, Search Engine Visibility, Jordan

Search Engine Marketing: Search Engine Marketing · PDF fileSEO vs. PPC ... Links ... Search engine marketing and social media marketing .....125 Search engine marketing and email

Optimising Xapian

Search Engine

Search as a Service with Xapian - Search Solutions 2009

PowerPoint Search Engine , ppt search engine

SEO (Search Engine Optimisation) and SEM (Search Engine Marketing) - Seminar on Web Search

SEO (Search Engine Optimization) vs SEM(Search Engine Marketing)

SEARCH ENGINE MARKETING - crm.agentlocator.cacrm.agentlocator.ca/UserFiles/2223/files/Search-Engine-LRes.pdf · search engine placements PAID SEARCH MARKETING We also have developed

Trends in Search Engine Optimization and Search Engine Marketing

SEARCH ENGINE OPTIMIZATION · 2016-02-06 · SEARCH ENGINE OPTIMIZATION Firman Ardiansyah. 70% dari Search Engine. BUAT SITUS WEB YANG RAMAH PENGGUNA ... Search Engine Friendly URLs

Xapian Motor Inferencia

Xapian vs sphinx

Search Engine Optimisation (Seo) And Search Engine Marketing

Search Engine Optimization and Search Engine Marketing

An Analytic Model to Optimize Search Results Using ... · Keywords: Search Engine; Social Search Engine; Real Time Search Engine; Analytic Search Engine Model; Social Rank; Socialytics;

Search Engine Marketing - megasmultimedia.commegasmultimedia.com/wp-content/uploads/2014/11/SEMPackage_WEB.pdf · Search Engine Marketing SEARCH ENGINE MARKETING (SEM) Search marketing

Search engine optimization service, search engine optimization

SEARCH ENGINE OPTIMIZATION How You can generate qualified Leads from Search Engine Optimization Search Engine Optimization.