Open Source Search: An Analysis
-
Upload
justin-finkelstein -
Category
Technology
-
view
1.241 -
download
2
description
Transcript of Open Source Search: An Analysis
![Page 1: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/1.jpg)
Open Source SearchAn analysis and comparison from a developer’s perspective
![Page 2: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/2.jpg)
The Contenders
![Page 3: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/3.jpg)
The Data
Report Buyer product catalogue:
• Text fields: title, subtitle, summary, toc• Product code and ISBN• Supplier, category, type and
availability• Publication date and price
![Page 4: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/4.jpg)
Apache Solr
Enterprise class search engineScalable and based on Apache
LuceneREST-ful API or PECL extensionFast, transactional full-text indexingFaceted and geospatial searchRich document indexingComes with simple web interfaceBuilt-in caching of queries and
responsesNumerous plug-ins
![Page 5: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/5.jpg)
Apache Solr: Installation
Available as system packages Uses Tomcat or Jetty Requires a restart on configuration
change Packages install as a service
![Page 6: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/6.jpg)
Apache Solr: Configuration
Specify database location Memory settings Query caching options Request handler setup Search components and plug-ins Spell checker configuration
![Page 7: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/7.jpg)
Apache Solr: Data Schema <!-- Report Buyer fields --><field name="item_guid" type="string" indexed="true" stored="true"
required="true" /><field name="name" type="text" indexed="true" stored="true" required="true"
boost="75" omitNorms="false" /><field name="subtitle" type="text" indexed="true" stored="true" required="false"
boost="25" omitNorms="false" /><field name="summary" type="text" indexed="true" stored="false" boost="1"
omitNorms="false" /><field name="toc" type="text" indexed="true" stored="false" boost="1"
omitNorms="false" /><field name="isbn" type="string" indexed="true" stored="false" boost="200"
omitNorms="false" /><field name="product_code" type="string" indexed="true" stored="true" boost="200"
omitNorms="false" /><field name="publish_date" type="tdate" indexed="true" stored="true" /><field name="price" type="tfloat" indexed="true" stored="true" /><field name="availability" type="boolean" indexed="true" stored="true" /><field name="link" type="string" indexed="false" stored="true" /><field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
<copyField source="name" dest="text"/><copyField source="subtitle" dest="text"/><copyField source="summary" dest="text"/><copyField source="toc" dest="text"/>
<uniqueKey>item_guid</uniqueKey><defaultSearchField>text</defaultSearchField>
![Page 8: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/8.jpg)
Apache Solr: Indexing Options
Data Import Handler REST-ful API PHP PECL Extension Third-party libraries, like Solarium
![Page 9: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/9.jpg)
Apache Solr: PHP PECL Indexer
<?php$solr_options = array('secure' => false, 'hostname' => 'localhost', 'port' => 8080);$solr = new SolrClient($solr_options);$doc = new SolrInputDocument();while ($row = mysql_fetch_array($result, MYSQL_ASSOC)){
$doc = new SolrInputDocument();$row['publish_date'] = strftime('%Y-%m-%dT00:00:01Z', strtotime($row['publish_date']));foreach ($row as $key => $value) {
$doc->addField($key, $value);}$updateResponse = $solr->addDocument($doc);$response = $updateResponse->getResponse();if ($response->responseHeader->status != 0) {
print "Error importing into Solr: "; print_r($response);
}}
$solr->commit();?>
![Page 10: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/10.jpg)
Apache Solr: RESTful indexing
POST to http://localhost:8080/solr/update?commit=true
<add><doc>
<field name="item_guid">a34bbff9e17ada79658c72fde90c7369</field><field name="name">Research Report on China's Corn Industry</field><field name="price">1265</field>etc
</doc></add>
![Page 11: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/11.jpg)
Apache Solr: PHP Querying
$solr_options = array('secure' => false, 'hostname' => 'localhost', 'port' => 8080);$solr = new SolrClient($solr_options);$query = new SolrQuery();$query->setQuery("research in china");$query->setFacet(true);$query->addFacetField('availability');
$query->addField('item_guid')->addField('name')->addField('publish_date')->addField('subtitle')->addField('product_code')->addField('availability')->addField('price');
$query->addSortField('publish_date', SolrQuery::ORDER_DESC);
$query_response = $solr->query($query); $response = $query_response->getResponse();
print "Found ".$response->response->numFound." results, for {$query_string} in ".$response->responseHeader->QTime." ms:\n\n";
foreach ($response->response->docs as $position=>$doc_data) {$download = ($doc_data['availability'] == '1') ? 'Yes' : 'No';print "{$position} - Date:{$pub_date} - {$doc_data['product_code']} - D/L:{$download} £".sprintf("%5d", $doc_data['price'])." - {$doc_data['name']}\n";
}print "Facets for instant ".$response->facet_counts->facet_fields->availability->false;
![Page 12: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/12.jpg)
Apache Solr: RESTful Queries
http://localhost:8080/solr/select/?q=research%20%in%20china&indent=on&hl=true&hl.fl=item_guid,name,publish_date,subtitle,product_code,availability,price&facet=true&facet.field=availability&wt=json
{ "responseHeader":{ "status":0, "QTime":20, "params":{
"facet":"true", "indent":"on", "q":"research \u0000 china","hl.fl":"item_guid,name,publish_date,subtitle,product_code,availability,price","facet.field":"availability", "wt":"json", "hl":"true"}},
"response":{"numFound":197481,"start":0,"docs":[{ "item_guid":"e68cf64921a02e926137d78d2c52da35", "name":"Market Research Report on China Civil Aero Industry", "product_code":"SFC00076", "price":190.0, "availability":false, "type":10,"link": "/industry_manufacturing/plant_heavy_equipment/market_research_report_china_civil_aero_industry.html", "publish_date":"2008-07-22T00:00:01Z"}
}
![Page 13: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/13.jpg)
Apache Solr: Comparison Points
More features than other products Responsive, busy mailing list Large team of developers Good PHP libraries for integration Several books available Fairly heavy footprint
![Page 14: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/14.jpg)
Elasticsearch: Features
Also built on Apache Lucene JSON-based Distributed, scalable server model Easy to configure, or configuration
free Faceting and highlight support Auto type detection Multiple indexes CouchDB integration
![Page 15: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/15.jpg)
Elasticsearch: Installation Installation
Download and unpack zip file Run elasticsearch/bin/elasticsearch
![Page 16: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/16.jpg)
Elasticsearch: Configuration
No schema is required - almost No configuration is required - almost
![Page 17: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/17.jpg)
Elasticsearch: Accesing the system
GET http://localhost:9200/ HTTP/1.0{
"ok" : true, "name" : "Test", "version" : { "number" : "0.18.7", "snapshot_build" : false }, "tagline" : "You Know, for Search", "cover" : "DON'T PANIC", "quote" : { "book" : "The Hitchhiker's Guide to the Galaxy", "chapter" : "Chapter 27", "text1" : "\"Forty-two,\" said Deep Thought, with infinite majesty and calm.", "text2" : "\"The Answer to the Great Question, of Life, the Universe and Everything\"" }}
![Page 18: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/18.jpg)
Elasticsearch: Creating an Index
curl -XPUT http://localhost:9200/reports/ -d '{
"index:" {"analysis": {
"analyzer": {"my_analyzer": {
"tokenizer": "standard","filter": ["standard", "lowercase",
"my_stemmer"]}
},"filter": {
"my_stemmer": {"type": "stemmer","name": "english"
}}
}}
}'
![Page 19: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/19.jpg)
Elasticsearch: Mapping the data
<?phprequire_once("ElasticSearch.php");$es = new ElasticSearch;$es->index = 'reports';$type = 'report';$mappings = array($type => array('properties' => array(
'_id' => array('type' => 'string', 'path' => 'item_guid'),'item_guid' => array('type' => 'string', 'store' => 'yes', 'index' =>
'not_analyzed'),'name' => array('type' => 'string', 'store' => 'no', 'boost' => 75),'subtitle' => array('type' => 'string', 'store' => 'yes', 'boost' => 25),'summary' => array('type' => 'string', 'store' => 'yes', 'boost' => 10),'toc' => array('type' => 'string', 'store' => 'no'),'product_code' => array('type' => 'string', 'store' => 'yes', 'boost' =>
200, 'index' => 'not_analyzed'),'isbn' => array('type' => 'string', 'store' => 'yes', 'boost' => 200, 'index'
=> 'not_analyzed'),)));
$json = json_encode($mappings);
$es->map($type, $json);?>
![Page 20: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/20.jpg)
Elasticsearch: Indexing
<?phprequire_once("ElasticSearch.php");$es = new ElasticSearch;$es->index = 'reports';$type = 'report';
$sql = "SELECT `item_guid`, `name`, `subtitle`, `summary`, `toc`, `supplier`,`product_code`, `isbn`, `category`, `price`, `availibility` as `availability`, `type`, `link`, `publish_date` FROM `rb_search`";
$result = read_query($sql);
while ($row = mysql_fetch_array($result, MYSQL_ASSOC)){
$es->add($type, $row['item_guid'], json_encode($row));}?>
![Page 21: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/21.jpg)
Elasticsearch: Inspection
GET http://localhost:9200/reports/report/_count/
{"count":260349,"_shards":{"total":1,"successful":1,"failed":0}}
![Page 22: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/22.jpg)
Elasticsearch: Querying
<?phprequire_once("ElasticSearch.php");$es = new ElasticSearch;
$es->index = 'reports';$type = 'report';
$query = array('fields' => array('item_guid', 'name', 'subtitle'),'query' => array(
'term' => array('name' => 'research'),),
'facets' => array('availability' => array(
'terms' => array('field' => 'availability'))
));
$result = $es->query($type, json_encode($query));?>
![Page 23: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/23.jpg)
Elasticsearch: PHP APIs
Nicholas Ruflin's elastica Raymond Julin's elasticsearch Niranjan Uma Shankar's
elasticsearch-php
![Page 24: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/24.jpg)
Elasticsearch: Comparison Points
Very fast indexing Auto-scaling architecture Elegant REST approach Flexible zero configuration model Poor documentation No feature list, conceptual model or
introduction All data is stored, meaning large
indices
![Page 25: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/25.jpg)
Sphinx Indexes MySQL, MSSQL, XML or
ODBC Querying through Sphinx PHP API Searching through SQL queries or
API Scalable to index 6TB of data in
16bn documents and 2000 queries/sec
Used by Craigslist, Boardreader Runs as a storage engine in MySQL
![Page 26: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/26.jpg)
Sphinx: Installation
Install from system packages or source
Source tarball is needed to get PHP SphinxAPI
No other software needed Runs as a service in Ubuntu
![Page 27: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/27.jpg)
Sphinx: Indexing Types
Plain index - fast search, slow update Real-time index - fast update, less
efficient Distributed - combination of both
methods
![Page 28: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/28.jpg)
Sphinx: Configuration
index rb_test{
# index typetype = rtpath = /mnt/data_indexed/sphinx/rb_test# define the fields we're indexingrt_field = namert_field = subtitlert_field = summaryrt_field = toc
#define the fields we want to get back outrt_attr_string = item_guidrt_attr_string = supplierrt_attr_string = product_codert_attr_string = isbnrt_attr_string = categoryrt_attr_uint = pricert_attr_string = linkrt_attr_timestamp = publish_date
# morphology preprocessors to applymorphology = stem_enhtml_strip = 1html_index_attrs = img=alt,title; a=title;html_remove_elements = style, script
}
![Page 29: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/29.jpg)
Spinx: Indexing the data
<?phprequire_once("mysql.inc.php");$sql = "SELECT conv(mid(md5(`item_guid`), 1, 16), 16, 10) AS `id`, `item_guid`,
`name`, `subtitle`, `summary`, `toc`, `supplier`, `product_code`, `isbn`,
`category`, `price`, `availibility` as `availability`, `type`, `link`,
UNIX_TIMESTAMP(`publish_date`) AS `publish_date` FROM `rb_search`";$result = read_query($sql);$sphinx = mysql_connect("127.0.0.1:9306", "", "", true);while ($row = mysql_fetch_array($result, MYSQL_ASSOC)) {
foreach ($row as $key=>$value) {$row[$key] = mysql_escape_string($value);
}$sql = "REPLACE INTO `rb_search` (`id`, `title`, `subtitle`,`availability`, `type`, `price`, `publish_date`, `item_guid`, `supplier`, `product_code`, `isbn`, `category`, `link`, `summary`, `toc`)
VALUES ('{$row['id']}', '{$row['name']}', '{$row['subtitle']}',
'{$row['availability']}', '{$row['type']}','{$row['price']}', '{$row['publish_date']}', '{$row['item_guid']}', '{$row['supplier']}', '{$row['product_code']}', '{$row['isbn']}', '{$row['category']}', '{$row['link']}','{$row['summary']}', '{$row['toc']}')";mysql_query($sql, $sphinx);
}?>
![Page 30: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/30.jpg)
Sphinx: Querying the Index
mysql --host=127.0.0.1 --port=9306
Welcome to the MySQL monitor. Commands end with ; or \g.Your MySQL connection id is 1Server version: 2.0.3-id64-release (r3043)
mysql> select item_guid, title, subtitle, price from rb_search where match('china pharmaceutical') and price > 100 and price < 300 limit 2\G
************************** 1. row *************************** id: 5228810066049016302 weight: 6671 price: 220item_guid: cc74cb075aa37696198e87850f033398 title: North China Pharmaceutical Group Corp-Therapeutic Competitors Report subtitle: *************************** 2. row *************************** id: 3548867347418583847 weight: 6662 price: 190item_guid: 6ce04df0fb277aa3ff596c2ca00c81a9 title: China Pharmaceutical Industry Report subtitle: 2006-20072 rows in set (0.01 sec)
![Page 31: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/31.jpg)
Sphinx: Comparison Points
Fastest indexing of all engines Really simple interface via SQL Document IDs must be unsigned
integers No faceting support Good support in forums
![Page 32: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/32.jpg)
Xapian Deployed as a C++ library Bindings provided to connect to PHP Available in most package
repositories Binding need to be compiled
separately Query Parser, similar to other
engines Stemming and faceted search Server replication
![Page 33: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/33.jpg)
Xapian: Installation
Install from system packages Compile PHP bindings from source No other software needed Runs on demand
![Page 34: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/34.jpg)
Xapian: Configuration concepts
No configuration required Define-and-go schema Documents Terms Values Document data
![Page 35: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/35.jpg)
Xapian: Indexing Data
<?php$xapian_db = new XapianWritableDatabase($xapian,
Xapian::DB_CREATE_OR_OVERWRITE);$xapian_term_generator = new XapianTermGenerator();$xapian_term_generator->set_stemmer(new XapianStem("english"));
while ($row = mysql_fetch_array($result, MYSQL_ASSOC)) {$doc = new XapianDocument();
$xapian_term_generator->set_document($doc);foreach ($xapian_term_weights as $field => $weight) {
$xapian_term_generator->index_text($row[$field], $weight);}
$xapian_term_generator->index_text($row['name'], 75, 'S:');$doc->add_boolean_term('CODE:' . $row['product_code']);
$doc->add_value($xapian_value_slots['price'], Xapian::sortable_serialise($row['price']));$doc->add_value($xapian_value_slots['publish_date'], strftime("%Y%m%d", strtotime($row['publish_date'])));
// add in additional values that we're going to use for facets
$doc->add_value($xapian_value_slots['availability'], $row['availability']);$doc->set_data(serialize($doc_data));$docid = 'Q'.$row['item_guid'];$xapian_db->replace_document($docid, $doc);
}?>
![Page 36: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/36.jpg)
Xapian: Querying the Index
<?php$xapian_db = new XapianDatabase($xapian);$query_parser = new XapianQueryParser();$query_parser->set_stemmer(new XapianStem("english"));$query_parser->set_default_op(XapianQuery::OP_AND);
$dvrProcessor = new XapianDateValueRangeProcessor($xapian_value_slots['publish_date'], 'date:');$query_parser->add_valuerangeprocessor($dvrProcessor);
$query_parser->add_prefix("code", "CODE:");$query_parser->add_prefix("category", "CATEGORY:");$query_parser->add_prefix("title", "S:");$query = $query_parser->parse_query('“Medical devices” NEAR china NOT russian price:10..150
category:medical');
$enquire = new XapianEnquire($xapian_db);$enquire->set_query($query);$matches = $enquire->get_mset($offset, $pagesize);while (!($start->equals($end))) {
$doc = $start->get_document();$price = Xapian::sortable_unserialise($doc->get_value($xapian_value_slots['price']));$start->next();
}?>
![Page 37: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/37.jpg)
Xapian: PHP APIs
Only one option available from Xapian
Requires additional compilation due to licensing
Not very well documented API
![Page 38: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/38.jpg)
Xapian: Comparison Points
Reasonably fast indexing Very flexible implementation Faceting and range searching Good Quick Start guide Responsive mailing list Third-party paid support
![Page 39: Open Source Search: An Analysis](https://reader034.fdocuments.net/reader034/viewer/2022042614/558df34d1a28aba2598b4616/html5/thumbnails/39.jpg)
In Summary
Every project has different needs Not one search product fits all Fastest to index was Sphinx Most feature-rich: Solr The next steps are up to you