Integrating the Google Search Appliance with a WebCenter or Liferay Portal
In Search Of: Integrating Site Search (IPC)
-
Upload
ian-barber -
Category
Technology
-
view
1.708 -
download
1
description
Transcript of In Search Of: Integrating Site Search (IPC)
![Page 1: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/1.jpg)
In Search Of...
Ian Barber@ianbarber
http://[email protected]
http://joind.in/2172
integrating site search
![Page 2: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/2.jpg)
2
How Search WorksIntegrating SearchImproving Results
Using SearchSearch Performance
Questions
![Page 3: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/3.jpg)
3
![Page 4: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/4.jpg)
4
Index
DocumentDocumentDocumentDocumentAnalyser
Query Parser
QueryQueryQueryQuery
ResultResultResultResult
![Page 5: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/5.jpg)
5
With AT&T’s help, the F.B.I Miami-Dade office had recovered $1.1 million from O’Healy’s Ponzi scheme, 10-15% more than expected.
Tokenisation
“”
![Page 6: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/6.jpg)
6
PHP Tokenisation
function tokenise($string) { $string = strtolower($string); preg_match_all('/\w+/', $string, $matches, PREG_OFFSET_CAPTURE); return $matches[0];}
![Page 7: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/7.jpg)
7
Document Term PairsDocument ID Term
1 the 1 best1 of1 the ... ...
204 and 204 what204 would
![Page 8: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/8.jpg)
8
Inverted IndexTerm Documents
best 1 (4, 16), 4 (422), 129 (344) ...
what 24 (50, 98), 75 (33, 208) ...
would 99 (32, 599), 201 (344) ..
... ...
![Page 9: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/9.jpg)
9
Boolean Query MergeQuery: Best Western Hotel
Result: Document 298
best 1 4 129 298 305 338western 4 95 194 204 298 305
hotel 2 40 200 298 355 402working 4 298 305
![Page 10: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/10.jpg)
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus egestas non. Quisque eu purus ut lacus egestas dapibus. Integer in velit id est dictum bibendum in id mi.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus Lorem ipsum dolor sit amet,
consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed sit amet ante vitae enim elementum semper sodales quis ipsum. Aliquam vel condimentum neque. Curabitur ornare feugiat ornare. Donec consectetur elit metus. Nulla eleifend tincidunt massa et euismod. Vestibulum vestibulum, justo vel egestas elementum, purus enim ornare quam, vel gravida est enim vel nibh.
Nam non eros nisi, eget fringilla justo. Fusce vel risus vitae mauris vehicula facilisis sit amet in mi. Nulla ut turpis id felis sollicitudin dictum sed non ipsum. Praesent ut risus nulla, sed blandit leo. Curabitur volutpat laoreet lacus, ut consectetur arcu vestibulum vel. Donec dapibus fringilla arcu, et semper lacus
![Page 11: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/11.jpg)
11
TF-IDF
function getWeight($docID, $term, $total) { $tf = count($term[$docID]); $idf = log($total / count($term), 2); return $tf * $idf;}
![Page 12: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/12.jpg)
12
Document Vector
socket what heavy steel ...
Doc 1 0.02 0.3 0.001 0 ...
Doc 2 0 0 0 0 ...
Doc 3 0.001 0.2 0 0 ...
Doc 4 0 0 0.002 0.003 ...
![Page 13: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/13.jpg)
best 23 42 179 246 333 703
weight 0.008 0.002 0.023 0.039 0.014 0.001
western 42 88 120 179 246 798
weight 0.003 0.004 0.023 0.001 0.034 0.004
1 - 246: 0.0732 - 179: 0.0243 - 120: 0.023
Ranked Query Merge
13
![Page 14: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/14.jpg)
14
PHP Similarityfunction score($queryString, $index) { $query = tokenize($queryString); $matches = array(); foreach($query as $qterm) { $postings = $index[$qterm]; foreach($postings as $id => $posting) { $matches[$id] += $posting['score']; } } return arsort($matches);}
![Page 15: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/15.jpg)
15
Integrating Search
![Page 16: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/16.jpg)
16
CREATE TABLE example ( id INT(11) NOT NULL auto_increment, title VARCHAR(255), content TEXT, PRIMARY KEY(id), FULLTEXT(title,content)) Engine=MyISAM;
INSERT INTO example (title, content) VALUES ('Mikko & Bacon','Mikko loves bacon'),('Marcello & Bacon','Marcello hates bacon'),('Jo & Sausages','Johanna loves sausages'),('Hollywood & Garlic','Lorenzo hates garlic'),('James & Cheddar','James is keen on cheeses');
MySQL Full Text Search
![Page 17: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/17.jpg)
17
MySQL FTI QuerySELECT * FROM example WHERE MATCH(title,content) AGAINST('loves bacon');
+----+------------------+------------------------+| id | title | content |+----+------------------+------------------------+| 1 | Mikko & Bacon | Mikko loves bacon | | 2 | Marcello & Bacon | Marcello hates bacon | | 3 | Jo & Sausages | Johanna loves sausages | +----+------------------+------------------------+3 rows in set (0.00 sec)
![Page 18: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/18.jpg)
18
Sphinx http://www.sphinxsearch.com
![Page 19: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/19.jpg)
19
Sphinx Configurationsource posts{ type = mysql sql_host = localhost sql_user = user sql_pass = password sql_db = search
sql_query = \ SELECT id, title, content FROM example; sql_attr_multi = uint tag from query; \ SELECT example_id, tag_id FROM tags;}
![Page 20: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/20.jpg)
20
index posts{ source = posts path = /var/data/sphinx/example morphology = stem_en
min_word_len = 3 min_prefix_len = 3 min_infix_len = 0 enable_star = 1}
![Page 21: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/21.jpg)
21
Stemming
happeninghappenedhappens
http://tartarus.org/~martin/PorterStemmer
- happen- happen- happen
![Page 22: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/22.jpg)
22
Command Line Searchingindexer --config /etc/sphinx.conf --allsearch --config /etc/sphinx.conf love bacon
displaying matches:1. document=1, weight=3, tag=(1,2)! id=1! title=Mikko & Bacon! content=Mikko loves baconwords:1. 'love': 2 documents, 2 hits2. 'bacon': 2 documents, 4 hits
searchd --config /etc/sphinx.conf
![Page 23: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/23.jpg)
23
Sphinx From PHP
$cl = new SphinxClient();$cl->SetServer('localhost', 3312);$cl->SetMatchMode(SPH_MATCH_ANY);
$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);
$cl->SetFilter('tag', array(1));$result = $cl->Query('bac*');$docIDs = array_keys($result["matches"]);
![Page 24: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/24.jpg)
24
Swish-E . http://swish-e.org
pecl install swish-beta
![Page 25: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/25.jpg)
25
Filesystem Index With Swish-E
IndexDir /var/data/documentsIndexFile fs-swish-e.indexIndexOnly .doc .docx .pdfFuzzyIndexingMode Stemming_en1
FileFilter .pdf /usr/local/bin/swish_filter.plFileFilter .doc /usr/local/bin/swish_filter.pl
fs-swish-e.conf
/usr/local/bin/swish-e -S fs -c fs-swish-e.conf
![Page 26: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/26.jpg)
26
Crawling Content
IndexDir /usr/local/lib/swish-e/spider.plIndexFile www-swish-e.indexSwishProgParameters default http://phpir.com/
FuzzyIndexingMode Stemming_en1DefaultContents HTML
www-swish-e.conf
/usr/local/bin/swish-e -S prog -c www-swish-e.conf
![Page 27: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/27.jpg)
27
Swish-E With Multiple Indices$swish = new Swish( 'www-swish-e.index fs-swish-e.index');$search = $swish->prepare();
$queryStr = 'search string goes here';$result = $search->execute($queryStr);$total = $result->hits;
while($r = $result->nextResult()) { echo $r->swishdocpath; // url}
![Page 28: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/28.jpg)
28
Lucene
![Page 29: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/29.jpg)
29
$index = Zend_Search_Lucene::create('idx');foreach($documents as $title => $content) { $doc = new Zend_Search_Lucene_Document(); $doc->addField( Zend_Search_Lucene_Field::Text( 'title', $title)); $doc->addField( Zend_Search_Lucene_Field::UnStored( 'content', $content)); $index->addDocument($doc);}
Build Index
![Page 30: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/30.jpg)
30
$results = $index->find('loves bacon');foreach($results as $result) { echo $result->score, " "; echo $result->title, "\n";} Output: 0.81656279309067 Mikko and Bacon0.24800278854758 Marcello & Bacon
Query Zend Search Lucene
![Page 31: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/31.jpg)
31
$file = file_get_contents($url);
$doc = Zend_Search_Lucene_Document_Html:: loadHTML($file);
$doc->addField( Zend_Search_Lucene_Field::Text( 'url', $url);$index->addDocument($doc)
Index HTML
![Page 32: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/32.jpg)
32
Solr http://lucene.apache.org/solr/
![Page 33: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/33.jpg)
33
Solr Search Index$options = array( 'hostname' => 'localhost', 'port' => 8983 );
$client = new SolrClient($options);$doc = new SolrInputDocument();$doc->addField('id', $id);$doc->addField('cat', $category);$doc->addField('title', $title);$doc->addField('text', $text);$response = $client->addDocument($doc);$client->commit();
![Page 34: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/34.jpg)
34
Solr Search Client$client = new SolrClient($options);
$query = new SolrQuery('bacon');$response = $client->query($query);$r = $response->getResponse();
foreach($r['response']['docs'] as $d) { echo $d->title[0] . "\n";}
![Page 35: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/35.jpg)
35
Xapian .
http://xapian.org
![Page 36: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/36.jpg)
36
Xapian In PHP$db = new XapianWritableDatabase( 'idx', Xapian::DB_CREATE_OR_OPEN);$i = new XapianTermGenerator();$i->set_stemmer(new XapianStem("english"));
$doc = new XapianDocument();$doc->set_data($content);$doc->add_value(1, $title);
$i->set_document($doc);$i->index_text($content);$db->add_document($doc);
![Page 37: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/37.jpg)
37
Xapian Search In PHP
$database = new XapianDatabase('idx');$enquire = new XapianEnquire($database);$qp = new XapianQueryParser();$qp->set_stemmer(new XapianStem("english"));$qp->set_database($database);$qp->set_stemming_strategy( XapianQueryParser::STEM_SOME);$query = $qp->parse_query($queryString);
$enquire->set_query($query);
![Page 38: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/38.jpg)
38
$matches = $enquire->get_mset(0, 10);
$i = $matches->begin();while(!$i->equals($matches->end())) { $n = $i->get_rank() + 1; $data = $i->get_document()->get_data(); $title = $i->get_document()->get_value(1); $score = $i->get_percent(); $i->next();}
![Page 39: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/39.jpg)
39
Improving Results
![Page 40: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/40.jpg)
40
Anchor Text
![Page 41: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/41.jpg)
41
$p = file_get_contents('http://phpir.com');
libxml_use_internal_errors(true);$dom = DomDocument::loadHTML($p);$links = $dom->getElementsByTagName('a');
foreach($links as $link) { $href = $link->getAttribute('href'); $text = $link->nodeValue;}
Parse Anchor Text
![Page 42: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/42.jpg)
42
1
2
3
Zone Weighting
![Page 43: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/43.jpg)
43
ZSL Zone Weighting
$doc = new Zend_Search_Lucene_Document();
$tfield = Zend_Search_Lucene_Field::Text ('title', $title);$tfield->boost = 1.3;$doc->addField($tfield);
$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));
$index->addDocument($doc);
![Page 44: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/44.jpg)
44
Document Authority
![Page 45: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/45.jpg)
45
Document Weights in ZSL$doc = new Zend_Search_Lucene_Document();$doc->addField( Zend_Search_Lucene_Field::Text ('title', $title));$doc->addField( Zend_Search_Lucene_Field::UnStored ('content', $content));
$doc->boost = 1 + ($numComments / 100);
$index->addDocument($doc);
![Page 46: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/46.jpg)
46
Using Search
![Page 47: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/47.jpg)
47
Summaries & Highlighting
![Page 48: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/48.jpg)
48
Sphinx Extract & Highlight$cl = new SphinxClient();$cl->SetServer( "localhost", 3312 );$q = 'bacon';$r = $cl->Query($q);foreach ($r["matches"] as $doc => $info) { $text[$doc] = getTextFromDB($doc);}
$e = $cl->BuildExcerpts($text, 'posts', $q);foreach($extracts as $extract) { echo $extract;}
![Page 49: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/49.jpg)
![Page 50: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/50.jpg)
50
Xapian Spelling Correction$indexer = new XapianTermGenerator();$indexer->set_database($database);$indexer->set_flags( XapianTermGenerator::FLAG_SPELLING);
Indexer
$queryString = "strreplace or str_cmp";$q = new XapianQueryParser();$q->set_database($database);$query = $q->parse_query($queryString, XapianQueryParser::FLAG_SPELLING_CORRECTION);echo "Did you mean: " . $q->get_corrected_query_string() . "\n";
Searcher
![Page 51: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/51.jpg)
51
Spelling Correction Output php xapsearch.php
Did you mean: str_replace or strcmp
4644 results found for “strreplace or str_cmp”:1: 2% docid=572 [phpdocs/html/cc.license.html]2: 2% docid=7169 [phpdocs/html/imagick.constants.html]3: 2% docid=10086 [phpdocs/html/sqlite3result.fetcharray.html]4: 2% docid=6132 [phpdocs/html/function.swf-posround.html]
![Page 52: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/52.jpg)
52
Results Sorting
![Page 53: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/53.jpg)
53
Sorting in ZSL
$q = Zend_Search_Lucene_Search_QueryParser:: parse('search string');
$results = $index->find($q, 'title');foreach($results as $result) { echo '<h3>', $result->title, "</h3>\n"; $doc = getDocumentFromDB($result->did); echo $q->htmlFragmentHighlightMatches($doc);}
![Page 54: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/54.jpg)
54
Faceted Search
![Page 55: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/55.jpg)
55
Faceted Search In Solr$client = new SolrClient($options);$query = new SolrQuery('bacon');$response = $client->query($query);$query->setFacet(true);$query->addFacetField('cat');$r = $response->getResponse();$f = $r['facet_counts']['facet_fields'];foreach($f['cat'] as $facet => $count) { echo $facet . " " . $count . "\n";}
![Page 56: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/56.jpg)
56
More Like This
![Page 57: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/57.jpg)
57
More Like This$rset = new XapianRset();$rset->add_document(5959); // str_replace$e = $enquire->get_eset(40, $rset);
$t = $e->begin();for($t; !$t->equals($e->end()); $t->next()){ $qs[] = new XapianQuery($t->get_term(), intval($t->get_weight()));}
$query = new XapianQuery( XapianQuery::OP_OR, $qs);
![Page 58: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/58.jpg)
58
More Like This Example php xapsim.php
1656 results found:1: 100% docid=5959 [phpdocs/html/function.str-replace.html]2: 47% docid=5956 [phpdocs/html/function.str-ireplace.html]3: 24% docid=5328 [phpdocs/html/function.preg-replace.html]4: 18% docid=5958 [phpdocs/html/function.str-repeat.html]
![Page 59: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/59.jpg)
59
Search Performance
![Page 60: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/60.jpg)
60
Index Updates
Docs
Main
New
Delta Delta Main
Query
Delta Main
Main
DocsDocsDocs
![Page 61: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/61.jpg)
61
Search Speed$index = Zend_Search_Lucene::open('index');$index->optimize();
indexer --merge main delta --rotate
Zend Search Lucene
Sphinx
$client = new SolrClient($options);$client->optimize();
Solr
xapian-compact xapindex xapindex2Xapian
![Page 62: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/62.jpg)
62
Distributing Search
Index
Application
Index Index
DocumentDocumentDocumentDocument
![Page 63: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/63.jpg)
63
Large Scale Search
http://www.nutch.org
http://hadoop.apache.org
![Page 64: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/64.jpg)
64
Image CreditsTitle http://www.flickr.com/photos/generated/2084287794/What Do You Want http://www.flickr.com/photos/the_justified_sinner/
2498066986/You Are Here http://www.flickr.com/photos/alecvuijlsteke/2692475420/Integrating Search http://www.flickr.com/photos/squeaks2569/3700355684/Sphinx http://www.flickr.com/photos/generated/2084287794/Lucene http://www.flickr.com/photos/mypanda/7731447/Swish-e http://www.flickr.com/photos/ryan_fung/2239687100/Solr http://www.flickr.com/photos/m-j-s/2724756177/Xapian http://www.flickr.com/photos/olibac/3522056495/Using Search http://www.flickr.com/photos/eneas/175027945/Improving Search http://www.flickr.com/photos/x-ray_delta_one/3928200642/Search Performance http://www.flickr.com/photos/maisonbisson/1634408/Large Scale Search http://www.flickr.com/photos/zedzap/3663508847/
![Page 65: In Search Of: Integrating Site Search (IPC)](https://reader033.fdocuments.net/reader033/viewer/2022052321/555821e1d8b42a25588b4c06/html5/thumbnails/65.jpg)
Questions?
65