Search Lucene
-
Upload
jeremy-coates -
Category
Technology
-
view
4.546 -
download
1
description
Transcript of Search Lucene
Can you be dynamic and fast?
“Miss Marple and the case of the Missing MIPS”
Zoë Slattery
Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
● Conclusions
Index and search
● Problem of finding relevant information is not new.– 3000 years BC [1]– Vannevar Bush, As We May Think, 1945.
● Today applications that search the Web must be able to provide instant access to > 10 billion documents
● Many applications need some form of search, eg searching your hard drive, email....
1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.
Options for information retrieval
● Search engines– Nutch, SearchBlox.....
● Information Retrieval libraries– Three with broadly similar features
Egothor
Xapian
Lucene
Implementationlanguage
Languagebindings
Languageports
License
Java None None BSD like
C++Perl, Python,
PHP, Java, TCLNone GPL
Java NoneC++, Perl, PHP, C#
Apache 2
Lucene [2]
DBWeb
Filesystem
Get user query Present search
results
Index
Indexdocuments
Searchindex
Gatherdata
Luce
neA
pplic
atio
n
User
2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.
.
Lucene indexing
Oh for a muse of fire that would
acsend thebrightest
heaven of invention.....
start
fire
ascend
...
Henry V, Scouting for boys...
Aerospace, Henry V...
Terms Documents
3. Inverted index
1. Documents
AnalysisIndex creation
end
[fire] [ascend] [bright] [heaven]
2. Token stream
Optimise
4. Optimised inverted index
Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times
● Conclusions
Indexing speed
Java + JIT
Java
PHP
4
32
167
Time to index/seconds
0.3
3
43
Time to optimise/seconds
4.3
35
210
Total time
Benchmark:●17.4 MB, 814 files of PHP source code●Linux/Thinkpad T60
Ouch! nearly 50 times as fast in Java
Why is the performance so bad?
First make sure we are comparing same thing:
➢ Compare indexes using Luke
➢ Limits on terms➢ Java stops looking at 10,000 terms
➢ Scoring➢ Java rounds down, PHP rounds to closest
➢ Analyser➢ Java Lucene has many analysers
Analysis - Java
Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
Analyzing "XY&Z Corporation - [email protected]" StandardAnalyzer: [xy&z] [corporation] [[email protected]]
SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]
Analysis - PHP
Analysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]
Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]
Analysing "XY&Z Corporation - [email protected]" Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com]
Stop words filter: [xy] [z] [corporation] [xyz] [example] [com]
Short words filter: [xy] [corporation] [xyz] [example] [com]
Compare indexes
Same 663 terms
java
php
Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times– Part one– Part two
● Conclusions
Execution profiles
● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations
● Profiling tools (all open source)
– Java● Eclipse TPTP
– PHP● Xdebug● KCachegrind
– System● Sysprof● vmstat, iostat
Java profile
Small problems with TPTP...
Java
Java + profile
2.3
687258
Time to index/seconds
0.3
673851
Time to optimise/seconds
88
50
% time in indexing
●Invasive and slow. Takes 600,000 times as long to execute●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)●Output file is machine readable only
But – it's free, open source and it works enough.
Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB
PHP profile
No problems with this tool
PHP
PHP + profile
5
70
Time to index/seconds
3
55
Time to optimise/seconds
63
56
% time in indexing
●Not so invasive as the Java tool but still adds to time and distorts slightly●Results easy to display with KCachegrind●Output file is readable
Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB
look at the normalize() code
public function normalize(Token $srcToken ){
$newToken = new Token(strtolower( $srcToken>getTermText() ),
$srcToken>getStartOffset(), $srcToken>getEndOffset());
$newToken>setPositionIncrement($srcToken>getPositionIncrement());
return $newToken; }
The normalize() function
Sum( ) = 2.92;
18.99 – 2.92 = 16.07
Micro benchmark
<?php require_once "Token.php"; require_once "LowerCase.php";
$token = new Token("GO", 105, 107); $filter = new LowerCase();
for ($i=0; $i < 10000000; $i++) { $norm_token = $filter>normalize($token); } ?>
normalize() opcodes
compiled vars: !0 = $srcToken, !1 = $newToken line # op ext return operands 11 0 RECV 1 13 1 ZEND_FETCH_CLASS :0 'Token' 2 NEW $1 :0 3 ZEND_INIT_METHOD_CALL !0, 'getTermText' 4 DO_FCALL_BY_NAME 0 5 SEND_VAR_NO_REF $3 6 DO_FCALL 1 'strtolower' 7 SEND_VAR_NO_REF $4 14 8 ZEND_INIT_METHOD_CALL !0, 'getStartOffset' 9 DO_FCALL_BY_NAME 0 10 SEND_VAR_NO_REF $6 15 11 ZEND_INIT_METHOD_CALL !0, 'getEndOffset' 12 DO_FCALL_BY_NAME 0 13 SEND_VAR_NO_REF $8 14 DO_FCALL_BY_NAME 3 15 ASSIGN !1, $1 16 ......
System profile
1. Convert to lower case2. Look up opcodes
How Xdebug worksS
crip
t exe
cutio
n
●Convert function name to lower case●Look up function in function table
Execute function
Call out to profiler – start time
Call out to profiler – end time
ZEND_INIT_METHOD_CALL
DO_FCALL_BY_NAME
The normalize() function
Sum( ) = 2.92;
18.99 – 2.92 = 16.07
Is consumed in setting up functions to be run
Why is function calling faster in Java?
● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time.
● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast.
● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.
Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times– Part one– Part two
● Conclusions
PHP profile
look at the call to normalize()
$token = $this>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));
public function normalize(Token $srcToken ){
$newToken = new Token(strtolower( $srcToken>getTermText() ), $srcToken>getStartOffset(), $srcToken>getEndOffset());
$newToken>setPositionIncrement($srcToken>getPositionIncrement());
return $newToken; }
look at the call to normalize()
$token = $this>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));
public function normalize (Token $srcToken) {$srcToken>setTermText(strtolower($srcToken>getTermtext()));return $srcToken;
}
normalize() recoded....
After fix
Performance improvement?
PHP + fix
PHP
151
167
Time to index/seconds
43
43
Time to optimise/seconds
Java 32 3 35
194
210
Total time
9.5 % improvement
Java + JIT 4 0.3 4.3
Agenda
● Index and search applications
● The problem for PHP programmers
● Understanding execution times– Part one– Part two
● Conclusions
Conclusions
● Two reasons why the PHP implementation of Lucene is slow:– Function calling overhead in PHP– Inefficient code in the analyser [3]– These are the main two, there are others....
● Dynamic and fast?– Hard to get to the same execution speed as Java – but possible to get closer.– But development speed is much better [4]– what speed to you care about?– Better not to use Java coding style (lots of methods that do nothing)
● So which implementation of Lucene should I use?– it depends.....
3. http://framework.zend.com/issues/browse/ZF-36834. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.
Options for PHP
Do you care about
speed?
Use Zend Search Lucene
Only need basic features?
Can support Java environment?
Use a Web Service?
Use Lucene via a Java bridge
No Lucene solution today [5]
Use SOLR as web service
Y
Y
Y
NN N
N
Y
5. http://pecl.php.net/package/clucene
Other useful links
●http://www.egothor.org/●http://xapian.org/●http://lucene.apache.org/●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html●http://www.derickrethans.nl/vld.php●http://lucene.apache.org/nutch/●http://www.searchblox.com/●http://www.xdebug.org/●http://www.eclipse.org/tptp/●http://www.getopt.org/luke/●http://www.projectzero.org●http://www.ibm.com/developerworks/ (Publication due 24/09/08)●http://php-java-bridge.sourceforge.net/doc/●http://www.zend.com/en/products/platform/product-comparison/java-bridge●http://lucene.apache.org/solr/●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html