Introduction to Open Source Search with Apache Lucene and Solr
better search with apache lucene and solr
Transcript of better search with apache lucene and solr
-
8/14/2019 better search with apache lucene and solr
1/37
Better Search with
Apache Lucene andSolr
Grant Ingersoll
November 19, 2007
Triangle Java Users Group
-
8/14/2019 better search with apache lucene and solr
2/37
Intro
Background
Search Concepts
Lucene
Indexing Searching
Analysis
Miscellaneous
Solr Lucene on roids
Solr Setup
Indexing/Searching
Advanced Solr
Lucene v. Solr
-
8/14/2019 better search with apache lucene and solr
3/37
Background
Lucene Created by Doug Cutting in 1999 as a Source
Forge project
Donated to the Apache Software Foundation
(ASF) in 2001
Solr
Created by Yonik Seeley while at CNET Based on Lucene
Donated to ASF in 2006
Nutch, Hadoop, Tika3
-
8/14/2019 better search with apache lucene and solr
4/37
Users
Lucene
IBM Omnifind Y! Edition
Technorati
Wikipedia Internet Archive
LinkedIn
Solr Netflix
CNET
Smithsonian
AOL:sports and music
Many others4
-
8/14/2019 better search with apache lucene and solr
5/37
Search Concepts
User inputs one or more keywords along withsome operators and expects to get back aranked list of documents relevant to the
keywords User sorts through the documents, reading/
using the most relevant
Users relevant docs does not always equal to thesearch engines
-
8/14/2019 better search with apache lucene and solr
6/37
Search Concepts
Several approaches developed over theyears: Boolean Model
Vector Space Model
Probabilistic Model
Language Modeling
Latent Semantic Indexing Vector Space Model is probably the most
common and is generally fast
Search is not a solved problem, despiteGoo les success!
-
8/14/2019 better search with apache lucene and solr
7/37
Vector Space Model
Goal: Identify documents that aresimilar to input query
Represent each word with a weight w
The words in the document and thequery each define a Vector in an n-
dimensional space
Common weighting scheme is calledTF-IDF TF = Term Frequency
IDF = Inverse Document Freq.
Intuition behind TF-IDF: A term that frequently occurs in a few
documents relative to the collection ismore important than one that occursin a lot of documents
Sim(q1, d1) = cos
q1
d1
dj=
q=
w = weight assigned to term
-
8/14/2019 better search with apache lucene and solr
8/37
Making Content Searchable
Search engines generally:
Extract Tokens from Content
Optionally transform said tokens depending on needs
Stemming Expand with synonyms (usually done at query time)
Remove token (stopword)
Add metadata
Store tokens and related metadata (position, etc.) in a datastructured optimized for searching
Called an Inverted Index
-
8/14/2019 better search with apache lucene and solr
9/37
Inverted Index
aardvark
hood
red
little
riding
robin
women
zoo
Little Red Riding Hood
Robin Hood
Little Women
0 1
0 2
0
0
2
1
0
1
2
Graphic courtesy of Yonik Seeley
-
8/14/2019 better search with apache lucene and solr
10/37
On to Lucene
Provides modified Vector Space Modelimplementation of search Boolean + VSM
Written in Java, but has been ported to manylanguages: C/C++
Ruby Python
.NET
10
-
8/14/2019 better search with apache lucene and solr
11/37
Lucene is
NOT a crawler
See Nutch
NOT an application See PoweredBy on the Wiki
NOT a library for doing Google PageRank orother link analysis algorithms
See Nutch and Hadoop
A library for enabling text based search
11
-
8/14/2019 better search with apache lucene and solr
12/37
Vocab
A Lucene Index is a collection ofDocuments
A Document is a collection ofFields
A Field is content along with metadatadescribing the content
Field content can have several attributes
Tokenized - Analyze the content, extractingTokens and adding them to the inverted index
Stored - Keep the content in a storage data
structure for use by application12
-
8/14/2019 better search with apache lucene and solr
13/37
Getting Started
Download Lucene from http://lucene.apache.org/java
Indexing Side:
Write code to add Documents to index
Search Side Write code to transform user query into Lucene
Query instances Submit Query to Lucene to Search
Display Results
13
http://lucene.apache.org/javahttp://lucene.apache.org/javahttp://lucene.apache.org/javahttp://lucene.apache.org/javahttp://lucene.apache.org/java -
8/14/2019 better search with apache lucene and solr
14/37
14
Basic Application
IndexWriter IndexSearcher
Lucene Index
Documentsuper_name: Spider-Manname: Peter Parkercategory: superheropowers: agility, spider-sense
Hits(Matching Docs)
Query(powers:agility)
addDocument() search()
Graphic courtesy of Yonik Seeley
-
8/14/2019 better search with apache lucene and solr
15/37
Indexing
15
Process of preparing and adding text to Lucene
Optimized for searching
Key Point: Lucene only indexes Strings
What does this mean? Lucene doesnt care about XML, Word,
PDF, etc.
There are many good open source
extractors available Its our job to convert whatever file format
we have into something Lucene can use
-
8/14/2019 better search with apache lucene and solr
16/37
How to Index
Look at Sample Code inQuickExampleTest.java
Available at http://www.lucenebootcamp.com
under the Subversion repository Other resources available there as well
16
http://www.lucenebootcamp.com/http://www.lucenebootcamp.com/ -
8/14/2019 better search with apache lucene and solr
17/37
Searching
Lucene Query Parser converts strings intoJava objects that can be used for searching See http://lucene.apache.org/java/docs/queryparsersyntax.html
Query objects can also be constructedprogrammatically
Native support for many types of queries
Keyword Phrase
Wildcard
Many more 17
-
8/14/2019 better search with apache lucene and solr
18/37
Searching
Look again at QuickExampleTest.java forexamples of how to search
18
-
8/14/2019 better search with apache lucene and solr
19/37
Analysis
Analysis is the process of converting raw textinto indexable Tokens
In Lucene, this is done by the Analyzer, Tokenizer
and TokenFilter classes
The Tokenizer is responsible for chunking the
input into Tokens
TokenFilters can further modify the Tokensproduced by the Tokenizer, including:
Removing it
Stemming
Other19
-
8/14/2019 better search with apache lucene and solr
20/37
Analysis
Lucene provides many Analyzers out of thebox
StandardAnalyzer
WhitespaceAnalyzer
Many others, including support for other
languages
Easy to add your own Analysis is done on both the content to be
indexed and the query
Same type ofAnalyzer should be used20
-
8/14/2019 better search with apache lucene and solr
21/37
Analysis and SearchRelevancy
21
LexCorp BFG-9000
LexCorp BFG-9000
BFG 9000Lex Corp
LexCorp
bfg 9000lex corp
lexcorp
WhitespaceTokenizer
WordDelimiterFilter catenateWords=1
LowercaseFilter
Lex corp bfg9000
Lex bfg9000
bfg 9000Lex corp
bfg 9000lex corp
WhitespaceTokenizer
WordDelimiterFilter catenateWords=0
LowercaseFilter
Query Analysis
A Match!
Document Indexing Analysis
corp
Graphic courtesy of Yonik Seeley
-
8/14/2019 better search with apache lucene and solr
22/37
Miscellaneous
Lucene comes with many contributed modulesthat are not part of the core, but solve commonproblems:
Located in contrib directory in the distribution Popular contribs:
Highlighter
Spell Checking Analyzers
More Like This (finds similar pages)
Luke:
http://www.getopt.org/luke/22
http://www.getopt.org/luke/http://www.getopt.org/luke/http://www.getopt.org/luke/ -
8/14/2019 better search with apache lucene and solr
23/37
Luke is your friend
23
-
8/14/2019 better search with apache lucene and solr
24/37
Solr
"Solr is an open source enterprise searchserver based on the Lucene Java search
library, with XML/HTTP APIs, caching,replication, and a web administration
interface."
24
-
8/14/2019 better search with apache lucene and solr
25/37
Why Solr?
Uses many Lucene best practices
Makes setup and configuration a snap
Easy to extend
Supports clients in: HTTP
Java
Ruby
JSON
PHP
Python25
-
8/14/2019 better search with apache lucene and solr
26/37
Getting Started
http://lucene.apache.org/solr/tutorial.html In the Solr directory: cd example
java -jar start.jar
Indexing: cd exampledocs
java -jar post.jar *.xml
Browse to http://localhost:8983/solr
26
http://localhost:8983/solrhttp://localhost:8983/solrhttp://localhost:8983/solrhttp://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/solr/tutorial.html -
8/14/2019 better search with apache lucene and solr
27/37
Solr Setup
schema.xml Describe your data and how it is processed
Adds semantics to Lucene Fields to handle typesother than Strings (int, double, long, custom, etc.)
solrconfig.xml Describe how clients interact with your data
Specifies: Data Location
Performance Options
Types of actions allowed27
-
8/14/2019 better search with apache lucene and solr
28/37
Indexing in Solr
Send XML add commands over HTTP Other options exist as well, but XML is most
common
canesCarolina
Hurricanes
Can also delete, update documents
28
-
8/14/2019 better search with apache lucene and solr
29/37
Searching
Send HTTP Get or Post requests
Query parameters specify options
http://localhost:8983/solr/select/?q=iPod
Try http://localhost:8983/solr/admin/form.jsp
Solr supports:
Faceting
Highlighting
More Like This
Spell Checking
Other29
http://localhost:8983/solr/admin/form.jsphttp://localhost:8983/solr/select/?q=iPodhttp://localhost:8983/solr/admin/form.jsphttp://localhost:8983/solr/admin/form.jsphttp://localhost:8983/solr/select/?q=iPodhttp://localhost:8983/solr/select/?q=iPod -
8/14/2019 better search with apache lucene and solr
30/37
Faceting
30
-
8/14/2019 better search with apache lucene and solr
31/37
Highlighting
31
From Digg.com
-
8/14/2019 better search with apache lucene and solr
32/37
Advanced Solr
Replication Use case:
Updates can wait (minutes instead of seconds)
High query volume Load-balanced environment
Caching Solr provides intelligent caching of objects such as:
Query Filters
Documents
Search Results
User Defined32
-
8/14/2019 better search with apache lucene and solr
33/37
Advanced Solr
In Lucene, an IndexReader (used forsearching by the IndexSearcher) representsa point-in-time view of the index Updates occurring after opening the IndexReader
are not available for searching
Applications must decide when to make changes
available for search
Solr can open up new IndexSearcherinstances and warm them before bringingthem online
33
-
8/14/2019 better search with apache lucene and solr
34/37
Lucene v. Solr
Lucene and Solr share many commonfeatures and terminologies
Both have friendly Apache Software
Foundation license Both have an extensive, knowledgeable
community
Both are fast, scalable, thread-safe, etc. Both have good documentation
Both are battle-tested34
-
8/14/2019 better search with apache lucene and solr
35/37
Lucene v. Solr
Lucene Embedded/
lightweight
No Container Want low-level
control over allaspects of process
Thick clients?
Distributed?
Need to use featuresnot available in Solr
JDK 1.4 35
Solr
Server-side
HTTP as lingua-
franca Want ease of
setup and
configuration
Non-Java clients
Replication/
Caching OOTB
JDK 1.5
-
8/14/2019 better search with apache lucene and solr
36/37
Resources
http://lucene.apache.org /solr
/java
Lucene In Action by Erik Hatcherand Otis
Gospodneti Search Smarter with Apache Solr
http://www.ibm.com/developerworks/java/library/j-solr1/
http://www.ibm.com/developerworks/java/library/j-solr2/
http://www.ibm.com/developerworks/java/library/j-solr2/http://www.ibm.com/developerworks/java/library/j-solr1/mailto:[email protected]:[email protected]:[email protected]:[email protected]://lucene.apache.org/http://www.ibm.com/developerworks/java/library/j-solr2/http://www.ibm.com/developerworks/java/library/j-solr2/http://www.ibm.com/developerworks/java/library/j-solr1/http://www.ibm.com/developerworks/java/library/j-solr1/mailto:[email protected]:[email protected]:[email protected]:[email protected]://lucene.apache.org/http://lucene.apache.org/ -
8/14/2019 better search with apache lucene and solr
37/37
Resources
Other presentations: http://people.apache.org/~gsingers
http://people.apache.org/~yonik
http://people.apache.org/~hossman
http://www.lucenebootcamp.com
My email:
User: trainer Domain: lucenebootcamp.com
37
http://www.lucenebootcamp.com/http://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://www.lucenebootcamp.com/http://www.lucenebootcamp.com/http://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://people.apache.org/~gsingershttp://people.apache.org/~gsingers