better search with apache lucene and solr

download better search with apache lucene and solr

of 37

Transcript of better search with apache lucene and solr

  • 8/14/2019 better search with apache lucene and solr

    1/37

    Better Search with

    Apache Lucene andSolr

    Grant Ingersoll

    November 19, 2007

    Triangle Java Users Group

  • 8/14/2019 better search with apache lucene and solr

    2/37

    Intro

    Background

    Search Concepts

    Lucene

    Indexing Searching

    Analysis

    Miscellaneous

    Solr Lucene on roids

    Solr Setup

    Indexing/Searching

    Advanced Solr

    Lucene v. Solr

  • 8/14/2019 better search with apache lucene and solr

    3/37

    Background

    Lucene Created by Doug Cutting in 1999 as a Source

    Forge project

    Donated to the Apache Software Foundation

    (ASF) in 2001

    Solr

    Created by Yonik Seeley while at CNET Based on Lucene

    Donated to ASF in 2006

    Nutch, Hadoop, Tika3

  • 8/14/2019 better search with apache lucene and solr

    4/37

    Users

    Lucene

    IBM Omnifind Y! Edition

    Technorati

    Wikipedia Internet Archive

    LinkedIn

    Solr Netflix

    CNET

    Smithsonian

    AOL:sports and music

    Many others4

  • 8/14/2019 better search with apache lucene and solr

    5/37

    Search Concepts

    User inputs one or more keywords along withsome operators and expects to get back aranked list of documents relevant to the

    keywords User sorts through the documents, reading/

    using the most relevant

    Users relevant docs does not always equal to thesearch engines

  • 8/14/2019 better search with apache lucene and solr

    6/37

    Search Concepts

    Several approaches developed over theyears: Boolean Model

    Vector Space Model

    Probabilistic Model

    Language Modeling

    Latent Semantic Indexing Vector Space Model is probably the most

    common and is generally fast

    Search is not a solved problem, despiteGoo les success!

  • 8/14/2019 better search with apache lucene and solr

    7/37

    Vector Space Model

    Goal: Identify documents that aresimilar to input query

    Represent each word with a weight w

    The words in the document and thequery each define a Vector in an n-

    dimensional space

    Common weighting scheme is calledTF-IDF TF = Term Frequency

    IDF = Inverse Document Freq.

    Intuition behind TF-IDF: A term that frequently occurs in a few

    documents relative to the collection ismore important than one that occursin a lot of documents

    Sim(q1, d1) = cos

    q1

    d1

    dj=

    q=

    w = weight assigned to term

  • 8/14/2019 better search with apache lucene and solr

    8/37

    Making Content Searchable

    Search engines generally:

    Extract Tokens from Content

    Optionally transform said tokens depending on needs

    Stemming Expand with synonyms (usually done at query time)

    Remove token (stopword)

    Add metadata

    Store tokens and related metadata (position, etc.) in a datastructured optimized for searching

    Called an Inverted Index

  • 8/14/2019 better search with apache lucene and solr

    9/37

    Inverted Index

    aardvark

    hood

    red

    little

    riding

    robin

    women

    zoo

    Little Red Riding Hood

    Robin Hood

    Little Women

    0 1

    0 2

    0

    0

    2

    1

    0

    1

    2

    Graphic courtesy of Yonik Seeley

  • 8/14/2019 better search with apache lucene and solr

    10/37

    On to Lucene

    Provides modified Vector Space Modelimplementation of search Boolean + VSM

    Written in Java, but has been ported to manylanguages: C/C++

    Ruby Python

    .NET

    10

  • 8/14/2019 better search with apache lucene and solr

    11/37

    Lucene is

    NOT a crawler

    See Nutch

    NOT an application See PoweredBy on the Wiki

    NOT a library for doing Google PageRank orother link analysis algorithms

    See Nutch and Hadoop

    A library for enabling text based search

    11

  • 8/14/2019 better search with apache lucene and solr

    12/37

    Vocab

    A Lucene Index is a collection ofDocuments

    A Document is a collection ofFields

    A Field is content along with metadatadescribing the content

    Field content can have several attributes

    Tokenized - Analyze the content, extractingTokens and adding them to the inverted index

    Stored - Keep the content in a storage data

    structure for use by application12

  • 8/14/2019 better search with apache lucene and solr

    13/37

    Getting Started

    Download Lucene from http://lucene.apache.org/java

    Indexing Side:

    Write code to add Documents to index

    Search Side Write code to transform user query into Lucene

    Query instances Submit Query to Lucene to Search

    Display Results

    13

    http://lucene.apache.org/javahttp://lucene.apache.org/javahttp://lucene.apache.org/javahttp://lucene.apache.org/javahttp://lucene.apache.org/java
  • 8/14/2019 better search with apache lucene and solr

    14/37

    14

    Basic Application

    IndexWriter IndexSearcher

    Lucene Index

    Documentsuper_name: Spider-Manname: Peter Parkercategory: superheropowers: agility, spider-sense

    Hits(Matching Docs)

    Query(powers:agility)

    addDocument() search()

    Graphic courtesy of Yonik Seeley

  • 8/14/2019 better search with apache lucene and solr

    15/37

    Indexing

    15

    Process of preparing and adding text to Lucene

    Optimized for searching

    Key Point: Lucene only indexes Strings

    What does this mean? Lucene doesnt care about XML, Word,

    PDF, etc.

    There are many good open source

    extractors available Its our job to convert whatever file format

    we have into something Lucene can use

  • 8/14/2019 better search with apache lucene and solr

    16/37

    How to Index

    Look at Sample Code inQuickExampleTest.java

    Available at http://www.lucenebootcamp.com

    under the Subversion repository Other resources available there as well

    16

    http://www.lucenebootcamp.com/http://www.lucenebootcamp.com/
  • 8/14/2019 better search with apache lucene and solr

    17/37

    Searching

    Lucene Query Parser converts strings intoJava objects that can be used for searching See http://lucene.apache.org/java/docs/queryparsersyntax.html

    Query objects can also be constructedprogrammatically

    Native support for many types of queries

    Keyword Phrase

    Wildcard

    Many more 17

  • 8/14/2019 better search with apache lucene and solr

    18/37

    Searching

    Look again at QuickExampleTest.java forexamples of how to search

    18

  • 8/14/2019 better search with apache lucene and solr

    19/37

    Analysis

    Analysis is the process of converting raw textinto indexable Tokens

    In Lucene, this is done by the Analyzer, Tokenizer

    and TokenFilter classes

    The Tokenizer is responsible for chunking the

    input into Tokens

    TokenFilters can further modify the Tokensproduced by the Tokenizer, including:

    Removing it

    Stemming

    Other19

  • 8/14/2019 better search with apache lucene and solr

    20/37

    Analysis

    Lucene provides many Analyzers out of thebox

    StandardAnalyzer

    WhitespaceAnalyzer

    Many others, including support for other

    languages

    Easy to add your own Analysis is done on both the content to be

    indexed and the query

    Same type ofAnalyzer should be used20

  • 8/14/2019 better search with apache lucene and solr

    21/37

    Analysis and SearchRelevancy

    21

    LexCorp BFG-9000

    LexCorp BFG-9000

    BFG 9000Lex Corp

    LexCorp

    bfg 9000lex corp

    lexcorp

    WhitespaceTokenizer

    WordDelimiterFilter catenateWords=1

    LowercaseFilter

    Lex corp bfg9000

    Lex bfg9000

    bfg 9000Lex corp

    bfg 9000lex corp

    WhitespaceTokenizer

    WordDelimiterFilter catenateWords=0

    LowercaseFilter

    Query Analysis

    A Match!

    Document Indexing Analysis

    corp

    Graphic courtesy of Yonik Seeley

  • 8/14/2019 better search with apache lucene and solr

    22/37

    Miscellaneous

    Lucene comes with many contributed modulesthat are not part of the core, but solve commonproblems:

    Located in contrib directory in the distribution Popular contribs:

    Highlighter

    Spell Checking Analyzers

    More Like This (finds similar pages)

    Luke:

    http://www.getopt.org/luke/22

    http://www.getopt.org/luke/http://www.getopt.org/luke/http://www.getopt.org/luke/
  • 8/14/2019 better search with apache lucene and solr

    23/37

    Luke is your friend

    23

  • 8/14/2019 better search with apache lucene and solr

    24/37

    Solr

    "Solr is an open source enterprise searchserver based on the Lucene Java search

    library, with XML/HTTP APIs, caching,replication, and a web administration

    interface."

    24

  • 8/14/2019 better search with apache lucene and solr

    25/37

    Why Solr?

    Uses many Lucene best practices

    Makes setup and configuration a snap

    Easy to extend

    Supports clients in: HTTP

    Java

    Ruby

    JSON

    PHP

    Python25

  • 8/14/2019 better search with apache lucene and solr

    26/37

    Getting Started

    http://lucene.apache.org/solr/tutorial.html In the Solr directory: cd example

    java -jar start.jar

    Indexing: cd exampledocs

    java -jar post.jar *.xml

    Browse to http://localhost:8983/solr

    26

    http://localhost:8983/solrhttp://localhost:8983/solrhttp://localhost:8983/solrhttp://lucene.apache.org/solr/tutorial.htmlhttp://lucene.apache.org/solr/tutorial.html
  • 8/14/2019 better search with apache lucene and solr

    27/37

    Solr Setup

    schema.xml Describe your data and how it is processed

    Adds semantics to Lucene Fields to handle typesother than Strings (int, double, long, custom, etc.)

    solrconfig.xml Describe how clients interact with your data

    Specifies: Data Location

    Performance Options

    Types of actions allowed27

  • 8/14/2019 better search with apache lucene and solr

    28/37

    Indexing in Solr

    Send XML add commands over HTTP Other options exist as well, but XML is most

    common

    canesCarolina

    Hurricanes

    Can also delete, update documents

    28

  • 8/14/2019 better search with apache lucene and solr

    29/37

    Searching

    Send HTTP Get or Post requests

    Query parameters specify options

    http://localhost:8983/solr/select/?q=iPod

    Try http://localhost:8983/solr/admin/form.jsp

    Solr supports:

    Faceting

    Highlighting

    More Like This

    Spell Checking

    Other29

    http://localhost:8983/solr/admin/form.jsphttp://localhost:8983/solr/select/?q=iPodhttp://localhost:8983/solr/admin/form.jsphttp://localhost:8983/solr/admin/form.jsphttp://localhost:8983/solr/select/?q=iPodhttp://localhost:8983/solr/select/?q=iPod
  • 8/14/2019 better search with apache lucene and solr

    30/37

    Faceting

    30

  • 8/14/2019 better search with apache lucene and solr

    31/37

    Highlighting

    31

    From Digg.com

  • 8/14/2019 better search with apache lucene and solr

    32/37

    Advanced Solr

    Replication Use case:

    Updates can wait (minutes instead of seconds)

    High query volume Load-balanced environment

    Caching Solr provides intelligent caching of objects such as:

    Query Filters

    Documents

    Search Results

    User Defined32

  • 8/14/2019 better search with apache lucene and solr

    33/37

    Advanced Solr

    In Lucene, an IndexReader (used forsearching by the IndexSearcher) representsa point-in-time view of the index Updates occurring after opening the IndexReader

    are not available for searching

    Applications must decide when to make changes

    available for search

    Solr can open up new IndexSearcherinstances and warm them before bringingthem online

    33

  • 8/14/2019 better search with apache lucene and solr

    34/37

    Lucene v. Solr

    Lucene and Solr share many commonfeatures and terminologies

    Both have friendly Apache Software

    Foundation license Both have an extensive, knowledgeable

    community

    Both are fast, scalable, thread-safe, etc. Both have good documentation

    Both are battle-tested34

  • 8/14/2019 better search with apache lucene and solr

    35/37

    Lucene v. Solr

    Lucene Embedded/

    lightweight

    No Container Want low-level

    control over allaspects of process

    Thick clients?

    Distributed?

    Need to use featuresnot available in Solr

    JDK 1.4 35

    Solr

    Server-side

    HTTP as lingua-

    franca Want ease of

    setup and

    configuration

    Non-Java clients

    Replication/

    Caching OOTB

    JDK 1.5

  • 8/14/2019 better search with apache lucene and solr

    36/37

    Resources

    http://lucene.apache.org /solr

    /java

    [email protected]

    [email protected]

    Lucene In Action by Erik Hatcherand Otis

    Gospodneti Search Smarter with Apache Solr

    http://www.ibm.com/developerworks/java/library/j-solr1/

    http://www.ibm.com/developerworks/java/library/j-solr2/

    http://www.ibm.com/developerworks/java/library/j-solr2/http://www.ibm.com/developerworks/java/library/j-solr1/mailto:[email protected]:[email protected]:[email protected]:[email protected]://lucene.apache.org/http://www.ibm.com/developerworks/java/library/j-solr2/http://www.ibm.com/developerworks/java/library/j-solr2/http://www.ibm.com/developerworks/java/library/j-solr1/http://www.ibm.com/developerworks/java/library/j-solr1/mailto:[email protected]:[email protected]:[email protected]:[email protected]://lucene.apache.org/http://lucene.apache.org/
  • 8/14/2019 better search with apache lucene and solr

    37/37

    Resources

    Other presentations: http://people.apache.org/~gsingers

    http://people.apache.org/~yonik

    http://people.apache.org/~hossman

    http://www.lucenebootcamp.com

    My email:

    User: trainer Domain: lucenebootcamp.com

    37

    http://www.lucenebootcamp.com/http://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://www.lucenebootcamp.com/http://www.lucenebootcamp.com/http://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://people.apache.org/~yonikhttp://people.apache.org/~gsingershttp://people.apache.org/~gsingers