What’s New in Apache Lucene 2.9

download What’s New in Apache Lucene 2.9

of 29

Transcript of What’s New in Apache Lucene 2.9

  • 8/8/2019 Whats New in Apache Lucene 2.9

    1/29

    Whats Newin Apache LA Lucid ImaginationTechnical White Paper

    cene 2.9

  • 8/8/2019 Whats New in Apache Lucene 2.9

    2/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page ii

    Abstract Apache Lucene is a high-performance, cross-platform, full-featured Information Retrievallibrary in open source, suitable for nearly every application that requires full-text searchfeatures.

    Since its introduction nearly 10 years ago, Apache Lucene has become a competitive playerfor developing extensible, high-performance full-text search solutions. The experienceaccumulated over time by the community of Lucene committers and contributors and theinnovations they have engineered have delivered significant ongoing advances in Lucenescapabilities.

    This white paper describes the new features and improvements in the latest version,Apache Lucene 2.9. It is intended mainly for programmers familiar with the broad base of Lucenes capabilities, though those new to Lucene should also find it a useful exploration of the newest features.

    In the simplest terms, Lucene is now faster and more flexible than before. Historic weak points have been improved to open the way for innovative new features like near-real-timesearch, flexible indexing, and high-performance numerical range queries. Many newfeatures have been added, new APIs introduced, and critical bugs have been fixedall withthe same goal: improving Lucenes state-of-the-art search capabilities.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    3/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page iii

    Table of ContentsIntroduction ............................................................................................................................................................ 1

    Core Features and Improvements .................................................................................................................. 3

    Numeric Capabilities and Numeric Range Queries .............................................................................. 3

    New TokenStream API .................................................................................................................................... 7 Per-Segment Search ...................................................................................................................................... 11

    Near Realtime Search (NRS) ...................................................................................................................... 12

    MultiTermQuery-Related Improvements ............................................................................................. 13

    Payloads ............................................................................................................................................................. 14

    Additions to Lucene Contrib .......................................................................................................................... 16

    New Contrib Analyzers ................................................................................................................................ 16

    Lucene Spatial (formerly known as LocalLucene) ............................................................................ 16

    Lucene Remote and Java RMI .................................................................................................................... 18 New Flexible QueryParser .......................................................................................................................... 18

    Minor Changes and Improvements in Lucene 2.9 ............................................................................. 19

    Strategies for Upgrading to Lucene 2.9 ..................................................................................................... 21

    Upgrade to 2.9Recommended Actions .............................................................................................. 21

    Upgrade to 2.9Optional Actions ........................................................................................................... 22

    References ............................................................................................................................................................ 23

    Next Steps ............................................................................................................................................................. 24

    APPENDIX: Choosing Lucene or Solr .......................................................................................................... 25

  • 8/8/2019 Whats New in Apache Lucene 2.9

    4/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 1

    Introduction

    Apache Lucene is a high-performance, cross-platform, full-featured Information Retrievallibrary, in open source, suitable for nearly every application that requires full-text searchfeatures. Lucene currently ranks among the top 15 open source projects and is one of the

    top 5 Apache projects, with installations at over 4,000 companies. Downloads of Lucene,and its server implementation Solr, have grown nearly tenfold over the past three years;Solr is the fastest-growing Lucene subproject. Lucene and Solr offer an attractivealternative to proprietary licensed search and discovery software vendors. 1 With therelease of version 2.9 in September 2009, the Apache Lucene community delivered thelatest upgrade of Lucene.

    This white paper aims to address key issues for you if you have an Apache Lucene-basedapplication, and need to upgrade existing code to work well with this latest version, so that you may take advantage of the various improvements and prepare for the next majorrelease. If you do not have a Lucene application, the paper should also give you a goodoverview of the innovations in this release.

    Unlike the previous 2.4.1 release (March 2009), Lucene 2.9 is more than just a bug-fixrelease. It introduces multiple performance improvements, new features, better runtimebehavior, API changes, and bug-fixes at a variety of levels. The 2.9 release improves Lucenein several key aspects, which make it an even more compelling alternative to othersolutions. Most notably:

    Improvements for Near-Realtime Search capabilities make documents searchablealmost instantaneously.

    A new, straightforward API for handling Numeric Ranges both simplifiesdevelopment and virtually wipes out performance overhead.

    Analysis API has been replaced for more streamlined, flexible text handling.

    1 See the Appendix for a discussion of when to choose Lucene or Solr.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    5/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 2

    And, behind the scenes, the groundwork has been laid for yet more indexing flexibility infuture releases.

    Lucene Contrib also adds new utility packages, introduced with this release: An extremely flexible query parser framework opens new possibilities for

    programmers to more easily create their own query parsing syntax. Local-Lucene and its geo-search capabilities, now donated to Apache, provide this

    near-mandatory functionality for state-of-the-art search. Various contributions have markedly improved support for languages like Arabic,

    Persian, and Chinese.

    Some important notes on compatibility : because previous minor releases also containedperformance improvements and bug fixes, programmers have been accustomed toupgrading to a new Lucene version just by replacing the JAR file in their classpath. And, inthose past cases, Lucene-based apps could be upgraded flawlessly without recompiling thesoftware components accessing or extending Apache Lucene. However, this may not be so

    with Lucene 2.9.Lucene 2.9 introduces several back-compatibility-breaking changes that may well requirechanges in your code that uses the library. A drop-in library replacement is not guaranteedto be successful; at a minimum, it is not likely to be flawless. As a result, we recommendthat if you are upgrading from a previous Lucene release, you should at least recompile anysoftware components directly accessing or extending the library. In the latter case,recompilation alone will most likely not be sufficient. More details on these dependenciesare discussed in the Upgrading Lucene section of the paper. Weve also noted anysignificant compatibility issues with this label: [BACK-COMPATIBILITY ].

    Finally, it is important to note that Lucene 2.9 will be the last release supporting the Java

    1.4 platform. While the majority of programmers are already running on either version 1.5or 1.6 platforms (1.6 is our recommended JVM), Java 1.4 reached its end of service life inOctober 2008.

    This document is not intended to be a comprehensive overview of Lucene 2.9 in all itsfunctions, but rather the new key features and capabilities. Always check the LucidImagination Certified distribution and the official Lucene Website(http://lucene.apache.org ) for the most up-to-date release information.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    6/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 3

    Core Features and Improvements

    Numeric Capabilities and Numeric Range Queries One of Apache Lucene's basic properties is its representation of internal searchable values(terms) as UTF-8 encoded characters. Every value passed to Lucene must be converted intoa string in order to be searchable. At the same time, Lucene is frequently applied to searchnumeric values and ranges, such as prices, dates, or other numeric field attributes.Historically, searching over numeric ranges has been a weak point of the library. However,the 2.9 release comes with a tremendous improvement for searching numeric values,especially for range queries.

    Prior to Lucene 2.9, numeric values were encoded with leading zeros, essentially as a full-precision value. Values stored with full precision ended up creating many unique terms inthe index. Thus, if you needed to retrieve all documents in a certain range (e.g., from $1.50to $1500.0) Lucene had to iterate through a lot of terms whenever many documents with

    unique values were indexed. Consequently, execution of queries with large ranges and lotsof unique terms could be extremely slow as a result of this overhead.

    Many workaround techniques have evolved over the years to improve the performance of ranges, such as encoding date ranges in multiple fields with separate fields for year, month,and day. But at the end of the day, every programmer had to roll his or her own way of searching ranges efficiently.

    In Lucene 2.9, NumericUtils and its relatives ( NumericRangeQuery /NumericRangeFilter ) introduce native numeric encoding and search capabilities.Numeric Java primitives ( long , int , float, and double ) are transformed into prefix-encoded representations with increasing precision. Internally each prefix precision isgenerated by stripping off the least significant bits indicated by the precisionStep . Eachvalue is subsequently converted to a sequence of 7-bit ASCII characters (due to the UTF-8term encoding in the index, 8 or more bits would split into two or more bytes) resulting ina predictable number of prefix-terms that can be calculated ahead of time. The figure belowillustrates such a Prefix Tree.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    7/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 4

    Example of a Prefix Tree, where the leaves of the tree hold the actual term values and all the descendants of anode have a common prefix associated with the node. Bold circles mark all relevant nodes to retrieve a rangefrom 215 to 977.

    The generated terms are indexed just like any other string values passed to Lucene. Underthe hood, Lucene associates distinct terms with all documents containing the term, so that all documents containing a numeric value with the same prefix are grouped together,meaning the number of terms that need to be searched is reduced tremendously. Thisstands in contrast to the relatively less efficient encoding scheme in previous releases,where each unique numeric value was indexed as a distinct term based on the number of terms in the index.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    8/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 5

    Directory directory = new RAMDirectory();

    Analyzer analyzer = new WhitespaceAnalyzer();

    IndexWriter writer = new IndexWriter(directory, analyzer,

    IndexWriter.MaxFieldLength.UNLIMITED);

    for (int i = 0; i < 20000; i++) {

    Document doc = new Document();

    doc.add(new Field("id", String.valueOf(i), Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

    String num = Integer.toString(i);

    String paddedValue = "00000".substring(0, 5 - num.length()) +num;

    doc.add(new Field("oldNumeric", paddedValue, Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

    writer.addDocument(doc);

    }

    writer.close();

    Indexing a zero-padded numeric value for use with an ordinary RangeQuery.

    You can also use the native encoding of numeric values beyond range searches. Numericfields can be loaded in the internal FieldCache, where they are used for sorting. Zero-padding of numeric primitives (see code example above) is no longer needed as the trie-encoding guarantees the correct ordering without requiring execution overhead or extracoding.

    The code listing below instead uses the new NumericField to index a numeric Javaprimitive using 4-bit precision. Like the straightforward NumericField , queryingnumeric ranges also provides a type-safe API. NumericRangeQuery instances arecreated using one of the provided static constructors for the corresponding Java primitive.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    9/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 6

    Directory directory = new RAMDirectory();

    Analyzer analyzer = new WhitespaceAnalyzer();

    IndexWriter writer = new IndexWriter(directory, analyzer,

    IndexWriter.MaxFieldLength.UNLIMITED);

    for (int i = 0; i < 20000; i++) {

    Document doc = new Document();

    doc.add(new Field("id", String.valueOf(i), Field.Store.YES,Field.Index.NOT_ANALYZED_NO_NORMS));

    doc.add(new NumericField("newNumeric", 4,

    Field.Store.YES, true).setIntValue(i));

    writer.addDocument(doc);

    }

    writer.close();

    Indexing numeric values with the new NumericField type

    The example below shows a numeric range query using an int primitive with the sameprecision used in the indexing example. If different precision values are used at index orsearch time, numeric queries can yield unexpected behavior.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    10/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 7

    IndexSearcher searcher = new IndexSearcher(directory, true);

    Query query = NumericRangeQuery.newIntRange("newNumeric", 4, 10,10000, true, false);

    TopDocs docs = searcher.search(query, null, 10);

    assertNotNull("Docs is null", docs);

    assertEquals(9990, docs.totalHits);for (int i = 0; i < docs.scoreDocs.length; i++) {

    ScoreDocs d= docs.scoreDocs[i];

    assertTrue(sd.doc >= 10 && sd.doc < 10000);

    }

    Searching numeric values with the new NumericRangeQuery

    Improvements resulting from new Lucene numeric capabilities are equally significant inversatility and performance. Now, Lucene can cover almost every use-case related to

    numeric values. Moreover, range searches or sorting on float or double values up to fast date searches (dates converted to time stamps) will execute in less than 100 millisecondsin most cases. By comparison, the old approach using padded full-precision values couldtake up to 30 seconds or more depending on the underlying index.

    New TokenStream API Almost every programmer who has extended Lucene has worked with its analysis function.Text analysis is common to almost every use-case, and is among the best known LuceneAPIs.

    Since its early days, Lucene has used a Decorator Pattern to provide a pluggable andflexible analysis API, allowing a combination of existing and customized analysisimplementations. The central analysis class TokenStream enumerates a sequence of tokens from either a document's fields or from a query. Commonly, multipleTokenStream instances are chained, each applying a separate analysis step to text termsrepresented by a Token class that encodes all relevant information about a term.

    Prior to Lucene 2.9, TokenStream operated exclusively on Token instances transportingterm information through the analysis chain. With this release, the token-based API hasbeen marked as deprecated. It is completely replaced by an attribute-based API.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    11/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 8

    Heres how it has changed. Rather than receiving a Token instance from one of the twoTokenStream.next() methods, the new API follows a stateful approach instead. Toadvance in the stream, consumers call TokenStream.incrementToken() , whichreturns a Boolean result indicating if the end of the stream has been reached. Informationgathered during the analysis process is encoded in attributes accessible via the newTokenStream base class AttributeSource . In contrast to the older Token class, theAttribute- based approach separates specific term characteristics from others not necessarily related. Each TokenStream adds the attributes it is specifically targeting at construction time (see code listing below) and keeps a reference to it throughout itslifetime. This provides type-safe access to all attributes relevant for a particularTokenStream instance.

    protected CharReplacementTokenStream(TokenStream input) {

    super(input);

    termAtt = (TermAttribute) addAttribute(TermAttribute.class);

    }

    Adding a TermAttribute at construction time

    Inside TokenStream.incrementToken() , a token stream only operates on attributesthat have been declared in the constructor. For instance, if you have Lucene replacing acharacter like a German umlaut in a term, only the TermAttribute (declared at construction time in the code listing above) is used. (Other attributes likePositionIncrementAttribute or PayloadAttribute are ignored by thisTokenStream as they might not be needed in this particular use-case.)

  • 8/8/2019 Whats New in Apache Lucene 2.9

    12/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 9

    public boolean incrementToken() throws IOException {

    if (input.incrementToken()) {

    final char[] termBuffer = termAtt.termBuffer();

    final int termLength = termAtt.termLength();

    if (replaceChar(termBuffer, termLength)) {

    termAtt.setTermBuffer(output, 0, outputPos);

    }

    return true;

    }

    return false;

    }

    Replacing characters using the new attribute based API.

    What the above example does not demonstrate is the full power of the new token API.There, we replaced one or more characters in the token and discarded the original one. Yet,in many use-cases, the original token should be preserved in addition to the modified one.Using the old API required a fair bit of work and logic to handle such a common use-case.

    In contrast, the new attribute-based approach allows capture and restoration of the state of attributes, which makes such use-cases almost trivial. The example below shows a versionof the previous example improved for Lucene 2.9, in which the original term attribute isrestored once the stream is advanced.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    13/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 10

    public boolean incrementToken() throws IOException {

    if (state != null) {

    restoreState(state);

    state = null;

    return true;

    }

    if (input.incrementToken()) {

    final char[] termBuffer = termAtt.termBuffer();

    final int termLength = termAtt.termLength();

    if (replaceChar(termBuffer, termLength)) {

    state = captureState();

    termAtt.setTermBuffer(output, 0, outputPos);

    }

    return true;

    }

    return false;

    }

    Replacing characters and additionally emitting the original term text using the new Attribute based API (positionincrements are omitted).

    The separation of attributes makes it possible to add arbitrary properties to the analysischain without using a customized Token class. Attributes are then made type-safelyaccessible by all subsequent TokenStream instances, and can eventually be used by the

    consumer. This way, you get a generic way to add various kind of custom information, suchas part-of-speech tags, payloads, or average document length to the token stream.Unfortunately, Lucene 2.9 doesn't yet provide functionality to persist custom Attributeimplementation to the underlying index. This improvement, part of what is often referredto as "flexible indexing," is under active development and is proposed for one of theupcoming Lucene releases.

    Beyond the generalizability of this API, one of its most significant improvements is itseffective reuse of Attribute instances across multiple iterations of analysis. Attribute

  • 8/8/2019 Whats New in Apache Lucene 2.9

    14/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 11

    implementations are created during TokenStream instantiation and are reused each timethe stream advances to a successive increment. Even if a stream is used for anotheranalysis, the same Attribute instances may be used, provided the stream is reusable.This greatly reduces the rate of object creation, streamlining execution and minimizing anyrequired garbage collection.

    While the new API provides full back-compatibility, it is strongly recommended to updateany existing custom TokenStream implementations to exclusively useincrementToken instead of one of the overhead-heavy next() methods.

    If you are trying to update your custom TokenStream or one of its subclass(TokenFilter and Tokenizer) implementations, it is recommended that you use theabstract BaseTokenStreamTestCase class, which provides various utility functions fortesting against the new and old API. The test case is freely available for download in thesource distribution of Apache Lucene 2.9.

    Per-Segment Search Since the early days of Apache Lucene, documents have been stored at the lowest level in asegment a small but entirely independent index. On the highest abstraction level, Lucenecombines segments into one large index and executes searches across all visible segments.As more and more documents are added to an index, Lucene buffers your documents inRAM and flushes them to disk periodically. Depending on a variety of factors, Lucene eitherincrementally adds documents to an existing segment, or creates entirely new segments. Toreduce the negative impact of an increasing number of segments on search performance,Lucene tries to combine/merge multiple segments into larger ones. For optimal searchperformance, Lucene can optimize an index that essentially merges all existing segmentsinto a single segment.

    Prior to Lucene 2.9, search logic resided at the highest abstraction level, accessing a singleIndexReader no matter how many segments the index was composed of. Similarly theFieldCache was associated with the top-level IndexReader, and then had to beinvalidated each time an index was reopened. With Lucene 2.9, the search logic and theFieldCache have moved to a per-segment level. While this has introduced a little moreinternal complexity, the benefit of the tradeoff is a new per-segment index behavior that yields a rich variety of performance improvements for unoptimized indexes.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    15/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 12

    In most applications, existing segments rarely change internally, and this property had not been effectively utilized in previous versions of Lucene. IndexReader.reopen() , first added in Lucene 2.4, now has the ability to add new or changed segments to an alreadyexisting top-level IndexReader instead of reloading all existing segments. TheFieldCache also takes advantage of rarely changing segments. Cache instances of unchanged or updated segments can remain in memory or need only be rebuilt instead of invalidating the FieldCache entirely. Depending on the number of changed indexsegments, this can heavily reduce I/O as well as garbage collection costs, compared toreopening the entire index.

    Previous versions of Lucene also suffered from long warming time for sorting and functionqueries. Those use-cases have been improved as the warm-up of reopened searchers isnow much faster.

    It's worth mentioning that Per-Segment Search doesn't yield improvements in all situations.If an IndexReader is opened on an optimized index, all pre-existing segments are mergedinto a single one, which then loads in its entirety. In other situations, perhaps morecommon, where some changes have been committed to the index and a newIndexReader instance is obtained by calling IndexReader.reopen() on a previouslyopened reader, the new per-segment capabilities can dramatically speed up reopening. But in this case, opening a new IndexReader using one of the overloaded staticIndexReader.open() methods will create an entirely new reader instance andtherefore can't take advantage of any per-segment capabilities.

    Near Realtime Search (NRS)More and more, Lucene programmers are pursuing real-time or near real-timerequirements with their search applications. Previous Lucene versions did a decent jobwith the incremental changes characteristic of this scenario, capturing those changes andmaking them available for searching. Lucene 2.9 adds significant new capabilities foraddressing the requirements of high-change document environments.

    First of all, the IndexWriter in general responsible for modifying the underlying indexand flushing documents to disk now offers a way to obtain an IndexReader instancedirectly from the writer. The newly obtained reader then not only reflects the documentsalready flushed to disk, but also makes all uncommitted documents still residing inmemory almost instantly searchable.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    16/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 13

    The reader instance returned by IndexWriter.getReader() supports reopening thereader as long as the writer releasing the reader has not been committed. Once it iscommitted, re-opening the reader will result in an AlreadyClosedExecption .

    It is important to understand why this feature is referred to as near real-time rather thanreal-time. When IndexWriter.getReader() is called for the very first time, Lucene

    needs to consume a reasonable amount of additional resources ( i.e., RAM, CPU-cycles, andfile descriptors) to make uncommitted documents searchable. Due to this additional work,uncommitted documents will not always be available instantaneously. Nonetheless, in most cases, the performance gained with this feature will be better than just reopening theindex, or the traditional simpler approach of opening a brand new reader instance.

    To keep the latency as low as possible, the IndexWriter offers an optional pre-warmupfunctionality, by which newly merged segments can be prepared for real-time search. If youare new to this feature, you should be aware that the pre-warmup API is still markedexperimental and might change in future releases.

    MultiTermQuery-Related Improvements In Lucene 2.4, many standard queries, such as FuzzyQuery , WildcardQuery, andPrefixQuery were re-factored and subclassed under MultiTermQuery . Lucene 2.9adds some improvements under the hood, resulting into much better performance forthose queries. [BACK- COMPATIBILITY ] 2

    In Lucene 2.9, multi-term queries now use a constant score internally, based on theassumption that most programmers don't care about the interim score of the queriesresulting from the term expansion that takes place during query rewriting.

    2 This could be a back-compatibility issue if one of those classes has been subclassed.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    17/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 14

    Although constant-scoring is now the default behavior, the older scoring mode is stillavailable for multiterm queries in 2.9. Beyond that, you can choose one of the followingscoring modes:

    Filtered constant score: rewrites the multiterm query into aConstantScoreQuery in combination with a filter to match all relevant

    documents. BooleanQuery constant score : rewrites the multiterm query into a

    ConstantScoreQuery based on a BooleanQuery by translating each term intoan optional Boolean clause. This mode still has a limitation of maxClauseCount andmight raise an exception if the query has too many Boolean clauses.

    Conventional scoring (not recommended): rewrites the multiterm query into anordinary BooleanQuery .

    Automatic constant score (default): tries to choose the best constant score mode(Filter or BooleanQuery ) based on term and document counts from the query.If the number of terms and documents is small enough, BooleanQuery is chosen,otherwise the query rewrites to a filter-backed ConstantScoreQuery .

    You can change the scoring mode by passing an implementation of RewriteMethod toMultiTermQuery.setRewriteMethod() as shown in the code example below.

    PrefixQuery prefixQuery = new PrefixQuery(new Term("aField","luc"));

    prefixQuery.setRewriteMethod(

    MultiTermQuery.CONSTANT_SCORE_FILTER_REWRITE);

    Explicitly setting a f iltered constant-score RewriteMethod on a PrefixQuery

    Payloads The Payloads feature, though originally added in a previous version of Lucene, remainspretty new to most programmers. A payload is essentially a byte array that is associatedwith a particular term in the index. Payloads can be associated with a single term duringtext analysis and subsequently committed directly to the index. On the search side, thesebyte arrays are accessible to influence the scoring for a particular term, or even to filterentire documents.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    18/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 15

    For instance, if your Lucene application is analyzing the phrase Gangs of New York,payloads can encode information about the terms New and York together, so that theyare treated as a paired term for the name of a city, or can specify that Gangs is a nounrather than a verb. Prior to 2.9, payloads were exposed via a query calledBoostingTermQuery, which has now been renamed to PayloadTermQuery . By usingthis query type, you can query Lucene to find all occurrences where New is a part of a cityname like New York or New Orleans.In comparison with previous versions, Lucene 2.9 also provides more control and flexibilityfor payload scoring. You can pass a custom PayloadFunction to the constructor of apayload-aware query. Each payload is fed back to the custom function, which calculates thescore based on the cumulative outcomes of payload occurrences.

    This improvement becomes even more useful when payloads are used in combination withspan queries. Spans represent a range of term positions in a document, whereas in turn,payloads can help scoring based on the distance between terms. For instance, using aPayloadNearQuery , documents can be scored differently if terms are in the samesentence or paragraph if that information is encoded in the payload.

    At a higher abstraction level, another payload aware TokenFilter has been added.DelimitedPayloadTokenFilter splits tokens separated by a predefined characterdelimiter, where the first part of the token is the token itself and the second part after thedelimiter represents the payload. For example, it can parse an e-mail address, for [email protected], by making carol.smith the token, and creating a payload torepresent the domain name, apache.org. A customizable payload-encoder takes care of encoding the values while everything else magically happens inside the filter. Besides beinga convenient way to add payloads to existing search functionality, this class also serves as aworking example of how to use payloads during the analysis process 3.

    3 See http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ for more information.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    19/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 16

    Additions to Lucene ContribSo far, weve reviewed key new features and improvements introduced in the ApacheLucene core API. This section outlines the major additions and improvements to LuceneContrib packages. Contrib packages are parts of Lucene which do not necessarily belong tothe API core but are often helpful in building Lucene applications.

    New Contrib Analyzers The Analysis package in Lucene Contrib has always been a valuable source for almost everyLucene programmer. The latest release brings several noteworthy improvementsespecially in terms of language support.

    Better support for Chinese : Chinese, like many Asian languages, does not use whitespaces to delimit one word from another, nor is punctuation used at all. Smart-CN provides an analyzer with improved tokenization and capabilities in splittingindividual characters. While Smart-CN is part of the analyzers contrib module, it is

    distributed in its own JAR file because of the large (6MB) file resources it dependson. Light10-based Arabic analysis : a new Analyzer based on a high-performance

    stemming algorithm (Light10) applying lightweight prefix and suffix removal toArabic text.

    Persian Analyzer : applying character normalization and Persian stopword removalto Persian-only or mixed language text.

    Reverse String filter, as in leading wildcards: to support a search feature like leadingwildcards efficiently, one of the common tricks/approaches is to index terms inreverse order. A leading wildcard effectively becomes a trailing wildcard if searchedagainst a field with reversed tokens.

    Lucene Spatial (formerly known as LocalLucene)Geospatial search has become a very common use-case, especially with the advent of mobile devices. Almost every new mobile platform supports a nearby search feature. Endusers seeking data on something near their current location (restaurants, movie theatres,

  • 8/8/2019 Whats New in Apache Lucene 2.9

    20/29

  • 8/8/2019 Whats New in Apache Lucene 2.9

    21/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 18

    Lucene Remote and Java RMI The historic dependency on Java RMI from the Lucene core has now been removed: LuceneRemote is now partitioned into an optional contrib package. While the package itself doesn't add any functionality to Lucene it introduces a critical back-compatibility issue likely to be relevant for many programmers. In prior versions, the core-interfaceSearchable extended java.rmi.Remote to enable searches on remote indexes. If youhad taken advantage of this convenience, you will now have to add the new Lucene-remoteJAR file to the classpath and change their code to use the new remote base interfaceRMIRemoteSearchable as shown below.

    final RMIRemoteSearchable remoteObject = ...;

    final String remoteObjectName = ...;

    Naming.rebind (remoteObjectName, remoteObject);

    Searchable searchable = (Searchable)Naming.lookup(remoteObjectName);

    Using RemoteSearchable with Lucene 2.9

    New Flexible QueryParser Lucenes built-in query parser has been a burden on developers trying to extend the default query syntax. While changing certain parts of it, such as query instantiation, could bereadily achieved by subclassing the parser, changing the actual syntax required deepknowledge of the JavaCC parser-generator.

    The new contrib package QueryParser provides a complete query parser framework,which is fully compliant with the core parser but enables flexible customization by using amodular architecture.

    The basic idea of the new query parser is to separate the syntax from the semantics of aquery, internally represented as a tree. Ultimately the parser splits up in three stages:

    1. Parsing stage : transforms the query text (syntax) into a QueryNode tree. This stage isexploited by a single interface ( SyntaxParser ) mandatory for customimplementation of this stage.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    22/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 19

    2. Query-Node processing stage : once the QueryNode tree is created, a chain of processors start working on the tree. While walking down the tree, a processor canapply query optimizations, child reordering, or term tokenization even before thequery is actually executed.

    3. Building stage : the final stage builds the actual Lucene Query object by mapping

    QueryNode types to associated builders. Each builder subsequently applies theactual conversion into a Lucene query.

    The snippet below, taken from the new standard QueryParser implementation, shows howthe stages are exposed at the API's top level.

    QueryNode queryTree = this .syntaxParser.parse(query, getField());

    queryTree = this .processorPipeline.process(queryTree);

    return (Query) this .builder.build(queryTree);

    To provide a smooth transition from the existing core parser to the new API, this contribpackage also contains an implementation fully compliant with the standard query syntax.This not only helps the switch to the new query parser but it also serves as an example of how to use and extend the API. That said, the standard implementation is based on the newquery parser API and therefore it can't simply replace a core parser as is. If you have beenreplacing Lucene's current query parser, you can use QueryParserWrapper instead , which preserves the old query parser interface but calls the new parser framework. Onefinal caveat: the QueryParserWrapper is marked as deprecated , as the new query parserwill be moved to the core in the upcoming release and eventually replace the old API.

    Minor Changes and Improvements in Lucene 2.9 Beside the improvements and entirely new features, Lucene 2.9 contains several minorimprovements worth mentioning. The following points are a partial outline of minorchanges.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    23/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 20

    Term vector-based highlighter : a new term highlighter implementation based onterm vectors (essentially a view of terms, offsets, and positions in a documentsfield). It supports features like N-Gram fields and phrase-unit highlighting withslops and yields good performance on large documents. The downside is that it requires a lot more disk space due to stored term vectors.

    [BACK- COMPATIBILITY

    ] Collector replaces HitCollector : the low-level HitCollector was deprecated and replaced with a new Collector class. Collector offers amore efficient API to collect hits across sequential IndexReader instances. Themost significant improvement here is that score calculation is now decoupledfrom collecting hits or skipped entirely if not neededa nice new efficiency.

    Improved String interning : Lucene 2.9 internally uses a custom String intern cache instead of Java's default String.intern() . The locklessimplementation yields minor internal performance improvements.

    New n-gram distance : a new n-grambased distance measure was added to thecontrib spellcheck package.

    [BACK- COMPATIBILITY ] Weight is now an abstract class : the Weight interface wasrefactored to an abstract class including minor method signature changes.

    ExtendedFieldCache marked deprecated: All methods and parsers from theinterface ExtendedFieldCache have been moved into FieldCache .ExtendedFieldCache is now deprecated and contains only a fewdeclarations for binary backwards compatibility.

    [BACK- COMPATIBILITY ] MergePolicy interface changed: MergePolicy now requiresan IndexWriter instance to be passed upon instantiation. As a result,IndexWriter was removed as a method argument from all MergePolicy methods.

    For a complete list of improvements, bug-fixes, compatibility, and runtime behaviorchanges you should consult the CHANGES.txt file included in the Lucene distribution(http://lucene.apache.org/java/2_9_0/changes/Changes.html).

  • 8/8/2019 Whats New in Apache Lucene 2.9

    24/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 21

    Strategies for Upgrading to Lucene 2.9In the main, a Lucene-based application will benefit from the improvements in 2.9, even asits new features such as numeric capabilities and the new TokenStream API do requirecode modifications and may require reindexing in order to take full advantage. That said,compared to previous version changes, an upgrade to version 2.9 requires a moresophisticated upgrade procedure.

    True, there are many cases in which an upgrade won't require code changes, as changeslimited to expert APIs won't affect applications only using high-level functionality. All thesame, even if an application complies with Lucene 2.9, it is likely that some of the changesin runtime characteristics can introduce unexpected behaviors. In the sections below, welloffer some brief suggestions for making the transition.

    In the months ahead, the next major version of Apache Lucene will be underway. That said,one cannot assume that an upgrade to 2.9 is not worthwhile given the numbering of thenext release (i.e., version 3.0). It is also slated to be a deprecation release, in which alldeprecated-marked code in Lucene 2.9 will be removed. Some parts of the API might bemodified in order to make use of Java Generics, but in general the upgrade from 2.9 to 3.0should be as seamless as earlier upgrades have been. Once you have replaced the usage of any deprecated API(s) in your code, you should then be able to upgrade the next timesimply by replacing the Lucene JAR file.

    Upgrade to 2.9Recommended Actions At a minimum, if you plan an upgrade of your search application to Lucene 2.9, you shouldrecompile your application against the new version before the application is rolled out in aproduction environment. The most critical issues will immediately raise a compile-timeerror once the JAR is in the classpath.

    For those of you using Lucene from a single location, for example, in the JRE's ext directory,you should make sure that 2.9 is the only Lucene version accessible. In cases where anapplication relies on extending Lucene in any particular way and the upgrade doesn't raisea compile-time error, it is recommended that you add a test-case for the extension based onthe behavior executed against the older version of Lucene.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    25/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 22

    It is also extremely important that you backup and archive your index before opening it with Lucene 2.9, as it will make changes to the index that may not be readable by previousversions.

    Again, we strongly recommend a careful reading of the CHANGES.txt file included in everyLucene distribution, especially the sections on back-compatibility policy and on changes in

    runtime behavior. Careful study followed by proper planning and testing should prevent you from running into any surprises once the new Lucene 2.9-based application goes intoproduction.

    Upgrade to 2.9Optional Actions Lucene 2.9 includes many new features that are not required for use of the new release.Nevertheless, 2.9 has numerous parts of the API marked as deprecated, since they are to beremoved in the next release. To prepare for the next release and further improvements inthis direction, it is strongly recommended that you replace any deprecated API during theupgrade process.

    Applications using any kind of numeric searches can improve their performance heavily byreplacing custom solution with Lucene's Numeric Capabilities described earlier in thiswhite paper.

    Last but not least, the new TokenStream API will replace the older API entirely in the next release. Custom TokenStream , TokenFilter, and Tokenizer implementationsshould be updated to the attribute-based API. Here, the source distribution contains basictest cases that can help you safely upgrade.

    Finally, to reiterate, you would do best to write new added test cases against their current Lucene version, and upgrade the test and your code once you have gained enoughconfidence in the stability of the upgrade.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    26/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 23

    Referenceshttp://lucene.apache.org/java/2_9_0/index.html

    http://lucene.apache.org/java/2_9_0/changes/Changes.html

    http://lucene.apache.org/java/2_9_0/changes/Contrib-Changes.html

    http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Interview-Uwe-Schindler

    http://wiki.apache.org/lucene-java/NearRealtimeSearch

    http://wiki.apache.org/lucene-java/Payloads

    http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

    http://wiki.apache.org/lucene-java/ConceptsAndDefinitions

    http://wiki.apache.org/lucene-java/FlexibleIndexing

    http://wiki.apache.org/lucene-java/Java_1.5_Migration

    http://www.lucidimagination.com/How-We-Can-Help/webinar-Lucene-29

    http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene_v2.html

    http://www.lucidimagination.com/Community/Hear-from-the-Experts/Podcasts-and-Videos/Interview-Ryan-McKinley

    http://ocw.kfupm.edu.sa/user062/ICS48201/NLLight%20Stemming%20for%20Arabic%20Information%20Retrieval.pdf

    https://javacc.dev.java.net/

  • 8/8/2019 Whats New in Apache Lucene 2.9

    27/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 24

    Next StepsFor more information on how Lucid Imagination can help search application developers,employees, customers, and partners find the information they need, please visit http://www.lucidimagination.com to access blog posts, articles, and reviews of dozens of successful implementations.

    Certified Distributions from Lucid Imagination are complete, supported bundles of software which include additional bug fixes, performance enhancements, along with ourfree 30-day Get Started program. Coupled with one of our support subscriptions, a CertifiedDistribution can provide a complete environment to develop, deploy, and maintaincommercial-grade search applications. Certified Distributions are available at www.lucidimagination.com/Downloads.

    Please e-mail specific questions to:

    Support and Service: [email protected]

    Sales and Commercial: [email protected]

    Consulting: [email protected]

    Or call: 1.650.353.4057

  • 8/8/2019 Whats New in Apache Lucene 2.9

    28/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 25

    APPENDIX: Choosing Lucene or SolrThe great improvements in the capabilities of Lucene and Solr open source searchtechnology have created rapidly growing interest in using them as alternatives for theirsearch applications. As is often the case with open source technology, online communitydocumentation provides rich details on features and variations, but does little to provideexplicit direction on which technologies would be the best choice. So when is Lucenepreferable to Solr and vice versa?

    There is in fact no single answer, as Lucene and Solr bring very similar underlyingtechnology to bear on somewhat distinct problems. Solr is versatile and powerful, a full-featured, production-ready search application server requiring little formal softwareprogramming. Lucene presents a collection of directly callable Java libraries, with fine-grained control of machine functions and independence from higher-level protocols.

    In choosing which might be best for your search solution, the key questions to consider areapplication scope, deployment environment, and software development preferences.

    If you are new to developing search applications, you should start with Solr. Solr providesscalable search power out of the box, whereas Lucene requires solid information retrievalexperience and some meaningful heavy lifting in Java to take advantage of its capabilities.In many instances, Solr doesnt even require any real programming.

    Solr is essentially the serverization of Lucene, and many of its abstract functions arehighly similar, if not the just the same. If you are building an app for the enterprise sector,for instance, you will find Solr almost a 100% match to your business requirements: it comes ready to run in a servlet container such as Tomcat or Jetty, and ready to scale in aproduction Java environment. Its RESTful interfaces and XML-based configuration files cangreatly accelerate application development and maintenance. In fact, Lucene programmershave often reported that they find Solr to contain the same features I was going to buildmyself as a framework for Lucene, but already very well implemented. Once you start withSolr, and you find yourself using a lot of the features Solr provides out of the box, you willlikely be better off using Solrs well organized extension mechanisms instead of startingfrom scratch using Apache Lucene.

  • 8/8/2019 Whats New in Apache Lucene 2.9

    29/29

    Whats New in Lucene 2.9

    A Lucid Imagination Technical White Paper October 2009 Page 26

    If, on the other hand, you do not wish to make any calls via HTTP, and wish to have all of your resources controlled exclusively by Java API calls that you write, Lucene may be abetter choice. Lucene can work best when constructing and embedding a state-of-the-art search engine, by allowing programmers to assemble and compile inside a native Javaapplication. Some programmers set aside the convenience of Solr in order to more directlycontrol the large set of sophisticated features with low-level access, data, or statemanipulation, and choose Lucene instead, for example, with byte-level manipulation of segments or intervention in data I/O. Investment at the low level enables development of extremely sophisticated, cutting edge text search and retrieval capabilities.

    As for features, the latest version of Solr generally encapsulates the latest version of Lucene. As the two are in many ways functional siblings, spending time gaining a solidunderstanding how Lucene works internally can help you understand Apache Solr and itsextension of Lucene's workings.

    No matter which you choose, the power of open source search is yours to harness. Moreinformation on both Lucene and Solr can be found at www.lucidimagination.com.