Introduction To Apache Lucene

23
Introduction to Apache Lucene Sumit Luthra

description

Apache LuceneTM is a free open-source , high-performance, full-featured text search engine library that has been written completely in Java. As a technology is best suited for any application that requires full-text search, especially cross-platform.

Transcript of Introduction To Apache Lucene

Page 1: Introduction To Apache Lucene

Introduction to Apache Lucene

Sumit Luthra

Page 2: Introduction To Apache Lucene

Agenda What is Apache Lucene ?

Focus of Apache Lucene

Lucene Architecture

Core Indexing Classes

Core Searching Classes

Demo

Questions & Answers

Page 3: Introduction To Apache Lucene

What is Apache Lucene? Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java.”

Also known as Information Retrieval Library.

Lucene is specifically an API, not an application.

Open Source

Page 4: Introduction To Apache Lucene

Focus Indexing Documents

Searching Documents

Note : You can use Lucene to provide consistent full-text indexing across both database objects and documents in various formats (Microsoft Office documents, PDF, HTML, text, emails and so on).

Page 5: Introduction To Apache Lucene

Lucene Architecture

Raw Content

Acquire content

Build document

Analyze document

Index document

Index

Users

Search UI

Build query

Render results

Run query

Page 6: Introduction To Apache Lucene

Indexing DocumentsIndexWriter writer = new IndexWriter(directory, analyzer, true);

Document doc = new Document();doc.add(new Field(“content", “Hello World”,

Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field(“name", “filename.txt",

Field.Store.YES, Field.Index.TOKENIZED));doc.add(new Field(“path", “http://myfile/",

Field.Store.YES, Field.Index.TOKENIZED));// [...]

writer.addDocument(doc);

writer.close();

Page 7: Introduction To Apache Lucene

Core indexing classes

IndexWriter

Directory

Analyzer

Document

Field

Page 8: Introduction To Apache Lucene

IndexWriter construction

// Deprecated

IndexWriter(Directory d, Analyzer a, // default analyzer

IndexWriter.MaxFieldLength mfl);

// Preferred

IndexWriter(Directory d,

IndexWriterConfig c);

Page 9: Introduction To Apache Lucene

Directory

FSDirectory

RAMDirectory

DbDirectory

FileSwitchDirectory

JEDirectory

Page 10: Introduction To Apache Lucene

AnalyzersTokenizes the input text

Common Analyzers

– WhitespaceAnalyzerSplits tokens on whitespace

– SimpleAnalyzerSplits tokens on non-letters, and then lowercases

– StopAnalyzerSame as SimpleAnalyzer, but also removes stop words

– StandardAnalyzerMost sophisticated analyzer that knows about certain token types, lowercases, removes stop words, ...

Page 11: Introduction To Apache Lucene

Analysis examples• “The quick brown fox jumped over the lazy dog”

• WhitespaceAnalyzer

– [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

• SimpleAnalyzer

– [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

• StopAnalyzer

– [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

• StandardAnalyzer

– [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Page 12: Introduction To Apache Lucene

More analysis examples• “XY&Z Corporation – [email protected]

• WhitespaceAnalyzer

– [XY&Z] [Corporation] [-] [[email protected]]

• SimpleAnalyzer

– [xy] [z] [corporation] [xyz] [example] [com]

• StopAnalyzer

– [xy] [z] [corporation] [xyz] [example] [com]

• StandardAnalyzer

– [xy&z] [corporation] [[email protected]]

Page 13: Introduction To Apache Lucene

Document & FieldsA Document is the atomic unit of indexing and

searching, It contains Fields

Fields have a name and a value

– You have to translate raw content into Fields

– Examples: Title, author, date, abstract, body, URL, keywords, ...

– Different documents can have different fields

Page 14: Introduction To Apache Lucene

Field optionsField.Store

– NO : Don’t store the field value in the index

– YES : Store the field value in the index

Field.Index

– ANALYZED : Tokenize with an Analyzer

– NOT_ANALYZED : Do not tokenize

– NO : Do not index this field

Page 15: Introduction To Apache Lucene

Searching an Index

IndexSearcher searcher = new IndexSearcher(directory);QueryParser parser = new QueryParser(Version, field_name

,analyzer);Query query = parser.parse(WORD_SEARCHED);

TopDocs hits = searcher.search(query, noOfHits);

ScoreDoc[] document = hits.scoreDocs;

Document doc = searcher.doc(0); // look at first matchSystem.out.println(“name=" + doc.get(“name"));searcher.close();

Page 16: Introduction To Apache Lucene

Core searching classes

IndexSearcher

Query

QueryParser

TopDocs

ScoreDoc

Page 17: Introduction To Apache Lucene

IndexSearcherConstructor:

– IndexSearcher(Directory d);

• // Deprecated

– IndexSearcher(IndexReader r);

• Construct an IndexReader with static method IndexReader.open(dir)

Page 18: Introduction To Apache Lucene

Query• TermQuery

– Constructed from a Term

• TermRangeQuery

• NumericRangeQuery

• PrefixQuery

• BooleanQuery

• PhraseQuery

• WildcardQuery

• FuzzyQuery

• MatchAllDocsQuery

Page 19: Introduction To Apache Lucene

QueryParser• Constructor

– QueryParser(Version matchVersion, String defaultField, Analyzer analyzer);

• Parsing methods

– Query parse(String query) throwsParseException;

– ... and many more

Page 20: Introduction To Apache Lucene

QueryParser syntax examplesQuery expression Document matches if…

java Contains the term java in the default field

java junitjava OR junit

Contains the term java or junit or both in the default field (the default operator can be changed to AND)

+java +junit

java AND junit

Contains both java and junit in the default field

title:ant Contains the term ant in the title field

title:extreme –subject:sports Contains extreme in the title and not sports in subject

(agile OR extreme) AND java Boolean expression matches

title:”junit in action” Phrase matches in title

title:”junit action”~5 Proximity matches (within 5) in title

java* Wildcard matches

java~ Fuzzy matches

lastmodified:[1/1/09 TO 12/31/09]

Range matches

Page 21: Introduction To Apache Lucene

TopDocs Class containing top N ranked searched documents/results that match a given query.

ScoreDocArray of ScoreDoc containing documents/resultsthat match a given query.

Page 22: Introduction To Apache Lucene

You will require lucene-core-x.y.jar for this demo.

Demo of simple indexing and searching using Apache Lucene

Page 23: Introduction To Apache Lucene

Any Questions ?

Thank You.