Not Just Your Grandmother’s Administrative Data Anymore: The Vision
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug...
-
date post
21-Dec-2015 -
Category
Documents
-
view
220 -
download
0
Transcript of Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug...
What is it?What is it?
Doug Cutting’s grandmother’s middle Doug Cutting’s grandmother’s middle namename
A open source set of Java ClasssesA open source set of Java ClasssesSearch Engine/Document Search Engine/Document
Classifier/IndexerClassifier/Indexerhttp://http://lucene.sourceforge.net/talks/pisalucene.sourceforge.net/talks/pisa//
Developed by Doug Cutting 1996Developed by Doug Cutting 1996Xerox/Apple/Excite/NutchXerox/Apple/Excite/NutchWrote several papers in IRWrote several papers in IR
What is it-Nuts and BoltsWhat is it-Nuts and Bolts
Modules for IRModules for IRAnalysisAnalysis
TokenizationTokenizationWhere tokens are indexedWhere tokens are indexed
Document Document Where the Document ID is createdWhere the Document ID is createdDate of Document is extractedDate of Document is extractedTitle of document is extractedTitle of document is extracted
Nuts and Bolts -IINuts and Bolts -II
Modules-Con’tModules-Con’tIndexIndex
Provides access to indexesProvides access to indexesMaintains indexesMaintains indexes
Query ParserQuery ParserWhere the magic of query happensWhere the magic of query happens
SearchSearchSearches across indexesSearches across indexes
Nuts and Bolts-IIINuts and Bolts-III
Modules-Con’tModules-Con’tSearch SpansSearch Spans
SpansSpansK+/- wordsK+/- wordsExample:Example:
Find me a document that has Rachael Ray and Find me a document that has Rachael Ray and Alton Brown within 100 words of each other Alton Brown within 100 words of each other that also has the term cookingthat also has the term cooking
Store/UtilStore/UtilStore the indexes and other housekeepingStore the indexes and other housekeeping
TheoryTheory
Space Optimization for Total RankingSpace Optimization for Total RankingCutting et al 1996Cutting et al 1996RAIO (Computer Assisted IR) 1997RAIO (Computer Assisted IR) 1997http://lucene.sf.net/papers/riao97.pshttp://lucene.sf.net/papers/riao97.ps
Lucene lecture at PisaLucene lecture at PisaDoug CuttingDoug CuttingSlides from Lecture at University of Pisa Slides from Lecture at University of Pisa
20042004See previous linkSee previous link
Vector Vector
Vectors are a mathematical distance Vectors are a mathematical distance between termsbetween terms Uses a cosine distance to determine how close Uses a cosine distance to determine how close
terms/documents areterms/documents are This distance can then be used for This distance can then be used for
WSD/Clustering/IRWSD/Clustering/IR Example:Example:
Bass,fishing: .6506Bass,fishing: .6506Bass,guitar: .000423Bass,guitar: .000423This tells us the document is about fishing not about This tells us the document is about fishing not about
guitarsguitars
Vectors-IRVectors-IR
““Vector-space search engines use the notion of a Vector-space search engines use the notion of a term spaceterm space, where each document is represented , where each document is represented as a vector in a high-dimensional space. There are as a vector in a high-dimensional space. There are as many dimensions as there are unique words in as many dimensions as there are unique words in the entire collection. Because a document's the entire collection. Because a document's position in the term space is determined by the position in the term space is determined by the words it contains, documents with many words in words it contains, documents with many words in common end up close together, while documents common end up close together, while documents with few shared words end up far apart.” with few shared words end up far apart.”
http://www.perl.com/pub/a/2003/02/19/engine.htmlhttp://www.perl.com/pub/a/2003/02/19/engine.html Intro to Comp Ling and its applications to IRIntro to Comp Ling and its applications to IR
Nisonger 2005 :PNisonger 2005 :P
Inverted IndexInverted Index
Term/Doc Id/WeightTerm/Doc Id/WeightTermTerm
““A Token, the basic unit of indexing in A Token, the basic unit of indexing in Lucene, represents a single word to be Lucene, represents a single word to be indexed after any document domain indexed after any document domain transformation -- such as stop-word transformation -- such as stop-word elimination, stemming, filtering, term elimination, stemming, filtering, term normalization, or language translation -- has normalization, or language translation -- has been applied.”been applied.”
http://www.javaworld.com/javaworld/jw-09-2http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene-p2.html000/jw-0915-lucene-p2.html
Inverted Index –Con’tInverted Index –Con’t
Doc IdDoc IdA unique “key” that identifies each A unique “key” that identifies each
documentdocumentWeightWeight
BinaryBinaryFreq CountFreq CountWeighting AlgorithmWeighting Algorithm
Index MergeIndex Merge
Basic/Basket/BasketballBasic/Basket/BasketballOnly keeps track of the differences Only keeps track of the differences
between wordsbetween wordsPeriodically merges indexesPeriodically merges indexes
Allows new documents to be added easilyAllows new documents to be added easily
QueryQuery
Boolean SearchBoolean SearchOnly searches documents with at least 1 Only searches documents with at least 1
term in queryterm in query““Boolean Search Engine”Boolean Search Engine”
Parallel SearchParallel SearchEach term in query is search in parallelEach term in query is search in parallelPartial scores added to queue of docsPartial scores added to queue of docs
Query-IIQuery-II
ThresholdThresholdIf partial score is too low and will not be If partial score is too low and will not be
part of N-best then the document is part of N-best then the document is ignored even before search is completeignored even before search is completeExampleExample
Potential New Doc [0,0,0,0,0,0,i]Potential New Doc [0,0,0,0,0,0,i]Document ranked 14 [233,202,109,100,i]Document ranked 14 [233,202,109,100,i]Potential New Doc is ignoredPotential New Doc is ignored
Small loss of recall greatly increases Small loss of recall greatly increases speed of searchspeed of search
Evaluation of LuceneEvaluation of Lucene
Quantitative Evaluation of Passage Quantitative Evaluation of Passage Retrieval Algorithms for Question Retrieval Algorithms for Question AnsweringAnsweringTellex et al, MIT AI Lab 2003Tellex et al, MIT AI Lab 2003
Compared Prise to Lucene for Compared Prise to Lucene for question and answer tasksquestion and answer tasksQuestion & AnswerQuestion & Answer
<Who is the president?> <George W. <Who is the president?> <George W. Bush .76>Bush .76>
Evaluation-IIEvaluation-II
PrisePriseA IR system developed by NIS that A IR system developed by NIS that
according to the paper uses “modern” according to the paper uses “modern” search engine techniquessearch engine techniques
FindingsFindingsFound Prise was better than Lucene Found Prise was better than Lucene
since “Boolean” query engines are since “Boolean” query engines are considered old school and its answers to considered old school and its answers to questions were betterquestions were better
Eval-IIIEval-III
LuceneLuceneFound although Prise had better correct Found although Prise had better correct
answers Lucene found more documents answers Lucene found more documents containing relevant informationcontaining relevant information
Eval-ConclusionEval-Conclusion
External Knowledge Sources for External Knowledge Sources for Question AnsweringQuestion Answering
http://people.csail.mit.edu/gremio/puhttp://people.csail.mit.edu/gremio/publications/TREC2005.psblications/TREC2005.ps..Katz et al, MIT Lab 2005Katz et al, MIT Lab 2005
MIT used Lucene in their 2005 TREC MIT used Lucene in their 2005 TREC submission not Prisesubmission not Prise
UsersUsers
Lucene is used widelyLucene is used widelyTRECTRECDocument Retrieval Enterprise SystemsDocument Retrieval Enterprise SystemsPart of Database/Web enginePart of Database/Web enginePart of NutchPart of NutchUsed by academics for large projectsUsed by academics for large projects
MIT, AI LabMIT, AI LabKnow-It-All Project (UW)Know-It-All Project (UW)
ConclusionsConclusions
Lucene is a good set of classesLucene is a good set of classesDesigned to allow customization without Designed to allow customization without
have to “reinvent the wheel”have to “reinvent the wheel”RobustRobustFastFastLarge development groupsLarge development groupsUsed Widely in Academia and IndustryUsed Widely in Academia and Industry
Questions?Questions?
Feel free to ask questions, make Feel free to ask questions, make comments, tell jokes.comments, tell jokes.