Hacking Lucene and Solr for Fun and Profit
-
Upload
lucenerevolution -
Category
Technology
-
view
143 -
download
1
description
Transcript of Hacking Lucene and Solr for Fun and Profit
HACKING LUCENE AND
SOLR FOR FUN AND
PROFIT
Grant Ingersoll
CTO, LucidWorks,
[email protected], @gsingers
• Search is a system building block
– text is only a part of the story
• If the algorithms fit,
use them!
• Embrace fuzziness!
• Scoring features are everywhere
Keyword Search is so yesterday
• Classic: Fast, fuzzy text matching across a large document collection
• Data Quality and Analysis
– Faceting, slicing and dicing of numerical/enumerated data
– Spatial
– Spell checking, record linkage, highlighting
– Stats, Missing fields, etc.
• Top N problems
Lucene and Solr can do…
• Search Hacks
• “Trust me, I’m a mathematician”
• “I wish I had thought of that” Hack
Topics
Search Hacks
• SimpleTextCodec Example
conf.setCodec(new SimpleTextCodec());
File simpleText = new File("simpletext");
directory = new SimpleFSDirectory(simpleText);
writer = new IndexWriter(directory, conf);
index(writer);
• Similarity:
BM25Similarity bm25Similarity = new BM25Similarity();
conf.setSimilarity(bm25Similarity);
• http://www.ibm.com/developerworks/java/library/j-solr-lucene/index.html
Learn IR
http://localhost:8983/solr/answer?q=what+is+trimethylbenzene&defType=qa&qa=true&qa.qf=body
Simple QA Workflow
• Split into sentences
– Buffer tokens – see com.tamingtext.texttamer.solr.SentenceTokenizer
• Identify Names using OpenNLP
• Add Entity marker tokens at the same position as original token
– Could also be done with Payloads
• Index
• https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/textta
mer/solr
• https://github.com/tamingtext/book/blob/master/apache-solr/solr-
qa/conf/schema.xml
Analysis
• Custom Query Parser takes in user’s natural language query,
classifies it to find the Answer Type and generates Solr query
• Retrieve candidate passages that match keywords and expected
answer type
• Unlike keyword search, we need to know exactly where matches
occur
• https://github.com/tamingtext/book/tree/master/src/main/java/com/
tamingtext/qa
Search Side
• Answer Type examples:
– Person (P), Location (L), Organization (O), Time Point (T),
Duration (R), Money (M)
– See page 248 for more
• Train an OpenNLP classifier off of a set of previously annotated
questions, e.g.:
– P Which French monarch reinstated the divine right of the
monarchy to France and was known as `The Sun King'
because of the splendour of his reign?
Answer Type Classification
“Trust me, I’m a mathematician”
Classification
kNN and TF/IDF Classification w/ Lucene
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/classifier/mlt
• Builds classifier off of index information
• See the org.apache.lucene.classification package
• Naïve Bayes Classifier
• kNN Classifier
• Perceptron Classifier
Lucene Classification Module
• Cross recommendation as search
– with search used to build cross recommendation!
• Recommend content to people who exhibit certain behaviors (clicks, query terms,
other)
• (Ab)use of a search engine
– but not as a search engine for content
– more like a search engine for behavior
• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation
Algorithms
– http://berlinbuzzwords.com/sessions/multi-modal-recommendation-algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
Recommenders
• History:
Recommendation Basics
User Thing
1 3
2 4
3 4
2 3
3 2
1 1
2 1
• History as matrix:
• t1+t3 cooccur 2 times, t1+t4 once, t2+t4 once
Recommendation Basics
t1 t2 t3 t4
u1 1 0 1 0
u2 1 0 1 1
u3 0 1 0 1
• Coocurrence
• More details at http://lucenerevolution.org/2013/Crowd-sourced-intelligence-built-
into-Search-over-Hadoop
Recommendation Basics
t1 t2 t3 t4
t1 2 0 2 1
t2 0 1 0 1
t3 2 0 1 1
t4 1 1 1 2
t3 not t3
t1 2 1
not t1 1 1
“I wish I had thought of that”
Time Space Continuum
• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time
ranges
– Useful for Open Hours, Shifts, etc.
• Key: multi-valued range data
• Query using rectangle intersections
– q = shift:"Intersects(0 19 23 365)”
• Credits to David Smiley and Hoss…
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
Finance Example
Time
% change
AAPL
MSFT
IBM
IBM
AAPL
AAPL
MSFT
MSFT
AAPL
• http://www.manning.com/ingersoll
– http://github.com/tamingtext/book
• http://www.tamingtext.com
• Me:
– @gsingers
Resources