Open Source Search FTW
-
Upload
grant-ingersoll -
Category
Technology
-
view
112 -
download
0
description
Transcript of Open Source Search FTW
© Copyright 2013
Open Source Search FTW!
Grant IngersollLucene/Solr Committer, Apache Soft. Found.CTO, LucidWorks
@gsingers
© 2013 LucidWorks2
Preaching to the Converted!
• Embrace fuzziness!
• Search is a system building block
• If the algorithms fit,
use them!
• Search use leads to search abuse
• Scoring features are everywhere
http://cheezburger.com/5243950080
© 2013 LucidWorks3
Topics
• Quick Intro to Lucene and Solr
• What’s new in Lucene and Solr 4.x?- Lucene/Solr for Info Retrieval
• (Ab)Using Search Engine Tech. for Fun and Profit
© 2013 LucidWorks4
Quick Intro to Lucene and Solr
© 2013 LucidWorks
Relax, You’re Among Friends
• Large, diverse search community with many non-traditional search engine usages- Object stores, Record linkage, Social, mobile -> web
• Open Dev. > Open Source
• “The Apache Way”- Meritocracy – Those who do, decide!
• Always Be Testing- Randomized system tests are all the rage- http://vimeo.com/32087114
• Patches Welcome!
© 2013 LucidWorks
© Copyright 2013
© 2013 LucidWorks
Lucene: Speed and Memory
• Native Near Real Time (NRT) support- Per segment- FieldCache can be controlled to only load new segments- Soft commit -- faster without fsync, allows quicker update
visibility
• DWPT (Document Writer per Thread)- Faster more consistent index speed
• Faster fuzzy & wildcard query processing
• String -> BytesRef- Much improved data structure- … means less memory and less garbage collection effort
© 2013 LucidWorks
Up and to the Right
• http://people.apache.org/~mikemccand/lucenebench/indexing.html
9
© 2013 LucidWorks
Lucene: Flexibility
• Flexible Index Formats- New posting list codecs: Block, Simple Text, Append (HDFS..),
etc- Pulsing codec: improves performance of primary key searches,
inlining docs, positions, and payloads, saves disk seeks
• Pluggable Scoring- Decoupled from TF/IDF- Built in alternatives include BM25 & DFR, and others
» http://en.wikipedia.org/wiki/Okapi_BM25
» http://terrier.org/docs/v3.5/dfr_description.html
- Add your own
© 2013 LucidWorks
FS(A|T)
• Keys:- byte[] – write-once
- Linear time build of min. automata (nlogn if not sorted)
- Compression
- Reverse lookups
- Weights (used for auto-suggest)
- Pluggable Algebra
• Uses:- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others
- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c
• More: - http://slidesha.re/vKtpVA
- http://bit.ly/Pkjyu0
- “Smaller Representation of Finite State Automata” » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata,
CIAA'2011, vol. 6807, 2011, pp. 118—192.
© Copyright 2013
© 2013 LucidWorks
Solr 4: New Features
• Search/Faceting/Relevance- New Relevance Function Queries (tf, df, others)- Pivot Faceting- Pseudo-join- Improved Spatial (more later)- Full support for Lucene Codecs, pluggable scoring
• Indexing- New Update Processors, including scripting option- Near real time
• Codec/Similarity support from Lucene 4• Other
- New Admin UI
© 2013 LucidWorks
Geospatial improvements
• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle
• Indexing:- "geo”:”43.17614,-90.57341”- “geo”:”Circle(4.56,1.23 d=0.0710)”- “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”
• Searching:- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0
0, -10 30)))”
© 2013 LucidWorks
Scaling Solr
• Distributed/sharded indexing & search- Auto distributes updates and queries to appropriate shards- Near Real Time (NRT) indexing capable
• Dynamically scalable- New SolrCloud instances add indexing and query capacity- Supports re-balancing
• Reliable- No single point of failure- Transactions logged- Robust, automatic recover
• http://wiki.apache.org/solr/SolrCloud
© 2013 LucidWorks16
New in 4.4 (just released)
• HDFS backed directory for storing index and transaction logs in Apache Hadoop
• New Core discovery capabilities
• Schemaless/External Schema/Field Guessing
• Schema APIs
• Add documents from the Admin UI
© Copyright 2013
Hacking Search Engines for Fun and Profit
17
© 2013 LucidWorks
… Find your Keys, Store Your Content
• Lucene/Solr is a fast key-value store- Bonus: search your values!
• NoSQL before NoSQL was cool
• Solr: distributed key/value- Durable, Isolated, Redundant, Fast,
Real-time- Joins, Column Storage
• Solr or Tika + Lucene can index popular office formats
• Solr can backup/replicate and scale as content grows
© 2013 LucidWorks
… Find Love! Upsell! Cross-sell!
• Cross recommendation as search- with search used to build cross recommendation!
• Recommend content to people who exhibit certain behaviors (clicks, query terms, other)
• (Ab)use of a search engine- but not as a search engine for content
- more like a search engine for behavior
• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms- http://berlinbuzzwords.com/sessions/multi-modal-recommendation
-algorithms
• Go get Mahout/Myrrix or just do it in y(our) search engine
© 2013 LucidWorks20
… Avoid Delays
© 2013 LucidWorks21
… Time travel?
• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges- Useful for Open Hours, Shifts,
etc.
• Query using rectangle intersections- q = shift:"Intersects(0 19 23
365)”
https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/
© 2013 LucidWorks22
Boldly go forth and rank!
• Faster
• More Flexible
• Easier than ever scaling
• More reliable than ever
• Reduced cost of experimentation
© 2013 LucidWorks
• Lucene/Solr EU Conference:- Dublin, IE, November 4-7:
http://lucenerevolution.org/- CFP Open Now
Where to Next?
• Lucene/Solr- http://lucene.apache.org- {java-user|solr-user}@lucene.apache.org- SIGIR ‘12 Open Source Workshop
» http://opensearchlab.otago.ac.nz/paper_10.pdf
• LucidWorks- http://www.lucidworks.com- Commercial support, products, etc. for
Lucene/Solr
• Me- [email protected] @gsingers on Twitter- “Taming Text” – Engineer’s guide to open
source search and NLP» http:///www.manning.com/ingersoll
23
© 2013 LucidWorks24
Credits
• All of the Lucene/Solr committers and contributors
• Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html
• Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1
• Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring-The-American.jpg
• Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/
• Love: http://www.msruntheus.com/above-all-love-each-other-deeply/
• TARDIS: http://2.bp.blogspot.com/-ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_doctor_who.jpg