Download - Open Source Search FTW

Transcript
Page 1: Open Source Search FTW

© Copyright 2013

Open Source Search FTW!

Grant IngersollLucene/Solr Committer, Apache Soft. Found.CTO, LucidWorks

@gsingers

Page 2: Open Source Search FTW

© 2013 LucidWorks2

Preaching to the Converted!

• Embrace fuzziness!

• Search is a system building block

• If the algorithms fit,

use them!

• Search use leads to search abuse

• Scoring features are everywhere

http://cheezburger.com/5243950080

Page 3: Open Source Search FTW

© 2013 LucidWorks3

Topics

• Quick Intro to Lucene and Solr

• What’s new in Lucene and Solr 4.x?- Lucene/Solr for Info Retrieval

• (Ab)Using Search Engine Tech. for Fun and Profit

Page 4: Open Source Search FTW

© 2013 LucidWorks4

Quick Intro to Lucene and Solr

Page 5: Open Source Search FTW

© 2013 LucidWorks

Relax, You’re Among Friends

• Large, diverse search community with many non-traditional search engine usages- Object stores, Record linkage, Social, mobile -> web

• Open Dev. > Open Source

• “The Apache Way”- Meritocracy – Those who do, decide!

• Always Be Testing- Randomized system tests are all the rage- http://vimeo.com/32087114

• Patches Welcome!

Page 6: Open Source Search FTW

© 2013 LucidWorks

Page 7: Open Source Search FTW

© Copyright 2013

Page 8: Open Source Search FTW

© 2013 LucidWorks

Lucene: Speed and Memory

• Native Near Real Time (NRT) support- Per segment- FieldCache can be controlled to only load new segments- Soft commit -- faster without fsync, allows quicker update

visibility

• DWPT (Document Writer per Thread)- Faster more consistent index speed

• Faster fuzzy & wildcard query processing

• String -> BytesRef- Much improved data structure- … means less memory and less garbage collection effort

Page 9: Open Source Search FTW

© 2013 LucidWorks

Up and to the Right

• http://people.apache.org/~mikemccand/lucenebench/indexing.html

9

Page 10: Open Source Search FTW

© 2013 LucidWorks

Lucene: Flexibility

• Flexible Index Formats- New posting list codecs: Block, Simple Text, Append (HDFS..),

etc- Pulsing codec: improves performance of primary key searches,

inlining docs, positions, and payloads, saves disk seeks

• Pluggable Scoring- Decoupled from TF/IDF- Built in alternatives include BM25 & DFR, and others

» http://en.wikipedia.org/wiki/Okapi_BM25

» http://terrier.org/docs/v3.5/dfr_description.html

- Add your own

Page 11: Open Source Search FTW

© 2013 LucidWorks

FS(A|T)

• Keys:- byte[] – write-once

- Linear time build of min. automata (nlogn if not sorted)

- Compression

- Reverse lookups

- Weights (used for auto-suggest)

- Pluggable Algebra

• Uses:- Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others

- FuzzyQuery is 100x faster -- http://bit.ly/hgO65c

• More: - http://slidesha.re/vKtpVA

- http://bit.ly/Pkjyu0

- “Smaller Representation of Finite State Automata” » Proc. of the 16th Inter. Conf. on Implementation and Application of Automata,

CIAA'2011, vol. 6807, 2011, pp. 118—192.

Page 12: Open Source Search FTW

© Copyright 2013

Page 13: Open Source Search FTW

© 2013 LucidWorks

Solr 4: New Features

• Search/Faceting/Relevance- New Relevance Function Queries (tf, df, others)- Pivot Faceting- Pseudo-join- Improved Spatial (more later)- Full support for Lucene Codecs, pluggable scoring

• Indexing- New Update Processors, including scripting option- Near real time

• Codec/Similarity support from Lucene 4• Other

- New Admin UI

Page 14: Open Source Search FTW

© 2013 LucidWorks

Geospatial improvements

• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle

• Indexing:- "geo”:”43.17614,-90.57341”- “geo”:”Circle(4.56,1.23 d=0.0710)”- “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”

• Searching:- fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"- fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0

0, -10 30)))”

Page 15: Open Source Search FTW

© 2013 LucidWorks

Scaling Solr

• Distributed/sharded indexing & search- Auto distributes updates and queries to appropriate shards- Near Real Time (NRT) indexing capable

• Dynamically scalable- New SolrCloud instances add indexing and query capacity- Supports re-balancing

• Reliable- No single point of failure- Transactions logged- Robust, automatic recover

• http://wiki.apache.org/solr/SolrCloud

Page 16: Open Source Search FTW

© 2013 LucidWorks16

New in 4.4 (just released)

• HDFS backed directory for storing index and transaction logs in Apache Hadoop

• New Core discovery capabilities

• Schemaless/External Schema/Field Guessing

• Schema APIs

• Add documents from the Admin UI

Page 17: Open Source Search FTW

© Copyright 2013

Hacking Search Engines for Fun and Profit

17

Page 18: Open Source Search FTW

© 2013 LucidWorks

… Find your Keys, Store Your Content

• Lucene/Solr is a fast key-value store- Bonus: search your values!

• NoSQL before NoSQL was cool

• Solr: distributed key/value- Durable, Isolated, Redundant, Fast,

Real-time- Joins, Column Storage

• Solr or Tika + Lucene can index popular office formats

• Solr can backup/replicate and scale as content grows

Page 19: Open Source Search FTW

© 2013 LucidWorks

… Find Love! Upsell! Cross-sell!

• Cross recommendation as search- with search used to build cross recommendation!

• Recommend content to people who exhibit certain behaviors (clicks, query terms, other)

• (Ab)use of a search engine- but not as a search engine for content

- more like a search engine for behavior

• See Ted Dunning’s talk from Berlin Buzzwords on Multi-modal Recommendation Algorithms- http://berlinbuzzwords.com/sessions/multi-modal-recommendation

-algorithms

• Go get Mahout/Myrrix or just do it in y(our) search engine

Page 20: Open Source Search FTW

© 2013 LucidWorks20

… Avoid Delays

Page 21: Open Source Search FTW

© 2013 LucidWorks21

… Time travel?

• Leverage Solr’s new spatial capabilities to index non-spatial data, such as time ranges- Useful for Open Hours, Shifts,

etc.

• Query using rectangle intersections- q = shift:"Intersects(0 19 23

365)”

https://people.apache.org/~hossman/spatial-for-non-spatial-meetup-20130117/

Page 22: Open Source Search FTW

© 2013 LucidWorks22

Boldly go forth and rank!

• Faster

• More Flexible

• Easier than ever scaling

• More reliable than ever

• Reduced cost of experimentation

Page 23: Open Source Search FTW

© 2013 LucidWorks

• Lucene/Solr EU Conference:- Dublin, IE, November 4-7:

http://lucenerevolution.org/- CFP Open Now

Where to Next?

• Lucene/Solr- http://lucene.apache.org- {java-user|solr-user}@lucene.apache.org- SIGIR ‘12 Open Source Workshop

» http://opensearchlab.otago.ac.nz/paper_10.pdf

• LucidWorks- http://www.lucidworks.com- Commercial support, products, etc. for

Lucene/Solr

• Me- [email protected] @gsingers on Twitter- “Taming Text” – Engineer’s guide to open

source search and NLP» http:///www.manning.com/ingersoll

23

Page 24: Open Source Search FTW

© 2013 LucidWorks24

Credits

• All of the Lucene/Solr committers and contributors

• Polar bear: http://gaijinexplorer.blogspot.ie/2012/12/its-all-just-relaxing.html

• Volunteers: http://www.poconohealthsystem.org/?id=228&sid=1

• Not Hiring: http://naijaguardianjobs.com/wp-content/uploads/2013/03/Not-Hiring-The-American.jpg

• Keys: http://www.flickr.com/photos/crazyneighborlady/355232758/

• Love: http://www.msruntheus.com/above-all-love-each-other-deeply/

• TARDIS: http://2.bp.blogspot.com/-ysN8JskY4WM/UEZNhBywQKI/AAAAAAAABdg/gXE0A9OO6Mk/s1600/13881_doctor_who.jpg