Advanced Relevancy Ranking

24
Advanced Relevancy Ranking Paul Nelson Chief Architect / Search Technologies

description

Lucene and Solr provide a number of options for query parsing, and these are valuable tools for creating powerful search applications. This presentation given at the 2013 Lucene Revolution will review the role that advanced query parsing can play in building systems, including: Relevancy customization, taking input from user interface variables such as the position on a website or geographical indicators, which sources are to be searched and 3rd party data sources. Query parsing can also enhance data security. Best practices for building and maintaining complex query parsing rules will be discussed and illustrated. Chief Architect Paul Nelson provides this compelling presentation. Search Technologies provides relevancy tuning services for Solr. For further information, see http://www.searchtechnologies.com/solr-lucene-relevancy.html http://www.searchtechnologies.com

Transcript of Advanced Relevancy Ranking

Page 1: Advanced Relevancy Ranking

Advanced Relevancy Ranking

Paul NelsonChief Architect / Search Technologies

Page 2: Advanced Relevancy Ranking

2Search Technologies Overview

• Formed June 2005• Over 100 employees and growing• Over 400 customers worldwide• Presence in US, Latin America, UK & Germany• Deep enterprise search expertise• Consistent revenue growth and profitability• Search Engine Independent

Page 3: Advanced Relevancy Ranking

3Lucene Relevancy: Simple Operators

• term(A) TF(A) * IDF(A)• Implemented with DefaultSimilarity / TermQuery• TF(A) = sqrt(termInDocCount)• IDF(A) = log(totalDocsInCollection/(docsWithTermCount+1)) + 1.0

• and(A,B) A * B• Implemented with BooleanQuery()

• or(A, B) A + B• Implemented with BooleanQuery()

• max(A, B) max(A, B)• Implemented with DisjunctionMaxQuery()

3

Page 4: Advanced Relevancy Ranking

4Simple Operators - Example

and

or max

george martha washington custis

0.10 0.20 0.60 0.90

0.1 + 0.2 = 0.30 max(0, 0.9) = 0.90

0.3 * 0.9 = 0.27

Page 5: Advanced Relevancy Ranking

5Less Used Operators

• boost(f, A) (A * f)• Implemented with Query.setBoost(f)

• constant(f, A) if(A) then f else 0.0• Implemented with ConstantScoreQuery()

• boostPlus(A, B) if(A) then (A + B) else 0.0• Implemented with BooleanQuery()

• boostMul(f, A, B) if(B) then (A * f) else A• Implemented with BoostingQuery()

5

Page 6: Advanced Relevancy Ranking

6Problem: Need for More Flexibility

• Difficult / impossible to use all operators• Many not available in standard query parsers

• Complex expressions = string manipulation• This is messy

• Query construction is in the application layer• Your UI programmer is creating query expressions?• Seriously?

• Hard to create and use new operators• Requires modifying query parsers - yuck

6

Page 7: Advanced Relevancy Ranking

7

Solr

Query Processing Language 7

UserInterface

QPLEngine Search

QPLScript

Page 8: Advanced Relevancy Ranking

8Introducing: QPL

• Query Processing Language• Domain Specific Language for Constructing Queries• Built on Groovy• https://wiki.searchtechnologies.com/index.php/QPL_Home_Page

• Solr Plug-Ins• Query Parser• Search Component

• “The 4GL for Text Search Query Expressions”• Server-side Solr Access

• Cores, Analyzers, Embedded Search, Results XML

8

Page 9: Advanced Relevancy Ranking

9Solr Plug-Ins

Page 10: Advanced Relevancy Ranking

10QPL Configuration – solrconfig.xml

<queryParser name="qpl"class="com.searchtechnologies.qpl.solr.QPLSolrQParserPlugin">

<str name="scriptFile">parser.qpl</str><str name="defaultField">text</str>

</queryParser>

<searchComponent name="qplSearchFirst"class="com.searchtechnologies.qpl.solr.QPLSearchComponent">

<str name="scriptFile">search.qpl</str><str name="defaultField">text</str><str name="isProcessScript">false</str>

</searchComponent>

Query Parser Configuration:

Search Component Configuration:

Page 11: Advanced Relevancy Ranking

11QPL Example #1

myTerms = solr.tokenize(query);

phraseQ = phrase(myTerms);

andQ = and(myTerms);

return phraseQ^3.0 | andQ^2.0 | orQ;

Tokenize:

Phrase Query:

And Query:

Put It All Together:

orQ = (myTerms.size() <= 2) ? null : orMin( (myTerms.size()+1)/2, myTerms);

Or Query:

Page 12: Advanced Relevancy Ranking

12Thesaurus Example #2

myTerms = solr.tokenize(query);

thes = Thesaurus.load("thesaurus.xml")

thesQ = thes.expand(0.8f,solr.tokenizer("text"), myTerms);

return and(thesQ);

Tokenize:

Load Thesaurus: (cached)

Thesaurus Expansion:

Put It All Together:Original Query: bathroom humor

[or(bathroom, loo^0.8, wc^0.8), or(humor, jokes^0.8)]

Page 13: Advanced Relevancy Ranking

13More Operators

Boolean Query Parser:pQ = parseQuery("(george or martha) near/5 washington")

Relevancy Ranking Operators:q1 = boostPlus(query, optionalQ)q2 = boostMul(0.5, query, optionalQ)q3 = constant(0.5, query)

Composite Queries:compQ = and(compositeMax(

["title":1.5, "body":0.8],"george", "washington"))

Page 14: Advanced Relevancy Ranking

14News Feed Use Case 14

Order Documents Date1 markets+terms Today2 markets Today3 terms Today4 companies Today5 markets+terms Yesterday6 markets Yesterday7 terms Yesterday8 companies Yesterday9 markets, companies older

Page 15: Advanced Relevancy Ranking

15News Feed Use Case – Step 1

markets = split(solr.markets, "\\s*;\\s*")marketsQ = field("markets", or(markets));

terms = solr.tokenize(query);termsQ = field("body",

or(thesaurus.expand(0.9f, terms)))

compIds = split(solr.compIds, "\\s*;\\s*")compIdsQ = field("companyIds", or(compIds))

Segments:

Terms:

Companies:

Page 16: Advanced Relevancy Ranking

16News Feed Use Case – Step 2

todayDate = sdf.format(c.getTime())todayQ = field("date_s",todayDate)

c.add(Calendar.DAY_OF_MONTH, -1)yesterdayDate = sdf.format(c.getTime())yesterdayQ = field("date_s",yesterdayDate)

Today:

Yesterday:

sdf = new SimpleDateFormat("yyyy-MM-dd")cal = Calendar.getInstance()

Page 17: Advanced Relevancy Ranking

17News Feed Use Case 17

Order Documents Date1 markets+terms Today2 markets Today3 terms Today4 companies Today5 markets+terms Yesterday6 markets Yesterday7 terms Yesterday8 companies Yesterday9 markets, companies older

Page 18: Advanced Relevancy Ranking

18News Feed Use Case – Step 3

sq1 = constant(4.0, and(marketsQ, termsQ))sq2 = constant(3.0, marketsQ)sq3 = constant(2.0, termsQ)sq4 = constant(1.0, compIdsQ)subjectQ = max(sq1, sq2, sq3, sq4)

tq1 = constant(10.0, todayQ)tq2 = constant(1.0, yesterdayQ)timeQ = max(tq1, tq2)

recentQ = and(subjectQ, timeQ)

Weighted Subject Queries:

Weighted Time Queries:

Put it All Together:

return max(recentQ, or(marketsQ,compIdsQ)^0.01))

Page 19: Advanced Relevancy Ranking

19Embedded Search Example #1

results = solr.search('subjectsCore', or(qTerms), 50)

subjectsQ = or(results*.subjectId)

return field("title", and(qTerms)) | subjectsQ^0.9;

Execute an Embedded Search:

Create a query from the results:

Put it all together:

qTerms = solr.tokenize(qTerms);

Page 20: Advanced Relevancy Ranking

20Embedded Search Example #2

results = solr.search('categories', and(qTerms), 10)

myList = solr.newList();myList.add("relatedCategories", results*.title);

solr.addResponse(myList)

Execute an Embedded Search:

Create a Solr named list:

Add it to the XML response:

qTerms = solr.tokenize(qTerms);

Page 21: Advanced Relevancy Ranking

21Other Features

• Embedded Grouping Queries• Oh yes they did!

• Proximity operators• ADJ, NEAR/#, BEFORE/#

• Reverse Lemmatizer• Prefers exact matches over variants

• Transformer• Applies transformations recursively to query trees

21

Page 22: Advanced Relevancy Ranking

22

Solr

Query Processing Language 22

UserInterface

QPLEngine Search

Data as entered by user Boolean

Query ExpressionQPL

Script

ApplicationDev Team

Search Team

Page 23: Advanced Relevancy Ranking

23

Solr

QPL: Using External Sources to Build Queries 23

UserInterface

QPLEngine Search

QPLScript

RDBMS OtherIndexes Thesaurus

Page 24: Advanced Relevancy Ranking

CONTACT

Paul [email protected]