Simple Fuzzy Name Matching in Solr

24
Simple Fuzzy Name Matching in Solr March 5, 2015 David Murgatroyd & Brian Sawyer (VP Engineering & Engineering Manager)

Transcript of Simple Fuzzy Name Matching in Solr

Page 1: Simple Fuzzy Name Matching in Solr

Simple Fuzzy Name Matching in Solr

March 5, 2015David Murgatroyd & Brian Sawyer

(VP Engineering & Engineering Manager)

Page 2: Simple Fuzzy Name Matching in Solr

Quick survey: How many of us...

● Have ever indexed something into Solr?● Have seen a Solr Admin interface?● Regularly develop Solr applications?● Develop Solr applications that include

names?● Have wondered how to fuzzy search those

names?

Page 3: Simple Fuzzy Name Matching in Solr

Motivating Questions...

● How could CBP know whether you’re on a terrorist watch list?

● How does your bank know if you’re wiring money to a drug lord?

● How does Airbnb know that’s really your driver’s license?

Page 4: Simple Fuzzy Name Matching in Solr

Answer...

Name Matching (plus more)

Page 5: Simple Fuzzy Name Matching in Solr

QueryIndexing

name:"Robert Smith"dob:2/13/1987

Doc

Review of Basic Solr

Index

q=name:"Bob Smitty"

name:"Robert Smith"dob:2/13/1987score : .79

Page 6: Simple Fuzzy Name Matching in Solr

QueryIndexing

Terrorist

Doc

Where does Solr fit?

Index

Air TravelerName

Terroristscore : .79

Page 7: Simple Fuzzy Name Matching in Solr

QueryIndexing

Sanctioned Drug Lord

Doc

Where does Solr fit?

Index

Wire TransferBeneficiary

Drug Lordscore : .79

Page 8: Simple Fuzzy Name Matching in Solr

Name on your account

Where does Solr fit?Name off your licensescore : .79

Page 9: Simple Fuzzy Name Matching in Solr

What kinds of name variation?

Page 10: Simple Fuzzy Name Matching in Solr

Best Practice: field per variation type?

Page 11: Simple Fuzzy Name Matching in Solr

But what if variations co-occur?

“Jesus A. Lopez Diaz” v.

“LobezDeaz, Chuy”

● Reordered.● Missing initial.● Two spelling differences● Nickname for first name.● Missing space.

Page 12: Simple Fuzzy Name Matching in Solr

Can’t a name field type do this? Like…

● Contribute score that reflects phenomena.● Be part of queries using many field types.● Have multiple fields per document.● Have multiple values per field.

Page 13: Simple Fuzzy Name Matching in Solr

Demo

Page 14: Simple Fuzzy Name Matching in Solr

How could you use such a Field?

● Plugin contains custom field type which does all the work behind the scenes

● Simple change to schema.xml to include new fieldType

<fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/>

<field name="primaryName" type="rni_name" indexed="true" stored="true" multiValued="false"/>

<field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>

Page 15: Simple Fuzzy Name Matching in Solr

What happens at index time?

● NameField indexes keys for different phenomena in separate (sub) fields

Page 16: Simple Fuzzy Name Matching in Solr

Indexing

name:"Robert Smith"dob:2/13/1987

name:"Robert Smith"name_Key1:…name_Key2:…name_Key3:…dob:2/13/1987

User Doc

Plug-in Implementation

Index

Page 17: Simple Fuzzy Name Matching in Solr

What happens at query time?

● Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking

Page 18: Simple Fuzzy Name Matching in Solr

What else happens at query time?

● Step #2: Uses Solr’s Rerank feature to rescore names in top documents and reorder accordingly○ &rq={!rniRerank reRankQuery=$rrq} &rrq={!func}

rniMatch(fieldName, "John Doe")○ Tuned for high precision○ Requires small addition to solrconfig.xml

<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>

<valueSourceParser name="rniMatch" class="com.basistech.rni.solr.NameMatchValueSourceParser"/>

Page 19: Simple Fuzzy Name Matching in Solr

Rerank Query

Main QueryIndexing

name:"Robert Smith"dob:2/13/1987

name:"Robert Smith"name_Key1:…name_Key2:…name_Key3:…dob:2/13/1987

User Doc

Plug-in Implementation

Index

q=name:"Bob Smitty"

booleanQuery:name_Key1:...name_Key2:...name_Key3:...

User Query

RerankerrniMatch(name, "Bob Smitty")

name:"Robert Smith"dob:2/13/1987score : .79

Page 20: Simple Fuzzy Name Matching in Solr

HighRecall Query(Solr)

Subset High Recall Results

Score > reRankScoreThreshold

&

Total < reRankDocs

ReRankRescoringQuery

ScoredResults

Trading Off Accuracy for Speed

Page 21: Simple Fuzzy Name Matching in Solr

● reRankScoreThreshold - Added by Us○ Score threshold top doc must meet to be rescored○ Tradeoff accuracy vs speed

● reRankDocs○ Controls how many of the top documents to rescore○ Tradeoff accuracy vs speed

Rerank Params - Speed v. Accuracy

Page 22: Simple Fuzzy Name Matching in Solr

Rerank Params - Integration w/Query

● reRankQuery○ Calls the NameMatch function to get score○ Can query multiple names or other fields

● reRankWeight○ Controls how much weight is given to name score vs

main query○ Allows user to include queries on other non-name

fields● reRankMode - Added by Us

○ Controls how the rerank score should be combined with main query score

○ Currently 'add' or 'replace'

Page 23: Simple Fuzzy Name Matching in Solr

Summary: How it works

● Custom field type○ Splits a single field into multiple fields covering

different phenomena○ Supports multiple name fields in a document as well

as multivalued fields○ Intercepts the query to inject a custom Lucene query

● Custom rerank function○ Rescores documents with algorithm specific to name

matching ○ Limits costly calculations to only top candidates○ Highly configurable

Page 24: Simple Fuzzy Name Matching in Solr

Suggested Questions:

● Thank David Smiley for helping? (Yes!)● What if the names are in other text fields?● What about support in Solr 5.0?● How did you implement multi-valued fields?● What about support in ElasticSearch?● How does it scale?● How do you handle names not in English?● How does this relate to the theme of Entity-

Centric Search?● How do plug-in’s scores relate to Solr scores?