Simple Fuzzy Name Matching in Solr
-
Upload
basis-technology -
Category
Engineering
-
view
1.745 -
download
5
Transcript of Simple Fuzzy Name Matching in Solr
Simple Fuzzy Name Matching in Solr
March 5, 2015David Murgatroyd & Brian Sawyer
(VP Engineering & Engineering Manager)
Quick survey: How many of us...
● Have ever indexed something into Solr?● Have seen a Solr Admin interface?● Regularly develop Solr applications?● Develop Solr applications that include
names?● Have wondered how to fuzzy search those
names?
Motivating Questions...
● How could CBP know whether you’re on a terrorist watch list?
● How does your bank know if you’re wiring money to a drug lord?
● How does Airbnb know that’s really your driver’s license?
Answer...
Name Matching (plus more)
QueryIndexing
name:"Robert Smith"dob:2/13/1987
Doc
Review of Basic Solr
Index
q=name:"Bob Smitty"
name:"Robert Smith"dob:2/13/1987score : .79
QueryIndexing
Terrorist
Doc
Where does Solr fit?
Index
Air TravelerName
Terroristscore : .79
QueryIndexing
Sanctioned Drug Lord
Doc
Where does Solr fit?
Index
Wire TransferBeneficiary
Drug Lordscore : .79
Name on your account
Where does Solr fit?Name off your licensescore : .79
What kinds of name variation?
Best Practice: field per variation type?
But what if variations co-occur?
“Jesus A. Lopez Diaz” v.
“LobezDeaz, Chuy”
● Reordered.● Missing initial.● Two spelling differences● Nickname for first name.● Missing space.
Can’t a name field type do this? Like…
● Contribute score that reflects phenomena.● Be part of queries using many field types.● Have multiple fields per document.● Have multiple values per field.
Demo
How could you use such a Field?
● Plugin contains custom field type which does all the work behind the scenes
● Simple change to schema.xml to include new fieldType
<fieldType name="rni_name" class="com.basistech.rni.solr.NameField"/>
<field name="primaryName" type="rni_name" indexed="true" stored="true" multiValued="false"/>
<field name="aka" type="rni_name" indexed="true" stored="true" multiValued="true"/>
What happens at index time?
● NameField indexes keys for different phenomena in separate (sub) fields
Indexing
name:"Robert Smith"dob:2/13/1987
name:"Robert Smith"name_Key1:…name_Key2:…name_Key3:…dob:2/13/1987
User Doc
Plug-in Implementation
Index
What happens at query time?
● Step #1: NameField generates analogous keys for a custom Lucene query that finds good candidates for re-ranking
What else happens at query time?
● Step #2: Uses Solr’s Rerank feature to rescore names in top documents and reorder accordingly○ &rq={!rniRerank reRankQuery=$rrq} &rrq={!func}
rniMatch(fieldName, "John Doe")○ Tuned for high precision○ Requires small addition to solrconfig.xml
<queryParser name="rniRerank" class="com.basistech.rni.solr.RNIReRankQParserPlugin"/>
<valueSourceParser name="rniMatch" class="com.basistech.rni.solr.NameMatchValueSourceParser"/>
Rerank Query
Main QueryIndexing
name:"Robert Smith"dob:2/13/1987
name:"Robert Smith"name_Key1:…name_Key2:…name_Key3:…dob:2/13/1987
User Doc
Plug-in Implementation
Index
q=name:"Bob Smitty"
booleanQuery:name_Key1:...name_Key2:...name_Key3:...
User Query
RerankerrniMatch(name, "Bob Smitty")
name:"Robert Smith"dob:2/13/1987score : .79
HighRecall Query(Solr)
Subset High Recall Results
Score > reRankScoreThreshold
&
Total < reRankDocs
ReRankRescoringQuery
ScoredResults
Trading Off Accuracy for Speed
● reRankScoreThreshold - Added by Us○ Score threshold top doc must meet to be rescored○ Tradeoff accuracy vs speed
● reRankDocs○ Controls how many of the top documents to rescore○ Tradeoff accuracy vs speed
Rerank Params - Speed v. Accuracy
Rerank Params - Integration w/Query
● reRankQuery○ Calls the NameMatch function to get score○ Can query multiple names or other fields
● reRankWeight○ Controls how much weight is given to name score vs
main query○ Allows user to include queries on other non-name
fields● reRankMode - Added by Us
○ Controls how the rerank score should be combined with main query score
○ Currently 'add' or 'replace'
Summary: How it works
● Custom field type○ Splits a single field into multiple fields covering
different phenomena○ Supports multiple name fields in a document as well
as multivalued fields○ Intercepts the query to inject a custom Lucene query
● Custom rerank function○ Rescores documents with algorithm specific to name
matching ○ Limits costly calculations to only top candidates○ Highly configurable
Suggested Questions:
● Thank David Smiley for helping? (Yes!)● What if the names are in other text fields?● What about support in Solr 5.0?● How did you implement multi-valued fields?● What about support in ElasticSearch?● How does it scale?● How do you handle names not in English?● How does this relate to the theme of Entity-
Centric Search?● How do plug-in’s scores relate to Solr scores?