Simple fuzzy name matching in elasticsearch

25
Simple Fuzzy Name Matching in Elasticsearch June 18, 2015 Brian Sawyer Engineering Manager [email protected]

Transcript of Simple fuzzy name matching in elasticsearch

Page 1: Simple fuzzy name matching in elasticsearch

Simple Fuzzy Name Matching in Elasticsearch

June 18, 2015Brian Sawyer

Engineering [email protected]

Page 2: Simple fuzzy name matching in elasticsearch

Quick survey: How many of us...

● Regularly develop Elastic applications?● Develop Elastic applications that include

names of…○ ...People?○ ...Places?○ ...Products?○ ...Organizations?○ …(other entity types)?

● Have names in languages beside English?● Want to have better name search?● Are Elasticsearch or plugin developers?

Page 3: Simple fuzzy name matching in elasticsearch

Motivating Questions...

● How could a border officer know whetheryou’re on a terrorist watch list?

● How does your bank know if you’re wiring money to a drug lord?

● How can an ecommerce site treat “Ho-medics Ultra sonic” and “Homedics Ultrasconic” as the same thing?

● How can a system search for mentions of people across news articles?

Page 4: Simple fuzzy name matching in elasticsearch

Answer...

Name Matching (plus more)

Page 5: Simple fuzzy name matching in elasticsearch

What kinds of name variation?

Page 6: Simple fuzzy name matching in elasticsearch

Real life exampleDavid K. MurgatroydVP of Engineering

Boarding Pass

Page 7: Simple fuzzy name matching in elasticsearch

Current Best Practice?

● multi_field type with a field per possible variation (http://stackoverflow.com/questions/20632042/elasticsearch-searching-for-human-names)

"mappings": { ... "type": "multi_field", "fields": {

"pty_surename": { "type": "string", "analyzer": "simple" },

"metaphone": { "type": "string", "analyzer": "metaphone" },

"porter": { "type": "string", "analyzer": "porter" } …

● Complex query against each field

● Generally gives high recall

Page 8: Simple fuzzy name matching in elasticsearch

Can’t a name field type do this?

● Manage all the subfields

● Contribute score that reflects phenomena

● Be part of queries using many field types

● Have multiple fields per document

● Have multiple values per field (coming soon)

Page 9: Simple fuzzy name matching in elasticsearch

But what if variations co-occur?

“Jesus Alfonso Lopez Diaz” v.

“LobEzDiaS, Chuy”

● Reordered● Missing token● Two spelling differences● Nickname for first name● Missing space

Page 10: Simple fuzzy name matching in elasticsearch

Can we do better?

● Incorporates our proprietary name matching technology

● Provides similarity scores to name pairs● Uses Elasticsearch's Rescore query● Allows for higher precision ranking and

tresholding● Multi-lingual name search

Page 11: Simple fuzzy name matching in elasticsearch

Demo

Page 12: Simple fuzzy name matching in elasticsearch

How could you use such a Field?

● Plugin contains custom mapper which does all the work behind the scenes

PUT /ofac/ofac/_mapping{ "ofac" : { "properties" : { "name" : { "type:" : "rni_name" } "aka" : { "type:" : "rni_name" } } }}

Page 13: Simple fuzzy name matching in elasticsearch

What happens at index time?

● NameMapper indexes keys for different phenomena in separate (sub) fields@Override

public void parse(ParseContext context) throws IOException {

Name name = NameBuilder.data(nameString).build();

//Generate keys for name

Collection<FieldSpec> fields = helper.deriveFieldsForName(name);

//Parse each key with the appropriate Mapper

for (FieldSpec field : fields) {

Mapper mapper = keyMappers.get(field.getField().fieldName());

context = context.createExternalValueContext(field.getStringValue());

mapper.parse(context);

}

}

Page 14: Simple fuzzy name matching in elasticsearch

Indexing

{ name: "Robert Smith"dob:"1987/02/13" }

{ name: "Robert Smith"name.key1:…name.key2:…name.key3:…dob: "1987/02/13" }

User Doc

Plug-in Implementation

Index

Page 15: Simple fuzzy name matching in elasticsearch

What happens at query time?

● Step #1: NameMapper generates analogous keys for a custom Lucene query that finds good candidates for re-scoring

@Override

public Query termQuery(Object value, @Nullable QueryParseContext context) {

//Parse name string

Name name = NameBuilder.data(value.toString()).build();

QuerySpec spec = helper.buildQuerySpec(new NameIndexQuery(name));

//Build Lucene query

Query query = spec.accept(new ESQueryVisitor(names.indexName() + "."));

return query;

}

Page 16: Simple fuzzy name matching in elasticsearch

What else happens at query time?

● Step #2: Uses a Rescore query to score names in the best candidate documents and reorder accordingly○ Tuned for high precision name matching○ Computationally expensive"rescore" : {

"query" : {

"rescore_query" : {

"function_score" : {

"name_score" : {

"field" : "name",

"query_name" : "LobEzDiaS, Chuy"

}

...

Page 17: Simple fuzzy name matching in elasticsearch

● The 'name_score' function matches the query name against the indexed name in every candidate document and returns the similarity score

@Override

public double score(int docId, float subQueryScore) {

//Create a scorer for the query name

CachedScorer cs = createCachedScorer(queryName);

//Retrieve name data from doc values

nameByteData.setDocument(docId);

Name indexName = bytesToName(nameByteData.valueAt(i).bytes);

//Score the query against the indexed name in this document

return cs.score(indexName);

}

What does that function do?

Page 18: Simple fuzzy name matching in elasticsearch

Rescore Query

Main Query

Plug-in Implementation

{ match : { name: "Bob Smitty" } }bool:

name.Key1:...name.Key2:...name.Key3:...

User Query

Rescorename_score : { field : "name", name : "Bob

Smitty")

name:"Robert Smith"dob:2/13/1987score : .79

Indexing

{ name: "Robert Smith"dob:"1987/02/13" }

{ name: "Robert Smith"name.Key1:…name.Key2:…name.Key3:…dob: "1987/02/13" }

User Doc

Index

Page 19: Simple fuzzy name matching in elasticsearch

● window_size○ Controls how many of the top documents to rescore○ Tradeoff accuracy vs speed

● minScoreToCheck - (Added by Us)○ Score threshold top doc must meet to be rescored○ Tradeoff accuracy vs speed

Rescore Params - Speed v. Accuracy

Page 20: Simple fuzzy name matching in elasticsearch

HighRecall Query(Elastic)

Subset High Recall Results

Total < windowsize

&

Score > minimumScoreThreshold

Rescoring (for High Precision)

Query

ScoredResults

Trading Off Accuracy for Speed

Page 21: Simple fuzzy name matching in elasticsearch

Rescore Params - Integration w/Query

● rescore_query○ Calls the name_score function to get score○ Combine rescore_queries to query across multiple

fields● query_weight

○ Controls how much weight is given to main query○ Allows user to include queries on other non-name

fields● rescore_query_weight

○ Controls how much weight is given to rescore query

Page 22: Simple fuzzy name matching in elasticsearch

What Challenges Were There?

● Design based on similar Solr plugin● 1-2 months solo develop time● Nice plugin infrastructure● Missing some useful javadocs/comments● No (official) plugin development guide● Used other plugin implementations as

guides https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-plugins.html#_plugins

Page 23: Simple fuzzy name matching in elasticsearch

Summary: How it works

● Custom field type mapping○ Splits a single field into multiple fields covering

different phenomena○ Supports multiple name fields in a document○ Intercepts the query to inject a custom Lucene query

● Custom rescore function○ Rescores documents with algorithm specific to name

matching ○ Limits intense calculations to only top candidates○ Highly configurable

Page 24: Simple fuzzy name matching in elasticsearch

Simple Fuzzy Name Matching in Elasticsearch

June 18, 2015Brian Sawyer

Engineering [email protected]

Page 25: Simple fuzzy name matching in elasticsearch

Suggested Questions:

● What if the names are in other text fields?● How did you implement multi-valued fields?● How does it scale?● How do you handle names not in English?● How does this relate to the theme of Entity-

Centric Search?● How do plug-in’s scores relate to Solr scores?