Cassandra Summit 2014: Fuzzy Entity Matching at Scale

35
Ken Krugler | President, Scale Unlimited Fuzzy Entity Matching

description

Presenter: Ken Krugler, President of Scale Unlimited Early Warning has information on hundreds of millions of people and companies. When a person wants to open a new bank account, they need to be able to accurately find similar entities in this large dataset, to provide a risk assessment. Using the combination of Cassandra & Solr via DSE, they can quickly find and evaluate all reasonable candidates.

Transcript of Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Page 1: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Ken Krugler | President, Scale Unlimited

Fuzzy Entity Matching

Page 2: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

whoami•Ken Krugler, Scale Unlimited - Nevada City, CA

•Consulting on big data (workflows, search, etc)

•Training for Hadoop, Cascading, Solr & Cassandra

Page 3: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

The Problem

Page 4: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Should I Trust You?•When opening a bank account...

•...what is the applicant's risk?

!

•Key is matching person...

•...to other account info

Page 5: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Matching people•I have some information you've provided

•I need to match against ALL bank data

•But banks won't exchange their customer info

•So what can we do?

Page 6: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Early Warning Services•Owned by the top 5 US banks

•Gets data from 800+ financial institutions

•So they have details on most US bank accounts

Page 7: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Fuzzy Matching

Page 8: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

What's a fuzzy match?•Match everything that's equivalent

!

!

•Match nothing that's different

Page 9: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Why is it hard?•Lots of gray areas in fuzzy matching

!

•Can't use exact key join

•So no easy lookup using C* row key

•Often computationally intensive

Page 10: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Matching People•I've got information on lots of people

•I'm being asked about a specific person

•How to quickly find all good matches?

•Not doing batch matching ≟

Page 11: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

What's a Good Match?•Comparing field values between records

•Are these two people the same?Name Bob Bogus Robert Bogus

Address 220 3rd Ave 220 3rd AvenueCity Seattle SeattleState WA WAZIP 98104-2608 98104

Page 12: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

What about now?•Normalization becomes critical

•How to focus on the important features?Name Bob Bogus Robert H. Bogus

Address Apt 102, 220 3rd Ave 3220 3rd Avenue SouthCity Seattle SeattleState Washington WAZIP 98104

Page 13: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

How do you calc similarity?•Calculate degree of similarity for each field (0 -> 1.0)

•Give each field a weight (these sum to 1.0)

•Score is sum(fieldN sim * fieldN weight)

•So score is 0 (nothing in common) to 1.0 (exact dup)

Page 14: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Does that scale?•For a given person being matched...

•You need to compare to every other person

•Works for a few thousand people

•Doesn't scale for 100s of millions of people

Page 15: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Search to the Rescue

Page 16: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Search is (fast) similarity•Find N most similar docs to this doc (my query)

•Each doc has multi-dimensional feature vector

•Each feature (dimension) is a unique word

•Feature weight is TF * IDF

Page 17: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Cosine Similarity•Each document has a term vector

•E.g. three unique words x, y, z

•Weight is TF*IDF of each word

•Calc cosine of angle between 2 vectors

•That is the similarity score

Page 18: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Cosine sim ≢ match sim•Doesn't have same level of sophistication

•So throw a bigger net to find candidates

•e.g. get top N*X, assuming at most X matches

•Then do match similarity calc on this (small) set

Page 19: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

So two-step process

Match0.900.500.100.85

...

Query: name=“Bob Bogus”^3

and ssn=“222447777”^10

and dob=“19600723”^5

Solr

Index

Name SSN DOB

Bob Bogus 222447777 19610603

Robert Bogus 193618919 19600723

Bob Smith 479385821 19600723

Sam Stealthy 222447777 19930523

Name SSN DOB

Bob Bogus 222447777 19600723

... ... ...

Page 20: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

How do you pick N?•Can be small, if match sim ≈ search sim

•If N is too big, it's inefficient

•If N is too small, you miss matches

•Tune search to mimic match sim

•Right tradeoff depends on use case

Page 21: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

What is Solr?•Enterprise search system, build on top of Lucene

•Open source project at Apache Software Foundation

•Scales to billions of documents

•Highly configurable & customizable

•Integrated with Cassandra in DSE

Page 22: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Solr Schema•Defines set of fields in a document

•Direct one-to-one mapping with Cassandra columns

•Fields can be defined with synonyms, etc., etc.<fields> <field name="key" type="string" indexed="true" stored="true" /> <field name="name" type="text" indexed="true" stored="true" /> </fields>

Page 23: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

DSE Search with Solr

Page 24: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

What is DSE with Solr? •DSE-specific enhancement to Cassandra

•Keeps a Solr index in sync with a C* table

•Indexes distributed to all nodes C* & Solr

C* & Solr

C* & Solr

C* Table

S* Index

C* Table

S* Index

C* Table

S* Index

Page 25: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Handy replication & failover •Implementation leverages C* replication

•So you get load balancing, reliability, scalability

•You can replicate from a regular C* DC to Solr DCC* & Solr

C* & Solr

C* & Solr

C* C*

C*

Solr DC C* DC

Page 26: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Who builds the index? •In background

•Much slower than C* updates

•Uses existing secondary index hook

Secondary Index Hook

Distribute to indexing queues

Indexing Queue

Logical Rows

Read C* storage row

max_solr_concurrency_per_core

Create one Solr docper entry

Apply FieldInputTransformer Update Solr

back_pressure_threshold_per_core

Page 27: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

How fast is it? •Writing 170M records ≈ 2.5 hours

•8 node DSE 4.0 cluster, 8 1TB SSDs on each

•This is indexing during writes

•About 15% of index available when writes finish

•Complete index takes another 12 hours

Page 28: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

System Overview

Page 29: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

ETL Hadoop Workflow•Extract, transform, load

•Built using Cascading API

•Parse data, simple normalization

•Other transformations happen in Solr

Page 30: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Cassandra ingress•Reduce tasks in Hadoop talk to C* cluster

•Using DataStax Java driver for Cassandra

•Bottleneck is Solr indexing

•Inserts get throttled when this falls behind

•But total time less than with deferred indexing

Page 31: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Architectural DiagramC* + Solr

C* + Solr

C* + Solr

Hadoop

Cluster

Entity Matcher API

Page 32: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Ingest performance•For max performance, write without reads

•But how to avoid creating duplicate entries?

•Set the row key to the hash of searchable fields

•Accept "near duplicates" in search results

•Possible to push some Solr load into workflow

Page 33: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Summary

Page 34: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

Key points to remember•This is for ad hoc requests, not batch deduplication

•Use search to reduce candidate set, then match

•Pain is in normalization, matching logic

•DSE + Solr simplifies architecture & adds goodness

Page 35: Cassandra Summit 2014: Fuzzy Entity Matching at Scale

More questions?•Feel free to contact me

•http://www.scaleunlimited.com/contact/

•Get training on DSE with Solr

•http://www.datastax.com/what-we-offer/products-services/training