ElasticSearch Real Time Fuzzy Matching With Spark...
Transcript of ElasticSearch Real Time Fuzzy Matching With Spark...
Real Time Fuzzy Matching With Spark and ElasticSearch
BFSI
Wilful Defaulters?
Sanctions Screening
PEP
HMT
OFAC SDN
..and many others
However ...
7TH OF TIR
7TH OF TIR COMPLEX
7TH OF TIR INDUSTRIAL COMPLEX
7TH OF TIR INDUSTRIES
7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN
SEVENTH of Tir
Entity Resolution
Directory Listings
De
Dew Drops, Shop no - A-152, super mart 1, Gurgaon - 122001, DLF Phase 4
DewDrop Florist, A 152, DLF City Phase 4, Near Galleria Market, Super Mart 1
Ecommerce
Cherry Mobile Amethyst Android 4.2 Jelly Bean (Black) with Free Smart and Globe SIM
Cherry Mobile Amethyst (White) with 1 Smart SIM
CHERRY MOBILE AMETHYST + 1 SMART SIM
Cherry Mobile Amethyst Android 4.2 Jelly Bean
Cherry Mobile Amethyst (White) with 1 Samsung Galaxy V
CHERRY MOBILE AMETHYST + 1 SAMSUNG GALAXY V. + 1 SMART AND GLOBE SIM
Government of ..
● Benefit rollouts● Surveillance● Licenses● Linking NPR with Passport
360 viewID Company Name Project
12345 UBM Asia Dave Chan HK - Fine Jewellery
13222 UBM A Dave C HK - Fashion Jewellery
15656 UBM Davechan HK - Beauty
14456 ubmAsia Mr. Dave CChan HK - Fine Jewellery
“In order to be irreplaceable, one must always be different.”
― Coco Chanel
Other uses
● Cross selling● Data Quality● Vendor consolidation● Master Data Management● CRM Deduplication
Challenges
● Discovering and maintaining rules is extremely tough
● Custom coding and domain specific logic makes maintenance a nightmare
● No one size fits all, big custom implementations needed every time even after using existing tools
Challenges..
● High Data volumes ● Each record has multiple dimensions● Exact matches are rare● Comparing each record with every other is not
possible● Languages have unique issues
Lets start wishing...
● Data variety● Scalable● No manual configuration of rules or algorithms● Multi language● Real time
Our Approach
- Learn from the data- Divide the load
Reifier Workflow
Configure data
Reifier Interactive Learner
Linked Result
Have training data?Reifier Match
Yes
No
1. Select Data
2. Field Selection and Stop Words
Strata Hadoop World Singapore 2015
3. Choose Training Set
Strata Hadoop World Singapore 2015
4. Run the Spark Job
Strata Hadoop World Singapore 2015
5. Enjoy the results
Strata Hadoop World Singapore 2015
At the beginning: (Without Chinese Stopped words)
亚洲博闻有限公司 Dave Chan亚洲华乐有限公司 David Chan
In this case, the similarity between 2 records is very high
What if we include the stopped word? (亚洲,有限公司)
博闻 Dave Chan华乐 David Chan
Company names for these records now are not matched at all and the system will not group them together.
Fuzzy Match in Reifier – Stopped word
Reifier Interactive Learner
Reifier Interactive Learner
Reifier Interactive Learner
Reifier Interactive Learner
Spark Benefits
● Distributed● Scalable● Fast● Machine Learning● Sampling● No need to orchestrate multiple jobs
Real Time
Spark + ElasticSearch
Advantages● Point and Shoot - Zero config
● Learning similarity definitions from data
■ - No hard coding of business rules
■ - Domain agnostic
■ - Handle multiple languages (English,
Chinese, Japanese, Thai)
Advantages
● Scalability
● Real time as well as batch
Thank You!
www.nubetech.co