White Paper - Bitpipeviewer.media.bitpipe.com/1216309501_94/1285601573_481/Talend_WP... ·...
Transcript of White Paper - Bitpipeviewer.media.bitpipe.com/1216309501_94/1285601573_481/Talend_WP... ·...
Talend White Paper Matching Technology Improves Data Quality
How
Matching Technology Improves Data Quality
White Paper
Talend White Paper Matching Technology Improves Data Quality
Page 2 of 18
Table of Contents
How Matching Technology Improves Data Quality ...................... 3
What is Matching? ........................................................... 3
Benefits of Matching ........................................................ 5
Matching Use Cases .......................................................... 6
What is Matched? ............................................................ 7
Standardization before Matching .......................................... 7
Matching Technology ........................................................ 9
Probabilistic ................................................................ 11
Deterministic versus Probabilistic ....................................... 12
Blocking ..................................................................... 12
Matching Process .......................................................... 13
What to do with Matches ................................................. 15
Conclusion .................................................................. 16
About Talend Data Quality ............................................... 17
Talend White Paper Matching Technology Improves Data Quality
Page 3 of 18
How Matching Technology Improves Data Quality
Enterprise applications like Master Data Management, Customer
Data Integration, and Data Warehouse projects rely on clean,
duplicate-free data to really be effective business tools. Companies
have sought the “single source of truth” in their customer data,
transactional data and even in their metadata to make these
applications most effective. Matching plays an important role in
achieving a single view of customers, parts, transactions and almost
any type of data.
For decades, software vendors and computer scientists have devised
strategies and technologies for finding relationships within the data.
Some of the first published works on matching strategies were as
early as 1946, when Halbert Dunn, MD, who was Chief, National
Office of Vital Statistics for the U. S. Public Health Service wrote a
paper in which he described linking the pages of a person’s medical
records to create a “book of life”. It was an idea ahead of its time,
since technology of the era was certainly not up to the task.
In 1969, Ivan Fellegi and Alan Sunter formalized probabilistic
matching techniques in a break-through research paper. Over the
years, others have tweaked algorithms to match records. The
strategies for matching records are mature and well-documented,
although not always simple.
This white paper looks into the topic of matching, what it is, how it
works, and different methods for matching data.
What is Matching?
Matching is the process of putting together similar or the same
records in order to either identify or remove duplicates from the
Talend White Paper Matching Technology Improves Data Quality
Page 4 of 18
data. Matching is often used to link together records that have some
sort of relationship. Since data doesn’t always tell us the
relationship between two data elements, matching technology lets
us define rules for items that might be related.
How relationships are interpreted and used, either in Business to
Business data or in Business to Consumer data, depends on the
context of the project and the needs of the business users.
Commonly, corporations use matching to remove duplicate customer
records and therefore optimize marketing programs, but there are
many uses for this optimized data beyond marketing. Data may
contain business names rather than households and relationships
need to be created between “IBM Corp.”, “International Business
Machines”, and “I.B.M.”, for example.
Data to be matched may also come from supply chain and ERP data,
where the matcher relies on patterns to match, for example, the
part numbers of XL-12345 to XL123-45. Data may be descriptive
data, where the matcher needs to find a relationship between
“Frozen Carrots” and “Car, Frzn”. The identification of these
relationships will contribute to the organization’s data management
strategy and effectiveness.
Matching is also called “Linking” because the end result of finding
two related records in your data might not necessarily be to delete
one of the records or combine two or more records. Rather, the
solution to understand your customers may just be to link the
records together (using keys). A family who lives under one
household and does business with one bank is an example of a
household whose records should remain separate, but linked by
household.
Talend White Paper Matching Technology Improves Data Quality
Page 5 of 18
Benefits of Matching
All corporations can benefit from matching, because the benefits
are plentiful. The top benefits include:
Billing and Credit – Removal of duplicate records and
householding can lead to benefits such as unified billing,
accurate revenue accounting, accurate contract billings,
unified credit management and reduced mailing costs.
Direct mail/marketing – Companies can decrease the costs
associated with direct mail by mailing one and only one
promotion to any given household.
Relationships with customers – Organizations with more
accurate and duplicate-free records have better relationships
with those customers.
Supply Chain and Inventory Efficiency – By matching
inventory in a warehouse that is physically identical, but
seemingly unconnected in the database, the company can
lower carrying costs. There's less inventory to put away, less
space to rent, lower insurance and taxes on inventory, lower
costs to physically count inventory and lower risk from
obsolete inventory.
Vendor Cost Savings – With cleaner vendor and inventory
data, buyers have more accurate information on the amount
purchased from any given vendor. Armed with accurate
data, the buyer can apply pressure on the vendor to lower
costs.
Talend White Paper Matching Technology Improves Data Quality
Page 6 of 18
Overall Corporate Efficiency – With duplicate-free data,
users are more likely to adopt systems that will improve
corporate efficiency. Storing fewer gigabytes of more
accurate data is much more efficient.
Matching Use Cases
Matching software is commonly used in several different
configurations, including but not limited to:
One-time matching project – where companies perform a
one-time removal of duplicates from a single database or a
one-time linking of two or more databases
Real time single database – often accomplished with first
identification of duplicates, then real time matching to
ensure that no duplicate records are added to the database
Matching of multiple databases at regular interval – most
commonly as part of a data warehouse, where data is nightly
loading into a data warehouse to understand business
intelligence metrics
Linking multiple databases via a master index – where the
flow of data enters a central master data management hub.
In such a configuration, data quality is a service, called by
the master data management application
Talend White Paper Matching Technology Improves Data Quality
Page 7 of 18
What is Matched?
Powerful matching technology will match a variety of types of data,
including but not limited to:
Individuals – Individuals living at the same address. The
software finds individuals, even if the data is mistyped or if
nicknames are involved. The software knows that Bob and
Rob are derived from the same root name.
Households - Members of the same household living at one
address. Usually, this is powerful in finding head of
household and contacting only one person with marketing
offers, for example.
Businesses – Companies with the same or similar names, also
with the ability to recognize EMC, E.M.C and E M C Corp as
the same company, for example.
Inventory or Supply Chain Items – Companies looking to
consolidate parts and items with the same or similar names.
The software understands that a “bolt, one half inch” might
be a match to a “1/2’’ bolt” or that “carrots, frozen” might
be the same as “Frz Car”. Since this data is so unique from
industry to industry, some standardization of data may be
necessary for finding these types of matches.
Standardization before Matching
Nearly all experts agree that standardization is absolutely necessary
before matching. Standardization is a process by which an agreed
standard is defined for any type of data. The rules are offered as
part of a business rules engine and applied to the data. So for
Talend White Paper Matching Technology Improves Data Quality
Page 8 of 18
example, the postal services in various countries offer standards for
name and address data. One such standard in the United States is
that on addresses the word “street” is always abbreviated “ST”, and
not “Str” or “Street”.
Users may also opt to standardize data shapes, too. In an ERP or
supply chain system, for example, a company may decide to always
designate part numbers as NN-AAAAA, where N is a number, and A is
an alphanumeric character. In this scenario, part number 12-HGAJS
would be valid, while 12HGAJS_2 would be subject to
standardization.
For name data, nicknames can also be problematic. Attempting to
match “Steve” with “Stephen”, for example, requires
standardization. The strategy here is to create a root name
attribute in the database that stores the root name of Steve, which
is Stephen. This strategy keeps the original names intact in the
database, but gives match opportunity to “Steve Williams” and
“Stephen Williams” in your database.
Standardization also helps when data is misfielded, for example, a
name is inadvertently typed into an address line. This commonly
occurs during data migration, especially in legacy systems where
data tends to be less structured. For example, some billing systems
data may contain “Attn: Accounts Payable” on various lines of the
database, and it’s up to standardization to sort this out. By the same
token, similar records won’t come together when comparing “Steven
A. Smith” to “25 Main St.”. Profiling the data ahead of time is the
best way to ensure that the correct data exists in the correct field
and that apples are being compared to apples.
The standardization process improves matching results, even when
implemented along with very simple matching algorithms. More
Data standardization can
be achieved with Talend
Data Quality by using
your own business rules,
regular expressions and
even public domain and
government sources like
the US Census, data.gov
and geonames.org.
This standardization is
not only integral to data
quality, it’s integral to
the effectiveness of
master data
management, CRM, ERP
and many business
applications.
Talend White Paper Matching Technology Improves Data Quality
Page 9 of 18
exact matches will exist once the address has been standardized and
the root name has been found. More exact matches will also exist
once part numbers and descriptions have been standardized.
However, in combination with advanced matching techniques,
standardization can improve information quality even more.
Matching Technology
The strength of matching technology is defined by how powerful the
algorithms are to establish the match. For algorithms, solutions
have powerful routines that are specially designed to compare
names, addresses, strings and partial strings, business names,
spelling errors, postal codes, tax ID numbers, data that sounds
similar such as “Phig” and “Fig”, and more.
There are two common types of matching technology on the market
today, deterministic and probabilistic.
Deterministic or “rules-based” matching is where records are
compared using fuzzy algorithms. The various algorithms allow for a
little bit of “slop” in data, so that if there are typos or phonetic
similarities (like ph & f), the algorithms can identify linkage.
Ultimately, the user decides which rows to compare and what
algorithm to use on each. Each row can have a “weight”, so that a
user might decide that TaxID number (social security number) has
more weight than last name, for example. The user can choose
from one of these common algorithms:
Fuzzy Match Algorithm Use Case/Description
Exact Match You can use an exact matching algorithm to find
exact duplicates.“Smith” will match to “Smith”
After standardization,
Talend Data Quality
matching uses
algorithms to determine
when two or more
records match. It
identifies matching
records referring to the
same business,
household,
individual/contact,
product, etc. and
identifies relationships
linking a contact to a
business, an individual
to a household, a
product to a product
class, or other.
Talend White Paper Matching Technology Improves Data Quality
Page 10 of 18
and only “Smith” with no fancy variations. After
records have been standardized, a certain
number of new exact matches should come as a
natural result.
SoundEx Developed for the some of the first computers
performing the US census in the 1930s,
SoundEx is a phonetic algorithm for indexing
names by sound, as pronounced in English. The
algorithm mainly encodes consonants; a vowel
will not be encoded unless it is the first letter.
Improvements to SoundEx are the basis for
many of the modern phonetic algorithms that
follow.
Metaphone and Double
Metaphone
Realizing that SoundEx was limited, Metaphone
was developed in the 1990s, using a larger set of
rules for English pronunciation.
Later, double metaphones were developed to
provide even more power. The algorithm returns
both a primary and a secondary code to account
for many variations of surnames with common
ancestry. For example, encoding the name
"Smith" yields a primary code of SM0 and a
secondary code of XMT, while the name
"Schmidt" yields a primary code of XMT and a
secondary code of SMT--both have XMT in
common.
The Double Metaphones algorithm does a better
job because it uses a much more complex rule
set for coding than its SoundEx and metaphones.
Levenshtein In the 1960s, a Russian scientist devised the
Levenshtein distance algorithm. Levenshtein
distance is a measure of the similarity between
two strings. Users define the distance, which is
the number of deletions, insertions, or
Talend White Paper Matching Technology Improves Data Quality
Page 11 of 18
Probabilistic
The second category of matching technology is called probabilistic,
the very same theories that Fellegi and Sunter wrote about back in
1969. The intricacies of probabilistic matching run well beyond the
scope of this white paper. However, statistical analysis and advanced
algorithms are key to its success.
The algorithm is smart enough to know that a common last name like
“Jones” should play a smaller role in matching as compared to a less
common last name, like “Jimmerson”. How does it know?
Probabilistic matching technology performs statistical analysis on the
data and deciding the frequency of items. It then uses that analysis
to weight the match, similar to the way that the user can apply
weight to the relevance of each row.
substitutions required to transform one into the
other. For example the distance between “Smith”
and “Smith” is 0, because no transformations are
needed. However, the distance between “Smith”
and “Smyth” is 1 because one substitution is
needed to transform it. “Smith” and “Smythe”
would have a distance of 2, and so on.
Jaro-Winkler Jaro-Winkler is similar in function to Levenshtein,
since it measures the number of differences
between strings. However, characters at the
beginning of the string are given more weight
than those at the end. This weighting of
characters allows Jaro-Winkler to deliver a score
between zero and 1, with one being a perfect
match.
Talend White Paper Matching Technology Improves Data Quality
Page 12 of 18
Deterministic versus Probabilistic
Data quality solutions often offer both types of matching, since one is
not necessarily superior to the other. While deterministic matching
doesn’t take into account a holistic view of the data set, it will
produce a good many matches and is much easier for data
management professionals to understand and tune. Probabilistic may
be superior in its holistic view of the data, but the ability to
understand and track why records matched and why they didn’t is
hindered by a complex algorithm. If you’re trying to do real-time
matching, having one incoming record match up against a master data
set, deterministic will also offer some performance benefits.
Remember that probabilistic relies on statistical analysis of the data
and that may slow performance on real-time jobs.
Blocking
No matter what algorithms you decide to use, the thought of
comparing a large number of records to themselves to find matches
makes the task a daunting one, both in resources needed and in time
needed to compare them. If you have a million rows of data, you
wouldn’t want to have to make one million comparisons. Even
comparing a single record against your large database would take
significant time. The time to execute will grow exponentially as your
dataset grows. That’s why most software vendors recommend first
making blocks or grouping keys part of the matching process. By
creating a key so that only those records that have some basic
similarities will be compared, matching performance improves with
no effect on matching accuracy.
The key might consist of part of a last name, postal code, street
name, or sex. Only record pairs with identical keys will be grouped
for more in-depth matching.
Talend White Paper Matching Technology Improves Data Quality
Page 13 of 18
Organizations often evoke a multi-match strategy, where matching is
analyzed from various angles. For name and address data,
organizations might rely heavily on tax id (Social Security) number
where it exists, while relying on other factors, such as address, last
name, city, and state, where tax id is missing from the customer
records.
Matching Process
If we were to take a journey from a record’s perspective through a
matching system, it would go something like this. The matcher starts
with the entire database and quickly whittles down the list of possible
matches by establishing the match key. Only those records that are
somewhat alike (those with the same window key) are more precisely
compared. This step is extremely fast. In a 100,000 record database,
this step might reduce the list of match possibilities to say fifty
possible detailed matches. In phase two, the fifty remaining records
will be then be scrutinized more carefully with the solution’s
powerful algorithms.
All of the major matching engines on the market use a similar two
step process for matching. The wide variation seems to be in the
Talend White Paper Matching Technology Improves Data Quality
Page 14 of 18
actual algorithms for detailed matching, more specifically in the
efficiency in which they find correct matches, the rate at which they
avoid bad matches, the ability for the matching solution to handle a
wide variety of data domains and types, and the speed at which they
do complete the task.
PROCESS DETAIL
Profile Profile the data to understand data quality
issues. Issues can be categorized, so that
misfielded data can go through one process,
incomplete records through another, etc.
Standardize Use a standardization process to optimize match
efficiency. Be certain that data conforms to
standards, if they exist, data is fielded correctly,
and that nicknames, data shapes, and
abbreviations are standardized
Identify fields to
compare (any/all field
types)
Perform match on fields that are unique. In this
example, a straight name and address match
will be performed. Matching can use any data
available, however. This includes tax ID
number, customer number, e-mail address, etc.
Match Grouping Keys
To deliver matching that is both accurate and
high performance, Talend recommends first
making grouping keys part of the matching
process. Processing time to compare a single
record against your entire database can take
significant time. The time to execute will grow
exponentially as your dataset grows. By creating
a key so that only those records that have some
basic similarities will be compared, matching
performance improves with no effect on
matching accuracy.
Talend White Paper Matching Technology Improves Data Quality
Page 15 of 18
For example, if you were to generate a set of
keys based on features of each customer
record. The key might consist of part of a last
name, postal code, street name, or sex. Only
record pairs with identical keys will be grouped
for more in-depth matching.
Match Talend’s algorithms are available that are
specifically designed for name comparison,
address comparison, spelling errors, items that
sound alike such as “phish” and “fish”, and
more. Match results are then grouped by pass,
suspect and fail patterns.
These match patterns allow users to know
exactly why records were brought together. This
information is crucial to enable matcher rules
tuning. Users can experiment with different
scores and weighting to find more matches.
What to do with Matches
Once data has been processed through the matcher, there are several
possible outcomes. Between any two given records in the same
match window, the matcher may find:
No relationship
Match – the matcher found a definite match based on the
criteria given
Suspect – the matcher thinks it found a match but is not
confident. The results should be manually reviewed.
Talend White Paper Matching Technology Improves Data Quality
Page 16 of 18
The matcher does not stop when it finds a match. In large data sets,
it is often the case that an individual may exist in many different
forms in the data.
Mitigating the suspect matches is the most time-consuming follow-up
task after the matching is complete. It is because of this that some
tools offer utilities and strategies for dealing with them. The tools
will present the suspect matches in a graphical user interface and
allow users to pick which relationships are accurate and which are
not.
Conclusion
Matching is vital to providing data that is fit-for-use in enterprise
applications. There are key strategies outlined in this whitepaper. Be
sure to standardize data, making sure that addresses are being
compared to addresses and not names, for example. Finally, use
powerful, yet transparent routines to perform the match to ensure
that any data that has been brought together can be easily
reconciled.
Talend White Paper Matching Technology Improves Data Quality
Page 17 of 18
About Talend Data Quality
Talend offers a complete Data Quality solution, composed of two
products: the open source data profiling tool, Talend Open Profiler is
available now on the web site, ready for download and free to use, or
you may chose the powerful Talend Data Quality suite, for the
improvement and corporate management of Data Quality. The suite
includes the foundation tools for data quality, including data
profiling, correction, issue mitigation, advanced reporting and an
integrated Data Integration tool for quick and easy data
transformations. Talend Data Quality includes the following tools:
Data Profiling provides deep analysis of Data Quality problems
and measures the evolution of Data Quality over time. It
includes a report management framework that will compare
current and historical statistics to determine the data
improvement or degradation.
Data Explorer lets users directly drill down into the tables of
the analyzed databases to correlate data more precisely.
Data Cleansing improves Data Quality by using standards
reference data and cross-checking your data against other
databases and reference data. It also enriches data by
providing value-add information that actually improves the
quality and usefulness of existing data.
Data Matching helps you identify hidden duplicates in the
data, offering a single view of customers, part numbers or
almost any other data domain.
Data Quality Portal is an analytical web application that lets
business users share and capitalize on analysis results and
reports.
Talend White Paper How Matching Technology Improves Data Quality
Page 18 of 18
All functionality is completely integrated with Talend’s data
management solutions: Talend Integration Suite and Talend MDM.
Take what you've learned from profiling and use the analysis in your
Data Integration or MDM workflow. Single user interface, repository
and deployment environment provide all you need to complete your
data management tasks.
Talend Data Quality Cleanse & track - Specific components - Reports - Data Quality Portal
Talend Open Profiler Identify Data Quality problems - Free, GPL, no limitations - Custom indicators
For more information on Talend open source solutions: http://www.talend.com Contact Talend in your region: http://www.talend.com/contact
© 2010 Talend. All rights reserved.