White Paper - Bitpipeviewer.media.bitpipe.com/1216309501_94/1285601573_481/Talend_WP... ·...

Talend White Paper Matching Technology Improves Data Quality

How

Matching Technology Improves Data Quality

White Paper


of 18

Table of Contents

How Matching Technology Improves Data Quality ...................... 3

What is Matching? ........................................................... 3

Benefits of Matching ........................................................ 5

Matching Use Cases .......................................................... 6

What is Matched? ............................................................ 7

Standardization before Matching .......................................... 7

Matching Technology ........................................................ 9

Probabilistic ................................................................ 11

Deterministic versus Probabilistic ....................................... 12

Blocking ..................................................................... 12

Matching Process .......................................................... 13

What to do with Matches ................................................. 15

Conclusion .................................................................. 16

About Talend Data Quality ............................................... 17


of 18

How Matching Technology Improves Data Quality

Enterprise applications like Master Data Management, Customer

Data Integration, and Data Warehouse projects rely on clean,

duplicate-free data to really be effective business tools. Companies

have sought the “single source of truth” in their customer data,

transactional data and even in their metadata to make these

applications most effective. Matching plays an important role in

achieving a single view of customers, parts, transactions and almost

any type of data.

For decades, software vendors and computer scientists have devised

strategies and technologies for finding relationships within the data.

Some of the first published works on matching strategies were as

early as 1946, when Halbert Dunn, MD, who was Chief, National

Office of Vital Statistics for the U. S. Public Health Service wrote a

paper in which he described linking the pages of a person’s medical

records to create a “book of life”. It was an idea ahead of its time,

since technology of the era was certainly not up to the task.

In 1969, Ivan Fellegi and Alan Sunter formalized probabilistic

matching techniques in a break-through research paper. Over the

years, others have tweaked algorithms to match records. The

strategies for matching records are mature and well-documented,

although not always simple.

This white paper looks into the topic of matching, what it is, how it

works, and different methods for matching data.

What is Matching?

Matching is the process of putting together similar or the same

records in order to either identify or remove duplicates from the


of 18

data. Matching is often used to link together records that have some

sort of relationship. Since data doesn’t always tell us the

relationship between two data elements, matching technology lets

us define rules for items that might be related.

How relationships are interpreted and used, either in Business to

Business data or in Business to Consumer data, depends on the

context of the project and the needs of the business users.

Commonly, corporations use matching to remove duplicate customer

records and therefore optimize marketing programs, but there are

many uses for this optimized data beyond marketing. Data may

contain business names rather than households and relationships

need to be created between “IBM Corp.”, “International Business

Machines”, and “I.B.M.”, for example.

Data to be matched may also come from supply chain and ERP data,

where the matcher relies on patterns to match, for example, the

part numbers of XL-12345 to XL123-45. Data may be descriptive

data, where the matcher needs to find a relationship between

“Frozen Carrots” and “Car, Frzn”. The identification of these

relationships will contribute to the organization’s data management

strategy and effectiveness.

Matching is also called “Linking” because the end result of finding

two related records in your data might not necessarily be to delete

one of the records or combine two or more records. Rather, the

solution to understand your customers may just be to link the

records together (using keys). A family who lives under one

household and does business with one bank is an example of a

household whose records should remain separate, but linked by

household.


of 18

Benefits of Matching

All corporations can benefit from matching, because the benefits

are plentiful. The top benefits include:

Billing and Credit – Removal of duplicate records and

householding can lead to benefits such as unified billing,

accurate revenue accounting, accurate contract billings,

unified credit management and reduced mailing costs.

Direct mail/marketing – Companies can decrease the costs

associated with direct mail by mailing one and only one

promotion to any given household.

Relationships with customers – Organizations with more

accurate and duplicate-free records have better relationships

with those customers.

Supply Chain and Inventory Efficiency – By matching

inventory in a warehouse that is physically identical, but

seemingly unconnected in the database, the company can

lower carrying costs. There's less inventory to put away, less

space to rent, lower insurance and taxes on inventory, lower

costs to physically count inventory and lower risk from

obsolete inventory.

Vendor Cost Savings – With cleaner vendor and inventory

data, buyers have more accurate information on the amount

purchased from any given vendor. Armed with accurate

data, the buyer can apply pressure on the vendor to lower

costs.


of 18

Overall Corporate Efficiency – With duplicate-free data,

users are more likely to adopt systems that will improve

corporate efficiency. Storing fewer gigabytes of more

accurate data is much more efficient.

Matching Use Cases

Matching software is commonly used in several different

configurations, including but not limited to:

One-time matching project – where companies perform a

one-time removal of duplicates from a single database or a

one-time linking of two or more databases

Real time single database – often accomplished with first

identification of duplicates, then real time matching to

ensure that no duplicate records are added to the database

Matching of multiple databases at regular interval – most

commonly as part of a data warehouse, where data is nightly

loading into a data warehouse to understand business

intelligence metrics

Linking multiple databases via a master index – where the

flow of data enters a central master data management hub.

In such a configuration, data quality is a service, called by

the master data management application


of 18

What is Matched?

Powerful matching technology will match a variety of types of data,

including but not limited to:

Individuals – Individuals living at the same address. The

software finds individuals, even if the data is mistyped or if

nicknames are involved. The software knows that Bob and

Rob are derived from the same root name.

Households - Members of the same household living at one

address. Usually, this is powerful in finding head of

household and contacting only one person with marketing

offers, for example.

Businesses – Companies with the same or similar names, also

with the ability to recognize EMC, E.M.C and E M C Corp as

the same company, for example.

Inventory or Supply Chain Items – Companies looking to

consolidate parts and items with the same or similar names.

The software understands that a “bolt, one half inch” might

be a match to a “1/2’’ bolt” or that “carrots, frozen” might

be the same as “Frz Car”. Since this data is so unique from

industry to industry, some standardization of data may be

necessary for finding these types of matches.

Standardization before Matching

Nearly all experts agree that standardization is absolutely necessary

before matching. Standardization is a process by which an agreed

standard is defined for any type of data. The rules are offered as

part of a business rules engine and applied to the data. So for


of 18

example, the postal services in various countries offer standards for

name and address data. One such standard in the United States is

that on addresses the word “street” is always abbreviated “ST”, and

not “Str” or “Street”.

Users may also opt to standardize data shapes, too. In an ERP or

supply chain system, for example, a company may decide to always

designate part numbers as NN-AAAAA, where N is a number, and A is

an alphanumeric character. In this scenario, part number 12-HGAJS

would be valid, while 12HGAJS_2 would be subject to

standardization.

For name data, nicknames can also be problematic. Attempting to

match “Steve” with “Stephen”, for example, requires

standardization. The strategy here is to create a root name

attribute in the database that stores the root name of Steve, which

is Stephen. This strategy keeps the original names intact in the

database, but gives match opportunity to “Steve Williams” and

“Stephen Williams” in your database.

Standardization also helps when data is misfielded, for example, a

name is inadvertently typed into an address line. This commonly

occurs during data migration, especially in legacy systems where

data tends to be less structured. For example, some billing systems

data may contain “Attn: Accounts Payable” on various lines of the

database, and it’s up to standardization to sort this out. By the same

token, similar records won’t come together when comparing “Steven

A. Smith” to “25 Main St.”. Profiling the data ahead of time is the

best way to ensure that the correct data exists in the correct field

and that apples are being compared to apples.

The standardization process improves matching results, even when

implemented along with very simple matching algorithms. More

Data standardization can

be achieved with Talend

Data Quality by using

your own business rules,

regular expressions and

even public domain and

government sources like

the US Census, data.gov

and geonames.org.

This standardization is

not only integral to data

quality, it’s integral to

the effectiveness of

master data

management, CRM, ERP

and many business

applications.


of 18

exact matches will exist once the address has been standardized and

the root name has been found. More exact matches will also exist

once part numbers and descriptions have been standardized.

However, in combination with advanced matching techniques,

standardization can improve information quality even more.

Matching Technology

The strength of matching technology is defined by how powerful the

algorithms are to establish the match. For algorithms, solutions

have powerful routines that are specially designed to compare

names, addresses, strings and partial strings, business names,

spelling errors, postal codes, tax ID numbers, data that sounds

similar such as “Phig” and “Fig”, and more.

There are two common types of matching technology on the market

today, deterministic and probabilistic.

Deterministic or “rules-based” matching is where records are

compared using fuzzy algorithms. The various algorithms allow for a

little bit of “slop” in data, so that if there are typos or phonetic

similarities (like ph & f), the algorithms can identify linkage.

Ultimately, the user decides which rows to compare and what

algorithm to use on each. Each row can have a “weight”, so that a

user might decide that TaxID number (social security number) has

more weight than last name, for example. The user can choose

from one of these common algorithms:

Fuzzy Match Algorithm Use Case/Description

Exact Match You can use an exact matching algorithm to find

exact duplicates.“Smith” will match to “Smith”

After standardization,

Talend Data Quality

matching uses

algorithms to determine

when two or more

records match. It

identifies matching

records referring to the

same business,

household,

individual/contact,

product, etc. and

identifies relationships

linking a contact to a

business, an individual

to a household, a

product to a product

class, or other.


of 18

and only “Smith” with no fancy variations. After

records have been standardized, a certain

number of new exact matches should come as a

natural result.

SoundEx Developed for the some of the first computers

performing the US census in the 1930s,

SoundEx is a phonetic algorithm for indexing

names by sound, as pronounced in English. The

algorithm mainly encodes consonants; a vowel

will not be encoded unless it is the first letter.

Improvements to SoundEx are the basis for

many of the modern phonetic algorithms that

follow.

Metaphone and Double

Metaphone

Realizing that SoundEx was limited, Metaphone

was developed in the 1990s, using a larger set of

rules for English pronunciation.

Later, double metaphones were developed to

provide even more power. The algorithm returns

both a primary and a secondary code to account

for many variations of surnames with common

ancestry. For example, encoding the name

"Smith" yields a primary code of SM0 and a

secondary code of XMT, while the name

"Schmidt" yields a primary code of XMT and a

secondary code of SMT--both have XMT in

common.

The Double Metaphones algorithm does a better

job because it uses a much more complex rule

set for coding than its SoundEx and metaphones.

Levenshtein In the 1960s, a Russian scientist devised the

Levenshtein distance algorithm. Levenshtein

distance is a measure of the similarity between

two strings. Users define the distance, which is

the number of deletions, insertions, or


of 18

Probabilistic

The second category of matching technology is called probabilistic,

the very same theories that Fellegi and Sunter wrote about back in

1969. The intricacies of probabilistic matching run well beyond the

scope of this white paper. However, statistical analysis and advanced

algorithms are key to its success.

The algorithm is smart enough to know that a common last name like

“Jones” should play a smaller role in matching as compared to a less

common last name, like “Jimmerson”. How does it know?

Probabilistic matching technology performs statistical analysis on the

data and deciding the frequency of items. It then uses that analysis

to weight the match, similar to the way that the user can apply

weight to the relevance of each row.

substitutions required to transform one into the

other. For example the distance between “Smith”

and “Smith” is 0, because no transformations are

needed. However, the distance between “Smith”

and “Smyth” is 1 because one substitution is

needed to transform it. “Smith” and “Smythe”

would have a distance of 2, and so on.

Jaro-Winkler Jaro-Winkler is similar in function to Levenshtein,

since it measures the number of differences

between strings. However, characters at the

beginning of the string are given more weight

than those at the end. This weighting of

characters allows Jaro-Winkler to deliver a score

between zero and 1, with one being a perfect

match.


of 18

Deterministic versus Probabilistic

Data quality solutions often offer both types of matching, since one is

not necessarily superior to the other. While deterministic matching

doesn’t take into account a holistic view of the data set, it will

produce a good many matches and is much easier for data

management professionals to understand and tune. Probabilistic may

be superior in its holistic view of the data, but the ability to

understand and track why records matched and why they didn’t is

hindered by a complex algorithm. If you’re trying to do real-time

matching, having one incoming record match up against a master data

set, deterministic will also offer some performance benefits.

Remember that probabilistic relies on statistical analysis of the data

and that may slow performance on real-time jobs.

Blocking

No matter what algorithms you decide to use, the thought of

comparing a large number of records to themselves to find matches

makes the task a daunting one, both in resources needed and in time

needed to compare them. If you have a million rows of data, you

wouldn’t want to have to make one million comparisons. Even

comparing a single record against your large database would take

significant time. The time to execute will grow exponentially as your

dataset grows. That’s why most software vendors recommend first

making blocks or grouping keys part of the matching process. By

creating a key so that only those records that have some basic

similarities will be compared, matching performance improves with

no effect on matching accuracy.

The key might consist of part of a last name, postal code, street

name, or sex. Only record pairs with identical keys will be grouped

for more in-depth matching.


of 18

Organizations often evoke a multi-match strategy, where matching is

analyzed from various angles. For name and address data,

organizations might rely heavily on tax id (Social Security) number

where it exists, while relying on other factors, such as address, last

name, city, and state, where tax id is missing from the customer

records.

Matching Process

If we were to take a journey from a record’s perspective through a

matching system, it would go something like this. The matcher starts

with the entire database and quickly whittles down the list of possible

matches by establishing the match key. Only those records that are

somewhat alike (those with the same window key) are more precisely

compared. This step is extremely fast. In a 100,000 record database,

this step might reduce the list of match possibilities to say fifty

possible detailed matches. In phase two, the fifty remaining records

will be then be scrutinized more carefully with the solution’s

powerful algorithms.

All of the major matching engines on the market use a similar two

step process for matching. The wide variation seems to be in the


of 18

actual algorithms for detailed matching, more specifically in the

efficiency in which they find correct matches, the rate at which they

avoid bad matches, the ability for the matching solution to handle a

wide variety of data domains and types, and the speed at which they

do complete the task.

PROCESS DETAIL

Profile Profile the data to understand data quality

issues. Issues can be categorized, so that

misfielded data can go through one process,

incomplete records through another, etc.

Standardize Use a standardization process to optimize match

efficiency. Be certain that data conforms to

standards, if they exist, data is fielded correctly,

and that nicknames, data shapes, and

abbreviations are standardized

Identify fields to

compare (any/all field

types)

Perform match on fields that are unique. In this

example, a straight name and address match

will be performed. Matching can use any data

available, however. This includes tax ID

number, customer number, e-mail address, etc.

Match Grouping Keys

To deliver matching that is both accurate and

high performance, Talend recommends first

making grouping keys part of the matching

process. Processing time to compare a single

record against your entire database can take

significant time. The time to execute will grow

exponentially as your dataset grows. By creating

a key so that only those records that have some

basic similarities will be compared, matching

performance improves with no effect on

matching accuracy.


of 18

For example, if you were to generate a set of

keys based on features of each customer

record. The key might consist of part of a last

name, postal code, street name, or sex. Only

record pairs with identical keys will be grouped

for more in-depth matching.

Match Talend’s algorithms are available that are

specifically designed for name comparison,

address comparison, spelling errors, items that

sound alike such as “phish” and “fish”, and

more. Match results are then grouped by pass,

suspect and fail patterns.

These match patterns allow users to know

exactly why records were brought together. This

information is crucial to enable matcher rules

tuning. Users can experiment with different

scores and weighting to find more matches.

What to do with Matches

Once data has been processed through the matcher, there are several

possible outcomes. Between any two given records in the same

match window, the matcher may find:

No relationship

Match – the matcher found a definite match based on the

criteria given

Suspect – the matcher thinks it found a match but is not

confident. The results should be manually reviewed.


of 18

The matcher does not stop when it finds a match. In large data sets,

it is often the case that an individual may exist in many different

forms in the data.

Mitigating the suspect matches is the most time-consuming follow-up

task after the matching is complete. It is because of this that some

tools offer utilities and strategies for dealing with them. The tools

will present the suspect matches in a graphical user interface and

allow users to pick which relationships are accurate and which are

not.

Conclusion

Matching is vital to providing data that is fit-for-use in enterprise

applications. There are key strategies outlined in this whitepaper. Be

sure to standardize data, making sure that addresses are being

compared to addresses and not names, for example. Finally, use

powerful, yet transparent routines to perform the match to ensure

that any data that has been brought together can be easily

reconciled.


of 18

About Talend Data Quality

Talend offers a complete Data Quality solution, composed of two

products: the open source data profiling tool, Talend Open Profiler is

available now on the web site, ready for download and free to use, or

you may chose the powerful Talend Data Quality suite, for the

improvement and corporate management of Data Quality. The suite

includes the foundation tools for data quality, including data

profiling, correction, issue mitigation, advanced reporting and an

integrated Data Integration tool for quick and easy data

transformations. Talend Data Quality includes the following tools:

Data Profiling provides deep analysis of Data Quality problems

and measures the evolution of Data Quality over time. It

includes a report management framework that will compare

current and historical statistics to determine the data

improvement or degradation.

Data Explorer lets users directly drill down into the tables of

the analyzed databases to correlate data more precisely.

Data Cleansing improves Data Quality by using standards

reference data and cross-checking your data against other

databases and reference data. It also enriches data by

providing value-add information that actually improves the

quality and usefulness of existing data.

Data Matching helps you identify hidden duplicates in the

data, offering a single view of customers, part numbers or

almost any other data domain.

Data Quality Portal is an analytical web application that lets

business users share and capitalize on analysis results and

reports.

Talend White Paper How Matching Technology Improves Data Quality

of 18

All functionality is completely integrated with Talend’s data

management solutions: Talend Integration Suite and Talend MDM.

Take what you've learned from profiling and use the analysis in your

Data Integration or MDM workflow. Single user interface, repository

and deployment environment provide all you need to complete your

data management tasks.

Talend Data Quality Cleanse & track - Specific components - Reports - Data Quality Portal

Talend Open Profiler Identify Data Quality problems - Free, GPL, no limitations - Custom indicators

For more information on Talend open source solutions: http://www.talend.com Contact Talend in your region: http://www.talend.com/contact

© 2010 Talend. All rights reserved.

http://www.talend.com/

http://www.talend.com/contact

White Paper - Bitpipeviewer.media.bitpipe.com/1216309501_94/1285601573_481/Talend_WP... ·...

Documents

Transcript of White Paper - Bitpipeviewer.media.bitpipe.com/1216309501_94/1285601573_481/Talend_WP... ·...