DataQualityManagement Predictive Analytics

8/12/2019 DataQualityManagement Predictive Analytics

1/14

Data Quality Management: The Key to

Successful Predictive Analytics

A White Paper


2/14

1 Introduction

2 Data Quality: How to Evaluate and Enhance It

2 Finding the Problems

3 Fixing the Problems

4 The Impact of Data Quality on Predictive Analytics

5 How Data Quality and Predictive Analytics Work Together

6 Comprehensive Data Quality Management and Analytical Solutions6 iWay Data Quality Center

7 iWay Data Profiler

8 WebFOCUS RStat

11 Conclusion

Table of Contents


3/14

Information Builders1

Introduction

Organizations of all types and sizes are enhancing their business intelligence (BI) strategies with

predictive analytics. In fact, analyst firm IDC estimated that the business analytics market which

includes predictive modeling tools would expand by at least 6.1 percent.1 A similar report issued

by Forrester Research named business analytics as the fastest-growing segment of global IT

software, with close to 70 percent of companies polled expressing a strong interest.2

Traditional reporting and analysis provide you with a rear-view perspective of events that have

already occurred. To make forecasts and predictions, you have to rely on the experience, instinct,

and intuition of your analysts. Predictive analytics, on the other hand, fosters a more proactive

approach to decision-making. It applies sophisticated data mining and statistical analysis

techniques to large volumes of both current and historical data, which helps not only your

analysts, but also your business users to distinguish the patterns, trends, and outliers that can serve

as clear indicators of what events, actions, and conditions may occur in the future.

The creation of predictive models and the deployment of a predictive analysis tool, however, may

not be enough. The quality of the data that you apply the models and tools against must be fully

optimized to ensure the utmost accuracy in your outcomes.

A recent article published on Web Analytics Worldclaims that, In order to attain accurate business

intelligence, companies must maintain quality data. Predictive analysis requires both past and

current data about many different things, including customers, businesses, products, and the

economy. All of this information is used to draw relationships and patterns between sets of data.

If the data is accurate and well maintained, then the business intelligence produced will be high

quality as well. 3

Thats where data quality management comes in. Data quality management is the process of

enhancing the accuracy, completeness, and consistency of your enterprise information. A variety

of techniques, such as profiling, standardizing, and cleansing, are leveraged to enrich and improve

the integrity of the data contained in your enterprise information sources.

This is particularly important when it comes to predictive analytics, as garbage in will most likely

lead to garbage out. If the predictive results used for strategic planning and decision-making are

unreliable, then the predictive modeling efforts will end up harming an organization more than

helping it.

This white paper will focus on the importance of data quality management as it relates to

predictive analytics. It will explore various methodologies for assessing and improving data quality,and highlight best practices in preparing data for predictive analysis. It will also present a real-

world scenario of how poor data quality would negatively impact predictive modeling efforts.

1 Kumar, Monika. 2010 State of the U.S. Business Analytics Market, IDC, March 2010.

2 Evelson, Boris; Garbani, Jean-Pierre; Green, Charles; Kisker, Holger; Lisserman, Miroslaw. The State of Business

Intelligence Software and Emerging Trends: 2010, Forrester Research, May 2010.

3 Data Quality and Predictive Analysis, Web Analytics World, January 2009.


4/14

Data Quality Management: The Key to Successful Predictive Analytics2

Finding the ProblemsThe quality of your data will drive the quality of your predictive results. How can you ensure data is

clean enough before you apply a predictive model against it? Data quality needs to be looked at

from four primary perspectives:

Accuracy Do data elements properly reflect the object being described by that particular

field? Is a products SKU number correct? Accuracy issues most often occur during manual data

entry and updating processes where human error can lead to such mistakes as transposed

numbers or misspellings

Consistency When certain information is analyzed or measured, does it produce the same

results repeatedly? Reliability problems are frequently caused by similar information residing

in disparate, unsynchronized systems. Multiple databases contain conflicting data, causing

inconsistencies when that information is merged Comprehensiveness Have all fields been filled? Companies often find that customer records

are missing information, such as e-mail addresses or zip codes. When a record has a high

number of empty or incomplete fields, it should be considered null

Timeliness How current is the information? Historical data, often used in predictive analytics,

may be outdated, such as a criminals arrest record with an old address. For a repeat offender, the

most current address would be needed for effective predictive policing efforts

One of the best ways to assess integrity from these four perspectives is through advanced

profiling techniques. Also commonly referred to as data discovery, profiling is the process of

gathering statistics about enterprise data. What are its primary characteristics and attributes?

How was it created and by whom? Which users access it most frequently? For what purposes is it

primarily used? Most importantly, what kind of shape is it in?

Profiling is one of the most effective means of obtaining an in-depth understanding of corporate

data, so it can be optimized for predictive modeling. It will deliver the insight needed to precisely

determine its overall health; identify, prioritize, and correct any issues or errors (some of which

may be expected, others may be surprises); and rectify the underlying causes of quality problems.

Additionally, once an initial profile has been created, you can perform ongoing monitoring of

profile-related metrics, taking a more proactive approach to detecting and fixing any future

quality problems.

What about custom data? While information like addresses and zip codes can be matched up

against a database to determine accuracy, that kind of validation simply isnt available for most

types of records. A large percentage of your data is likely custom, as in product details, and

requires some level of subject matter expertise to assess its quality.

In cases like these, you must have a programmatic way to apply rules to this type of information to

more proactively ensure its quality. These rules must be easy to define and implement, and should

be used in a way that they do more than just uncover and correct bad data they must stop it

from entering the environment in the first place.

Data Quality: How to Evaluate and Enhance It


5/14


Fixing the ProblemsNow that you know where the problems exist in your data, how do you fix them before predictive

modeling and analytics tools are applied? Some of the most common procedures for data quality

management include:

Cleansing, Standardizing, Enriching, Matching, and Merging

These steps, while seemingly unrelated, are all rather important in achieving and sustaining

optimum levels of data quality.

Cleansing eliminates mistakes within databases and other information sources through the

alteration of existing data based on pre-defined business rules and criteria. For example, if

incorrect customer names are identified, cleansing would help to amend missing or incomplete

entries, while standardization would consistently format all completed entries based on pre-defined business rules.

Enrichment improves comprehensiveness, dynamically extending and enhancing information by

comparing it to third-party content such as consumer demographics or geographic distributors and

appending its attributes when appropriate. If customer records are lacking zip codes, for example, they

can be determined based on existing addresses, and added as a separate field in each record.

Merging and matching promote consistency by automatically uncovering related entries within

the same system or across multiple systems, and then linking, matching, or merging as needed.

For example, entries for a customer, John Smith, exist in mutliple different databases. Although

the records are similar, there are some inconsistencies. Advanced matching capabilities can closely

assess the data in each record to determine if they are redundant, or separate and distinct. If therecords are determined to be redundant, merging would then consolidate them into a single,

comprehensive entry for John Smith, using the most frequently occurring data. Householding, a

technique similar to merging where related information from disparate systems is collected and

stored in a data warehouse or other central location for easy access, also falls into this category.

Scoring

Many organizations have begun to rely on scoring to more effectively evaluate data quality, and

to better prioritize problems if and when they occur. With scoring, a number is assigned to every

data record, providing insight into its quality. For example, you may give a pristine record a score

of five, while a completely invalid record would receive a score of one. Any number in between

would demonstrate the level of confidence that you have in the records thoroughness and

accuracy, and indicate if any action is needed (i.e., any record with a score of three or less would

require human review).

Remember to be f lexible when it comes to scoring procedures, applying different rules to different

types of data to convey a sense of urgency or non-urgency when problems arise. Critical data,

such as customer information, should be scored more strictly than, say, data about the inventory

of office supplies.


6/14


Predictive analytics has been defined as a branch of data mining related to the prediction of

future probabilities, behaviors, and trends. When deploying predictive analytics, many companies

will skip important steps in the process. The most commonly overlooked task is data preparation.

Getting the data in proper shape for predictive analysis which includes gathering it from various

sources and compiling it into a final set that will be fed to the predictive model should be a key

activity. In the most successful predictive modeling scenarios, data preparation will account for

approximately 60 to 80 percent of the cost of the initiative.

Effective data preparation requires more than just pulling data from back-end systems and

moving it into a centralized location, such as a data mart or data warehouse. Failing to properly

cleanse and enhance it can prohibit you from making it truly analytics-ready.

Companies dont intentionally ignore the need to fix their data before applying a predictive model

to it. They simply dont realize how incomplete or inaccurate the information in their enterprise

systems actually is.

As one of the very first steps in any predictive analytics project, invalid or erroneous records must

be located and corrected, and any missing data must be filled in. Otherwise, the information

feeding the model will lack integrity, and the garbage in, garbage out rule will apply. In other

words, poor information will lead to poor results and poor results will undoubtedly lead to poor

decisions.

The best way to identify bad data is through the use of a comprehensive data quality

management tool (preferably one that is fully integrated with the predictive modeling solution),

which can profile, transform, and standardize information, while filling in any missing data. This willhelp ensure that data preparation is addressed properly, instead of becoming a stumbling block

that causes significant delays in model creation and deployment.

There are also other elements of data preparation that must be considered to ensure optimum

results precision. IT organizations must also select tables, records, and attributes from various

sources across the business as well as transform, merge, aggregate, derive, sample, and weigh

(when required) the information. It is important to note that these steps may often need to be

performed multiple times to make the data truly ready for the modeling tool.

The Impact of Data Quality on Predictive Analytics


7/14


Here is an example of the impact that data quality can have on predictive analysis ef forts.

Company XYZ has purchased a predictive analytics package to expand customer wallet share

and profitability through more targeted and effective up-sell and cross-sell activities. The goal is

to determine the factors that influence the purchase of complimentary products, and based on

those factors, identify customers who are most likely to buy certain additional products.

The data set is compiled from various customer relationship management (CRM) and sales force

automation (SFA) systems. Once the model is built and deployed, the results prove to be poor.

Customers with a high likelihood of future purchases are missed, while customers who are unlikely

to spend any more money with the company are mistakenly identified as potential targets.

Unknowingly, the company uses those faulty results to launch an aggressive, multi-touch

up-sell campaign that costs approximately $600,000. If, for example, 20 percent of the data that

the predictive model was based on were erroneous, it would be safe to assume that 20 percent

of the results the list of customers who are most likely to buy a certain product were also

unreliable. Therefore, the company would have wasted 20 percent of its investment in the

campaign or $120,000.

If the company had instead employed a data quality management solution that provided the

aforementioned data cleansing and enhancement techniques before the predictive model is

applied, the erroneous or incomplete records would be corrected in advance. The increased

accuracy in the raw data would substantially improve the results.

As a result, the company would not only save the potentially wasted $120,000, it would also see

a sharper increase in the revenues that result from the program, since the list of target customers

would be more precise.

If the company continues to use the tool to monitor the quality of the inputs for this model or

any future models it creates it can ensure that predictive modeling results are always as accurate

as possible, and yield the highest returns.

How Data Quality and Predictive Analytics Work Together


8/14


9/14


iWay DQC delivers a broad array of cutting-edge features in a single, affordable, intuitive solution.

Key capabilities include:

Centralized management of all data quality activities, including business rules and data flows,

from a single, unified platform

Bundled administration tools that allow for easy configuration, without the need for external

applications

A platform-independent architecture based on open standards

Parallel processing methods that ensure scalability, support both batch and on-demand modes,

and accelerate data quality procedures, performing the entire data quality process in less than

0.1 second, and processing more than 5 million records per hour

Advanced semantic profiling, for fast and accurate information analysis

Seamless integration into any B2B, A2A, or portal application, as well as popular ESB, SOA, andETL tools.

The ability to easily tap into external data sources, such as national address or name registries,

as well as third-party dictionaries and custom lists for the purposes of parsing, cleansing, and

validation

A set of powerful algorithms that ef ficiently perform approximate matching in record unification,

regardless of internal data structures

iWay Data Profiler

The iWay Data Profiler integrates output from iWay DQC with business intelligence (BI) technology

in a simple yet powerful way. Administrators can view, monitor, compare, and report on anymission-critical data with no additional client software, plug-ins, or report viewers required.

It provides sophisticated integration capabilities bolstered by mature tools for data quality

monitoring, reporting, and analytics. Users are able to query, analyze, deliver, and display

electronic profiling data in an almost unlimited number of ways.


10/14


Advanced data profiling information, generated via iWay Data Quality Centers semantic analytics

and complex business rules, provides basic data statistics, such as uniqueness and frequency,

and uncovers relationships between data using primary and foreign keys. This profiling data can

then be further analyzed using intuitive and graphical reporting tools, helping users to uncover

variances in data profiles over different periods of time. Users can also drill down on profiled

categories to reveal the details of the exact records within that group.

The iWay Data Profiler provides a wide array of powerful capabilities, including:

Customizable data quality indicators (DQIs) that allow companies to define various levels of

validity. These DQIs can then be applied to data to provide immediate insight into the integrity

of specific records

Dynamic collection of profiling data from iWay DQC

Tagging and archiving of profiling data as sets within an associated RDBMS for easy retrieval

Advanced data manipulation and graphics

Comparison of multiple data profiling sets for more rapid variance discovery

Printing and exporting of any data profiling view into HTML, PDF, Excel, and other industry-

standard formats

Portable analytical capabilities embedded directly within the profiling report that allow users

to view and analyze profiling data in an almost unlimited number of ways

Additionally, iWay Data Profiler is available as a software-as-a-service (SaaS) application. This offers

many significant benefits, including:

Accelerated deployment and setup Increased budget-friendliness through a convenient pay-per-use model that eliminates the high

upfront expenditures associated with on-premise tools

The ability for detailed profiling information to be more easily shared with those who own the

data being profiled non-technical users working across various divisions and lines of business

Immediate, cost-efficient scalability whenever its needed to satisfy changing requirements

and emerging needs

WebFOCUS RStat

WebFOCUS RStat is the markets first fully integrated BI and predictive analytics environment,

seamlessly bridging the gap between backward- and forward-facing views of business operations.With WebFOCUS RStat, companies can easily and cost-effectively deploy predictive models as

intuitive scoring applications. So business users at all levels can make decisions based on accurate,

validated future predictions, instead of relying solely on instinct.

WebFOCUS RStat provides a single platform for data access and preparation, BI, predictive model

building and testing, and deploying results to end users as scoring applications. This eliminates

the need to purchase and maintain multiple tools, and frees analysts and other statisticians

from spending countless hours extracting and querying data. At the same time, it reduces costs,

simplifies maintenance, and optimizes IT resources.


11/14


WebFOCUS RStats greatest benefit is its significantly increased accuracy. With the R engine a

powerful and flexible open source statistical programming language as its underlying analysis

tool, WebFOCUS RStat can deliver results that are always consistent, complete, and correct.

Using WebFOCUS RStat

enables a variety of

outputs that can be

generated to display

variable relationships

and distributions for

exploratory analysis.

This Decision Tree

predictive model shows

the graphical display of

the tree and how the

data was classified into

the terminal nodes.


12/14


WebFOCUS RStat provides:

A single tool, fully integrated with Developer Studio and WebFOCUS Reporting Servers with access to

more than 300 data sources for both BI developers and data miners

Comprehensive data exploration, descriptive statistics, and interactive graphs

In-depth data visualization and transformation

Hypothesis testing, clustering, and correlation analysis

The ability to build and export predictive models for estimation and classification of likely future behavior

Comprehensive predictive model evaluation

Rapid application creation through easy incorporation of scoring routines into WebFOCUS reports


13/14


14/14

Worldwide Offices

Corporate HeadquartersTwo Penn Plaza

New York, NY 10121-2898

(212) 736-4433

(800) 969-4636

United StatesAtlanta, GA* (770) 395-9913

Baltimore, MD (703) 247-5565

Boston, MA* (781) 224-7660

Channels (770) 677-9923

Chicago, IL* (630) 971-6700

Cincinnati, OH* (513) 891-2338

Dallas, TX* (972) 398-4100Denver, CO* (303) 770-4440

Detroit, MI* (248) 641-8820

Federal Systems, DC*(703) 276-9006

Florham Park, NJ (973) 593-0022

Gulf Area (972) 490-1300

Hartford, CT (781) 272-8600

Houston, TX* (713) 952-4800

Kansas City, MO (816) 471-3320

Los Angeles, CA* (310) 615-0735

Milwaukee, WI (414) 827-4685

Minneapolis, MN* (651) 602-9100

New York, NY* (212) 736-4433

Orlando, FL (407) 804-8000

Philadelphia, PA*(610) 940-0790

Phoenix, AZ(480) 346-1095Pittsburgh, PA (412) 494-9699

Sacramento, CA (916) 973-9511

San Jose, CA* (408) 453-7600

Seattle, WA(206) 624-9055

St. Louis, MO* (636) 519-1411, ext . 321

Washington DC*(703) 276-9006

InternationalAustralia*

Melbourne 61-3-9631-7900

Sydney 61-2-8223-0600

Austria Raffeisen Informatik Consulting GmbH

Wien 43-1-211-36-3344

Bangladesh

Dhaka 415-505-1329

Belgium*

Brussels 32(0)2-743-02-40

Brazil InfoBuild Brazil Ltda.

So Paulo 55-11-3285-1050

CanadaCalgary (403) 437-3479

Montreal* (514) 421-1555

Ottawa (613) 233-7647

Toronto* (416) 364-2760

Vancouver (604) 688-2499

China

Beijing 010-51289680, ext. 8010

Croatia InfoBuild CEE

Strmec Samoborski 385-1-23-62-400

Czech Republic InfoBuild CEE

Praha 420-221-986-460

Estonia InfoBuild Baltics

Tallinn 372-5265815

Finland InfoBuild Oy

Espoo 358-207-580-840

France*

Svres +33 (0)1-45-07-66-00

Germany

Eschborn* 49-6196-775-76-0

Greece Applied Science Ltd.

Athens 30-210-699-8225

Guatemala IDS de Centroamerica

Guatemala City (502) 2412-4212

Hungary InfoBuild CEE

Budapest 36-1-430-3500

India* InfoBuild India

Chennai 91-44-42177082Israel Malam Team SRL Products

Petah-Tikva 972-3-7662040

Italy

Milan 39-02-92-349-724

Japan KK Ashisuto

Tokyo 81-3-5276-5863

Kuwait InfoBuild Middle East

Safat 965-2-232-2926

Latvia InfoBuild Baltics

Riga 371-67039637

Lebanon InfoBuild Middle East

Beirut 961-4-533162

Lithuania InfoBuild Baltics

Vilnius 370-5-268-3327

Mexico

Mexico City 52-55-5062-0660

Netherlands*

Amstelveen 31 (0)20-4563333

Nigeria InfoBuild Nigeria

Garki-Abuja 234-803-318-4750

Norway InfoBuild Norge ASOslo 47-4820-4030

Poland InfoBuild CEE

Warszawa 48-22-657-0014

Portugal

Lisboa 351-217-217-400

Qatar InfoBuild Middle East

Doha 974-4-466-6244

Russian Federation InfoBuild CIS

Moscow 7-495-797-20-46 Armenia Azerbaijan Belarus Kazakhstan Kyrgyzstan Moldova Tajikistan Turkmenistan Ukraine Uzbekistan

Saudi Arabia InfoBuild Middle East

Riyadh 966-1-479-7623

Singapore Automatic Identification Technology Ltd.

Singapore 65-6286-2922

Slovakia InfoBuild CEE

Bratislava 421-232-332-513 Bulgaria Romania Serbia Slovenia

South Africa Fujitsu (Pty) Ltd.

Cape Town 27-21-937-6100

Johannesburg 27-11-233-5432

South Korea Uvansys

Seoul 82-2-832-0705

Spain

Barcelona 34-93-452-63-85

Bilbao 34-94-452-50-15

Madrid* 34-91-710-22-75

Sweden InfoBuild AB

Solna 46-8-578-772-01

Switzerland

Dietlikon 41-44-839-49-49

Taiwan Galaxy Software Services, Inc.

Taipei (866) 2-2586-7890

Thailand Datapro Computer Systems Co. Ltd.

Bangkok 66(2) 301 2800

Turkey InfoBuild Turkey

Ankara 90-312-266-3300

Istanbul 90-212-351-2730

United Arab Emirates InfoBuild Middle East

Abu Dhabi 971-2-627-5911 Bahrain Egypt Jordan Oman

Dubai 971-4-391-4394

United Kingdom*

Uxbridge Middlesex 0845-658-8484

Venezuela InfoServices Consulting

Caracas 58212-763-1653

* Training facilities are located at these offices.

Corporate Headquarters Two Penn Plaza, New York, NY 10121-2898 (212) 736-4 433 Fax (212) 967-6406 DN3601489.1011

Connect With Us informationbuilders.com askinfo@informationbuilder s.com

DataQualityManagement Predictive Analytics

Documents

Transcript of DataQualityManagement Predictive Analytics