DataQualityManagement Predictive Analytics
-
Upload
davidguerrero -
Category
Documents
-
view
225 -
download
0
Transcript of DataQualityManagement Predictive Analytics
-
8/12/2019 DataQualityManagement Predictive Analytics
1/14
Data Quality Management: The Key to
Successful Predictive Analytics
A White Paper
-
8/12/2019 DataQualityManagement Predictive Analytics
2/14
1 Introduction
2 Data Quality: How to Evaluate and Enhance It
2 Finding the Problems
3 Fixing the Problems
4 The Impact of Data Quality on Predictive Analytics
5 How Data Quality and Predictive Analytics Work Together
6 Comprehensive Data Quality Management and Analytical Solutions6 iWay Data Quality Center
7 iWay Data Profiler
8 WebFOCUS RStat
11 Conclusion
Table of Contents
-
8/12/2019 DataQualityManagement Predictive Analytics
3/14
Information Builders1
Introduction
Organizations of all types and sizes are enhancing their business intelligence (BI) strategies with
predictive analytics. In fact, analyst firm IDC estimated that the business analytics market which
includes predictive modeling tools would expand by at least 6.1 percent.1 A similar report issued
by Forrester Research named business analytics as the fastest-growing segment of global IT
software, with close to 70 percent of companies polled expressing a strong interest.2
Traditional reporting and analysis provide you with a rear-view perspective of events that have
already occurred. To make forecasts and predictions, you have to rely on the experience, instinct,
and intuition of your analysts. Predictive analytics, on the other hand, fosters a more proactive
approach to decision-making. It applies sophisticated data mining and statistical analysis
techniques to large volumes of both current and historical data, which helps not only your
analysts, but also your business users to distinguish the patterns, trends, and outliers that can serve
as clear indicators of what events, actions, and conditions may occur in the future.
The creation of predictive models and the deployment of a predictive analysis tool, however, may
not be enough. The quality of the data that you apply the models and tools against must be fully
optimized to ensure the utmost accuracy in your outcomes.
A recent article published on Web Analytics Worldclaims that, In order to attain accurate business
intelligence, companies must maintain quality data. Predictive analysis requires both past and
current data about many different things, including customers, businesses, products, and the
economy. All of this information is used to draw relationships and patterns between sets of data.
If the data is accurate and well maintained, then the business intelligence produced will be high
quality as well. 3
Thats where data quality management comes in. Data quality management is the process of
enhancing the accuracy, completeness, and consistency of your enterprise information. A variety
of techniques, such as profiling, standardizing, and cleansing, are leveraged to enrich and improve
the integrity of the data contained in your enterprise information sources.
This is particularly important when it comes to predictive analytics, as garbage in will most likely
lead to garbage out. If the predictive results used for strategic planning and decision-making are
unreliable, then the predictive modeling efforts will end up harming an organization more than
helping it.
This white paper will focus on the importance of data quality management as it relates to
predictive analytics. It will explore various methodologies for assessing and improving data quality,and highlight best practices in preparing data for predictive analysis. It will also present a real-
world scenario of how poor data quality would negatively impact predictive modeling efforts.
1 Kumar, Monika. 2010 State of the U.S. Business Analytics Market, IDC, March 2010.
2 Evelson, Boris; Garbani, Jean-Pierre; Green, Charles; Kisker, Holger; Lisserman, Miroslaw. The State of Business
Intelligence Software and Emerging Trends: 2010, Forrester Research, May 2010.
3 Data Quality and Predictive Analysis, Web Analytics World, January 2009.
-
8/12/2019 DataQualityManagement Predictive Analytics
4/14
Data Quality Management: The Key to Successful Predictive Analytics2
Finding the ProblemsThe quality of your data will drive the quality of your predictive results. How can you ensure data is
clean enough before you apply a predictive model against it? Data quality needs to be looked at
from four primary perspectives:
Accuracy Do data elements properly reflect the object being described by that particular
field? Is a products SKU number correct? Accuracy issues most often occur during manual data
entry and updating processes where human error can lead to such mistakes as transposed
numbers or misspellings
Consistency When certain information is analyzed or measured, does it produce the same
results repeatedly? Reliability problems are frequently caused by similar information residing
in disparate, unsynchronized systems. Multiple databases contain conflicting data, causing
inconsistencies when that information is merged Comprehensiveness Have all fields been filled? Companies often find that customer records
are missing information, such as e-mail addresses or zip codes. When a record has a high
number of empty or incomplete fields, it should be considered null
Timeliness How current is the information? Historical data, often used in predictive analytics,
may be outdated, such as a criminals arrest record with an old address. For a repeat offender, the
most current address would be needed for effective predictive policing efforts
One of the best ways to assess integrity from these four perspectives is through advanced
profiling techniques. Also commonly referred to as data discovery, profiling is the process of
gathering statistics about enterprise data. What are its primary characteristics and attributes?
How was it created and by whom? Which users access it most frequently? For what purposes is it
primarily used? Most importantly, what kind of shape is it in?
Profiling is one of the most effective means of obtaining an in-depth understanding of corporate
data, so it can be optimized for predictive modeling. It will deliver the insight needed to precisely
determine its overall health; identify, prioritize, and correct any issues or errors (some of which
may be expected, others may be surprises); and rectify the underlying causes of quality problems.
Additionally, once an initial profile has been created, you can perform ongoing monitoring of
profile-related metrics, taking a more proactive approach to detecting and fixing any future
quality problems.
What about custom data? While information like addresses and zip codes can be matched up
against a database to determine accuracy, that kind of validation simply isnt available for most
types of records. A large percentage of your data is likely custom, as in product details, and
requires some level of subject matter expertise to assess its quality.
In cases like these, you must have a programmatic way to apply rules to this type of information to
more proactively ensure its quality. These rules must be easy to define and implement, and should
be used in a way that they do more than just uncover and correct bad data they must stop it
from entering the environment in the first place.
Data Quality: How to Evaluate and Enhance It
-
8/12/2019 DataQualityManagement Predictive Analytics
5/14
Information Builders3
Fixing the ProblemsNow that you know where the problems exist in your data, how do you fix them before predictive
modeling and analytics tools are applied? Some of the most common procedures for data quality
management include:
Cleansing, Standardizing, Enriching, Matching, and Merging
These steps, while seemingly unrelated, are all rather important in achieving and sustaining
optimum levels of data quality.
Cleansing eliminates mistakes within databases and other information sources through the
alteration of existing data based on pre-defined business rules and criteria. For example, if
incorrect customer names are identified, cleansing would help to amend missing or incomplete
entries, while standardization would consistently format all completed entries based on pre-defined business rules.
Enrichment improves comprehensiveness, dynamically extending and enhancing information by
comparing it to third-party content such as consumer demographics or geographic distributors and
appending its attributes when appropriate. If customer records are lacking zip codes, for example, they
can be determined based on existing addresses, and added as a separate field in each record.
Merging and matching promote consistency by automatically uncovering related entries within
the same system or across multiple systems, and then linking, matching, or merging as needed.
For example, entries for a customer, John Smith, exist in mutliple different databases. Although
the records are similar, there are some inconsistencies. Advanced matching capabilities can closely
assess the data in each record to determine if they are redundant, or separate and distinct. If therecords are determined to be redundant, merging would then consolidate them into a single,
comprehensive entry for John Smith, using the most frequently occurring data. Householding, a
technique similar to merging where related information from disparate systems is collected and
stored in a data warehouse or other central location for easy access, also falls into this category.
Scoring
Many organizations have begun to rely on scoring to more effectively evaluate data quality, and
to better prioritize problems if and when they occur. With scoring, a number is assigned to every
data record, providing insight into its quality. For example, you may give a pristine record a score
of five, while a completely invalid record would receive a score of one. Any number in between
would demonstrate the level of confidence that you have in the records thoroughness and
accuracy, and indicate if any action is needed (i.e., any record with a score of three or less would
require human review).
Remember to be f lexible when it comes to scoring procedures, applying different rules to different
types of data to convey a sense of urgency or non-urgency when problems arise. Critical data,
such as customer information, should be scored more strictly than, say, data about the inventory
of office supplies.
-
8/12/2019 DataQualityManagement Predictive Analytics
6/14
Data Quality Management: The Key to Successful Predictive Analytics4
Predictive analytics has been defined as a branch of data mining related to the prediction of
future probabilities, behaviors, and trends. When deploying predictive analytics, many companies
will skip important steps in the process. The most commonly overlooked task is data preparation.
Getting the data in proper shape for predictive analysis which includes gathering it from various
sources and compiling it into a final set that will be fed to the predictive model should be a key
activity. In the most successful predictive modeling scenarios, data preparation will account for
approximately 60 to 80 percent of the cost of the initiative.
Effective data preparation requires more than just pulling data from back-end systems and
moving it into a centralized location, such as a data mart or data warehouse. Failing to properly
cleanse and enhance it can prohibit you from making it truly analytics-ready.
Companies dont intentionally ignore the need to fix their data before applying a predictive model
to it. They simply dont realize how incomplete or inaccurate the information in their enterprise
systems actually is.
As one of the very first steps in any predictive analytics project, invalid or erroneous records must
be located and corrected, and any missing data must be filled in. Otherwise, the information
feeding the model will lack integrity, and the garbage in, garbage out rule will apply. In other
words, poor information will lead to poor results and poor results will undoubtedly lead to poor
decisions.
The best way to identify bad data is through the use of a comprehensive data quality
management tool (preferably one that is fully integrated with the predictive modeling solution),
which can profile, transform, and standardize information, while filling in any missing data. This willhelp ensure that data preparation is addressed properly, instead of becoming a stumbling block
that causes significant delays in model creation and deployment.
There are also other elements of data preparation that must be considered to ensure optimum
results precision. IT organizations must also select tables, records, and attributes from various
sources across the business as well as transform, merge, aggregate, derive, sample, and weigh
(when required) the information. It is important to note that these steps may often need to be
performed multiple times to make the data truly ready for the modeling tool.
The Impact of Data Quality on Predictive Analytics
-
8/12/2019 DataQualityManagement Predictive Analytics
7/14
Information Builders5
Here is an example of the impact that data quality can have on predictive analysis ef forts.
Company XYZ has purchased a predictive analytics package to expand customer wallet share
and profitability through more targeted and effective up-sell and cross-sell activities. The goal is
to determine the factors that influence the purchase of complimentary products, and based on
those factors, identify customers who are most likely to buy certain additional products.
The data set is compiled from various customer relationship management (CRM) and sales force
automation (SFA) systems. Once the model is built and deployed, the results prove to be poor.
Customers with a high likelihood of future purchases are missed, while customers who are unlikely
to spend any more money with the company are mistakenly identified as potential targets.
Unknowingly, the company uses those faulty results to launch an aggressive, multi-touch
up-sell campaign that costs approximately $600,000. If, for example, 20 percent of the data that
the predictive model was based on were erroneous, it would be safe to assume that 20 percent
of the results the list of customers who are most likely to buy a certain product were also
unreliable. Therefore, the company would have wasted 20 percent of its investment in the
campaign or $120,000.
If the company had instead employed a data quality management solution that provided the
aforementioned data cleansing and enhancement techniques before the predictive model is
applied, the erroneous or incomplete records would be corrected in advance. The increased
accuracy in the raw data would substantially improve the results.
As a result, the company would not only save the potentially wasted $120,000, it would also see
a sharper increase in the revenues that result from the program, since the list of target customers
would be more precise.
If the company continues to use the tool to monitor the quality of the inputs for this model or
any future models it creates it can ensure that predictive modeling results are always as accurate
as possible, and yield the highest returns.
How Data Quality and Predictive Analytics Work Together
-
8/12/2019 DataQualityManagement Predictive Analytics
8/14
-
8/12/2019 DataQualityManagement Predictive Analytics
9/14
Information Builders7
iWay DQC delivers a broad array of cutting-edge features in a single, affordable, intuitive solution.
Key capabilities include:
Centralized management of all data quality activities, including business rules and data flows,
from a single, unified platform
Bundled administration tools that allow for easy configuration, without the need for external
applications
A platform-independent architecture based on open standards
Parallel processing methods that ensure scalability, support both batch and on-demand modes,
and accelerate data quality procedures, performing the entire data quality process in less than
0.1 second, and processing more than 5 million records per hour
Advanced semantic profiling, for fast and accurate information analysis
Seamless integration into any B2B, A2A, or portal application, as well as popular ESB, SOA, andETL tools.
The ability to easily tap into external data sources, such as national address or name registries,
as well as third-party dictionaries and custom lists for the purposes of parsing, cleansing, and
validation
A set of powerful algorithms that ef ficiently perform approximate matching in record unification,
regardless of internal data structures
iWay Data Profiler
The iWay Data Profiler integrates output from iWay DQC with business intelligence (BI) technology
in a simple yet powerful way. Administrators can view, monitor, compare, and report on anymission-critical data with no additional client software, plug-ins, or report viewers required.
It provides sophisticated integration capabilities bolstered by mature tools for data quality
monitoring, reporting, and analytics. Users are able to query, analyze, deliver, and display
electronic profiling data in an almost unlimited number of ways.
-
8/12/2019 DataQualityManagement Predictive Analytics
10/14
Data Quality Management: The Key to Successful Predictive Analytics8
Advanced data profiling information, generated via iWay Data Quality Centers semantic analytics
and complex business rules, provides basic data statistics, such as uniqueness and frequency,
and uncovers relationships between data using primary and foreign keys. This profiling data can
then be further analyzed using intuitive and graphical reporting tools, helping users to uncover
variances in data profiles over different periods of time. Users can also drill down on profiled
categories to reveal the details of the exact records within that group.
The iWay Data Profiler provides a wide array of powerful capabilities, including:
Customizable data quality indicators (DQIs) that allow companies to define various levels of
validity. These DQIs can then be applied to data to provide immediate insight into the integrity
of specific records
Dynamic collection of profiling data from iWay DQC
Tagging and archiving of profiling data as sets within an associated RDBMS for easy retrieval
Advanced data manipulation and graphics
Comparison of multiple data profiling sets for more rapid variance discovery
Printing and exporting of any data profiling view into HTML, PDF, Excel, and other industry-
standard formats
Portable analytical capabilities embedded directly within the profiling report that allow users
to view and analyze profiling data in an almost unlimited number of ways
Additionally, iWay Data Profiler is available as a software-as-a-service (SaaS) application. This offers
many significant benefits, including:
Accelerated deployment and setup Increased budget-friendliness through a convenient pay-per-use model that eliminates the high
upfront expenditures associated with on-premise tools
The ability for detailed profiling information to be more easily shared with those who own the
data being profiled non-technical users working across various divisions and lines of business
Immediate, cost-efficient scalability whenever its needed to satisfy changing requirements
and emerging needs
WebFOCUS RStat
WebFOCUS RStat is the markets first fully integrated BI and predictive analytics environment,
seamlessly bridging the gap between backward- and forward-facing views of business operations.With WebFOCUS RStat, companies can easily and cost-effectively deploy predictive models as
intuitive scoring applications. So business users at all levels can make decisions based on accurate,
validated future predictions, instead of relying solely on instinct.
WebFOCUS RStat provides a single platform for data access and preparation, BI, predictive model
building and testing, and deploying results to end users as scoring applications. This eliminates
the need to purchase and maintain multiple tools, and frees analysts and other statisticians
from spending countless hours extracting and querying data. At the same time, it reduces costs,
simplifies maintenance, and optimizes IT resources.
-
8/12/2019 DataQualityManagement Predictive Analytics
11/14
Information Builders9
WebFOCUS RStats greatest benefit is its significantly increased accuracy. With the R engine a
powerful and flexible open source statistical programming language as its underlying analysis
tool, WebFOCUS RStat can deliver results that are always consistent, complete, and correct.
Using WebFOCUS RStat
enables a variety of
outputs that can be
generated to display
variable relationships
and distributions for
exploratory analysis.
This Decision Tree
predictive model shows
the graphical display of
the tree and how the
data was classified into
the terminal nodes.
-
8/12/2019 DataQualityManagement Predictive Analytics
12/14
Data Quality Management: The Key to Successful Predictive Analytics10
WebFOCUS RStat provides:
A single tool, fully integrated with Developer Studio and WebFOCUS Reporting Servers with access to
more than 300 data sources for both BI developers and data miners
Comprehensive data exploration, descriptive statistics, and interactive graphs
In-depth data visualization and transformation
Hypothesis testing, clustering, and correlation analysis
The ability to build and export predictive models for estimation and classification of likely future behavior
Comprehensive predictive model evaluation
Rapid application creation through easy incorporation of scoring routines into WebFOCUS reports
-
8/12/2019 DataQualityManagement Predictive Analytics
13/14
-
8/12/2019 DataQualityManagement Predictive Analytics
14/14
Worldwide Offices
Corporate HeadquartersTwo Penn Plaza
New York, NY 10121-2898
(212) 736-4433
(800) 969-4636
United StatesAtlanta, GA* (770) 395-9913
Baltimore, MD (703) 247-5565
Boston, MA* (781) 224-7660
Channels (770) 677-9923
Chicago, IL* (630) 971-6700
Cincinnati, OH* (513) 891-2338
Dallas, TX* (972) 398-4100Denver, CO* (303) 770-4440
Detroit, MI* (248) 641-8820
Federal Systems, DC*(703) 276-9006
Florham Park, NJ (973) 593-0022
Gulf Area (972) 490-1300
Hartford, CT (781) 272-8600
Houston, TX* (713) 952-4800
Kansas City, MO (816) 471-3320
Los Angeles, CA* (310) 615-0735
Milwaukee, WI (414) 827-4685
Minneapolis, MN* (651) 602-9100
New York, NY* (212) 736-4433
Orlando, FL (407) 804-8000
Philadelphia, PA*(610) 940-0790
Phoenix, AZ(480) 346-1095Pittsburgh, PA (412) 494-9699
Sacramento, CA (916) 973-9511
San Jose, CA* (408) 453-7600
Seattle, WA(206) 624-9055
St. Louis, MO* (636) 519-1411, ext . 321
Washington DC*(703) 276-9006
InternationalAustralia*
Melbourne 61-3-9631-7900
Sydney 61-2-8223-0600
Austria Raffeisen Informatik Consulting GmbH
Wien 43-1-211-36-3344
Bangladesh
Dhaka 415-505-1329
Belgium*
Brussels 32(0)2-743-02-40
Brazil InfoBuild Brazil Ltda.
So Paulo 55-11-3285-1050
CanadaCalgary (403) 437-3479
Montreal* (514) 421-1555
Ottawa (613) 233-7647
Toronto* (416) 364-2760
Vancouver (604) 688-2499
China
Beijing 010-51289680, ext. 8010
Croatia InfoBuild CEE
Strmec Samoborski 385-1-23-62-400
Czech Republic InfoBuild CEE
Praha 420-221-986-460
Estonia InfoBuild Baltics
Tallinn 372-5265815
Finland InfoBuild Oy
Espoo 358-207-580-840
France*
Svres +33 (0)1-45-07-66-00
Germany
Eschborn* 49-6196-775-76-0
Greece Applied Science Ltd.
Athens 30-210-699-8225
Guatemala IDS de Centroamerica
Guatemala City (502) 2412-4212
Hungary InfoBuild CEE
Budapest 36-1-430-3500
India* InfoBuild India
Chennai 91-44-42177082Israel Malam Team SRL Products
Petah-Tikva 972-3-7662040
Italy
Milan 39-02-92-349-724
Japan KK Ashisuto
Tokyo 81-3-5276-5863
Kuwait InfoBuild Middle East
Safat 965-2-232-2926
Latvia InfoBuild Baltics
Riga 371-67039637
Lebanon InfoBuild Middle East
Beirut 961-4-533162
Lithuania InfoBuild Baltics
Vilnius 370-5-268-3327
Mexico
Mexico City 52-55-5062-0660
Netherlands*
Amstelveen 31 (0)20-4563333
Nigeria InfoBuild Nigeria
Garki-Abuja 234-803-318-4750
Norway InfoBuild Norge ASOslo 47-4820-4030
Poland InfoBuild CEE
Warszawa 48-22-657-0014
Portugal
Lisboa 351-217-217-400
Qatar InfoBuild Middle East
Doha 974-4-466-6244
Russian Federation InfoBuild CIS
Moscow 7-495-797-20-46 Armenia Azerbaijan Belarus Kazakhstan Kyrgyzstan Moldova Tajikistan Turkmenistan Ukraine Uzbekistan
Saudi Arabia InfoBuild Middle East
Riyadh 966-1-479-7623
Singapore Automatic Identification Technology Ltd.
Singapore 65-6286-2922
Slovakia InfoBuild CEE
Bratislava 421-232-332-513 Bulgaria Romania Serbia Slovenia
South Africa Fujitsu (Pty) Ltd.
Cape Town 27-21-937-6100
Johannesburg 27-11-233-5432
South Korea Uvansys
Seoul 82-2-832-0705
Spain
Barcelona 34-93-452-63-85
Bilbao 34-94-452-50-15
Madrid* 34-91-710-22-75
Sweden InfoBuild AB
Solna 46-8-578-772-01
Switzerland
Dietlikon 41-44-839-49-49
Taiwan Galaxy Software Services, Inc.
Taipei (866) 2-2586-7890
Thailand Datapro Computer Systems Co. Ltd.
Bangkok 66(2) 301 2800
Turkey InfoBuild Turkey
Ankara 90-312-266-3300
Istanbul 90-212-351-2730
United Arab Emirates InfoBuild Middle East
Abu Dhabi 971-2-627-5911 Bahrain Egypt Jordan Oman
Dubai 971-4-391-4394
United Kingdom*
Uxbridge Middlesex 0845-658-8484
Venezuela InfoServices Consulting
Caracas 58212-763-1653
* Training facilities are located at these offices.
Corporate Headquarters Two Penn Plaza, New York, NY 10121-2898 (212) 736-4 433 Fax (212) 967-6406 DN3601489.1011
Connect With Us informationbuilders.com askinfo@informationbuilder s.com