DataQualityManagement Predictive Analytics

download DataQualityManagement Predictive Analytics

of 14

Transcript of DataQualityManagement Predictive Analytics

  • 8/12/2019 DataQualityManagement Predictive Analytics

    1/14

    Data Quality Management: The Key to

    Successful Predictive Analytics

    A White Paper

  • 8/12/2019 DataQualityManagement Predictive Analytics

    2/14

    1 Introduction

    2 Data Quality: How to Evaluate and Enhance It

    2 Finding the Problems

    3 Fixing the Problems

    4 The Impact of Data Quality on Predictive Analytics

    5 How Data Quality and Predictive Analytics Work Together

    6 Comprehensive Data Quality Management and Analytical Solutions6 iWay Data Quality Center

    7 iWay Data Profiler

    8 WebFOCUS RStat

    11 Conclusion

    Table of Contents

  • 8/12/2019 DataQualityManagement Predictive Analytics

    3/14

    Information Builders1

    Introduction

    Organizations of all types and sizes are enhancing their business intelligence (BI) strategies with

    predictive analytics. In fact, analyst firm IDC estimated that the business analytics market which

    includes predictive modeling tools would expand by at least 6.1 percent.1 A similar report issued

    by Forrester Research named business analytics as the fastest-growing segment of global IT

    software, with close to 70 percent of companies polled expressing a strong interest.2

    Traditional reporting and analysis provide you with a rear-view perspective of events that have

    already occurred. To make forecasts and predictions, you have to rely on the experience, instinct,

    and intuition of your analysts. Predictive analytics, on the other hand, fosters a more proactive

    approach to decision-making. It applies sophisticated data mining and statistical analysis

    techniques to large volumes of both current and historical data, which helps not only your

    analysts, but also your business users to distinguish the patterns, trends, and outliers that can serve

    as clear indicators of what events, actions, and conditions may occur in the future.

    The creation of predictive models and the deployment of a predictive analysis tool, however, may

    not be enough. The quality of the data that you apply the models and tools against must be fully

    optimized to ensure the utmost accuracy in your outcomes.

    A recent article published on Web Analytics Worldclaims that, In order to attain accurate business

    intelligence, companies must maintain quality data. Predictive analysis requires both past and

    current data about many different things, including customers, businesses, products, and the

    economy. All of this information is used to draw relationships and patterns between sets of data.

    If the data is accurate and well maintained, then the business intelligence produced will be high

    quality as well. 3

    Thats where data quality management comes in. Data quality management is the process of

    enhancing the accuracy, completeness, and consistency of your enterprise information. A variety

    of techniques, such as profiling, standardizing, and cleansing, are leveraged to enrich and improve

    the integrity of the data contained in your enterprise information sources.

    This is particularly important when it comes to predictive analytics, as garbage in will most likely

    lead to garbage out. If the predictive results used for strategic planning and decision-making are

    unreliable, then the predictive modeling efforts will end up harming an organization more than

    helping it.

    This white paper will focus on the importance of data quality management as it relates to

    predictive analytics. It will explore various methodologies for assessing and improving data quality,and highlight best practices in preparing data for predictive analysis. It will also present a real-

    world scenario of how poor data quality would negatively impact predictive modeling efforts.

    1 Kumar, Monika. 2010 State of the U.S. Business Analytics Market, IDC, March 2010.

    2 Evelson, Boris; Garbani, Jean-Pierre; Green, Charles; Kisker, Holger; Lisserman, Miroslaw. The State of Business

    Intelligence Software and Emerging Trends: 2010, Forrester Research, May 2010.

    3 Data Quality and Predictive Analysis, Web Analytics World, January 2009.

  • 8/12/2019 DataQualityManagement Predictive Analytics

    4/14

    Data Quality Management: The Key to Successful Predictive Analytics2

    Finding the ProblemsThe quality of your data will drive the quality of your predictive results. How can you ensure data is

    clean enough before you apply a predictive model against it? Data quality needs to be looked at

    from four primary perspectives:

    Accuracy Do data elements properly reflect the object being described by that particular

    field? Is a products SKU number correct? Accuracy issues most often occur during manual data

    entry and updating processes where human error can lead to such mistakes as transposed

    numbers or misspellings

    Consistency When certain information is analyzed or measured, does it produce the same

    results repeatedly? Reliability problems are frequently caused by similar information residing

    in disparate, unsynchronized systems. Multiple databases contain conflicting data, causing

    inconsistencies when that information is merged Comprehensiveness Have all fields been filled? Companies often find that customer records

    are missing information, such as e-mail addresses or zip codes. When a record has a high

    number of empty or incomplete fields, it should be considered null

    Timeliness How current is the information? Historical data, often used in predictive analytics,

    may be outdated, such as a criminals arrest record with an old address. For a repeat offender, the

    most current address would be needed for effective predictive policing efforts

    One of the best ways to assess integrity from these four perspectives is through advanced

    profiling techniques. Also commonly referred to as data discovery, profiling is the process of

    gathering statistics about enterprise data. What are its primary characteristics and attributes?

    How was it created and by whom? Which users access it most frequently? For what purposes is it

    primarily used? Most importantly, what kind of shape is it in?

    Profiling is one of the most effective means of obtaining an in-depth understanding of corporate

    data, so it can be optimized for predictive modeling. It will deliver the insight needed to precisely

    determine its overall health; identify, prioritize, and correct any issues or errors (some of which

    may be expected, others may be surprises); and rectify the underlying causes of quality problems.

    Additionally, once an initial profile has been created, you can perform ongoing monitoring of

    profile-related metrics, taking a more proactive approach to detecting and fixing any future

    quality problems.

    What about custom data? While information like addresses and zip codes can be matched up

    against a database to determine accuracy, that kind of validation simply isnt available for most

    types of records. A large percentage of your data is likely custom, as in product details, and

    requires some level of subject matter expertise to assess its quality.

    In cases like these, you must have a programmatic way to apply rules to this type of information to

    more proactively ensure its quality. These rules must be easy to define and implement, and should

    be used in a way that they do more than just uncover and correct bad data they must stop it

    from entering the environment in the first place.

    Data Quality: How to Evaluate and Enhance It

  • 8/12/2019 DataQualityManagement Predictive Analytics

    5/14

    Information Builders3

    Fixing the ProblemsNow that you know where the problems exist in your data, how do you fix them before predictive

    modeling and analytics tools are applied? Some of the most common procedures for data quality

    management include:

    Cleansing, Standardizing, Enriching, Matching, and Merging

    These steps, while seemingly unrelated, are all rather important in achieving and sustaining

    optimum levels of data quality.

    Cleansing eliminates mistakes within databases and other information sources through the

    alteration of existing data based on pre-defined business rules and criteria. For example, if

    incorrect customer names are identified, cleansing would help to amend missing or incomplete

    entries, while standardization would consistently format all completed entries based on pre-defined business rules.

    Enrichment improves comprehensiveness, dynamically extending and enhancing information by

    comparing it to third-party content such as consumer demographics or geographic distributors and

    appending its attributes when appropriate. If customer records are lacking zip codes, for example, they

    can be determined based on existing addresses, and added as a separate field in each record.

    Merging and matching promote consistency by automatically uncovering related entries within

    the same system or across multiple systems, and then linking, matching, or merging as needed.

    For example, entries for a customer, John Smith, exist in mutliple different databases. Although

    the records are similar, there are some inconsistencies. Advanced matching capabilities can closely

    assess the data in each record to determine if they are redundant, or separate and distinct. If therecords are determined to be redundant, merging would then consolidate them into a single,

    comprehensive entry for John Smith, using the most frequently occurring data. Householding, a

    technique similar to merging where related information from disparate systems is collected and

    stored in a data warehouse or other central location for easy access, also falls into this category.

    Scoring

    Many organizations have begun to rely on scoring to more effectively evaluate data quality, and

    to better prioritize problems if and when they occur. With scoring, a number is assigned to every

    data record, providing insight into its quality. For example, you may give a pristine record a score

    of five, while a completely invalid record would receive a score of one. Any number in between

    would demonstrate the level of confidence that you have in the records thoroughness and

    accuracy, and indicate if any action is needed (i.e., any record with a score of three or less would

    require human review).

    Remember to be f lexible when it comes to scoring procedures, applying different rules to different

    types of data to convey a sense of urgency or non-urgency when problems arise. Critical data,

    such as customer information, should be scored more strictly than, say, data about the inventory

    of office supplies.

  • 8/12/2019 DataQualityManagement Predictive Analytics

    6/14

    Data Quality Management: The Key to Successful Predictive Analytics4

    Predictive analytics has been defined as a branch of data mining related to the prediction of

    future probabilities, behaviors, and trends. When deploying predictive analytics, many companies

    will skip important steps in the process. The most commonly overlooked task is data preparation.

    Getting the data in proper shape for predictive analysis which includes gathering it from various

    sources and compiling it into a final set that will be fed to the predictive model should be a key

    activity. In the most successful predictive modeling scenarios, data preparation will account for

    approximately 60 to 80 percent of the cost of the initiative.

    Effective data preparation requires more than just pulling data from back-end systems and

    moving it into a centralized location, such as a data mart or data warehouse. Failing to properly

    cleanse and enhance it can prohibit you from making it truly analytics-ready.

    Companies dont intentionally ignore the need to fix their data before applying a predictive model

    to it. They simply dont realize how incomplete or inaccurate the information in their enterprise

    systems actually is.

    As one of the very first steps in any predictive analytics project, invalid or erroneous records must

    be located and corrected, and any missing data must be filled in. Otherwise, the information

    feeding the model will lack integrity, and the garbage in, garbage out rule will apply. In other

    words, poor information will lead to poor results and poor results will undoubtedly lead to poor

    decisions.

    The best way to identify bad data is through the use of a comprehensive data quality

    management tool (preferably one that is fully integrated with the predictive modeling solution),

    which can profile, transform, and standardize information, while filling in any missing data. This willhelp ensure that data preparation is addressed properly, instead of becoming a stumbling block

    that causes significant delays in model creation and deployment.

    There are also other elements of data preparation that must be considered to ensure optimum

    results precision. IT organizations must also select tables, records, and attributes from various

    sources across the business as well as transform, merge, aggregate, derive, sample, and weigh

    (when required) the information. It is important to note that these steps may often need to be

    performed multiple times to make the data truly ready for the modeling tool.

    The Impact of Data Quality on Predictive Analytics

  • 8/12/2019 DataQualityManagement Predictive Analytics

    7/14

    Information Builders5

    Here is an example of the impact that data quality can have on predictive analysis ef forts.

    Company XYZ has purchased a predictive analytics package to expand customer wallet share

    and profitability through more targeted and effective up-sell and cross-sell activities. The goal is

    to determine the factors that influence the purchase of complimentary products, and based on

    those factors, identify customers who are most likely to buy certain additional products.

    The data set is compiled from various customer relationship management (CRM) and sales force

    automation (SFA) systems. Once the model is built and deployed, the results prove to be poor.

    Customers with a high likelihood of future purchases are missed, while customers who are unlikely

    to spend any more money with the company are mistakenly identified as potential targets.

    Unknowingly, the company uses those faulty results to launch an aggressive, multi-touch

    up-sell campaign that costs approximately $600,000. If, for example, 20 percent of the data that

    the predictive model was based on were erroneous, it would be safe to assume that 20 percent

    of the results the list of customers who are most likely to buy a certain product were also

    unreliable. Therefore, the company would have wasted 20 percent of its investment in the

    campaign or $120,000.

    If the company had instead employed a data quality management solution that provided the

    aforementioned data cleansing and enhancement techniques before the predictive model is

    applied, the erroneous or incomplete records would be corrected in advance. The increased

    accuracy in the raw data would substantially improve the results.

    As a result, the company would not only save the potentially wasted $120,000, it would also see

    a sharper increase in the revenues that result from the program, since the list of target customers

    would be more precise.

    If the company continues to use the tool to monitor the quality of the inputs for this model or

    any future models it creates it can ensure that predictive modeling results are always as accurate

    as possible, and yield the highest returns.

    How Data Quality and Predictive Analytics Work Together

  • 8/12/2019 DataQualityManagement Predictive Analytics

    8/14

  • 8/12/2019 DataQualityManagement Predictive Analytics

    9/14

    Information Builders7

    iWay DQC delivers a broad array of cutting-edge features in a single, affordable, intuitive solution.

    Key capabilities include:

    Centralized management of all data quality activities, including business rules and data flows,

    from a single, unified platform

    Bundled administration tools that allow for easy configuration, without the need for external

    applications

    A platform-independent architecture based on open standards

    Parallel processing methods that ensure scalability, support both batch and on-demand modes,

    and accelerate data quality procedures, performing the entire data quality process in less than

    0.1 second, and processing more than 5 million records per hour

    Advanced semantic profiling, for fast and accurate information analysis

    Seamless integration into any B2B, A2A, or portal application, as well as popular ESB, SOA, andETL tools.

    The ability to easily tap into external data sources, such as national address or name registries,

    as well as third-party dictionaries and custom lists for the purposes of parsing, cleansing, and

    validation

    A set of powerful algorithms that ef ficiently perform approximate matching in record unification,

    regardless of internal data structures

    iWay Data Profiler

    The iWay Data Profiler integrates output from iWay DQC with business intelligence (BI) technology

    in a simple yet powerful way. Administrators can view, monitor, compare, and report on anymission-critical data with no additional client software, plug-ins, or report viewers required.

    It provides sophisticated integration capabilities bolstered by mature tools for data quality

    monitoring, reporting, and analytics. Users are able to query, analyze, deliver, and display

    electronic profiling data in an almost unlimited number of ways.

  • 8/12/2019 DataQualityManagement Predictive Analytics

    10/14

    Data Quality Management: The Key to Successful Predictive Analytics8

    Advanced data profiling information, generated via iWay Data Quality Centers semantic analytics

    and complex business rules, provides basic data statistics, such as uniqueness and frequency,

    and uncovers relationships between data using primary and foreign keys. This profiling data can

    then be further analyzed using intuitive and graphical reporting tools, helping users to uncover

    variances in data profiles over different periods of time. Users can also drill down on profiled

    categories to reveal the details of the exact records within that group.

    The iWay Data Profiler provides a wide array of powerful capabilities, including:

    Customizable data quality indicators (DQIs) that allow companies to define various levels of

    validity. These DQIs can then be applied to data to provide immediate insight into the integrity

    of specific records

    Dynamic collection of profiling data from iWay DQC

    Tagging and archiving of profiling data as sets within an associated RDBMS for easy retrieval

    Advanced data manipulation and graphics

    Comparison of multiple data profiling sets for more rapid variance discovery

    Printing and exporting of any data profiling view into HTML, PDF, Excel, and other industry-

    standard formats

    Portable analytical capabilities embedded directly within the profiling report that allow users

    to view and analyze profiling data in an almost unlimited number of ways

    Additionally, iWay Data Profiler is available as a software-as-a-service (SaaS) application. This offers

    many significant benefits, including:

    Accelerated deployment and setup Increased budget-friendliness through a convenient pay-per-use model that eliminates the high

    upfront expenditures associated with on-premise tools

    The ability for detailed profiling information to be more easily shared with those who own the

    data being profiled non-technical users working across various divisions and lines of business

    Immediate, cost-efficient scalability whenever its needed to satisfy changing requirements

    and emerging needs

    WebFOCUS RStat

    WebFOCUS RStat is the markets first fully integrated BI and predictive analytics environment,

    seamlessly bridging the gap between backward- and forward-facing views of business operations.With WebFOCUS RStat, companies can easily and cost-effectively deploy predictive models as

    intuitive scoring applications. So business users at all levels can make decisions based on accurate,

    validated future predictions, instead of relying solely on instinct.

    WebFOCUS RStat provides a single platform for data access and preparation, BI, predictive model

    building and testing, and deploying results to end users as scoring applications. This eliminates

    the need to purchase and maintain multiple tools, and frees analysts and other statisticians

    from spending countless hours extracting and querying data. At the same time, it reduces costs,

    simplifies maintenance, and optimizes IT resources.

  • 8/12/2019 DataQualityManagement Predictive Analytics

    11/14

    Information Builders9

    WebFOCUS RStats greatest benefit is its significantly increased accuracy. With the R engine a

    powerful and flexible open source statistical programming language as its underlying analysis

    tool, WebFOCUS RStat can deliver results that are always consistent, complete, and correct.

    Using WebFOCUS RStat

    enables a variety of

    outputs that can be

    generated to display

    variable relationships

    and distributions for

    exploratory analysis.

    This Decision Tree

    predictive model shows

    the graphical display of

    the tree and how the

    data was classified into

    the terminal nodes.

  • 8/12/2019 DataQualityManagement Predictive Analytics

    12/14

    Data Quality Management: The Key to Successful Predictive Analytics10

    WebFOCUS RStat provides:

    A single tool, fully integrated with Developer Studio and WebFOCUS Reporting Servers with access to

    more than 300 data sources for both BI developers and data miners

    Comprehensive data exploration, descriptive statistics, and interactive graphs

    In-depth data visualization and transformation

    Hypothesis testing, clustering, and correlation analysis

    The ability to build and export predictive models for estimation and classification of likely future behavior

    Comprehensive predictive model evaluation

    Rapid application creation through easy incorporation of scoring routines into WebFOCUS reports

  • 8/12/2019 DataQualityManagement Predictive Analytics

    13/14

  • 8/12/2019 DataQualityManagement Predictive Analytics

    14/14

    Worldwide Offices

    Corporate HeadquartersTwo Penn Plaza

    New York, NY 10121-2898

    (212) 736-4433

    (800) 969-4636

    United StatesAtlanta, GA* (770) 395-9913

    Baltimore, MD (703) 247-5565

    Boston, MA* (781) 224-7660

    Channels (770) 677-9923

    Chicago, IL* (630) 971-6700

    Cincinnati, OH* (513) 891-2338

    Dallas, TX* (972) 398-4100Denver, CO* (303) 770-4440

    Detroit, MI* (248) 641-8820

    Federal Systems, DC*(703) 276-9006

    Florham Park, NJ (973) 593-0022

    Gulf Area (972) 490-1300

    Hartford, CT (781) 272-8600

    Houston, TX* (713) 952-4800

    Kansas City, MO (816) 471-3320

    Los Angeles, CA* (310) 615-0735

    Milwaukee, WI (414) 827-4685

    Minneapolis, MN* (651) 602-9100

    New York, NY* (212) 736-4433

    Orlando, FL (407) 804-8000

    Philadelphia, PA*(610) 940-0790

    Phoenix, AZ(480) 346-1095Pittsburgh, PA (412) 494-9699

    Sacramento, CA (916) 973-9511

    San Jose, CA* (408) 453-7600

    Seattle, WA(206) 624-9055

    St. Louis, MO* (636) 519-1411, ext . 321

    Washington DC*(703) 276-9006

    InternationalAustralia*

    Melbourne 61-3-9631-7900

    Sydney 61-2-8223-0600

    Austria Raffeisen Informatik Consulting GmbH

    Wien 43-1-211-36-3344

    Bangladesh

    Dhaka 415-505-1329

    Belgium*

    Brussels 32(0)2-743-02-40

    Brazil InfoBuild Brazil Ltda.

    So Paulo 55-11-3285-1050

    CanadaCalgary (403) 437-3479

    Montreal* (514) 421-1555

    Ottawa (613) 233-7647

    Toronto* (416) 364-2760

    Vancouver (604) 688-2499

    China

    Beijing 010-51289680, ext. 8010

    Croatia InfoBuild CEE

    Strmec Samoborski 385-1-23-62-400

    Czech Republic InfoBuild CEE

    Praha 420-221-986-460

    Estonia InfoBuild Baltics

    Tallinn 372-5265815

    Finland InfoBuild Oy

    Espoo 358-207-580-840

    France*

    Svres +33 (0)1-45-07-66-00

    Germany

    Eschborn* 49-6196-775-76-0

    Greece Applied Science Ltd.

    Athens 30-210-699-8225

    Guatemala IDS de Centroamerica

    Guatemala City (502) 2412-4212

    Hungary InfoBuild CEE

    Budapest 36-1-430-3500

    India* InfoBuild India

    Chennai 91-44-42177082Israel Malam Team SRL Products

    Petah-Tikva 972-3-7662040

    Italy

    Milan 39-02-92-349-724

    Japan KK Ashisuto

    Tokyo 81-3-5276-5863

    Kuwait InfoBuild Middle East

    Safat 965-2-232-2926

    Latvia InfoBuild Baltics

    Riga 371-67039637

    Lebanon InfoBuild Middle East

    Beirut 961-4-533162

    Lithuania InfoBuild Baltics

    Vilnius 370-5-268-3327

    Mexico

    Mexico City 52-55-5062-0660

    Netherlands*

    Amstelveen 31 (0)20-4563333

    Nigeria InfoBuild Nigeria

    Garki-Abuja 234-803-318-4750

    Norway InfoBuild Norge ASOslo 47-4820-4030

    Poland InfoBuild CEE

    Warszawa 48-22-657-0014

    Portugal

    Lisboa 351-217-217-400

    Qatar InfoBuild Middle East

    Doha 974-4-466-6244

    Russian Federation InfoBuild CIS

    Moscow 7-495-797-20-46 Armenia Azerbaijan Belarus Kazakhstan Kyrgyzstan Moldova Tajikistan Turkmenistan Ukraine Uzbekistan

    Saudi Arabia InfoBuild Middle East

    Riyadh 966-1-479-7623

    Singapore Automatic Identification Technology Ltd.

    Singapore 65-6286-2922

    Slovakia InfoBuild CEE

    Bratislava 421-232-332-513 Bulgaria Romania Serbia Slovenia

    South Africa Fujitsu (Pty) Ltd.

    Cape Town 27-21-937-6100

    Johannesburg 27-11-233-5432

    South Korea Uvansys

    Seoul 82-2-832-0705

    Spain

    Barcelona 34-93-452-63-85

    Bilbao 34-94-452-50-15

    Madrid* 34-91-710-22-75

    Sweden InfoBuild AB

    Solna 46-8-578-772-01

    Switzerland

    Dietlikon 41-44-839-49-49

    Taiwan Galaxy Software Services, Inc.

    Taipei (866) 2-2586-7890

    Thailand Datapro Computer Systems Co. Ltd.

    Bangkok 66(2) 301 2800

    Turkey InfoBuild Turkey

    Ankara 90-312-266-3300

    Istanbul 90-212-351-2730

    United Arab Emirates InfoBuild Middle East

    Abu Dhabi 971-2-627-5911 Bahrain Egypt Jordan Oman

    Dubai 971-4-391-4394

    United Kingdom*

    Uxbridge Middlesex 0845-658-8484

    Venezuela InfoServices Consulting

    Caracas 58212-763-1653

    * Training facilities are located at these offices.

    Corporate Headquarters Two Penn Plaza, New York, NY 10121-2898 (212) 736-4 433 Fax (212) 967-6406 DN3601489.1011

    Connect With Us informationbuilders.com askinfo@informationbuilder s.com