Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business...

Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia [email protected] dw_tutorial.ppt

Transcript of Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business...

Page 1: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Recent Developments in Data Warehousing: A Tutorial

Hugh J. WatsonTerry College of BusinessUniversity of [email protected]://

Page 2: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Tutorial Objectives Provide an overview of data

warehousing Provide materials to support the

teaching of data warehousing Discuss recent developments in data


Page 3: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Topics Covered Definitions and concepts The data mart and enterprise-wide data

warehouse strategies Data extraction, cleansing, transformation and

loading Meta data Data stores Online analytical processing (OLAP) Warehouse users, tools, and applications Case study: Harrah’s Entertainment

Page 4: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

The Importance of Data Warehousing Provide a “single version of the truth” Improve decision making Support key corporate initiatives such as

performance management, B2C and B2B e-commerce, and customer relationship management

Estimated to be a $113.5 billion market in 2002 for systems, software, services, and in-house expenditures (Palo Alto Management Group)

Page 5: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

A Simple Definition

A data warehouse is a collection ofdata created to support decision-making applications.

Page 6: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Warehouse Characteristics Subject oriented -- data are organized

around sales, products, etc. Integrated -- data are integrated to

provide a comprehensive view Time variant -- historical data are

maintained Nonvolatile -- data are not updated by


Page 7: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Another Definition

Data warehousing is the entire process of data extraction, transformation, and loading of data to the warehouse and the access of the data by end users and applications.

Page 8: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Mart

A data mart stores data for a limited number ofsubject areas, such as marketing and sales data. It isused to support specific applications.

An independent data mart is created directly fromsource systems.

A dependent data mart is populated from a datawarehouse.

Page 9: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Operational Data Store

An operational data store consolidates data frommultiple source systems and provides a near real-time, integrated view of volatile, current data.

Its purpose is to provide integrated data foroperational purposes. It has add, change, and deletefunctionality.

It may be created to avoid a full blown ERPimplementation.

Page 10: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.






Data Sources

Transaction Data






ETL Software Data Stores Data AnalysisTools and Applications


Other Internal Data


Clickstream Informix

Web Data

External Data

Demographic Harte-Hanks










Clean/ScrubTrans formFirst logic



Data MartsTeradataIBM

Data Warehouse

Meta Data














Queries,Reporting,DSS/EIS, Data Mining

Micro Strategy




Page 11: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Two Data Warehousing Strategies Enterprise-wide warehouse, top

down, the Inmon methodology Data mart, bottom up, the Kimball

methodology When properly executed, both result

in an enterprise-wide data warehouse

Page 12: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

The Data Mart Strategy The most common approach Begins with a single mart and architected marts

are added over time for more subject areas Relatively inexpensive and easy to implement Can be used as a proof of concept for data

warehousing Can perpetuate the “silos of information”

problem Can postpone difficult decisions and activities Requires an overall integration plan

Page 13: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

The Enterprise-wide Strategy A comprehensive warehouse is built initially An initial dependent data mart is built using

a subset of the data in the warehouse Additional data marts are built using subsets

of the data in the warehouse Like all complex projects, it is expensive,

time consuming, and prone to failure When successful, it results in an integrated,

scalable warehouse

Page 14: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Sources and Types Primarily from legacy, operational

systems Almost exclusively numerical data at the

present time External data may be included, often

purchased from third-party sources Technology exists for storing unstructured

data and expect this to become more important over time

Page 15: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Extraction, Transformation, and Loading (ETL) Processes

The “plumbing” work of data warehousing

Data are moved from source to target data bases

A very costly, time consuming part of data warehousing

Page 16: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Recent Development:More Frequent Updates Updates can be done in bulk and

trickle modes Business requirements, such as

trading partner access to a Web site, requires current data

For international firms, there is no good time to load the warehouse

Page 17: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Recent Development: Clickstream Data Results from clicks at web sites A dialog manager handles user

interactions. An ODS helps to custom tailor the dialog

The clickstream data is filtered and parsed and sent to a data warehouse where it is analyzed

Software is available to analyze the clickstream data

Page 18: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Extraction Often performed by COBOL routines

(not recommended because of high program maintenance and no automatically generated meta data)

Sometimes source data is copied to the target database using the replication capabilities of standard RDMS (not recommended because of “dirty data” in the source systems)

Increasing performed by specialized ETL software

Page 19: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Sample ETL Tools Teradata Warehouse Builder from

Teradata DataStage from Ascential Software SAS System from SAS Institute Power Mart/Power Center from

Informatica Sagent Solution from Sagent Software Hummingbird Genio Suite from

Hummingbird Communications

Page 20: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Reasons for “Dirty” Data Dummy Values Absence of Data Multipurpose Fields Cryptic Data Contradicting Data Inappropriate Use of Address Lines Violation of Business Rules Reused Primary Keys, Non-Unique Identifiers Data Integration Problems

Page 21: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Cleansing Source systems contain “dirty data” that

must be cleansed ETL software contains rudimentary data

cleansing capabilities Specialized data cleansing software is often

used. Important for performing name and address correction and householding functions

Leading data cleansing vendors include Vality (Integrity), Harte-Hanks (Trillium), and Firstlogic (i.d.Centric)

Page 22: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Steps in Data Cleansing Parsing





Page 23: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Parsing Parsing locates and identifies

individual data elements in the source files and then isolates these data elements in the target files.

Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Page 24: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Correcting Corrects parsed individual data

components using sophisticated data algorithms and secondary data sources.

Example include replacing a vanity address and adding a zip code.

Page 25: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Standardizing Standardizing applies conversion

routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

Examples include adding a pre name, replacing a nickname, and using a preferred street name.

Page 26: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Matching Searching and matching records

within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

Examples include identifying similar names and addresses.

Page 27: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Consolidating Analyzing and identifying

relationships between matched records and consolidating/merging them into ONE representation.

Page 28: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Staging Often used as an interim step between data

extraction and later steps Accumulates data from asynchronous sources

using native interfaces, flat files, FTP sessions, or other processes

At a predefined cutoff time, data in the staging file is transformed and loaded to the warehouse

There is usually no end user access to the staging file

An operational data store may be used for data staging

Page 29: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Transformation Transforms the data in accordance

with the business rules and standards that have been established

Example include: format changes, deduplication, splitting up fields, replacement of codes, derived values, and aggregates

Page 30: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Data Loading Data are physically moved to the

data warehouse The loading takes place within a

“load window” The trend is to near real time

updates of the data warehouse as the warehouse is increasingly used for operational applications

Page 31: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Meta Data Data about data Needed by both information technology

personnel and users IT personnel need to know data sources and

targets; database, table and column names; refresh schedules; data usage measures; etc.

Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.

Page 32: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Recent Development:Meta Data Integration A growing realization that meta data is

critical to data warehousing success Progress is being made on getting

vendors to agree on standards and to incorporate the sharing of meta data among their tools

Vendors like Microsoft, Computer Associates, and Oracle have entered the meta data marketplace with significant product offerings

Page 33: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Database Vendors High end (i.e., terabyte plus)

vendors include NCR-Teradata (Teradata) and IBM (DB2)

Oracle (8i) and Microsoft (SQL Server 7) are major players for smaller databases

Page 34: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

On-line Analytical Processing (OLAP) A set of functionality that facilitates

multidimensional analysis Allows users to analyze data in ways

that are natural to them Comes in many varieties -- ROLAP,


Page 35: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

ROLAP Relational OLAP Uses a RDBMS to implement and OLAP

environment Typically involves a star schema to

provide the multidimensional capabilities OLAP tool manipulates RDBMS star

schema data Called slowlap by MOLAP vendors

Page 36: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

MOLAP Multidimensional OLAP Uses a MDDBS (e.g., Essbase) to

store and access data Usually requires proprietary

(non SQL) data access tools Provides exceptionally fast response


Page 37: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Star Schema Creates non-normalized data

structures Easier for users to understand Optimized for OLAP Uses fact (facts or measures in the

business) and dimension (establishes the context of the facts) tables

Page 38: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

OLAP Tools

Products come from vendors such as Brio, Cognos, Hyperion, and BusinessObjects

Typically available as a fat or thin (i.e., browser) client In a web environment, the browser communicates with a

web server, which talks to an application server, which connects to backend databases

The application server provides query, reporting, and OLAP analysis functionality over the web

Java applets or downloaded components augment the thin client

A broadcast server may be used to schedule, run, publish, and broadcast reports, alerts, and responses over the LAN, email, or personal digital assistant.

Page 39: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Claim# Physician ID# Patient ID# Service Code# Payer ID# Claim Number# Line Item Number# Claim DateDate of ServicesAmount of ChargeUnit of Services

Service#Service CodeService Description#Category Code

Time Periods#Claim DateYearMonthQuarterWeek

Payer#Payer IDNameAddressPhone NumberEDI Number

Star Schema

Patient#Patient IDPatient NameAddressAgeSexInsurance ID

Physician#Physician IDPhysician NameSpecialty IDCredential ID

Page 40: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Dimension Table Examples Retail -- store name, zip code, product

name, product category, day of week Telecommunications -- call origin, call

destination Banking -- customer name, account

number, branch, account officer Insurance -- policy type, insured party

Page 41: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Fact Table Examples Retail -- number of units sold, sales

amount Telecommunications -- length of

call in minutes, average number of calls

Banking -- average monthly balance

Insurance -- claims amount

Page 42: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

The Fact Table Key Concatenates the Dimension Keys

Assume that you want to know the number of television sets sold to Best Buys on January 15, 2001.The query might be:SELECT CLIENT.CUSNAME, SALES.NOSOLD







Page 43: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Warehouse Users Analysts Managers Executives Operational personnel Customers and suppliers

Page 44: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Warehouse Tools and Applications SQL queries Managed query environments Structured and ad hoc reports DSS/EIS Portals Data mining Packaged applications Custom-built applications

Page 45: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Recent Development:Enterprise Intelligence Portals

Offers users an effective way to access information scattered across networked enterprise systems through a simple and personalized Web interface

Provides access to structured and unstructured data

Potentially integrates data warehousing and knowledge management

Page 46: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Harrah’s Entertainment

Harrah’s Entertainment -- data warehousing supported a successful shift to a CRM oriented corporate strategy. Winner of the 2000 TDWI Leadership Award

Operates 21 casinos across the country In 1993, the gaming laws changed, which allowed

Harrah’s to expand Harrah’s decided to compete using a brand

strategy supported by information technology Needed to know their customers exceptionally


Page 47: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Harrah’s Data Warehousing Architecture WINet sources data from the casino,

hotel, and event systems The patron data base serves as an

operational data store The marketing workbench serves as

the data warehouse

Page 48: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Sample Applications Operational personnel use PDB to

check the preferences, history, and value of customers

Analysts use PDB and MWB to create offers to visit a Harrah’s casino

Analysts use MWB to support predictive modeling efforts

Page 49: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.
Page 50: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.


Right Offer Right Message Right Time

Predict the valueof a customer

Market based onthat expected value

Track transactionsthat are linked tomarketinginitiatives

Evaluate theeffectiveness

Track profitability

Refine MarketingApproaches






Measure: Profit & Loss Behavior change New test report

Define: Objectives Tests Control cells

Page 51: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.

Customer Relationship Lifecycle

Annual Revenue

Establish Reinvigorate

Length of Relationship


Page 52: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.


Cooper, B.L., H.J. Watson, B.H. Wixom, and D.L. Goodhue, "Data Warehousing Supports Corporate Strategy at First American Corporation," MIS Quarterly, (December 2000), pp. 547-567. Provides a case study of how the First American Corporation turned their strategy and fortunes around through the use of data warehousing.

Stoller, Wixom, and Watson, “WISDOM Provides Competitive Advantage at Owens & Minor,” ( Provides a case study of how data warehousing can support supply chain integration.

Watson, Wixom, Buonamica, and Revak, “Sherwin-Williams' Data Mart Strategy: Creating Intelligence Across the Supply Chain,” Communications of ACIS, April 2001. Provides a textbook example of how to implement a data mart strategy.

Watson, H.J., D.A. Annino, B.H. Wixom, K.L. Avery, and M. Rutherford, “Current Practices in Data Warehousing,” Information Systems Management, (Winter, 2001), pp. 47-55. Provides data on companies’ data warehousing experiences, with an emphasis on the benefits being realized.

Watson, H.J. and L. Volonino, “Harrah’s High Payoff from Customer Information,” ( Provides a case study of how Harrah’s Entertainment has implemented a CRM strategy facilitated by data warehousing.

Page 53: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.


Devlin, Data Warehouse -- Architecture to Implementation, Addison-

Wesley, 1997.

Gray and Watson, Decision Support in the Data Warehouse, Prentice-Hall,


Kimball, The Data Warehouse Toolkit, Wiley, 1996.

Kimball and Merz, The Data Webhouse Toolkit, Wiley, 2000.

Inmon, Building the Operational Data Store, second edition, Wiley, 1999.

Inmon, Imhoff, and Sousa, Corporate Information Factory, Wiley, 1999.

Page 54: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.


(provides detailed information about the OLAP market, products, and applications) (includes an interactive demo of their data cleansing tool) (a wealth of current information from “the father of data warehousing”) (illustrates recent advances in ETL tools) (excellent materials from one of the leading DSS vendors)

Page 55: Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University of Georgia hwatson/dw_tutorial.ppt.