Dwdm 2(data warehouse)

44
Shanu Sharma, CSE-ASET DATA WAREHOUSE- THE BUILDING BLOCKS

Transcript of Dwdm 2(data warehouse)

Page 1: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

DATA WAREHOUSE-

THE BUILDING BLOCKS

Page 2: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TOPICS COVERED

Definition of Data warehouse Characteristics of Data Warehouse Data mart Components of data warehouse Meta data Applications of Data warehouse OLTP v/s Data Warehouse

Page 3: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

CONCEPT OF DATA WAREHOUSE

Take all the data you already have in the organization, clean and transform it, and then provide useful strategic information.

Page 4: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

DEFINITION OF DATA WAREHOUSE

(1996 )Bill Inmon considered to be the father of data warehousing stated.

“A DW is a subject-oriented, integrated, non-volatile, time-variant collection of data in favor of decision-making”.

Sean Kelly said Data in the data warehouse is“Separate available, integrated, time-stamped,

subject-oriented, non-volatile, accessible”

Page 5: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

CHARACTERISTICS OF DATA WAREHOUSE

Subject Oriented

Integrated

TimeVariant

NonVolatile

Page 6: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

1. SUBJECT ORIENTED DATA

In operational systems data is stored by individual applications or business process. Like data about individual order , customer etc.

For example in banking industry data sets for saving or checking accounts contain data about that particular application.

But in DW data is stored by real world business objectives or events not by the applications.

Page 7: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

In DW subject is the organization methodSubjects vary with enterprise

Page 8: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

2. INTEGRATED DATA

Data in DW comes from several operational systems.

Different datasets have different file formats.

Example: Data for subject Account comes from 3 different data sources.

So variations could be there, like:Naming conventions could be different.Attributes for data items could be different.

Like: Saving account no. could be of 8 bytes long but only 6 bytes for checking accounts.

Page 9: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Before moving the data into the data warehouse, you have to go through a process of transformation, consolidation, and integration of the source data.

Here are some of the items that would need standardization:Naming conventionsCodesData attributes

Page 10: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Page 11: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TIME VARIANT DATA

In operational systems the stored data contains current values. Like in saving account system the balance is the current balance of the customer.

But the data in the DW is meant for analysis and decision making.

Comparative analysis is one of the best techniques for business performance evaluation

Time is critical factor for comparative analysis Every data structure in DW contains time element

Page 12: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

So, DW has to contain historical data and current values.

Data is stored as snapshots over past and current periods.

The time-variant nature of the data in a data warehouse

Allows for analysis of the past Relates information to the present Enables forecasts for the future

Page 13: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

NON VOLATILE DATA Data from operational systems are moved into

DW after specific intervals Every business transaction don’t update in DW Data from DW is not deleted Data is neither changed by individual

transactions

Page 14: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Subject Oriented

Organized along the lines of the subjects of the corporation. Typical subjects are customer, product, vendor and transaction.

Time-Variant

Every record in the data warehouse has some form of time variancy attached to it.

Non-Volatile

Refers to the inability of data to be updated. Every record in the data warehouse is time stamped in one form or another.

Page 15: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

DATA GRANULARITY

Data granularity refers to the level of details of data in data warehouse.The lower the level of details, the finer is the data granularity.

Page 16: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

DATA WAREHOUSES AND DATA MARTS In 1998 Bill Inmon stated ,

“The single most important issue facing the IT manager this year is whether to build the data warehouse first or the data mart first”.

How are they different ?

Page 17: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Page 18: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

In any organization for managing data for analysis purpose there are basically two approaches.

1. Top Down Approach

The centralized data warehouse would feed the dependent data marts that may be designed based on a dimensional data model.

In this approach data in the data warehouse is stored at the lowest level of granularity based on a normalized data model.

Page 19: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Advantages: An enterprise view of data Not a union of disparate data marts Centralized rules and control

Disadvantages:

Slow approach High exposure to risk of failure

Page 20: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

2. Bottom Up Approach

In this approach first data marts are created to provide analytical capability for specific business subjects based on dimension data model.

Then these data marts are joined or unioned by conforming the dimensions to create a DW.

Advantages: Faster and easier implementation Less risk of failure Allows project team to learn and grow

Disadvantages: Redundant data in every data mart. Inconsistent data

Page 21: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

DW: BUILDING BLOCKS OR COMPONENTS

Page 22: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

1. SOURCE DATA COMPONENT

Production data Comes from various operational systems of the

enterprise. Internal Data Like private documents, customer profiles,

departmental databases etc. External Data Statistics data produced by external agencies. Used

for comparing performance against other organizations.

Archived Data In every operational systems, the old data

periodically stored in archived files or on disk storage. This data is also required as the data warehouse keeps historical snapshots of data.

Page 23: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

2. DATA STAGING COMPONENT

After data is extracted, data is to be prepared Data extracted from sources needs to be

changed, converted and made ready in suitable format

Three major functions to make data ready Extract Transform Load

Staging area provides a place and area with a set of functions to Clean Change Combine Convert

Page 24: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Different techniques are used for extracting data from different data sources.

Data transformation includes

Data cleaning- like correction of misselling, resolution of conflicts, providing default values for missing data elements etc, remove duplication.

Standardization of Data- standardize data types, field length. Semantic standardization like resolving synonyms and homonyms.

Sorting, Merging etc.

Page 25: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Data Loading: Data Movement to the Data Warehouse

Page 26: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

3. DATA STORAGE COMPONENTS

Separate repository Data structured for efficient processing Updated after specific periods Only read-only

Page 27: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

4. INFORMATION DELIVERY COMPONENT

It includes various methods of delivering information on the basis of users. Ex.

Ad hoc reports or predefined reports for novice and casual users.

Statistical analysis for business analyst.

It also provides information to data mining applications.

Page 28: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

Page 29: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

METADATA COMPONENT

Metadata component is the data about the data in the data warehouse.

Metadata in a data warehouse contains the answers to questions about the data in the data warehouse.

It serves as a directory of the contents of the data warehouse

Page 30: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPES OF METADATA

Operational Metadata Contains information about the operational data

sources like field lengths, data types etc.

Extraction and Transformation Metadata extraction frequencies, extraction methods etc.

End-User Metadata

Page 31: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPES & TYPICAL APPLICATIONS OF DWH

Page 32: Dwdm 2(data warehouse)

32

APPLICATION AREAS

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis

Page 33: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPICAL APPLICATIONSImpact on organization’s core business is to streamline and maximize profitability.

Fraud detection. Profitability analysis. Direct mail/database marketing. Credit risk prediction. Yield management. Inventory management.

.

Page 34: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPICAL APPLICATIONS

Fraud detection

By observing data usage patterns. People have typical purchase patterns. Deviation from patterns. Certain cities notorious for fraud. Certain items bought by stolen cards. Similar behavior for stolen phone cards.

Page 35: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPICAL APPLICATIONS

Profitability Analysis Banks know if they are profitable or not. Don’t know which customers are profitable. Typically more than 50% are NOT profitable. Don’t know which one? Balance is not enough, transactional behavior is the

key. Restructure products and pricing strategies. Life-time profitability models (next 3-5 years).

Page 36: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPICAL APPLICATIONS

Direct mail marketing

Targeted marketing.

Offering high bandwidth package NOT to all users.

Know from call detail records of web surfing.

Saves marketing expense, saving pennies.

Knowing your customers better.

Page 37: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPICAL APPLICATIONSCredit risk prediction

Who should get a loan? Qualitative decision making NOT subjective. Different interest rates for different customers. Do not subsidize bad customer on the basis of good.

Page 38: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

TYPICAL APPLICATIONSYield Management

Works for fixed inventory businesses. Item prices vary for varying customers. Example: Air Lines, Hotels etc. Price of (say) Air Ticket depends on:

How much in advance ticket was bought? How many vacant seats were present? How profitable is the customer? Ticket is one-way or return?

Page 39: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET

RECENT APPLICATIONAgriculture Systems

Agri and related data collected for decades. Decision making based on expert judgment. Lack of integration results in underutilization. What is required, in which amount and when?

Page 40: Dwdm 2(data warehouse)

40

DATA WAREHOUSE VS. OLTP

OLTP (On Line Transaction Processing)

Select tx_date, balance from tx_tableWhere account_ID = 23876;

Page 41: Dwdm 2(data warehouse)

41

DATA WAREHOUSE VS. OLTP

DWH

Select balance, age, sal, gender from customer_table, tx_tableWhere age between (30 and 40) andEducation = ‘graduate’ andCustID.customer_table = Customer_ID.tx_table;

Page 42: Dwdm 2(data warehouse)

42

DATA WAREHOUSE VS. OLTP

OLTP DWH

Primary key used Primary key NOT used

No concept of Primary Index Primary index used

Few rows returned Many rows returned

May use a single table Uses multiple tables

High selectivity of query Low selectivity of query

Indexing on primary key (unique)

Indexing on primary index (non-unique)

Page 43: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET43

COMPARISON OF RESPONSE TIMES On-line analytical processing (OLAP) queries must be

executed in a small number of seconds. Often requires denormalization and/or sampling.

Complex query scripts and large list selections can generally be executed in a small number of minutes.

Sophisticated clustering algorithms (e.g., data mining) can generally be executed in a small number of hours (even for hundreds of thousands of customers).

Page 44: Dwdm 2(data warehouse)

Shanu Sharma, CSE-ASET44

DATA WAREHOUSE FOR DECISION SUPPORT & OLAP

Putting Information technology to help the knowledge worker make faster and better decisionsWhich of my customers are most likely to

go to the competition?What product promotions have the biggest

impact on revenue?How did the share price of software

companies correlate with profits over last 10 years?