Dwdm 2(data warehouse)
-
Upload
er-bansal -
Category
Technology
-
view
514 -
download
0
Transcript of Dwdm 2(data warehouse)
Shanu Sharma, CSE-ASET
DATA WAREHOUSE-
THE BUILDING BLOCKS
Shanu Sharma, CSE-ASET
TOPICS COVERED
Definition of Data warehouse Characteristics of Data Warehouse Data mart Components of data warehouse Meta data Applications of Data warehouse OLTP v/s Data Warehouse
Shanu Sharma, CSE-ASET
CONCEPT OF DATA WAREHOUSE
Take all the data you already have in the organization, clean and transform it, and then provide useful strategic information.
Shanu Sharma, CSE-ASET
DEFINITION OF DATA WAREHOUSE
(1996 )Bill Inmon considered to be the father of data warehousing stated.
“A DW is a subject-oriented, integrated, non-volatile, time-variant collection of data in favor of decision-making”.
Sean Kelly said Data in the data warehouse is“Separate available, integrated, time-stamped,
subject-oriented, non-volatile, accessible”
Shanu Sharma, CSE-ASET
CHARACTERISTICS OF DATA WAREHOUSE
Subject Oriented
Integrated
TimeVariant
NonVolatile
Shanu Sharma, CSE-ASET
1. SUBJECT ORIENTED DATA
In operational systems data is stored by individual applications or business process. Like data about individual order , customer etc.
For example in banking industry data sets for saving or checking accounts contain data about that particular application.
But in DW data is stored by real world business objectives or events not by the applications.
Shanu Sharma, CSE-ASET
In DW subject is the organization methodSubjects vary with enterprise
Shanu Sharma, CSE-ASET
2. INTEGRATED DATA
Data in DW comes from several operational systems.
Different datasets have different file formats.
Example: Data for subject Account comes from 3 different data sources.
So variations could be there, like:Naming conventions could be different.Attributes for data items could be different.
Like: Saving account no. could be of 8 bytes long but only 6 bytes for checking accounts.
Shanu Sharma, CSE-ASET
Before moving the data into the data warehouse, you have to go through a process of transformation, consolidation, and integration of the source data.
Here are some of the items that would need standardization:Naming conventionsCodesData attributes
Shanu Sharma, CSE-ASET
Shanu Sharma, CSE-ASET
TIME VARIANT DATA
In operational systems the stored data contains current values. Like in saving account system the balance is the current balance of the customer.
But the data in the DW is meant for analysis and decision making.
Comparative analysis is one of the best techniques for business performance evaluation
Time is critical factor for comparative analysis Every data structure in DW contains time element
Shanu Sharma, CSE-ASET
So, DW has to contain historical data and current values.
Data is stored as snapshots over past and current periods.
The time-variant nature of the data in a data warehouse
Allows for analysis of the past Relates information to the present Enables forecasts for the future
Shanu Sharma, CSE-ASET
NON VOLATILE DATA Data from operational systems are moved into
DW after specific intervals Every business transaction don’t update in DW Data from DW is not deleted Data is neither changed by individual
transactions
Shanu Sharma, CSE-ASET
Subject Oriented
Organized along the lines of the subjects of the corporation. Typical subjects are customer, product, vendor and transaction.
Time-Variant
Every record in the data warehouse has some form of time variancy attached to it.
Non-Volatile
Refers to the inability of data to be updated. Every record in the data warehouse is time stamped in one form or another.
Shanu Sharma, CSE-ASET
DATA GRANULARITY
Data granularity refers to the level of details of data in data warehouse.The lower the level of details, the finer is the data granularity.
Shanu Sharma, CSE-ASET
DATA WAREHOUSES AND DATA MARTS In 1998 Bill Inmon stated ,
“The single most important issue facing the IT manager this year is whether to build the data warehouse first or the data mart first”.
How are they different ?
Shanu Sharma, CSE-ASET
Shanu Sharma, CSE-ASET
In any organization for managing data for analysis purpose there are basically two approaches.
1. Top Down Approach
The centralized data warehouse would feed the dependent data marts that may be designed based on a dimensional data model.
In this approach data in the data warehouse is stored at the lowest level of granularity based on a normalized data model.
Shanu Sharma, CSE-ASET
Advantages: An enterprise view of data Not a union of disparate data marts Centralized rules and control
Disadvantages:
Slow approach High exposure to risk of failure
Shanu Sharma, CSE-ASET
2. Bottom Up Approach
In this approach first data marts are created to provide analytical capability for specific business subjects based on dimension data model.
Then these data marts are joined or unioned by conforming the dimensions to create a DW.
Advantages: Faster and easier implementation Less risk of failure Allows project team to learn and grow
Disadvantages: Redundant data in every data mart. Inconsistent data
Shanu Sharma, CSE-ASET
DW: BUILDING BLOCKS OR COMPONENTS
Shanu Sharma, CSE-ASET
1. SOURCE DATA COMPONENT
Production data Comes from various operational systems of the
enterprise. Internal Data Like private documents, customer profiles,
departmental databases etc. External Data Statistics data produced by external agencies. Used
for comparing performance against other organizations.
Archived Data In every operational systems, the old data
periodically stored in archived files or on disk storage. This data is also required as the data warehouse keeps historical snapshots of data.
Shanu Sharma, CSE-ASET
2. DATA STAGING COMPONENT
After data is extracted, data is to be prepared Data extracted from sources needs to be
changed, converted and made ready in suitable format
Three major functions to make data ready Extract Transform Load
Staging area provides a place and area with a set of functions to Clean Change Combine Convert
Shanu Sharma, CSE-ASET
Different techniques are used for extracting data from different data sources.
Data transformation includes
Data cleaning- like correction of misselling, resolution of conflicts, providing default values for missing data elements etc, remove duplication.
Standardization of Data- standardize data types, field length. Semantic standardization like resolving synonyms and homonyms.
Sorting, Merging etc.
Shanu Sharma, CSE-ASET
Data Loading: Data Movement to the Data Warehouse
Shanu Sharma, CSE-ASET
3. DATA STORAGE COMPONENTS
Separate repository Data structured for efficient processing Updated after specific periods Only read-only
Shanu Sharma, CSE-ASET
4. INFORMATION DELIVERY COMPONENT
It includes various methods of delivering information on the basis of users. Ex.
Ad hoc reports or predefined reports for novice and casual users.
Statistical analysis for business analyst.
It also provides information to data mining applications.
Shanu Sharma, CSE-ASET
Shanu Sharma, CSE-ASET
METADATA COMPONENT
Metadata component is the data about the data in the data warehouse.
Metadata in a data warehouse contains the answers to questions about the data in the data warehouse.
It serves as a directory of the contents of the data warehouse
Shanu Sharma, CSE-ASET
TYPES OF METADATA
Operational Metadata Contains information about the operational data
sources like field lengths, data types etc.
Extraction and Transformation Metadata extraction frequencies, extraction methods etc.
End-User Metadata
Shanu Sharma, CSE-ASET
TYPES & TYPICAL APPLICATIONS OF DWH
32
APPLICATION AREAS
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis
Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONSImpact on organization’s core business is to streamline and maximize profitability.
Fraud detection. Profitability analysis. Direct mail/database marketing. Credit risk prediction. Yield management. Inventory management.
.
Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Fraud detection
By observing data usage patterns. People have typical purchase patterns. Deviation from patterns. Certain cities notorious for fraud. Certain items bought by stolen cards. Similar behavior for stolen phone cards.
Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Profitability Analysis Banks know if they are profitable or not. Don’t know which customers are profitable. Typically more than 50% are NOT profitable. Don’t know which one? Balance is not enough, transactional behavior is the
key. Restructure products and pricing strategies. Life-time profitability models (next 3-5 years).
Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONS
Direct mail marketing
Targeted marketing.
Offering high bandwidth package NOT to all users.
Know from call detail records of web surfing.
Saves marketing expense, saving pennies.
Knowing your customers better.
Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONSCredit risk prediction
Who should get a loan? Qualitative decision making NOT subjective. Different interest rates for different customers. Do not subsidize bad customer on the basis of good.
Shanu Sharma, CSE-ASET
TYPICAL APPLICATIONSYield Management
Works for fixed inventory businesses. Item prices vary for varying customers. Example: Air Lines, Hotels etc. Price of (say) Air Ticket depends on:
How much in advance ticket was bought? How many vacant seats were present? How profitable is the customer? Ticket is one-way or return?
Shanu Sharma, CSE-ASET
RECENT APPLICATIONAgriculture Systems
Agri and related data collected for decades. Decision making based on expert judgment. Lack of integration results in underutilization. What is required, in which amount and when?
40
DATA WAREHOUSE VS. OLTP
OLTP (On Line Transaction Processing)
Select tx_date, balance from tx_tableWhere account_ID = 23876;
41
DATA WAREHOUSE VS. OLTP
DWH
Select balance, age, sal, gender from customer_table, tx_tableWhere age between (30 and 40) andEducation = ‘graduate’ andCustID.customer_table = Customer_ID.tx_table;
42
DATA WAREHOUSE VS. OLTP
OLTP DWH
Primary key used Primary key NOT used
No concept of Primary Index Primary index used
Few rows returned Many rows returned
May use a single table Uses multiple tables
High selectivity of query Low selectivity of query
Indexing on primary key (unique)
Indexing on primary index (non-unique)
Shanu Sharma, CSE-ASET43
COMPARISON OF RESPONSE TIMES On-line analytical processing (OLAP) queries must be
executed in a small number of seconds. Often requires denormalization and/or sampling.
Complex query scripts and large list selections can generally be executed in a small number of minutes.
Sophisticated clustering algorithms (e.g., data mining) can generally be executed in a small number of hours (even for hundreds of thousands of customers).
Shanu Sharma, CSE-ASET44
DATA WAREHOUSE FOR DECISION SUPPORT & OLAP
Putting Information technology to help the knowledge worker make faster and better decisionsWhich of my customers are most likely to
go to the competition?What product promotions have the biggest
impact on revenue?How did the share price of software
companies correlate with profits over last 10 years?