Ch 1 intro_dw
-
Upload
sushil-kulkarni -
Category
Technology
-
view
822 -
download
0
description
Transcript of Ch 1 intro_dw
SUSHIL KULKARNISUSHIL KULKARNI
DATA WAREHOUSINGDATA WAREHOUSINGDATA WAREHOUSING
Which are ourlowest/highest margin
customers ?
Which are ourlowest/highest margin
customers ?Who are my customers and what products are they buying?
Who are my customers and what products are they buying?
Which customersare most likely to go to the competition ?
Which customersare most likely to go to the competition ?
What impact will new products/services
have on revenue and margins?
What impact will new products/services
have on revenue and margins?
What product prom--otions have the biggest impact on revenue?
What product prom--otions have the biggest impact on revenue?
What is the most effective distribution
channel?
What is the most effective distribution
channel?
A producer wants to knowA producer wants to know……..
Lot of data everywhereLot of data everywhere
yet ...yet ...• I can’t find the data I need
– data is scattered over the network
– many versions, subtle differences
• I can’t get the data I need
– need an expert to get the data
• I can’t understand the data I found
– available data poorly documented
• I can’t use the data I found
– results are unexpected
– data needs to be transformed from one form to other
What is a Data Warehouse?What is a Data Warehouse?
A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.
[Barry Devlin]
What users says...What users says...
• Data should be integrated across the enterprise
• Summary data has a real value to the organization
• Historical data holds the key to understanding data over time
• What-if capabilities are required
What is Data Warehousing?What is Data Warehousing?
A process of transforming
data into information and making it available to users in a timely enough manner to make a difference
[Forrester Research, April 1996]
Data
EvolutionEvolution• 60’s: Batch reports
– hard to find and analyze information
– inflexible and expensive, reprogram every new request
• 70’s: Terminal-based DSS and EIS (executive information systems)
– still inflexible, not integrated with desktop tools
• 80’s: Desktop data access and analysis tools
– query tools, spreadsheets, GUIs
– easier to use, but only access operational databases
• 90’s: Data warehousing with integrated OLAP engines and tools
Warehouses are Very Large Warehouses are Very Large
DatabasesDatabases
35%
30%
25%
20%
15%
10%
5%
0%5GB
5-9GB
10-19GB 50-99GB 250-499GB
20-49GB 100-249GB 500GB-1TB
Initial
Projected 2Q96
Source: META Group, Inc.
Respondents
Very Large Data BasesVery Large Data Bases
• Terabytes -- 10^12 bytes:
• Petabytes -- 10^15 bytes:
• Exabytes -- 10^18 bytes:
• Zettabytes -- 10^21 bytes:
• Zottabytes -- 10^24 bytes:
Walmart -- 24 Terabytes
Geographic Information Systems
National Medical Records
Weather images
Intelligence Agency Videos
Data Warehousing Data Warehousing ----
It is a processIt is a process• Technique for assembling and
managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
• A decision support database
maintained separately from the organization’s operational database
Data WarehouseData Warehouse
• A data warehouse is a
– subject-oriented
– integrated
– time-varying
– non-volatile
collection of data that is used primarily
in organizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
Customers: Get information of different prices of a beer
Farmers: Harvest information from known access paths
Data Warehouse SubjectData Warehouse Subject--orientedoriented
Students: Get information about various universities in U.K.
Data Warehouse SubjectData Warehouse Subject--orientedoriented
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
• Focusing on the modelling and
analysis of data for decision makers,
not on daily operations or transaction
processing
• Provide a simple and concise view
around particular subject issues by
excluding data that are not useful in the
decision support process
Data Warehouse SubjectData Warehouse Subject--orientedoriented
Customers
Etc…
Vendors Etc…
Orders
DataWarehouse
Enterprise“Database”
Transactions
Copied, organizedsummarized
Data Mining
Data Miners:
• “Farmers” – they know• “Explorers” - unpredictable
Data Warehouse SubjectData Warehouse Subject--orientedoriented
Use to study trends and changes
Data Warehouse :Data Warehouse :Time Time -- variantvariant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
– Operational database: current value data
• Every key structure in the data warehouse
– Contains an element of time explicitly or implicitly, while
the key of operational data may or may not contain “time
element”
Data Warehouse :Data Warehouse :Time Time -- variantvariant
cannot updated by end users
Data Warehouse : Data Warehouse : NonNon--volatilevolatile
Data Warehouse ArchitectureData Warehouse Architecture
Data Warehouse
Engine
Optimized Loader
Extraction
Cleansing
Analyze
Query
Metadata Repository
RelationalDatabases
LegacyData
Purchased Data
ERPSystems
Data MartData Mart
• A Data Mart is a smaller, more focused
Data Warehouse – a mini-warehouse.
• A Data Mart typically reflects the business
rules of a specific business unit within an
enterprise.
Data Warehouse to Data MartData Warehouse to Data Mart
DataWarehouse
Data Mart
Data Mart
Data Mart
DecisionSupport
Information
DecisionSupport
Information
DecisionSupport
Information
DATA MARTSDATA MARTS
• Create many DM’s • Limited scope
Examples:
1. Financial DM2. Marketing DM3. Supply chain DM
Generic Architecture of DataGeneric Architecture of Data
(synonym) Transaction data
Transaction (Operational) Transaction (Operational)
DataData• Operational (production) systems create
(massive number of) transactions, such as sales, purchases, deposits, withdrawals, returns, refunds, phone calls, toll roads, web site “hits”, etc…
• Transactions are the base level of data –the raw material for understanding customer behavior
• Unfortunately, operational systems change due to changing business needs
• Fortunately, operational systems can usually be changed to support changing business needs
• Data warehousing strategies need to be aware of operational system changes
Operational Summary DataOperational Summary Data
Summaries are for a specific time period and utilize the transaction data for that time period
Other Examples???
Decision Support Summary DataDecision Support Summary Data
• The data that are used to help make decisions about the business– Financial Data, such as:
• Income Statements (Profit & Loss)• Balance Sheets (Assets – Liabilities = Net
Worth)
– Sales summaries– Other examples???
• Data warehouses maintain this type of data, however financial data “of record”(for audit purposes) usually comes from databases and not the data warehouse (confusing???)
• Generally, it is a bad idea to use the same system for analytic and operational purposes
Data Warehouse for Decision Data Warehouse for Decision
Support Support
• Putting Information technology to help
the knowledge worker make faster and
better decisions
– Which of my customers are most likely to
go to the competition?
– What product promotions have the biggest
impact on revenue?
– How did the share price of software
companies correlate with profits over last
10 years?
Decision SupportDecision Support
• Used to manage and control business
• Data is historical or point-in-time
• Optimized for inquiry rather than update
• Use of the system is loosely defined and can
be ad-hoc
• Used by managers and end-users to
understand the business and make
judgements
Database SchemaDatabase Schema
• Database schema defines the structure of data, not the values of the data (e.g., first name, last
name = structure; Ron Norman = values of the data)
• In RDBMS:
– Columns = fields = attributes (A,B,C)
– Rows = records = tuples (1-7)
Logical Database SchemaLogical Database Schema• Describes data in a way that is familiar to
business users
Physical Database SchemaPhysical Database Schema• Describes the data the way it will be stored in an
RDBMS which might be different than the way the logical shows it
MetadataMetadata
• General definition: Data about data !!!– Examples:
• A library’s card catalog (metadata) describes publications (data)
• A file system maintains permissions (metadata) about files (data)
• A form of system documentation including:– Values legally allowed in a field (e.g., AZ,
CA, OR, UT, WA, etc.)– Description of the contents of each field
(e.g., start date)– Date when data were loaded– Indication of currency of the data
(last updated)– Mappings between systems
(e.g., A.this = B.that)
• Invaluable, otherwise have to research to find it
Business RulesBusiness Rules
• Highest level of abstraction from
operational (transaction) data
• Describes why relationships exist and
how they are applied
• Examples:
– Need to have 3 forms of ID for credit
– Only allow a maximum daily withdrawal of
$200
– After the 3rd log-in attempt, lock the log-in
screen
– Accept no bills larger than $20
– Others???
General Architecture for Data General Architecture for Data
WarehousingWarehousing
• Source systems
• Extraction, (Clean),
Transformation, &
Load (ETL)
• Central repository
• Metadata repository
• Data marts
• Operational
feedback
• End users
(business)
DATA WAREHOUSE SCOPEDATA WAREHOUSE SCOPE
Broad :
Required for companies, Very costly, May be divided according to Depts.
Narrow:
Required for Personal information
Design of a Data Warehouse: A Design of a Data Warehouse: A
Business Analysis FrameworkBusiness Analysis Framework• Four views regarding the design of a data
warehouse
– Top-down view
• allows selection of the relevant information necessary for
the data warehouse
– Data source view
• exposes the information being captured, stored, and
managed by operational systems
– Data warehouse view
• consists of fact tables and dimension tables
– Business query view
• sees the perspectives of data in the warehouse from the
view of end-user
Data Warehouse Design Process Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
– Top-down: Starts with overall design and planning
– Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
– Waterfall: structured and systematic analysis at each step before proceeding to the next
– Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around
• Typical data warehouse design process
– Choose a business process to model, e.g., orders, invoices, etc.
– Choose the grain (atomic level of data) of the business process
– Choose the dimensions that will apply to each fact table record
– Choose the measure that will populate each fact table record
MultiMulti--Tiered ArchitectureTiered Architecture
Data
Warehouse
Extract
Transform
Load
Refresh
OLAP Engine
Analysis
Query
Reports
Data mining
Monitor
&
Integrator
Metadata
Data Sources Front-End Tools
Serve
Data Marts
Operational
DBs
other
sources
Data Storage
OLAP Server
Three Data Warehouse ModelsThree Data Warehouse Models
• Enterprise warehouse
– collects all of the information about subjects spanning
the entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a
specific groups of users. Its scope is confined to
specific, selected groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data
mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
Data Mining works with Data Mining works with
Warehouse DataWarehouse Data
• Data Warehousing provides the Enterprise with a memory
• Data Mining provides the Enterprise with intelligence
We want to know ...We want to know ...
• Given a database of 100,000 names, which persons are the least likely to default on their credit cards?
• Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?
• If I raise the price of my product by Rs. 2, what is the effect on my ROI?
• If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?
• If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?
• Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
Application AreasApplication Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
Data Mining in UseData Mining in Use
• Data Mining can be used to track fraud
• A Supermarket becomes an information broker
• Basketball teams use it to track game strategy
• Cross Selling
• Warranty Claims Routing
• Holding on to Good Customers
• Weeding out Bad Customers
Two Systems Two Systems
• Operational System
• Information System
Operational SystemsOperational Systems
• Run the business in real time
• Based on up-to-the-second data
• Optimized to handle large numbers of simple read/write transactions
• Optimized for fast response to predefined transactions
• Used by people who deal with customers, products --clerks, salespeople etc.
• They are increasingly used by customers
It refers to a class of
systems that facilitate
and manage
transaction-oriented
applications, typically for data entry and
retrieval transaction
processing
On Line Transaction Process On Line Transaction Process
(OLTP)(OLTP)
OLTP technology is used in a number of industries, including banking, airlines, mail order, supermarkets, and manufacturing. Applications include electronic banking, order processing, employee time clock systems, e-commerce, and eTrading. The most widely used OLTP system is probably IBM's CICS.
On Line Transaction Process On Line Transaction Process
(OLTP)(OLTP)
What are Operational Systems?What are Operational Systems?
• They are OLTP systems
• Run mission critical
applications
• Need to work with stringent performance requirements for routine tasks
• Used to run a business!
RDBMS used for OLTPRDBMS used for OLTP
• Database Systems have been used traditionally for OLTP– clerical data processing tasks
– detailed, up to date data
– structured repetitive tasks
– read/update a few records
– isolation, recovery and
integrity are critical
Operational Summary DataOperational Summary Data
Summaries are for a specific time period and utilize the transaction data for that time period
Other Examples???
Examples of Operational DataExamples of Operational Data
Data Industry Usage Technology Volumes
CustomerFile
All TrackCustomerDetails
Legacy application, flatfiles, main frames
Small-medium
AccountBalance
Finance Controlaccountactivities
Legacy applications,hierarchical databases,mainframe
Large
Point-of-Sale data
Retail Generatebills, managestock
ERP, Client/Server,relational databases
Very Large
CallRecord
Telecomm-unications
Billing Legacy application,hierarchical database,mainframe
Very Large
ProductionRecord
Manufact-uring
ControlProduction
ERP,relational databases,AS/400
Medium
So, whatSo, what’’s different?s different?
ApplicationApplication--Orientation vs. Orientation vs.
SubjectSubject--OrientationOrientation
Application-Orientation
Operational Database
LoansCredit Card
Trust
Savings
Subject-Orientation
DataWarehouse
Customer
Vendor
Product
Activity
OLTP vs. Data WarehouseOLTP vs. Data Warehouse
• OLTP systems are tuned for known transactions and workloads while workload is not known a priori in a data warehouse
• Special data organization, access methods and implementation methods are needed to support
data warehouse queries (typically multidimensional queries)
– e.g., average amount spent on phone calls between
9AM-5PM in Pune during the month of December
OLTP OLTP vsvs Data WarehouseData Warehouse
• OLTP
– Application Oriented
– Used to run business
– Detailed data
– Current up to date
– Isolated Data
– Repetitive access
– Clerical User
• Warehouse (DSS)
– Subject Oriented
– Used to analyze
business
– Summarized and refined
– Snapshot data
– Integrated Data
– Ad-hoc access
– Knowledge User
(Manager)
OLTP OLTP vsvs Data WarehouseData Warehouse
• OLTP
– Performance Sensitive
– Few Records accessed at a time (tens)
– Read/Update Access
– No data redundancy
– Database Size 100MB -100 GB
• Data Warehouse
– Performance relaxed
– Large volumes accessed at a time(millions)
– Mostly Read (Batch Update)
– Redundancy present
– Database Size 100 GB - few terabytes
OLTP OLTP vsvs Data WarehouseData Warehouse
• OLTP
– Transaction
throughput is the
performance metric
– Thousands of users
– Managed in entirety
• Data Warehouse
– Query throughput is
the performance
metric
– Hundreds of users
– Managed by subsets
To summarize ...To summarize ...
• OLTP Systems are used to “run” a business
• The Data Warehouse helps to “optimize” the business
Why Separate Data Why Separate Data
Warehouse?Warehouse?• Performance
– Op dbs designed & tuned for known txs & workloads.
– Complex OLAP queries would degrade perf. for op txs.
– Special data organization, access & implementation methods needed for multidimensional views & queries.
• Function
– Missing data: Decision support requires historical data, which op dbs do not typically maintain.
– Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources.
– Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.
INFORMATION SYSTEMSINFORMATION SYSTEMS
• Designed to support decision-making based on
1. Historical data2. Prediction data.
• Designed for complex queries or data-mining applications.
Examples:
1. Sales trend analysis, 2. Customer segmentation3. Human resources planning
INFORMATION SYSTEMSINFORMATION SYSTEMS
DIFFERENCEDIFFERENCE
Periodical batch updates and queries requiring many or all rows
Many, constant updates and queries on one or a few table rows
Volume
Ease of flexible access and use
Performance throughput, availability
Design goal
Broad, ad hoc, complex queries and analysis
Narrow, planned, and simple updates and queries
Scope of usage
Managers, business analysts, customers
Clerks, sales-persons, administrations
Primary users
Real and analyze historical data.
Real time data entryPurpose
Informational SystemsOperational SystemsCharacteristics
T H A N K S !T H A N K S !