Krithi Talk Impact
-
Upload
idhayasakthi -
Category
Documents
-
view
221 -
download
0
Transcript of Krithi Talk Impact
-
7/31/2019 Krithi Talk Impact
1/169
DATA WAREHOUSINGAND
DATA MINING
S. SudarshanKrithi Ramamritham
IIT Bombay
[email protected]@cse.iitb.ernet.in
-
7/31/2019 Krithi Talk Impact
2/169
2
Course OverviewThe course:what and how
0. IntroductionI. Data WarehousingII. Decision Support
and OLAPIII. Data MiningIV. Looking Ahead
Demos and Labs
-
7/31/2019 Krithi Talk Impact
3/169
3
0. Introduction
Data Warehousing,OLAP and data mining:what and why (now)?Relation to OLTPA case study
demos, labs
-
7/31/2019 Krithi Talk Impact
4/169
4
Which are ourlowest/highest margin
customers ? Who are my customers
and what productsare they buying?
Which customersare most likely to goto the competition ?
What impact willnew products/services
have on revenueand margins?
What product prom--otions have the biggest
impact on revenue?
What is the mosteffective distribution
channel?
A producer wants to know.
-
7/31/2019 Krithi Talk Impact
5/169
5
Data, Data everywhere yet ... I cant find the data I need
data is scattered over thenetworkmany versions, subtledifferences
I cant get the data I need need an expert to get the data
I cant understand the data Ifound
available data poorly documented
I cant use the data I found results are unexpected
data needs to be transformedfrom one form to other
-
7/31/2019 Krithi Talk Impact
6/169
-
7/31/2019 Krithi Talk Impact
7/169
7
What are the users saying...
Data should be integratedacross the enterpriseSummary data has a realvalue to the organizationHistorical data holds thekey to understanding data
over timeWhat-if capabilities arerequired
-
7/31/2019 Krithi Talk Impact
8/169
8
What is Data Warehousing?
A process of transforming data intoinformation andmaking it available tousers in a timelyenough manner to
make a difference
[Forrester Research, April1996]Data
Information
-
7/31/2019 Krithi Talk Impact
9/169
9
Evolution
60s: Batch reports hard to find and analyze informationinflexible and expensive, reprogram every newrequest
70s: Terminal -based DSS and EIS (executiveinformation systems)
still inflexible, not integrated with desktop tools
80s: Desktop data access and analysis tools query tools, spreadsheets, GUIseasier to use, but only access operational databases
90s: Data warehousing with integrated OLAP
engines and tools
-
7/31/2019 Krithi Talk Impact
10/169
10
Warehouses are Very LargeDatabases
35%
30%
25%
20%
15%
10%
5%
0%5GB
5-9GB
10-19GB 50-99GB 250-499GB
20-49GB 100-249GB 500GB-1TB
InitialProjected 2Q96
Source: META Group, Inc.
R e s p o n
d e n t s
-
7/31/2019 Krithi Talk Impact
11/169
11
Very Large Data Bases
Terabytes -- 10^12 bytes:
Petabytes -- 10^15 bytes:
Exabytes -- 10^18 bytes:
Zettabytes -- 10^21
bytes:
Zottabytes -- 10^24bytes:
Walmart -- 24 Terabytes
Geographic Information
SystemsNational Medical Records
Weather images
Intelligence AgencyVideos
-
7/31/2019 Krithi Talk Impact
12/169
12
Data Warehousing --It is a process
Technique for assembling andmanaging data from varioussources for the purpose of
answering businessquestions. Thus makingdecisions that were notprevious possibleA decision support databasemaintained separately fromthe organizations operational
database
-
7/31/2019 Krithi Talk Impact
13/169
13
Data Warehouse
A data warehouse is asubject-oriented
integrated
time-varying
non-volatile
collection of data that is used primarily inorganizational decision making.
-- Bill Inmon, Building the Data Warehouse 1996
-
7/31/2019 Krithi Talk Impact
14/169
14
Explorers, Farmers and Tourists
Explorers: Seek out the unknown andpreviously unsuspected rewards hiding inthe detailed data
Farmers: Harvest informationfrom known access paths
Tourists: Browse informationharvested by farmers
-
7/31/2019 Krithi Talk Impact
15/169
15
Data Warehouse Architecture
Data WarehouseEngine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased
Data
ERPSystems
-
7/31/2019 Krithi Talk Impact
16/169
16
Data Warehouse for DecisionSupport & OLAP
Putting Information technology to help theknowledge worker make faster and betterdecisions
Which of my customers are most likely to goto the competition?What product promotions have the biggest
impact on revenue?How did the share price of softwarecompanies correlate with profits over last 10years?
-
7/31/2019 Krithi Talk Impact
17/169
17
Decision Support
Used to manage and control business
Data is historical or point-in-time
Optimized for inquiry rather than updateUse of the system is loosely defined andcan be ad-hoc
Used by managers and end-users tounderstand the business and make
judgements
-
7/31/2019 Krithi Talk Impact
18/169
18
Data Mining works with WarehouseData
Data Warehousingprovides the Enterprisewith a memory
Data Mining providesthe Enterprise withintelligence
-
7/31/2019 Krithi Talk Impact
19/169
19
We want to know ...Given a database of 100,000 names, which persons are theleast likely to default on their credit cards?
Which types of transactions are likely to be fraudulentgiven the demographics and transactional history of aparticular customer?
If I raise the price of my product by Rs. 2, what is theeffect on my ROI?
If I offer only 2,500 airline miles as an incentive topurchase rather than 5,000, how many lost responses willresult?
If I emphasize ease-of-use of the product as opposed to itstechnical capabilities, what will be the net effect on myrevenues?
Which of my customers are likely to be the most loyal?
Data Mining helps extract such information
-
7/31/2019 Krithi Talk Impact
20/169
-
7/31/2019 Krithi Talk Impact
21/169
21
Data Mining in Use
The US Government uses Data Mining totrack fraudA Supermarket becomes an informationbrokerBasketball teams use it to track gamestrategy
Cross SellingWarranty Claims RoutingHolding on to Good Customers
Weeding out Bad Customers
-
7/31/2019 Krithi Talk Impact
22/169
22
What makes data mining possible?
Advances in the following areas aremaking data mining deployable:
data warehousingbetter and more data (i.e., operational,
behavioral, and demographic)the emergence of easily deployed data
mining tools andthe advent of new data mining
techniques. -- Gartner Group
-
7/31/2019 Krithi Talk Impact
23/169
23
Why Separate Data Warehouse?
PerformanceOp dbs designed & tuned for known txs & workloads.Complex OLAP queries would degrade perf. for op txs.Special data organization, access & implementationmethods needed for multidimensional views & queries.
FunctionMissing data: Decision support requires historical data, which
op dbs do not typically maintain.Data consolidation: Decision support requires consolidation(aggregation, summarization) of data from manyheterogeneous sources: op dbs, external sources.Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to bereconciled.
-
7/31/2019 Krithi Talk Impact
24/169
24
What are Operational Systems?
They are OLTP systemsRun mission criticalapplicationsNeed to work withstringent performancerequirements for
routine tasksUsed to run abusiness!
-
7/31/2019 Krithi Talk Impact
25/169
25
RDBMS used for OLTP
Database Systems have been usedtraditionally for OLTP
clerical data processing tasksdetailed, up to date datastructured repetitive tasksread/update a few recordsisolation, recovery and integrity are
critical
-
7/31/2019 Krithi Talk Impact
26/169
26
Operational Systems
Run the business in real timeBased on up-to-the-second dataOptimized to handle largenumbers of simple read/writetransactionsOptimized for fast response topredefined transactionsUsed by people who deal with
customers, products -- clerks,salespeople etc.They are increasingly used bycustomers
-
7/31/2019 Krithi Talk Impact
27/169
27
Examples of Operational DataData Industry Usage Technology Volumes
Customer File
All Track Customer Details
Legacy application, flat files, main frames
Small-medium
Account Balance Finance Control account activities
Legacy applications, hierarchical databases, mainframe
Large
Point-of- Sale data
Retail Generate bills, manage stock
ERP, Client/Server, relational databases
Very Large
Call Record Telecomm- unications Billing Legacy application, hierarchical database, mainframe
Very Large
Production Record
Manufact- uring
Control Production
ERP, relational databases,
AS/400
Medium
-
7/31/2019 Krithi Talk Impact
28/169
So, whats different?
-
7/31/2019 Krithi Talk Impact
29/169
29
Application-Orientation vs.Subject-Orientation
Application-Orientation
Operational
Database
LoansCreditCard
Trust
Savings
Subject-Orientation
Data
Warehouse
Customer
VendorProduct
Activity
-
7/31/2019 Krithi Talk Impact
30/169
30
OLTP vs. Data Warehouse
OLTP systems are tuned for knowntransactions and workloads whileworkload is not known a priori in a data
warehouseSpecial data organization, access methodsand implementation methods are neededto support data warehouse queries(typically multidimensional queries)
e.g ., average amount spent on phone callsbetween 9AM-5PM in Pune during the monthof December
-
7/31/2019 Krithi Talk Impact
31/169
31
OLTP vs Data Warehouse
OLTPApplicationOriented
Used to runbusinessDetailed dataCurrent up to date
Isolated DataRepetitive accessClerical User
Warehouse (DSS)Subject OrientedUsed to analyze
businessSummarized andrefinedSnapshot data
Integrated DataAd-hoc accessKnowledge User(Manager)
-
7/31/2019 Krithi Talk Impact
32/169
32
OLTP vs Data Warehouse
OLTPPerformance SensitiveFew Records accessed at
a time (tens)
Read/Update Access
No data redundancy
Database Size 100MB-100 GB
Data WarehousePerformance relaxedLarge volumes accessed
at a time(millions)Mostly Read (BatchUpdate)Redundancy presentDatabase Size100 GB - few terabytes
-
7/31/2019 Krithi Talk Impact
33/169
33
OLTP vs Data Warehouse
OLTPTransactionthroughput is the
performance metricThousands of usersManaged inentirety
Data WarehouseQuery throughputis the performance
metricHundreds of usersManaged bysubsets
-
7/31/2019 Krithi Talk Impact
34/169
34
To summarize ...
OLTP Systems areused to run abusiness
The DataWarehouse helpsto optimize thebusiness
-
7/31/2019 Krithi Talk Impact
35/169
35
Why Now?
Data is being producedERP provides clean data
The computing power is availableThe computing power is affordableThe competitive pressures are
strongCommercial products are available
h d
-
7/31/2019 Krithi Talk Impact
36/169
36
Myths surrounding OLAP Serversand Data Marts
Data marts and OLAP servers are departmentalsolutions supporting a handful of usersMillion dollar massively parallel hardware is
needed to deliver fast time for complex queriesOLAP servers require massive and unwieldyindicesComplex OLAP queries clog the network withdataData warehouses must be at least 100 GB to beeffective
Source -- Arbor Software Home Page
-
7/31/2019 Krithi Talk Impact
37/169
37
Wal*Mart Case Study
Founded by Sam WaltonOne the largest Super Market Chains
in the US
Wal*Mart: 2000+ Retail Stores
SAM's Clubs 100+WholesalersStores
This case study is from Felipe Carinos (NCR
Teradata) presentation made at Stanford DatabaseSeminar
-
7/31/2019 Krithi Talk Impact
38/169
38
Old Retail Paradigm
Wal*MartInventoryManagement
Merchandise AccountsPayablePurchasingSupplier Promotions:
National, Region,Store Level
SuppliersAccept OrdersPromote Products
Provide specialIncentivesMonitor and TrackThe Incentives
Bill and CollectReceivablesEstimate RetailerDemands
N (J I Ti ) R il
-
7/31/2019 Krithi Talk Impact
39/169
39
New (Just-In-Time) RetailParadigm
No more dealsShelf-Pass Through (POS Application)
One Unit PriceSuppliers paid once a week on ACTUAL items sold
Wal*Mart ManagerDaily Inventory RestockSuppliers (sometimes SameDay) ship to Wal*Mart
Warehouse-Pass ThroughStock some Large Items
Delivery may come from supplierDistribution Center
Suppliers merchandise unloaded directly onto Wal*MartTrucks
-
7/31/2019 Krithi Talk Impact
40/169
40
Wal*Mart System
NCR 5100M 96Nodes;Number of Rows:Historical Data:New Daily Volume:
Number of Users:Number of Queries:
24 TB Raw Disk; 700 -1000 Pentium CPUs
> 5 Billions65 weeks (5 Quarters)Current Apps: 75 MillionNew Apps: 100 Million +Thousands60,000 per week
-
7/31/2019 Krithi Talk Impact
41/169
41
Course Overview
0. IntroductionI. Data Warehousing
II. Decision Supportand OLAPIII. Data Mining
IV. Looking Ahead
Demos and Labs
I D W h
-
7/31/2019 Krithi Talk Impact
42/169
42
I. Data Warehouses:Architecture, Design & Construction
DW ArchitectureLoading, refreshingStructuring/ModelingDWs and Data MartsQuery Processing
demos, labs
-
7/31/2019 Krithi Talk Impact
43/169
43
Data Warehouse Architecture
Data WarehouseEngine
Optimized Loader
ExtractionCleansing
AnalyzeQuery
Metadata Repository
RelationalDatabases
LegacyData
Purchased
Data
ERPSystems
-
7/31/2019 Krithi Talk Impact
44/169
44
Components of the Warehouse
Data Extraction and LoadingThe Warehouse
Analyze and Query -- OLAP ToolsMetadata
Data Mining tools
-
7/31/2019 Krithi Talk Impact
45/169
Loading the Warehouse
Cleaning the data
before it is loaded
-
7/31/2019 Krithi Talk Impact
46/169
46
Source Data
Typically host based, legacy
applicationsCustomized applications,COBOL, 3GL, 4GL
Point of Contact DevicesPOS, ATM, Call switches
External SourcesNielsens, Acxiom, CMIE,
Vendors, Partners
Sequential Legacy Relational ExternalOperational/ Source Data
-
7/31/2019 Krithi Talk Impact
47/169
47
Data Quality - The Reality
Tempting to think creating a datawarehouse is simply extractingoperational data and entering into adata warehouse
Nothing could be farther from thetruthWarehouse data comes fromdisparate questionable sources
-
7/31/2019 Krithi Talk Impact
48/169
48
Data Quality - The Reality
Legacy systems no longer documentedOutside sources with questionable qualityproceduresProduction systems with no built inintegrity checks and no integration
Operational systems are usually designed to
solve a specific business problem and arerarely developed to a a corporate plan
And get it done quickly, we do not have time toworry about corporate standards...
-
7/31/2019 Krithi Talk Impact
49/169
49
Data Integration Across Sources
Trust Credit cardSavings Loans
Same datadifferent name
Different dataSame name
Data found herenowhere else
Different keyssame data
-
7/31/2019 Krithi Talk Impact
50/169
50
Data Transformation Example
appl A - balanceappl B - balappl C - currbalappl D - balcurr
appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feet
appl D - pipeline - yds
appl A - m,f appl B - 1,0appl C - x,yappl D - male, female
Data Warehouse
-
7/31/2019 Krithi Talk Impact
51/169
51
Data Integrity Problems
Same person, different spellingsAgarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt.LTD.
Use of different namesmumbai, bombay
Different account numbers generated bydifferent applications for the same customerRequired fields left blankInvalid product codes collected at point of sale
manual entry leads to mistakes
in case of a problem use 9999999
-
7/31/2019 Krithi Talk Impact
52/169
52
Data Transformation Terms
ExtractingConditioning
ScrubbingMergingHouseholding
EnrichmentScoring
LoadingValidatingDelta Updating
-
7/31/2019 Krithi Talk Impact
53/169
53
Data Transformation Terms
ExtractingCapture of data from operational source in
as is status
Sources for data generally in legacymainframes in VSAM, IMS, IDMS, DB2; moredata today in relational databases on Unix
ConditioningThe conversion of data types from the sourceto the target data store (warehouse) --always a relational database
-
7/31/2019 Krithi Talk Impact
54/169
54
Data Transformation Terms
HouseholdingIdentifying all members of a household
(living at the same address)Ensures only one mail is sent to a
householdCan result in substantial savings: 1
lakh catalogues at Rs. 50 each costs Rs.50 lakhs. A 2% savings would save Rs.1 lakh.
-
7/31/2019 Krithi Talk Impact
55/169
55
Data Transformation Terms
EnrichmentBring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A.C. Nielsen, CMIE, IMRA etc...Scoring
computation of a probability of anevent. e.g..., chance that a customerwill defect to AT&T from MCI, chancethat a customer is likely to buy a newproduct
-
7/31/2019 Krithi Talk Impact
56/169
56
Loads
After extracting, scrubbing, cleaning,validating etc. need to load the datainto the warehouse
Issueshuge volumes of data to be loadedsmall time window available when warehouse can betaken off line (usually nights)when to build index and summary tablesallow system administrators to monitor, cancel, resume,change load ratesRecover gracefully -- restart after failure from whereyou were and without loss of data integrity
-
7/31/2019 Krithi Talk Impact
57/169
57
Load Techniques
Use SQL to append or insert newdata
record at a time interfacewill lead to random disk I/Os
Use batch load utility
-
7/31/2019 Krithi Talk Impact
58/169
-
7/31/2019 Krithi Talk Impact
59/169
59
Refresh
Propagate updates on source data tothe warehouseIssues:
when to refreshhow to refresh -- refresh techniques
-
7/31/2019 Krithi Talk Impact
60/169
60
When to Refresh?
periodically (e.g., every night, everyweek) or after significant eventson every update: not warranted unlesswarehouse data require current data (upto the minute stock quotes)refresh policy set by administrator based
on user needs and trafficpossibly different policies for differentsources
-
7/31/2019 Krithi Talk Impact
61/169
61
Refresh Techniques
Full Extract from base tablesread entire source table: too expensivemaybe the only choice for legacy
systems
-
7/31/2019 Krithi Talk Impact
62/169
62
How To Detect Changes
Create a snapshot log table to recordids of updated rows of source dataand timestampDetect changes by:
Defining after row triggers to updatesnapshot log when source table
changesUsing regular transaction log to detect
changes to source data
-
7/31/2019 Krithi Talk Impact
63/169
63
Data Extraction and Cleansing
Extract data from existingoperational and legacy dataIssues:
Sources of data for the warehouseData quality at the sourcesMerging different data sourcesData Transformation
How to propagate updates (on the sources) tothe warehouseTerabytes of data to be loaded
-
7/31/2019 Krithi Talk Impact
64/169
64
Scrubbing Data
Sophisticatedtransformation tools.Used for cleaning the
quality of dataClean data is vital for thesuccess of thewarehouse
ExampleSeshadri, Sheshadri,Sesadri, Seshadri S.,Srinivasan Seshadri, etc.are the same person
-
7/31/2019 Krithi Talk Impact
65/169
65
Scrubbing Tools
Apertus -- Enterprise/IntegratorVality -- IPE
Postal Soft
-
7/31/2019 Krithi Talk Impact
66/169
Structuring/Modeling Issues
Data Heart of the Data
-
7/31/2019 Krithi Talk Impact
67/169
67
Data -- Heart of the DataWarehouse
Heart of the data warehouse is thedata itself!Single version of the truthCorporate memoryData is organized in a way thatrepresents business -- subjectorientation
-
7/31/2019 Krithi Talk Impact
68/169
68
Data Warehouse Structure
Subject Orientation -- customer,product, policy, account etc... Asubject may be implemented as aset of related tables. E.g.,customer may be five tables
-
7/31/2019 Krithi Talk Impact
69/169
69
Data Warehouse Structure
base customer (1985-87)custid, from date, to date, name, phone, dob
base customer (1988-90)custid, from date, to date, name, credit rating,employer
customer activity (1986-89) -- monthlysummarycustomer activity detail (1987-89)
custid, activity date, amount, clerk id, order no
customer activity detail (1990-91)custid, activity date, amount, line item no, order no
Time is part of key of each table
-
7/31/2019 Krithi Talk Impact
70/169
70
Data Granularity in Warehouse
Summarized data storedreduce storage costsreduce cpu usageincreases performance since smaller
number of records to be processeddesign around traditional high level
reporting needstradeoff with volume of data to be
stored and detailed usage of data
-
7/31/2019 Krithi Talk Impact
71/169
71
Granularity in Warehouse
Can not answer some questions withsummarized data
Did Anand call Seshadri last month?Not possible to answer if total durationof calls by Anand over a month is onlymaintained and individual call detailsare not.
Detailed data too voluminous
-
7/31/2019 Krithi Talk Impact
72/169
72
Granularity in Warehouse
Tradeoff is to have dual level of granularity
Store summary data on disks95% of DSS processing done against this
data
Store detail on tapes
5% of DSS processing against this data
-
7/31/2019 Krithi Talk Impact
73/169
73
Vertical Partitioning
Frequentlyaccessed Rarelyaccessed
Smaller tableand so less I/O
Acct.No Name Balance Date Opened
InterestRate Address
Acct.No Balance
Acct.No Name Date Opened
InterestRate Address
-
7/31/2019 Krithi Talk Impact
74/169
74
Derived Data
Introduction of derived (calculateddata) may often helpHave seen this in the context of duallevels of granularityCan keep auxiliary views andindexes to speed up queryprocessing
h
-
7/31/2019 Krithi Talk Impact
75/169
75
Schema Design
Database organizationmust look like businessmust be recognizable by business user
approachable by business userMust be simple
Schema Types
Star SchemaFact Constellation SchemaSnowflake schema
bl
-
7/31/2019 Krithi Talk Impact
76/169
76
Dimension Tables
Dimension tablesDefine business in terms already
familiar to users
Wide rows with lots of descriptive textSmall tables (about a million rows)Joined to fact table by a foreign keyheavily indexedtypical dimensions
time periods, geographic region (markets,cities), products, customers, salesperson,etc.
F T bl
-
7/31/2019 Krithi Talk Impact
77/169
77
Fact Table
Central tablemostly raw numeric itemsnarrow rows, a few columns at mostlarge number of rows (millions to a
billion)Access via dimensions
S S h
-
7/31/2019 Krithi Talk Impact
78/169
78
Star Schema
A single fact table and for eachdimension one dimension tableDoes not capture hierarchies directly
T i
m e
p r o d
c u s t
c i t y
f
a c t
date, custno, prodno, cityname, ...
-
7/31/2019 Krithi Talk Impact
79/169
F C ll i
-
7/31/2019 Krithi Talk Impact
80/169
80
Fact Constellation
Fact ConstellationMultiple fact tables that share many
dimension tables
Booking and Checkout may share manydimension tables in the hotel industry
Hotels
Travel Agents
Promotion
Room Type
Customer
Booking
Checkout
-
7/31/2019 Krithi Talk Impact
81/169
-
7/31/2019 Krithi Talk Impact
82/169
82
Creating Arrays
Many times each occurrence of a sequence of data is in a different physical locationBeneficial to collect all occurrences togetherand store as an array in a single rowMakes sense only if there are a stablenumber of occurrences which are accessedtogetherIn a data warehouse, such situations arisenaturally due to time based orientation
can create an array by month
-
7/31/2019 Krithi Talk Impact
83/169
83
Selective Redundancy
Description of an item can be storedredundantly with order table --most often item description is alsoaccessed with order tableUpdates have to be careful
-
7/31/2019 Krithi Talk Impact
84/169
84
Partitioning
Breaking data into severalphysical units that can behandled separatelyNot a question of whether to do it in datawarehouses but how to doitGranularity andpartitioning are key toeffective implementationof a warehouse
-
7/31/2019 Krithi Talk Impact
85/169
-
7/31/2019 Krithi Talk Impact
86/169
86
Criterion for Partitioning
Typically partitioned bydateline of businessgeographyorganizational unitany combination of above
-
7/31/2019 Krithi Talk Impact
87/169
87
Where to Partition?
Application level or DBMS levelMakes sense to partition atapplication level
Allows different definition for each yearImportant since warehouse spans many
years and as business evolves definitionchanges
Allows data to be moved betweenprocessing complexes easily
-
7/31/2019 Krithi Talk Impact
88/169
Data Warehouse vs. Data Marts
What comes first
From the Data Warehouse to Data
-
7/31/2019 Krithi Talk Impact
89/169
89
Marts
DepartmentallyStructured
IndividuallyStructured
Data WarehouseOrganizationallyStructured
Less
More
HistoryNormalizedDetailed
Data
Information
-
7/31/2019 Krithi Talk Impact
90/169
90
Data Warehouse and Data Marts
OLAPData MartLightly summarizedDepartmentally structured
Organizationally structured AtomicDetailed Data Warehouse Data
Characteristics of the
-
7/31/2019 Krithi Talk Impact
91/169
91
Departmental Data Mart
OLAPSmallFlexible
Customized byDepartmentSource is
departmentallystructured datawarehouse
Techniques for Creating
-
7/31/2019 Krithi Talk Impact
92/169
92
Departmental Data Mart
OLAP
Subset
SummarizedSuperset
Indexed
Arrayed
Sales Mktg.Finance
-
7/31/2019 Krithi Talk Impact
93/169
93
Data Mart Centric
Data Marts
Data Sources
Data Warehouse
Problems with Data Mart Centric
-
7/31/2019 Krithi Talk Impact
94/169
94
Solution
If you end up creating multiple warehouses,integrating them is a problem
-
7/31/2019 Krithi Talk Impact
95/169
95
True Warehouse
Data Marts
Data Sources
Data Warehouse
-
7/31/2019 Krithi Talk Impact
96/169
96
Query Processing
Indexing
Pre computedviews/aggregatesSQL extensions
-
7/31/2019 Krithi Talk Impact
97/169
97
Indexing Techniques
Exploiting indexes to reducescanning of data is of crucialimportanceBitmap IndexesJoin IndexesOther Issues
Text indexingParallelizing and sequencing of index
builds and incremental updates
Indexing Techniques
-
7/31/2019 Krithi Talk Impact
98/169
98
Indexing Techniques
Bitmap index:A collection of bitmaps -- one for each
distinct value of the column
Each bitmap has N bits where N is thenumber of rows in the tableA bit corresponding to a value v for a
row r is set if and only if r has the valuefor the indexed attribute
-
7/31/2019 Krithi Talk Impact
99/169
d
-
7/31/2019 Krithi Talk Impact
100/169
100
Customer Query : select * from customer where
gender = F and vote = Y
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
Bitmap Index
M
F
F
F
F
M
Y
Y
Y
N
N
N
d
-
7/31/2019 Krithi Talk Impact
101/169
101
Bit Map Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W L
C7 N H
Base Table
Row ID N S E W
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1
4 0 0 0 1
5 0 1 0 0
6 0 0 0 1
7 1 0 0 0
Row ID H M L
1 1 0 0
2 0 1 0
3 0 0 0
4 0 0 0
5 0 1 0
6 0 0 0
7 1 0 0
Rating Index Region Index
Customers where Region = W Rating = M And
Bi M I d
-
7/31/2019 Krithi Talk Impact
102/169
102
BitMap Indexes
Comparison, join and aggregation operationsare reduced to bit arithmetic with dramaticimprovement in processing time
Significant reduction in space and I/O (30:1)Adapted for higher cardinality domains as well.Compression (e.g., run-length encoding)exploitedProducts that support bitmaps: Model 204,TargetIndex (Redbrick), IQ (Sybase), Oracle7.3
-
7/31/2019 Krithi Talk Impact
103/169
J i I d
-
7/31/2019 Krithi Talk Impact
104/169
104
Join Indexes
Join indexes can also span multipledimension tables
e.g., a join index on city and time dimension of calls fact table
Star Join Processing
-
7/31/2019 Krithi Talk Impact
105/169
105
Star Join Processing
Use join indexes to join dimensionand fact table
Calls C+T
C+T+L
C+T+L +P
Time
Loca- tion
Plan
Optimized Star Join Processing
-
7/31/2019 Krithi Talk Impact
106/169
106
Optimized Star Join Processing
Time
Loca- tion
Plan
Calls
Virtual Cross Product of T, L and P
Apply Selections
Bitmapped Join Processing
-
7/31/2019 Krithi Talk Impact
107/169
107
Bitmapped Join Processing
AND
Time
Loca- tion
Plan
Calls
Calls
Calls
Bitmaps 1 0
1
0 0 1
1 1 0
-
7/31/2019 Krithi Talk Impact
108/169
-
7/31/2019 Krithi Talk Impact
109/169
Parallel Query Processing
-
7/31/2019 Krithi Talk Impact
110/169
110
Parallel Query Processing
Partitioned DataParallel scansYields I/O parallelism
Parallel algorithms for relational operatorsJoins, Aggregates, Sort
Parallel UtilitiesLoad, Archive, Update, Parse, Checkpoint,Recovery
Parallel Query Optimization
Pre-computed Aggregates
-
7/31/2019 Krithi Talk Impact
111/169
111
Pre computed Aggregates
Keep aggregated data forefficiency (pre-computed queries)
QuestionsWhich aggregates to compute?How to update aggregates?
How to use pre-computedaggregates in queries?
P t d Agg g t
-
7/31/2019 Krithi Talk Impact
112/169
112
Pre-computed Aggregates
Aggregated table can be maintainedby the
warehouse server
middle tierclient applications
Pre-computed aggregates -- specialcase of materialized views -- samequestions and issues remain
SQL Extensions
-
7/31/2019 Krithi Talk Impact
113/169
113
SQL Extensions
Extended family of aggregatefunctions
rank (top 10 customers)percentile (top 30% of customers)median, modeObject Relational Systems allow
addition of new aggregate functions
SQL Extensions
-
7/31/2019 Krithi Talk Impact
114/169
114
SQL Extensions
Reporting featuresrunning total, cumulative totals
Cube operatorgroup by on all subsets of a set of
attributes (month,city)redundant scan and sorting of data can
be avoided
Red Brick has Extended set ofAggregates
-
7/31/2019 Krithi Talk Impact
115/169
115
Aggregates
Select month, dollars, cume(dollars) asrun_dollars, weight, cume(weight) asrun_weightsfrom sales, market, product, period t
where year = 1993and product like Columbian% and city like San Fr% order by t.perkey
RISQL (Red Brick Systems)Extensions
-
7/31/2019 Krithi Talk Impact
116/169
116
Extensions
AggregatesCUMEMOVINGAVG
MOVINGSUMRANKTERTILERATIOTOREPORT
Calculating RowSubtotals
BREAK BY
Sophisticated DateTime SupportDATEDIFF
Using SubQueriesin calculations
Using SubQueries in Calculations
-
7/31/2019 Krithi Talk Impact
117/169
117
Using SubQueries in Calculations
select product, dollars as jun97_sales,(select sum(s1.dollars)from market mi, product pi, period, ti, sales si
where pi.product = product.productand ti.year = period.yearand mi.city = market.city) as total97_sales,100 * dollars/
(select sum(s1.dollars)from market mi, product pi, period, ti, sales si where pi.product = product.product
and ti.year = period.yearand mi.city = market.city) as percent_of_yr
from market, product, period, sales
where year = 1997and month = June and city like Ahmed% order by product;
Course Overview
-
7/31/2019 Krithi Talk Impact
118/169
118
Course Overview
The course:what and how
0. IntroductionI. Data WarehousingII. Decision Supportand OLAP
III. Data MiningIV. Looking Ahead
Demos and Labs
-
7/31/2019 Krithi Talk Impact
119/169
II. On-Line Analytical Processing (OLAP)
Making DecisionSupport Possible
Limitations of SQL
-
7/31/2019 Krithi Talk Impact
120/169
120
Limitations of SQL
A Freshman inBusiness needs
a Ph.D. in SQL
-- Ralph Kimball
Typical OLAP Queries
-
7/31/2019 Krithi Talk Impact
121/169
121
Typical OLAP Queries
Write a multi-table join to compare sales for eachproduct line YTD this year vs. last year.
Repeat the above process to find the top 5
product contributors to margin.Repeat the above process to find the sales of aproduct line to new vs. existing customers.
Repeat the above process to find the customersthat have had negative sales growth.
What Is OLAP?
-
7/31/2019 Krithi Talk Impact
122/169
122
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html
Online Analytical Processing - coined byEF Codd in 1994 paper contracted byArbor Software * Generally synonymous with earlier terms such asDecisions Support, Business Intelligence, ExecutiveInformation SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase,Oracle Express)ROLAP: Relational OLAP (Informix MetaCube,Microstrategy DSS Agent)
The OLAP Market
-
7/31/2019 Krithi Talk Impact
123/169
123
The OLAP Market
Rapid growth in the enterprise market1995: $700 Million1997: $2.1 Billion
Significant consolidation activity among
major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires Panorama
Result: OLAP shifted from small verticalniche to mainstream DBMS category
Strengths of OLAP
-
7/31/2019 Krithi Talk Impact
124/169
124
Strengths of OLAP
It is a powerful visualization paradigm
It provides fast, interactive responsetimes
It is good for analyzing time series
It can be useful to find some clusters and
outliersMany vendors offer OLAP tools
OLAP Is FASMI
-
7/31/2019 Krithi Talk Impact
125/169
125
Nigel Pendse, Richard Creath - The OLAP Report
OLAP Is FASMI
FastAnalysisSharedMultidimensionalInformation
M l i di i l D
-
7/31/2019 Krithi Talk Impact
126/169
126Month
1 2 3 4 765
P r o
d u c
t
Toothpaste
JuiceColaMilk
Cream
Soap
WSN
Dimensions: Product, Region, TimeHierarchical summarization paths
Product Region Time Industry Country Year
Category Region Quarter
Product City Month Week
Office Day
Multi-dimensional Data
HeyI sold $100M worth of goods
Data Cube Lattice
-
7/31/2019 Krithi Talk Impact
127/169
127
Data Cube Lattice
Cube latticeABC
AB AC BCA B C
noneCan materialize some groupbys, compute otherson demandQuestion: which groupbys to materialze?
Question: what indices to createQuestion: how to organize data (chunks, etc)
Visualizing Neighbors is simpler
-
7/31/2019 Krithi Talk Impact
128/169
128
Visualizing Neighbors is simpler
1 2 3 4 5 6 7 8 AprMayJun
Jul AugSepOctNov
DecJanFebMar
Month Store Sales Apr 1 Apr 2 Apr 3 Apr 4 Apr 5 Apr 6 Apr 7 Apr 8May 1May 2May 3May 4
May 5May 6May 7May 8Jun 1Jun 2
A Visual Operation: Pivot (Rotate)
-
7/31/2019 Krithi Talk Impact
129/169
129
A Visual Operation: Pivot (Rotate)
10
47
30
12
JuiceCola
Milk
Cream
3/1 3/2 3/3 3/4
Date
Product
Slicing and Dicing
-
7/31/2019 Krithi Talk Impact
130/169
130
Slicing and Dicing
Product
Sales Channel Retail Direct Special
Household
Telecomm
Video
Audio IndiaFar East
Europe
The Telecomm Slice
-
7/31/2019 Krithi Talk Impact
131/169
Nature of OLAP Analysis
-
7/31/2019 Krithi Talk Impact
132/169
132
Nature of OLAP Analysis
Aggregation -- (total sales,percent-to-total)Comparison -- Budget vs.Expenses
Ranking -- Top 10, quartileanalysisAccess to detailed and
aggregate dataComplex criteriaspecificationVisualization
-
7/31/2019 Krithi Talk Impact
133/169
Multidimensional Spreadsheets
-
7/31/2019 Krithi Talk Impact
134/169
134
Multidimensional Spreadsheets
Analysts needspreadsheets that supportpivot tables (cross-tabs)drill-down and roll-up
slice and dicesortselectionsderived attributes
Popular in retail domain
OLAP - Data Cube
-
7/31/2019 Krithi Talk Impact
135/169
135
OLAP Data Cube
Idea: analysts need to group data in manydifferent ways
eg. Sales(region, product, prodtype,prodstyle, date, saleamount)
saleamount is a measure attribute, rest aredimension attributesgroupby every subset of the other attributes
materialize (precompute and store)groupbys to give online response
Also: hierarchies on attributes: date ->weekday,date -> month -> quarter -> year
SQL Extensions
-
7/31/2019 Krithi Talk Impact
136/169
136
SQL Extensions
Front-end tools requireExtended Family of Aggregate Functionsrank, median, mode
Reporting Featuresrunning totals, cumulative totals
Results of multiple group bytotal sales by month and total sales by
productData Cube
Relational OLAP: 3 Tier DSS
-
7/31/2019 Krithi Talk Impact
137/169
137
Relational OLAP: 3 Tier DSSData Warehouse ROLAP Engine Decision Support Client
Database Layer Application Logic Layer Presentation Layer
Store atomic
data in industrystandardRDBMS.
Generate SQL
execution plans inthe ROLAP engineto obtain OLAPfunctionality.
Obtain multi-
dimensionalreports from theDSS Client.
-
7/31/2019 Krithi Talk Impact
138/169
Typical OLAP ProblemsD E l i
-
7/31/2019 Krithi Talk Impact
139/169
139
Data Explosion Syndrome
Number of Dimensions
N u m b
e r o
f A g g r e g a
t i o n s
(4 levels in each dimension)
Data Explosion
Microsoft TechEd98
Metadata Repository
-
7/31/2019 Krithi Talk Impact
140/169
140
etadata epos to y
Administrative metadatasource databases and their contentsgateway descriptionswarehouse schema, view & derived data definitions
dimensions, hierarchiespre-defined queries and reportsdata mart locations and contentsdata partitionsdata extraction, cleansing, transformation rules,defaultsdata refresh and purging rulesuser profiles, user groupssecurity: user authorization, access control
Metdata Repository .. 2
-
7/31/2019 Krithi Talk Impact
141/169
141
p y
Business databusiness terms and definitionsownership of data
charging policiesoperational metadata
data lineage: history of migrated data andsequence of transformations appliedcurrency of data: active, archived, purgedmonitoring information: warehouse usagestatistics, error reports, audit trails.
Recipe for a Successful
-
7/31/2019 Krithi Talk Impact
142/169
pWarehouse
-
7/31/2019 Krithi Talk Impact
143/169
For a Successful Warehouse
-
7/31/2019 Krithi Talk Impact
144/169
144
Look closely at the data extracting,cleaning, and loading toolsImplement a user accessible automated
directory to information stored in thewarehouseDetermine a plan to test the integrity of the data in the warehouseFrom the start get warehouse users in thehabit of 'testing' complex queries
For a Successful Warehouse
-
7/31/2019 Krithi Talk Impact
145/169
145
Coordinate system roll-out with networkadministration personnelWhen in a bind, ask others who have
done the same thing for adviceBe on the lookout for small, but strategic,projectsMarket and sell your data warehousingsystems
Data Warehouse Pitfalls
-
7/31/2019 Krithi Talk Impact
146/169
146
You are going to spend much time extracting,cleaning, and loading data
Despite best efforts at project management, datawarehousing project scope will increase
You are going to find problems with systemsfeeding the data warehouse
You will find the need to store data not beingcaptured by any existing system
You will need to validate data not being validatedby transaction processing systems
Data Warehouse Pitfalls
-
7/31/2019 Krithi Talk Impact
147/169
147
Some transaction processing systems feeding thewarehousing system will not contain detail
Many warehouse end users will be trained andnever or seldom apply their training
After end users receive query and report tools,requests for IS written reports may increase
Your warehouse users will develop conflictingbusiness rules
Large scale data warehousing can become anexercise in data homogenizing
Data Warehouse Pitfalls
-
7/31/2019 Krithi Talk Impact
148/169
148
'Overhead' can eat up great amounts of diskspaceThe time it takes to load the warehouse willexpand to the amount of the time in theavailable window... and then someAssigning security cannot be done with atransaction processing system mindsetYou are building a HIGH maintenance system
You will fail if you concentrate on resourceoptimization to the neglect of project, data, andcustomer management issues and anunderstanding of what adds value to thecustomer
DW and OLAP Research Issues
-
7/31/2019 Krithi Talk Impact
149/169
149
Data cleaningfocus on data inconsistencies, not schema differencesdata mining techniques
Physical Designdesign of summary tables, partitions, indexestradeoffs in use of different indexes
Query processingselecting appropriate summary tablesdynamic optimization with feedbackacid test for query optimization: cost estimation, use of transformations, search strategiespartitioning query processing between OLAP server andbackend server.
DW and OLAP Research Issues .. 2
-
7/31/2019 Krithi Talk Impact
150/169
150
Warehouse Managementdetecting runaway queriesresource managementincremental refresh techniquescomputing summary tables during loadfailure recovery during load and refreshprocess management: scheduling queries,
load and refreshQuery processing, cachinguse of workflow technology for processmanagement
-
7/31/2019 Krithi Talk Impact
151/169
Products, References, Useful Links
Reporting Tools
-
7/31/2019 Krithi Talk Impact
152/169
152
g
Andyne Computing -- GQLBrio -- BrioQueryBusiness Objects -- Business ObjectsCognos -- ImpromptuInformation Builders Inc. -- Focus for WindowsOracle -- Discoverer2000Platinum Technology -- SQL*Assist, ProReportsPowerSoft -- InfoMakerSAS Institute -- SAS/AssistSoftware AG -- EsperantSterling Software -- VISION:Data
OLAP and Executive InformationSystems
-
7/31/2019 Krithi Talk Impact
153/169
153
Andyne Computing -- PabloArbor Software -- Essbase
Cognos -- PowerPlay
Comshare -- Commander
OLAPHolistic Systems -- Holos
Information Advantage --AXSYS, WebOLAP
Informix -- MetacubeMicrostrategies --DSS/Agent
Microsoft -- PlatoOracle -- Express
Pilot -- LightShip
Planning Sciences --
GentiumPlatinum Technology --ProdeaBeacon, Forest & Trees
SAS Institute -- SAS/EIS,OLAP++
Speedware -- Media
Other Warehouse RelatedProducts
-
7/31/2019 Krithi Talk Impact
154/169
154
Data extract, clean, transform,refresh
CA-Ingres replicator
Carleton PassportPrism Warehouse ManagerSAS Access
Sybase Replication ServerPlatinum Inforefiner, Infopump
Extraction and TransformationTools
-
7/31/2019 Krithi Talk Impact
155/169
155
Carleton Corporation -- PassportEvolutionary Technologies Inc. -- Extract
Informatica -- OpenBridge
Information Builders Inc. -- EDA Copy Manager
Platinum Technology -- InfoRefiner
Prism Solutions -- Prism Warehouse Manager
Red Brick Systems -- DecisionScape Formation
Scrubbing Tools
-
7/31/2019 Krithi Talk Impact
156/169
156
Apertus -- Enterprise/IntegratorVality -- IPEPostal Soft
Warehouse Products
-
7/31/2019 Krithi Talk Impact
157/169
157
Computer Associates -- CA-IngresHewlett-Packard -- Allbase/SQLInformix -- Informix, Informix XPS
Microsoft -- SQL ServerOracle -- Oracle7, Oracle Parallel ServerRed Brick -- Red Brick Warehouse
SAS Institute -- SASSoftware AG -- ADABASSybase -- SQL Server, IQ, MPP
Warehouse Server Products
-
7/31/2019 Krithi Talk Impact
158/169
158
Oracle 8Informix
Online Dynamic ServerXPS --Extended Parallel ServerUniversal Server for object relational
applicationsSybase
Adaptive Server 11.5Sybase MPPSybase IQ
Warehouse Server Products
-
7/31/2019 Krithi Talk Impact
159/169
159
Red Brick WarehouseTandem NonstopIBM
DB2 MVSUniversal ServerDB2 400
Teradata
Other Warehouse RelatedProducts
-
7/31/2019 Krithi Talk Impact
160/169
160
Connectivity to SourcesApertusInformation Builders EDA/SQL
Platimum InfohubSAS ConnectIBM Data Joiner
Oracle Open ConnectInformix Express Gateway
Other Warehouse RelatedProducts
-
7/31/2019 Krithi Talk Impact
161/169
161
Query/Reporting EnvironmentsBrio/QueryCognos Impromptu
Informix ViewpointCA Visual ExpressBusiness Objects
Platinum Forest and Trees
4GL's, GUI Builders, and PCDatabases
-
7/31/2019 Krithi Talk Impact
162/169
162
Information Builders -- FocusLotus -- ApproachMicrosoft -- Access, Visual BasicMITI -- SQR/WorkbenchPowerSoft -- PowerBuilder
SAS Institute -- SAS/AF
Data Mining Products
-
7/31/2019 Krithi Talk Impact
163/169
163
DataMind -- neurOagentInformation Discovery -- IDISSAS Institute -- SAS/Neuronets
Data Warehouse
-
7/31/2019 Krithi Talk Impact
164/169
164
W.H. Inmon, Building the DataWarehouse, Second Edition, John Wileyand Sons, 1996W.H. Inmon, J. D. Welch, Katherine L.Glassey, Managing the Data Warehouse,John Wiley and Sons, 1997Barry Devlin, Data Warehouse from
Architecture to Implementation, AddisonWesley Longman, Inc 1997
Data Warehouse
-
7/31/2019 Krithi Talk Impact
165/169
165
W.H. Inmon, John A. Zachman, JonathanG. Geiger, Data Stores Data Warehousingand the Zachman Framework, McGraw HillSeries on Data Warehousing and DataManagement, 1997Ralph Kimball, The Data WarehouseToolkit, John Wiley and Sons, 1996
OLAP and DSS
-
7/31/2019 Krithi Talk Impact
166/169
166
Erik Thomsen, OLAP Solutions, John Wileyand Sons 1997Microsoft TechEd Transparencies fromMicrosoft TechEd 98Essbase Product LiteratureOracle Express Product LiteratureMicrosoft Plato Web SiteMicrostrategy Web Site
Data Mining
-
7/31/2019 Krithi Talk Impact
167/169
167
Michael J.A. Berry and Gordon Linoff, DataMining Techniques, John Wiley and Sons1997Peter Adriaans and Dolf Zantinge, DataMining, Addison Wesley Longman Ltd.1996KDD Conferences
Other Tutorials
-
7/31/2019 Krithi Talk Impact
168/169
168
Donovan Schneider, Data Warehousing Tutorial,Tutorial at International Conference forManagement of Data (SIGMOD 1996) andInternational Conference on Very Large Data
Bases 97Umeshwar Dayal and Surajit Chaudhuri, DataWarehousing Tutorial at International Conferenceon Very Large Data Bases 1996Anand Deshpande and S. Seshadri, Tutorial onDatawarehousing and Data Mining, CSI-97
Useful URLs
-
7/31/2019 Krithi Talk Impact
169/169
Ralph Kimballs home page http://www.rkimball.com
Larry Greenfields Data WarehouseInformation Center
http://pwp.starnetinc.com/larryg/
Data Warehousing Institutehttp://www.dw-institute.com/
OLAP Council
http://www.rkimball.com/http://pwp.starnetinc.com/larryg/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://www.dw-institute.com/http://pwp.starnetinc.com/larryg/http://www.rkimball.com/