Post on 14-Aug-2020
Business Intelligence Data Warehoue & OLAP
2015.06
충북대학교 조 완섭wscho@cbnu.ac.kr
Ch8-2
Contents
1. DW와 OLAP
2. Models과 operations
3. Data warehouse 구축
4. 실무사례
- AOL
2015-07-26 CBU / MIS 2
DATA WAREHOUSE AND OLAP
Section 1
2015-07-26 CBU / MIS 3
Contents
• DW and OLAP
– Background
– What is a data warehouse?
– Why a warehouse?
– DW, OLAP, OLTP
2015-07-26 CBU / MIS 4
What is a data warehouse?
• Collection of diverse data– aimed at executive, decision maker
– subject oriented
– often a copy of operational data
– with value-added data (e.g., summaries, history)
– Integrated, historical, and non-volatile data
• Collection of tools– gathering data
– cleansing, integrating, ...
– querying, reporting, analysis
– data mining
– monitoring, administering warehouse
2015-07-26 CBU / MIS 5
Why a warehouse?
• Huge databases, but few analysis
• Analysis requires integration of diverse databases
– Construct a data warehouse
2015-07-26 CBU / MIS 6
DW, OLAP, OLTP - DW : 2 approaches
• Two approaches in the analysis
– Query-Driven (Lazy) ; Limitation !
– Warehouse (Eager)
2015-07-26 CBU / MIS 7
Source Source….SourceSource
SourceSource
Query-driven ?
or
Data warehouse ?
DW, OLAP, OLTP - DW : 2 approaches
• Query-Driven
2015-07-26 CBU / MIS 8
Limitation
- Historical data ?
- Overhead in the OLTP systems
Client Client
Wrapper Wrapper Wrapper
Mediator
Source Source Source
Analysis
Query &
Results
OLTP
OLAP
OLTP
DW, OLAP, OLTP - DW : 2 approaches
• Adv. Of Query-Driven
– No need to copy data
– Less storage
– No need to purchase data
– More up-to-date data
– Query needs can be unknown (ad-hoc queries)
– Only query interface needed at sources
2015-07-26 CBU / MIS 9
DW, OLAP, OLTP - DW : 2 approaches
• Data Warehouse
2015-07-26 CBU / MIS 10
Client Client
Warehouse
Source Source Source
Query & Analysis
Integration
MetadataExtraction
Transformation
Cleansing
Loading
OLAP & Data Mining
DW, OLAP, OLTP - DW : 2 approaches
• Adv. Of DW Approach
– High query performance (specialized indexes)
– Queries not visible outside warehouse
– Local processing at sources (OLTP) unaffected
– Can operate when sources unavailable
– Can query data not stored in a DBMS (queries on the integrated database)
– Extra information at warehouse
– Summarize (store aggregates)
– Add historical information
– Sufficient meta data
2015-07-26 CBU / MIS 11
DW, OLAP, OLTP
• OLTP: On Line Transaction Processing
– Describes processing at operational sites
• OLAP: On Line Analytical Processing
– Describes processing at warehouse
2015-07-26 CBU / MIS 12
Mostly reads
Queries long, complex
Gb-Tb of data
Summarized, consolidated data
Decision-makers, analysts as users
OLTP OLAP Mostly updates
Many small transactions
Mb-Tb of data
Raw data
Clerical users
Up-to-date data
Consistency, recoverability critical
DW, OLAP, OLTP
• Data Mart
– Smaller warehouses
– Support part of organization• e.g., marketing (customers, products, sales)
– Do not require enterprise-wide consensus• but long term integration problems?
2015-07-26 CBU / MIS 13
WAREHOUSE MODELS & OPERATORS
Section 2
2015-07-26 CBU / MIS 14
DW Data Model and Operators
• Data Models
– relations
– Star schema & snowflakes
– cubes
• Operators
– aggregations
– grouping
– slice & dice
– roll-up, drill down
– pivoting
– other
2015-07-26 CBU / MIS 15
Star Schema
• Fact Table & Dimension Tables & Measures
2015-07-26 CBU / MIS 16
Measures
Fact Table
Dimension TableDimension Table
Dimension Table
Star Schema
• Contents
2015-07-26 CBU / MIS 17
customer custId name address city
53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la
product prodId name price
p1 bolt 10
p2 nut 5
store storeId city
c1 nyc
c2 sfo
c3 la
sale oderId date custId prodId storeId qty amt
100 1/7/97 53 p1 c1 1 12
102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50
Star Schema
• Snow-flake schema
2015-07-26 CBU / MIS 18
sType
cidspopregID
City
cidssizelocation
sType
regIDregName
Region
Star Schema
• Contents
• Dimension Hierarchy
2015-07-26 CBU / MIS 19
store storeId cityId stId mgr
s5 sfo t1 joe
s7 sfo t2 fred
s9 la t1 nancycity cityId pop regId
sfo 1M north
la 5M south
region regId name
north cold region
south warm region
sType stId size location
t1 small downtown
t2 large suburbs
snowflake schema
store
sType
city region
week
year
quarter month day
Cube Data Model
• Star Schema vs. Cube data model
2015-07-26 CBU / MIS 20
sale prodId storeId amt
p1 c1 12
p2 c1 11
p1 c3 50
p2 c2 8
c1 c2 c3
p1 12 50
p2 11 8
Fact table view:Multi-dimensional cube:
dimensions = 2
Cube Data Model
• Star Schema vs. Cube data model
2015-07-26 CBU / MIS 21
sale oderId date custId prodId storeId qty amt
100 1/7/97 53 p1 c1 1 12
102 2/7/97 53 p2 c100 2 11
105 3/8/97 111 p1 c3 5 50
Customer
Sto
re
P1
p50
P2…
1111 … 53 …
C100
….C3
C2
C1
60
50
40
150
Store
<1,12,…>
<5,50,…>
<2,11,…>
<qty, amt, date>
Cube Data Model
• Another example
2015-07-26 CBU / MIS 22
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
Multi-dimensional cube:Fact table view:
3D Cube
MOLAP - CUBE Operators• Pivot
– 행과 열의 위치를 바꾸는 연산자
• Roll-up – 차원 계층을 따라서 상위 레벨에 대한 집계치를 계산하는 연산– 예를들어, 날짜별 판매량에서 주별, 혹은 월별 판매량을 구할 때 사용됨
• Drill-down – 차원 계층을 따라서 하위 레벨에 대한 집계치를 계산하는 연산– 예를들어, 월별 판매량에서 주별, 혹은 날짜별 판매량을 구하는데 사용됨
• Slice– 한 차원에 대하여 주어진 값으로 selection한 결과 (subcube)를 산출하는 연산– 예를들어 slice for time = "Q1"은 time 차원에서 하나의 값 (Q1)으로 셀렉션하여 생성됨
• Dice – 두 차원 이상에 대하여 주어진 값들로 selection한 결과 (subcube)– location, time, item 세 차원에서 selection 된 결과 subcube를 만들 때 사용됨
• Drill-across– 두 개 이상의 사실 테이블을 포함한 질의를 수행할 때 사용되는 연산자
• Drill-through– 큐브에서 해당 소스 데이터베이스 (관계형 데이터베이스 테이블)까지 접근하여
raw data를 검색하는 연산
2015-07-26 CBU / MIS 23
MOLAP - CUBE Operators
• Ranking (Sorting) the top/bottom N items
– 최고치부터 차례로 N개를 선택하거나 혹은 최저치부터 차례로 N개를 선택하는 연산
– 매출량이 가장 높은 10개 상점은 ?
• 이 밖에도 큐브에 대한 다양한 연산자들이 제안되고 있음
– Moving average
– Growth rate
– Interests
– Internal rate of return
– Depreciation
– Currency conversion
– Statistical functions
2015-07-26 CBU / MIS 24
MOLAP - CUBE Operators
• Pivoting
2015-07-26 CBU / MIS 25
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
Multi-dimensional cube:Fact table view:
c1 c2 c3
p1 56 4 50
p2 11 8
Pivoting
- change the orientation of axis
p1 p2
C1 56 11
C2 4 8
C3 50
cube
roll-up
MOLAP - CUBE Operators
• Cube Aggregation : roll-up, drill down
2015-07-26 CBU / MIS 26
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
c1 c2 c3
p1 56 4 50
p2 11 8
c1 c2 c3
sum 67 12 50
sum
p1 110
p2 19
129
. . .
sale(c1,*,*)
sale(*,*,*)sale(c2,p2,*)
day1, day2
Drill-dwonRoll-up
MOLAP - CUBE Operators
• Slice, dice
2015-07-26 CBU / MIS 27
day 2c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
c1 c2 c3
p1 12 50
p2 11 8
day 1
slice
dice
subcube
cube
MOLAP - CUBE Operators
• Han’s book 그림
2015-07-26 CBU / MIS 28
OLAP 연산자
ROLAP - SQL
• Star Schema
2015-07-26 CBU / MIS 29
의사결정지원용 질의 예• 날짜별 avg_sales ?• 날짜별, branch_name별 avg_sales ?• 날짜별, branch_name별, item_name별 avg_sales ?• 월별, branch_name별, item_name별 avg_sales ?• 분기별, branch_name별, item_name별 avg_sales ?• 년도별, branch_name별, item_name별 avg_sales ?• 2009년, branch_name별, item_name별 avg_sales ?• 2009년, 1분기, branch_name별, item_name별 avg_sales ?• 2009년, 1분기, branch_name별, item_name별, city별 avg_sales ?• item_name별, month별 avg_sales ?• item_name별, quarter별 avg_sales ?• item_name별, quarter별, item_name별, city별 avg_sales ?• brand별, quarter별, item_name별, city별 avg_sales• type별, quarter별, item_name별, city별 avg_sales• supplier_type별, quarter별, item_name별, city별 avg_sales• (조건에 따라서 무수히 많은 질문이 가능하며, 이러한 질의를 효과적으로 작
성하고, 빠르게 분석할 수 있어야 함)=> SQL vs. OLAP ?
2015-07-26 CBU / MIS 30
ROLAP - SQL
• Grouping
Q1) year, month별로 SUM(Amount)를 집계한 SQL SELECT year, month, SUM(dollars_sold)FROM Sales S, Time TWHERE S.time_key = T.time_keyGROUP BY year, month;
2015-07-26 CBU / MIS 31
year month SUM(dollars_sold)
2005 1 200
2005 2 250
2005 3 530
... ... ...
2006 1 330
2006 2 420
... ... ...
ROLAP - SQL
• Q2) 질의 Q1에 대한 roll-up / drill-down 연산
SELECT year, SUM(dollars_sold) => year, month, SUM(dollars_sold)
FROM Sales S, time T
WHERE S.time_key = T.time_key
GROUP BY year; => year, month
2015-07-26 CBU / MIS 32
year SUM(dollars_sold)
2005 2200
2006 3250
2007 6530
... ...
year month SUM(dollars_sold)
2005 1 200
2005 2 250
2005 3 530
... ... ...
2006 1 330
2006 2 420
... ... ...
Rollup
Drilldown
ROLAP - SQL
• Q3) Slice for time = "Q1"SELECT S.*
FROM sales S, time T
WHERE S.time_key = T.time_key AND T.quarter = "Q1";
• Q4) Dice for (location.city = "Toronto" or "Vancouver") AND
(time.quarter = "Q1" or "Q2") AND
(item.item_name = "entertainment" or "PC")
SELECT S.*
FROM sales S, time T, item I, lication L
WHERE S.time_key = T.time_key AND
S.lication_key = L.location_key AND
S.item_key = I.item_key AND
(T.quarter = "Q1" or T.quarter = "Q2") AND
(L.city = "Toronto" or L.city = "Vancouver") AND
(I.item_name = "entertainment" or I.item_name = "PC");
2015-07-26 CBU / MIS 33
ROLAP – SQL 확장 [Gra97]
• 확장된 집계 연산자– rank (e.g., top ten), percentile (top 10%), window queries
• GROUP BY 절의 확장– GROUPING SETS, ROLLUP, CUBE 등의 옵션 추가
• Assumed Table
2015-07-26 CBU / MIS 34
shipment (fact table)
S# P# Qty Date
S1 P1 20 2004-01-01
S1 P2 10 2004-01-01
S1 P3 50 2003-03-07
S2 P1 20 2004-05-02
S2 P2 5 2004-05-02
S2 P3 50 2004-05-02
S1 P1 45 2004-01-02
… … … …
ROLAP – SQL 확장 [Gra97]
2015-07-26 CBU / MIS 35
S# TotalS1 125S2 75
S# P# Total
S1 P1 65
S1 P2 10
S1 P3 50
S2 P1 20
S2 P2 5
S2 P3 50
Total shipment quantities by supplier
SELECT S#, SUM(QTY) AS TotalFROM SHIPMENTGROUP BY (S#);
Total shipment quantities by supplier and product
SELECT S#, P#, SUM(QTY) AS TotalFROM SHIPMENTGROUP BY (S#, P#);
ROLAP – SQL 확장 [Gra97]
2015-07-26 CBU / MIS 36
GROUPING SET으로 확장한 SQL
S# P# Total
S1 NULL 125
S2 NULL 75
NULL P1 85
NULL P2 15
NULL P3 100
SELECT S#, P#, SUM(QTY) AS TotalFROM SHIPMENTGROUP BY
GROUPING SETS (S#, P#);
S#
P#
ROLAP – SQL 확장 [Gra97]
2015-07-26 CBU / MIS 37
S#, P#
S#
()
1) Aggregate with the finest granularity (Group By S#, P#)2) Then with the next level of granularity (Group By S#)3) Then the grand total (with no Group By clause)
ROLAP – SQL 확장 [Gra97]
2015-07-26 CBU / MIS 38
Aggregate with the finest granularity (GROUP BY S#, P#)Then with all subsets (GROUP BY S#, GROUP BY P#Then the grand total (with no GROUP BY clause)
ROLAP – SQL 확장 [Gra97]
2015-07-26 CBU / MIS 39
Null Null
Null
Null
ROLAP – SQL 확장 [Gra97]
2015-07-26 CBU / MIS 40
NullNullNullNull
Null
Null
Null Null
IMPLEMENTING A DW
Section 3
2015-07-26 CBU / MIS 41
OLAP & DW Architecture
• OLAP
– Analysis tool on the data warehouse
• ROLAP:
– Relational On-Line Analytical Processing
– Relational DBMS is used for storing & accessing data warehouses
• MOLAP:
– Multi-Dimensional On-Line Analytical Processing
– Specialized data management systems
2015-07-26 CBU / MIS 42
OLAP & DW Architecture
• Logical Architecture
2015-07-26 CBU / MIS 43
Data model transformation between conceptual multidimensional schema and logical schema
Data model transformation between external sources and logical schema
OLAP & DW Architecture
• The physical architecture
– Data warehouse architecture for multidimensional OLAP systems
• Multidimensional DB systems serves as both the Data Store Layer and the Data Management Layer
• Data from sources are conformed to the multidimensional model, summaries along all dimensions are precomputed for performance
• Data accesses are directly submitted
2015-07-26 CBU / MIS 44
OLAP & DW Architecture
• Multidimensional OLAP
2015-07-26 CBU / MIS 45
OLAP & DW Architecture
• The physical architecture (cont’)
– Data warehouse architecture for relational OLAP • Both the Data Store Layer and Data Management Layer are rela
tional
• Required to support the multidimensional analysis requests from the Application Interface Layer => relational OLAP engine
• Relational DBMSs cannot be directly transplanted into a DW system
– Data marts• Small data volumes (< 10 GB)
• Data with less than 10 dimensions
– Virtual data warehouse• Collection of views
2015-07-26 CBU / MIS 46
OLAP & DW Architecture
• Relational OLAP
2015-07-26 CBU / MIS 47
OLAP & DW Architecture
2015-07-26 CBU / MIS 48
Data Warehouse 구축과 OLAP 분석과정
DW 구축 절차
• Data extracted from operational databases (data sources)
• Cleaned to minimize errors, fill in missing information
• Transformed (e.g., relational views) to reconcile semantic mismatches
• Loading data: Materializing views, storing them locally into warehouse
• Data is periodically refreshed to reflect updates in data sources
• Periodically purge old data
• System catalogs: Very large, often stored in separate database called metadata
repository
2015-07-26 CBU / MIS 49
OLAP & DW Architecture
2015-07-26 CBU / MIS 50
multi-
dimensional
server
M.D. tools
utilities
could also
sit on
relational
DBMS
Pro
du
ct
Date1 2 3 4
milk
soda
eggs
soap
AB
Sales
• 2가지 구현방식- MOLAP- ROLAP
OLAP & DW Architecture
2015-07-26 CBU / MIS 51
relational
DBMS
ROLAP
server
tools
utilitiessale prodId date sum
p1 1 62
p2 1 19
p1 2 48
Special indices, tuning;
Schema is “denormalized”
Index
• Inverted Index
2015-07-26 CBU / MIS 52
20
23
18
19
20
21
22
23
25
26
r4
r18
r34
r35
r5
r19
r37
r40
rId name age
r4 joe 20
r18 fred 20
r19 sally 21
r34 nancy 20
r35 tom 20
r36 pat 25
r5 dave 21
r41 jeff 26
. . .
age
index
inverted
lists data
recordsQuery:
Get people with age = 20 and name = “fred”
List for age = 20: r4, r18, r34, r35
List for name = “fred”: r18, r52
Answer is intersection: r18
Index
• Bit-map index
2015-07-26 CBU / MIS 53
id name age
1 joe 20
2 fred 20
3 sally 21
4 nancy 20
5 tom 20
6 pat 25
7 dave 21
8 jeff 26
. . .
data
records
20
23
18
19
20
21
22
23
25
26
age
indexbit
maps
1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
1
0
1
1
Query:
Get people with age = 20 AND
name = “fred”
List for age = 20 : 1101100000
List for name = “fred” : 0100000001
Answer is intersection : 0100000000
Index
• Bit-map index
– Good if domain cardinality small
– Bit vectors can be compressed
2015-07-26 CBU / MIS 54
Join index
• SELECT * FROM sales, product WHERE prodId=id ;
2015-07-26 CBU / MIS 55
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
p1 c1 2 44
p1 c2 2 4
product id name price
p1 bolt 10
p2 nut 5
joinTb prodId name price storeId date amt
p1 bolt 10 c1 1 12
p2 nut 5 c1 1 11
p1 bolt 10 c3 1 50
p2 nut 5 c2 1 8
p1 bolt 10 c1 2 44
p1 bolt 10 c2 2 4
HUGE TABLE
Join index
• Join index on product.id = sales.prodId
2015-07-26 CBU / MIS 56
product id name price jIndex
p1 bolt 10 r1,r3,r5,r6
p2 nut 5 r2,r4
sale rId prodId storeId date amt
r1 p1 c1 1 12
r2 p2 c1 1 11
r3 p1 c3 1 50
r4 p2 c2 1 8
r5 p1 c1 2 44
r6 p1 c2 2 4
join index
Materialization
• 자주 요구하는 요약 정보를 미리 계산하여 DW에저장함으로써 분석의 성능을 높임
2015-07-26 CBU / MIS 57
day 2 c1 c2 c3
p1 44 4
p2 c1 c2 c3
p1 12 50
p2 11 8
day 1
c1 c2 c3
p1 56 4 50
p2 11 8
c1 c2 c3
p* 67 12 50
c*
p1 110
p2 19
129
. . .
CUBE Sales (Product, City, Day)
Sales(Product, City, *)
Sales(*, City, *)
Sales(Product, *, *)
Sales(*, * , *)
Materialization
• Cube Aggregation Lattice
2015-07-26 CBU / MIS 58
city, product, date
city, product city, date product, date
city product date
all
use greedy
algorithm to
decide what
to materializecity, product, date
city, product city, date product, date
city product date
all129
Materialized View
• Query Processing Using MVs
2015-07-26 CBU / MIS 59
Materialized View
• MV 생성
2015-07-26 CBU / MIS 60
Materialized View
• Query Rewriting
2015-07-26 CBU / MIS 61
Metadata
• Administrative– definition of sources, tools, ...
– schemas, dimension hierarchies, …
– rules for extraction, cleaning, …
– refresh, purging policies
– user profiles, access control, ...
• Business– business terms & definition
– data ownership, charging
• Operational– data lineage (data sources and path way)
– data currency (e.g., active, archived, purged)
– useage stats, error reports, audit trails
2015-07-26 CBU / MIS 62
2015-07-26 CBU / MIS 63
구 분 정 보 내 용 탐 색 유 형
데이터관리용도
- 데이터 모델 정보- 엔티티 정보- 속성정보- 도메인 정보- KEY 정보- Alias 정보- Entity별 비즈니스룰- 약어정보- 논리·물리모델 매핑- 비즈니스용어·물리적이름의 매핑
- 관리자 정보
- 회사내의 비즈니스엔티티 유형은 어떤 것이고 어떻게 정의되었는가?- 비즈니스 엔티티의 속성은?- 속성들이 가지는 코드값의 의미는?- 비즈니스 엔티티에 관련된 다른 정보는 어떤 것들이 있는가?- 비즈니스 변화가 어떤 엔티티에 변화를 주는가?- 데이터 품질에 대한 책음은 누가 지고있는가?- 데이터와 관련된 비즈니스 정의에 대해 누구에게 질문할 것인가?
데이터베이스정보
- 데이터베이스명- 테이블/뷰명- COPYBOOK명- Column명- Index 명- Dimensions / Facts- Connectivity 정보- Network, Server 정보- 권한 정보- 관리자 정보
- 데이터베이스를 구성하는 테이블 목록은?- 데이터의 물리적 속성은 어떻게 되는가?- 데이터는 어떻게 인덱싱되어있는가?- 데이터가 최종적으로 갱신된 시기는?- 이 데이터는 누가 사용하는가?- 이 데이터는 언제 자주 사용하는가?- 데이터베이스에 문제가 생기면 누구와 연락을 취해야 하는가?
데이터 이동정보
-소스 및 타겟정보(컬럼, 테이블, 데이터베이스)
- 소스/타겟 매핑- 변환로직- 변환 버전관리- 매핑 Value- 연산, 파생규칙- 이동시간, 주기- 검증정보- 관리자정보
- 이 데이터의 원천은 어디인가?- 소스시스템은 어떤 종류가 있는가?- 데이터 추출방법은 어떠한가/어떤 조건에 의해 추출되었는가?- 소스에서 타겟으로 오면서 어떤 변환과정을 거쳤는가?- 데이터 추출작업이 제대로 수행되었는가?- 예외데이터에 대해 어떤 처리가 이루어졌는가?- 데이터 이동은 누가 책임지는가?
BusinessIntelligence
- Connectivity 정보- 보안 절차 정보- 리포트, 질의유형 정보- 리포트, 질의검증 정보- 리포트 분배정보- 리포트 원천데이터정보- Help Desk 정보- 관리자정보
- 어떤 리포트, 질의를 사용하여 원하는 결과를 얻을수 있을까?- 내 리포트와 질의를 사용하는 사람은 누구인가?- 이 리포트의 데이터는 어디서 오는가?- 리포트, 질의의 결과로 얻을 수 있는 결론은 무었인가?- 이 리포트와 질의는 품질이 검증된 것인가?- 누가 이 질의를 만들었고 누가 사용하는가?- 특정데이터에 권한이 없을 때 누구와 연락을 취해야 하는가?- 데이터베이스에 문제가 생기면 누구와 연락을 취해야 하는가?