Introduction to Data Warehousing, Profiling and Cleansing

29
COMP33111, 2011/2012 1 2 2 COMP33111 Lecture 2 Introduction to Data Warehousing, Profiling and Cleansing Goran Nenadic School of Computer Science 3 3 Aims Understand the need for data warehousing Learn basic principles of data warehousing Understand data warehousing models and architectures Review the basic approaches to data profiling and cleansing 4 4 Plan Lecture today Tutorial and lab exercise 2 covers practical aspects and examples Lab test 1: 22 October 2012, 14-15 (3 rd year Lab)

Transcript of Introduction to Data Warehousing, Profiling and Cleansing

Page 1: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 1

2 2

COMP33111 Lecture 2

Introduction to

Data Warehousing,

Profiling and Cleansing

Goran Nenadic

School of Computer Science

3 3

Aims

Understand the need for data warehousing

Learn basic principles of data warehousing

Understand data warehousing models and

architectures

Review the basic approaches to data

profiling and cleansing

4 4

Plan

Lecture today

Tutorial and lab exercise 2 covers practical aspects and examples

Lab test 1: 22 October 2012, 14-15 (3rd year Lab)

Page 2: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 2

5 5

Need for data analysis

Modern enterprise environment competition is more intense than ever

quantity of information is increasing

In order to succeed, an organisation must have a comprehensive view of all of its aspects

-> data integration

make informed and reliable decisions

-> data analysis

take timely actions and accurate predictions

recall from the Intro lecture

6 6

Online transaction processing

Traditional “transactional” information, i.e.

operational data that documents everyday life

in an enterprise/organisation

retail (e.g. sales in supermarket stores)

financial services (e.g. ATM withdrawals)

transport (e.g. flight bookings)

telecommunications (e.g. mobile billing, Internet)

healthcare (e.g. drug prescriptions)

This is OLTP or operational or transactional

data

recall from the Intro lecture

7 7

Online transaction processing OLTP: processing and recording transactions

that create new data and/or update existing

information in operational DBs

insertions, updates, deletions

Typically a small number of rows are affected in

each transaction

Traditional DBMS optimised to perform well in

OLTP, but not in comprehensive exploration,

aggregation and decision making

recall from the Intro lecture

Page 3: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 3

8

Data management

Vertical fragmentation of information systems

into numerous, heterogeneous OLTP systems

8

9

Goal: unified access to data

Collect and combine data

Provide integrated view

Support data sharing

9

10

Traditional query-driven approach

Translate a query into a series of queries to the

different operational sources and then combine

the answers

10

Page 4: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 4

11

Traditional query-driven approach

PROS

• Complete data: • all data available at sources

can be used to answer query

• Always “fresh” data

CONS

• Incomplete data: sometimes

data source unavailable

• e.g. broken connection

• Delay in query processing

• slow information sources

• complex filtering and

integration

• Inefficient and potentially

expensive for frequent queries

• Competes with/disturbs local

processing at sources

• Possibly problems with data

access (e.g. personal data)

11

continued

12

Data warehousing approach

Data integrated in advance

12

13 13

Data warehouse (DW)

DW: an integrated database designed to support data analysis, business intelligence, and better and faster decision making

DWs integrate and aggregate data from various operational and external DBs maintained by different units different data representations, inconsistencies etc.

DW needs to provide more complex aggregation and analysis of data

mining “new” data (e.g. spending trends)

Page 5: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 5

14

Data warehouse - definitions

14

“A data warehouse is a subject-oriented, integrated,

time-variant, and non-volatile collection of data in

support of the management’s decision-making

process.”

-- Bill Inmon (~1990)

15 15

Data warehouse - definition

Subject-oriented: the warehouse is organised

around the major subject(s) of the enterprise.

Integrated: it merges various source data from

different applications and systems.

Time-variant - data in the warehouse is only

accurate and valid at some point in time or over

some time interval.

Non-volatile - data is not updated in real time but

is refreshed from operational systems on a

regular basis.

continued

16 16

Data warehouse characteristics

DW typically integrates several resources

e.g. sales DBs from various regions/states/years

Requires more historical data than generally maintained

in operational DBs

Must be optimised for access to very large amounts of

data

Mostly read-accessed, rarely write-accessed

DW data may be more coarse grained than in

operational DBs;

e.g. agregated, summerised, de-personalised

Page 6: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 6

17 17

Data warehouse characteristics

DWs are maintained separately from operational data

functional reasons

“historical” data (e.g. sales over long periods of time)

consolidated data (aggregated, summarised from

various sources, including both internal and external)

performance reasons

avoid degradation of services (operational DBs are not

optimised for “non-transactional” applications)

DW updates may not be so frequent

Difference between real-time and derived data

continued

18 18

General DW role

OLAP

DSS

data mining

applications

data

warehouse

operational DBs

external

DBs

Extract

Transform

Load

ETL

derived data

19

Major DW issues

DW design

Data extraction

monitors (change detectors), updates

Integration

cleansing & merging

DW maintenance and evolution

Optimisations

19

Page 7: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 7

20 20

Data warehouse

modelling and design

21

DW modelling and design

How to organise a DW to facilitate

views/queries with aggregates, e.g., sum,

min, avg, etc

maintenance difficult and

computation costly

need to decide what to pre-compute and when

storage of large amounts of data

21

22 22

DW modelling and design

Modelling questions

What are the user requirements (types of analysis needed)?

Which data should be considered (measures)?

Which data is available (OLTP, integration)?

What is the context of the data?

What are the data inter-dependences (integration)?

What is the scale of the project?

Define data model and design the schema

traditional ER (entity-relationship) modelling techniques may not be appropriate

multidimensional model

continued

Page 8: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 8

23

Multidimensional data model

Designed for the analysis for multi-dimensional

data i.e. data organised along dimensions, e.g.,

time (days, months, weeks, quarters, ...),

space (shop, region, country,...),

product (food, non-food, dairy, imported, blue,...)

customers (private, public, large, old,...),...

Popular data model for DW

23

24 24

Multidimensional data model

Dimensionality modelling

logical design technique that aims to present the data

in a standard, intuitive form that allows for high-

performance access.

Uses the concepts of entity-relationship (ER)

modelling with some restrictions

Focused on key performance indicators that

need to be analysed

continued

25 25

Multidimensional data model

DW: a set of points in a multi-dimensional space e.g. (store, product, amount, time)

time is typically one of the dimensions, as it is of particular significance to decision support

Objects of analysis (key performance indicators) are typically numeric measures can be aggregated

e.g. sales, budget, number of applications, …

Dimensions describe characteristics of the measures, i.e. their “context”

continued

Page 9: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 9

26 26

Example

Sales

Time

Product

Shop

Location

Sales per item

Sales per location

Sales per month

measures

Time Characteristics

Time

Shop Characteristics

Shop

Product Characteristics

Product

Location Characteristics

Location

27

Cube representation

27

3 dimensions: store, product, time

1 measure: amount (amt)

28 28

Multidimensional data example

Dimensions: Product, City, Date

Hyper-cube representation

Date/Time

Pro

duct

Page 10: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 10

29 29

Example – cube

30 30

Example

Dec.

Mar.

Feb.

Apr.

May

Jun.

Jul.

Aug.

Sep.

Oct.

Nov.

Jan.

Mo

nth

Su

pe

rma

rke

t

Mis

ce

lla

ne

ou

s

Re

sta

ura

nt

Tra

ve

l

Re

tail

Ve

hic

le

Category

Region

One

FourThreeTwo

Month = Dec.

Count = 110

Amount = 6,720

Region = Two

Category = Vehicle

multidimensional

cube for credit

card purchases

31

Dimension modelling

Rule of thumb: dimensions involve attributes that

provide the context for measures

Each dimension should have a single base level,

i.e. the level of finest granularity.

“day” can be the base level for the “time” dimension

“products” can be the base level for the “product”

dimension

“store” can be the base level for “location dimension”

31

Page 11: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 11

32

Dimension modelling

Base level elements can then be aggregated or

summarised to more coarse-grained levels

“days” -> weeks, months, quarter, year, ....

“products” -> brand, food/non-food, supplier, ....

“store” -> district, region, country, urban/rural, ....

32

33 33

Dimension modelling - attributes

Each dimension has a set of attributes

e.g. product: category, year, manufacturer

e.g. store: town, county, manager, size

Attributes can be related

hierarchically

lattice

34

Lattice hierarchy

34

Page 12: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 12

35 35

DW data models

For multidimensional implementation in a relational DB, we use two types of tables:

dimension tables tuples of attributes of a dimension

one table for each of the dimensions

fact table contains the data – “facts”

typically only one fact table (potentially with multiple measures)

36 36

Example:

dimension table for product: product ID, name, manufacturer, type, etc.

dimension table for fiscal quarter quarter ID, quarter number, year, BEG/END dates

dimension table for region region ID, county, capital

fact table (product_ID, quarter_ID, region_ID, sales, qty)

continued

DW data models

37

DW data models

product_ID quarter_ID region_ID sales qty

numerical values key columns joining fact table

to dimension tables

continued

Product Info

Quarter Info

Region Info

Page 13: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 13

38 38

DW data models

Each dimension table has a simple primary key product_ID or quarter_ID

Fact table uses a composite primary key the primary key of the fact table is made up of two

or more foreign keys linking to dimension tables.

Such simple model structure is called a star schema

continued

39 39

Star

schema

example

[see separate sheet]

40

Storing aggregated data

Star schema: data of all granularity is kept in a

single fact table

need an aggregation level indicator

“Fact constellation” schema

aggregate tables are kept separate from fact tables

one table per combination of levels of dimensions

40

Page 14: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 14

41

Fact constellation schema

41

42

Snowflake schema

When dimensions have very high cardinality

(many attributes and tuples):

(partly) normalise the dimension tables by

attribute level, with each smaller dimension

table pointing to an appropriate aggregated

fact table

one table per dimension level

(star schema: 1 table per dimension)

different fact tables as in fact constellation

42

43 43

Snowflake schema example

Dimension Branch has its own

dimension City which in turn has a

dimension Region.

Dimension tables are organised in a hierarchy by normalisation

Page 15: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 15

44

Snowflake schema

For each level of a dimension, all tuples of this

level are stored in their own table together with

the attributes for this level plus the key(s) for the

parent “level(s)”, e.g.

time(day,month,week,quarter,year) is split into

time1(day, week, month),

time2a(week, year),

time2b(month, quarter),

time3(quarter, year)

44

continued

45 45

Data model schemas

Star schema: a logical structure that has a fact table containing factual data in the centre, surrounded by dimension tables (for each dimension) containing reference data.

Snowflake schema: a variant of the star schema where dimension tables can further have dimension tables, a kind of fractal structure.

Starflake schema: a hybrid structure that contains a mixture of star and snowflake schemas.

46

Data model schemas

Understand which summary levels exist i.e. will

be used:

understand dimensions

Start with a star schema and then create the

“snowflakes” with queries:

find the proper “snowflaked” attribute tables

46

continued

Page 16: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 16

47 47

Data warehouse

architectures

48 48

Data warehouse architecture

49 49

Architectural components - data

Operational data source data (operational DBs) for the DW

Operational data-store a repository of operational data

Meta-data description of data (data about data)

used for acquisition of data, DW management, and querying

Page 17: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 17

50 50

Architectural components - data

Detailed data aggregation of other data

Lightly and highly summarised data generated by the warehouse manager

e.g. “summary” tables that hold aggregations such as sums, averages, counts of values in other tables

other derived data (“calculated columns”)

Archive/backup data

continued

51 51

Architectural components - tools

Load manager extraction, acquisition and loading of data into the DW

Warehouse manager management of the DW, e.g. indexing, backup

Query manager management of user queries

End-user access tools reporting, querying, application development tools

executive information system (EIS) tools

online analytical processing (OLAP) tools

data mining tools

52 52

Load manager tasks

Data extraction acquisition of data from operational DBs

design changes in operational DBs require adjustments

Data cleansing detect data anomalies and rectify them

Data transformation/reformatting transforming and linking data

Loading loading, building indexes, etc.

Refreshing propagate updates from operational DBs to DW

Page 18: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 18

53

Data quality problems

Instance-level problems

errors and inconsistencies in the actual data contents

noisy, dirty data – e.g. missing values, duplicates, etc.

Schema-level problems

referential integrity

heterogeneous data models

53

54

Data quality problems

54

Rahm, E.; Do, H.H.

Data Cleaning: Problems and Current Approaches

IEEE Techn. Bulletin on Data Engineering, Dec. 2000

continued

55

Examples

55

Page 19: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 19

56

Data cleansing

Several steps:

Data analysis and profiling: data type, illegal values,

length, value ranges, variance, uniqueness, null

values, typical string patterns (e.g. phone numbers) ...

Mapping rules: define functions to clean as much of

inconsistencies as possible (e.g. attribute splitting,

approximate joins, spelling checks, etc)

Verification: has everything worked correctly?

Backflow to cleaned data: highlight the problems

56

Look at tutorials 1 and 2

57 57

Data transformation/reformatting

Link and consolidate data from different sources handle possibly different DBs schemes

parsing and fuzzy matching may be needed

avoid un-necessary redundancy

Reformatting/mapping as appropriate town region state

date month fiscal quarter year

Reconcile semantic mismatches transform into uniform measures (e.g. currency)

e.g. “gender” and “sex” should be the same

58

Data transformation/reformatting

Format revisions (data format of variables)

Decoding of fields (0 = disable; 1 = enable)

Character sets, units of measurement, date/time

conversions

Calculated and derived values

Splitting single fields, merging of multiple fields

Key restructuring (come up with keys for the fact and

dimension tables based on the keys in source systems)

Deduplication

58

typical tasks

Page 20: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 20

59 59

Loading data

“Materialise” prepared data and store it in DW

Large volumes of data to be loaded may take several days to load

can be performed in parallel

Building summary tables (various pre-calculated aggregations)

indexes (supporting most-likely queries)

meta-data (various information about the data)

60 60

Meta-data generation

Meta-data: data about data

Build a meta-data repository

Document the following source DBs

DW schema

loading process description (applied transformation rules, cleaning methods, etc.)

refreshing policy

security issues, user profiles/groups

statistics

etc.

61 61

ETL process

Previous tasks also known as ETL

extract

transform

load

data

warehouse

operational DBs

external

DBs

Extract

Transform

Load

ETL

Page 21: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 21

62

62

summary

63 63

Refreshing data in DW

Refreshing intervals e.g. every night, week, or quarterly

typically not so frequent (depending on the subject area; e.g. for stock DWs, up-the-minute data may be required)

Possibly different refreshing periods for different operational DBs

Approaches full extraction and loading (expensive, not efficient)

incremental detect and propagate only changes in operational DBs

logical and transactional correctness and integrity

purging out-of-date data

64

Page 22: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 22

65 65

Data warehouse data flows

66 66

Data warehouse data flows In-flow

extraction, cleansing and loading of the source data

Up-flow adding value to the data in the warehouse through

summarising, packaging and distribution of the data

Down-flow archiving and backing-up data in the warehouse

Out-flow making the data available to end-users

Meta-flow managing the meta-data

continued

67

67

Page 23: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 23

68 68

DW architecture types

Central warehouse

Distributed warehouse needs replication, partitioning, communication

replicated metadata repository on each site

Federated warehouse decentralised autonomous warehouses

(e.g. containing data marts)

Data web-house a distributed data warehouse that is implemented

over the Web with no central data repository

69 69

Data marts

Enterprise warehouse: collects all information

about subjects that span the entire organisation

products, sales, customers, locations, finance ...

requires extensive business modelling

may take years to design & build

Data marts: special-purpose warehouses

a subset of a data warehouse that supports the

requirements of a particular department or

business function.

70 70

Data marts

Can be standalone or linked centrally to the

corporate data warehouse

can be built without enterprise-DW: faster roll out, but

complex integration in the long run

Still, often used for building a DW

loading and consolidation from data marts

Good to control/allow access to data

continued

Page 24: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 24

71 71

Data warehousing:

trends and open issues

72 72

Multi-dimensional model

Benefits Predictable, standard framework

Respond well to changes in user reporting needs

Relatively easy to add data without reloading tables

Standard design approaches have been developed

There exist a number of products supporting the dimensional model

Denormalized and indexed relational models more flexible

Multidimensional models simpler to use and more efficient

73 73

Multidimensional model

Typically not in 3NF since performance can be truly awful. Most of the work that

is performed on denormalizing a data model is an attempt to reach performance objectives.

the structure can be overwhelmingly complex. We may wind up creating many small relations which the user might think of as a single relation or group of data.

continued

Page 25: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 25

74 74

Typical problems

Data homogenisation

Complexity of integration

Hidden problems with source systems

Required data not captured

Data ownership/access/privacy policies

Underestimation of resources for data loading

High demand for resources

Increased end-user demands

High maintenance costs

75 75

Challenges

Design and management of the DW project managing design, construction, implementation

long-term and large-scale project

monitoring utilisation patterns and design solutions

Administration of DW an intensive, major activity

includes possible evolutions of data models

Quality of data merging data from heterogonous sources

integration issues (different namings, etc.)

76 76

Open issues

Automating acquisition and loading of operational data

Self-maintainability

Intelligent meta-data extraction/generation

Performance optimisation quite large DBs: many terabytes, growing by as much

as 50% a year

Privacy and security depersonalisation of data, user authentication and

authorization, role-based access control

see [Elmasri & Navathe; sec. 23.3.2]

Page 26: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 26

77 77

Examples of

DW systems

78 78

Examples of DW systems

IBM DB2 Universal Data Warehouse Edition a business intelligence platform that includes DB2, federated data

access, data partitioning, enhanced extract, transform, and load (ETL), workload management etc.

see a case study on EDEKA

Microsoft SQL Server Data Transformation Services ETL data from source(s) to the data warehouse

see a case study on Barnes & Noble (web)

Oracle Business Intelligence Warehouse Builder dimensional design, ETL, deployment manager

see tutorials/demos/documents on the web

79 79

Page 27: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 27

80 80

81

Demos and tutorials PALO OLAP (Jedox)

open source, available for download

demos available

Dundas OLAP services demos available

Talend enterprise-ready open source data warehousing

and integration

see case studies (e.g. Virgin Mobile, Sony – gaming analytics)

Radar-soft OLAP and Visual Analysis

See Tutorial 3 for Dundas OLAP and Radar-soft

82 82

Dundas OLAP Services

See Tutorial 3

Page 28: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 28

83 83

Data warehousing:

wrapping up

84 84

London branch Sale branch Manchester branch

Census

data

Operational data

Detailed

transactional

data

Data warehouse Integrate

Clean

Summarise

Direct

Query

Reporting

tools

Mining

tools OLAP

Data analysis

GIS

data

85

General 3x3 DW architecture

85

Page 29: Introduction to Data Warehousing, Profiling and Cleansing

COMP33111, 2011/2012 29

86 86

Steps for building a DW

1. Choose relevant business processes

2. Choose the grain (granulation) of DW

3. Identify the dimensions and model the schema

4. Choose, extract, transform and load the facts

5. Store pre-calculations in the fact table(s)

6. Round out the dimension tables

7. Choose the duration/refresh of the database

8. Track slowly changing dimensions

9. Decide the query priorities and modes

87 87

Summary

DW: an integrated database for supporting higher-level analysis of data analysis using multi-dimensional and pre-aggregated data

data model: multidimensional

Integrates and aggregates operational DBs data profiling and cleansing

data quality

Access to huge amount of data optimised querying, indexing is needed

Huge benefits supports competitive business intelligence

also, scientific data warehouses

88 88

Reading for this lecture

Chapters 31 and 32 in [Connolly & Begg]

Also, chapter 28 (28.1–28.4, 28.7) in [Elmasri & Navathe] Rahm, E.; Do, H.H. Data Cleaning: Problems and Current Approaches

IEEE Techn. Bulletin on Data Engineering, Dec. 2000

Other on-line materials: Blackboard and

http://www.cs.manchester.ac.uk/ugt/COMP33111/

Additional reading (on the web) Chaudhuri, Dayal: An overview of data warehousing and OLAP technology

(http://portal.acm.org/citation.cfm?id=248616)

W.H. Inmon: Building the data warehouse, Wiley, 2002 ISBN: 0471081302

Complete Tutorial and lab exercises 2

Try Dundas OLAP services (Lab/tutorial exercise 3)