Data Warehouse Architecture.ppt

11
Data Warehouse Architecture -Processes-

description

DW

Transcript of Data Warehouse Architecture.ppt

Page 1: Data Warehouse Architecture.ppt

Data Warehouse Architecture

-Processes-

Page 2: Data Warehouse Architecture.ppt

Overview

• Architecture – a technical blueprint stage

• Must support 3 major driving forces:– Populating the warehouse

• Data extraction, cleaning and loading

– Day-to-day management of the warehouse• Large volumes of data, create/delete summaries

– The ability to cope with requirement evolution• Cope with future changes with query profiles

Page 3: Data Warehouse Architecture.ppt

Typical Process Flow

• Extract and load the data

• Clean and transform data into a form that provides good query performance

• Backup and archive data

• Manage queries, and direct them to appropriate data sources

Page 4: Data Warehouse Architecture.ppt

Extract & Load Process

• Extract– Takes data from source systems and make it available to the

data warehouse

• Load– Takes extracted data and loads it into the data warehouse

• Data in operational systems is held in a from suitable for that system

• Before loading the data into the DW, information content must be reconstructed

• Data must become value added business information– Extract & load process must take data and add context and

meaning

Page 5: Data Warehouse Architecture.ppt

Issues with ELP

• When to start extracting the data, run transformation and consistency checks and so on?– A controlling mechanism is essential to fire each

module when appropriate

• When to extract?– Data must be in consistent– Start extracting data from data sources when it

represents the same snapshot of time as all other data sources

• Eg. Customer database

Page 6: Data Warehouse Architecture.ppt

Issues…

• Loading the data– Extracted data are loaded into temporary data store

to perform clean up and check for consistency– Do not execute consistency checks until all the data

sources have been loaded into the temporary data store

• Eg. Customer canceling subscriptions

– Error recovery must be an integral part of the design– The effort required to clean up the source systems

increases exponentially with the number of overlapping data sources

Page 7: Data Warehouse Architecture.ppt

Issues…

• Copy Management tools and clean up– Eg. IBM’s Information Warehouse Framework

• Data Refresher & Data Hub

– Most copy management tools do not have the capability of performing consistency check directly (user must write the logic & code it)

– Make cost-benefit analysis before purchasing copy management tool

• If source systems do not overlap, then consistency checks are very simple

Page 8: Data Warehouse Architecture.ppt

Clean and Transform Data

• Steps involved are:– Clean and transform the loaded data into a

structure that speeds up queries– Partition the data to speed up queries,

optimize hardware performance and simplify the DW management

– Create aggregations to speed up the common queries

Page 9: Data Warehouse Architecture.ppt

Clean and Transform Data…

• Data needs to be cleaned and checked in the following ways:– Make sure data is consistent with itself– Make sure data is consistent with other data within the same

source– Make sure data is consistent with data in the other source

systems– Make sure data is consistent with the information already in the

DW

• Once data is cleaned, convert source data into a structure that is designed to balance query performance and operational cost– The structure must be suitable for long term storage

Page 10: Data Warehouse Architecture.ppt

Backup & archive

• Regular backup is essential to recover data from loss

• Archiving– Older data is removed from the system in a format

that allows it to be quickly restored if required– Issue

• As DW evolves, all information may change• Hence to ensure that a restored archive is valid, it becomes

necessary to extract all related data and structures as well

Page 11: Data Warehouse Architecture.ppt

Query Management Process

• It is a system process – Manages the queries – Speeds them up by directing queries to the

most effective data source– Ensure that all system resources are used

effectively– Monitor query profiles – manage which

aggregations to generate– This process operates at all times