Data Warehouse 1

20
Data warehouse- introduction “A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.” -- Barry Devlin, IBM Consultant Data warehouse is subject oriented, integrated , non-volatile and time varying collection of data in support of its decision making process. Data warehouse is a collection of corporate information, derived directly from operational systems and some external data sources. It is a relational database designed for query and analysis rather than for transaction processing. Extraction, Transformation and Loading To serve purpose of facilitating business analysis, data warehouse system must be loaded regularly. To do this data from one or more operational system must be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. Extraction-- select data using different methods. Transformation--validate, clean, integrate, and time stamp data. Loading--move data into the warehouse. Data warehouse- an Environment, not a product A data warehouse is not a single software or hardware product you purchase to provide strategic information. It is rather, a

description

ppr

Transcript of Data Warehouse 1

Dell

Data warehouse- introduction

A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. -- Barry Devlin, IBM ConsultantData warehouse is subject oriented, integrated , non-volatile and time varying collection of data in support of its decision making process.

Data warehouse is a collection of corporate information, derived directly from operational systems and some external data sources. It is a relational database designed for query and analysis rather than for transaction processing.

Extraction, Transformation and Loading

To serve purpose of facilitating business analysis, data warehouse system must be loaded regularly. To do this data from one or more operational system must be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading.

Extraction-- select data using different methods.

Transformation--validate, clean, integrate, and time stamp data.

Loading--move data into the warehouse.

Data warehouse- an Environment, not a productA data warehouse is not a single software or hardware product you purchase to provide strategic information. It is rather, a computing environment where users can find strategic information , an environment where users are put directly in touch with the data they need to make better decisions. It is a user-centric environment

An ideal environment for data analysis and decision support

Flexible and interactive.

100% user-driven.

Very responsive and conductive to the ask-answer-ask again pattern.

Provides the ability to discover answers to complex, unpredictable question.

Data warehouse- a Blend of many technologies

Data warehouse blend of many technologies. Many technologies are in used, they all work together in a data warehouse. The end result is the creation of a new computing environment for the purpose of providing the strategic information every organization needs desperately.

Take all the data from the operational systems.

Integrate all the data from the various sources.

Remove inconsistencies and transform the data.

Store the data in formats suitable for easy access for decision making.

Features of data warehouse

Subject Oriented:

Data is categorized and stored by business subject rather than by application. Data are organized according to subject instead of application.

Integrity:

when data resides in many separate applications in the operational environment, encoding of data is often inconsistent, for instance, in one application gender might be coded as male and female in another 0 and 1.

Time variant:

Data is stored as a series of snapshots, each representing a period of time. Data warehouse contain a place for storing data that are 8 to 10 or older to be used for comparison, tends and forecasting.

Non-Volatile:

Typically data in the data warehouse is not updated or deleted. Data are not updated or changed in any way once they enter the data warehouse but are only loaded and accessed.

Advantages of data warehouse

High query performance

But not necessarily most current information

Doesnt interfere with local processing at sources

Complex queries at warehouse

OLTP at information sources

Information copied at warehouse

Can modify, annotate, summarize, restructure, etc.

Can store historical information

Security, no auditing

Disadvantages o data warehouse

May rely too heavily on data generated only from TPSMay complicate business processes by institutionalising reports, data for datas sakeLearning curve too long - technical and business aspectsCulture of developing quick and dirty strategic applicationsEnd-users may not have skills for building queriesAvailability of data warehousing skillsData warehouses require high maintenanceCost of information may outweigh its benefitNeed for data warehouse

Real-time issues your current systems arent enabled to integrate disparate sources of dataand keep historical records of those integrations, in near real-time.Scalability issues you have tons of historical data you need to gather in to an easily accessible place, common formats, common keys, and common access methods. AND you need to ensure that the system is scalable over the next 3 to 5 years.

Avoidance of Siloed Solution Sets if you have many different or disparate solutions already in existence, yet your corporation is unable to answer common questions requiring consistency across your enterprise.

Enterprise Class System of Record across historical and integrated data sets, if you have a need to do this, you probably need an enterprise data warehouse

Disparate Source Systems along with Internal and External Data Sets if you need to ingrate all of these for a single enterprise vision WITH HISTORY, then you need a data warehouse.

Self-Service BI if you have a need to eventually reach this goal, where users can visualize and construct their own reports, then you probably need an enterprise data warehouse, along with its highly integrated historical facts from all the different sources in your organization.

Consolidation of information resources

Improved query performance

Separate research and decision support functions from the operational systems

Foundation for data mining, data visualization, advanced reporting and OLAP tools

The operational DBMS designs are inadequate for decision support.

Data warehousing separate analytical processing from operational processing by providing a separate architecture system for decisional implementation.

Data warehouse enables to analyze the current business trends and helps in decision making.

Ref:-http://danlinstedt.com/datavaultcat/why-when-data-warehousing-is-it-relevant/2. Architecture of Data warehouse

The structure that bring all the components of a data warehouse together is known as the architecture.

Data Warehouse Architectures: Conceptual ViewSingle-layer

Every data element is stored once only

Virtual warehouse

Two-layer

Real-time + derived data

Most commonly used approach in industry today

Three-layer

Transformation of real-time data to derived data really requires two steps.

Four views regarding the design of a data warehouse

Top-down view

allows selection of the relevant information necessary for the data warehouse

Data source view

exposes the information being captured, stored, and managed by operational systems

Data warehouse view

consists of fact tables and dimension tables

Business query view

sees the perspectives of data in the warehouse from the view of end-user

1)Data sources:

In data source we have all the sources where data is stored. Data stored in different-different places. Data from operational databases and external sources are extracted using application program interfaces known as gateways.

2)Data storage:

The data storage of the architecture is the data warehouse database server. It is the relational database system. We use the back end tools and utilities to feed data into the bottom tier. These back end tools and utilities perform the Extract, Clean, Load, and refresh functions.Data extraction

get data from multiple, heterogeneous, and external sources

Data cleaning

detect errors in the data and rectify them when possible

Data transformation

convert data from legacy or host format to warehouse format

Load

sort, summarize, consolidate, compute views, check integrity, and build indicies and partitions

Refresh

propagate the updates from the data sources to the warehouse

Meta data is the data defining warehouse objects. It stores:

Description of the structure of the data warehouse

schema, view, dimensions, hierarchies, derived data defn, data mart locations and contents

Operational meta-data

data lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)

The algorithms used for summarization

The mapping from operational environment to the data warehouse

Data related to system performance

warehouse schema, view and derived data definitions

Business data

business terms and definitions, ownership of data, charging policies

Data Mart

a subset of corporate-wide data that is of value to a specific groups of users. Its scope is confined to specific, selected groups, such as marketing data mart.

Independent vs. dependent (directly from warehouse) data mart.

3) OLAP server

Relational OLAP (ROLAP)

Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware

Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services

Greater scalabilityMultidimensional OLAP (MOLAP)

Sparse array-based multidimensional storage engine Fast indexing to pre-computed summarized dataHybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)

Flexibility, e.g., low level: relational, high-level: array

Front-end tools - This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools.Data Warehouse ModelsFrom the perspective of data warehouse architecture, we have the following data warehouse models:

Virtual Warehouse

Data mart

Enterprise Warehouse

Virtual Warehouse

The view over an operational data warehouse is known as a virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.

Data Mart

Data mart contains a subset of organization-wide data. This subset of data is valuable to specific groups of an organization.

In other words, we can claim that data marts contain data specific to a particular group. For example, the marketing data mart may contain data related to items, customers, and sales. Data marts are confined to subjects.

Enterprise Warehouse

An enterprise warehouse collects all the information and the subjects spanning an entire organization

It provides us enterprise-wide data integration.

The data is integrated from operational systems and external information providers.

This information can vary from a few gigabytes to hundreds of gigabytes, terabytes or beyond.

Ref:-http://www.tutorialspoint.com/dwh/dwh_architecture.htm3.Explain in detail MDDM?

INTRODUCTION OF MDDM

The multidimensional data model is an integral part of On-Line Analytical Processing, or OLAP. Because OLAP is on-line, it must provide answers quickly; analysts pose iterative queries during interactive sessions, not in batch jobs that run overnight. And because OLAP is also analytic, the queries are complex. The multidimensional data model is designed to solve complex queries in real time.The multidimensional data model is important because it enforces simplicityMulti-dimensional Data Models

Classical relations:

One-dimensional (not in the mathematical sense)

Relation maps key onto attributes

However, in many cases in data warehousing one is interested in multiple

perspectives (dimensions)

Example: Sales based on product, time, region, customer, store, manager/employee

Cannot be represented with normal relations

Multi-dimensional data models

Multi-dimensional database systems

Component of MDDM

The two primary component of Dimensional Model are Dimensions and Facts

Dimensions:-Texture attribute to analysis data.

Facts:-Numeroc value to analysis business.

Types of MDDM

Star Schema

Snowflake Schema

Fact Constellation Schema

Star Schema

Each dimension in a star schema is represented with only one-dimension table.

This dimension table contains the set of attributes.

The following diagram shows the sales data of a company with respect to the four dimensions, namely time, item, branch, and location.

There is a fact table at the center. It contains the keys to each of four dimensions.

The fact table also contains the attributes, namely dollars sold and units sold.

Note: Each dimension has only one dimension table and each table holds a set of attributes. For example, the location dimension table contains the attribute set {location_key, street, city, province_or_state,country}. This constraint may cause data redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia. The entries for such cities may cause data redundancy along the attributes province_or_state and country.

Snowflake Schema

Some dimension tables in the Snowflake schema are normalized.

The normalization splits up the data into additional tables.

Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example, the item dimension table in star schema is normalized and split into two dimension tables, namely item and supplier table.

Now the item dimension table contains the attributes item_key, item_name, type, brand, and supplier-key.

The supplier key is linked to the supplier dimension table. The supplier dimension table contains the attributes supplier_key and supplier_type.

Note: Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space./pFact Constellation Schema

p class="normal_(Web)"A fact constellation has multiple fact tables. It is also known as galaxy schema./pp class="normal_(Web)"The following diagram shows two fact tables, namely sales and shipping./pp/pp class="normal_(Web)"The sales fact table is same as that in the star schema./pp class="normal_(Web)"The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key, from_location, to_location./pp class="normal_(Web)"The shipping fact table also contains two measures, namely dollars sold and units sold./pp class="normal_(Web)"It is also possible to share dimension tables between fact tables. For example, time, item, and location dimension tables are shared between the sales and shipping fact table./pSchema Definition

p class="normal_(Web)"Multidimensional schema is defined using Data Mining Query Language (DMQL). The two primitives, cube definition and dimension definition, can be used for defining the data warehouses and data marts./pSyntax for Cube Definition

define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition

define dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Definition

The star schema that we have discussed can be defined using Data Mining Query Language (DMQL) as follows:

define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)

define dimension item as (item key, item name, brand, type, supplier type)

define dimension branch as (branch key, branch name, branch type)

define dimension location as (location key, street, city, province or state, country)

Snowflake Schema Definition

Snowflake schema can be defined using DMQL as follows:

define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)

define dimension item as (item key, item name, brand, type, supplier (supplier key, supplier type))

define dimension branch as (branch key, branch name, branch type)

define dimension location as (location key, street, city (city key, city, province or state, country))

Fact Constellation Schema Definition

Fact constellation schema can be defined using DMQL as follows:

define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)

define dimension item as (item key, item name, brand, type, supplier type)

define dimension branch as (branch key, branch name, branch type)

define dimension location as (location key, street, city, province or state,country)

define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales

define dimension item as item in cube sales

define dimension shipper as (shipper key, shipper name, location as location in cube sales, shipper type)

define dimension from location as location in cube sales

define dimension to location as location in cube sales

Ref:- http://www.tutorialspoint.com/dwh/dwh_schemas.htmTypical OLAP OperationsRoll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction

Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions

Slice and dice: project and select

Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes

Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL)

Ref:-http://www.nyu.edu/classes/jcf/g22.3033-002/slides/session4/DataWarehousingAndOLAP.pdf

EMBED Excel.Sheet.8 \* MERGEFORMAT

_1234567890.xlsSheet1

TimeData

Jan-97January

Feb-97February

Mar-97March