ETL Process-Training.ppt

Building the warehouse Concepts

Satyam Computers Services Ltd.

Introduction to Building a Datawarehousing

Data Warehousing ArchitectureData Warehousing Architecture

Operational Systems

Information Transformation/Migration Infrastructure

External Systems

EnterpriseData Warehouse

FinanceDatamartIndependent

SalesDatamartDependent

MktingDatamartDependent

WebServer

LightClients

Replication Services

LAN Clients

Data Warehousing ArchitectureData Warehousing Architecture

DataStores

LegacySystem

MetadataRepository

Staging Area

Extraction/Transformation

Server

To Warehouse/Datamart

• Metadata Design/Mgmt

• Scrubbing Tool

• Mapping Tool

• Extraction Mgmt Tool

• Transformation Tool

• Migration Mgmt Tool

Building a Datawarehouse

• Extracting, Transforming, and Transporting Data

• Extracting Data• Extraction Techniques Phase• Extraction Tools

Steps Involved in Building a Datawarehouse

Extraction PhaseExtraction Phase

• Examining the Source Data and Identify the Extraction tool

• The Extracts are typically written in Source system Code (e.g. PL/SQL or VB Script or COBOL).

• The Extraction tool also generates Source system Code.

• Using the tool for Extraction makes the process easier instead of Hand-Coding.

• The Pre and Post Process Exists. E.g. Before Extract process there may be a call for sorting the data or a call to a function that scores a record based on a formula.

Transformation PhaseTransformation Phase

• Importance of Quality Data• Creating Business Rules• Tools are available to create Reusable

Transformation Modules or objects.• Simple Data Transformation which includes Date,

Number and Character Conversion• Assigning Surrogate Keys• Combining from Separate Sources• Validating one-one and one-many Relationships

Transporting PhaseTransporting Phase

• Insert statements create Logs.

• Bulk Loader is advisable

• Truncate target tables before full refresh

• Index Management

•Drop and reindex.

Refresh Phase

• Process Slowly Changing Dimensions

• Automate the Extract-Transform-Load Cycle.

• Incremental Fact Table Extracts.

• Purging and Archiving Data.

Extracting Data

The Process of getting data from Legacy System or any Data Source.After extracting data is put in staging area where it can be scrubbed and cleaned.

The source of data may from a single source or from a multiple source. If the source is from multiple sources then a connector tool is required to connect between multiple sources.

If the data is from single source it can come from OLTP system or from a flat file.

Extraction Process in Detail

Extracting Data

• Tools have Well Defined disciplined approach and Documentation.

• Tools provide an easier way to perform the extraction method by providing click, drag and drop features.

The extraction process can be done either by hand coded method or by using tools.Advantageous and disadvantages over Custom-programmed Extraction (PL SQL Scripts) and tool based extraction.

Extracting Data

• Hand coded extraction techniques allow extraction in cost effective manner since the PL/SQL construct is available with the RDBMS.

• Hand coded extraction are used when the extraction is to be taken place where the programmer has clear data structure known.

Extraction Techniques

• Bulk Extraction.– The entire data warehouse is refreshed

periodically by extraction's from the source systems. All applicable data are extracted from the source systems for loading into the warehouse. This approach heavily uses the network connection for loading data from source to target databases, but such mechanism is easy to set up and maintain.

Extraction Methods.

Extraction Techniques

• Change-Based Replication– Only data that have been newly inserted or

updated in the source systems are extracted and loaded into the warehouse. This approach uses less network connection due to the volume of data to be transported. This mechanism involves complex programming to determine when a new warehouse record to be inserted or when an existing warehouse record must be updated.

Extraction Methods.

Extraction TechniquesExtraction Techniques

Hand Coding Development practices

• Set up Header and Comment fields for the Code• Stick to the naming standards.• Test everything - both Test everything - both Unit

testing and system testing.• Document Everything

Extracting Data

• The Source System Platform and Database.– Tools cannot access all types of data source on all

types of Computing platforms

• Built-in Extraction or Duplication Functionality.– The availability of built-in extraction or duplication

reduces the technical difficulties inherent in the data extraction process.

Criteria for Identifying Extraction Tool.

Extracting Data

• The Batch windows of the Operational Systems.

– Some extraction mechanism are faster or more efficient than others. The batch windows of the operational system determine the time frame for the extraction.

Criteria for Identifying Extraction Tool.

Extraction ToolsExtraction Tools

Extraction Tools include

– Apertus Carleton. Passport– Evolutionary Technologies. ETL Extract.– Platinum. InfoPump

TRANSFORMING DATA

Transforming Data

• IMPORTANCE OF QUALITY DATA.

• TRANSFORMATION

• TRANSFORMING DATA : PROBLEMS AND

SOLUTIONS

• TRANSFORMATION TECHNIQUES

• TRANSFORMATION TOOLS

Importance of Quality Data

Quality Data:

Before the extracted data is to be transformed, the quality of the data has to be looked on. Once quality data is transformed there will be minimum necessary to change the data at the target which reduces inconsistencies between source and target.

Data Quality AssuranceData Quality Assurance

Characteristic of Quality Data

• Accurate

• Complete

• Consistent

• Unique

• Timely

Data Quality AssuranceData Quality Assurance

Data Quality Tools assist warehousing teams with the

task of locating and correcting data errors.

Corrections of data can be made to source or to the target. But when corrections are made to target it causes inconsistencies between the source and target data which create synchronization problems.

Data Quality ToolsData Quality Tools

Though dirty data continue to be the biggest issues for data warehousing initiatives, research indicates that data quality investments are small percentage to total warehouse spending.

– DataFlux. Data Quality Workbench.– Pine Cone Systems. Content Tracker.– Prism. Quality Manager.– Vality Technology. Integrity Data

Reengineering

Transformation

Transformation :

Transformation is process by which extracted data are transformed into appropriate format. The data extracted in put into the staging area where cleaning, scrubbing takes place and stored so that transformation of the clean data can take place. For transformation phase data can come from cleansing tool. After transformation data goes to the transportation stage.

Transforming Data : Problems and Solutions

Transforming Data : Problems and Solutions

The Common ProblemsProblems of Data that come out of a Legacy System are

• Inconsistent or Incorrect use of codes and special characters.

• A Single Field is Used for Unofficial or undocumented purposes.

• Overloaded Codes.• Evolving Data.• Missing, Incorrect or Duplicate values.

Transforming Data Problems and Solutions

Transforming Data Problems and Solutions

There are different solutions available to ensure the data to be loaded is Correct or not

– Cross-Footing• A template for the quality data norms can be used to

identify the erroneous data by comparing with the norms in the template.

– Manual Examination• A sampling methodology can be selected and a manual

examination can be made on the sampled data

– Process Validation• Scripts can be generated which takes care of identifying

erroneous and segregate them.

Transformation Techniques

Field Splitting and consolidation :

Single physical field in source system needs to split up into more than one target warehouse field.

Several source system field must be consolidated and stored in one single warehouse field

Address field# 123 ABC Street,

DEF City,Republic of GH

No : 123 Street : ABC STREETCity : DEFCountry: GH

Transformation TechniquesTransformation Techniques

Standardization : Standards and conventions for abbreviations are applied to individual data items to improve uniformity in both source and target objects.

System AOrder Date05 August 1998-----------------------------System BOrder Date08-08-98

System AOrder DateAugust 05 1998-----------------------------System BOrder DateAugust 08 1998

Transformation TechniquesTransformation Techniques

Deduplication : Rules are defined to identify duplicate stores of customers or products. In case of two or more repeated records, they are merged to form one warehouse record.

System ACustomer Name :

John W Istin------------------------------------

System BCustomer Name :John William Istin

Customer Name :John William Istin

Transformation ToolsTransformation Tools

Some of the Transformation tools includes

– Apertus Carleton. Enterprise/Integrarot.– Data Mirror. Transformation Server.– Informatica. Power Mart Designer.

TRANSPORTATION

Transporting the Data

• TRANSPORTING DATA INTO WAREHOUSE

• BUILDING THE TRANSPORTATION PROCESS

• TRANSPORTING THE DATA

• POST PROCESSING OF LOADED DATA

Transporting Data into Warehouse


The transformed data is then transported into the data warehouse. The load images are transported through the loaders into the warehouse.

Data Loaders :

Data loaders load transformed data into the data warehouse.

Stored procedurs can be used to handle the warehouse loading if the images are available in same RDBMS engine.


Source Data Staging Area Warehouse Schema

Extract Load


Warehouse Schema: It is nothing but the Dimensional Model(dimensions and facts)

Staging Area: It is nothing but workspace where data is ready after cleaning. This is for minimizing the time required to prepare the data.

Source Data: This can be flat file, oracle table or some other form.

First of all source data (can be in flat file or oracle or other…) comes to staging area. This is called Extraction from source and then putting into the staging area after cleaning data, can be done thru a tool or PL/SQL or SQL Loader.

In the Staging area data can be transformed to the required format. After transforming the data in the staging area data can be moved to the warehouse thru the tool or PL/SQL scripts.


Building the Transporting Process

For Transporting Data we can use: PL/SQL scripts SQL Loader Routines for flat files ETL Tool

With PL/SQL Scripts we can load the data into the warehouse from the one or more source tables or files. We use PL/SQL for adding surrogate key to the tables and doing some transformation. We do transformation based on the requirement and plus for storing the data in such a way to increase the performance.

Building the Transporting ProcessUsing PL/SQL

Similarly we can use SQL Loader for directly putting the data from the flat files to the tables. We use this for the Bulk loading. SQL Loader can be used for loading

varying length and

fixed format files.

Building the Transporting ProcessUsing SQL Loader

We can also use tool for this purpose. In tool there will be graphical features. You have to map the source to target and add some transformation things to this and it will automatically generate the script for transporting the data to the target.

Tools->

Oracle warehouse Builder,Informatica

Building the Transporting ProcessUsing Tools

Transporting the Data

After Building the Process data is loaded into the warehouse. For PL/SQL process this can be done by executing the Procedures and for SQL Loader routines this can be done by running routines.

Post Processing of Loaded Data

Scheduling of Jobs

Oracle Enterprise Manager or Some Oracle Package (DBMS_JOB) can be used for this purpose. All the Jobs or procedures can be scheduled according to the loading requirement. In OEM you can submit a job for the scheduling and set the interval for the job. In the later stage you can alter this setting.

OEM internally use DBMS_JOB for all the scheduling purposes. DBMS_JOB is a package that can be used for the scheduling purposes. You can schedule any job and set the interval by writing a procedure for the job. Job will be automatically executed at the interval set in the job.


create or replace procedure schedule_job is

job_no number;

begin

DBMS_JOB.SUBMIT( job_no,

'insert_temp;',

sysdate,

'sysdate+1/48' );

commit;

dbms_output.put_line('job '||to_char(job_no));

end;


Source A – part A Source C – part CSource B – part B

A B C

A

B

C

Analytical

Operational

User’s View

Extraction

Transformation

Categorization of transaction data

Datawarehouse Building

ETVL Tools

The following are the Popular ETVL Tools

• Oralce Warehouse Builder.

• Informatica.

• Sagent.

• SAS Warehouse Administrator.

ETVL Tools

Oracle Warehouse Builder - Key Features

• Easy to Use - Graphical Design.

• Wizard driven Interface.

• Integrated Metadata via Common Warehouse Model (CWM).

• Tightly Integrated with Oracle 8i.

• A Library of Pre-defined Transformations available

ETVL Tools

Oracle Warehouse Builder - Key Features

• Graphical mapping and Transformation design.

• Automated Code Generation.

• Support for Heterogeneous Sources.

LEAVING A METADATA TRAIL

• DEFINING WAREHOUSE METADATA

• DEVELOPING A METADATA STRATEGY

• EXAMINING TYPES OF METADATA

• METADATA MANAGEMENT TOOLS

• COMMON WAREHOUSE METADATA

DEFINING WAREHOUSE METADATA

Metadata

What is Metadata?• Traditionally defined as data about data• Form of abstraction that describes the

structure and contents of the data warehouse

Metadata

• Metadata is more comprehensive and transcends the data.

– Metadata provide the format and nameformat and name of data items– It actually provides the contextcontext in which the data

element exists.– provides information such as the domaindomain of possible

values;– the relationrelation that data element has to others;– the data's business rulesbusiness rules,– and even the origin of the dataorigin of the data.

Importance of Metadata

• Metadata establish the context of the Warehouse dataMetadata help warehouse administrators and users locate and understand data items, both in the source systems and in the warehouse data structures.

E.g.: The date 02/05/98 could mean either May 2, 1998 or February 5, 1998 depending on the date convention used. Metadata describing the format of this date field could help determine the definite and unambiguous meaning of the data item.

• Metadata facilitate the Analysis ProcessMetadata must provide warehouse end-users with the information they need to easily perform the analysis steps. It should thus allow users to quickly locate data that are in the warehouse.

Metadata should allow analysts to interpret data correctly by providing information about data formats and data definitions.


• Metadata are a form of Audit Trail for Data TransformationMetadata document the transformation of source data into warehouse data. Hence warehouse metadata must be capable of explaining how a particular piece of warehouse data was derived from the operational systems.

All business rules governing the transformation of data to new values or new formats are also documented as metadata.


This kind of audit trail is required:

- to build the user’s confidence regarding the veracity and quality of warehouse data

- to know where the data came from so that the user has a good understanding of warehouse data

- by some warehousing products that use this type of metadata to generate extraction and transformation scripts for use in the warehouse back-end


• Metadata Improve or Maintain Data QualityMetadata can improve or maintain warehouse data quality through the definition of valid values for individual warehouse data items. Using a data quality tool prior to actual loading into the warehouse, the warehouse load images can be reviewed to check for compliance with valid values for key data items. Data errors are quickly highlighted for correction.

Metadata can be used as the basis for any error-correction processing that should be done if a data error is found. Error-correction rules are documented in the metadata repository and executed by program code on an as needed basis.


DEVELOPING A METADATA STRATEGY

METADATA STRATEGY

Metadata organization and administration, which promotes sharing and central management of metadata in distributed repository architecture.

Content Creation and integrity, to maintain consistency of metadata that may be passed among various tools throughout the phases of the projects;

METADATA STRATEGY

Component-based metadata sharing, which includes facilities for exchanging metadata among upstream design/modeling tools and downstream analytical problems

Planning for the future, necessary for ensuring compatibility with emerging metadata and interoperability standards.

EXAMINING

TYPES OF METADATA

METADATA TYPES

• ADMINISTRATIVE METADATA

• END-USER METADATA

• OPTIMIZATION METADATA

The metadata has 3 major categorize

There is the metadata associated with the decision-support database.

– This metadata describes the database structures such as tables, columns and partitions, as well as security settings and operational information.

The second category of data warehouse metadata is used by the end user to navigate the database.

– A query and analysis tool, such as BusinessObjects from Business Objects Inc. or PowerPlay from Cognos Corp., usually creates and manages this metadata.

The metadata has 3 major categorize

The third category is the metadata created by the back-end extract/transformation tool that's used to move data from the source systems to the data warehouse.

– This metadata is primarily concerned with source data definitions, transformation logic and source-to-target data mappings. These tools also must be concerned with process scheduling, maintaining data integrity and error management.

ADMINISTRATIVE METADATA

These contain descriptions of the source databases and their contents, the data warehouse objects, and the business rules to transform data from the sources into the data warehouse.

• Data Sources: Descriptions of all data sources used by the warehouse, including information about the data ownership. Any relationships between different data sources (e.g., one provides data to the other) are also documented.

• Source-to-target field mapping: The mapping of source fields (in operational systems) to target fields (in the data warehouse) explains what fields are used to populate the data warehouse. It also documents transformations and formatting changes that were applied to the original, raw data to derive the warehouse data.

• Warehouse Schema Design: Describes the warehouse servers, databases, database tables, fields, and any hierarchies that may exist in the data. All referential tables, system codes, etc., are also documented.

• Warehouse back-end data structure: Model of the back-end of the warehouse, including staging tables, load image tables, and any other temporary data structures that are used during the data transformation process.

• Warehouse back-end tools or programs: A definition of each extraction, transformation, and quality assurance program or tool that is used to build or refresh the data warehouse.


• Warehouse architecture: If the warehouse architecture is one where an enterprise warehouse feeds many departmental or vertical data marts the warehouse architecture should be documented. If the data mart contains a logical subset of the data warehouse contents, this subset should also be defined.

• Business Rules and Policies: All applicable business rules and policies are documented. Examples include business formulae for computing costs or profits.

• Access and Security rules: Rules governing the flow of data across various users and their access limitations are documented.


END-USER METADATA

End-User metadata help users create their queries and

interpret the results, and also contain,

Warehouse Contents : Must describe the data structure and contents of the data warehouse in user friendly terms. Aliases, rules, summaries and precomputed totals are to be documented.

Predefined Queries & Reports : Queries & reports that have been predefined and documented to avoid duplication of effort.

Business rules & Policies : All business rules and changes of there rles over time should be documented.

Hierarchy Definitions : Hierarchy definitions are important to support driling up and down warehouse dimensions.

Status Information : Status information is required to inform warehouse users of the warehouse status at any point of time.

Data Quality : Known data quality problems in the ware house should be clearly documented, this will prompt users to make careful use of warehouse data.

END-USER METADATA

Warehouse load History : A history of data errors, data volume, load schedule should be available.

Warehouse purging rules : The rules which determine when data is removed from warehouse should be known to end-users.

END-USER METADATA

OPTIMIZATION METADATA

Metadata are maintained to aid in the optimization of the data warehouse design and performance.

Aggregate Definitions : All warehouse aggregates should be documented so that front end tools with aggregate navigation facilities rely on this type of metadata.

Collection of Query Statistics : It is helpful to track the type of queries that are made against the warehouse. This helps in optimization and tuning. Also helps to identify data that are largely unused.

METADATA MANAGEMENT TOOLS


Metadata Catalog is a generic descriptor for the overall set of metadata used in the warehouse.

Tools are needed for cataloging all of this metadata and keeping track of it. The tool probably can’t read and write all the metadata directly, but it will manage metadata stored in many locations.

The functions and services required in the metadata catalog maintenance includes

1. Information catalog integration/Merge- from data model to database to front-end tools

2. Metadata managemtnt - remove old unused entries.

3. Capture existing metadata - from mainframe or other sources.

4. Manage and display graphical and tabular representations of the

Metadata catalog contents - metadata browser.

5. Maintain user profiles for application and security use.

6. Security for the metadata catalog.

7. Local or centralized metadata catalog support.

8. Creating remote procedure calls to provide


COMMON WAREHOUSE METADATA

The CWM Metamodel

The CWM metamodel is organized into 18 packages arranged in 4 layers on a UML base (Fig next).

CWM new architecture defines its sub-metamodel as individual packages. Because CWM uses modeling techniques that minimize the number of dependencies between its packages, tool integrators can select only those metamodel services they need while avoiding problems common to large, monolithic metamodels (UML).

The CWM Metamodel

WarehouseProcess

WarehouseOperation

Transformation

XMLRecordMulti

DimensionalRelational

BusinessInformation

SoftwareDeployment

UML 1.3(Foundation, Behavioral_Elements, Model_Management)

Management

Resource

Analysis

Object(UML)

Foundation

OLAPData

MiningInformation

VisualizationBusiness

Nomenclature

DataTypes

ExpressionsKeysIndex

TypeMapping

192287Total

77130CWMX

115157CWM

AssociationsClassesCounts

The Four layers of the CWM collect together different sorts of metamodel packages:

• Base Layer contains the standard UML 1.3 notation and the extensions to support warehouse concepts

• The Foundation layer contains the metamodel shared by other packages (Business Information, Data Types, Expressions, Keys & Indexes, Software Deployment, Type Mapping).

• The Resource layer contains data models used for operational data sources and target data warehouses.

The CWM Metamodel Cont.

• The Analysis layer provides metamodels supporting logical services that may be mapped onto data stores defined by Resource layer packages. For example, the Transformation metamodel supports the definition of transformations between data warehouse sources and targets, and the OLAP metamodel allows data warehouses stored in either relational or multidimensional data engines to be viewed as dimensions and cubes.


• The Management layer metamodels support the operation of data warehouses by allowing the definition and scheduling of operational tasks (Warehouse Process package) and by recording the activity of warehouse processes and related statistics (Warehouse Operation package).


CWM Design Basis

In accordance to solution framework metamodeling architecture constitutes 4 layers

– Metamodeling language (M3)– Metamodels (M2)– Metadata or Models (M1)– Data or Objects (M0)

CWM Design Basis

Standard OMG ComponentsModeling Language: UMLMetadata Interchange: XMIMetadata API:

MOF IDL Mapping

MIDDLEWARE

APPLICATION

User Data/ObjectLayer (M0)

Metadata/Model Layer(M1)

Metamodel Layer(M2)

Meta-metamodelLayer (M3)

<Stock name=“IBM” price=“112”/>

Stock: name, price

UML: Class, AttributeCWM: Table, Column ElementType, Attribute

MOF: Class, Attribute, Operation, Association

Our Vision…..

Enable “Decisions@speed

of thought”

Thank you

SATYAM - “Our People Make The Difference”

ETL Process-Training.ppt

Documents

Transcript of ETL Process-Training.ppt