ETL Process-Training.ppt
description
Transcript of ETL Process-Training.ppt
Building the warehouse Concepts
Satyam Computers Services Ltd.
Introduction to Building a Datawarehousing
Data Warehousing ArchitectureData Warehousing Architecture
Operational Systems
Information Transformation/Migration Infrastructure
External Systems
EnterpriseData Warehouse
FinanceDatamartIndependent
SalesDatamartDependent
MktingDatamartDependent
WebServer
LightClients
Replication Services
LAN Clients
Data Warehousing ArchitectureData Warehousing Architecture
DataStores
LegacySystem
MetadataRepository
Staging Area
Extraction/Transformation
Server
To Warehouse/Datamart
• Metadata Design/Mgmt
• Scrubbing Tool
• Mapping Tool
• Extraction Mgmt Tool
• Transformation Tool
• Migration Mgmt Tool
Building a Datawarehouse
• Extracting, Transforming, and Transporting Data
• Extracting Data• Extraction Techniques Phase• Extraction Tools
Steps Involved in Building a Datawarehouse
Extraction PhaseExtraction Phase
• Examining the Source Data and Identify the Extraction tool
• The Extracts are typically written in Source system Code (e.g. PL/SQL or VB Script or COBOL).
• The Extraction tool also generates Source system Code.
• Using the tool for Extraction makes the process easier instead of Hand-Coding.
• The Pre and Post Process Exists. E.g. Before Extract process there may be a call for sorting the data or a call to a function that scores a record based on a formula.
Transformation PhaseTransformation Phase
• Importance of Quality Data• Creating Business Rules• Tools are available to create Reusable
Transformation Modules or objects.• Simple Data Transformation which includes Date,
Number and Character Conversion• Assigning Surrogate Keys• Combining from Separate Sources• Validating one-one and one-many Relationships
Transporting PhaseTransporting Phase
• Insert statements create Logs.
• Bulk Loader is advisable
• Truncate target tables before full refresh
• Index Management
•Drop and reindex.
Refresh Phase
• Process Slowly Changing Dimensions
• Automate the Extract-Transform-Load Cycle.
• Incremental Fact Table Extracts.
• Purging and Archiving Data.
Extracting Data
The Process of getting data from Legacy System or any Data Source.After extracting data is put in staging area where it can be scrubbed and cleaned.
The source of data may from a single source or from a multiple source. If the source is from multiple sources then a connector tool is required to connect between multiple sources.
If the data is from single source it can come from OLTP system or from a flat file.
Extraction Process in Detail
Extracting Data
• Tools have Well Defined disciplined approach and Documentation.
• Tools provide an easier way to perform the extraction method by providing click, drag and drop features.
The extraction process can be done either by hand coded method or by using tools.Advantageous and disadvantages over Custom-programmed Extraction (PL SQL Scripts) and tool based extraction.
Extracting Data
• Hand coded extraction techniques allow extraction in cost effective manner since the PL/SQL construct is available with the RDBMS.
• Hand coded extraction are used when the extraction is to be taken place where the programmer has clear data structure known.
Extraction Techniques
• Bulk Extraction.– The entire data warehouse is refreshed
periodically by extraction's from the source systems. All applicable data are extracted from the source systems for loading into the warehouse. This approach heavily uses the network connection for loading data from source to target databases, but such mechanism is easy to set up and maintain.
Extraction Methods.
Extraction Techniques
• Change-Based Replication– Only data that have been newly inserted or
updated in the source systems are extracted and loaded into the warehouse. This approach uses less network connection due to the volume of data to be transported. This mechanism involves complex programming to determine when a new warehouse record to be inserted or when an existing warehouse record must be updated.
Extraction Methods.
Extraction TechniquesExtraction Techniques
Hand Coding Development practices
• Set up Header and Comment fields for the Code• Stick to the naming standards.• Test everything - both Test everything - both Unit
testing and system testing.• Document Everything
Extracting Data
• The Source System Platform and Database.– Tools cannot access all types of data source on all
types of Computing platforms
• Built-in Extraction or Duplication Functionality.– The availability of built-in extraction or duplication
reduces the technical difficulties inherent in the data extraction process.
Criteria for Identifying Extraction Tool.
Extracting Data
• The Batch windows of the Operational Systems.
– Some extraction mechanism are faster or more efficient than others. The batch windows of the operational system determine the time frame for the extraction.
Criteria for Identifying Extraction Tool.
Extraction ToolsExtraction Tools
Extraction Tools include
– Apertus Carleton. Passport– Evolutionary Technologies. ETL Extract.– Platinum. InfoPump
TRANSFORMING DATA
Transforming Data
• IMPORTANCE OF QUALITY DATA.
• TRANSFORMATION
• TRANSFORMING DATA : PROBLEMS AND
SOLUTIONS
• TRANSFORMATION TECHNIQUES
• TRANSFORMATION TOOLS
Importance of Quality Data
Quality Data:
Before the extracted data is to be transformed, the quality of the data has to be looked on. Once quality data is transformed there will be minimum necessary to change the data at the target which reduces inconsistencies between source and target.
Data Quality AssuranceData Quality Assurance
Characteristic of Quality Data
• Accurate
• Complete
• Consistent
• Unique
• Timely
Data Quality AssuranceData Quality Assurance
Data Quality Tools assist warehousing teams with the
task of locating and correcting data errors.
Corrections of data can be made to source or to the target. But when corrections are made to target it causes inconsistencies between the source and target data which create synchronization problems.
Data Quality ToolsData Quality Tools
Though dirty data continue to be the biggest issues for data warehousing initiatives, research indicates that data quality investments are small percentage to total warehouse spending.
– DataFlux. Data Quality Workbench.– Pine Cone Systems. Content Tracker.– Prism. Quality Manager.– Vality Technology. Integrity Data
Reengineering
Transformation
Transformation :
Transformation is process by which extracted data are transformed into appropriate format. The data extracted in put into the staging area where cleaning, scrubbing takes place and stored so that transformation of the clean data can take place. For transformation phase data can come from cleansing tool. After transformation data goes to the transportation stage.
Transforming Data : Problems and Solutions
Transforming Data : Problems and Solutions
The Common ProblemsProblems of Data that come out of a Legacy System are
• Inconsistent or Incorrect use of codes and special characters.
• A Single Field is Used for Unofficial or undocumented purposes.
• Overloaded Codes.• Evolving Data.• Missing, Incorrect or Duplicate values.
Transforming Data Problems and Solutions
Transforming Data Problems and Solutions
There are different solutions available to ensure the data to be loaded is Correct or not
– Cross-Footing• A template for the quality data norms can be used to
identify the erroneous data by comparing with the norms in the template.
– Manual Examination• A sampling methodology can be selected and a manual
examination can be made on the sampled data
– Process Validation• Scripts can be generated which takes care of identifying
erroneous and segregate them.
Transformation Techniques
Field Splitting and consolidation :
Single physical field in source system needs to split up into more than one target warehouse field.
Several source system field must be consolidated and stored in one single warehouse field
Address field# 123 ABC Street,
DEF City,Republic of GH
No : 123 Street : ABC STREETCity : DEFCountry: GH
Transformation TechniquesTransformation Techniques
Standardization : Standards and conventions for abbreviations are applied to individual data items to improve uniformity in both source and target objects.
System AOrder Date05 August 1998-----------------------------System BOrder Date08-08-98
System AOrder DateAugust 05 1998-----------------------------System BOrder DateAugust 08 1998
Transformation TechniquesTransformation Techniques
Deduplication : Rules are defined to identify duplicate stores of customers or products. In case of two or more repeated records, they are merged to form one warehouse record.
System ACustomer Name :
John W Istin------------------------------------
System BCustomer Name :John William Istin
Customer Name :John William Istin
Transformation ToolsTransformation Tools
Some of the Transformation tools includes
– Apertus Carleton. Enterprise/Integrarot.– Data Mirror. Transformation Server.– Informatica. Power Mart Designer.
TRANSPORTATION
Transporting the Data
• TRANSPORTING DATA INTO WAREHOUSE
• BUILDING THE TRANSPORTATION PROCESS
• TRANSPORTING THE DATA
• POST PROCESSING OF LOADED DATA
Transporting Data into Warehouse
Transporting Data into Warehouse
The transformed data is then transported into the data warehouse. The load images are transported through the loaders into the warehouse.
Data Loaders :
Data loaders load transformed data into the data warehouse.
Stored procedurs can be used to handle the warehouse loading if the images are available in same RDBMS engine.
Transporting Data into Warehouse
Source Data Staging Area Warehouse Schema
Extract Load
Transporting Data into Warehouse
Warehouse Schema: It is nothing but the Dimensional Model(dimensions and facts)
Staging Area: It is nothing but workspace where data is ready after cleaning. This is for minimizing the time required to prepare the data.
Source Data: This can be flat file, oracle table or some other form.
First of all source data (can be in flat file or oracle or other…) comes to staging area. This is called Extraction from source and then putting into the staging area after cleaning data, can be done thru a tool or PL/SQL or SQL Loader.
In the Staging area data can be transformed to the required format. After transforming the data in the staging area data can be moved to the warehouse thru the tool or PL/SQL scripts.
Transporting Data into Warehouse
Building the Transporting Process
For Transporting Data we can use: PL/SQL scripts SQL Loader Routines for flat files ETL Tool
With PL/SQL Scripts we can load the data into the warehouse from the one or more source tables or files. We use PL/SQL for adding surrogate key to the tables and doing some transformation. We do transformation based on the requirement and plus for storing the data in such a way to increase the performance.
Building the Transporting ProcessUsing PL/SQL
Similarly we can use SQL Loader for directly putting the data from the flat files to the tables. We use this for the Bulk loading. SQL Loader can be used for loading
varying length and
fixed format files.
Building the Transporting ProcessUsing SQL Loader
We can also use tool for this purpose. In tool there will be graphical features. You have to map the source to target and add some transformation things to this and it will automatically generate the script for transporting the data to the target.
Tools->
Oracle warehouse Builder,Informatica
Building the Transporting ProcessUsing Tools
Transporting the Data
After Building the Process data is loaded into the warehouse. For PL/SQL process this can be done by executing the Procedures and for SQL Loader routines this can be done by running routines.
Post Processing of Loaded Data
Scheduling of Jobs
Oracle Enterprise Manager or Some Oracle Package (DBMS_JOB) can be used for this purpose. All the Jobs or procedures can be scheduled according to the loading requirement. In OEM you can submit a job for the scheduling and set the interval for the job. In the later stage you can alter this setting.
OEM internally use DBMS_JOB for all the scheduling purposes. DBMS_JOB is a package that can be used for the scheduling purposes. You can schedule any job and set the interval by writing a procedure for the job. Job will be automatically executed at the interval set in the job.
Post Processing of Loaded Data
create or replace procedure schedule_job is
job_no number;
begin
DBMS_JOB.SUBMIT( job_no,
'insert_temp;',
sysdate,
'sysdate+1/48' );
commit;
dbms_output.put_line('job '||to_char(job_no));
end;
Post Processing of Loaded Data
Source A – part A Source C – part CSource B – part B
A B C
A
B
C
Analytical
Operational
User’s View
Extraction
Transformation
Categorization of transaction data
Datawarehouse Building
ETVL Tools
The following are the Popular ETVL Tools
• Oralce Warehouse Builder.
• Informatica.
• Sagent.
• SAS Warehouse Administrator.
ETVL Tools
Oracle Warehouse Builder - Key Features
• Easy to Use - Graphical Design.
• Wizard driven Interface.
• Integrated Metadata via Common Warehouse Model (CWM).
• Tightly Integrated with Oracle 8i.
• A Library of Pre-defined Transformations available
ETVL Tools
Oracle Warehouse Builder - Key Features
• Graphical mapping and Transformation design.
• Automated Code Generation.
• Support for Heterogeneous Sources.
LEAVING A METADATA TRAIL
• DEFINING WAREHOUSE METADATA
• DEVELOPING A METADATA STRATEGY
• EXAMINING TYPES OF METADATA
• METADATA MANAGEMENT TOOLS
• COMMON WAREHOUSE METADATA
DEFINING WAREHOUSE METADATA
Metadata
What is Metadata?• Traditionally defined as data about data• Form of abstraction that describes the
structure and contents of the data warehouse
Metadata
• Metadata is more comprehensive and transcends the data.
– Metadata provide the format and nameformat and name of data items– It actually provides the contextcontext in which the data
element exists.– provides information such as the domaindomain of possible
values;– the relationrelation that data element has to others;– the data's business rulesbusiness rules,– and even the origin of the dataorigin of the data.
Importance of Metadata
• Metadata establish the context of the Warehouse dataMetadata help warehouse administrators and users locate and understand data items, both in the source systems and in the warehouse data structures.
E.g.: The date 02/05/98 could mean either May 2, 1998 or February 5, 1998 depending on the date convention used. Metadata describing the format of this date field could help determine the definite and unambiguous meaning of the data item.
• Metadata facilitate the Analysis ProcessMetadata must provide warehouse end-users with the information they need to easily perform the analysis steps. It should thus allow users to quickly locate data that are in the warehouse.
Metadata should allow analysts to interpret data correctly by providing information about data formats and data definitions.
Importance of Metadata
• Metadata are a form of Audit Trail for Data TransformationMetadata document the transformation of source data into warehouse data. Hence warehouse metadata must be capable of explaining how a particular piece of warehouse data was derived from the operational systems.
All business rules governing the transformation of data to new values or new formats are also documented as metadata.
Importance of Metadata
This kind of audit trail is required:
- to build the user’s confidence regarding the veracity and quality of warehouse data
- to know where the data came from so that the user has a good understanding of warehouse data
- by some warehousing products that use this type of metadata to generate extraction and transformation scripts for use in the warehouse back-end
Importance of Metadata
• Metadata Improve or Maintain Data QualityMetadata can improve or maintain warehouse data quality through the definition of valid values for individual warehouse data items. Using a data quality tool prior to actual loading into the warehouse, the warehouse load images can be reviewed to check for compliance with valid values for key data items. Data errors are quickly highlighted for correction.
Metadata can be used as the basis for any error-correction processing that should be done if a data error is found. Error-correction rules are documented in the metadata repository and executed by program code on an as needed basis.
Importance of Metadata
DEVELOPING A METADATA STRATEGY
METADATA STRATEGY
Metadata organization and administration, which promotes sharing and central management of metadata in distributed repository architecture.
Content Creation and integrity, to maintain consistency of metadata that may be passed among various tools throughout the phases of the projects;
METADATA STRATEGY
Component-based metadata sharing, which includes facilities for exchanging metadata among upstream design/modeling tools and downstream analytical problems
Planning for the future, necessary for ensuring compatibility with emerging metadata and interoperability standards.
EXAMINING
TYPES OF METADATA
METADATA TYPES
• ADMINISTRATIVE METADATA
• END-USER METADATA
• OPTIMIZATION METADATA
The metadata has 3 major categorize
There is the metadata associated with the decision-support database.
– This metadata describes the database structures such as tables, columns and partitions, as well as security settings and operational information.
The second category of data warehouse metadata is used by the end user to navigate the database.
– A query and analysis tool, such as BusinessObjects from Business Objects Inc. or PowerPlay from Cognos Corp., usually creates and manages this metadata.
The metadata has 3 major categorize
The third category is the metadata created by the back-end extract/transformation tool that's used to move data from the source systems to the data warehouse.
– This metadata is primarily concerned with source data definitions, transformation logic and source-to-target data mappings. These tools also must be concerned with process scheduling, maintaining data integrity and error management.
ADMINISTRATIVE METADATA
These contain descriptions of the source databases and their contents, the data warehouse objects, and the business rules to transform data from the sources into the data warehouse.
• Data Sources: Descriptions of all data sources used by the warehouse, including information about the data ownership. Any relationships between different data sources (e.g., one provides data to the other) are also documented.
• Source-to-target field mapping: The mapping of source fields (in operational systems) to target fields (in the data warehouse) explains what fields are used to populate the data warehouse. It also documents transformations and formatting changes that were applied to the original, raw data to derive the warehouse data.
• Warehouse Schema Design: Describes the warehouse servers, databases, database tables, fields, and any hierarchies that may exist in the data. All referential tables, system codes, etc., are also documented.
• Warehouse back-end data structure: Model of the back-end of the warehouse, including staging tables, load image tables, and any other temporary data structures that are used during the data transformation process.
• Warehouse back-end tools or programs: A definition of each extraction, transformation, and quality assurance program or tool that is used to build or refresh the data warehouse.
ADMINISTRATIVE METADATA
• Warehouse architecture: If the warehouse architecture is one where an enterprise warehouse feeds many departmental or vertical data marts the warehouse architecture should be documented. If the data mart contains a logical subset of the data warehouse contents, this subset should also be defined.
• Business Rules and Policies: All applicable business rules and policies are documented. Examples include business formulae for computing costs or profits.
• Access and Security rules: Rules governing the flow of data across various users and their access limitations are documented.
ADMINISTRATIVE METADATA
END-USER METADATA
End-User metadata help users create their queries and
interpret the results, and also contain,
Warehouse Contents : Must describe the data structure and contents of the data warehouse in user friendly terms. Aliases, rules, summaries and precomputed totals are to be documented.
Predefined Queries & Reports : Queries & reports that have been predefined and documented to avoid duplication of effort.
Business rules & Policies : All business rules and changes of there rles over time should be documented.
Hierarchy Definitions : Hierarchy definitions are important to support driling up and down warehouse dimensions.
Status Information : Status information is required to inform warehouse users of the warehouse status at any point of time.
Data Quality : Known data quality problems in the ware house should be clearly documented, this will prompt users to make careful use of warehouse data.
END-USER METADATA
Warehouse load History : A history of data errors, data volume, load schedule should be available.
Warehouse purging rules : The rules which determine when data is removed from warehouse should be known to end-users.
END-USER METADATA
OPTIMIZATION METADATA
Metadata are maintained to aid in the optimization of the data warehouse design and performance.
Aggregate Definitions : All warehouse aggregates should be documented so that front end tools with aggregate navigation facilities rely on this type of metadata.
Collection of Query Statistics : It is helpful to track the type of queries that are made against the warehouse. This helps in optimization and tuning. Also helps to identify data that are largely unused.
METADATA MANAGEMENT TOOLS
METADATA MANAGEMENT TOOLS
Metadata Catalog is a generic descriptor for the overall set of metadata used in the warehouse.
Tools are needed for cataloging all of this metadata and keeping track of it. The tool probably can’t read and write all the metadata directly, but it will manage metadata stored in many locations.
The functions and services required in the metadata catalog maintenance includes
1. Information catalog integration/Merge- from data model to database to front-end tools
2. Metadata managemtnt - remove old unused entries.
3. Capture existing metadata - from mainframe or other sources.
4. Manage and display graphical and tabular representations of the
Metadata catalog contents - metadata browser.
5. Maintain user profiles for application and security use.
6. Security for the metadata catalog.
7. Local or centralized metadata catalog support.
8. Creating remote procedure calls to provide
METADATA MANAGEMENT TOOLS
COMMON WAREHOUSE METADATA
The CWM Metamodel
The CWM metamodel is organized into 18 packages arranged in 4 layers on a UML base (Fig next).
CWM new architecture defines its sub-metamodel as individual packages. Because CWM uses modeling techniques that minimize the number of dependencies between its packages, tool integrators can select only those metamodel services they need while avoiding problems common to large, monolithic metamodels (UML).
The CWM Metamodel
WarehouseProcess
WarehouseOperation
Transformation
XMLRecordMulti
DimensionalRelational
BusinessInformation
SoftwareDeployment
UML 1.3(Foundation, Behavioral_Elements, Model_Management)
Management
Resource
Analysis
Object(UML)
Foundation
OLAPData
MiningInformation
VisualizationBusiness
Nomenclature
DataTypes
ExpressionsKeysIndex
TypeMapping
192287Total
77130CWMX
115157CWM
AssociationsClassesCounts
The Four layers of the CWM collect together different sorts of metamodel packages:
• Base Layer contains the standard UML 1.3 notation and the extensions to support warehouse concepts
• The Foundation layer contains the metamodel shared by other packages (Business Information, Data Types, Expressions, Keys & Indexes, Software Deployment, Type Mapping).
• The Resource layer contains data models used for operational data sources and target data warehouses.
The CWM Metamodel Cont.
• The Analysis layer provides metamodels supporting logical services that may be mapped onto data stores defined by Resource layer packages. For example, the Transformation metamodel supports the definition of transformations between data warehouse sources and targets, and the OLAP metamodel allows data warehouses stored in either relational or multidimensional data engines to be viewed as dimensions and cubes.
The CWM Metamodel Cont.
• The Management layer metamodels support the operation of data warehouses by allowing the definition and scheduling of operational tasks (Warehouse Process package) and by recording the activity of warehouse processes and related statistics (Warehouse Operation package).
The CWM Metamodel Cont.
CWM Design Basis
In accordance to solution framework metamodeling architecture constitutes 4 layers
– Metamodeling language (M3)– Metamodels (M2)– Metadata or Models (M1)– Data or Objects (M0)
CWM Design Basis
Standard OMG ComponentsModeling Language: UMLMetadata Interchange: XMIMetadata API:
MOF IDL Mapping
MIDDLEWARE
APPLICATION
User Data/ObjectLayer (M0)
Metadata/Model Layer(M1)
Metamodel Layer(M2)
Meta-metamodelLayer (M3)
<Stock name=“IBM” price=“112”/>
Stock: name, price
UML: Class, AttributeCWM: Table, Column ElementType, Attribute
MOF: Class, Attribute, Operation, Association
Our Vision…..
Enable “Decisions@speed
of thought”
Thank you
SATYAM - “Our People Make The Difference”