ETL Concepts

56
ETL Concepts

description

This is about ETL Concepts

Transcript of ETL Concepts

Page 1: ETL Concepts

ETL Concepts

Page 2: ETL Concepts

Scope of the TrainingWhat is ETL?Need for ETLETL GlossaryThe ETL Process

• Data Extraction and Preparation• Data Cleansing• Data Transformation• Data Load• Data Refresh Strategies

ETL Solution OptionsCharacteristics of ETL ToolsTypes of ETL ToolsETL Tool - selection criteriaKey tools in the market

Page 3: ETL Concepts

What is ETL?ETL stands for Extraction, Transformation

and LoadThis is the most challenging, costly and

time consuming step towards building any type of Data warehouse.

This step usually determines the success or failure of a Data warehouse because any analysis lays a lot of importance on data and the quality of data that is being analyzed.

Page 4: ETL Concepts

What is ETL?

Page 5: ETL Concepts

What is ETL? - ExtractionExtraction

The process of culling out data that is required for the Data Warehouse from the source system

Can be to a file or to a databaseCould involve some degree of cleansing or

transformationCan be automated since it becomes

repetitive once established

Page 6: ETL Concepts

What is ETL? - Transformation & Cleansing

TransformationModification or transformation of data being imported

into the Data Warehouse.Usually done with the purpose of ensuring ‘clean’ and

‘consistent’ dataCleansing

The process of removing errors and inconsistencies from data being imported to a data warehouse

Could involve multiple stagesFeedback could go back for strengthening OLTP data

capture mechanism

Page 7: ETL Concepts

What is ETL? - Loading

After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse

Issueshuge volumes of data to be loadedsmall time window available when warehouse can

be taken off line (usually nights)when to build index and summary tablesallow system administrators to monitor, cancel,

resume, change load ratesRecover gracefully -- restart after failure from

where you were and without loss of data integrity

Page 8: ETL Concepts

What is ETL? - Loading Techniques

Techniquesbatch load utility: sort input records on

clustering key and use sequential I/O; build indexes and derived tables

sequential loads still too longuse parallelism and incremental

techniques

Page 9: ETL Concepts

The Need for ETLFacilitates Integration of data from various data sources for building a Datawarehouse • Note: Mergers and acquisitions also create disparities in data

representation and pose more difficult challenges in ETL.Businesses have data in multiple databases with different

codification and formatsTransformation is required to convert and to summarize

operational data into a consistent, business oriented formatPre-Computation of any derived informationSummarization is also carried out to pre-compute

summaries and aggregatesMakes data available in a queriable format

Page 10: ETL Concepts

The Need for ETL - Exampleen

codi

ngun

itfie

ld

appl A - balanceappl B - bal

appl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - in

appl C - pipeline - feetappl D - pipeline - yds

appl A - m,fappl B - 1,0appl C - x,y

appl D - male, female

Data Warehouse

Page 11: ETL Concepts

Data Integrity Problems - ScenariosSame person, different spellingsAgarwal, Agrawal, Aggarwal etc...

Multiple ways to denote company namePersistent Systems, PSPL, Persistent Pvt. LTD.

Use of different namesmumbai, bombay

Different account numbers generated by different applications for the same customer

Required fields left blankInvalid product codes collected at point of sale

manual entry leads to mistakes“in case of a problem use 9999999”

Page 12: ETL Concepts

ETL Glossary

Extracting Conditioning House holding Enrichment Scoring

Page 13: ETL Concepts

ETL Glossary

Extracting

Capture of data from operational source in “as is” status

Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix

Conditioning

The conversion of data types from the source to the target data store (warehouse) -- always a relational database

Page 14: ETL Concepts

ETL GlossaryHouse holdingIdentifying all members of a household (living at the same

address)Ensures only one mail is sent to a householdCan result in substantial savings: 1 lakh catalogues at Rs. 50

each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.Enrichment

Bring data from external sources to augment/enrich operational data. Data sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...

Scoring computation of a probability of an event. e.g..., chance that a

customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product

Page 15: ETL Concepts

The ETL ProcessAccess data dictionaries defining source files

Build logical and physical data models for target data

Identify sources of data from existing systems

Specify business and technical rules for data extraction, conversion and transformation

Perform data extraction and transformation

Load target databases

Page 16: ETL Concepts

The ETL Process – Push vs. Pull Pull :- A Pull strategy is initiated by the Target System. As a part of the

Extraction Process, the source data can be pulled from Transactional system into a staging area by establishing a connection to the relational/flat/ODBC sources.

• Advantage :- No additional space required to store the data that needs to be loaded into to the staging database

• Disadvantage :- Burden on the Transactional systems when we want to load data into the staging database

OR Push :- A Push strategy is initiated by the Source System. As a part of the

Extraction Process, the source data can be pushed/exported or dumped onto a file location from where it can loaded into a staging area.

• Advantage :- No additional burden on the Transactional systems when we want to load data into the staging database

• Disadvantage :- Additional space required to store the data that needs to be loaded into to the staging database

Page 17: ETL Concepts

The ETL Process – Push vs. Pull

With a PUSH strategy, the source system area maintains the application to read the source and create an interface file that is presented to your ETL. With a PULL strategy, the DW maintains the application to read the source.

Page 18: ETL Concepts

The ETL Process - Data Extraction and Preparation

Extract

Analyze, Cleanand Transform

Data Movementand Load

Stage I

Stage II

Stage III

PeriodicRefresh/Update

Page 19: ETL Concepts

The ETL Process – A simplified picture

OLTPSystems

OLTPSystems

Staging Area

Data WarehouseExtract

OLTPSystems

OLTPSystems

OLTPSystems

OLTPSystems

Load

Transform

Stage I Stage II Stage III

Page 20: ETL Concepts

Static extractStatic extract = capturing a snapshot of the source data at a point in time

Incremental extractIncremental extract = capturing changes that have occurred since the last static extract

Capture = extract…obtaining a snapshot of a chosen subset of the source data for loading into the data warehouse

The ETL Process – Step1

Page 21: ETL Concepts

Scrub = cleanse…uses pattern recognition and AI techniques to upgrade data quality

Fixing errors:Fixing errors: misspellings, erroneous dates, incorrect field usage, mismatched addresses, missing data, duplicate data, inconsistencies

Also:Also: decoding, reformatting, time stamping, conversion, key generation, merging, error detection/logging, locating missing data

The ETL Process – Step2

Page 22: ETL Concepts

Transform = convert data from format of operational system to format of data warehouse

Record-level:Record-level:Selection – data partitioningJoining – data combiningAggregation – data summarization

Field-level:Field-level: Single-field – from one field to one fieldMulti-field – from many fields to one, or one field to many

The ETL Process – Step3

Page 23: ETL Concepts

Load/Index= place transformed data into the warehouse and create indexes

Refresh mode:Refresh mode: bulk rewriting of target data at periodic intervals

Update mode:Update mode: only changes in source data are written to data warehouse

The ETL Process – Step4

Page 24: ETL Concepts

The ETL Process - Data Transformation

Transforms the data in accordance with the business rules and standards that have been established

Example include: format changes, de-duplication, splitting up fields, replacement of codes, derived values, and aggregates

Page 25: ETL Concepts

Scrubbing/Cleansing DataSophisticated transformation tools used for

improving the quality of dataClean data is vital for the success of the

warehouseExample

• Seshadri, Sheshadri, Sesadri, Seshadri S., Srinivasan Seshadri, etc. are the same person

Page 26: ETL Concepts

Dummy ValuesAbsence of DataMultipurpose FieldsCryptic DataContradicting DataInappropriate Use of Address LinesViolation of Business RulesReused Primary KeysNon-Unique IdentifiersData Integration Problems

Reasons for “Dirty” data

Page 27: ETL Concepts

The ETL Process - Data Cleansing

Source systems contain “dirty data” that must be cleansed

ETL software contains rudimentary data cleansing capabilities

Specialized data cleansing software is often used. Important for performing name and address correction and house holding functions

Leading data cleansing/Quality Technology vendors include IBM (QualityStage & ProfileStage), Harte-Hanks (Trillium Software), SAS (DataFlux) and Firstlogic

Page 28: ETL Concepts

Steps in Data Cleansing

Parsing

Correcting

Standardizing

Matching

Consolidating

Page 29: ETL Concepts

Parsing

Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files.

Examples include parsing the first, middle, and last name; street number and street name; and city and state.

Page 30: ETL Concepts

Correcting

Corrects parsed individual data components using sophisticated data algorithms and secondary data sources.

Example include replacing a vanity address and adding a zip code.

Page 31: ETL Concepts

Standardizing

Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules.

Examples include adding a pre name, replacing a nickname, and using a preferred street name.

Page 32: ETL Concepts

Matching

Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications.

Examples include identifying similar names and addresses.

Page 33: ETL Concepts

Consolidating

Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation.

Page 34: ETL Concepts

Data Quality Technology Tools (Vendors)

DataFlux Integration Server & dfPower ® Studio (www.DataFlux.com)

Trillium Software Discovery & Trillium Software System (www.trilliumsoftware.com)

ProfileStage & QualityStage (www.ascential.com)

Page 35: ETL Concepts

MarketScope Update: Data Quality Technology ratings, 2005 (Source: Gartner - June 2005)

Page 36: ETL Concepts

The ETL Process - Data Loading

Data are physically moved to the data warehouse

The loading takes place within a “load window”

The trend is to near real time updates of the data warehouse as the warehouse is increasingly used for operational applications

Page 37: ETL Concepts

Data Loading - First Time LoadFirst load is a complex exercise

• Data extracted from tapes, files, archives etc.

• First time load might take a lot of time to complete

Page 38: ETL Concepts

Data Refresh

Issues:when to refresh?

on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes)

periodically (e.g., every 24 hours, every week) or after “significant” events

refresh policy set by administrator based on user needs and traffic

possibly different policies for different sourceshow to refresh?

Page 39: ETL Concepts

Data RefreshData refreshing can follow two approaches :

Complete Data Refresh

• Completely refresh the target table every time

Data Trickle Load

• Replicate only net changes and update the target database

Page 40: ETL Concepts

Data Refresh TechniquesSnapshot Approach - Full extract from base tablesread entire source table or database: expensivemay be the only choice for legacy databases or files.

Incremental techniques (related to work on active DBs)detect & propagate changes on base tables: replication

servers (e.g., Sybase, Oracle, IBM Data Propagator)snapshots & triggers (Oracle)transaction shipping (Sybase)

Logical correctnesscomputing changes to star tablescomputing changes to derived and summary tablesoptimization: only significant changes

transactional correctness: incremental load

Page 41: ETL Concepts

ETL Solution Options

ETLETL

Custom SolutionCustom Solution

Generic SolutionGeneric Solution

Page 42: ETL Concepts

Custom SolutionUsing RDBMS staging tables and stored

proceduresProgramming languages like C, C++, Perl,

Visual Basic etcBuilding a code generator

Page 43: ETL Concepts

Custom Solution – Typical components

• Control Program• Control Program

• Time window based extraction• Restart at point of failure

• High level of error handling• Control metadata captured in Oracle tables

• Facility to launch failure recovery programs Automatically

• Time window based extraction• Restart at point of failure

• High level of error handling• Control metadata captured in Oracle tables

• Facility to launch failure recovery programs Automatically

• Snapshots for dimension tables

• PL/SQL extraction procedure• Complex views for

transformation• Control table and highly

parameterized/generic extraction process

• Snapshots for dimension tables

• PL/SQL extraction procedure• Complex views for

transformation• Control table and highly

parameterized/generic extraction process

Extract From Source

• Control table driven• Highly configurable process• PL/SQL procedure• Checks performed - referential

integrity, Y2K, elementary statistics, business rules

• Mechanism to flag the records as bad / reject

• Control table driven• Highly configurable process• PL/SQL procedure• Checks performed - referential

integrity, Y2K, elementary statistics, business rules

• Mechanism to flag the records as bad / reject

Data Quality

• Multiple Stars extracted as separate groups

• Pro*C programs using embedded SQL

• Surrogate key generation mechanism

• ASCII file downloads generated for load into warehouse

• Multiple Stars extracted as separate groups

• Pro*C programs using embedded SQL

• Surrogate key generation mechanism

• ASCII file downloads generated for load into warehouse

Generate Download Files

Page 44: ETL Concepts

Generic SolutionAddress limitations (in scalability &

complexity) of manual codingThe need to deliver quantifiable business

valueFunctionality, Reliability and Viability are

no longer major issues

Page 45: ETL Concepts

Characteristics of ETL Tools Provides facility to specify a large number of transformation rules with a GUI

Generate programs to transform data Handle multiple data sources Handle data redundancy Generate metadata as output Most tools exploit parallelism by running on multiple low-cost servers

in multi-threaded environment Support data extraction, cleansing, aggregation, reorganization,

transformation, and load operations Generate and maintain centralized metadata Closely integrated with various RDBMS Filter data, convert codes, calculate derived values, map many source

data fields to one target data field Automatic generation of data extract programs High speed loading of target data warehouses

Page 46: ETL Concepts

Types of ETL ToolsFirst-generation

• Code-generation products

• Generate the source code Second-generation

• Engine-driven products

• Generate directly executable code

Note: Due to more efficient architecture, second

generation tools have significant advantage over

first-generation

Page 47: ETL Concepts

Types of ETL Tools - First-GenerationExtraction, transformation,load process run on server or host

GUI interface is used to define extraction/ transformation processes

Detailed transformations require coding in COBOL or CExtract program is generated automatically as source

code. Source code is compiled, scheduled, and run in batch mode

Uses intermediate filesProgram is single threaded and cannot use parallel

processorsLittle metadata is generated automatically

Page 48: ETL Concepts

First-Generation ETL Tools – Strengths and LimitationsStrengths

Tools are mature

Programmers are

familiar with code

generation in COBOL

or C

Limitations High cost of products Complex training Extract programs have to

compiled from source Many transformations have to

coded manually Lack of parallel execution

support Most metadata to be manually

generated

Page 49: ETL Concepts

First-Generation ETL Tools – ExamplesSAS/Warehouse Administrator

Prism from Prism Solutions

Passport from Apertus Carleton Corp

ETI-EXTRACT Tool Suite from

Evolutionary Technologies

Copy Manager from Information Builders

Page 50: ETL Concepts

Types of ETL Tools - Second-GenerationExtraction/Transformation/Load runs on server

Data directly extracted from source and processed on server

Data transformation in memory and written directly to warehouse database. High throughput since intermediate files are not used

Directly executable codeSupport for monitoring, scheduling, extraction,

scrubbing, transformation, load, index, aggregation, metadata

Multi-threading with use of parallel processorsAutomatic generation of high percentage of metadata

Page 51: ETL Concepts

Second-Generation ETL Tools – Strengths and LimitationsStrengths

Lower cost suites, platforms,

and environment

Fast, efficient, and multi-

threaded

ETL functions highly integrated

and automated

Open, extensible metadata

Limitations

Not mature

Initial tools oriented only to

RDBMS sources

Page 52: ETL Concepts

Second-Generation ETL Tools – ExamplesPowerMart from Informatica

DataStage from Ardent

Data Mart Solution from Sagent

Technology

Tapestry from D2K

Page 53: ETL Concepts

ETL Tools - Examples

DataStage from Ascential SoftwareSAS System from SAS InstitutePower Mart/Power Center from

InformaticaSagent Solution from Sagent SoftwareHummingbird Genio Suite from

Hummingbird Communications

Page 54: ETL Concepts

ETL Tool - General Selection criteria Business Vision/Considerations Overall IT strategy/Architecture Over all cost of Ownership Vendor Positioning in the Market Performance In-house Expertise available User friendliness Training requirements to existing users References from other customers

Page 55: ETL Concepts

ETL Tool – Specific Selection criteriaSupport to retrieve, cleanse, transform,

summarize, aggregate, and load dataEngine-driven products for fast, parallel

operationGenerate and manage central metadata

repositoryOpen metadata exchange architectureProvide end-users with access to metadata in

business termsSupport development of logical and physical

data models

Page 56: ETL Concepts

ETL Tool - Selection criteria

ETI Extract

SAS Warehouse Administrator

Informatica PowerCenter

Platinum Decision Base

Ardent DataStage

Data Mirror Transformation Server

Ardent Warehouse Executive

Carleton Pureview

Ease of Use / Development Capabilities

Target Database Loading

Data Transformation

and Repair Complexity

Operations Management/

Process Automation

Metadata Management and

Administration

Data Extraction & Integration complexity

Source: Gartner Report

High Rating

Low Rating