Extract Transform Load (ETL)

39

Click here to load reader

Transcript of Extract Transform Load (ETL)

Page 1: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 1 of 39

M2: Extract-Transform-Load (ETL)

The only way to do great work is to love what you do. -- Steve Jobs --

W O R A P O T J A K K H U PA N , P H DW O R A P O T . J @ P S U . A C . T H R O O M B S C . 0 4 0 6 / 7

I n f o r m a t i o n a n d C o m m u n i c a t i o n T e c h n o l o g y P r o g r a m m e , F a c u l t y o f S c i e n c e , P S U

Page 2: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 2 of 39

Objectives

• Database Vs Data Warehouse

• ER Diagram for database

• Data warehouse architecture

• ETL (Data Extraction, Transformation and Loading) definitions

• ETL design principles

• ETL functions

Page 3: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 3 of 39

Data warehouse architecture

Page 4: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 4 of 39

Difference between Databases and Data Warehouse

• Database• A database is made up of a collection of tables that stores a specific

set of structured data. • A table contains a collection of rows, and columns,• Each column(also called field or attribute) in the table is designed to

store a certain type of information.

• OLTP• OLTP (On-line Transaction Processing) is characterized by a large

number of short on-line transactions (INSERT, UPDATE, DELETE). • The main emphasis for OLTP systems is put on very fast query

processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second.

Page 5: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 5 of 39

Difference between Databases and Data Warehouse

• Data Warehouse• A data warehouse is a subject-oriented, integrated, time-variant, and

non-volatile collection of data in support of management’s decision making process”

• A data warehouse is designed for OLAP

• OLAP • OLAP (On-line Analytical Processing) is characterized by relatively low

volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques.

• In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas.

Page 6: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 6 of 39

Data Warehousing Components

Operational DB

Operational DB

Operational DB

Extract

Transform

Load

(ETL)

Data Warehouse

OLAP

Data Mining

Page 7: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 7 of 39

Data Flow

Source: Connelly & Begg (2001), Database Systems: A Practical Approach to Design, Implementation, and Management (3rd Edition), Addison Wesley

Page 8: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 8 of 39

OLTP vs. OLAP

• IT systems can be divided into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyse it.

Page 9: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 9 of 39

OLTP vs. OLAP

OLTP systems OLAP systems

Holds current data Holds historical data

Stores detailed data Stored detailed and summarized data

Data is dynamic Data is static

Repetitive processing Ad hoc processing

Predictable pattern of usage Unpredictable pattern of usage

Transaction-driven Analysis-driven

Application-oriented Subject-oriented

Supports day-to-day operations Support strategic decisions

Serves large number of clerks/operational users Serve low number of managers

Source: Connelly & Begg (2001), Database Systems: A Practical Approach to Design, Implementation, and Management (3rd Edition), Addison Wesley

Page 10: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 10 of 39

Data Acquisition & Integration

• A process to populate a data warehouse.

• Three main functions:• Extract: retrieves data in a source system to produce a new

Source Data.

• Transform: Inspects, cleanses, and conforms the new source data into a data warehouse (or called Load Data).

• Load: updates a data warehouse using the data provided in the Load Data.

• These three functions are more commonly known as ETL.

Page 11: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 11 of 39

DW Data Components

• Fact Table• Tell “one version of the truth” of the subject

• Numerical Measurement; sale amount, total customer, etc.

• Key(s) to dimension table

• Dimension Table• Identify the key cell of fact table

• Drill down, Roll up

• Describe the subject; • product name,

• customer name,

• store location

Page 12: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 12 of 39

ETL Process

• ETL process begins by defining a data source and which data you are interested in the data source in copying a new destination.

• You may need to perform one or more transformations on the data for retrieval purpose.• E.g., you may need to transfer “True” or “False” (string type) into

“1” or “0” (Boolean type).

• You also need to use a load sequence to inject the transformed data into the appropriate destination (or called target system, always a data warehouse or a section of a data warehouse architecture, e.g., Data Mart, in this unit).

Page 13: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 13 of 39

Source System Analysis

• It provides significant insight and understanding of the enterprise and its data for a data warehouse to express the enterprise at any level.

• It examines the enterprise data for its informational content – the meaning of the data and how it captures and expresses that meaning.

• A data warehouse designer in this stage should focus on the enterprise and the analysis of its data.

• An early and common mistake in data warehouse design is the use of source system analysis to search for source data within the enterprise to fit the definition of a data warehouse.

• The designer must be allowed to query and survey the enterprise data, not just a summary or description of the enterprise data.

Page 14: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 14 of 39

Source System Analysis Principles

• They explain what the data warehouse designer is looking for, including• Multiple systems of record; e.g., selling products through a series of retail

outlets (West Division, East Division)

• Entity data; including physical and logical members, agents, facilities and resources:• Physical entities can be touched and uniquely identified

• Logical entities cannot be touched, e.g., concepts, constructs and hierarchies that organize and enhance the meaning of enterprise events and entities.

• Entities can also describe and qualify each other by their associations, e.g., • S Block can identify itself as a unique physical entity as well as identify the location

of lecturer #123.

Page 15: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 15 of 39

Source System Analysis Principles

• Granularity. A designer must beware of the grain of all source data elements, which is determined by its level of detail, hierarchical depth, or measurement precision.

• Latency refers to the time gap between an enterprise event. It determines the earliest moment data will be available to the data warehouse.

• Transaction data; known as event data, which identify the moment when an enterprise performs its primary functions, e.g., • Sales - the moment when a retail enterprise sells something• Manufacturing – the moment when an assembly plant builds something• Service – the moment when a consulting firm provide a service.

• Snapshot data, expresses the cumulative effect of a series of transactions or events over a range of time, e.g., Web site hits per hour.

Page 16: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 16 of 39

Source System Analysis Methods

• They explain how the data warehouse designer examines the source system to understand how the enterprise and its data interact.

• System documentation is a good start, which provides information about how an enterprise system is intended and expected to behave.

• The interaction of enterprise data is a good baseline from which to start.

• We should also document how an enterprise system misbehaves, creating unexpected data and results (the anomalous data).

• Source system analysis is the first opportunity to protect the quality of the data in a data warehouse.

Page 17: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 17 of 39

Data Flow, State Diagram & System Record

• Data Flow Diagram is used to indentify where the data comes from, goes to, and by what transport mechanism it moves.

• The Data State Diagram is used to capture the various business meaning of a data element as it flows through the data flow diagram.

• It also indicates the relevance of a data element to the enterprise.

• It also includes any physical indications of each state.

• The authoritative point of origin for each enterprise entity at any given state is the System of Record, in where the ETL gets data and loads into a data warehouse.

Page 18: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 18 of 39

Business Rules

• Business rules govern the data in the source system.

• The data profile, data flow diagram, stat state diagram and system record provide the best opportunity to identify the business rules.

• They come in three basic varieties:• Intra-record business rules:

• Column A + Colum B = Column C

• Intra-dataset business rules:• Row 1. Column A + Row 2. Column A = Row 3.Column B

• Cross dataset business rules:• File1. Column A = Table 2. Column B.

Page 19: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 19 of 39

Target System Analysis

• Target system is a data warehouse, or a component of a data warehouse architecture.

• The design needs to choose the data model, RDBMS, and business intelligence reporting architecture.

• It should also indicate how the data warehouse will reflect the enterprise of the source system (e.g., purchase orders, machines, people, etc.) as those entities cycle through their states (e.g., reviewed, approved, commissioned, hired, etc.).

• Target system analysis should reveal and clarify both expectations of the data warehouse designer and customers.

• It also provides an opportunity to recognize and resolve discrepancies between the designer and customers.

• Its goal is to create a set of expectations so explicit that these expectations can be compared directly to the data in the data warehouse.

Page 20: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 20 of 39

Data mapping

• It is the process by which an ETL analyst identifies the source data, specific to location, state, and timing, which will be used to satisfy the data requirements of a data warehouse.

• Transformations necessary to create the data elements, as they will be stored in a data warehouse, are also included in a Data Mapping.

• The Data Mapping document is an input into the Data Quality SLA and the Metadata SLA.

• The Data Mapping document must clearly and precisely identify the source data element that will be used, such that there is no ambiguity about the location, state, or timing of the extract of a data element.

• The Data Mapping document must clearly and precisely identify the target data element that will be populated, such that there is no ambiguity about the location and state of the data element as stored in the data warehouse.

• The Data Mapping document must clearly and precisely define the transformations necessary to create the data element as it will be stored in the data warehouse.

Page 21: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 21 of 39

Types of data mapping

1. Simple data mapping

2. Derived data mapping

Source data element Transformation Target data element

Length in kilometres n/a Length in kilometres

Source data element Transformation Target data element

Length in kilometres × 1000 Length in metres

Page 22: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 22 of 39

Types of data mapping cont.3. Recursive data mapping

Source data element Transformation Target data element

Length in kilometres n/a Length in kilometres

Price per meter n/a Price per meter

Source data element Transformation Target data element

Price per meter × 1000 Price per kilometre

Source data element Transformation Target data element

Length in kilometres × Total price

Price per kilometre

Page 23: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 23 of 39

ETL vs. ELT

Source data

Transaction

application

ETL

Extract

Data warehouse

Transaction source

data

Transform

Transaction load

dataTransaction Table

Load

Source data

Transaction

application

ELT

Extract

Data warehouse

Transaction

source dataTransaction

source data

Load

Transform

Transaction load

data

Transaction

TableLoad

Active &

current data

Page 24: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 24 of 39

ETL vs. ELT cont.

• In an ETL application, data is extracted from an operational system. A transform performs all data modifications to the Source Data. A load application reads the Load Data and performs the necessary inserts, updates, and deletes to a data warehouse.

• An ELT application performs all the functions and purposes of an ETL application.

• The difference between an ETL application and an ELT application is the platform on which the application performs it functions.

• ELT has two advantages.• A data warehouse RDBMS platform is a powerful platform. All the resources (CPU seconds,

throughput, etc.) of a data warehouse RDBMS platform are available to an ELT application.

• A copy of look-up data need not be kept and maintained on the ELT platform because the data warehouse RDBMS has access to all the data in the data warehouse.

• ELT has one disadvantage.• A portion of the data warehouse’s resources (CPU seconds, throughput, etc.) are consumed by

someone other than a data warehouse customer. Given sufficient data volumes and transformation complexity, this could adversely affect data warehouse customers.

Page 25: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 25 of 39

ETL Design Principles

• ETL applications are subject to unexpected circumstances and, therefore, should expect the unexpected to occur.

• An ETL analyst must work hard to assure an ETL application is bulletproof, knowing each ETL application will behave as intended, even if the source system does not.

• ETL Process Principles (Principles 1 to 6), address specifically the executable part of an ETL application, i.e., the code that moves, copies, and transforms data.• Which is similar to a manufacturing plant. It converts and transforms

raw data (i.e., materials) into a data warehouse (i.e., finished product).

• ETL Staging Principles (Principles 7 to 11), provide design principles for managing and controlling the creation and use of stage data and structures.

Page 26: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 26 of 39

Principle 01: one thing at a time

• Multitasking conserves time and resources and is contrary to all things of ETL since an ETL application, however, assumes that nothing will go as planned, and that some input values will be unreasonable and invalid.

• It is recommended to perform each action individually and then combine the separate result sets into one set of data.

• One Thing at a Time is basically a granular modular approach. Benefits of using a granular modular approach include:• Create the opportunity for Data Quality and Metadata functions to integrate

within an ETL application.• Create the opportunity to isolate violated assumptions.• Remove any question about the sequence and precedence of ETL functions,

regardless of the language or platform.

Page 27: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 27 of 39

Principle 02: Know When to Begin

• Operational systems rely on operational job schedulers to know when the conditions have been satisfied for a job to begin.

• ETL applications, however, rely on conditions within precedent data (i.e., Begin Conditions). When precedent Begin Conditions have been satisfied, subsequent applications relying on those conditions can safely begin.• An Extract application will examine an operational source system prior to extracting

data.

• A Transform application will examine data provided by preceding Extract applications.

• A Load application will examine data provided by preceding Transform applications to determine whether or not Begin Conditions have been satisfied.

• Data Quality and Metadata information prove to be extremely helpful in these circumstances.

• Principle 02 is basically a backward-looking design principle.

Page 28: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 28 of 39

Principle 03: Know When to End

• It is a forward-looking design that requires an ETL application to examine data it has created.

• An ETL application can verify, by examining its own output data, whether or not that ETL application has completed satisfactorily.

• Then, the results of that final review can be captured as Data Quality or Metadata information, and shared with subsequent ETL applications.

Page 29: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 29 of 39

Principle 04: Large to Medium to Small

• Large to Medium to Small design assembles all applicable data elements and entities.

• Data that is no longer required is dismissed. The final data set is a load-ready file that will be loaded to a data warehouse.

• At this initial stage, all applicable data is juxtaposed simultaneously. The decision to exclude data is made in the broadest context possible, which allows the greatest control of data exclusion.

Page 30: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 30 of 39

Principle 05: stage data integrity

• It is a design principle by which precedent applications create (store) a set of stage data as it will be consumed by subsequent applications.

• Once created, a set of stage data can only be consumed as a single contiguous set by subsequent applications.

• It avoid unnecessary risk and increases the overall integrity of an ETL application.

• For example, we have source raw materials data from Company A, B and C; and an application that extracts data describing raw materials from company A. We may have the following approaches:• Create a single set of stage data ABC (not a good solution, why?)

• Create a multiple data sets of stage data, A, B, C and ABC.

Page 31: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 31 of 39

Principle 06: Know what you have

• It prompts an ETL application to take inventory of inbound data, rather than assume inbound data contains all that is expected.

• Information describing contents of inbound data is available through two sources: Metadata and data itself.

• The output of the comparison of inbound data and expected data includes lists of matches and mismatches (or missing data).

• Normally, an threshold is used for missing data to choose a response based on the history of data anomalies.

A

B

C

Compare

A

B

C

What you

have

What you

don’t have

Matches

Mismatches

Page 32: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 32 of 39

ETL Staging Principles

• Principle 07: Name the data – describes how to identify data and its features, origin and destination with an appropriate level of granularity and control.

• Principle 08: Own the data – describes how to secure data to prevent interference by other applications, including ETL and operational applications

• Principal 09: Build the data – describes how to create a data set from its foundation

• Principal 10: Type the data – describes how to protect ETL functions from incompatible data types

• Principal 11: Land the data – describes the need to retain interim data beyond its immediate use.

Page 33: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 33 of 39

ETL Functions

• ETL functions are designed to discern what has happened in the enterprise, and bring that information to the data warehouse.

• Extract functions – retrieve data from source system and store as stage data in ETL environment.

• Transform functions – are applied in staged datasets to derive required information (sets of dimension data)

• Load functions – loads data to the data warehouse.

Page 34: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 34 of 39

Extract functions

1. Extract data from contiguous dataset – a simple extract function

ETL environmentSource system

Source Extract Stage

Source system ETL environment

StageExtract

2. Extract data from a data flow – needs a control mechanism based on the bundles of data

flow records.

Page 35: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 35 of 39

Data level Transform functions

• Row-level transformation – is applied to every row in a staged dataset, the simplest transformations.

• Dataset-level transformation • is formed within the context of a whole set of data.• It must address the whole dataset at a time to derive the

information necessary to update each individual row.

• Surrogate key generation: intra-dataset• It generates a sequential numeric value that uniquely identifies

each row of dataset.• A surrogate key is used here to uniquely identify each raw in an

ETL application because sometimes transformed data lacks a key.

Page 36: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 36 of 39

Data warehouse level transformation

• The functions must be performed within the context of the data warehouse.

• They do not have all necessary knowledge to derive the data required, and they have to use both the input data and data from the data warehouse.

• Surrogate key generation: intra-data warehouse

• The identifier should be throughout the data warehousing

• The best way is to retrieve the max identifier in the data warehouse and then assign max + 1 to the new row.

Page 37: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 37 of 39

Load Data

1. Load data from a stable and contiguous dataset – the simplest and most common method

Load data

Data Load

ETL environment

Data

warehouse

Load data ETL environment

Load Data

warehouse

2. Load data from data flow – needs a control mechanism to be able to know each

row has been loaded only once and so on.

Page 38: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 38 of 39

ETL Beginning to end

Customer

Expectations

Data quality

SLA

ETL direct

requirements

Metadata

SLA

Target system

analysis

Source System

Analysis

ETL indirect

requirements

Target System

Analysis

ETL principles

Customer

Expectations

Data

warehouse

Source Data

Data

Mapping/Logical

Design

Physical

Design

ETL

Application

Data

warehouse

Page 39: Extract Transform Load (ETL)

ICT@PSU 308-471 Data Warehousing and Data Mining 39 of 39

Closing remarks

• A data warehouse designer captures customers expectations in the design of a data warehouse.

• A target system analysis captures the behaviour of data in a data warehouse design. These behaviours are expressed as direct requirements.

• Data mapping is a road map showing how an ETL application will achieve data behaviours.

• The data quality SLA and Metadata SLA capture the information necessary for customers to use the data in the data warehouse (indirect requirements):• Is the data complete?• Are there any anomalies?• When is the data available? • What is the profile of today’s data?

• The direct and indirect requirements meet together in a single physical design, which declares the physical hardware, platform, datasets and jobs that are the ETL application.

• The ETL application delivers data to a data warehouse that meets customer expectations.